Typefully
@TheTuringPost
Log in
5 Open-Source Datasets for LLM Training
Share
•
3 years ago
•
View on X
5 open-source datasets used to train LLMs 1. YT-Temporal-1B (video) 2. Muffin (text) 3. LAION-400M (text) 4. HumanEval (code) 5. WebVid-2M (text and video) Links 🧵
YT-Temporal-1B (video)
crfm.stanford.edu/ecosystem-graphs/index.html?asset=YT-Temporal-1B
Muffin (text)
crfm.stanford.edu/ecosystem-graphs/index.html?asset=Muffin
LAION-400M (text) A dataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddings and kNN indices that allow efficient similarity search.
crfm.stanford.edu/ecosystem-graphs/index.html?asset=LAION-400M
HumanEval A dataset of 164 programming problems hand-written to evaluate their Codex model.
crfm.stanford.edu/ecosystem-graphs/index.html?asset=HumanEval
WebVid-2M (text and video) A large-scale dataset of 2.5M short videos with textual descriptions sourced from stock footage sites.
crfm.stanford.edu/ecosystem-graphs/index.html?asset=WebVid-2M