5 open-source datasets used to train LLMs

1. YT-Temporal-1B (video)
2. Muffin (text)
3. LAION-400M (text)
4. HumanEval (code)
5. WebVid-2M (text and video)

Links 🧵

YT-Temporal-1B (video)

https://crfm.stanford.edu/ecosystem-graphs/index.html?asset=YT-Temporal-1B

Muffin (text)

https://crfm.stanford.edu/ecosystem-graphs/index.html?asset=Muffin

LAION-400M (text)

A dataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddings and kNN indices that allow efficient similarity search.

https://crfm.stanford.edu/ecosystem-graphs/index.html?asset=LAION-400M

HumanEval

A dataset of 164 programming problems hand-written to evaluate their Codex model.

https://crfm.stanford.edu/ecosystem-graphs/index.html?asset=HumanEval

WebVid-2M (text and video)

A large-scale dataset of 2.5M short videos with textual descriptions sourced from stock footage sites.

https://crfm.stanford.edu/ecosystem-graphs/index.html?asset=WebVid-2M

5 Open-Source Datasets for LLM Training