Typefully
@TheTuringPost
Log in
5 Open-Source Datasets for LLM Training
Share
•
3 years ago
•
View on X
5 open-source datasets used to train LLMs 1. MineDojo (videos and text) 2. ROOTS (text) 3. NaturalInstructions-v2 (text) 4. Anthropic Helpfulness dataset (text) 5. LAION-115M(text and image) Links 🧵
MineDojo (videos and text)
crfm.stanford.edu/ecosystem-graphs/index.html?asset=MineDojo
ROOTS A 1.6TB dataset spanning 59 languages
crfm.stanford.edu/ecosystem-graphs/index.html?asset=ROOTS
NaturalInstructions-v2
crfm.stanford.edu/ecosystem-graphs/index.html?asset=NaturalInstructions-v2
Anthropic Helpfulness dataset
crfm.stanford.edu/ecosystem-graphs/index.html?asset=Anthropic%20Helpfulness%20dataset
LAION-115M
crfm.stanford.edu/ecosystem-graphs/index.html?asset=LAION-115M