Typefully

5 Open-Source Datasets for LLM Training

Avatar

Share

 • 

3 years ago

 • 

View on X

5 open-source datasets used to train LLMs 1. MineDojo (videos and text) 2. ROOTS (text) 3. NaturalInstructions-v2 (text) 4. Anthropic Helpfulness dataset (text) 5. LAION-115M(text and image) Links 🧵
MineDojo (videos and text) crfm.stanford.edu/ecosystem-graphs/index.html?asset=MineDojo
ROOTS A 1.6TB dataset spanning 59 languages crfm.stanford.edu/ecosystem-graphs/index.html?asset=ROOTS
NaturalInstructions-v2 crfm.stanford.edu/ecosystem-graphs/index.html?asset=NaturalInstructions-v2
Anthropic Helpfulness dataset crfm.stanford.edu/ecosystem-graphs/index.html?asset=Anthropic%20Helpfulness%20dataset
LAION-115M crfm.stanford.edu/ecosystem-graphs/index.html?asset=LAION-115M
Avatar

TuringPost

@TheTuringPost

Newsletter exploring AI & ML - AI 101 - ML techniques - AI Business insights - Global dynamics - ML History Led by @kseniase_ Save hours of research 👇🏼