Typefully

5 Open-Source Datasets for LLM Training

Avatar

Share

 • 

3 years ago

 • 

View on X

5 open-source datasets used to train LLMs 1. LAION-2B-en (text and image) 2. P3 (text, prompts) 3. VIMA dataset (text and image) 4. COYO-700M (text and image) 5. xP3 (text, prompts and code) Links 🧵
LAION-2B-en (text and image) laion.ai/blog/laion-5b/
P3 (text, prompts) The Public Pool of Prompts relies on the Hugging Face Dataset library. huggingface.co/datasets/bigscience/P3
xP3 (text and code) A collection of prompts and datasets across 46 languages & 16 NLP tasks. huggingface.co/datasets/bigscience/xP3
COYO-700M (text and image) A large-scale dataset that contains 747M image-text pairs as well as many other meta-attributes to increase the usability to train various models. github.com/kakaobrain/coyo-dataset
VIMA dataset (text and image) vimalabs.github.io/
Avatar

TuringPost

@TheTuringPost

Newsletter exploring AI & ML - AI 101 - ML techniques - AI Business insights - Global dynamics - ML History Led by @kseniase_ Save hours of research 👇🏼