Typefully
@TheTuringPost
Log in
5 Open-Source Datasets for LLM Training
Share
•
2 years ago
•
View on X
5 open-source datasets used to train LLMs 1. LAION-2B-en (text and image) 2. P3 (text, prompts) 3. VIMA dataset (text and image) 4. COYO-700M (text and image) 5. xP3 (text, prompts and code) Links 🧵
LAION-2B-en (text and image)
laion.ai/blog/laion-5b/
P3 (text, prompts) The Public Pool of Prompts relies on the Hugging Face Dataset library.
huggingface.co/datasets/bigscience/P3
xP3 (text and code) A collection of prompts and datasets across 46 languages & 16 NLP tasks.
huggingface.co/datasets/bigscience/xP3
COYO-700M (text and image) A large-scale dataset that contains 747M image-text pairs as well as many other meta-attributes to increase the usability to train various models.
github.com/kakaobrain/coyo-dataset
VIMA dataset (text and image)
vimalabs.github.io/