5 open-source datasets used to train LLMs

1. LAION-2B-en (text and image)
2. P3 (text, prompts)
3. VIMA dataset (text and image)
4. COYO-700M (text and image)
5. xP3 (text, prompts and code)

Links 🧵

LAION-2B-en (text and image)

https://laion.ai/blog/laion-5b/

P3 (text, prompts)

The Public Pool of Prompts relies on the Hugging Face Dataset library.

https://huggingface.co/datasets/bigscience/P3

xP3 (text and code)

A collection of prompts and datasets across 46 languages & 16 NLP tasks.

https://huggingface.co/datasets/bigscience/xP3

COYO-700M (text and image)

A large-scale dataset that contains 747M image-text pairs as well as many other meta-attributes to increase the usability to train various models.

https://github.com/kakaobrain/coyo-dataset

VIMA dataset (text and image)

https://vimalabs.github.io/

5 Open-Source Datasets for LLM Training