Datasets

Open datasets and data collections

Showing 21-40 of 59 projects

google-deepmind/mathematics_dataset

This dataset generates mathematical questions and answers for school-level difficulty, useful for AI/ML research.

1.9K
Archived
Python
LLM Frameworks
Coding Challenges
#mathematics#dataset#machine-learning

shramos/Awesome-Cybersecurity-Datasets

A curated list of cybersecurity datasets for security researchers and machine learning practitioners.

1.9K
Archived
Security Research
Datasets
#cybersecurity#dataset#security-research

ChineseGLUE/ChineseGLUE

A benchmark for evaluating language understanding models and datasets for the Chinese language.

1.8K
Archived
Python
LLM Frameworks
Datasets
Python
#nlp#benchmarking#language-understanding

karolpiczak/ESC-50

Open-source dataset for training environmental sound classification models.

1.8K
Archived
Python
Computer Vision
Datasets
Python
#audio-processing#machine-learning#open-source

coderonion/awesome-yolo-object-detection

A curated collection of YOLO object detection projects and datasets for developers working with computer vision and AI.

1.7K
Experimental
Computer Vision
Datasets
#object-detection#yolo#datasets

njvisionpower/Safety-Helmet-Wearing-Dataset

A dataset and pretrained model for detecting safety helmet wearing, useful for computer vision projects.

1.7K
Archived
Python
Computer Vision
Datasets
GluonCV
#dataset#detection#hardhat

Toyhom/Chinese-medical-dialogue-data

This repository contains a dataset of Chinese medical dialogues for NLP and conversational AI research.

1.6K
Archived
Python
LLM Frameworks
Datasets
#medical-data#chinese-language#natural-language-processing

EleutherAI/the-pile

The Pile is a large, diverse language model training dataset for use in AI research and development.

1.6K
Archived
Python
LLM Frameworks
Datasets
Python
#language-model#dataset#machine-learning

gururise/AlpacaDataCleaned

A cleaned and curated version of the Alpaca dataset from Stanford, useful for machine learning projects.

1.6K
Archived
Python
Datasets
#machine-learning#dataset#computer-vision

opendatalab/OmniDocBench

A comprehensive benchmark for document parsing and evaluation, designed for CVPR 2025.

1.5K
Stable
Python
Computer Vision
Datasets
#computer-vision#document-parsing#benchmark

facebookresearch/fastMRI

A large-scale dataset of raw MRI measurements and clinical MRI images for medical imaging research.

1.5K
Archived
Python
Computer Vision
Datasets
PyTorch
#medical-imaging#mri-reconstruction#deep-learning

brendenlake/omniglot

Omniglot dataset for one-shot learning experiments in MATLAB

1.4K
Archived
MATLAB
Datasets
#machine-learning#datasets#one-shot-learning

EricGuo5513/HumanML3D

A large and diverse 3D human motion-language dataset for deep learning and motion generation.

1.4K
Archived
Python
Computer Vision
Datasets
Python
#dataset#motion-generation#text-annotation

PolyAI-LDN/conversational-datasets

A collection of large datasets for training conversational AI models and agents.

1.4K
Archived
Python
LLM Frameworks
Datasets
#conversational-ai#machine-learning#datasets

Hello-SimpleAI/chatgpt-comparison-detection

A Python library for detecting and analyzing comparisons to ChatGPT in text, with a corpus of human-written comparisons.

1.3K
Archived
Python
LLM Wrappers & SDKs
Datasets
Python
#chatgpt#text-classification#machine-learning

KaiDMML/FakeNewsNet

A dataset for fake news detection research using Python.

1.3K
Archived
Python
Computer Vision
Datasets
Python
#fake-news-detection#nlp#machine-learning

google-deepmind/rc-data

A question answering dataset for building AI-powered language models and conversational agents.

1.3K
Archived
Python
LLM Frameworks
Datasets
#question-answering#natural-language-processing#dataset

wainshine/Company-Names-Corpus

A corpus of company names, abbreviations, and brands that can be used for Chinese text segmentation and entity recognition.

1.3K
Archived
Datasets
CLI Tools
#corpus#dataset#ner

datitran/raccoon_dataset

This GitHub repository contains a dataset for training a raccoon detector using TensorFlow.

1.3K
Archived
Jupyter Notebook
Computer Vision
Datasets
#tensorflow#computer-vision#dataset

kakaobrain/coyo-dataset

A large-scale image-text dataset for training AI models, primarily focused on visual AI and multimodal AI tasks.

1.3K
Archived
Python
Computer Vision
Agents & Orchestration
#computer-vision#multimodal-ai#dataset

Stay in the loop

Get weekly updates on trending AI coding tools and projects.