Datasets

Open datasets and data collections

Showing 21-40 of 59 projects

google-deepmind/mathematics_dataset

This dataset generates mathematical questions and answers for school-level difficulty, useful for AI/ML research.

1.9K

Archived

Python

LLM Frameworks

Coding Challenges

#mathematics#dataset#machine-learning

shramos/Awesome-Cybersecurity-Datasets

A curated list of cybersecurity datasets for security researchers and machine learning practitioners.

1.9K

Archived

Security Research

Datasets

#cybersecurity#dataset#security-research

ChineseGLUE/ChineseGLUE

A benchmark for evaluating language understanding models and datasets for the Chinese language.

1.8K

Archived

Python

LLM Frameworks

Datasets

Python

#nlp#benchmarking#language-understanding

karolpiczak/ESC-50

Open-source dataset for training environmental sound classification models.

1.8K

Archived

Python

Computer Vision

Datasets

Python

#audio-processing#machine-learning#open-source

coderonion/awesome-yolo-object-detection

A curated collection of YOLO object detection projects and datasets for developers working with computer vision and AI.

1.7K

Experimental

Computer Vision

Datasets

#object-detection#yolo#datasets

njvisionpower/Safety-Helmet-Wearing-Dataset

A dataset and pretrained model for detecting safety helmet wearing, useful for computer vision projects.

1.7K

Archived

Python

Computer Vision

Datasets

GluonCV

#dataset#detection#hardhat

Toyhom/Chinese-medical-dialogue-data

This repository contains a dataset of Chinese medical dialogues for NLP and conversational AI research.

1.6K

Archived

Python

LLM Frameworks

Datasets

#medical-data#chinese-language#natural-language-processing

EleutherAI/the-pile

The Pile is a large, diverse language model training dataset for use in AI research and development.

1.6K

Archived

Python

LLM Frameworks

Datasets

Python

#language-model#dataset#machine-learning

gururise/AlpacaDataCleaned

A cleaned and curated version of the Alpaca dataset from Stanford, useful for machine learning projects.

1.6K

Archived

Python

Datasets

#machine-learning#dataset#computer-vision

opendatalab/OmniDocBench

A comprehensive benchmark for document parsing and evaluation, designed for CVPR 2025.

1.5K

Stable

Python

Computer Vision

Datasets

#computer-vision#document-parsing#benchmark

facebookresearch/fastMRI

A large-scale dataset of raw MRI measurements and clinical MRI images for medical imaging research.

1.5K

Archived

Python

Computer Vision

Datasets

PyTorch

#medical-imaging#mri-reconstruction#deep-learning

brendenlake/omniglot

Omniglot dataset for one-shot learning experiments in MATLAB

1.4K

Archived

MATLAB

Datasets

#machine-learning#datasets#one-shot-learning

EricGuo5513/HumanML3D

A large and diverse 3D human motion-language dataset for deep learning and motion generation.

1.4K

Archived

Python

Computer Vision

Datasets

Python

#dataset#motion-generation#text-annotation

PolyAI-LDN/conversational-datasets

A collection of large datasets for training conversational AI models and agents.

1.4K

Archived

Python

LLM Frameworks

Datasets

#conversational-ai#machine-learning#datasets

Hello-SimpleAI/chatgpt-comparison-detection

A Python library for detecting and analyzing comparisons to ChatGPT in text, with a corpus of human-written comparisons.

1.3K

Archived

Python

LLM Wrappers & SDKs

Datasets

Python

#chatgpt#text-classification#machine-learning

KaiDMML/FakeNewsNet

A dataset for fake news detection research using Python.

1.3K

Archived

Python

Computer Vision

Datasets

Python

#fake-news-detection#nlp#machine-learning

google-deepmind/rc-data

A question answering dataset for building AI-powered language models and conversational agents.

1.3K

Archived

Python

LLM Frameworks

Datasets

#question-answering#natural-language-processing#dataset

wainshine/Company-Names-Corpus

A corpus of company names, abbreviations, and brands that can be used for Chinese text segmentation and entity recognition.

1.3K

Archived

Datasets

CLI Tools

#corpus#dataset#ner

datitran/raccoon_dataset

This GitHub repository contains a dataset for training a raccoon detector using TensorFlow.

1.3K

Archived

Jupyter Notebook

Computer Vision

Datasets

#tensorflow#computer-vision#dataset

kakaobrain/coyo-dataset

A large-scale image-text dataset for training AI models, primarily focused on visual AI and multimodal AI tasks.

1.3K

Archived

Python

Computer Vision

Agents & Orchestration

#computer-vision#multimodal-ai#dataset

Stay in the loop

Get weekly updates on trending AI coding tools and projects.