X

Overview

DATASET Marketplace and Directory Navigation of 40+ categories of AI, LLM, RL, Text, Image Datasets.

DATASET

# Webscale-RL Dataset | ## Dataset Description **Webscale-RL** is a large-scale reinforcement learning dataset designed to address the fundamental bottleneck in LLM RL training: the scarcity of high-quality, diverse RL data. While pretraining leverages **>1T diverse web tokens**, existing RL datasets remain limited to **<10B tokens** with constrained diversity. Webscale-RL bridges this gap

deepnlp/agent-reinforcement-learning-open-dataset
500 credits

# Open Agent RL Dataset: High Quality AI Agent | Tool Use & Function Calls | Reinforcement Learning Datasets DeepNLP website provides **high quality, genuinue, online users' request** of Agent & RL datasets to help LLM foundation/SFT/Post Train to get more capable models at function call, tool use and planning. The datasets are collected and sampled from users' requests on our various clients

opendataarena/oda-mixture-100k

# ODA-Mixture-100k ODA-Mixture-100k is a compact general-purpose post-training dataset curated from top-performing open corpora (selected via the *OpenDataArena* leaderboard) and refined through deduplication, benchmark decontamination. --- ## Dataset Summary - **Domain**: General-purpose(e.g., Math, Code, Reasoning, General). - **Format**: Problem → Solution (reasoning trace) → Final answe

Access to dataset ILSVRC/imagenet-1k is restricted. You must have access to it and be authenticated to access it. Please log in.

# CADS: A Comprehensive Anatomical Dataset and Segmentation for Whole-Body Anatomy in Computed Tomography ## Overview CADS is a robust, fully automated framework for segmenting 167 anatomical structures in Computed Tomography (CT), spanning from head to knee regions across diverse anatomical systems. The framework consists of two main components: 1. **CADS-dataset**: - 22,022 CT volumes w

huggingfaceh4/math-500

# Dataset Card for MATH-500 This dataset contains a subset of 500 problems from the MATH benchmark that OpenAI created in their _Let's Verify Step by Step_ paper. See their GitHub repo for the source file: https://github.com/openai/prm800k/tree/main?tab=readme-ov-file#math-splits

# Summary This is the dataset proposed in our paper [**[ICLR 2025] OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation**](https://arxiv.org/abs/2407.02371). OpenVid-1M is a high-quality text-to-video dataset designed for research institutions to enhance video quality, featuring high aesthetics, clarity, and resolution. It can be used for direct training or as a qualit

allenai/dolma3_dolmino_mix-100b-1025

# Dolma 3 Dolmino Mix (100B) The Dolma 3 Dolmino Mix (100B) is the mixture of high-quality data used for the second stage of training for Olmo 3 7B model. ### Dataset Sources | Source | Category | Tokens | Documents | |--------|----------|--------|-----------| | TinyMATH Mind | Math (synth) | 898M (0.9%) | 1.52M | | TinyMATH PoT | Math (synth) | 241M (0.24%) | 758K | | CraneMath | Math (sy

# Dataset Card for Alpaca-Cleaned - **Repository:** https://github.com/gururise/AlpacaDataCleaned ## Dataset Description This is a cleaned version of the original Alpaca Dataset released by Stanford. The following issues have been identified in the original release and fixed in this dataset: 1. **Hallucinations:** Many instructions in the original dataset had instructions referencing data on t

# UltraData-Math Dataset | Source Code | 中文 README ***UltraData-Math*** is a large-scale, high-quality mathematical pre-training dataset totaling **290B+ tokens** across three progressive tiers—**L1** (170.5B tokens web corpus), **L2** (33.7B tokens quality-selected), and **L3** (88B tokens multi-format refined)—designed to systematically enhance mathematical reasoning in LLMs. It has

# OLMo 2 (November 2024) Pretraining set Collection of data used to train OLMo-2-1124 models. The majority of this dataset comes from DCLM-Baseline with no additional filtering, but we provide the explicit breakdowns below. | Name | Tokens | Bytes (uncompressed) | Documents | License | |-----------------|--------|----------------------|-----------|-----------| | DCLM-Baseline | 3.

# Dataset Card for tiny-imagenet ## Dataset Description - **Homepage:** https://www.kaggle.com/c/tiny-imagenet - **Repository:** [Needs More Information] - **Paper:** http://cs231n.stanford.edu/reports/2017/pdfs/930.pdf - **Leaderboard:** https://paperswithcode.com/sota/image-classification-on-tiny-imagenet-1 ### Dataset Summary Tiny ImageNet contains 100000 images of 200 classes (500 for each

# Instruction-Finetuning Dataset Collection (Alpaca-CoT) This repository will continuously collect various instruction tuning datasets. And we standardize different datasets into the same format, which can be directly loaded by the of Alpaca model. We also have conducted empirical study on various instruction-tuning datasets based on the Alpaca model, as shown in . If you think this dataset c

facebook/multilingual_librispeech

# Dataset Card for MultiLingual LibriSpeech ## Table of Contents - - - - - - - - - - - - - - - - - - - - - - - ## Dataset Description - **Homepage:** - **Repository:** [Needs More Information] - **Paper:** - **Leaderboard:** ### Dataset Summary This is a streamable version of the Multilingual LibriSpeech (MLS) dataset. The data ar

## BLIP3o Pretrain Long-Caption Dataset This collection contains **27 million images**, each paired with a long (~120 token) caption generated by **Qwen/Qwen2.5-VL-7B-Instruct**. ### Download \`\`\`python from huggingface_hub import snapshot_download snapshot_download( repo_id="BLIP3o/BLIP3o-Pretrain-Long-Caption", repo_type="dataset" ) \`\`\` ## Load Dataset without Extracting You do

# Dataset Card for Boolq ## Table of Contents - - - - - - - - - - - - - - - - - - - - - - ## Dataset Description - **Homepage:** - **Repository:** https://github.com/google-research-datasets/boolean-questions - **Paper:** https://arxiv.org/abs/1905.10044 - **Point of Contact:** - **Size of downloaded dataset files:** 8.77 MB - **Size of

# AIME 2024 Dataset ## Dataset Description This dataset contains problems from the American Invitational Mathematics Examination (AIME) 2024. AIME is a prestigious high school mathematics competition known for its challenging mathematical problems. ## Dataset Details - **Format**: JSONL - **Size**: 30 records - **Source**: AIME 2024 I & II - **Language**: English ### Data Fields Each record

gustavosta/stable-diffusion-prompts

# Stable Diffusion Dataset This is a set of about 80,000 prompts filtered and extracted from the image finder for Stable Diffusion: "". It was a little difficult to extract the data, since the search engine still doesn't have a public API without being protected by cloudflare. If you want to test the model with a demo, you can go to: "". If you want to see the model, go to: "".

# Measuring Multimodal Mathematical Reasoning with the MATH-Vision Dataset ] ] ] ] ] ] ## Data Usage \`\`\`python from datasets import load_dataset dataset = load_dataset("MathLLMs/MathVision") print(dataset) \`\`\` ## Acknowledgments We would like to thank the following contributors for helping improve the dataset quality: - for correcting answers for ID 338 and ID 1826 ## News

digitallearninggmbh/math-lighteval

# Dataset Card for Mathematics Aptitude Test of Heuristics (MATH) dataset in lighteval format ## Table of Contents - - - - - - - - - - - - - ## Dataset Description - **Homepage:** https://github.com/hendrycks/math - **Repository:** https://github.com/hendrycks/math - **Paper:** https://arxiv.org/pdf/2103.03874.pdf - **Leaderboard:** N/A - **Point of Contact:** Dan

TEXT

Loading...

IMAGE

Loading...

VIDEO

Loading...

AUDIO

Loading...

REINFORCEMENT LEARNING

Loading...

AI Agent Teacher

Loading...

Write Your Review

Detailed Ratings

Upload Pictures and Videos