X

ODA-Mixture-100k

Information

# ODA-Mixture-100k Subject Distribution ODA-Mixture-100k is a compact general-purpose post-training dataset curated from top-performing open corpora (selected via the *OpenDataArena* leaderboard) and refined through deduplication, benchmark decontamination. --- ## Dataset Summary - **Domain**: General-purpose(e.g., Math, Code, Reasoning, General). - **Format**: Problem → Solution (reasoning trace) → Final answer. - **Scale (selected training set)**: ~**100K** samples. - **Goal**: Achieve significant general-purpose performance gains across various domains (Math, Code, Reasoning, etc.) using a small-scale, curated dataset of ~100K samples. --- ## ️ Data Curation Pipeline ODA-Mixture-100k is built by following a single rule: **trust the OpenDataArena leaderboard**. ### 1️⃣ Data Collection We chose **LIMO** as our foundation because it achieves a high ranking on the ODA overall leaderboard with very few samples. This efficiency allows us to establish a strong reasoning baseline. We then augment this core with **AM-Thinking-v1-Distilled-math** and **AM-Thinking-v1-Distilled-code**, the top-performing and efficient dataset on the ODA Math and Code leaderboards, to enhance specialized domain capabilities. ### 2️⃣ Deduplication & Decontamination We first perform **exact deduplication** over all questions to remove identical items, and then run **benchmark decontamination** to reduce evaluation leakage by removing overlaps with standard and competition benchmarks. ### 3️⃣ Data Selection To adhere to our ~100K data budget while maximizing the impact of each sample, we employ semantic clustering to map the overall data distribution. Within each cluster, we preferentially sample the most challenging instances, using sequence length as a practical proxy for reasoning complexity and problem difficulty. --- ## Source Composition | Source | Count | Percentage | |---|---:|---:| | LIMO | 817 | 0.81% | | AM-Thinking-Distilled-math | 50,244 | 49.59% | | AM-Thinking-Distilled-code| 50,245 | 49.60% | --- ## Data Format \`\`\`json \{ "id": "unique_identifier", "source": "data source", "question": "textual question or instruction", "response": "textual response" \} \`\`\` --- ## Performance ODA-Mixture-100k is evaluated as an SFT corpus for both **Qwen2.5-7B-Base** and **Qwen3-8B-Base**. Across the full ODA benchmark suite spanning four domains—**General (DROP, IFEVAL, AGIEVAL, MMLU-Pro)**, **Math (GSM8K, MATH500, Omni-Math, OlympiadBench, AIME2024)**, **Code (HumanEval, MBPP, LCB (V5), HumanEval+)**, and **Reasoning (ARC-C, BBH, CALM, KOR-BENCH)**—we observe consistent improvements over the corresponding base checkpoints, with particularly strong gains on several benchmarks.
Leaderboard Performance Comparison. Best scores in bold, second-best underlined. Eff. denotes Data Efficiency.
Dataset Size Eff. General Math Code Reasoning AVG
Qwen2.5-7B-Base
Qwen2.5-7B-Base -- 51.439.850.142.7 46.0
OpenThoughts3-1.2M 1.2M+0.011 45.571.867.054.3 59.6
OmniThought-0528 365k+0.027 47.171.247.657.2 55.8
SYNTHETIC-2-SFT-verified 105k+0.086 51.369.840.158.9 55.0
AM-Thinking-v1-Distilled-math 558k+0.016 57.777.439.544.8 54.8
LIMO 817+9.920 60.744.057.953.8 54.1
MiroMind-M1-SFT-719K 719k+0.006 52.071.026.351.5 50.2
AM-Thinking-v1-Distilled-code 324k+0.024 49.952.368.744.4 53.8
Light-R1-SFTData 79k+0.084 55.564.438.851.9 52.7
ODA-Mixture-500k 500k+0.039 63.472.866.759.6 65.6
ODA-Mixture-100k 100k+0.149 56.871.264.451.5 61.0
Qwen3-8B-Base
Qwen3-8B-Base -- 58.751.252.450.6 53.2
MiroMind-M1-SFT-719K 719k+0.023 64.577.263.665.8 67.8
AM-Thinking-v1-Distilled-math 558k+0.028 65.979.759.563.2 67.1
OmniThought-0528 365k+0.043 55.878.368.166.0 67.0
AM-Thinking-v1-Distilled-code 324k+0.045 64.864.975.859.3 66.2
Light-R1-SFTData 79k+0.168 64.971.859.063.6 64.8
SYNTHETIC-2-SFT-verified 105k+0.107 59.575.456.166.6 64.4
LIMO 817+0.490 61.746.052.754.1 53.6
ODA-Mixture-500k 500k+0.042 71.277.273.069.7 72.8
ODA-Mixture-100k 100k+0.177 61.177.373.264.7 69.0
--- ## About OpenDataArena [OpenDataArena](https://opendataarena.github.io/) is an open research platform dedicated to **discovering, evaluating, and advancing high-quality datasets for AI post-training**. It provides a transparent, data-centric ecosystem to support reproducible dataset evaluation and sharing. **Key Features:** - **Dataset Leaderboard** — helps researchers identify **the most valuable and high-quality datasets across different domains**. - **Detailed Evaluation Scores** — provides **comprehensive metrics** to assess data quality, complexity, difficulty etc. - **Data Processing Toolkit** — [OpenDataArena-Tool](https://github.com/OpenDataArena/OpenDataArena-Tool) offers an open-source pipeline for dataset curation and scoring. If you find our work helpful, please consider **⭐ starring and subscribing** to support our research. --- ## Citation \`\`\`bibtex @dataset\{opendataarena_odamix100k_2025, author = \{OpenDataArena\}, title = \{OpenDataArena-ODA-Mixture-100k\}, year = \{2025\}, url = \{https://huggingface.co/datasets/OpenDataArena/Mixture-100k\} \} \`\`\` \`\`\`bibtex @article\{cai2025opendataarena, title=\{OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value\}, author=\{Cai, Mengzhang and Gao, Xin and Li, Yu and Lin, Honglin and Liu, Zheng and Pan, Zhuoshi and Pei, Qizhi and Shang, Xiaoran and Sun, Mengyuan and Tang, Zinan and others\}, journal=\{arXiv preprint arXiv:2512.14051\}, year=\{2025\} \} \`\`\`

Prompts

Reviews

Tags

Write Your Review

Detailed Ratings

ALL
Correctness
Helpfulness
Interesting
Upload Pictures and Videos

Name
Size
Type
Download
Last Modified
  • Community

Add Discussion

Upload Pictures and Videos