LeetCodeDataset: New Benchmark for Robust Code LLM Evaluation & Training

This is a Plain English Papers summary of a research paper called LeetCodeDataset: New Benchmark for Robust Code LLM Evaluation & Training. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter.

Introducing LeetCodeDataset: A New Benchmark for Code LLMs

Code generation has become a critical capability for large language models (LLMs), but researchers face two significant challenges: the scarcity of coding benchmarks that accurately assess reasoning abilities and the lack of self-contained training resources. While benchmarks like LiveCodeBench have attempted to address data contamination through live updates, they cover only a limited number of problems and lack detailed metadata for comprehensive analysis.

The authors introduce LeetCodeDataset, a high-quality benchmark designed to overcome these limitations. By meticulously curating over 90% of Python problems from LeetCode, they've created a resource that serves dual purposes: robust evaluation and efficient training. Each problem in the dataset comes with rich metadata (difficulty levels, release dates, topic tags) and an average of 100+ test cases to minimize false positives.

A key innovation is the dataset's temporal split strategy - problems released after July 1, 2024 form the test set, while earlier problems constitute the training set. This approach ensures contamination-free evaluation while providing ample training material.

Building a Comprehensive Code Evaluation Dataset

Data Collection

As of March 2025, LeetCode hosted approximately 3,505 programming problems, with 3,115 supporting Python submissions. The data collection process began with this Python problem set and followed several key steps:

Metadata Acquisition: Using LeetCode's GraphQL API, the researchers collected comprehensive metadata for each problem, including unique identifiers, difficulty ratings, descriptions, starter code, and topic tags.
Canonical Solution Verification: Reference solutions were retrieved from open-source GitHub repositories and verified on the LeetCode platform to establish ground truth solutions with 100% acceptance rates.
Entry Point Identification: For each problem, the function targeted for testing was identified through text pattern matching, focusing exclusively on problems with single-function starter code.
Input Generation: Inputs for testing were generated using one-shot prompting with an LLM, followed by additional prompting to create more complex test cases.
Test Case Generation: Outputs were computed using canonical solutions in a sandboxed execution environment, with special handling for data structures like binary trees and linked lists.

Figure 1: An example of a LeetCode problem.

Through this process, the researchers successfully generated outputs for 2,869 problems, covering over 90% of all Python problems available on the platform.

For supervised fine-tuning (SFT), they employed Qwen2.5-Coder-32B-Instruct to implement a multi-stage generation process that produced diverse, verified solution candidates. The resulting dataset supports both evaluation and training, with the potential to facilitate reinforcement learning approaches as well.

Dataset Overview

The LeetCodeDataset can be analyzed along multiple dimensions:

Difficulty Levels: Problems are categorized into three levels of difficulty:

Easy: Problems focusing on basic syntax and foundational data structure applications
Medium: Problems requiring familiarity with classical algorithms and efficient strategy design
Hard: Problems involving complex algorithmic combinations, mathematical insights, or specialized optimizations

Difficulty			Release Year
Type	Count	Proportion (%)	Period	Count	Proportion (%)
Easy	686	23.91	Before 2020	1077	37.54
Medium	1498	52.21	2020–2022	1009	35.17
Hard	686	23.88	2023–2025	783	27.29

Table 1: Distribution of difficulty and release year on the LeetCodeDataset.

Release Dates: The yearly distribution shows approximately 350 new problems added annually in recent years. This temporal information is valuable for contamination-free evaluation, with problems from the past 6-12 months providing an effective balance between bias and variance.

Topic Tags: Each problem is labeled with algorithm and data structure tags, with multiple tags permitted per problem. This tagging system helps learners focus on specific skills and provides valuable insights for LLMs.

Figure 2: Topic frequency distribution.

The comprehensive coverage and detailed metadata make LeetCodeDataset a valuable resource for both evaluating existing models and training new ones.

Holistic Evaluation

The researchers evaluated six models on the LeetCodeDataset test set, which consists of 256 programming problems released after July 1, 2024. The evaluated models included:

Two proprietary systems: GPT-4o and Claude 3.7 Sonnet
Four open-source models: DeepSeek-V3, DeepSeek-R1, Qwen2.5-Max, and QwQ-Plus

All experiments used identical generation parameters (temperature=0.2, top_p=0.95) to ensure fair comparisons.

Following LiveCodeBench's temporal evaluation methodology, they analyzed monthly accuracy changes relative to problem release months and summarized model pass rates across difficulty levels.

Figure 3: Monthly pass rates of various models on the LeetCodeDataset.

The evaluation revealed three key insights:

Superior Performance of Reasoning Models: DeepSeek-R1 (pass@1 rate = 65.23%) and QwQ-Plus (pass@1 rate = 56.25%) demonstrated substantial advantages in solving complex coding problems, highlighting the value of long-CoT reasoning capabilities.
Baseline Comparison: Claude-3.7-Sonnet performed best among non-reasoning models. GPT-4o and DeepSeek-V3 achieved identical overall scores, with GPT-4o performing better on easy problems and DeepSeek-V3 excelling on hard problems.
Contamination Analysis: The minimal temporal overlap between GPT-4o's release date (August 2024) and the test problem release window (post-July 2024) suggests authentic capability measurements.

Model	Easy (%)	Medium (%)	Hard (%)	Overall (%)
GPT-4o-0806	81.48	32.76	10.47	35.55
Claude-3.7-Sonnet	87.04	54.31	23.26	50.78
DeepSeek-V3	77.78	31.90	13.95	35.55
DeepSeek-R1	94.44	68.97	41.86	65.23
Qwen2.5-Max	74.07	25.00	10.47	30.47
QwQ-Plus	92.59	62.93	24.42	56.25

Table 2: Model pass rates by difficulty level on the LeetCodeDataset.

The researchers also analyzed model performance across different topic tags, identifying each model's strengths and weaknesses. Key findings included:

DeepSeek-R1 showed consistently strong performance across all topic tags, with pass rates mostly ranging from 60% to 70% and minimal variation.
Non-reasoning models exhibited significant fluctuations, such as GPT-4o dropping to 7.7% in Binary Search tasks but reaching 63.2% in Simulation tasks.
Significant performance gaps between reasoning and non-reasoning models appeared in Dynamic Programming, Binary Search, and Tree-related tasks.

	GPT-40	DeepSeek -V3	Qwen2.5 -Max	Claude-3.7 -Sonnet	DeepSeek -R1	QwQ -Plus
Array	32.1	34.5	28.0	51.2	67.9	55.4
String	37.3	38.8	35.8	49.3	68.7	50.7
Dynamic Programming	10.5	15.8	8.8	31.6	70.2	40.4
Hash Table	39.5	37.5	35.7	50.0	66.1	50.0
Math	38.2	40.0	32.7	56.4	69.1	58.2
Greedy	12.5	15.6	12.5	21.9	62.5	28.1
Sorting	20.0	20.0	6.7	36.7	66.7	53.3
Prefix Sum	17.9	14.3	14.3	35.7	71.4	35.7
Binary Search	7.7	23.1	11.5	30.8	73.1	30.8
Sliding Window	52.2	47.8	43.5	69.6	56.5	52.2
Enumeration	27.3	31.8	9.1	45.5	63.6	50.0
Matrix	19.0	33.3	19.0	52.4	76.2	61.9
Simulation	63.2	57.9	42.1	63.2	63.2	84.2
Depth-First Search	31.6	21.1	26.3	31.6	57.9	57.9
Bit Manipulation	33.3	44.4	27.8	50.0	50.0	66.7
Combinatorics	12.5	18.8	12.5	37.5	93.8	25.0
Counting	20.0	26.7	26.7	46.7	53.3	46.7
Graph	40.0	33.3	46.7	53.3	66.7	66.7
Heap (Priority Queue)	40.0	53.3	33.3	66.7	66.7	66.7
Number Theory	38.5	30.8	30.8	38.5	69.2	53.8
Breadth-First Search	41.7	33.3	50.0	58.3	58.3	75.0
Tree	27.3	18.2	9.1	9.1	72.7	54.5
Two Pointers	20.0	30.0	30.0	40.0	80.0	40.0
Segment Tree	30.0	30.0	30.0	70.0	80.0	30.0
All	35.5	35.5	30.5	50.8	65.2	56.2

Table 3: Model pass rates by topic tags on the LeetCodeDataset.

This detailed analysis, similar to that seen in Performance Study of LLM-Generated Code on LeetCode, provides valuable insights for future model development and improvement.

Efficient Training

Experiment Setup

The researchers conducted supervised fine-tuning (SFT) experiments using Qwen2.5-Coder-7B as the base model. Training parameters included:

3 epochs
Initial learning rate of 1e-5
Warmup ratio of 0.1
Cosine learning rate scheduling
Batch size of 32

These parameters remained consistent across all experiments to ensure fair comparisons.

Results

To evaluate the training efficiency of LeetCodeDataset, the researchers compared it with five widely-used coding datasets ranging from 9.5K to 111.1K samples - all substantially larger than the LeetCodeDataset training set of 2.6K samples.

Models were trained on each dataset and evaluated across four benchmarks: HumanEval, MBPP, LiveCodeBench, and the LeetCodeDataset evaluation set.

Training Data	Rows	Human Eval	MBPP	LiveCode Bench 24-08 25-02	LeetCode Dataset 24-07 25-03
Magicoder Evol-Instruct-110K	111.1 K	77.4	74.1	15.1	13.7
Magicoder OSS-Instruct-75K	75.1 K	73.8	76.5	15.1	12.9
Open-R1 CodeForces-CoT	9.5 K	79.9	74.1	15.8	13.3
OpenThoughts 114k	19.9 K	77.4	75.7	16.9	16.4
LeetCodeDataset Pre 2024-07 human	2.6 K	55.5	53.4	14.0	10.9
LeetCodeDataset Pre 2024-07 model	2.6 K	79.9	77.5	15.4	12.5

Table 4: Model SFT-training results.

The results revealed three key findings:

Superior Model-Generated Training Data: Models trained on model-generated responses significantly outperformed those trained on human-written responses (79.9% vs. 55.5% on HumanEval; 77.5% vs. 53.4% on MBPP), despite both being verified as correct. This highlights the quality advantage of model-generated training data for code generation tasks.
High Data Efficiency: Training with only 2.6K model-generated LeetCode samples achieved superior performance on HumanEval (79.9%) and MBPP (77.5%), surpassing models trained on much larger datasets. This demonstrates exceptional data efficiency for domain-specific code generation.
Limitations on Hard Benchmarks: Despite being in-distribution for LeetCodeDataset, the 2.6K-trained model underperformed on hard benchmarks, suggesting that small-scale SFT primarily develops basic programming skills.

These findings align with research on training efficiency in KodCode: A Diverse, Challenging, Verifiable Synthetic Dataset for Coding, which similarly emphasizes the value of high-quality, targeted training data.

Related Work

Code Generation Benchmarks: Numerous benchmarks have been developed to evaluate code generation capabilities in LLMs. For foundational Python programming, widely used benchmarks include HumanEval and MBPP. EvalPlus offers a more rigorous variant, while Multiple-E extends these benchmarks to 18 other programming languages.

As LLM capabilities advance, many benchmarks have become too easy to adequately assess modern models. Specialized benchmarks for competitive programming include APPS, CodeContests, and TACO, which source problems from platforms like Codeforces and AtCoder. LiveCodeBench provides holistic and contamination-free evaluations by dynamically updating coding challenges, while CODEELO aligns with the CodeForces platform by submitting directly to it.

Fine-tuning Datasets for Code: Synthetic data is a primary source for LLM supervised fine-tuning. CodeAlpaca employs few-shot prompting and teacher models to synthesize data, while Magicoder leverages open-source code snippets to generate high-quality instructional data. Competitive programming benchmarks like APPS and CodeTest provide training splits for SFT. For advanced reasoning, Open-R1 CodeForces-CoTs includes 10K CodeForces problems with reasoning traces, while OpenThoughts is a synthetic dataset with 114K examples spanning various domains.

Limitations

Despite its effectiveness, LeetCodeDataset has three key limitations:

False Positive Risks: While the dataset includes diverse inputs and test cases to reduce incorrect solutions passing, it lacks extremely complex input patterns and suffers from an imbalanced test case distribution. These limitations present residual risks of false positives, such as solutions passing tests despite logic errors.
Complexity Analysis Gap: Determining time and space complexity for problems requires LeetCode-style test cases tailored to each algorithm's behavior. This limitation exceeds the current scope as it demands manual problem-specific validation.
Coverage Gaps: The dataset doesn't include certain problem types, particularly problems with multiple solution entry points.

Future Impact

LeetCodeDataset addresses key challenges in code-generation research by providing a rigorously curated resource that enables reliable, contamination-free model evaluation and highly efficient training. Its temporal split ensures clean benchmarking and supports longitudinal studies, while its comprehensive coverage of algorithms and data structures facilitates robust overall evaluation and fine-grained skill analysis.

The integrated evaluation toolkit streamlines assessment and comparison across models, making it a valuable resource for researchers and practitioners alike.

Perhaps most importantly, the dataset demonstrates remarkable training efficiency - models trained on just 2.6K curated samples can match the performance of those trained on 110K examples from previous benchmarks. This finding suggests that future work might benefit from focusing on data quality rather than quantity.

LeetCodeDataset is available on Hugging Face and GitHub, positioning it to become a foundational resource for developing, training, and evaluating advanced code-generation models.

Click here to read the full summary of this paper

LeetCodeDataset: New Benchmark for Robust Code LLM Evaluation & Training

Introducing LeetCodeDataset: A New Benchmark for Code LLMs

Building a Comprehensive Code Evaluation Dataset

Data Collection

Dataset Overview

Holistic Evaluation

Efficient Training

Experiment Setup

Results

Related Work

Limitations

Future Impact

Comments (0)

Read More

#reading

#popular

LeetCodeDataset: New Benchmark for Robust Code LLM Evaluation & Training

Introducing LeetCodeDataset: A New Benchmark for Code LLMs

Building a Comprehensive Code Evaluation Dataset

Data Collection

Dataset Overview

Holistic Evaluation

Efficient Training

Experiment Setup

Results

Related Work

Limitations

Future Impact

Comments (0)

Read More

⚛️ Build a Simple Todo App with React Store - a Tiny React State Manager

System Hacking: Journey into the Intricate World of Cyber Intrusion

How to manage large env files?

Top 15 Builder.ai Alternatives for 2025: Explore the Best App Development Platforms

#reading

#popular