MLRC-Bench: Can LLMs Conquer Machine Learning Research Competitions? Objective Metrics Revealed!

This is a Plain English Papers summary of a research paper called MLRC-Bench: Can LLMs Conquer Machine Learning Research Competitions? Objective Metrics Revealed!. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter.

Can Language Agents Solve Machine Learning Research Challenges?

Evaluating how well large language model (LLM) agents perform research tasks has been limited by a lack of objective metrics. Current benchmarks often rely on subjective judgments of innovation or focus solely on basic implementation tasks that don't capture true research abilities.

A new benchmark called MLRC-Bench addresses this gap by measuring how effectively language agents tackle actual Machine Learning Research Competitions. Unlike previous work, which either uses LLMs to judge research quality or tests simple engineering tasks, MLRC-Bench provides objective metrics based on research competitions from top ML conferences.

Overview of our MLRC-Bench and the evaluation pipeline.
Overview of the MLRC-Bench evaluation pipeline, showing the standardized environment, agent scaffolding, and objective evaluation process.

How MLRC-Bench Differs from Previous Benchmarks

MLRC-Bench distinguishes itself from previous benchmarks by focusing specifically on method proposal and implementation for unsolved research problems. The framework evaluates both novelty and effectiveness against established baselines and top human solutions.

	Problem Identification	Method Proposal	Experiment Design	Code Implementation	Evaluation Method	Evaluation Object	Compute Constraints
AI Scientist (Lu et al., 2024b)	✓	✓	✓	✓	LLM & Human Judge	Paper
Can LLMs Generate Novel Research Ideas? (Si et al., 2024)	✓	✓	✓		Human Judge	Idea Proposal
DiscoPOP (Lu et al., 2024a)		✓		✓	Performance-Based	Function-Level Code
MLE-Bench (Chan et al., 2024)		∼		✓	Performance-Based	File-Level Code
MLAgentBench (Huang et al., 2024a)		∼		✓	Performance-Based	File-Level Code
MLRC-BENCH (Ours)		✓		✓	Performance-Based	Repository-Level Code	✓

Table 1: Comparison between MLRC-Bench and existing work on automated scientific discovery in machine learning with LLM agents. "∼" means that some but not all of the tasks in that benchmark require the indicated capability. "Compute Constraints" indicates whether the solution code must adhere to specified runtime and GPU memory limitations, an important aspect ignored by most prior work.

Unlike benchmarks like MLE-Bench that focus on engineering tasks, MLRC-Bench requires genuine methodological innovation. It also differs from frameworks like MLR-Copilot by providing objective performance metrics rather than relying on subjective paper evaluations.

The MLRC-Bench Framework

Task Selection

MLRC-Bench includes seven tasks from recent ML conferences and workshops, selected based on three criteria:

Novel Research-Focused: Tasks require genuine methodological innovation, not just hyperparameter tuning
Non-Trivial: Problems must involve complexity beyond standard ML algorithms
Feasible: Starter code, data splits, and evaluation procedures must be publicly available

Competition	Venue	Research Area	Modality	Metric	Test Runtime	GPU Memory
LLM Merging (Tam et al., 2024)	NeurIPS 2024	Efficient LLM	Text	Accuracy, ROUGE	1 hour	48 GB
Backdoor Trigger Recovery (Xiang et al., 2024)	NeurIPS 2024	LLM Safety	Text	REASR, Recall	0.5 hour	48 GB
Temporal Action Localisation (Heyward et al., 2024)	ECCV 2024 Workshop	Multimodal Perception	Video, Audio	mAP	0.5 hour	16 GB
Rainfall Prediction (Gruca et al., 2022)	NeurIPS 2023	AI for Science	Satellite Data	Critical Success Index	0.5 hour	48 GB
Machine Unlearning (Triantafillou et al., 2024)	NeurIPS 2023	Data Privacy	Image	Forgetting Quality, Accuracy	0.5 hour	16 GB
Next Product Recommendation (Jin et al., 2023)	KDD Cup 2023	Recommendation System	Text	Mean Reciprocal Rank	0.5 hour	16 GB
Cross-Domain Meta Learning (Carrión-Ojeda et al., 2022)	NeurIPS 2022	Few-Shot Learning	Image	Accuracy	3.5 hours	16 GB

Table 2: 7 MLRC-Bench tasks representing cutting-edge machine learning research. For each competition, we show the venue where the competition is held, research area, data modality, performance metric, along with the constraints presented to the agents, including maximum allowed runtime and GPU memory based on our hardware configurations.

The benchmark is designed to be continuously updated with new competitions, reducing the risk of data contamination as newer LLMs are released.

Task Environment

For each task, MLRC-Bench provides:

Task Description: Details about the research problem and constraints
Starter Code: A baseline model, evaluation scripts, and data splits
Human Idea: Insights from state-of-the-art papers or top participant solutions

The environment features a standardized code structure where agents can modify only the methods directory while evaluation scripts remain read-only. This ensures fair comparison while preserving evaluation integrity.

Evaluation Metrics

MLRC-Bench evaluates solutions on three objective dimensions:

Effectiveness: Performance on the competition's primary metric
Efficiency: Runtime during training and inference
Simplicity: Logical lines of code (complexity)

To compare across different tasks, the benchmark uses a "Relative Improvement to Human" metric that normalizes scores, setting the baseline solution to 0 and the top human solution to 100.

Experimental Results

Agent Performance

Researchers tested five leading LLMs (Claude 3.5 Sonnet v2, gemini-exp-1206, Llama 3.1 405B Instruct, o3-mini, and GPT-4o) using the MLAB scaffolding, which allows agents to iteratively modify code and run experiments.

Agent	temporal -action-loc	llm -merging	meta -learning	product -rec	rainfall -pred	machine -unlearning	backdoor -trigger	Avg
MLAB (gemini-exp-1206)	-0.5	5.0	-1.1	0.1	43.1	5.6	12.9	9.3
MLAB (llama3-1-405b-instruct)	0.5	-1.0	-4.9	0.0	31.5	6.2	11.5	6.3
MLAB (o3-mini)	0.3	-1.0	-4.9	0.1	25.1	3.6	6.2	4.2
MLAB (claude-3-5-sonnet-v2)	0.8	5.0	-4.9	3.0	14.6	-94.7	39.9	-5.2
MLAB (gpt-4o)	0.3	2.0	-4.9	0.6	47.5	-18.0	10.4	5.4
Human Idea + MLAB (gpt-4o)	0.5	-1.0	-4.9	2.2	12.3	6.8	8.8	3.5
CoI-Agent Idea (o1) + MLAB (gpt-4o)	0.4	-1.0	-4.9	0.1	39.4	11.8	4.0	7.1

Table 3: For each research competition and agent, we report the test-phase relative improvement to human, defined as the agent's solution margin over the baseline normalized by the top human solution's margin over the baseline, taking the best of 8 trials. Top human participants in each competition will score 100.0 due to the normalization. Additionally, we evaluate two other gpt-4o-based pipelines: MLAB augmented with ideas from either CoI-Agent (Li et al., 2024a) or humans. Best performing agent in each task is highlighted in bold. Our results indicate that providing additional ideas, whether sourced from AI or humans, does not consistently yield performance improvements. The best-performing configuration—gemini-exp-1206 under MLAB—achieves only 9.3% of the human-level improvement over baseline on average, underscoring the inherent difficulty of these research tasks.

The results revealed that even the best-performing agent (gemini-exp-1206) closed only 9.3% of the gap between baseline and top human performance. Surprisingly, providing additional ideas from either AI or humans didn't consistently improve performance, highlighting the challenge of effectively implementing even good ideas.

Scaling with Inference-Time Compute

We measure Pass@k as we scale the number of trials and ideas, running MLAB for eight trials per idea. Results indicate that 1) providing high-quality ideas—especially human-generated ones—significantly boosts an agent's success rate across multiple attempts, 2) while varying the balance between idea exploration and exploitation under a fixed budget yields similar outcomes due to diminishing returns from repeated trials.

The study analyzed how agents scale with more inference-time compute by sampling multiple ideas and running multiple implementation trials per idea. The findings show that high-quality ideas (especially human ones) boost success rates, but under a fixed compute budget, there was no significant difference between exploring more ideas versus running more trials per idea.

Subjective vs. Objective Evaluation

Radar plots of objective and subjective evaluations for agent-generated solutions across seven research tasks. Each dimension is normalized on a 1–5 scale, where higher values indicate better performance. Objective metrics include effectiveness, efficiency, and simplicity, which are highlighted in bold. The rest are subjective metrics, assessed by prompting an LLM as a judge. Notably, more effective solutions identified by agents tend to be more complex and time-consuming (e.g., in backdoor trigger recovery task). Additionally, overlapping scores in subjective dimensions suggest that LLM-based evaluation struggles to distinguish the research capabilities of different models.

Correlation heatmap between objective (x-axis) and subjective (y-axis) metrics for agent-generated solutions across all tasks. In this setting, code is included when prompting the LLM to evaluate subjective dimensions. No strong correlation is observed, suggesting that LLM-judged subjective metrics may not reliably indicate real-world impact.

The researchers investigated whether LLM-based evaluations can reliably assess research quality by comparing subjective judgments (validity, clarity, rigor, generalizability, and innovativeness) with objective metrics. They found almost no correlation between LLM-judged innovativeness and actual effectiveness, suggesting that LLM evaluations alone aren't reliable proxies for real-world research impact.

Implementation Process Analysis

We track the percentages of changes of performance, runtime, and lines of code compared to baseline across iterative refinement of implementations within a trial of LLM-based MLAB agent on the development set. Performance improvement is the higher the better, while increased runtime and lines of code are the lower the better.

Stage distribution across each step, annotated using GPT-4o and grouped into seven distinct stages to illustrate shifts in task focus and activity over the course of all tasks.

Analysis of the implementation process revealed that agents spend most of their time understanding the problem environment and writing code, with limited effort on brainstorming new ideas. As they refine implementations, runtime and code complexity increase steadily, often without proportional performance gains, suggesting that agents tend to over-refine their solutions.

Cost-Effectiveness Analysis

We perform a cost-effectiveness analysis of various setups. On the x-axis, we plot API cost, where lower is better, and on the y-axis, we show relative improvement to human, where higher is better. Among the settings evaluated, Llama 3.1 405B with the MLAB scaffolding emerges as a Pareto-optimal setting that balances cost and performance improvement.

When considering both performance and API costs, Llama 3.1 405B Instruct with the MLAB scaffolding offered the most favorable trade-off, achieving higher success rates than GPT-4o and Claude 3.5 Sonnet at a significantly lower cost.

Conclusion

MLRC-Bench provides a rigorous, objective framework for evaluating AI research agents that avoids the pitfalls of purely subjective assessment. The results show that current state-of-the-art LLM agents still fall significantly short of human research capabilities, with the best agent achieving only 9.3% of human-level improvement over baseline.

The benchmark reveals a concerning misalignment between LLM-judged innovation and actual performance on cutting-edge research problems. This highlights the importance of objective metrics when evaluating research agents' capabilities.

As a dynamic benchmark that can incorporate new competitions, MLRC-Bench provides a solid foundation for tracking progress in AI-assisted scientific discovery while maintaining rigorous evaluation standards. The significant gap between current agents and human researchers suggests ample room for improvement in how AI systems approach genuine research innovation.

Click here to read the full summary of this paper

MLRC-Bench: Can LLMs Conquer Machine Learning Research Competitions? Objective Metrics Revealed!

Can Language Agents Solve Machine Learning Research Challenges?

How MLRC-Bench Differs from Previous Benchmarks