This is a Plain English Papers summary of a research paper called AI Writes First Peer-Reviewed Paper: Meet AI Scientist-v2. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter.

The First AI-Generated Paper to Pass Peer Review: Introducing AI Scientist-v2

Artificial intelligence has reached a remarkable milestone in scientific research. The AI Scientist-v2, developed by researchers at SakanaAI, has become the first fully autonomous AI system to author a scientific paper that passed peer review. This breakthrough demonstrates how AI can independently formulate hypotheses, design and execute experiments, analyze results, and compile findings into publication-quality manuscripts without human intervention.

Unlike its predecessor, AI Scientist-v2 eliminates the need for human-authored code templates, can work across diverse machine learning domains, and employs a sophisticated tree-search methodology to explore multiple research paths simultaneously. This advancement represents a significant step toward AI systems that can meaningfully contribute to scientific discovery.

Feature Codebase
Drafting
Execution
Planning
Parallel
Experiments
VLM
Reviewer
Human Result
Evaluation
The AI Scientist-v1 Topic-Specific Linear $\boldsymbol{x}$ $\boldsymbol{x}$ Not Submitted
The AI Scientist-v2 Domain-General Tree-Based $\checkmark$ $\checkmark$ Workshop Acceptance-Worthy

Table 1: Comparison of features between AI Scientist-v1 and AI Scientist-v2

Evolution of AI in Scientific Research: Building on Previous Work

The development of AI for scientific discovery has progressed considerably in recent years. Early systems relied heavily on human guidance and specialized templates, limiting their generalizability and autonomy. AI Scientist-v2 builds upon these foundations while addressing their key limitations.

Previous approaches often required domain-specific code templates and human oversight at critical decision points. The creators of AI Scientist-v2 have overcome these constraints by developing a system that can generate domain-general code and make autonomous decisions throughout the research process. This achievement follows a lineage of increasingly capable AI research assistants, including systems like AutoML and the original AI Scientist-v1.

The new system represents a significant leap forward in AI-driven scientific discovery, moving from tools that assist human researchers to systems that can conduct original research independently.

How AI Scientist-v2 Works: Architecture and Methodology

AI Scientist-v2 operates as an end-to-end research system capable of handling the entire scientific workflow without human intervention. The system begins by formulating research questions, then designs a series of experiments to investigate these questions. It generates and executes code, analyzes the resulting data, and ultimately compiles its findings into a coherent scientific manuscript.

Key improvements over the predecessor version include domain-general code generation, parallel experimentation through tree search, and integration of vision-language models for figure evaluation. These advancements enable the system to tackle more complex research questions across a broader range of machine learning topics.

Managing the Research Process: Stage-Based Approach

The research process in AI Scientist-v2 follows a structured four-stage approach:

  1. Preliminary Investigation: The system explores basic questions to establish a research direction
  2. Hyperparameter Tuning: Parameters are optimized for subsequent experiments
  3. Research Agenda Execution: The main experimental work is conducted
  4. Ablation Studies: The system tests the contribution of individual components

Each stage builds upon the findings from previous stages, creating a progressive research narrative. This structured approach helps manage computational resources effectively and ensures thorough exploration of the research space.

Hyperparameter Value
Debug Probability 1.0
Maximum Debug Depth 3
Maximum Experiment Runtime per Node 1 hour
Node Allocation per Stage:
Stage 1: Preliminary Investigation 21 nodes
Stage 2: Hyperparameter Tuning 12 nodes
Stage 3: Research Agenda Execution 12 nodes
Stage 4: Ablation Studies 12 nodes

Table 2: Hyperparameters controlling the research stages and resource allocation

Exploring Multiple Research Paths: Agentic Tree Search

A core innovation in AI Scientist-v2 is its agentic tree search methodology for experimental design. Unlike the linear approach of its predecessor, this system can explore multiple research directions simultaneously, similar to how human scientists might pursue several promising leads in parallel.

The tree-search approach allows the system to allocate resources efficiently by expanding promising branches while pruning less successful paths. When experiments fail, the system employs sophisticated debugging mechanisms to identify and resolve issues, often recursively attempting different approaches until successful.

This parallel exploration significantly improves research efficiency compared to the sequential approach used in earlier systems. It also enables the AI to be more resilient to setbacks, as failures in one branch don't halt the entire research process.

Turning Research into Papers: The Writing Process

Once the experimental work is complete, AI Scientist-v2 autonomously compiles its findings into a scientific manuscript. The system generates a paper structure aligned with academic conventions, writes clear technical content, and integrates experimental results, figures, and tables into a coherent narrative.

The writing process involves summarizing experimental findings, contextualizing results within relevant literature, and drawing meaningful conclusions. The system can present complex statistical analyses and discuss their implications in a manner consistent with academic standards.

Importantly, the writing agent doesn't simply report results but constructs a scientific argument that builds from the initial hypothesis through experimental evidence to justified conclusions.

Improving Visual Quality: The VLM Feedback Loop

A unique feature of AI Scientist-v2 is its integration of Vision-Language Models (VLMs) to evaluate and refine figures and visualizations. This feedback loop allows the system to assess the clarity, informativeness, and aesthetic quality of visualizations from a "reader's perspective."

When figures don't effectively communicate the intended information, the VLM provides specific feedback, which the system uses to regenerate improved visualizations. This iterative refinement process ensures that figures are not only technically accurate but also effectively communicate research findings.

Component/Task Model Used Max Tokens Temperature
Code Generation (\$3.2) Claude 3.5 Sonnet (v2) 8,192 0.5
LLM/VLM Feedback Agents (\$3.4) GPT-4o 8,192 0.5
Summary Report Agent (\$3) GPT-4o 8,192 1.0

Table 3: Model configurations used in different components of AI Scientist-v2

Putting AI Scientist-v2 to the Test: Evaluation and Results

To evaluate AI Scientist-v2's capabilities, the researchers submitted three independently generated papers to the ICLR 2025 "I Can't Believe It's Not Better" (ICBINB) workshop—a peer-reviewed venue focused on negative or surprising results in machine learning research.

This evaluation approach is particularly significant because it subjected the AI's work to the same rigorous peer review process that human researchers face, providing an objective assessment of its scientific capabilities.

Experimental Setup: Creating Three Research Papers

The system generated three distinct research papers, each exploring different machine learning topics:

  1. Compositional Regularization: Investigating unexpected challenges in enhancing neural network generalization
  2. Impact of Label Noise: Examining how noisy labels affect model calibration in deep learning
  3. Challenges in Pest Detection: Analyzing failures and solutions in applying deep learning to pest detection

For each paper, AI Scientist-v2 independently formulated research questions, designed and executed experiments, analyzed results, and authored a complete manuscript—all without human intervention. The only human involvement was submitting the finished papers to the workshop review system.

Title Workshop Result Materials
Compositional Regularization: Unexpected Obstacles in Enhancing Neural Network Generalization Accepted (Score: 6.33) See Appendix C.1, GitHub Repository
Unveiling the Impact of Label Noise on Model Calibration in Deep Learning Rejected See Appendix C.2.1, GitHub Repository
Real-world Challenges in Pest Detection using Deep Learning: an Investigation into Failures and Solutions Rejected See Appendix C.3.1, GitHub Repository

Table 4: Summary of the three papers submitted to the ICLR workshop and their results

Breaking New Ground: The First AI-Accepted Paper

In a historic achievement, one of the papers—"Compositional Regularization: Unexpected Obstacles in Enhancing Neural Network Generalization"—was accepted to the workshop with a score of 6.33, exceeding the average acceptance threshold for human-authored papers. This marks the first instance of a fully AI-generated scientific paper passing peer review in a competitive academic venue.

The accepted paper explored why combining multiple regularization techniques doesn't always improve neural network generalization as expected. The research included comprehensive ablation studies and controlled experiments that impressed reviewers with their methodological rigor.

Parameter Value
Optimizer SGD with Momentum
Momentum 0.9
Initial Learning Rate 0.1
Learning Rate Decay 0.1 at epochs 50 and 75
Number of Epochs 100
Batch Size 128
Weight Decay $5 \mathrm{e}-4$

Table 5: Hyperparameters used in the experiments for the accepted paper on compositional regularization

This success demonstrates that AI can now produce research meeting academic standards without human assistance—a significant milestone in artificial intelligence research.

Looking Beyond the Horizon: Implications and Future Directions

The achievements of AI Scientist-v2 have profound implications for the future of scientific research and discovery. As these systems continue to advance, they could dramatically accelerate scientific progress across numerous fields.

Current Limitations: What AI Scientist-v2 Can't Yet Do

Despite its impressive capabilities, AI Scientist-v2 has notable limitations. The system currently operates only within machine learning research and depends on pre-existing codebases and frameworks. It lacks the ability to develop novel theoretical frameworks or paradigm-shifting insights that characterize groundbreaking human research.

Additionally, the system has limited ability to situate its work within the broader context of scientific literature, primarily citing sources provided in its initial prompt rather than conducting comprehensive literature reviews. Its experimental design, while sophisticated, follows established patterns rather than developing innovative methodologies.

The Road Ahead: Future Improvements

Future versions could expand beyond machine learning to other scientific domains, including experimental sciences that require interaction with physical laboratory equipment. Integrating larger knowledge bases and more sophisticated reasoning capabilities could enable deeper theoretical contributions.

Promising directions include improving the system's ability to synthesize knowledge across disparate fields, enhancing experimental creativity, and developing more sophisticated debugging and error recovery mechanisms. These improvements could lead to AI scientists capable of more novel and impactful scientific contributions.

The Ethical Dimension: AI Safety and Responsible Research

The emergence of autonomous AI scientists raises important ethical considerations. These include ensuring scientific quality and reproducibility, proper attribution of credit, and maintaining research integrity. As these systems become more capable, frameworks for responsible development and deployment become increasingly important.

Positively, AI scientists could democratize research by making high-quality scientific investigation accessible to organizations with limited resources. They could also accelerate progress on urgent societal challenges by conducting research at scales and speeds beyond human capabilities.

The researchers behind AI Scientist-v2 have open-sourced their code at GitHub to promote transparency and enable collaborative improvement of these technologies.

Conclusion: Transforming Scientific Discovery

AI Scientist-v2 represents a watershed moment in the development of artificial intelligence for scientific discovery. By producing the first AI-generated paper to pass peer review, it demonstrates that autonomous AI systems can now conduct meaningful scientific research from hypothesis to publication without human intervention.

This breakthrough suggests a future where AI scientists work alongside human researchers, dramatically increasing research productivity and accelerating scientific progress. While many challenges remain, the path toward increasingly capable AI scientists is now clearly established.

As these technologies continue to develop, they promise to transform how scientific knowledge is generated, potentially leading to new discoveries at unprecedented scales and speeds. The future of science may well be a collaborative endeavor between human and artificial intelligence, each contributing their unique strengths to the advancement of human knowledge.

Click here to read the full summary of this paper