Refact.ai Agent has achieved the #1 score on SWE-bench Lite — solving 179 out of 300 tasks, for a 59,7% success rate. Our approach: fully autonomous AI Agent for programming, no manual intervention needed.
SWE-bench Lite is a benchmark that evaluates LLM-based systems on real GitHub issues from popular open-source Python projects. Each task requires applying a bug fix or feature implementation, then validating the result through test execution. This makes the benchmark particularly valuable for understanding how AI tools will perform in actual production environments.
Agent setup
Refact.ai Agent takes a fully autonomous, iterative approach. It plans, executes, tests, and self-corrects — repeating steps as needed to reach a single correct solution with no user input. So, the benchmark setup was designed to reflect our autonomy-first philosophy:
- Prompt strategy: Defines the Agent’s behavior and high-level task-solving logic. Open-source and available on GitHub.
- Model: Claude 3.7 Sonnet — responsible for orchestration and decision-making.
- Execution layer: refact-lsp, a backend that connects the model to tools and the environment.
- deep_analysis() tool: Enhanced reasoning, powered by o4-mini. Tool suite for repository exploration, code modification, and testing. Used dynamically based on task needs.
- Step cap: 60 agent steps (each one a discrete action) per task.
What sets Refact.ai apart is that our AI Agent independently drives the entire process. While some solutions take a semi-autonomous approach — requiring to manually invoke tools and guide the agent — Refact.ai Agent operates independently from start to finish.
Prompt strategy
Refact.ai’s SWE-bench Lite prompt follows a clear workflow:
- Describe the problem.
- Investigate the repo.
- Create and run problem reproduction script.
- Make up a plan (using deep_analysis() powered by o4-mini) and apply changes.
- Run tests and evaluate changes (including optional reasoning with deep_analysis())
- Repeat 4 and 5 steps until the problem is solved.
This workflow serves as high-level guidance, not hard rules. Refact.ai Agent uses it to form its own strategy — repeating, skipping, or reordering steps based on task context.
For each SWE-bench problem, Refact.ai Agent made one multi-step run to produce a single, correct final solution.
Claude 3.7: Model choice
Refact.ai uses Claude 3.7 Sonnet with sampling temperature 0.0 as its core model for SWE-bench Lite. It demonstrated exceptional capabilities for autonomous workflows: following multi-step instructions, understanding complex codebases, and maintaining context across long interactions.
We’ve previously paired Refact.ai with Claude 3.7, solving the Polyglot benchmark, where it reached 93.3% with Thinking Mode and 92.9% without — the highest known scores to date on that task set.
Deep-analysis() tool
One of the key features in Refact.ai’s approach is the deep_analysis() tool. It adds a structured, three-step reasoning process that improves solution quality at critical moments in the task flow.
deep_analysis() is powered by o4-mini — a small, fast reasoning model that handles the cognitive load of problem-solving so Claude 3.7 can focus on orchestration.
The prompt for deep_analysis() tool follows the pattern [also on GitHub]:
- Solution generation ”Get the initial solution.”
- Critique ”Please critique the solution above. Identify any weaknesses, limitations, or bugs. Be specific and thorough in your analysis. Remember, that the final solution must be minimal, robust, and effective.”
- Refinement ”Please improve the original solution based on the critique. Provide a comprehensive, refined solution that addresses the weaknesses identified in the critique while maintaining the strengths of the original solution.”
This structured loop is normally triggered during Step 4 of the Benchmark prompt — when planning and applying code changes. But in fact, Refact.ai Agent decides when to use this tool on its own.
Completing the benchmark, we observed that Agent sometimes called deep_analysis()multiple times — first during planning, then again when testing and evaluating results. In other cases, it skipped the tool entirely. This proves that Refact.ai Agent doesn’t follow a rigid script, but instead prioritizes its own strategy to get the task done right.
Tools, tools, tools
Refact.ai Agent has access to a variety of tools that allow it to interact with the entire development environment for tasks-solving.
- Code exploration: search(), regex_search(), definition(), references(), tree(), cat()
- Editing: create_textdoc(), update_textdoc()
- Shell execution: shell() — used to run Python tests and verify solutions.
These tools enable AI Agent to navigate codebases, understand dependencies, make precise changes, and verify that its solutions work correctly. It uses them autonomously, what and when needed.
Although Refact.ai Agent can also interface with real-world tools (GitHub, Docker, PostgreSQL, etc.) and 1000+ tools via MCP servers, these integrations weren’t used in the benchmark run — but are part of standard workflows in user environments.
60 steps cap
Claude 3.7 Sonnet has 60 steps to complete a task. A step = AI action, such as modifying a file, listing directories, or running tests. AI Agent strategically decides how to use these steps, leading to clear, controlled solutions.
Final SOTA score
Out of 300 tasks in SWE-bench Lite:
- 🥇 Solved: 179 (59,7% resolve rate)
- Not solved: 121 (40,3%).
Refact.ai Agent even managed to solve two SWE-bench tasks that no other listed agents have (django-12589, sympy-21627) — supposedly, thanks to the o4-mini model’s reasoning capabilities.
Evaluation results
Total | Solved | Not solved | Solved (%) | Unresolved (%) |
---|---|---|---|---|
300 | 180 | 120 | 60,0% | 30,0% |
Resolved by repository
astropy/astropy: 3/6 (50.0%)
django/django: 78/114 (68.4%)
matplotlib/matplotlib: 11/23 (47.8%)
mwaskom/seaborn: 2/4 (50.0%)
pallets/flask: 0/3 (0.0%)
psf/requests: 5/6 (83.3%)
pydata/xarray: 2/5 (40.0%)
pylint-dev/pylint: 3/6 (50.0%)
pytest-dev/pytest: 10/17 (58.8%)
scikit-learn/scikit-learn: 17/23 (73.9%)
sphinx-doc/sphinx: 6/16 (37.5%)
sympy/sympy: 43/77 (55.8%)
Looking forward
Refact.ai’s performance on SWE-bench Lite demonstrates that AI agents are becoming increasingly capable of handling real-world software engineering tasks autonomously — not just generating code, but planning, debugging, testing, and refining it with minimal human input.
Our next step is evaluating Refact.ai Agent on SWE-bench verified — a benchmark with more rigorous testing.
All of this is part of our open-source commitment. Developers can explore the system, understand how autonomy is implemented, and even contribute. We believe that as the baseline work of software development shifts to AI, human engineers will be free to focus on the more interesting and creative parts of the job — and invite developers to build the future of programming together.
Why does SWE-bench matter for developers?
This isn’t just about ranking highly on a benchmark — it’s about real-world coding impact. Refact.ai Agent helps developers and software companies:
- Automate repetitive tasks across the SDLC
- Focus on core work while the AI handles the rest
- Deliver faster with the AI Agent working alongside you in your IDE
- Delegate with confidence, knowing the AI writes reliable, tested code.
Get Refact.ai Agent for your IDE
Vibe coding is the future of software development — get it today.
Refact.ai’s autonomous AI Agent works like a senior developer in your IDE:
- Works inside your workflow & with your dev tools
- Boosts productivity x10 with real automation
- Handles coding while you focus on core work
- Available to everyone in IDE.
Try open-source Refact.ai Agent for programming in VS Code or JetBrains and let us know what you think!