TL;DR;

I built Testronaut, an autonomous testing agent powered by AI, in just a few weeks.

It turns natural language descriptions like "Login to the app and check the homepage" into structured, automated web tests using Playwright MCP.

In this article, I share:

  • Why local models (like Llama3.2 and Qwen2.5) failed me
  • How mastering fundamentals was better than picking fancy frameworks
  • How Google ADK helped me create specialized collaborating agents
  • How Testronaut plans tests, executes them, and documents everything automatically

And most importantly:

The work has just begun — I’m building the foundation for what might become the best open-source AI testing platform out there.

🔗 Check out the full Testronaut repository here!


Introduction:

In last weeks, Microsoft released Playwright MCP, an innovative extension to the Playwright ecosystem that opened up a world of new possibilities for automation.

Unlike traditional browser automation tools that require direct scripting, Playwright MCP introduced a Model Context Protocol — a way for AI agents and external systems to interact with the browser via structured commands, enabling a new level of abstraction and control.

As soon as I read about it, one thought immediately crossed my mind:

"What if I could finally build an agent that tests my applications for me, without having to write a single line of Selenium or Playwright code ever again?"

That idea marked the beginning of Testronaut — an autonomous AI agent designed to transform natural language intentions into real, documented web tests.

In this article, I want to share how I built it in just a few weeks, the lessons I learned along the way, and why I believe we're entering a new era of software testing.


Defining the Requirements

One of the key premises for Testronaut was clear from the start:

Use open-source tools, keep it free, and make it run locally.

I initially imagined that I could handle the entire automation flow using Ollama and local models running on my machine.

It sounded perfect — low cost, full control, and no dependency on cloud APIs.

However, after several days struggling with various local models like llama3.2:3b, llama3.1, and qwen2.5:14b, a painful reality became clear:

Most of the challenges I faced were not because of the agent logic — they were because of the performance limitations of local models.

They simply couldn't reason fast enough, lacked consistency when following multi-step plans, or consumed too many resources under real-world load.

Along the way, I explored a few popular frameworks:

Each one offered interesting ideas and tools, but I quickly realized something essential:

The biggest challenge isn’t about picking the right framework. It's about mastering the fundamentals — understanding how agents think, reason, and interact.

(Obvious? Maybe. Painful to learn the hard way? Definitely.)

Learning the Hard Way (No Shortcuts)

So, just like we used to do in the pre-ChatGPT era, I went back to basics:

I opened the Google ADK documentation — my final framework choice — and decided to truly understand:

  • How autonomous agents are supposed to work
  • How the Google ADK could help me structure a real solution
  • How to properly integrate the right tools (instead of just plugging in a model and hoping for magic)

It felt a bit like ancient craftsmanship — like the Aztecs building something solid, one stone at a time, without shortcuts.

Reading, experimenting, and failing (a lot) became my daily routine.

And slowly, a clear picture started to emerge:

Building a reliable agent wasn't about fancy prompts or huge models — it was about designing a brain that could:

  • Interpret intentions clearly
  • Plan actions systematically
  • Execute steps carefully
  • Document every move reliably

In other words, it was less "make a smart chatbot" and more "build a serious, disciplined autonomous worker."


Defining the Solution

Once I figured out how to build autonomous agents, the next important step was to decide what exactly the Testronaut should do.

With a clear understanding of the platform's goal, I was able to define its core functionalities:

1. Creating a Test Plan from a Natural Language Intention

Testronaut was born to help dev teams validate critical user flows, experience journeys, or even very simple scripts such as:

"Navigate to https://www.saucedemo.com/ and verify the homepage loads.

Fill the login form with standard_user and secret_sauce.

In the Products Page, take a snapshot of the first product available."

(Quick note: SauceDemo is a public website designed specifically for QA testing, commonly used for automation practice and demos.)

The ability to describe tests in natural language is powerful:

It allows developers — even those without deep QA expertise — to simply express what they want to validate using plain English.

Testronaut takes that informal narrative and transforms it into a structured Test Plan, explicitly defining the expected outcomes of each step.

For example, the above narrative would become:

{
  "test_plan": {
    "name": "Homepage Load and Product Snapshot Test",
    "description": "Verify homepage loads correctly, login with standard credentials, and capture a snapshot of the first product on the products page.",
    "steps": [
      {
        "step": 1,
        "action": "Navigate to https://www.saucedemo.com/",
        "expected_result": "Homepage loads successfully and is visible."
      },
      {
        "step": 2,
        "action": "Verify homepage content loads correctly.",
        "expected_result": "Homepage content is visible and correct."
      },
      {
        "step": 3,
        "action": "Fill the login form with username 'standard_user' and password 'secret_sauce' and submit.",
        "expected_result": "Login is successful and user is redirected to the Products page."
      },
      {
        "step": 4,
        "action": "On the Products page, locate the first product available.",
        "expected_result": "First product is located and visible."
      },
      {
        "step": 5,
        "action": "Take a snapshot of the first product.",
        "expected_result": "Snapshot of the first product is saved."
      }
    ]
  }
}

By structuring the intention into JSON, Testronaut could now plan, execute, and document every step systematically.


2. The Executor Agent

With the Test Plan created, the next natural step was to execute it.

For that, I needed a second agent: one capable of reading the Test Plan and performing the corresponding browser actions using Playwright MCP.

Fortunately, the Google ADK makes it extremely simple to integrate with external MCP servers.

Here’s how I connected the agent to a Playwright MCP server:

async def get_playwright_mcp_toolset():
    """Create and configure the MCP toolset.

    Returns:
        Tuple[MCPToolset, AsyncExitStack]: The configured toolset and its exit stack.
    """
    print("Attempting to connect to MCP Playwright server...")
    tools, exit_stack = await MCPToolset.from_server(
        connection_params=StdioServerParameters(
            command='npx',
            args=["@playwright/mcp@latest", "--headless"],
        )
    )
    return tools, exit_stack

And inside the agent definition:

async def get_test_executor_agent():
    tools, exit_stack = await get_playwright_mcp_toolset()
    test_executor_agent = Agent(
        model=LiteLlm(
            model=constants.MODEL
        ),
        name="testronaut_test_executor",
        description="A helpful assistant for test execution.",
        instruction=prompt.TEST_EXECUTOR_PROMPT,
        output_key="STATE_TEST_EXECUTOR",
        tools=[get_test_plan, *tools]
    )
    return test_executor_agent, exit_stack

By setting it up this way, the Test Executor Agent is able to:

  • Retrieve the full Test Plan from the session state;
  • Iterate step-by-step;
  • Execute precise browser actions via MCP;
  • Analyze the updated UI state after each action;
  • Move on to the next step intelligently; All that remains is enabling collaboration between the agents... --- ### 3. Assembling the Megazord Testronaut

The Google ADK provides specialized agents for coordinating the execution of multiple sub-agents.

They're called Workflow Agents.

For Testronaut, I used the simplest and most effective one:

the Sequential Agent, which connects the two agents I had developed (the Planner and the Executor).

Here's how the final connection looks:

sequential_agent = SequentialAgent(
    name="TestronautEntrypointAgent",
    sub_agents=[test_planner_agent, test_executor_agent],
)

All that remains is enabling collaboration between the agents...


3. Assembling the Megazord Testronaut

The Google ADK provides specialized agents for coordinating the execution of multiple sub-agents.

They're called Workflow Agents.

For Testronaut, I used the simplest and most effective one:

the Sequential Agent, which connects the two agents I had developed (the Planner and the Executor).

Image description

Here's how the final connection looks:

sequential_agent = SequentialAgent(
    name="TestronautEntrypointAgent",
    sub_agents=[test_planner_agent, test_executor_agent],
)

Simple as that!

Now, the two agents collaborate seamlessly:

  • The Test Planner creates a structured test plan from a natural language input;
  • The Test Executor reads that plan and executes it step-by-step in a real browser;

This collaboration forms the core of what I call the Testronaut — an autonomous agent capable of understanding intentions, executing them accurately, and documenting every step along the way.


Conclusion: The Work Has Just Begun!

Developing Testronaut helped me understand the true fundamentals behind building multi-agent collaboration systems.

It wasn't just about coding — it was about learning how intelligent agents can plan, execute, and work together towards a common goal.

Testronaut is still taking its very first steps.

But the vision for its evolution is already well defined and publicly shared in its Manifesto.

The next milestones will be shared with you all — and who knows, maybe a community will emerge to transform Testronaut into...

"The ultimate application for behavioral testing."

Thanks for reading! 🚀
Stay tuned — the journey has just begun.