What if LLMs had opinions and argued in threads? What happens when we give LLMs not just memory and tools, but also autonomy, voices, perspectives, and structure?

I built a simple CLI tool, a tiny experiment in multi-agent, tree-of-thought reasoning. The tool lets you simulate a conversation between different AI personas, each contributing their thoughts in a threaded format.

This concept of enabling language models to work together to solve complex problems, with each agent assuming a unique role based on its strengths, is a central tenet of multi-agent LLM systems. Such systems are often observed to outperform traditional single-agent models, particularly when dealing with intricate tasks that necessitate diverse expertise and collaborative decision-making.


🛠️ The CLI: What It Does

The CLI tool I built supports:

  • Defining multiple personas (e.g., Philosopher, Technologist, Policymaker, Educator)
  • Assigning each a role or lens to interpret the prompt
  • Running rounds of discussion threads (like Reddit comments)
  • Outputting results in HTML, Markdown, or JSON
  • Optional logging of engagement and “most insightful” paths

The first part wasn’t meant to be a goal-oriented enterprise-grade agent framework. It’s intentionally simple, an experiment in structure, not infrastructure.

👉 Git repo: http://github.com/gajakannan/public-showcase/tree/main/multillm-tot


💡 Try It Yourself

Here are a few example prompts and command-line inputs to experiment with:

python main.py 
     --prompt "With increased complexity should we relook proliferation of Microservice and build Modular Monoliths" 
     --rounds 5 
     --personas-file './input/microservice-personas.json' 
     --output html 
     --save-to "./output/microservice-discussion.html"

Screen copy of output html

python main.py --prompt "Can a specialized AI or AGI replace primary care physicians" --rounds 20 --personas-file './input/pcp-personas.json' --output html --save-to "./output/pcp-discussion.html"

python main.py 
     --prompt "Is AI gonna replace primary care physicians" 
     --rounds 8 
     --personas-file './input/pcp-personas.json'
     --output html 
     --save-to "./output/pcp-discussion.html"

Screen copy of output html

python main.py 
     --prompt "Which is the good front end technology to develop web applications" 
     --rounds 25 
     -personas-file './input/frontend-personas.json' 
     --output html 
     --save-to "./output/discussion.html"

Screen copy of output html


🧎‍♂️ Why Tree-of-Thought?

LLMs are great at linear reasoning, but sometimes, ideas need branches, not just steps. It’s less about “right answers,” more about expanding thought space. This approach enables agents to diverge in their thinking, reflect on different possibilities, and build upon each other's ideas, with the primary focus being on expanding the thought space rather than solely seeking a single "right answer".

This CLI is a simple inspiration, not a full-fledged ToT that implements path optimization or scoring. Perhaps in the future, it could.


🤖 Where This Fits: RAG, Agentic RAG, and CAG

If you’ve been following the evolution of LLM architecture, you’ll recognize these three models:

  • RAG (Retrieval-Augmented Generation): Adds external knowledge to LLMs.
  • Agentic RAG: Enables LLMs to delegate tasks to autonomous agents (e.g., web search, coding).
  • CAG (Cache-Augmented Generation): Optimizes for speed by caching memory instead of retrieving it.

This CLI sits squarely as a collaborative multi-agent conversational tool. It doesn’t fetch external data, but it orchestrates reasoning. Each agent is a simulated persona, capable of reflecting, responding, and evolving the conversation.


🌱 What’s Next?

This CLI is just a start. I’m toying with ideas like:

  • Integrating OpenAI tools to let each agent choose its own approach—whether that means browsing the web for real-time data, executing code snippets, reflecting quietly on prior context, or invoking APIs. This modularity opens the door to more flexible decision-making pathways where agents can act autonomously based on the task at hand. Frameworks like LangChain, CrewAI, or AutoGPT can be used to orchestrate these multi-agent workflows, where each agent has a set of tools and reasoning capabilities and can decide which to invoke depending on its role and context. For instance, one agent might fact-check a statement using retrieval tools powered by LangChain, while another writes simulation code, and a third prompts itself with counterfactuals for deeper reflection. CrewAI can be especially useful when simulating structured teams, while AutoGPT lends itself to more open-ended exploration.
  • Assigning voting power or influence to different personas—where each agent's opinion can carry a weight based on their domain authority, confidence level, or even engagement score. This allows the system to resolve disagreement in a structured way, such as weighted consensus, probabilistic sampling, or majority opinion. For example, an actuarial persona might be granted higher voting weight in pricing discussions, while a customer advocate might have more say in usability debates.
  • Implementing Agentic RAG and CAG options—where agents can dynamically retrieve data or leverage cached memory to balance responsiveness with accuracy, opening the door for adaptive workflows like delegated web searches or instant responses based on frequently accessed context.
  • Assigning a goal for the agents, such as reaching consensus, challenging assumptions, ranking solutions, or role-playing stakeholder positions can dramatically shift how they interact and contribute. For example, a debate-style goal encourages conflict and contrast, while a synthesis goal prioritizes convergence. These behavioral nudges can help simulate more realistic human-like collaboration or disagreement.

Bringing this into the insurance domain, here are some business use cases that could benefit from multi-agent reasoning and simulation:

  • Underwriting: Simulating multi-underwriter collaboration in property insurance scenarios, an Agentic RAG example where each underwriter agent brings a different lens to evaluating the same risk. For instance, one agent may specialize in structural risk, another in climate exposure, and another in occupancy or usage-based data. These agents could debate, validate, or challenge each other’s views asynchronously, ultimately helping the human underwriter synthesize a more holistic decision.
  • Claims: where agents represent medical experts, policy terms, and historical precedent to triage a complex case. Each agent can provide input based on its specialization like, medical necessity, coverage interpretation, or comparative case history and the system can surface areas of alignment or contention. The result might be a collaborative decision summary, a weighted score, or even a generated explanation suitable for review or audit.
  • Fraud detection: where multiple perspectives evaluate anomalies in claim data or customer behavior. Vector embeddings can significantly enhance this use case by enabling similarity searches across high-dimensional claim histories, customer behavior, or provider patterns. Each agent can retrieve semantically similar past cases using vector embeddings, offering comparative reasoning. Combined with RAG, the agents can pull in structured anomaly flags, historical fraud investigations, and contextual metadata to enrich their evaluations, making the simulation both data-aware and behaviorally diverse.
  • Product design: where marketing, actuarial, and distribution personas simulate how a new coverage option might perform or be perceived. By combining Agentic RAG and Tree-of-Thought prompting, these personas can retrieve relevant market data, historical uptake metrics, and regulatory constraints to shape their opinions. The simulation can branch into competing strategies, for example, a low-premium, high-volume plan versus a niche, high-margin variant, and converge through deliberation or voting. The result might be a ranked list of product ideas, a synthesized go-to-market narrative, or an early warning about friction points across departments.
  • Customer service: where empathy agents, compliance agents, and procedural agents work together to craft responses that are both kind and accurate.

But mostly, I wanted to share a tiny, runnable proof‑of‑concept—something that shows how easy it is to spark new ideas once your LLMs are in conversation, not isolation.

In the following Part 2 installment we evolve this CLI into a goal driven ToT engine, add RAG powered by vector embeddings, benchmark different agent roles, and walk through two real‑world scenarios, insurance underwriting and care‑management, to demonstrate how the tool can be used in business context.