Just discovered OpenAI's newest models (o3 and o4-mini) and couldn't resist giving them a quick spin! 🧪 These are being called "our smartest and most capable models to date with full tool access" - but do they live up to the hype? Here's what I found during my definitely-not-thorough-but-still-revealing test session!

What's new with these AI besties? 🚀

All-in-one tool access! For the first time, these models can seamlessly use ALL their tools (web search, coding, analyzing images, creating images) without awkward mode switching

Extended thinking time - o3 can take up to a MINUTE to think through complex problems (like that friend who says "give me a sec" before delivering the perfect advice)

o4-mini is the speedy option ⚡ When you need fast answers without the deep thinking, this lighter model delivers surprisingly smart responses with higher usage limits

They actually reason with images instead of just describing them - analyzing diagrams, zooming in, and using visuals as part of solving problems

My totally unscientific, low-effort tests 🧪

Test #1: The ancient flowchart challenge

I dug up a flowchart from a school team project game we developed back in 2003 (when flip phones were cool!) and first asked o3 to convert it to a cool 3D diagram - which completely failed (probably limitations in DALL-E or maybe Sora?).

Then I asked it to analyze the flowchart and convert it to Mermaid chart code. First attempt? Syntax errors galore! But after some tweaking, I was genuinely impressed - it didn't just recreate the chart, it enhanced it! Cleaner layout, better logic flow, and somehow made our 20-year-old work look better 😅

Old flowchart revival attempt

Test #2: The website cloning experiment

In one shot prompt, I grabbed a screenshot of a Swedish business website (Wint - a financial platform) and challenged o3 to recreate it with HTML/CSS/JS in a static page. The results? Not very impressive compared to older models - it handled the basic structure but really wasn't any better than what GPT-4 could already do.

Original Wint page

One prompt o3 clone attempt

Both good old 4o and Lovable.dev, a Swedish company specializing in AI-generated web apps, had similar results with their fine-tuned system. So nothing spectacular here.

One prompt Lovable clone attempt

Test #3: The tax calculation gauntlet

This one was fascinating! First, I had o3 research the current Swedish tax regulations. Then I asked it to audit my company's K10 tax calculations going back to 2018. In just 33 SECONDS, it verified all my previous calculations, spotted some subtle rule changes over the years, and even suggested an optimal setup for this year.

Analysing tax stuff

What was particularly interesting was watching o3 get into an argument with Claude about the tax regulations! Claude made some corrections based on older drafts and proposals that never actually became law. o3 pushed back with: "Below is what Skatteverket's current (2025‑edition) guidance and the wording of 57 kap. IL actually say. Your two 'corrections' are based on older drafts/proposals that never became law (or were repealed years ago)."

Claude then had to admit: "After reviewing the information from the Swedish Tax Agency's current guidelines (2025), I must correct my previous analysis. You are completely right."

Of course, only the tax authority would know if o3's analysis was really correct, but the confidence and specificity were impressive.

The verdict? A quiet evolution, not a revolution 🧠

Models are definitely better, but the "wow" is smaller — because you've evolved, too.

A year or two ago, I probably wouldn't have slept for a week after playing with these models. Today? They're impressive but in a more subdued way - like getting a nicer coffee machine that makes your morning better without changing your life.

The autonomous tool selection implementation is cool, but nothing revolutionary. ChatGPT still falls far behind due to not supporting MCP (Model Context Protocol) like Claude desktop does. That said, the image analysis capabilities are pretty impressive - the way it can zoom in on details and reason about visual information is a nice step forward.

What o3 and o4-mini do well:

Multi-step reasoning across different inputs (like analyzing images + code + files together)

Autonomously choosing tools based on what your question needs

Understanding complex, domain-specific knowledge (like multi-year tax regulations)

Enhanced comprehension of diagrams, charts, and visual information

What still needs work:

We can't draw firm conclusions from such a small test, but:

Image analysis (and generation) - DALL-E absolutely still needs work. However, if the zooming to read an image feature now works, maybe generating parts of a diagram in a multi-step sequential way could actually do the trick. What if it could create one layer for entities, one layer for text, and one layer for connections? That would probably de-complex it enough to actually pull it off.

Image description

Mermaid and code syntax - still needs human correction sometimes

Complex visual reasoning - understands basic layouts well but can miss nuanced details

Context preservation - occasionally forgets details from earlier in the conversation

For daily use, o3 feels right for complex tasks (coding projects, deep research, analyzing data), while o4-mini handles quick questions where speed matters more than depth.

Easter weekend will be my true test of whether these models earn a permanent place in my workflow. For now, they feel like a solid step forward - not mind-blowing, but definitely useful enough to keep around.

Oh, and can we talk about those names? o3 and o4-mini? Every time OpenAI makes progress with AI, they take a step backward with naming conventions! It's like they're allergic to memorable branding 😂


Reference: Introducing o3 and o4-mini - OpenAI