Introduction
The Alibaba recently team has released the Qwen 3 Series, including two standout models: the 235B parameter MoE model (with 22B active parameters) and a lightweight 30B version (3B active).
As per official docs, the Qwen3-235B-A22B model take on giants like DeepSeek R1, Grok-3, and Gemini 2.5 Pro—and it is doing it with fewer active parameters, faster inference, and open-source accessibility.
On the other hand, on a lighter note the Qwen3-30B-A3B outcompetes the previous QwQ-32B with 10 times of activated parameters, and a small model like Qwen3-4B can rival the performance of Qwen2.5-72B-Instruct.
Best part, takes fraction of cost per million input and output token, compared to SOTA models.
Impressive, isn’t?
Let’s look at some of the key details of new Qwen 3 and then put it to test!
TLDR;
- Alibaba released Qwen 3, featuring efficient MoE models (235B & 30B) claimed to outperform giants like DeepSeek R1 and GPT-4o with fewer active parameters.
- Qwen 3 consistently produced better, more functional, and user-friendly code for tasks.
- Both Qwen 3 and DeepSeek R1 correctly solved logic puzzles but had better output structure.
- While both performed well on simpler math, DeepSeek R1 more accurately solved a complex multi-step calculation problem where Qwen 3 slightly missed, suggesting varied results.
- Qwen 3 excelled at researching, summarizing and structuring text.
- Overall, Qwen3 is a highly capable, efficient open-source choice, strong in coding/writing, while DeepSeek R1 holds an edge in complex math and reasoning speed.
Qwen3: Key Details
Here are all the important key details for Qwen3 you need to know:
Feature | Details |
---|---|
Moto | Think Deeper, Act Faster |
Variants (XXB - Total Parameters, AXXB - Active Parameters) | |
Qwen3-235B-A22B (MoE) | |
Qwen-3-30B (LMoE) | |
0.6B, 1.8B, 4B, 7B, 14B, and 32B (Dense fully activated) | |
Benchmarks Beaten (Qwen3-235B-A22B & 30B-A3B) | HumanEval (Coding) |
MATH (Mathematical Problem Solving) | |
GSM8K (Grade-School Math Work Problems) | |
Big-Bench Hard (General Reasoning) | |
BoolQ (Boolean QA / Reading Comprehension) | |
ARC-Challenge (Scientific/Multiple-Choice Reasoning) | |
Mixture of Experts (MoE) | 10% active parameters = Huge inference savings |
Modes | Hybrid Thinking Mode to switch between instant answers and step-by-step reasoning |
Language Support | Supports 119 languages |
MCP Support | Improved MCP Support |
Pretraining | Pretrained on 36T tokens, twice the size of Qwen 2.5 |
Open Source | Apache 2.0 license with local deployment option |
Local Support | Available for all except 253B variants |
Additionally, Qwen3-235B-A22B outperformed others on task like:
- Coding – beat models like DeepSeek R1, Gemini 2.5 Pro, OpenAI O3 Mini, and 01
- Mathematics – surpassed all major competitors in problem-solving and step-by-step accuracy
- General Reasoning – outcompeted top-tier models in logical and multi-step reasoning tasks
- Reading Comprehension – excelled in integrating multi-section ideas and inferencing
- Logical Deduction (Puzzles) – solved complex deduction tasks better than many others
& Qwen3-30B-A3B performed competitively on task like:
- Coding & Math – matched or exceeded lighter models like GPT-4 Omni, Gemma 3, DCV3
- Efficiency Benchmarks – achieved strong performance with only 3B active params
- General Use Benchmarks – handled a wide variety of tasks comparably to larger models
With such impressive results, benchmarks and tasks support, it easy to get lost.
So, let’s evaluate the model performance for real-world use cases across different domain, and compare results against SOTA like Deep Seek R1.
By the end of the article, you will have a clear understanding, if Qwen3 is right for your use case or you can go for Deep Seek R1.
So, Let’s begin!
CODING
The first task we will check model on is coding, nothing fancies here, let’s get straight to testing
1. Functional Sticky Note App
(Skill Tested: Front-end UI development, DOM manipulation, interactivity)
Sticks notes are very good productivity tool, and I use them quite often, so let’s see how both models perform on generating functional stick note webapp.
Prompt
Create a frontend for a modern note taking app.
Make it so that you can add sticky notes.
Add snip / notes drag functionality too.
Just added functionality to drag and drop. Let’s check the result.
Qwen 3
Code: note_taker_qwen.html
Output: (From artifacts window)
Qwen3 - Notes Taker
Deep Seek R1
Code: note_taker_deepseek.html
Output
deepseekr1_notes_taker
Comparing, both models did a fine job but output of Qwen3 was more fast, consistent and user friendly.
So, I would definitely like to go for Qwen 3 for simple tasks as it understands nuances and requirement well.
Now on to next test
2. Conway Game of Life (Code)
(Skill Tested: Matrix logic, algorithm implementation, terminal rendering)
Conway's Game of Life is a cellular automation game with the following rules:
- Any live cell with fewer than two live neighbors dies (underpopulation).
- Any live cell with more than three live neighbors dies (overpopulation).
- Any live cell with two or three live neighbors continues to live.
- Any dead cell with exactly three live neighbors comes to life.
Credits : Pinterest
Game Workings
The starting layout is called the seed of the system.
The first generation is created by applying the rules to every cell in the seed at the same time, whether the cell is alive or dead.
All changes—cells being born or dying—happen at once in a step called a tick.
Each new generation depends only on the one before it.
The same rules are then used again and again to create the next generation till no of iterations are met.
Seeing the complexity and decision making in the game, it suited me as a good question to test models coding, decision making in code ability.
Prompt
Create a Python implementation of Conway's Game of Life that runs in the terminal.
The program should accept an initial state, run for a specified number of generations,
and display each generation in the terminal.
No specific mention of testing the code, let’s check the results
Qwen3
Code: game_of_live_qwen.py
Output
game_of_life_qwen
Deep Seek R1
Code: game_of_life_deepseek.py , test_file*: blinker.txt ,* terminal_command : terminal_command.bash
Output
game_of_life_deepseekr1
One thing I liked straightaway of Qwen3 is that it provided me the sample test case, making it easier to test the code and a simple implementation. Just single script. Good for proto-typing things out.
This was not the case with Deep Seek R1, it provided 2 files, one main and another test (after asking how to run - assuming was vibe coding). Also, implementation was not similar to Qwen3, performance issues were there.
So again, I would go to Qwen3 for the coding tasks as its fast, reliable, more accurate and optimized code.
Now let’s move onto next test
3. Butterfly SVG Generation
(Skill Tested: SVG generation, geometric reasoning, visual symmetry)
SVG stands for Standard Vector Graphics and quite prominent in graphics design and core development field. So, let’s code a butterfly one for ourselves.
Prompt
Generate SVG code for a simple butterfly shape.
The butterfly should have symmetrical wings and basic body parts (head, thorax, abdomen).
Use basic SVG shapes and keep the design relatively simple but recognizable.
Specific mention of symmetry in body part to check visual symmetry, let’s check the results
Qwen 3
Code: **svg_butterfly_qwen3.xml** (change name to svg_butterfly_qwen3.svg
when run locally)
Output
DeepSeek R1
Code: **svg_butterfly_deepseek.xml** (change name to svg_butterfly_deepseek.svg
when run locally)
Output
Both models produced symmetric and parts specific results
However, Qwen3 produced better output then Deep Seek R1 and was more nuanced in following the instruction as visible in Code.
Deep Seek output look more cartoonish style and code also shows, model skips following instruction.
Honestly, never expected Qwen3 to generate output better than Grok3 and, it did really beat it for good. Proof (Grok3 Results - even worse than Deep Seek R1) 👇
This marks the end of coding & mathematical + graphical analysis ability test and Qwen3 took the lead.
Now let’s test the model performance on Reasoning Task, where Deep Seek R1 truly shined!
REASONING
One of the top reasons of Deep Seek R1 was its reasoning ability, I even wrote a blog on analysis of its COT Reasonings.
Let’s see if Qwen3 can beat the Deep Seek at its own game.
1. Find The Guilty Person
(Skill Testing: Deductive logic, truth-table consistency)
A puzzle that many open sources model failed to answer correctly in past tests. It requires high deductive reasoning to crack this one, making it ideal for the test.
Prompt
A detective is investigating a case involving five suspects: Alice, Ben, Carla, David, and Eva. Only one of them is guilty.
The detective receives the following statements:
Alice says: “Ben is guilty.”
Ben says: “Carla is lying.”
Carla says: “Eva is guilty.”
David says: “I didn’t do it.”
Eva says: “Carla is telling the truth.”
Exactly one person is telling the truth. Who is guilty, and how do you know?
Ans: David
For fairness, only the question was provided. Let's check the results.
Qwen 3
Conclusion
The only consistent scenario is when Ben is the only truth-teller , and David is the guilty person . This satisfies both conditions:
Only one person is telling the truth (Ben).
Only one person is guilty (David).
All other statements are false, consistent with their being lies.
Final Answer:
David
DeepSeek R1
Why Other Suspects Don’t Fit:
If Alice, Ben, Carla, or Eva were guilty, it would lead to more than one truthful statement, violating the rule.
Only David's guilt results in exactly one truthful person (Ben) while keeping all others false.
Final Answer:
David is guilty, and Ben is the only one telling the truth.
Both models performed great and followed same truth-table consistency checks for arriving at the right answer.
However Deep Seek results was the fastest and took around 80 seconds in contrast to Qwen 120+ seconds which generated a more structured answer, following 4 step deduction process.
As for me I would go for Deep Seek R1 as just want a quick solution.
Now let’s move onto next test.
2. Museum Field Trip Planning
(Skill Tested: Constraint satisfaction, arithmetic reasoning, real-world planning)
I often like to go for field trips with my friends and had to do a lot of real-world-planning like - bus booking, hotel booking, money management and so on.
This reasoning task comes from a similar ground and requires real world planning, ideal for model ability testing.
Prompt
A group of students are planning to go on a field trip to a museum.
They need to decide how many buses to rent and how to divide the students among the buses.
Each bus can hold up to 40 students, but the museum can only accommodate 120 students at a time.
The group has a budget of $800 for the bus rental, and each bus costs $200 per day.
How many buses should the group rent, and how many students should go on each bus and the cost associated for day?
Explain your reasoning.
Ans: 3 buses, 40 each student, $200 each
Again, for fairness, only the question was provided. Let's check the results.
Qwen 3
Final Answer
The group should rent 3 buses with 40 students on each bus.
Deep Seek R1
The group should rent 3 buses, and each bus should carry 40 students.
Considering 1st try both models failed to include the last requirement - cost associated for day. However, on 2nd try, gave the total cost instead of cost / day, which may be due to non-clear prompt.
But again, DeepSeek was the fastest (95 seconds) compared to Qwen (105 seconds) but as for output structuring - Qwen3 takes the lead.
So, in essence for reasoning tasks, it's just a matter of preferences, both models performed equally well and stumbled upon same prompt.
So, if you prefer speed - go for deepseekr1, else for output quality - Qwen 3 is preferred.
Anyway’s this marks the end of the Reasoning test with a tie, now let’s move on to next test - Math's
MATHS
Qwen3 official docs states it flagship model beats other models on many math related benchmarks like MATH , GSM8K and many more.
So, let’s test its mathematical ability and check whether it remains up to its statements or not?
1. Train Timing Problem
(Skill Tested: Multi-step calculations, algebra, word problem reasoning)
This problem tackles calculation related to relative motion (focus: distance = speed * time) in physics and needs to be solved through multi-step calculation.
Prompt
A train leaves City A at 9:00 AM traveling at 60 km/h.
Another train leaves City B at 11:00 AM traveling toward City A at 90 km/h.
The distance between City A and City B is 450 km. At what time do the two trains meet? Show all steps clearly.
Trust me I had to solve this question variant in my application development exam and though easy, one can often get tricked here in conversion. Let’s check how model performs.
Qwen3 (Shorten)
Train A travels 60 km/h × 2 h = 120 km by 11:00 AM,
leaving 330 km between the trains. They close the gap at a combined speed of 150 km/h (60 + 90), meeting in 330/150 = 2.2 h (2 hours 12 minutes) after 11:00 AM, which results in a meeting time of 1:12 PM.
Calculations:
* Head start: 60×2=120 km
* Remaining distance: 450−120=330 km
* Relative speed: 60+90=150 km/h
* Time to meet: 330÷150=2.2 hours = 2 h 12 min
* Meeting time: 11:00 AM + 2 h 12 min = 1:12 PM
DeepSeek R1 (Shorten)
The first train (Train A) departs at 9:00 AM at 60 km/h and covers 120 km by 11:00 AM,
leaving a remaining distance of 330 km.
The second train (Train B) departs at 11:00 AM at 90 km/h,
and the two trains approach each other at a combined speed of 150 km/h (60 + 90).
They meet after 2.2 hours (2 hours and 12 minutes), at 1:12 PM.
Key Calculations:
1. Distance covered by Train A by 11:00 AM: 60 km/h \* 2 = 120 km
2. Remaining distance: 450 km - 120 km = 330 km
3. Relative speed: 60 km/h + 90 km/h = 150 km/h
4. Time to meet: 330 km / 150 km/h = 2.2 h = 2 hours and 12 minutes
5. Meeting time: 11:00 AM + 2h 12 min = 1:12 PM
Amazing both models aced it. They not only returned the output but also showed detailed calculation step.
However, Qwen3 was faster and generated a more step by step solution, on the other hand Deep Seek R1 started tacked the problem in 2 different ways
It stated with number-based calculation, then shifted to algebra, got misguided and then used the pure calculation approach. This also explains why it took much time.
So here I would prefer to go with Qwen3 as it tackled the problem as it is and derived solution using pure number calculation.
Off to next question
2. Fuel Calculation for 20-Minute Race
(Skill Tested: Multi-step calculations, unit conversions, practical application in racing context)
Saw this question with quite a lot of interaction on reddit . As per poster claims and my own test, most model failed to solve the question due to rounding off error, ideal for our use case 😁
Prompt
I'm playing assetto corsa competizione, and I need you to tell me how many liters of fuel to take in a race.
The qualifying time was 2:04.317, the race is 20 minutes long, and the car uses 2.73 liters per lap
As per post owner:
This is actually really hard. It requires the model compute the number of laps (9.x) then round up because a partial lap isn't possible (10) then multiply by the liters/lap to get the correct answer of 27.3L, with bonus points for suggesting an extra liter or two.
The most common failures I see are in forgetting to round up and then doing the final multiply totally wrong.
One common thing to do in all models for this question is adding a safety buffer in some form, leading to misguided steps and wrong answer. Let’s see if our models can overcome them or not
Qwen 3
To determine the required fuel load for a 20-minute race in Assetto Corsa Competizione, follow these steps:
In ACC, fuel can be set with decimal precision, so 26.4 liters is sufficient for the race duration. However, 27 liters is safer to avoid running dry.
- With Safety Margin: 27 liters (ensures completion of 10 laps).
- Precise Calculation: 26.4 liters (exact amount for \~9.65 laps).
Final Recommendation:
1. Safety Margin (Optional):
To account for unexpected variations (e.g., off-track moments, traffic), round up to 27 liters.
2. **Compute Fuel Requirement:
Multiply the number of laps (including the partial lap) by fuel consumption per lap:
9.65 laps × 2.73 liters/lap ≈ 26.35 liters.**
3. Estimate Number of Laps in the Race:
124.317 × 1200 ≈ 9.65 laps.
4. Calculate Total Race Duration in Seconds:
20 minutes × 60 = 1200 seconds.
5. Convert Lap Time to Seconds:
Qualifying time = 2:04.317 = 2 × 60 + 4.317 = 124.317 seconds per lap.
Deep Seek R1
To determine the fuel needed for your 20-minute race in Assetto Corsa Competizione:
1. Convert Qualifying Time to Seconds:
2 minutes + 4.317 seconds = 124.317 seconds per lap
2. Convert Race Duration to Seconds:
20 minutes × 60 = 1200 seconds
3. Calculate Total Laps:
1200 seconds ÷ 124.317 seconds per lap ≈ 9.65 laps
Round up to 10 laps, since you can’t finish a partial lap
**4. Base Fuel Requirement:
10 laps × 2.73 liters per lap = 27.3 liters**
5. Add a Safety Buffer:
ACC races can extend by a lap due to timing quirks or unexpected slowdowns
27.3 liters + 2.73 liters = 30.03 liters
Final Recommendation:
Take 30 to 31 liters to safely cover 10 to 11 laps, accounting for any race anomalies or pace variations
Would you like help creating a reusable race prep guide?
What surprises me is Deep Seek R1 not only solved the problem but also created a variation of its own, without explicit prompting and giving it as recommendations.
On the other hand, Qwen3 failed to generate the right output 27.3, but was not off by a lot.
So, for complex math, I would like to still go with Deep Seek R1, its better and solved 2/2 challenges correct.
Off to last task - Writing
WRITING
Most models perform fairly well in this domain, so I have prepared a complex multistep task prompt, to put models writing ability to test
1. Summarizing Reseach of Climate Modeling Article
(Skill Tested: Multi-section synthesis, causal inference, reasoning)
I like to read research papers/articles, but being so long, I often end up dividing them into section, read it section by section and summarize them later.
Let’s replicate the same method but for LLM and see what it generates
Prompt
Imagine a research article is split into three sections provided below.
After reading them, answer the final question.
Section A (Introduction): Describes the problem of climate modeling in arid regions and the limitations of previous approaches.
Section B (Methodology): Introduces a hybrid model combining satellite data with on-the-ground sensors using a deep ensemble neural network.
Section C (Results & Conclusion): Shows the hybrid model outperforms all baselines, especially in extreme temperature prediction.
Question: Based on the full article, explain why the hybrid approach was more effective than traditional models, and summarize the primary advantage it offered in the context of arid regions.
This prompt test the casual inferencing reasoning and summarization capabilities of the model. Let’s check the results!
- Qwen 3: climate_modelling_summary_qwen.txt
- Deep Seek R1: climate_modelling_summary_deepseek.txt
I was expecting both models to generate 3 separate section summaries (reader friendly), but only Qwen3 did that, showing the strong comprehension and inference capabilities.
On the other hand - deep seek r1 provided the similar text but gave me hard time to read and figure out the details, not reader friendly - unless until instructed.
So, If I had to go for writing related task, again Qwen3 will be my open-source choice
This marks the end of the test and Qwen 3 takes the lead.
Final Thoughts
So where does that leave us?
Qwen 3, especially the 235B flagship in many case is beating models like GPT-4o, Deepseek r1, Grok 3, etc.
With strong coding and math skills, and a Mixture of Experts design, it delivers top-tier results at a much lower inference cost.
The 30B version is plenty impressive as well, lightweight, local + dev-friendly, and more capable than other midsize models like Gemma 3 or DCV3.
Can only say this at the end - If you’ve been waiting for a truly capable, open alternative to the likes of GPT-4 Omni or Deepseek R1—this might be it.