Qwen 3 vs. Deepseek R1: Complete comparison

Introduction

The Alibaba recently team has released the Qwen 3 Series, including two standout models: the 235B parameter MoE model (with 22B active parameters) and a lightweight 30B version (3B active).

As per official docs, the Qwen3-235B-A22B model take on giants like DeepSeek R1, Grok-3, and Gemini 2.5 Pro—and it is doing it with fewer active parameters, faster inference, and open-source accessibility.

On the other hand, on a lighter note the Qwen3-30B-A3B outcompetes the previous QwQ-32B with 10 times of activated parameters, and a small model like Qwen3-4B can rival the performance of Qwen2.5-72B-Instruct.

Best part, takes fraction of cost per million input and output token, compared to SOTA models.

Impressive, isn’t?
Let’s look at some of the key details of new Qwen 3 and then put it to test!

TLDR;

Alibaba released Qwen 3, featuring efficient MoE models (235B & 30B) claimed to outperform giants like DeepSeek R1 and GPT-4o with fewer active parameters.
Qwen 3 consistently produced better, more functional, and user-friendly code for tasks.
Both Qwen 3 and DeepSeek R1 correctly solved logic puzzles but had better output structure.
While both performed well on simpler math, DeepSeek R1 more accurately solved a complex multi-step calculation problem where Qwen 3 slightly missed, suggesting varied results.
Qwen 3 excelled at researching, summarizing and structuring text.
Overall, Qwen3 is a highly capable, efficient open-source choice, strong in coding/writing, while DeepSeek R1 holds an edge in complex math and reasoning speed.

Qwen3: Key Details

Here are all the important key details for Qwen3 you need to know:

Feature	Details
Moto	Think Deeper, Act Faster
Variants (XXB - Total Parameters, AXXB - Active Parameters)
	Qwen3-235B-A22B (MoE)
	Qwen-3-30B (LMoE)
	0.6B, 1.8B, 4B, 7B, 14B, and 32B (Dense fully activated)
Benchmarks Beaten (Qwen3-235B-A22B & 30B-A3B)	HumanEval (Coding)
	MATH (Mathematical Problem Solving)
	GSM8K (Grade-School Math Work Problems)
	Big-Bench Hard (General Reasoning)
	BoolQ (Boolean QA / Reading Comprehension)
	ARC-Challenge (Scientific/Multiple-Choice Reasoning)
Mixture of Experts (MoE)	10% active parameters = Huge inference savings
Modes	Hybrid Thinking Mode to switch between instant answers and step-by-step reasoning
Language Support	Supports 119 languages
MCP Support	Improved MCP Support
Pretraining	Pretrained on 36T tokens, twice the size of Qwen 2.5
Open Source	Apache 2.0 license with local deployment option
Local Support	Available for all except 253B variants

Additionally, Qwen3-235B-A22B outperformed others on task like:

Coding – beat models like DeepSeek R1, Gemini 2.5 Pro, OpenAI O3 Mini, and 01
Mathematics – surpassed all major competitors in problem-solving and step-by-step accuracy
General Reasoning – outcompeted top-tier models in logical and multi-step reasoning tasks
Reading Comprehension – excelled in integrating multi-section ideas and inferencing
Logical Deduction (Puzzles) – solved complex deduction tasks better than many others

& Qwen3-30B-A3B performed competitively on task like:

Coding & Math – matched or exceeded lighter models like GPT-4 Omni, Gemma 3, DCV3
Efficiency Benchmarks – achieved strong performance with only 3B active params
General Use Benchmarks – handled a wide variety of tasks comparably to larger models

With such impressive results, benchmarks and tasks support, it easy to get lost.

So, let’s evaluate the model performance for real-world use cases across different domain, and compare results against SOTA like Deep Seek R1.

By the end of the article, you will have a clear understanding, if Qwen3 is right for your use case or you can go for Deep Seek R1.

So, Let’s begin!

CODING

The first task we will check model on is coding, nothing fancies here, let’s get straight to testing

1. Functional Sticky Note App

(Skill Tested: Front-end UI development, DOM manipulation, interactivity)

Sticks notes are very good productivity tool, and I use them quite often, so let’s see how both models perform on generating functional stick note webapp.

Prompt

Create a frontend for a modern note taking app. 
Make it so that you can add sticky notes. 
Add snip / notes drag functionality too.

Just added functionality to drag and drop. Let’s check the result.

Qwen 3

Code: note_taker_qwen.html

Output: (From artifacts window)
Qwen3 - Notes Taker

Deep Seek R1

Code: note_taker_deepseek.html

Output
deepseekr1_notes_taker

Comparing, both models did a fine job but output of Qwen3 was more fast, consistent and user friendly.

So, I would definitely like to go for Qwen 3 for simple tasks as it understands nuances and requirement well.

Now on to next test

2. Conway Game of Life (Code)

(Skill Tested: Matrix logic, algorithm implementation, terminal rendering)

Conway's Game of Life is a cellular automation game with the following rules:

Any live cell with fewer than two live neighbors dies (underpopulation).
Any live cell with more than three live neighbors dies (overpopulation).
Any live cell with two or three live neighbors continues to live.
Any dead cell with exactly three live neighbors comes to life.

Game Of Life Patterns

Credits : Pinterest

Game Workings
The starting layout is called the seed of the system.

The first generation is created by applying the rules to every cell in the seed at the same time, whether the cell is alive or dead.

All changes—cells being born or dying—happen at once in a step called a tick.

Each new generation depends only on the one before it.

The same rules are then used again and again to create the next generation till no of iterations are met.

Seeing the complexity and decision making in the game, it suited me as a good question to test models coding, decision making in code ability.

Prompt

Create a Python implementation of Conway's Game of Life that runs in the terminal. 
The program should accept an initial state, run for a specified number of generations, 
and display each generation in the terminal.

No specific mention of testing the code, let’s check the results

Qwen3

Code: game_of_live_qwen.py

Output
game_of_life_qwen

Deep Seek R1

Code: game_of_life_deepseek.py , test_file*: blinker.txt ,* terminal_command : terminal_command.bash

Output
game_of_life_deepseekr1

One thing I liked straightaway of Qwen3 is that it provided me the sample test case, making it easier to test the code and a simple implementation. Just single script. Good for proto-typing things out.

This was not the case with Deep Seek R1, it provided 2 files, one main and another test (after asking how to run - assuming was vibe coding). Also, implementation was not similar to Qwen3, performance issues were there.

So again, I would go to Qwen3 for the coding tasks as its fast, reliable, more accurate and optimized code.

Now let’s move onto next test

3. Butterfly SVG Generation

(Skill Tested: SVG generation, geometric reasoning, visual symmetry)

SVG stands for Standard Vector Graphics and quite prominent in graphics design and core development field. So, let’s code a butterfly one for ourselves.

Prompt

Generate SVG code for a simple butterfly shape. 
The butterfly should have symmetrical wings and basic body parts (head, thorax, abdomen).
Use basic SVG shapes and keep the design relatively simple but recognizable.

Specific mention of symmetry in body part to check visual symmetry, let’s check the results

Qwen 3

Code: **svg_butterfly_qwen3.xml** (change name to svg_butterfly_qwen3.svg when run locally)

Output

Qwen3 Results

DeepSeek R1

Code: **svg_butterfly_deepseek.xml** (change name to svg_butterfly_deepseek.svgwhen run locally)

Output
Deep Seek R1 - Results

Both models produced symmetric and parts specific results

However, Qwen3 produced better output then Deep Seek R1 and was more nuanced in following the instruction as visible in Code.

Deep Seek output look more cartoonish style and code also shows, model skips following instruction.

Honestly, never expected Qwen3 to generate output better than Grok3 and, it did really beat it for good. Proof (Grok3 Results - even worse than Deep Seek R1) 👇
Grok 3 Results

This marks the end of coding & mathematical + graphical analysis ability test and Qwen3 took the lead.

Now let’s test the model performance on Reasoning Task, where Deep Seek R1 truly shined!

REASONING

One of the top reasons of Deep Seek R1 was its reasoning ability, I even wrote a blog on analysis of its COT Reasonings.

Let’s see if Qwen3 can beat the Deep Seek at its own game.

1. Find The Guilty Person

(Skill Testing: Deductive logic, truth-table consistency)

A puzzle that many open sources model failed to answer correctly in past tests. It requires high deductive reasoning to crack this one, making it ideal for the test.

Prompt

A detective is investigating a case involving five suspects: Alice, Ben, Carla, David, and Eva. Only one of them is guilty. 
The detective receives the following statements: 
Alice says: “Ben is guilty.” 
Ben says: “Carla is lying.” 
Carla says: “Eva is guilty.” 
David says: “I didn’t do it.” 
Eva says: “Carla is telling the truth.” 
Exactly one person is telling the truth. Who is guilty, and how do you know?

Ans: David

For fairness, only the question was provided. Let's check the results.

Qwen 3

Conclusion
The only consistent scenario is when Ben is the only truth-teller , and David is the guilty person . This satisfies both conditions:

Only one person is telling the truth (Ben).
Only one person is guilty (David).
All other statements are false, consistent with their being lies.

Final Answer:
David

DeepSeek R1

Why Other Suspects Don’t Fit:
If Alice, Ben, Carla, or Eva were guilty, it would lead to more than one truthful statement, violating the rule.

Only David's guilt results in exactly one truthful person (Ben) while keeping all others false.

Final Answer:
David is guilty, and Ben is the only one telling the truth.

Both models performed great and followed same truth-table consistency checks for arriving at the right answer.

However Deep Seek results was the fastest and took around 80 seconds in contrast to Qwen 120+ seconds which generated a more structured answer, following 4 step deduction process.

As for me I would go for Deep Seek R1 as just want a quick solution.

Now let’s move onto next test.

2. Museum Field Trip Planning

(Skill Tested: Constraint satisfaction, arithmetic reasoning, real-world planning)

I often like to go for field trips with my friends and had to do a lot of real-world-planning like - bus booking, hotel booking, money management and so on.

This reasoning task comes from a similar ground and requires real world planning, ideal for model ability testing.

Prompt

A group of students are planning to go on a field trip to a museum. 
They need to decide how many buses to rent and how to divide the students among the buses. 
Each bus can hold up to 40 students, but the museum can only accommodate 120 students at a time. 
The group has a budget of $800 for the bus rental, and each bus costs $200 per day. 
How many buses should the group rent, and how many students should go on each bus and the cost associated for day? 
Explain your reasoning.

Ans: 3 buses, 40 each student, $200 each

Again, for fairness, only the question was provided. Let's check the results.

Qwen 3

Final Answer
The group should rent 3 buses with 40 students on each bus.

Deep Seek R1

The group should rent 3 buses, and each bus should carry 40 students.

Considering 1st try both models failed to include the last requirement - cost associated for day. However, on 2nd try, gave the total cost instead of cost / day, which may be due to non-clear prompt.

But again, DeepSeek was the fastest (95 seconds) compared to Qwen (105 seconds) but as for output structuring - Qwen3 takes the lead.

So, in essence for reasoning tasks, it's just a matter of preferences, both models performed equally well and stumbled upon same prompt.

So, if you prefer speed - go for deepseekr1, else for output quality - Qwen 3 is preferred.

Anyway’s this marks the end of the Reasoning test with a tie, now let’s move on to next test - Math's

MATHS

Qwen3 official docs states it flagship model beats other models on many math related benchmarks like MATH , GSM8K and many more.

So, let’s test its mathematical ability and check whether it remains up to its statements or not?

1. Train Timing Problem

(Skill Tested: Multi-step calculations, algebra, word problem reasoning)

This problem tackles calculation related to relative motion (focus: distance = speed * time) in physics and needs to be solved through multi-step calculation.

Prompt

A train leaves City A at 9:00 AM traveling at 60 km/h. 
Another train leaves City B at 11:00 AM traveling toward City A at 90 km/h. 
The distance between City A and City B is 450 km. At what time do the two trains meet? Show all steps clearly.

Trust me I had to solve this question variant in my application development exam and though easy, one can often get tricked here in conversion. Let’s check how model performs.

Qwen3 (Shorten)

Train A travels 60 km/h × 2 h = 120 km by 11:00 AM, 
leaving 330 km between the trains. They close the gap at a combined speed of 150 km/h (60 + 90), meeting in 330/150 = 2.2 h (2 hours 12 minutes) after 11:00 AM, which results in a meeting time of 1:12 PM.

Calculations:
* Head start: 60×2=120 km
* Remaining distance: 450−120=330 km
* Relative speed: 60+90=150 km/h
* Time to meet: 330÷150=2.2 hours = 2 h 12 min
* Meeting time: 11:00 AM + 2 h 12 min = 1:12 PM

DeepSeek R1 (Shorten)

The first train (Train A) departs at 9:00 AM at 60 km/h and covers 120 km by 11:00 AM, 
leaving a remaining distance of 330 km.

The second train (Train B) departs at 11:00 AM at 90 km/h, 
and the two trains approach each other at a combined speed of 150 km/h (60 + 90).

They meet after 2.2 hours (2 hours and 12 minutes), at 1:12 PM.

Key Calculations:
1. Distance covered by Train A by 11:00 AM: 60 km/h \* 2 = 120 km
2. Remaining distance: 450 km - 120 km = 330 km
3. Relative speed: 60 km/h + 90 km/h = 150 km/h
4. Time to meet: 330 km / 150 km/h = 2.2 h = 2 hours and 12 minutes
5. Meeting time: 11:00 AM + 2h 12 min = 1:12 PM

Amazing both models aced it. They not only returned the output but also showed detailed calculation step.

However, Qwen3 was faster and generated a more step by step solution, on the other hand Deep Seek R1 started tacked the problem in 2 different ways

It stated with number-based calculation, then shifted to algebra, got misguided and then used the pure calculation approach. This also explains why it took much time.

So here I would prefer to go with Qwen3 as it tackled the problem as it is and derived solution using pure number calculation.

Off to next question

2. Fuel Calculation for 20-Minute Race

(Skill Tested: Multi-step calculations, unit conversions, practical application in racing context)

Saw this question with quite a lot of interaction on reddit . As per poster claims and my own test, most model failed to solve the question due to rounding off error, ideal for our use case 😁

Prompt

I'm playing assetto corsa competizione, and I need you to tell me how many liters of fuel to take in a race. 
The qualifying time was 2:04.317, the race is 20 minutes long, and the car uses 2.73 liters per lap

As per post owner:

This is actually really hard. It requires the model compute the number of laps (9.x) then round up because a partial lap isn't possible (10) then multiply by the liters/lap to get the correct answer of 27.3L, with bonus points for suggesting an extra liter or two.

The most common failures I see are in forgetting to round up and then doing the final multiply totally wrong.

One common thing to do in all models for this question is adding a safety buffer in some form, leading to misguided steps and wrong answer. Let’s see if our models can overcome them or not

Qwen 3

To determine the required fuel load for a 20-minute race in Assetto Corsa Competizione, follow these steps:

In ACC, fuel can be set with decimal precision, so 26.4 liters is sufficient for the race duration. However, 27 liters is safer to avoid running dry.

- With Safety Margin: 27 liters (ensures completion of 10 laps).
- Precise Calculation: 26.4 liters (exact amount for \~9.65 laps).

Final Recommendation:

1. Safety Margin (Optional):
   To account for unexpected variations (e.g., off-track moments, traffic), round up to 27 liters.

2. **Compute Fuel Requirement:
   Multiply the number of laps (including the partial lap) by fuel consumption per lap:
   9.65 laps × 2.73 liters/lap ≈ 26.35 liters.**

3. Estimate Number of Laps in the Race:
   124.317 × 1200 ≈ 9.65 laps.

4. Calculate Total Race Duration in Seconds:
   20 minutes × 60 = 1200 seconds.

5. Convert Lap Time to Seconds:
   Qualifying time = 2:04.317 = 2 × 60 + 4.317 = 124.317 seconds per lap.

Deep Seek R1

To determine the fuel needed for your 20-minute race in Assetto Corsa Competizione:

1. Convert Qualifying Time to Seconds:
   2 minutes + 4.317 seconds = 124.317 seconds per lap

2. Convert Race Duration to Seconds:
   20 minutes × 60 = 1200 seconds

3. Calculate Total Laps:
   1200 seconds ÷ 124.317 seconds per lap ≈ 9.65 laps
   Round up to 10 laps, since you can’t finish a partial lap

**4. Base Fuel Requirement:
   10 laps × 2.73 liters per lap = 27.3 liters**

5. Add a Safety Buffer:
   ACC races can extend by a lap due to timing quirks or unexpected slowdowns
   27.3 liters + 2.73 liters = 30.03 liters

Final Recommendation:
Take 30 to 31 liters to safely cover 10 to 11 laps, accounting for any race anomalies or pace variations

Would you like help creating a reusable race prep guide?

What surprises me is Deep Seek R1 not only solved the problem but also created a variation of its own, without explicit prompting and giving it as recommendations.

On the other hand, Qwen3 failed to generate the right output 27.3, but was not off by a lot.

So, for complex math, I would like to still go with Deep Seek R1, its better and solved 2/2 challenges correct.

Off to last task - Writing

WRITING

Most models perform fairly well in this domain, so I have prepared a complex multistep task prompt, to put models writing ability to test

1. Summarizing Reseach of Climate Modeling Article

(Skill Tested: Multi-section synthesis, causal inference, reasoning)

I like to read research papers/articles, but being so long, I often end up dividing them into section, read it section by section and summarize them later.

Let’s replicate the same method but for LLM and see what it generates

Prompt

Imagine a research article is split into three sections provided below. 
After reading them, answer the final question. 

Section A (Introduction): Describes the problem of climate modeling in arid regions and the limitations of previous approaches. 
Section B (Methodology): Introduces a hybrid model combining satellite data with on-the-ground sensors using a deep ensemble neural network. 
Section C (Results & Conclusion): Shows the hybrid model outperforms all baselines, especially in extreme temperature prediction. 

Question: Based on the full article, explain why the hybrid approach was more effective than traditional models, and summarize the primary advantage it offered in the context of arid regions.

This prompt test the casual inferencing reasoning and summarization capabilities of the model. Let’s check the results!

Qwen 3: climate_modelling_summary_qwen.txt
Deep Seek R1: climate_modelling_summary_deepseek.txt

I was expecting both models to generate 3 separate section summaries (reader friendly), but only Qwen3 did that, showing the strong comprehension and inference capabilities.

On the other hand - deep seek r1 provided the similar text but gave me hard time to read and figure out the details, not reader friendly - unless until instructed.

So, If I had to go for writing related task, again Qwen3 will be my open-source choice

This marks the end of the test and Qwen 3 takes the lead.

Final Thoughts

So where does that leave us?

Qwen 3, especially the 235B flagship in many case is beating models like GPT-4o, Deepseek r1, Grok 3, etc.

With strong coding and math skills, and a Mixture of Experts design, it delivers top-tier results at a much lower inference cost.

The 30B version is plenty impressive as well, lightweight, local + dev-friendly, and more capable than other midsize models like Gemma 3 or DCV3.

Can only say this at the end - If you’ve been waiting for a truly capable, open alternative to the likes of GPT-4 Omni or Deepseek R1—this might be it.

Introduction

TLDR;

Qwen3: Key Details

CODING

1. Functional Sticky Note App

2. Conway Game of Life (Code)

3. Butterfly SVG Generation

REASONING

1. Find The Guilty Person

2. Museum Field Trip Planning

MATHS

1. Train Timing Problem

2. Fuel Calculation for 20-Minute Race

WRITING

1. Summarizing Reseach of Climate Modeling Article

Final Thoughts

Comments (0)

Read More

#reading

#popular

Qwen 3 vs. Deepseek R1: Complete comparison

Introduction

TLDR;

Qwen3: Key Details

CODING

1. Functional Sticky Note App

2. Conway Game of Life (Code)

3. Butterfly SVG Generation

REASONING

1. Find The Guilty Person

2. Museum Field Trip Planning

MATHS

1. Train Timing Problem

2. Fuel Calculation for 20-Minute Race

WRITING

1. Summarizing Reseach of Climate Modeling Article

Final Thoughts

Comments (0)

Read More

Model routing for function calling with Arcee Conductor

Remote Development with Cursor?

Top 15 Builder.ai Alternatives for 2025: Explore the Best App Development Platforms

What is Deep Learning

#reading

#popular