As an AI enthusiast in, I've been closely following the recent developments in large language models. Today, I'm diving into a detailed comparison between two powerful contenders: Reka Flash 3 and QwQ 32B.

Both models have been making waves in the AI community, but how do they actually perform across various domains? Let's uncover the hype.


Coding

The first domain we will test both models is coding. Current LLM’s are proficient are writing clean and optimized code, let’s see how both the model performs on 2 different tasks.

3D Simulation - Java Script

3D simulation requires accurate particle physics calculations, proper collision detection logic & performance optimizations for large particle counts. Most proprietary LLM’s now a days do pretty well but lacks clean and concise code, let’s see how our contender performs.

For this simulation I am going to create a basic JavaScript simulation of rotating 3d sphere with numbers which becomes brighter as they are close.

Prompt: 3D Simulation using Java Script

Create a JavaScript simulation of a rotating 3D sphere made up of letters. 
The closest letters should be brighter, while the ones farthest away should appear grey.

Output: Reka Flash 3

Code link: 3d_rotating_letters_reka.html

How To Use:

  1. Save the code as an HTML file
  2. Open it in a modern web browser

Output: Reka Flash 3

Reka Flash Output

Completely different from what was expected, no dark background, no functionality & very long generation time.

Despite multiple efforts, the model failed to generate the right code for right output and kept reiterating the previous mistakes. I guess this was the result of overthinking.

However, when I tried same prompt with number it generates a good output with all functionalities.

Reka 3D Generation Test 2 : 3d_rotating_numbers_reka.html

Reka 3D Generation Test 2 Output

Output: QwQ 32b

Code Link: JS (3d_rotating_letters_qwq.js), HTML (3d_rotating_letters_reka.html)

Browser Output (3d letter rotating simulation)

QwQ 32b

Though model generated bad code in 1st test but 2nd one gave an output. The output is modularized, polished, rendered perfectly and follows all the instructions in the prompt except speed controls. Definitely a go for small 3d scene rendering task.

The only thing I hated was, it didn’t provide the run instructions, unless until stated explicitly.

Let’s check the performance on Leet Code.

Leet Code - Problem #3463

I found this one while writing my other blog, and it became my benchmark question to evaluate coding capabilities of LLM, as this requires 1st principle thinking, rather than library usage. Let’s see how Reka flash 3 vs. QwQ 32b performs:

Note: This problem has only 8.8% acceptance rate on Leet Code and only Grok3 was able to solve, it. In Essense it's a really hard problem.

Prompt: 3463. Check If Digits Are Equal in String After Operations II

I am giving you a leetcode problem with test case, your job is to write a clean and concise code. 
The question will have private test cases, so make sure your cover edge cases as well. 
Here is the problem Statement : 

You are given a string s consisting of digits. Perform the following operation repeatedly until the string has exactly two digits:

For each pair of consecutive digits in s, starting from the first digit, calculate a new digit as the sum of the two digits modulo 10.
Replace s with the sequence of newly calculated digits, maintaining the order in which they are computed.
Return true if the final two digits in s are the same; otherwise, return false.

Example 1:

Input: s = "3902"

Output: true

Explanation:

Initially, s = "3902"
First operation:
(s[0] + s[1]) % 10 = (3 + 9) % 10 = 2
(s[1] + s[2]) % 10 = (9 + 0) % 10 = 9
(s[2] + s[3]) % 10 = (0 + 2) % 10 = 2
s becomes "292"
Second operation:
(s[0] + s[1]) % 10 = (2 + 9) % 10 = 1
(s[1] + s[2]) % 10 = (9 + 2) % 10 = 1
s becomes "11"
Since the digits in "11" are the same, the output is true.
Example 2:

Input: s = "34789"

Output: false

Explanation:

Initially, s = "34789".
After the first operation, s = "7157".
After the second operation, s = "862".
After the third operation, s = "48".
Since '4' != '8', the output is false.


You can try solving this qn using 1st principle reasonig

Output: Reka Flash 3

class Solution:
    def hasSameDigits(self, s: str) -> bool:
        while len(s) > 2:
            new_s = []
            for i in range(len(s) - 1):
                new_digit = (int(s[i]) + int(s[i+1])) % 10
                new_s.append(str(new_digit))
            s = ''.join(new_s)
        return s[0] == s[1]

Results (Time Limit Exceed Error)

Time Limit Exceeded

Time Limit Exceeded

The code passed all public test cases but failed to solve private test cases in record time.

Output: QWQ 32b

class Solution:
    def hasSameDigits(self, s: str) -> bool:
        digits = list(map(int, s))
        while len(digits) > 2:
            new_digits = []
            for i in range(len(digits) - 1):
                new_digit = (digits[i] + digits[i + 1]) % 10
                new_digits.append(new_digit)
            digits = new_digits
        return digits[0] == digits[1]

Results (time limit exceed error)

Time Limit Exceeded

Time Limit Exceeded

As expected, both of the model failed to solve the given question; despite nudging to solve the problem from 1st principal reasoning, this shows the capability of model is limited

Final Thoughts

  • For small coding task Both model works fine, however needs to be explicitly and carefully prompted. However, for rendering related task QwQ is a better choice.
  • The limiting factor to not use Reka Flash 3 is the reasoning steps are embedded in output itself, rather than separate collapsible bar and its common AI Hallucinations.
  • As for me I would not use both of them, rather go to Gemma / Lamma. They are way better at coding, rendering task.

Let’s move on and check models on next task


Reasoning

The second domain for testing is Reasoning. Modern LLMs are great at finding details and solving problems. Let’s see how both models perform on my reasoning questions.

River Crossing Puzzle - Modified

This question is modified version of the flash game variant, I used to play as a kid. Idea is to transfer a wolf, cow and fodder, but he can carry one at a time and left alone two food chain kicks in, so how to transfer.

The modified version makes it simple by introducing 3 secure compartments, while keeping details the same. This identification is key to solution - All three can be takes as compartment are isolated.

Let’s see which models will find this out.

Prompt: Crossing The River

A farmer wants to cross a river and take with him a wolf, a goat and a cabbage. He has a boat with three secure separate compartments. If the wolf and the goat are alone on one shore, the wolf will eat the goat. If the goat and the cabbage are alone on the shore, the goat will eat the cabbage. How can the farmer efficiently bring the wolf, the goat and the cabbage across the river without anything being eaten?

Output: Reka Flash 3

Take all three animals (wolf, goat, cabbage) across the river simultaneously.

Place each animal in a separate compartment to prevent interaction.
Since none of the dangerous pairs (wolf-goat or goat-cabbage) are left alone on either shore, both banks remain safe throughout the trip.

Ans : Take all three in separate compartments in one trip

Fantastic, got the intricate details and solved the problem in 64.9 seconds (most time went for COT reasoning - 4391 tokens). Despite being an open source its good!

Now let’s check response for QwQ 32b

Output: QwQ 32b

This is a classic logic puzzle that requires careful planning to ensure that the wolf, goat, and cabbage are transported across the river without any of them being eaten. So answer is:

1. Take the goat to the far shore.
2. Return alone.
3. Take the wolf to the far shore.
4. Bring the goat back to the starting shore.
5. Take the cabbage to the far shore.
6. Bring the goat to the far shore.

### Final Answer:
{1. Take goat to far shore, 2. Return alone, 3. Take wolf to far shore, 4. Bring goat back, 5. Take cabbage to far shore, 6. Bring goat to far shore.}

The model failed to understand the intricate detail of question and reverted to the classic logi puzzle answer, so it fails to generate the correct response in 21.8 seconds (560 tokens)

Now let’s test it on another question

Minimum No of Weights Needed

This question was asked to be by my uncle. It comes under - binary decision testing (divide and conquer), logical deduction & optimization testing. I failed to do it, let’s see how both models perform the task

Prompt: Minimum Weight Balancing

You are given 8 identical-looking balls. One of them is slightly heavier, but you don't know which one. You have a balance scale that allows you to compare the weight of two sets of balls. What is the minimum number of weights necessary to guarantee finding the heavier ball?

Solution

  1. Divide the 8 balls into 3-3-2 groups.
  2. Weigh the two groups of 3 balls against each other.
  3. If one group is heavier, weigh any 2 balls from that group.
  4. If one of the 2 balls is heavier, you have found the heavier ball.

Expected: Thus 2 no of weighs

Output: Reka Flash 3 (Beautified)

{COT}
In all cases, the maximum number of weighings required is 2. This confirms that the minimum number of weights necessary to guarantee finding the heavier ball is indeed 2.
Ans : 2

Perfect - Reka Flash 3 nailed it, it not only solves the problem, but also verified across multiple scenarios in 37.2 seconds (2452 tokens).

Let’s see how Qwen does

Output: QwQ 32b (Beautified)

{COT}
1. Divide 8 balls into Group A (3), Group B (3), and Group C (2).
2. Weigh Group A vs. Group B; if unbalanced, the heavier ball is in that group; else, it's in Group C.
3. Weigh two balls from the identified group; if unbalanced, the heavier one is found; else, it's the third ball.

Final Answer: 2 weighings.

QwQ also solved the problem easily, but it didn’t verify its result across multiple scenarios. However, it took 12.8 seconds (2383 tokens)

Final Thoughts

  • For general purpose and simple reasoning task go for QwQ 32b as it generates faster response in less token usage. Good if you are low on budget.
  • However, if need complex reasoning task solved, go for Reka Falsh 3, despite being both open source, its much better. Can opt for QwQ if low on budget, but results may not be optimal.
  • As for me, I will go for Groq 3 Deep Think (still open), as its reasoning abilities are great!

Anyway’s let’s move on and check models on next task.


Math

The third domain for testing is Math, which also involves reasoning. Modern LLMs are great at solving complex math problems. Let’s see if both models can compete. To keep it fair, I’ll test them on one simple and one complex problem.

Sequence Puzzle

A simple question for humans but a tough one for LLMs. Many models have failed to get it right. Let’s see if QwQ or Reka can crack it or not with reasoning steps!

Prompt: Sequence Puzzle

What is the next number in the sequence? 2, 6, 12, 20, 30, ?

Output: Reka Flash 3

{COT}
The next term should be 6×7=42.

Both methods confirm that the next number in the sequence is 42

Nice! The model got the right answer in 16.4 seconds (984 tokens). After checking the COT traces, I found it reasoned like a human, not a mathematician - using simple math instead of complex formulas. This makes the explanation easy to understand!

Reka Flash 3 - Reasoning Steps

Reka Flash 3 - Reasoning Steps

Output: QwQ 32b

{COT}
The next term is 42

Qwen also solved the problem but required more time and tokens (39.8 sec ~ 1785 tokens). This was because of reasoning like a mathematician - deriving complex formula to find the solution as part of verification.

QwQ 32b → Reasoning Steps

QwQ 32b → Reasoning Steps

Now let’s taste the model on the medium - complex one to find the real capabilities

GCD of a Series Set

I have used this question before also (as a hard one), though proprietary all solved it, let’s see how open-source ones do.

Prompt: Series GCD

Find the GCD of this series set: {n^99(n^60-1): n>1}

Output: Reka Flash 3

{COT}
The product of these primes is 
2×3×5×7×11×13×31×61=56704530.

Thus, the GCD of the series set is 
5670453

As expected, the model got it wrong. Compared to O1-reasoning, which got it right, this makes sense—open-source models have hallucination issues. Inspecting the COT traces, indeed it was the cause Traces with lots of "But wait" moments.

COT Reasoning Steps

COT Reasoning Steps

Output: QwQ 32b

{COT}
16×9×25×7×11×13×31×61=6,814,407,600

Final Answer
The greatest common divisor of the set is  
6814407600

Despite the problem being hard, QwQ got it right! Upon inspection, I noticed the model started to hallucinate but then corrected itself through a self-evaluation step—something missing in Reka Flash 3. 👇

QwQ 32b - Self Evaluation Step

QwQ 32b - Self Evaluation Step

Qwen getting this on right shows the gap between proprietary and open sources is getting smaller day by day.

Final Thoughts

  • For complex / higher order math's task / question, go for QwQ 32b. It generated the right response for both the question. The only drawback is tight budget (204 sec ~ 9k+ tokens).
  • However simple math tasks, Reka Flash 3 seems to be the better model as it uses less token and have fast response time.
  • As for me, I would be happy with O1 - reasoning / Grok 3 Think / Lamma Models.

Finally let’s move on and check models on last task.


Writing

The final task for evaluation is Writing. Most LLMs excel at this, so I expect Reka and Qwen to perform well too. Let’s see how they handle different writing scenarios!

Sci-Fi Story Narration

I am a lover of Sci-Fi texts and narrations, so let’s generate one by giving both models as gist of story and see how narration unfold hidden links.

Prompt: Mars Colonization

Narrarate a story whose gist is: A person is sent on a mission to colonize a new planet (mars) but discovers that their ship has a dangerous malfunction **due to certain happenings on earth**. Keep it grounded and as close to reality as possible but must be sci-fi at same time.

Output: Reka Flash 3

Story Moral: The cost of determination in an era of fractured loyalty. Grounded in real tech (AI systems, radiation shielding) and human frailty, the story pits idealism against Earth’s self-destruction, asking: Can survival be worth the price of forgetting why we left?

(Note: "Termite" refers to a NASA-derived term for self-replicating code, a nod to real asteroid-defense jargon.)

Gist of Output

  • Though the model excelled at narration, it failed to stay grounded in the prompt.
  • The last note mentions NASA, but it's missing from the story.
  • Mars mention comes much later.
  • in Epilogue: The Burden, narrator says “Elena never learned who sabotaged them”, directly contradicting the required condition / connection in the prompt.
  • Few spelling errors were also found.
  • Utilized 1485 tokens in 21.6 sec

Let's see how Qwen Performs.

Output: QwQ 32b

Story Moral: A grounded sci-fi narrative exploring the tension between human ambition and institutional caution, emphasizing that space colonization’s greatest challenge may not be the void itself, but the choices made amid chaos on Earth.

Gist of Output

  • Great one, I was hooked,
  • For most part grounded to the reality, but sci-fi at same time.
  • A clear ending and included real concerns leading to space bearing civilization like climate disasters and political upheaval.
  • Story captures the mars context very early.
  • No spelling mistakes.
  • Utilized 2060 tokens in 33.0 seconds (21 seconds for reasoning)

Now let’s test both models on another subtask

Role Play

For this test, I will be creating a story on the lines of demon slayer (a popular anime). Let’s see how elaborative, creative and exceptional Reka 3 Flash & QwQ 32b becomes

Prompt: Demon Slayer Style

Roleplay as Kiyoharu and play the given scene

**Scene**

A quiet city suddenly faces an onslaught of the undead. Your character, a seemingly normal resident, now has to step up and lead the fight for survival. Do they succeed or fail? And what is the truth behind this strange occurrence?

Output: Reka 3 Flash (28.9 ~ 1975 tokens)

Kiyoharu’s arc closes—not with heroism, but with the quiet resolve of someone who learned to hold the line, one frame at a time.

The story delivers a nuanced character arc and emotional depth, aligning closely with the prompt’s intent of an “ordinary” character rising to lead.

Output: QwQ 32b (14.2 ~ 1709 tokens)

Outcome: Success, but with sacrifice. The city is saved, but the origin of the outbreak remains a shadowy secret. Kiyoharu emerges as a leader, forever changed by the fragility of normalcy.

The story follows a more structured survival-action narrative with a stronger sense of closure.

As per me I would love to read story 1, for reasons:

Component Story 1 Story 2
Character Consistency Kiyoharu is a nervous photography student, gradually stepping up. Kiyoharu is a capable part-time librarian with mechanical skills, leading from the start.
Engagement & Immersion Dynamic dialogue and cinematic framing enhance reader connection. Relies more on narrative descriptions, less interactive.
Leadership Development Kiyoharu evolves from hesitant survivor to strategic leader. Kiyoharu is portrayed as a leader early, with less development.
Truth Revelation Personal connection to the virus mystery through his father’s involvement. The virus origin is external, with the reveal remaining a cautionary tale.
Conflict & Stakes Focuses on personal dilemmas, guilt, and survival, with a haunting resolution. Action-driven with clear mission objectives and a more resolved conclusion.
Ending Impact Reflective, melancholic tone, showcasing quiet resilience. Heroic, with a sense of closure, portraying Kiyoharu as a hardened survivor.

Final Thoughts

  • Both models generated good stories, but Reka Flash 3 stands out in terms of Indepth Storytelling.
  • Choose Reka Flash 3 over QwQ 32b, if explicit prompting is not your forte and want great role play story.
  • Choose QwQ 32b for more grounded story in terms of narration.
  • Overall, both models are capable for simple writing tasks.

Now let’s move to the final conclusion.


Final Conclusion

The below table summarizes model usage according to task at hand:

Task Reka 3 Flash QwQ 32b
Coding Complex coding Easy coding
Reasoning Complex reasoning Simple reasoning
Math Simple problems Hard problems
Writing Indepth Story Telling More grounded response

Both models handle multi-domain tasks equally well,

However, throughout my experiments, I was consistently amazed by the performance of these open-source models. They often par with proprietary ones, showing their impressive capabilities.

So, depending on the domain and task, each model has its own pros and cons. Choose wisely, as these models effectively bridge the gap between the Commercial & Open Source.