This is a Plain English Papers summary of a research paper called AI Learns General Reasoning: Nemotron-CrossThink Beats Math-Only LLMs. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter.

Breaking the Math Barrier: How Nemotron-CrossThink Extends LLM Self-Learning to General Reasoning

Large Language Models (LLMs) have shown impressive reasoning capabilities when enhanced through Reinforcement Learning (RL). While applying RL to math reasoning has been successful due to well-defined rules and verifiable answers, extending these techniques to broader domains presents significant challenges. Diverse reasoning tasks demand different cognitive approaches and reward structures, making it difficult to design effective RL training that generalizes beyond mathematics.

Employing self-learning with multi-domain data, Nemotron-CrossThink outperforms baseline models, including domain-specific training (Only Math) and Open-Reasoner-Zero (orz-7B), achieving consistent gains across all reasoning tasks.
Figure 1: Performance comparison showing Nemotron-CrossThink outperforming math-only training and existing reasoners on diverse benchmarks.

In this research, NVIDIA and Carnegie Mellon researchers introduce Nemotron-CrossThink, a framework that systematically incorporates multi-domain corpora into RL training to improve generalization across diverse reasoning tasks. The approach addresses key challenges by incorporating data from varied domains, applying structured templates to control answer-space complexity, filtering for verifiable answers, and optimizing data blending strategies.

The Nemotron-CrossThink Framework

Nemotron-CrossThink offers a systematic way to incorporate multi-domain data for self-learning that results in better generalization across diverse tasks:

Nemotron-CrossThink.  We (a) curate QA pairs from from synthetic (Common Crawl) and open-source datasets, categorized into general-purpose reasoning (𝒟gpr) and mathematical reasoning (𝒟mr); (b) apply structured templates to convert data into multiple-choice (mcq) and open-ended formats, promoting diverse reasoning trajectories; (c) filter out unverifiable or ill-formatted responses; (d) train an RL policy using Group Relative Policy Optimization (grpo). The final reward is used to update the policy, iteratively improving the model's reasoning capabilities across diverse domains.
Figure 2: The Nemotron-CrossThink pipeline showing data curation, templating, filtering, and RL training steps.

The framework consists of four main components:

  1. Data Curation: Collecting diverse data from synthetic sources (Common Crawl) and open-source QA datasets spanning STEM, humanities, law, and social sciences.

  2. Template Application: Converting questions into multiple-choice (MCQ) and open-ended formats to promote diverse reasoning trajectories.

  3. Data Filtering: Removing samples that can't be verified with rule-based rewards.

  4. Data Blending: Optimizing the mix of data from different domains and formats.

This process culminates in reinforcement learning using Group Relative Policy Optimization (GRPO) to refine reasoning capabilities across diverse domains.

Building a Diverse Training Dataset

The researchers curated a dataset combining both synthetically generated data and open-source QA pairs, categorized into general-purpose reasoning (GPR) and mathematical reasoning (MR):

Data Source Category Type Samples
MMLU [Train] GPR MCQ 99,842
Syn-QA GPR MCQ 192,950
Natural Reasoning GPR OE 100,000
NuminaMath MR OE 87,350
PersonaSkill-MATH MR OE 100,000
Math MR OE 8523
Total 588,645

Table 1: Training data distribution showing the mix of general-purpose and mathematical reasoning sources in multiple-choice and open-ended formats.

For general-purpose reasoning (GPR), they collected open-source datasets like Natural Reasoning and MMLU training set, spanning multiple domains including STEM fields, economics, and social sciences. They also synthesized additional QA pairs from Common Crawl documents (Syn-QA).

For mathematical reasoning (MR), they incorporated established datasets like MATH and NuminaMath, along with synthetic math problems generated using the PersonaSkill method.

Controlling Answer Space Through Templates

Since different question types elicit different reasoning patterns, the researchers applied two templates to their data:

  1. Multiple-Choice Questions (MCQ): Presenting fixed options for selection.
  2. Open-Ended Questions: Requiring the model to generate answers without options.

This templating approach allows for verifiable rewards in non-deterministic domains by limiting the answer space, making it feasible to apply reinforcement learning beyond math.

Strategic Data Blending for Optimal Performance

After gathering diverse datasets, the researchers explored different blending strategies to understand how domain and format diversity affect model performance:

Category Blend Name Symbol Blend Description
Data Source Natural Distribution $B_{nd}$ Ratio of number of samples in a dataset divided by the total number of samples in all the datasets.
More Math $B_{mr↑}$ 2:1 ratio of $\mathcal{D}{mr}$ and $\mathcal{D}{gpr}$
More General Purpose Reasoning $B_{gpr↑}$ 2:1 ratio of $\mathcal{D}{gpr}$ and $\mathcal{D}{mr}$
Question Types More MCQ $B_{mcq↑}$ 2:1 ratio of $\mathcal{D}{mcq}$ and $\mathcal{D}{open}$
More Open-Ended $B_{open↑}$ 2:1 ratio of $\mathcal{D}{open}$ and $\mathcal{D}{mcq}$
Data Usefulness Avg. Score $B_{score}$ Provide weight to each source based on their average benchmark performances

Table 2: Overview of data blending strategies, categorized by data source, question type, and data usefulness.

These blending strategies were designed to test hypotheses about how domain diversity and question format affect reinforcement learning outcomes. For comparison, the researchers also created two single-source blends: one using only math data ($B_{only_mr}$) and one using only general-purpose reasoning data ($B_{only_gpr}$).

Experimental Setup

The researchers used Qwen2.5-7B and Qwen2.5-32B as their baseline models, applying GRPO training using the veRL framework. They evaluated model performance on diverse benchmarks including:

  • Math benchmarks: MATH-500, AMC23
  • General-purpose reasoning benchmarks: MMLU, MMLU-PRO, AGIEVAL, GPQA-DIAMOND, and SUPERGPQA

This comprehensive evaluation ensured the results generalized across various reasoning domains, not just mathematical ones.

Individual Dataset Impact

To understand which datasets contribute most to reasoning performance, the researchers first trained models on individual datasets:

Data Source MMLU MMLU-PRO GPQA-DIAMOND AGIEVAL SUPERGPQA MATH-500 AMC23 Avg
$\mathcal{M}$ 74.20 45.00 31.82 48.59 25.36 48.30 40.00 44.75
MMLU [Train] 69.76 38.50 32.83 47.66 27.69 22.00 5.00 34.78
Syn-QA 70.45 52.41 30.81 52.10 24.57 54.20 35.00 45.65
Natural Reasoning 68.89 31.33 33.33 46.65 22.44 68.60 42.50 44.82
NuminaMath 72.94 52.05 33.84 54.39 26.97 76.20 55.00 53.06
PersonaSkill-Math 53.99 28.08 18.69 45.69 16.92 77.20 50.00 41.51
Math 63.30 31.64 21.72 51.95 18.31 78.40 50.00 45.04

Table 3: Results of self-learning on individual datasets, showing their varying strengths across different benchmarks.

The analysis revealed several insights:

  • NuminaMath achieved the highest overall average, outperforming the baseline by over 8.3%, with strong performance on both math and general-purpose reasoning tasks.
  • Syn-QA showed modest improvement over the baseline with stronger accuracy in MMLU-PRO and AGIEVAL.
  • MMLU [Train] performed best on SUPERGPQA, suggesting it captures broad conceptual knowledge useful for long-tail domains.
  • Natural Reasoning showed surprisingly good math performance despite being a GPR dataset.

These results informed the data blending strategy for maximum effectiveness.

Performance Across Blending Strategies

Next, the researchers compared the performance of different blending strategies:

Model Category Blend MMLU MMLU-PRO GPQA-DIAMOND AGIEVAL SUPERGPQA MATH-500 AMC23 Avg
$\mathcal{M}$ 74.20 45.00 31.82 48.59 25.36 48.30 40.00 44.75
ORZ 73.20 48.90 29.30 63.49 27.60 81.40 62.50 55.20
*CrossThink Data Source $B_{nd}$ 73.18 54.81 38.07 59.99 26.54 77.00 60.00 55.66
$B_{mr↑}$ 74.85 55.51 40.10 61.47 26.81 77.80 67.50 57.72
$B_{gpr↑}$ 74.94 57.82 38.58 63.71 29.16 77.60 65.00 58.12
Question Types $B_{mcq↑}$ 74.26 55.77 39.59 62.54 28.05 78.00 60.00 56.89
$B_{open↑}$ 74.46 55.82 43.15 61.28 26.82 78.40 62.50 57.49
Data Usefulness $B_{score}$ 74.70 56.16 40.10 59.80 27.37 78.00 62.50 56.95
Single Source $B_{only_mr}$ 74.24 54.26 38.58 61.39 27.69 78.60 70.00 57.82
$B_{only_gpr}$ 72.77 52.06 37.06 56.56 27.44 72.20 55.00 53.30

Table 4: Results showing Nemotron-CrossThink-7B performance across different blending strategies. The multi-domain blend $B_{gpr↑}$ achieves the highest overall average.

Key findings include:

  • All blending strategies significantly outperformed the base model.
  • $B_{gpr↑}$ (emphasizing general-purpose reasoning) achieved the highest overall average (58.12%), outperforming Open-Reasoner-Zero (ORZ) by ~5%.
  • While $B_{only_mr}$ (math-only training) performed slightly better on math-specific tasks, it lagged behind on general reasoning benchmarks.
  • $B_{open↑}$ (emphasizing open-ended questions) outperformed $B_{mcq↑}$ (emphasizing multiple-choice questions).

These results demonstrate that integrating diverse reasoning domains in RL leads to better overall performance than domain-specific training.

Token Efficiency: Doing More with Less

Beyond accuracy, the researchers analyzed how different training strategies affect the verbosity and efficiency of model responses:

Token efficiency comparison of models trained on $B_{gpr↑}$ (multi-domain blend) and two single domain blends ($B_{only\_math}$ and ORZ).
Figure 3: Token efficiency comparison showing that multi-domain training ($B_{gpr↑}$) leads to more concise responses than math-only training.

On general-purpose reasoning benchmarks, models trained with the multi-domain blend ($B_{gpr↑}$) produced correct answers with significantly fewer tokens than math-only models. For instance, on MMLU, $B_{gpr↑}$ used an average of 229 tokens for correct responses, compared to 351 for $B_{only_mr}$.

Interestingly, on math benchmarks, $B_{gpr↑}$ appropriately generated longer responses, adapting its verbosity to the task requirements. This adaptive behavior highlights a key strength of multi-domain training: it equips the model with the flexibility to tailor its response style to the nature of the task.

Across all tasks, the multi-domain model used 28% fewer tokens on average for correct responses than the math-only model, demonstrating more efficient reasoning.

Average token lengths of correct and incorrect responses across general-purpose and math reasoning tasks for models trained on $B_{gpr↑}$, $B_{only\_math}$, and ORZ.
Figure 4: Comparison showing incorrect responses are consistently longer than correct ones, suggesting verbose reasoning doesn't guarantee correctness.

The analysis also revealed that incorrect responses are substantially longer than correct ones—by 3.6× on average. This suggests that verbose reasoning often reflects model uncertainty or repetitive thinking rather than productive deduction.

Data Format Study: Question and Answer Templates

The researchers investigated how question and answer formats affect model performance:

Question Type MMLU MMLU-PRO GPQA-DIAMOND AGIEVAL SUPERGPQA MATH-500 AMC23 Avg
MCQ +OPEN-ENDED 73.18 54.81 38.07 59.99 26.54 77.00 60.00 55.66
OPEN-ENDED 74.61 54.36 39.09 59.30 29.16 76.60 65.00 56.87

Table 5: Impact of Question Format. Converting all questions to open-ended format improves accuracy across benchmarks, reducing reliance on option guessing and encouraging deeper reasoning.

The open-ended-only configuration outperformed the mixed-format setting across most benchmarks, achieving a 1.21% higher average score. This suggests that removing multiple-choice options forces the model to rely more on reasoning rather than potentially guessing from options.

For answer formats in multiple-choice questions, they compared short-form (option label only, e.g., "A") versus long-form answers (option label with description, e.g., "(A) The sky is blue"):

Answer Type MMLU MMLU-PRO GPQA-DIAMOND AGIEVAL SUPERGPQA MATH-500 AMC23 Avg
Long 72.77 52.06 37.06 56.56 27.44 72.20 55.00 53.30
Short 74.22 54.56 39.59 58.01 28.39 74.20 52.50 54.50

Table 6: Impact of Answer Format. Using short-form answers improves accuracy by reducing output ambiguity and avoiding penalization from rigid reward functions in rule-based training.

Short-form answers consistently outperformed long-form variants, with a 1.20% improvement in average accuracy. This suggests that reducing output complexity minimizes ambiguity and better aligns with rule-based reward signals.

Difficulty-Based Filtering Enhances Performance

The researchers explored a simple approach to estimate question difficulty for datasets without explicit difficulty labels. They labeled questions as "difficult" if they were answered incorrectly by a smaller model (Qwen-2.5-7B) in a zero-shot setting:

Model Blend MMLU MMLU-PRO GPQA-DIAMOND AGIEVAL SUPERGPQA MATH-500 AMC23 Avg
Qwen-2.5-32B 83.30 55.10 40.40 62.77 33.16 60.55 45.00 54.33
NEMOTRON-CROSSTHINK-32B $\mathcal{B}_{gpr↑}$ 83.57 68.83 46.70 73.90 37.99 82.40 67.50 65.84
$B_{f(gpr)↑}$ 83.60 69.43 49.75 75.82 38.34 84.00 75.00 67.99

Table 7: Difficulty-Based Filtering results showing that training on harder examples yields consistent gains across all benchmarks.

This filtering approach resulted in consistent performance improvements across all benchmarks. While both filtered and unfiltered models outperformed the original baseline, the model trained on the filtered dataset achieved the highest accuracy on every task. On average, filtering boosted overall accuracy by 2.15%, suggesting that selectively training on challenging examples produces more robust and generalizable models.

Broader Implications

Nemotron-CrossThink demonstrates that data diversity, not just volume, is key to broader reasoning capabilities in language models. By combining data from diverse domains and applying thoughtful formatting strategies, the framework enables consistent gains across both general-purpose and mathematical benchmarks.

The research also shows that properly designed RL can extend beyond mathematics to general reasoning tasks, provided that appropriate templates and filtering techniques are applied to manage answer-space complexity.

Perhaps most importantly, the models trained with Nemotron-CrossThink show adaptive behavior—generating concise answers for general-purpose questions and more detailed responses for math problems—reducing inference costs while preserving task-specific rigor.

Conclusion

Nemotron-CrossThink offers a practical framework for improving the generalization abilities of language models through reinforcement learning with multi-domain corpora. By systematically addressing the challenges of applying RL beyond mathematical reasoning, it achieves substantial improvements across diverse benchmarks.

The research demonstrates that incorporating diverse reasoning domains in reinforcement learning leads to more accurate, efficient, and generalizable language models. This paves the way for future work in self-learning systems that can reason effectively across the full spectrum of human knowledge domains.

Click here to read the full summary of this paper