NVIDIA Releases Nemotron-Cascade 2: An Open 30B MoE with 3B Active Parameters, Delivering Better Reasoning and Strong Agentic Capabilities

NVIDIA has announced the release of Nemotron-Cascade 2, an open-weight 30B Mixture-of-Experts (MoE) model with 3B activated parameters. The model focuses on maximizing ‘intelligence density,’ delivering advanced reasoning capabilities at a fraction of the parameter scale used by frontier models. Nemotron-Cascade 2 is the second open-weight LLM to achieve Gold Medal-level performance in the 2025 International Mathematical Olympiad (IMO), the International Olympiad in Informatics (IOI), and the ICPC World Finals.

https://research.nvidia.com/labs/nemotron/files/Nemotron-Cascade-2.pdf

Targeted Performance and Strategic Trade-offs

The primary value proposition of Nemotron-Cascade 2 is its specialized performance in mathematical reasoning, coding, alignment, and instruction following. While it achieves state-of-the-art results in these key reasoning-intensive domains, it is surely not a ‘blanket win’ across all benchmarks.

The model’s performance excels in several targeted categories compared to the recently released Qwen3.5-35B-A3B (February 2026) and the larger Nemotron-3-Super-120B-A12B:

Mathematical Reasoning: Outperforms Qwen3.5-35B-A3B on AIME 2025 (92.4 vs. 91.9) and HMMT Feb25 (94.6 vs. 89.0).

Coding: Leads on LiveCodeBench v6 (87.2 vs. 74.6) and IOI 2025 (439.28 vs. 348.6+).

Alignment and Instruction Following: Scores significantly higher on ArenaHard v2 (83.5 vs. 65.4+) and IFBench (82.9 vs. 70.2).

Technical Architecture: Cascade RL and Multi-domain On-Policy Distillation (MOPD)

The model’s reasoning capabilities stem from its post-training pipeline, starting from the Nemotron-3-Nano-30B-A3B-Base model.

1. Supervised Fine-Tuning (SFT)

During SFT, NVIDIA research team utilized a meticulously curated dataset where samples were packed into sequences of up to 256K tokens. The dataset included:

1.9M Python reasoning traces and 1.3M Python tool-calling samples for competitive coding.

816K samples for mathematical natural language proofs.

A specialized Software Engineering (SWE) blend consisting of 125K agentic and 389K agentless samples.

2. Cascade Reinforcement Learning

Following SFT, the model underwent Cascade RL, which applies sequential, domain-wise training. This prevents catastrophic forgetting by allowing hyperparameters to be tailored to specific domains without destabilizing others. The pipeline includes stages for instruction-following (IF-RL), multi-domain RL, RLHF, long-context RL, and specialized Code and SWE RL.

3. Multi-Domain On-Policy Distillation (MOPD)

A critical innovation in Nemotron-Cascade 2 is the integration of MOPD during the Cascade RL process. MOPD assembly uses the best-performing intermediate ‘teacher’ models—already derived from the same SFT initialization—to provide a dense token-level distillation advantage. This advantage is defined mathematically as:

$$a_{t}^{MOPD}=log~\pi^{domain_{t}}(y_{t}|s_{t})-log~\pi^{train}(y_{t}|s_{t})$$

The research team found that MOPD is substantially more sample-efficient than sequence-level reward algorithms like Group Relative Policy Optimization (GRPO). For instance, on AIME25, MOPD reached teacher-level performance (92.0) within 30 steps, while GRPO achieved only 91.0 after matching those steps.

Inference Features and Agentic Interaction

Nemotron-Cascade 2 supports two primary operating modes through its chat template:

Thinking Mode: Initiated by a single <think> token, followed by a newline. This activates deep reasoning for complex math and code tasks.

Non-Thinking Mode: Activated by prepending an empty <think></think> block for more efficient, direct responses.

For agentic tasks, the model utilizes a structured tool-calling protocol within the system prompt. Available tools are listed within <tools> tags, and the model is instructed to perform tool calls wrapped in <tool_call> tags to ensure verifiable execution feedback.

By focusing on ‘intelligence density,’ Nemotron-Cascade 2 demonstrates that specialized reasoning capabilities once thought to be the exclusive domain of frontier-scale models are achievable at a 30B scale through domain-specific reinforcement learning.

Check out Paper and Model on HF. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Source link