Multi-step reasoning tasks like mathematical problem solving are vulnerable to cascading failures, where a single incorrect step leads to complete solution breakdown. Current LLM routing methods assign entire queries to one model, treating all reasoning steps as equal. We propose TRIM (Targeted Routing In Multi-step reasoning tasks), which routes only critical steps—those likely to derail the solution—to larger models while letting smaller models handle routine continuations. Our key insight is that targeted step-level interventions can fundamentally transform inference efficiency by confining expensive calls to precisely those steps where stronger models prevent cascading errors. TRIM operates at the step-level: it uses process reward models to identify erroneous steps and makes routing decisions based on step-level uncertainty and budget constraints. We develop several routing strategies within TRIM, ranging from a simple threshold-based policy to more expressive policies that reason about long-horizon accuracy-cost trade-offs and uncertainty in step-level correctness estimates. On MATH-500, even the simplest thresholding strategy surpasses prior routing methods with 5× higher cost efficiency, while more advanced policies match the strong, expensive model's performance using 80% fewer expensive model tokens. On harder benchmarks such as AIME, TRIM achieves up to 6× higher cost efficiency. All methods generalize effectively across math reasoning tasks, demonstrating that step-level difficulty represents fundamental characteristics of reasoning.
Query-level routing asks: which model should solve this problem? However, multi-step reasoning is inherently heterogeneous—most steps are routine, while a few correspond to critical decision points that determine the solution trajectory. Query-level routers implicitly assume that strong-model intervention is uniformly required across all tokens, committing to a single model for the entire generation.
This mismatch leads to substantial inefficiency. In tasks such as mathematical reasoning or code generation, errors at early steps can propagate and cause complete failure. Yet, query-level routing incurs the cost of strong-model generation for the entire solution, even though intervention is only necessary at a small number of error-prone steps.
Query-level routing assigns the entire problem to a single model, treating all steps as equally important. This ignores the fact that only a few steps are error-prone and critical, leading to unnecessary strong-model usage even when most of the reasoning can be handled by a cheaper model.
TRIM addresses this mismatch by operating at the step level: it generates solutions incrementally and selectively escalates only those steps where the reasoning trace risks diverging from a correct solution. By intervening only at critical steps, TRIM prevents cascading errors while allowing the weaker model to handle routine continuations.
Key Insight: A small number of well-placed step-level interventions can recover strong-model performance, yielding substantial efficiency gains by confining expensive computation to only the error-prone critical steps.
Consider a two-model setup: a weak, cheap model Mw and a strong, expensive model Ms. At each reasoning step, Mw proposes a candidate continuation, and the TRIM router decides whether to keep it or regenerate it using Ms.
This transforms routing from a one-shot decision into a sequential control problem over the reasoning trajectory. Decisions are guided by step-level correctness signals (e.g., PRM/self-verification scores) and token cost, enabling explicit optimization of the accuracy-cost trade-off.
Step-wise router architecture. TRIM uses process rewards to evaluate partial solutions and employs RL-based policies or POMDP-based solvers for making routing decisions at each step.
We develop four routing strategies within TRIM, ranging from simple to highly expressive:
A simple myopic policy that escalates the current step to Ms whenever its score falls below a predefined fixed threshold k. Despite its simplicity, it surpasses all query-level baselines with 5× higher cost efficiency.
A full trajectory-aware policy trained via RL over the complete sequence of (PRM score, token cost) features.
A light policy trained via RL on compact aggregate features summarizing the reasoning trace (current PRM score, min of prior scores, token count, step index).
A principled uncertainty-aware policy that models routing as a partially observable markov decision process (POMDP) over latent trajectory correctness states.
We instantiate TRIM with routing policies of increasing sophistication. The simplest policy, TRIM-Thr, uses only the PRM score of the current step, while the more expressive policies incorporate trajectory-level information to reason about whether intervention is still likely to improve the final solution enough to justify its cost.
We first introduce TRIM-Thr, a simple routing policy that relies solely on the PRM score of the current step generated by the weak model Mw. If this score falls below a predefined threshold k, the router regenerates the step with the strong model Ms; otherwise, it accepts the weak model’s output and continues generation.
Despite its simplicity, TRIM-Thr is highly effective. The threshold parameter k provides a principled mechanism for controlling the performance-cost trade-off: lower thresholds reduce strong-model usage, while higher thresholds allow more aggressive intervention. Empirically, this simple step-level policy already outperforms prior query-level routing methods and is competitive with an idealized oracle query-level router that selects the best model per query.
TRIM-Thr VS Idealized Oracle Query-level Router. Comparison of task performance–cost trade-offs for Qwen2.5-3B-Instruct (Mw) and Claude 3.7 Sonnet (Ms), under the TRIM-Thr (with Qwen2.5-Math-PRM-7B) versus the Idealized Oracle Query-level Router.
While TRIM-Thr performs remarkably well, it is inherently myopic: it makes routing decisions using only the correctness estimate of the most recent step, without accounting for past context or future consequences. In many cases, the current step alone is insufficient for deciding whether intervention is worthwhile.
For example, even if the current step appears incorrect, regeneration may not be beneficial if the overall trajectory has already diverged too far from a correct solution, or if the cost of invoking Ms outweighs the potential gain. Conversely, when earlier steps remain largely consistent, a targeted intervention can recover the trajectory and substantially improve final correctness.
This motivates richer routing policies that reason jointly about trajectory correctness and marginal intervention cost.
TRIM-Seq learns routing decisions from the full sequence of step-level signals accumulated along the reasoning trace. At each step, the router observes the sequence of correctness estimates and token counts from prior steps, allowing it to reason over how the trajectory has evolved rather than focusing only on the latest PRM score.
Concretely, TRIM-Seq uses the feature sequence (r1, c1), ..., (rt, ct), where ri denotes the PRM score of step i and ci denotes its token length. These features capture the two quantities most fundamental to routing in multi-step reasoning: semantic fidelity, via step-level correctness estimates, and marginal intervention cost, via token counts that quantify the expense of regeneration with the strong model.
We parameterize the routing policy with a transformer over this feature sequence and train it using RL. The objective maximizes expected return by balancing final solution correctness against the cumulative cost of invoking the strong model. Each regeneration incurs a cost proportional to the number of strong-model tokens, weighted by a trade-off parameter λ, while the final reward reflects task correctness. This enables the router to learn when an intervention is likely to improve solution quality enough to justify its cost.
While TRIM-Seq exploits the full sequential history, much of the relevant structure can be captured using a compact set of aggregate statistics. In multi-step reasoning, errors often exhibit a compounding structure: a single incorrect step can invalidate everything that follows, while a sequence of mildly unreliable steps can gradually push the trajectory off course.
Motivated by this, TRIM-Agg uses a reduced feature representation consisting of the current PRM score, the minimum of prior scores, the token count of the current step, and the step index. These aggregated features preserve key signals about whether the trajectory remains plausibly on track and whether intervention is cost-justified, while discarding the full sequential history.
We train TRIM-Agg with the same RL objective as TRIM-Seq. Empirically, this compact representation enables substantially faster training with little to no loss in performance, making it a particularly attractive practical instantiation of TRIM framework.
Summary: TRIM-Thr offers a simple and strong stepwise routing rule, while TRIM-Seq and TRIM-Agg leverage trajectory-level correctness and cost signals to learn more principled intervention policies under explicit performance-cost trade-offs.
A final challenge in stepwise routing arises from the imperfect nature of PRM estimates. Although PRMs provide useful signals about the correctness of intermediate steps, their predictions are often noisy and can misclassify correct steps as incorrect, or vice versa. In principle, RL-trained routers can learn to compensate for this noise, but training such policies under long-horizon sparse rewards is often sample-inefficient and expensive.
TRIM-POMDP addresses this by explicitly modeling PRM scores as noisy observations of an unobserved latent state that reflects the true correctness status of the reasoning trajectory. Rather than acting directly on raw PRM outputs, the router first infers the latent state and then plans accordingly, casting routing as a partially observable Markov decision process (POMDP).
TRIM-POMDP defines a compact latent state space with three correctness classes: S0, where the trajectory remains correct so far; S1, where the trajectory has already diverged irrecoverably; and S2, where the most recent step is incorrect but prior steps are correct, so the trajectory may still be recoverable through intervention.
POMDP formulation in TRIM. The latent state consists of three correctness classes: S0 (trajectory correct), S1 (irrecoverably incorrect), and S2 (current step incorrect but recoverable), augmented with step index and token cost. The observation space comprises PRM-based scores of prior and current steps along with auxiliary features, providing noisy signals of the underlying correctness state.
If this latent state were directly observed, routing would reduce to a standard fully observable control problem. In practice, however, the router only sees noisy PRM-based signals. This makes partial observability central to the problem rather than incidental, and motivates an explicit belief-state approach.
To bridge the gap between noisy PRM outputs and latent correctness states, TRIM-POMDP learns an observation function that maps the history of routing observations to a probability distribution over the latent states. Concretely, this amounts to modeling the distribution of PRM-based features conditioned on each correctness class.
This observation model can be fit offline using process supervision datasets with step-level annotations, such as ProcessBench. Once learned, it can be reused across different cost budgets, since it depends only on the alignment between PRM scores and ground-truth correctness labels rather than on a specific routing objective.
By explicitly separating state inference from policy optimization (via observation function), TRIM-POMDP provides a principled way to route under noisy correctness signals without requiring expensive RL training for every performance-cost trade-off.
Observation modeling for TRIM-POMDP and PRM noise. Conditional PRM-score distributions reveal that step-level signals are informative but noisy, motivating explicit latent-state inference rather than direct thresholding on raw scores alone.
Once the observation model is learned, TRIM-POMDP computes routing policies using standard POMDP solvers. An additional advantage is that the resulting router is largely agnostic to the specific choice of weak and strong LLMs, depending primarily on their step-level transition characteristics. This makes the formulation modular and reusable across model pairs.
Key Takeaway: TRIM spans a spectrum from simple thresholding to learned RL policies and uncertainty-aware POMDP planning, showing that increasingly structured use of trajectory-level information leads to increasingly principled and efficient routing decisions.
Setup: We evaluate TRIM on challenging mathematical reasoning benchmarks including MATH-500, AIME, OlympiadBench, and Minerva Math. The primary two-model setup uses Qwen2.5-3B-Instruct as the cheap model (Mw) and Claude 3.7 Sonnet as the expensive strong model (Ms), with step-level signals obtained from Qwen2.5-Math-PRM-7B. We compare against RouteLLM (BERT, MF, SW Ranking), Smoothie, AutoMix, and our adapted variant of AutoMix, AutoMix-PRM.
We measure cost efficiency using CPT(x%): the fraction of expensive-model Ms tokens (normalized by full Ms usage) required to recover x% of the performance gap between Mw and Ms (lower is better). ΔIBC measures the average performance gain per unit cost relative to a random router, capturing the overall quality of the cost-performance frontier (higher is better).
TRIM substantially improves over query-level routing across benchmarks. Even the simplest policy, TRIM-Thr, is already highly competitive and often dramatically more efficient than prior methods, while TRIM-Agg and TRIM-POMDP further improve the trade-off by leveraging trajectory-level information and uncertainty modeling.
The headline pattern is consistent across settings: routing a few critical steps is markedly better than routing an entire query. TRIM approaches strong-model accuracy using a small fraction of its token cost, with particularly large gains on harder benchmarks like AIME.
| Method | MATH-500 | AIME | ||||||
|---|---|---|---|---|---|---|---|---|
| CPT(50%)↓ | CPT(80%)↓ | CPT(95%)↓ | ΔIBC↑ | CPT(50%)↓ | CPT(80%)↓ | CPT(95%)↓ | ΔIBC↑ | |
| BERT | 42.52% | 71.68% | 85.31% | 0.08 | 38.18% | 70.93% | 80.71% | 0.44 |
| MF | 34.78% | 70.10% | 93.55% | 0.49 | 41.28% | 69.32% | 93.56% | 0.65 |
| SW Ranking | 40.17% | 60.44% | 71.56% | 0.37 | 34.18% | 57.15% | 82.34% | 0.79 |
| Smoothie | 47.73% | 74.65% | 93.81% | 0.30 | 45.65% | 81.05% | 94.57% | 0.03 |
| AutoMix | 37.99% | 67.93% | 91.02% | 0.12 | 51.80% | 82.29% | 98.69% | 0.0004 |
| AutoMix-PRM | 23.88% | 42.98% | 53.96% | 0.95 | 43.77% | 69.70% | 80.96% | 0.07 |
| TRIM-Thr | 9.45% | 15.95% | 25.08% | 4.75 | 23.47% | 36.20% | 42.89% | 1.81 |
| TRIM-Agg | 7.30% | 12.22% | 17.21% | 5.67 | 12.35% | 27.79% | 38.01% | 2.50 |
| TRIM-POMDP | 6.33% | 14.41% | 17.98% | 5.86 | 16.01% | 23.71% | 28.17% | 5.00 |
Table 1: Benchmarking TRIM on MATH-500 and AIME. TRIM consistently dominates prior query-level routing baselines. The simple threshold TRIM-Thr policy is already very strong, while TRIM-Agg and TRIM-POMDP deliver the best overall trade-offs in different budget regimes.
We now visualize the full performance-cost Pareto frontiers of all TRIM routing approaches on MATH-500 and AIME.
Performance-cost trade-offs of TRIM routing approaches on MATH-500 and AIME. TRIM-Thr provides a strong low-complexity baseline, while TRIM-Agg and TRIM-POMDP achieve the best cost-performance trade-offs, reaching near-Ms accuracy using a small fraction of expensive tokens. Notably, TRIM-POMDP performs particularly well in low-budget regimes, whereas TRIM-Agg slightly dominates at higher budgets.
A key strength of TRIM is that step-level difficulty patterns transfer across benchmarks. We train routers on AIME and evaluate on OlympiadBench and Minerva Math—without any retraining. TRIM maintains strong performance while many query-level baselines degrade to near-zero or negative efficiency.
| Method | OlympiadBench | Minerva Math | ||||||
|---|---|---|---|---|---|---|---|---|
| CPT(50%)↓ | CPT(80%)↓ | CPT(95%)↓ | Δ-IBC↑ | CPT(50%)↓ | CPT(80%)↓ | CPT(95%)↓ | Δ-IBC↑ | |
| BERT | 55.03% | 87.41% | 96.14% | -0.04 | 48.99% | 88.25% | 98.33% | -0.10 |
| MF | 55.24% | 78.21% | 90.05% | -0.07 | 38.96% | 58.28% | 76.07% | 0.42 |
| SW Ranking | 52.57% | 76.48% | 95.03% | 0.07 | 49.63% | 79.96% | 98.26% | 0.04 |
| Smoothie | 52.16% | 76.56% | 92.10% | -0.08 | 54.75% | 80.53% | 93.83% | -0.09 |
| AutoMix | 44.95% | 71.13% | 96.44% | 0.02 | 19.91% | 45.04% | 54.03% | 0.72 |
| AutoMix-PRM | 39.80% | 61.57% | 72.05% | 0.22 | 16.82% | 32.72% | 45.76% | 1.35 |
| TRIM-Thr | 20.45% | 33.03% | 46.97% | 1.31 | 15.20% | 21.65% | 34.66% | 2.23 |
| TRIM-Agg | 14.13% | 28.45% | 42.97% | 2.57 | 11.05% | 20.89% | 32.35% | 3.12 |
Table 2: Cross-benchmark generalization. Routers trained on AIME and evaluated on OlympiadBench and Minerva Math. TRIM methods generalize effectively while query-level baselines degrade, demonstrating that step-level difficulty captures transferable reasoning structure.
Our cross-dataset results highlight a key distinction between routing paradigms. Query-level routers (e.g., RouteLLM) often fit to dataset-specific signals—such as formatting, style, or typical problem structure—that do not transfer across benchmarks. In contrast, TRIM bases decisions on step-level correctness signals within the evolving reasoning trace, which reflect universal failure modes in multi-step reasoning (e.g., divergence at critical steps). As a result, TRIM learns transferable routing behavior that generalizes more robustly across datasets of comparable difficulty.
Performance-cost trade-off on OlympiadBench and Minerva Math. Both TRIM-Agg and TRIM-Thr achieve a smooth accuracy-cost Pareto frontier under cross-benchmark transfer. At a fraction of the expensive model's token cost, TRIM approaches the expensive model's accuracy.
In practice, TRIM can be implemented using the same system-level optimization techniques that enable low-latency speculative decoding, thereby avoiding frequent context re-encoding ("prefill") when switching between models and effectively eliminating the associated latency overhead. We provide direct empirical evidence that TRIM does not introduce significant wall-clock overhead—and is in fact faster than running the expensive model Ms alone.
We measured end-to-end latency and throughput on a fixed 2×H100 setup using vLLM with prefix caching enabled. For TRIM-Thr, Ms was sharded across both GPUs via tensor parallelism (55% memory per GPU), while Mw and the PRM each ran on separate GPUs (40% memory). For single-model baselines, both GPUs were fully dedicated to the large model. All measurements were conducted on the MATH-500 test set under two model-pair configurations:
| Configuration | Threshold (k) | Latency (sec/query) | Throughput (tok/sec) |
|---|---|---|---|
| Qwen2.5-32B | — | 17.10 | 31.77 |
| Qwen2.5-7B | — | 10.27 | 65.20 |
| TRIM-Thr (1.5B + 32B) | 0.1 | 6.21 | 64.30 |
| TRIM-Thr (1.5B + 32B) | 0.4 | 9.02 | 52.30 |
| TRIM-Thr (1.5B + 32B) | 0.7 | 12.10 | 47.86 |
| TRIM-Thr (1.5B + 7B) | 0.1 | 6.35 | 66.66 |
| TRIM-Thr (1.5B + 7B) | 0.4 | 8.17 | 65.25 |
| TRIM-Thr (1.5B + 7B) | 0.7 | 9.51 | 64.61 |
Table 3: End-to-end latency and throughput. TRIM-Thr (1.5B+32B) achieves a 1.4×–2.75× speedup over running the 32B model alone (17.10/12.10 ≈ 1.41, 17.10/6.21 ≈ 2.75), despite incorporating PRM evaluation and stepwise routing. Latency gains increase with larger strong models and lower routing thresholds.
Summary of Key Results
@article{kapoor2026trim,
title = {TRIM: Hybrid Inference via Targeted Stepwise Routing in Multi-Step Reasoning Tasks},
author = {Kapoor, Vansh and Gupta, Aman and Chen, Hao and Beniwal, Anurag and Huang, Jing and Kumar, Aviral},
journal = {arXiv preprint arXiv:2601.10245},
year = {2026}
}