Every production agent stack hits the same wall: you can't send everything to a frontier model. Cost explodes. Latency kills the UX. And most queries don't need Claude — they need a fast, private, local answer.
The solution is a routing layer. But the way most teams build it is wrong.
The pattern that works in production is a three-tier model stack:
- Router (tiny, 3-8B): Scores query complexity in <100ms. This is the gatekeeper.
- Local (medium, 24B): Handles 60-70% of all queries. Function calling, lookups, simple reasoning.
- Frontier (API): Reserved for genuine complexity — multi-step reasoning, novel analysis, tool orchestration.
The router scores every incoming query on a 0-1 complexity scale. Below threshold (typically 0.3-0.4), it routes local. Above, it routes frontier. The key insight: the threshold is tunable per-domain. Finance queries route frontier more aggressively than status checks.
Memory Considerations
On a 48GB machine, you can keep two models loaded simultaneously with room for KV cache:
Primary (24B quantized): ~14GB
Router (8B): ~8.5GB
macOS overhead: ~7GB
KV cache headroom: ~18GB
This means zero cold-start latency for the two models that handle 95% of traffic.
From the user's perspective, the routing layer is invisible — and that's the point. A well-tuned router means:
- Simple queries (account balance, status check, schedule lookup) return in under 2 seconds
- Complex queries (spending analysis, investment recommendations) take 10-15 seconds but deliver frontier-quality answers
- No configuration — the user never picks a model or adjusts settings
The failure mode people don't talk about: when the router gets it wrong. A complex query routed locally gives a confident but mediocre answer. The user doesn't know they got a worse response. This is why scoring calibration matters more than model selection.
The Feedback Signal
Track which model handled each query, the response quality (implicit: did the user follow up with clarification?), and the router's confidence score. Over time, this data tunes the threshold automatically.
The router model itself needs to be tiny and fast. Phi-4-mini (3.8B) or Hermes 3 (8B) both work. The scoring prompt is simple:
"Rate the complexity of this query from 0.0 to 1.0. Consider: does it require multi-step reasoning? Does it need current data? Does it involve nuanced judgment?"
Parse the float from the response. That's it. No embeddings, no classifiers, no fine-tuning. The LLM is the classifier.
Model costs are dropping, but they're not zero. More importantly, latency is the real constraint for agent UX. A routing layer isn't an optimization — it's an architectural requirement for any agent system that needs to feel responsive.
The teams that figure this out ship products that feel like magic. The teams that don't ship demos that feel like waiting rooms.