MoM: Specialized Models for Intelligent Routing

One fabric. Many minds. We're introducing MoM (Mixture of Models)βa family of specialized routing models that power vLLM-SR's intelligent decision-making.
Why MoM?β
vLLM-SR solves a critical problem: how to route LLM requests to the right model at the right time. Not every query needs the same resourcesβ"What's the weather?" shouldn't cost as much as "Analyze this legal contract."
The Evolution: From Encoder-Only to Mixture-of-Modelsβ
Where We Started: ModernBERT Foundationβ
vLLM-SR initially built its routing intelligence entirely on ModernBERT (encoder-only models):
Advantages:
- β‘ Blazing fast: Sub-10ms inference latency
- π High throughput: 10,000+ QPS on commodity hardware
- π° Cost-effective: Minimal compute requirements
- π― Proven accuracy: Strong performance on classification tasks
Limitations:
- β Black box decisions: No explanation for routing choices
- β Limited reasoning: Cannot handle complex, multi-step logic
- β Fixed capabilities: Hard to extend with new behaviors
- β No tool integration: Cannot leverage external tools or APIs
Why We're Evolving: Decoder-Only Modelsβ
As vLLM-SR adoption grew, we encountered more diverse scenarios and requirements:
- Explainability: Users need to understand why a query was routed to a specific model
- Complex reasoning: Some routing decisions require multi-step analysis
- Agentic workflows: Integration with tool calling, function execution, and external APIs
- Advanced techniques: Reinforcement learning (RL), sophisticated post-training methods
- Domain expertise: Specialized routing for legal, medical, scientific domains
The Solution: Expand to decoder-only models while keeping encoder speed where it matters.
The MoM Architecture: Best of Both Worldsβ
Our Mixture-of-Models approach combines encoder and decoder strengths:
- β‘ Encoders β Fast classification (sub-10ms latency) for high-throughput scenarios
- π§ Decoders β Explainable decisions with reasoning for transparency
- π― Domain Agents β Expert routing with specialized knowledge
This hybrid architecture lets you choose the right tool for each job: speed when you need it, reasoning when it matters.
The MoM Model Familyβ
π Encoders β Speed & Safetyβ
Fast, high-throughput models for classification and security checks:
| Model | Purpose |
|---|---|
| mom-enc-class-intent-v1 | Intent/topic classification (sub-10ms latency) |
| mom-enc-guard-pii-v1 | PII detection (privacy protection) |
| mom-enc-guard-jailbreak-v1 | Jailbreak/attack detection (security) |
π§ Decoders β Explainabilityβ
When you need to understand why a routing decision was made:
| Model | Purpose |
|---|---|
| mom-dec-class-intent-v1 | Intent classification with reasoning |
| mom-dec-class-intent-r1 | Higher-capacity variant for complex cases |
π― Domain Agents β Specialized Expertiseβ
Expert models for domain-specific routing:
| Model | Domain |
|---|---|
| mom-dec-agent-sci-v1 | Science (physics, chemistry, biology) |
| mom-dec-agent-math-v1 | Mathematics (algebra, calculus, statistics) |
| mom-dec-agent-hum-v1 | Humanities (literature, philosophy, history) |
| mom-dec-agent-soc-v1 | Social sciences (psychology, economics) |
| mom-dec-agent-law-v1 | Legal (contracts, compliance) |
| mom-dec-agent-gen-v1 | Generalist fallback |
Design Principlesβ
Safety-First: Guardrail models (PII, jailbreak detection) run before routingβsecurity at the edge.
Speed β Explainability: Choose encoders for sub-10ms latency or decoders for transparent reasoning. Different endpoints, different SLAs.
Domain Expertise: Specialized agents achieve 15-25% better accuracy on domain-specific tasks vs. generalist routing. Math queries go to math experts, legal queries to legal experts.
How vLLM-SR Uses MoMβ
vLLM-SR's routing pipeline leverages MoM models at multiple stages:
- Security Check β
mom-enc-guard-*models filter malicious/sensitive requests - Intent Classification β
mom-enc-class-intent-v1ormom-dec-class-intent-v1determines query type - Domain Routing β
mom-dec-agent-*models route specialized queries to optimal downstream models - Cost Optimization β Simple queries β lightweight models; complex queries β premium models
This achieves 2x+ cost reduction while maintaining quality, similar to RouteLLM.
Performanceβ
Early benchmarks:
- Encoders: sub-10ms p99 latency, 10,000+ QPS
- Decoders: ~50-100ms latency with explainable outputs
- Domain Agents: 15-25% accuracy improvement over generalist routing
What's Next: Exploring Frontier Techniquesβ
The move to decoder-only models opens exciting possibilities for vLLM-SR:
π€ Agentic Routingβ
Decoder models can act as intelligent agents that:
- Dynamically select and orchestrate multiple models
- Make multi-step routing decisions with tool calling
- Adapt routing strategies based on feedback
π― Reinforcement Learning (RL)β
Apply RL techniques to optimize routing decisions:
- Learn from user feedback and model performance
- Discover optimal routing policies through trial and error
- Continuously improve cost-quality trade-offs
π§ Advanced Post-Trainingβ
Leverage cutting-edge post-training methods:
- Distillation: Transfer knowledge from large models to efficient routers
- Preference learning: Train on human feedback (RLHF, DPO)
- Domain adaptation: Fine-tune for specific industries or use cases
π οΈ Tool Integrationβ
Enable routers to:
- Call external APIs for context-aware routing
- Query databases for historical routing patterns
- Integrate with monitoring systems for real-time optimization
The vision: vLLM-SR routers that not only classify but reason, learn, and adapt.
Model Namingβ
mom-{type}-{function}-{domain}-{version}
- type:
enc(encoder) /dec(decoder) - function:
class(classification) /guard(safety) /agent(domain expert) - domain:
intent,pii,jailbreak,sci,math, etc. - version:
v1(baseline) /r1(higher-capacity)
Get Startedβ
All MoM models are available on Hugging Face.
Resources:
vLLM-SR Β· Route with intent. Think with reason.
