Skip to main content

MoM: Specialized Models for Intelligent Routing

Β· 5 min read
Xunzhuo Liu
Software Engineer @ Tencent

MoM Family

One fabric. Many minds. We're introducing MoM (Mixture of Models)β€”a family of specialized routing models that power vLLM-SR's intelligent decision-making.

Why MoM?​

vLLM-SR solves a critical problem: how to route LLM requests to the right model at the right time. Not every query needs the same resourcesβ€”"What's the weather?" shouldn't cost as much as "Analyze this legal contract."

The Evolution: From Encoder-Only to Mixture-of-Models​

Where We Started: ModernBERT Foundation​

vLLM-SR initially built its routing intelligence entirely on ModernBERT (encoder-only models):

Advantages:

  • ⚑ Blazing fast: Sub-10ms inference latency
  • πŸ“Š High throughput: 10,000+ QPS on commodity hardware
  • πŸ’° Cost-effective: Minimal compute requirements
  • 🎯 Proven accuracy: Strong performance on classification tasks

Limitations:

  • ❌ Black box decisions: No explanation for routing choices
  • ❌ Limited reasoning: Cannot handle complex, multi-step logic
  • ❌ Fixed capabilities: Hard to extend with new behaviors
  • ❌ No tool integration: Cannot leverage external tools or APIs

Why We're Evolving: Decoder-Only Models​

As vLLM-SR adoption grew, we encountered more diverse scenarios and requirements:

  • Explainability: Users need to understand why a query was routed to a specific model
  • Complex reasoning: Some routing decisions require multi-step analysis
  • Agentic workflows: Integration with tool calling, function execution, and external APIs
  • Advanced techniques: Reinforcement learning (RL), sophisticated post-training methods
  • Domain expertise: Specialized routing for legal, medical, scientific domains

The Solution: Expand to decoder-only models while keeping encoder speed where it matters.

The MoM Architecture: Best of Both Worlds​

Our Mixture-of-Models approach combines encoder and decoder strengths:

  • ⚑ Encoders β€” Fast classification (sub-10ms latency) for high-throughput scenarios
  • 🧠 Decoders β€” Explainable decisions with reasoning for transparency
  • 🎯 Domain Agents β€” Expert routing with specialized knowledge

This hybrid architecture lets you choose the right tool for each job: speed when you need it, reasoning when it matters.

The MoM Model Family​

πŸ”’ Encoders β€” Speed & Safety​

Fast, high-throughput models for classification and security checks:

ModelPurpose
mom-enc-class-intent-v1Intent/topic classification (sub-10ms latency)
mom-enc-guard-pii-v1PII detection (privacy protection)
mom-enc-guard-jailbreak-v1Jailbreak/attack detection (security)

🧠 Decoders β€” Explainability​

When you need to understand why a routing decision was made:

ModelPurpose
mom-dec-class-intent-v1Intent classification with reasoning
mom-dec-class-intent-r1Higher-capacity variant for complex cases

🎯 Domain Agents β€” Specialized Expertise​

Expert models for domain-specific routing:

ModelDomain
mom-dec-agent-sci-v1Science (physics, chemistry, biology)
mom-dec-agent-math-v1Mathematics (algebra, calculus, statistics)
mom-dec-agent-hum-v1Humanities (literature, philosophy, history)
mom-dec-agent-soc-v1Social sciences (psychology, economics)
mom-dec-agent-law-v1Legal (contracts, compliance)
mom-dec-agent-gen-v1Generalist fallback

Design Principles​

Safety-First: Guardrail models (PII, jailbreak detection) run before routingβ€”security at the edge.

Speed ↔ Explainability: Choose encoders for sub-10ms latency or decoders for transparent reasoning. Different endpoints, different SLAs.

Domain Expertise: Specialized agents achieve 15-25% better accuracy on domain-specific tasks vs. generalist routing. Math queries go to math experts, legal queries to legal experts.

How vLLM-SR Uses MoM​

vLLM-SR's routing pipeline leverages MoM models at multiple stages:

  1. Security Check β†’ mom-enc-guard-* models filter malicious/sensitive requests
  2. Intent Classification β†’ mom-enc-class-intent-v1 or mom-dec-class-intent-v1 determines query type
  3. Domain Routing β†’ mom-dec-agent-* models route specialized queries to optimal downstream models
  4. Cost Optimization β†’ Simple queries β†’ lightweight models; complex queries β†’ premium models

This achieves 2x+ cost reduction while maintaining quality, similar to RouteLLM.

Performance​

Early benchmarks:

  • Encoders: sub-10ms p99 latency, 10,000+ QPS
  • Decoders: ~50-100ms latency with explainable outputs
  • Domain Agents: 15-25% accuracy improvement over generalist routing

What's Next: Exploring Frontier Techniques​

The move to decoder-only models opens exciting possibilities for vLLM-SR:

πŸ€– Agentic Routing​

Decoder models can act as intelligent agents that:

  • Dynamically select and orchestrate multiple models
  • Make multi-step routing decisions with tool calling
  • Adapt routing strategies based on feedback

🎯 Reinforcement Learning (RL)​

Apply RL techniques to optimize routing decisions:

  • Learn from user feedback and model performance
  • Discover optimal routing policies through trial and error
  • Continuously improve cost-quality trade-offs

πŸ”§ Advanced Post-Training​

Leverage cutting-edge post-training methods:

  • Distillation: Transfer knowledge from large models to efficient routers
  • Preference learning: Train on human feedback (RLHF, DPO)
  • Domain adaptation: Fine-tune for specific industries or use cases

πŸ› οΈ Tool Integration​

Enable routers to:

  • Call external APIs for context-aware routing
  • Query databases for historical routing patterns
  • Integrate with monitoring systems for real-time optimization

The vision: vLLM-SR routers that not only classify but reason, learn, and adapt.

Model Naming​

mom-{type}-{function}-{domain}-{version}
  • type: enc (encoder) / dec (decoder)
  • function: class (classification) / guard (safety) / agent (domain expert)
  • domain: intent, pii, jailbreak, sci, math, etc.
  • version: v1 (baseline) / r1 (higher-capacity)

Get Started​

All MoM models are available on Hugging Face.

Resources:


vLLM-SR Β· Route with intent. Think with reason.