MoM: Specialized Models for Intelligent Routing
One fabric. Many minds. We're introducing MoM (Mixture of Models)βa family of specialized routing models that power vLLM-SR's intelligent decision-making.
Why MoM?β
vLLM-SR solves a critical problem: how to route LLM requests to the right model at the right time. Not every query needs the same resourcesβ"What's the weather?" shouldn't cost as much as "Analyze this legal contract."
MoM System Cardβ
A quick overview of all MoM models:
Category | Model | Size | Architecture | Base Model | Purpose |
---|---|---|---|---|---|
π§ Intelligent Routing | mom-brain-flash | Flash | Encoder | ModernBERT | Ultra-fast intent classification |
mom-brain-pro | Pro | Decoder | Qwen3 0.6B | Balanced routing with reasoning | |
mom-brain-max | Max | Decoder | Qwen3 1.7B | Maximum accuracy for complex decisions | |
π Similarity Search | mom-similarity-flash | Flash | Encoder | ModernBERT | Semantic similarity matching |
π Prompt Guardian | mom-jailbreak-flash | Flash | Encoder | ModernBERT | Jailbreak/attack detection |
mom-pii-flash | Flash | Encoder | ModernBERT | PII detection & privacy protection | |
π― SLM Experts | mom-expert-math-flash | Flash | Decoder | Qwen3 0.6B | Backend math problem solver |
mom-expert-science-flash | Flash | Decoder | Qwen3 0.6B | Backend science problem solver | |
mom-expert-social-flash | Flash | Decoder | Qwen3 0.6B | Backend social sciences solver | |
mom-expert-humanities-flash | Flash | Decoder | Qwen3 0.6B | Backend humanities solver | |
mom-expert-law-flash | Flash | Decoder | Qwen3 0.6B | Backend law problem solver | |
mom-expert-generalist-flash | Flash | Decoder | Qwen3 0.6B | Backend generalist solver |
Key Insights:
- 4 Categories: 3 for routing (Intelligent Routing, Similarity Search, Prompt Guardian) + 1 for backend problem solving (SLM Experts)
- ModernBERT (encoder-only) β Sub-10ms latency for high-throughput routing
- Qwen3 (decoder-only) β Explainable routing decisions + domain-specific problem solving
- Flash models achieve 10,000+ QPS on commodity hardware
- SLM Experts are not routersβthey are specialized backend models that solve domain-specific problems
The Evolution: From Encoder-Only to Mixture-of-Modelsβ
Where We Started: ModernBERT Foundationβ
vLLM-SR initially built its routing intelligence entirely on ModernBERT (encoder-only models):
Advantages:
- β‘ Blazing fast: Sub-10ms inference latency
- π High throughput: 10,000+ QPS on commodity hardware
- π° Cost-effective: Minimal compute requirements
- π― Proven accuracy: Strong performance on classification tasks
Limitations:
- β Black box decisions: No explanation for routing choices
- β Limited reasoning: Cannot handle complex, multi-step logic
- β Fixed capabilities: Hard to extend with new behaviors
- β No tool integration: Cannot leverage external tools or APIs
Why We're Evolving: Decoder-Only Modelsβ
As vLLM-SR adoption grew, we encountered more diverse scenarios and requirements:
- Explainability: Users need to understand why a query was routed to a specific model
- Complex reasoning: Some routing decisions require multi-step analysis
- Agentic workflows: Integration with tool calling, function execution, and external APIs
- Advanced techniques: Reinforcement learning (RL), sophisticated post-training methods
- Domain expertise: Specialized routing for legal, medical, scientific domains
The Solution: Expand to decoder-only models while keeping encoder speed where it matters.
The MoM Architecture: Best of Both Worldsβ
Mixture-of-Models (MoM) is both a philosophy and an architecture:
- Backend LLM Architecture β Route requests to the optimal downstream model (GPT-4, Claude, Llama, etc.)
- Router Internal Design β The router itself uses multiple specialized models working together
Our MoM approach combines encoder and decoder strengths:
- β‘ Encoders (ModernBERT) β Fast classification (sub-10ms latency) for high-throughput scenarios
- π§ Decoders (Qwen3) β Explainable decisions with reasoning for transparency
- π― Domain Agents (Qwen3) β Expert problem solving with specialized knowledge
This hybrid architecture lets you choose the right tool for each job: speed when you need it, reasoning when it matters.
Key Insight: Just as vLLM-SR routes to different backend LLMs, the router itself is powered by a mixture of specialized modelsβeach optimized for specific routing tasks (security, similarity, intent classification, domain expertise).
The MoM Model Familyβ
We organize MoM models into four categories with three size variants (Flash, Pro, Max):
π§ Intelligent Routingβ
Smart routing models with three size variants:
Model | Size | Base Model | Purpose |
---|---|---|---|
mom-brain-flash | Flash | ModernBERT | Ultra-fast intent classification (sub-10ms latency) |
mom-brain-pro | Pro | Qwen3 0.6B | Balanced performance with reasoning capabilities |
mom-brain-max | Max | Qwen3 1.7B | Maximum accuracy for complex routing decisions |
Architecture: Flash is based on ModernBERT (encoder-only), while Pro and Max are based on Qwen3 0.6B and 1.7B (decoder-only) models.
π Similarity Searchβ
Semantic similarity and vector search:
Model | Size | Base Model | Purpose |
---|---|---|---|
mom-similarity-flash | Flash | ModernBERT | Fast semantic similarity matching for route selection |
Architecture: Based on ModernBERT (encoder-only) for high-speed embedding generation.
π Prompt Guardianβ
Security and safety checks before routing:
Model | Size | Base Model | Purpose |
---|---|---|---|
mom-jailbreak-flash | Flash | ModernBERT | Jailbreak/attack detection (security) |
mom-pii-flash | Flash | ModernBERT | PII detection (privacy protection) |
Architecture: Both based on ModernBERT (encoder-only) for ultra-fast security checks.
π― SLM Expertsβ
Specialized small language models deployed as backend problem solvers:
Model | Size | Base Model | Domain | Training Data |
---|---|---|---|---|
mom-expert-math-flash | Flash | Qwen3 0.6B | Mathematics | GSM8K, MATH |
mom-expert-science-flash | Flash | Qwen3 0.6B | Science | ARC-Challenge, OpenBookQA, SciQ |
mom-expert-social-flash | Flash | Qwen3 0.6B | Social Sciences | CommonsenseQA, StrategyQA |
mom-expert-humanities-flash | Flash | Qwen3 0.6B | Humanities | TruthfulQA, MMLU-train subset |
mom-expert-law-flash | Flash | Qwen3 0.6B | Law | MMLU-train law subset + specialized sources |
mom-expert-generalist-flash | Flash | Qwen3 0.6B | Generalist | Mixed from above domains |
Architecture: All based on Qwen3 0.6B (decoder-only) for domain-specific problem solving. Currently only Flash variants are available.
Purpose: These models are not routersβthey are deployed as backend LLMs to solve domain-specific problems. They form part of the Mixture-of-Models backend architecture that vLLM-SR routes to.
Design Principlesβ
Safety-First: Prompt Guardian models (PII, jailbreak detection) run before routingβsecurity at the edge.
Speed β Capability: Choose Flash for sub-10ms latency, Pro for balanced performance, or Max for maximum accuracy. Different sizes, different SLAs.
Domain Expertise: SLM Expert models are deployed as backend problem solvers, achieving 15-25% better accuracy on domain-specific tasks vs. generalist LLMs. Math problems are solved by math experts, science questions by science experts, etc.
How vLLM-SR Uses MoMβ
MoM operates at two levels in vLLM-SR:
Level 1: Router Internal Architecture (MoM Inside)β
The router itself is a mixture of specialized models working together in a pipeline:
- Security Check β
mom-jailbreak-flash
andmom-pii-flash
filter malicious/sensitive requests - Intent Classification β
mom-brain-*
models (flash/pro/max) determine query type and routing decisions - Similarity Search β
mom-similarity-flash
finds semantically similar routes
Each stage uses the right model for the right task: fast encoders for security checks, reasoning decoders for complex decisions.
Level 2: Backend LLM Orchestration (MoM Outside)β
The router then directs requests to the optimal backend LLM from a mixture of models:
General-Purpose LLMs:
- Simple queries β Lightweight models (Llama 3.2, Qwen3 2.5)
- Complex queries β Premium models (GPT-4, Claude 3.5)
Domain-Specific SLM Experts (mom-expert-*
):
- Math problems β
mom-expert-math-flash
(Qwen3 0.6B trained on GSM8K, MATH) - Science questions β
mom-expert-science-flash
(Qwen3 0.6B trained on ARC, SciQ) - Social sciences β
mom-expert-social-flash
(Qwen3 0.6B on CommonsenseQA, StrategyQA) - Humanities β
mom-expert-humanities-flash
(Qwen3 0.6B on TruthfulQA, MMLU) - Legal queries β
mom-expert-law-flash
(Qwen3 0.6B on MMLU law + specialized sources) - General tasks β
mom-expert-generalist-flash
(Qwen3 0.6B on mixed training)
This dual-level MoM architecture achieves 2x+ cost reduction while maintaining quality, similar to RouteLLM.
The Philosophy: Mixture-of-Models all the way downβfrom the router's internal decision-making to the backend LLM pool (including both general-purpose LLMs and specialized SLM experts).
What's Next: Exploring Frontier Techniquesβ
The move to decoder-only models opens exciting possibilities for vLLM-SR:
π€ Agentic Routingβ
Decoder models can act as intelligent agents that:
- Dynamically select and orchestrate multiple models
- Make multi-step routing decisions with tool calling
- Adapt routing strategies based on feedback
π― Reinforcement Learning (RL)β
Apply RL techniques to optimize routing decisions:
- Learn from user feedback and model performance
- Discover optimal routing policies through trial and error
- Continuously improve cost-quality trade-offs
π§ Advanced Post-Trainingβ
Leverage cutting-edge post-training methods:
- Distillation: Transfer knowledge from large models to efficient routers
- Preference learning: Train on human feedback (RLHF, DPO)
- Domain adaptation: Fine-tune for specific industries or use cases
π οΈ Tool Integrationβ
Enable routers to:
- Call external APIs for context-aware routing
- Query databases for historical routing patterns
- Integrate with monitoring systems for real-time optimization
The vision: vLLM-SR routers that not only classify but reason, learn, and adapt.
Model Naming Conventionβ
mom-{category}-{size}
mom-expert-{domain}-{size}
Four Categoriesβ
- Intelligent Routing:
mom-brain-{flash|pro|max}
- Similarity Search:
mom-similarity-{flash}
- Prompt Guardian:
mom-{jailbreak|pii}-{flash}
- SLM Experts:
mom-expert-{domain}-{flash}
where domain ={math|science|social|humanities|law|generalist}
Three Size Variantsβ
- flash: ModernBERT-based (for brain/similarity/guardian) or Qwen3 0.6B (for experts) β fastest, sub-10ms latency
- pro: Qwen3 0.6B (for brain) β balanced performance with reasoning
- max: Qwen3 1.7B (for brain) β maximum accuracy and capabilities
Architecture Summaryβ
- Intelligent Routing: Flash (ModernBERT) + Pro/Max (Qwen3 0.6B/1.7B)
- Similarity Search: Flash (ModernBERT)
- Prompt Guardian: Flash (ModernBERT)
- SLM Experts: Flash only (Qwen3 0.6B) β 6 domain specialists
Get Startedβ
All MoM models are available on Hugging Face.
Resources:
vLLM-SR Β· Route with intent. Think with reason.