Skip to main content

MoM: Specialized Models for Intelligent Routing

Β· 8 min read
Xunzhuo Liu
Software Engineer @ Tencent

MoM Family

One fabric. Many minds. We're introducing MoM (Mixture of Models)β€”a family of specialized routing models that power vLLM-SR's intelligent decision-making.

Why MoM?​

vLLM-SR solves a critical problem: how to route LLM requests to the right model at the right time. Not every query needs the same resourcesβ€”"What's the weather?" shouldn't cost as much as "Analyze this legal contract."

MoM System Card​

A quick overview of all MoM models:

CategoryModelSizeArchitectureBase ModelPurpose
🧠 Intelligent Routingmom-brain-flashFlashEncoderModernBERTUltra-fast intent classification
mom-brain-proProDecoderQwen3 0.6BBalanced routing with reasoning
mom-brain-maxMaxDecoderQwen3 1.7BMaximum accuracy for complex decisions
πŸ” Similarity Searchmom-similarity-flashFlashEncoderModernBERTSemantic similarity matching
πŸ”’ Prompt Guardianmom-jailbreak-flashFlashEncoderModernBERTJailbreak/attack detection
mom-pii-flashFlashEncoderModernBERTPII detection & privacy protection
🎯 SLM Expertsmom-expert-math-flashFlashDecoderQwen3 0.6BBackend math problem solver
mom-expert-science-flashFlashDecoderQwen3 0.6BBackend science problem solver
mom-expert-social-flashFlashDecoderQwen3 0.6BBackend social sciences solver
mom-expert-humanities-flashFlashDecoderQwen3 0.6BBackend humanities solver
mom-expert-law-flashFlashDecoderQwen3 0.6BBackend law problem solver
mom-expert-generalist-flashFlashDecoderQwen3 0.6BBackend generalist solver

Key Insights:

  • 4 Categories: 3 for routing (Intelligent Routing, Similarity Search, Prompt Guardian) + 1 for backend problem solving (SLM Experts)
  • ModernBERT (encoder-only) β†’ Sub-10ms latency for high-throughput routing
  • Qwen3 (decoder-only) β†’ Explainable routing decisions + domain-specific problem solving
  • Flash models achieve 10,000+ QPS on commodity hardware
  • SLM Experts are not routersβ€”they are specialized backend models that solve domain-specific problems

The Evolution: From Encoder-Only to Mixture-of-Models​

Where We Started: ModernBERT Foundation​

vLLM-SR initially built its routing intelligence entirely on ModernBERT (encoder-only models):

Advantages:

  • ⚑ Blazing fast: Sub-10ms inference latency
  • πŸ“Š High throughput: 10,000+ QPS on commodity hardware
  • πŸ’° Cost-effective: Minimal compute requirements
  • 🎯 Proven accuracy: Strong performance on classification tasks

Limitations:

  • ❌ Black box decisions: No explanation for routing choices
  • ❌ Limited reasoning: Cannot handle complex, multi-step logic
  • ❌ Fixed capabilities: Hard to extend with new behaviors
  • ❌ No tool integration: Cannot leverage external tools or APIs

Why We're Evolving: Decoder-Only Models​

As vLLM-SR adoption grew, we encountered more diverse scenarios and requirements:

  • Explainability: Users need to understand why a query was routed to a specific model
  • Complex reasoning: Some routing decisions require multi-step analysis
  • Agentic workflows: Integration with tool calling, function execution, and external APIs
  • Advanced techniques: Reinforcement learning (RL), sophisticated post-training methods
  • Domain expertise: Specialized routing for legal, medical, scientific domains

The Solution: Expand to decoder-only models while keeping encoder speed where it matters.

The MoM Architecture: Best of Both Worlds​

Mixture-of-Models (MoM) is both a philosophy and an architecture:

  1. Backend LLM Architecture β€” Route requests to the optimal downstream model (GPT-4, Claude, Llama, etc.)
  2. Router Internal Design β€” The router itself uses multiple specialized models working together

Our MoM approach combines encoder and decoder strengths:

  • ⚑ Encoders (ModernBERT) β€” Fast classification (sub-10ms latency) for high-throughput scenarios
  • 🧠 Decoders (Qwen3) β€” Explainable decisions with reasoning for transparency
  • 🎯 Domain Agents (Qwen3) β€” Expert problem solving with specialized knowledge

This hybrid architecture lets you choose the right tool for each job: speed when you need it, reasoning when it matters.

Key Insight: Just as vLLM-SR routes to different backend LLMs, the router itself is powered by a mixture of specialized modelsβ€”each optimized for specific routing tasks (security, similarity, intent classification, domain expertise).

The MoM Model Family​

We organize MoM models into four categories with three size variants (Flash, Pro, Max):

🧠 Intelligent Routing​

Smart routing models with three size variants:

ModelSizeBase ModelPurpose
mom-brain-flashFlashModernBERTUltra-fast intent classification (sub-10ms latency)
mom-brain-proProQwen3 0.6BBalanced performance with reasoning capabilities
mom-brain-maxMaxQwen3 1.7BMaximum accuracy for complex routing decisions

Architecture: Flash is based on ModernBERT (encoder-only), while Pro and Max are based on Qwen3 0.6B and 1.7B (decoder-only) models.

Semantic similarity and vector search:

ModelSizeBase ModelPurpose
mom-similarity-flashFlashModernBERTFast semantic similarity matching for route selection

Architecture: Based on ModernBERT (encoder-only) for high-speed embedding generation.

πŸ”’ Prompt Guardian​

Security and safety checks before routing:

ModelSizeBase ModelPurpose
mom-jailbreak-flashFlashModernBERTJailbreak/attack detection (security)
mom-pii-flashFlashModernBERTPII detection (privacy protection)

Architecture: Both based on ModernBERT (encoder-only) for ultra-fast security checks.

🎯 SLM Experts​

Specialized small language models deployed as backend problem solvers:

ModelSizeBase ModelDomainTraining Data
mom-expert-math-flashFlashQwen3 0.6BMathematicsGSM8K, MATH
mom-expert-science-flashFlashQwen3 0.6BScienceARC-Challenge, OpenBookQA, SciQ
mom-expert-social-flashFlashQwen3 0.6BSocial SciencesCommonsenseQA, StrategyQA
mom-expert-humanities-flashFlashQwen3 0.6BHumanitiesTruthfulQA, MMLU-train subset
mom-expert-law-flashFlashQwen3 0.6BLawMMLU-train law subset + specialized sources
mom-expert-generalist-flashFlashQwen3 0.6BGeneralistMixed from above domains

Architecture: All based on Qwen3 0.6B (decoder-only) for domain-specific problem solving. Currently only Flash variants are available.

Purpose: These models are not routersβ€”they are deployed as backend LLMs to solve domain-specific problems. They form part of the Mixture-of-Models backend architecture that vLLM-SR routes to.

Design Principles​

Safety-First: Prompt Guardian models (PII, jailbreak detection) run before routingβ€”security at the edge.

Speed ↔ Capability: Choose Flash for sub-10ms latency, Pro for balanced performance, or Max for maximum accuracy. Different sizes, different SLAs.

Domain Expertise: SLM Expert models are deployed as backend problem solvers, achieving 15-25% better accuracy on domain-specific tasks vs. generalist LLMs. Math problems are solved by math experts, science questions by science experts, etc.

How vLLM-SR Uses MoM​

MoM operates at two levels in vLLM-SR:

Level 1: Router Internal Architecture (MoM Inside)​

The router itself is a mixture of specialized models working together in a pipeline:

  1. Security Check β†’ mom-jailbreak-flash and mom-pii-flash filter malicious/sensitive requests
  2. Intent Classification β†’ mom-brain-* models (flash/pro/max) determine query type and routing decisions
  3. Similarity Search β†’ mom-similarity-flash finds semantically similar routes

Each stage uses the right model for the right task: fast encoders for security checks, reasoning decoders for complex decisions.

Level 2: Backend LLM Orchestration (MoM Outside)​

The router then directs requests to the optimal backend LLM from a mixture of models:

General-Purpose LLMs:

  • Simple queries β†’ Lightweight models (Llama 3.2, Qwen3 2.5)
  • Complex queries β†’ Premium models (GPT-4, Claude 3.5)

Domain-Specific SLM Experts (mom-expert-*):

  • Math problems β†’ mom-expert-math-flash (Qwen3 0.6B trained on GSM8K, MATH)
  • Science questions β†’ mom-expert-science-flash (Qwen3 0.6B trained on ARC, SciQ)
  • Social sciences β†’ mom-expert-social-flash (Qwen3 0.6B on CommonsenseQA, StrategyQA)
  • Humanities β†’ mom-expert-humanities-flash (Qwen3 0.6B on TruthfulQA, MMLU)
  • Legal queries β†’ mom-expert-law-flash (Qwen3 0.6B on MMLU law + specialized sources)
  • General tasks β†’ mom-expert-generalist-flash (Qwen3 0.6B on mixed training)

This dual-level MoM architecture achieves 2x+ cost reduction while maintaining quality, similar to RouteLLM.

The Philosophy: Mixture-of-Models all the way downβ€”from the router's internal decision-making to the backend LLM pool (including both general-purpose LLMs and specialized SLM experts).

What's Next: Exploring Frontier Techniques​

The move to decoder-only models opens exciting possibilities for vLLM-SR:

πŸ€– Agentic Routing​

Decoder models can act as intelligent agents that:

  • Dynamically select and orchestrate multiple models
  • Make multi-step routing decisions with tool calling
  • Adapt routing strategies based on feedback

🎯 Reinforcement Learning (RL)​

Apply RL techniques to optimize routing decisions:

  • Learn from user feedback and model performance
  • Discover optimal routing policies through trial and error
  • Continuously improve cost-quality trade-offs

πŸ”§ Advanced Post-Training​

Leverage cutting-edge post-training methods:

  • Distillation: Transfer knowledge from large models to efficient routers
  • Preference learning: Train on human feedback (RLHF, DPO)
  • Domain adaptation: Fine-tune for specific industries or use cases

πŸ› οΈ Tool Integration​

Enable routers to:

  • Call external APIs for context-aware routing
  • Query databases for historical routing patterns
  • Integrate with monitoring systems for real-time optimization

The vision: vLLM-SR routers that not only classify but reason, learn, and adapt.

Model Naming Convention​

mom-{category}-{size}
mom-expert-{domain}-{size}

Four Categories​

  1. Intelligent Routing: mom-brain-{flash|pro|max}
  2. Similarity Search: mom-similarity-{flash}
  3. Prompt Guardian: mom-{jailbreak|pii}-{flash}
  4. SLM Experts: mom-expert-{domain}-{flash} where domain = {math|science|social|humanities|law|generalist}

Three Size Variants​

  • flash: ModernBERT-based (for brain/similarity/guardian) or Qwen3 0.6B (for experts) β€” fastest, sub-10ms latency
  • pro: Qwen3 0.6B (for brain) β€” balanced performance with reasoning
  • max: Qwen3 1.7B (for brain) β€” maximum accuracy and capabilities

Architecture Summary​

  • Intelligent Routing: Flash (ModernBERT) + Pro/Max (Qwen3 0.6B/1.7B)
  • Similarity Search: Flash (ModernBERT)
  • Prompt Guardian: Flash (ModernBERT)
  • SLM Experts: Flash only (Qwen3 0.6B) β€” 6 domain specialists

Get Started​

All MoM models are available on Hugging Face.

Resources:


vLLM-SR Β· Route with intent. Think with reason.