zhaohongyuziranerran

LLM Inference MCP Server

Updated

LLM Inference MCP Server - Multi-model routing, cost optimization, structured output. Supports OpenAI, DeepSeek, Anthropic, vLLM.

LLM Inference MCP Server

Multi-model LLM routing with cost optimization, structured output, and batch inference.

Features

  • Smart Routing: Automatically selects the best model based on task type (quick/reasoning/creative/code/extraction/chat/translation/summary)
  • Multi-Provider: Supports OpenAI, DeepSeek, Anthropic Claude, and local vLLM
  • Cost Optimization: Compare costs across models, track spending, find cheapest options
  • Structured Output: Force JSON schema output from any model (native or prompt-engineered)
  • Batch Inference: Process up to 50 prompts in parallel
  • Token Counting: Estimate tokens for text/messages before making API calls
  • Model Comparison: Run same prompt on multiple models side-by-side

Tools

Tool Description
list_models List all available models with specs and pricing
chat_completion Smart routing chat completion with auto-fallback
structured_output Force JSON schema output from any model
batch_inference Process multiple prompts in parallel (up to 50)
count_tokens Estimate tokens for text or messages
estimate_inference_cost Compare costs across models before running
compare_models Run same prompt on multiple models and compare

Configuration

Set environment variables to enable providers:

# OpenAI
export OPENAI_API_KEY="sk-..."
export OPENAI_BASE_URL="https://api.openai.com/v1"  # optional

# DeepSeek
export DEEPSEEK_API_KEY="sk-..."
export DEEPSEEK_BASE_URL="https://api.deepseek.com/v1"  # optional

# Anthropic Claude
export ANTHROPIC_API_KEY="sk-ant-..."
export ANTHROPIC_BASE_URL="https://api.anthropic.com"  # optional

# vLLM (local GPU inference)
export VLLM_BASE_URL="http://localhost:8000/v1"  # default

# Custom OpenAI-compatible providers (up to 5)
export CUSTOM_LLM_1_API_KEY="..."
export CUSTOM_LLM_1_BASE_URL="https://..."

Installation

Using with Claude Desktop / Cursor / Windsurf

Add to your MCP settings:

{
  "mcpServers": {
    "llm-inference": {
      "command": "python",
      "args": ["-m", "llm_inference_mcp.server"],
      "env": {
        "OPENAI_API_KEY": "sk-...",
        "DEEPSEEK_API_KEY": "sk-...",
        "VLLM_BASE_URL": "http://localhost:8000/v1"
      }
    }
  }
}

Using with uvx

uvx llm-inference-mcp

Using with pip

pip install llm-inference-mcp
llm-inference-mcp

Smart Routing

The router automatically selects the best model based on task type:

Task Type Description Prioritizes
quick Quick, simple tasks Speed + Cost
reasoning Complex analysis Quality + Reasoning
creative Writing, brainstorming Quality + Diversity
code Code generation Accuracy + Quality
extraction Data extraction Reliability + Speed
chat General Q&A Balance
translation Translation Accuracy
summary Summarization Speed + Cost

Cost Optimization Examples

# Find cheapest model for a task
chat_completion(messages=[...], prefer_cheapest=True)

# Compare costs before running
estimate_inference_cost(input_text="Analyze this report...", output_tokens=1000)

# Use local vLLM for zero-cost inference
chat_completion(messages=[...], provider="vllm")

Supported Models

Provider Models Input $/1M Output $/1M
OpenAI GPT-4o, GPT-4o Mini, O3 Mini $0.15-$10.00 $0.60-$30.00
DeepSeek DeepSeek V3, R1 $0.14-$0.55 $0.28-$2.19
Anthropic Claude Sonnet 4, Haiku 3.5 $0.80-$3.00 $4.00-$15.00
vLLM Qwen2.5-7B, any local model FREE FREE

License

MIT

MCP Server ยท Populars

MCP Server ยท New

    Battam1111

    Myco

    Self-evolving cognitive organism for AI agents โ€” eternal devouring, eternal evolution.

    Community Battam1111
    MLS-Tech-Inc

    Shortlist MCP Server

    MCP server for Shortlist โ€” search, queue, and auto-apply to jobs from Claude Code

    Community MLS-Tech-Inc
    tomtom-international

    TomTom MCP Server

    A Model Context Protocol (MCP) server providing TomTom's location services, search, routing, and traffic data to AI agents.

    metabase

    Metabase

    The easy-to-use open source Business Intelligence and Embedded Analytics tool that lets everyone work with data :bar_chart:

    Community metabase
    syncable-dev

    Memtrace

    The missing memory layer for coding agents

    Community syncable-dev