LLM Inference MCP Server

Multi-model LLM routing with cost optimization, structured output, and batch inference.

Features

Smart Routing: Automatically selects the best model based on task type (quick/reasoning/creative/code/extraction/chat/translation/summary)
Multi-Provider: Supports OpenAI, DeepSeek, Anthropic Claude, and local vLLM
Cost Optimization: Compare costs across models, track spending, find cheapest options
Structured Output: Force JSON schema output from any model (native or prompt-engineered)
Batch Inference: Process up to 50 prompts in parallel
Token Counting: Estimate tokens for text/messages before making API calls
Model Comparison: Run same prompt on multiple models side-by-side

Tools

Tool	Description
`list_models`	List all available models with specs and pricing
`chat_completion`	Smart routing chat completion with auto-fallback
`structured_output`	Force JSON schema output from any model
`batch_inference`	Process multiple prompts in parallel (up to 50)
`count_tokens`	Estimate tokens for text or messages
`estimate_inference_cost`	Compare costs across models before running
`compare_models`	Run same prompt on multiple models and compare

Configuration

Set environment variables to enable providers:

# OpenAI
export OPENAI_API_KEY="sk-..."
export OPENAI_BASE_URL="https://api.openai.com/v1"  # optional

# DeepSeek
export DEEPSEEK_API_KEY="sk-..."
export DEEPSEEK_BASE_URL="https://api.deepseek.com/v1"  # optional

# Anthropic Claude
export ANTHROPIC_API_KEY="sk-ant-..."
export ANTHROPIC_BASE_URL="https://api.anthropic.com"  # optional

# vLLM (local GPU inference)
export VLLM_BASE_URL="http://localhost:8000/v1"  # default

# Custom OpenAI-compatible providers (up to 5)
export CUSTOM_LLM_1_API_KEY="..."
export CUSTOM_LLM_1_BASE_URL="https://..."

Installation

Using with Claude Desktop / Cursor / Windsurf

Add to your MCP settings:

{
  "mcpServers": {
    "llm-inference": {
      "command": "python",
      "args": ["-m", "llm_inference_mcp.server"],
      "env": {
        "OPENAI_API_KEY": "sk-...",
        "DEEPSEEK_API_KEY": "sk-...",
        "VLLM_BASE_URL": "http://localhost:8000/v1"
      }
    }
  }
}

Using with uvx

uvx llm-inference-mcp

Using with pip

pip install llm-inference-mcp
llm-inference-mcp

Smart Routing

The router automatically selects the best model based on task type:

Task Type	Description	Prioritizes
`quick`	Quick, simple tasks	Speed + Cost
`reasoning`	Complex analysis	Quality + Reasoning
`creative`	Writing, brainstorming	Quality + Diversity
`code`	Code generation	Accuracy + Quality
`extraction`	Data extraction	Reliability + Speed
`chat`	General Q&A	Balance
`translation`	Translation	Accuracy
`summary`	Summarization	Speed + Cost

Cost Optimization Examples

# Find cheapest model for a task
chat_completion(messages=[...], prefer_cheapest=True)

# Compare costs before running
estimate_inference_cost(input_text="Analyze this report...", output_tokens=1000)

# Use local vLLM for zero-cost inference
chat_completion(messages=[...], provider="vllm")

Supported Models

Provider	Models	Input $/1M	Output $/1M
OpenAI	GPT-4o, GPT-4o Mini, O3 Mini	$0.15-$10.00	$0.60-$30.00
DeepSeek	DeepSeek V3, R1	$0.14-$0.55	$0.28-$2.19
Anthropic	Claude Sonnet 4, Haiku 3.5	$0.80-$3.00	$4.00-$15.00
vLLM	Qwen2.5-7B, any local model	FREE	FREE

License

MIT

LLM Inference MCP Server

LLM Inference MCP Server

Features

Tools

Configuration

Installation

Using with Claude Desktop / Cursor / Windsurf

Using with uvx

Using with pip

Smart Routing

Cost Optimization Examples

Supported Models

License

MCP Server · Populars

🦞 OpenClaw — Personal AI Assistant

MarkItDown-MCP

MarkItDown

Awesome MCP Servers

mcp-server-sentry: A Sentry MCP server

MCP Server · New

Myco

Shortlist MCP Server

TomTom MCP Server

Metabase

Memtrace