infra-advisor-mcp

A Model Context Protocol (MCP) server that estimates GPU requirements, training/inference costs, and cloud-vs-on-prem TCO for AI workloads.

Describe a workload in plain English ("a customer-support chatbot for a 50-person startup", "continual pre-training a 7B model on 50B tokens") and the server returns model recommendations, monthly cost projections, hardware sizing, and a break-even analysis — as structured data or a full Markdown report.

All numbers are produced by deterministic Python calculators (scaling laws, VRAM math, TCO models). No LLM is invoked for the arithmetic, so results are reproducible and auditable. Pricing and hardware specs live in version-controlled YAML.

What it can answer

Which model should I use for this task, scale, and budget? (open-source vs. API)
What will inference cost per month across cloud APIs and self-hosted options?
What does a training run cost (SFT / continual pre-training / pre-training / RL) in GPU-hours, wall-clock time, and dollars?
How do I shard the model across those GPUs? — recommends a parallelism strategy (DDP, FSDP/ZeRO-3, or tensor+pipeline parallel) and degrees from the model footprint, GPU VRAM, and interconnect.
How many GPUs to actually serve the load? — sizes replicas to the daily output volume at the latency target (so a "cheaper" option can't be silently under-provisioned), and models quantization (fp8/int8/int4) shrinking VRAM and lifting throughput.
Cloud or on-prem? — full 1/3/5-year TCO with a break-even month.
What are the ongoing on-prem costs — power, cooling, rack, networking, depreciation, and ML-infra staffing?

MCP tools

Tool	Purpose
`generate_full_report`	Main entry point. Runs every tool and returns a complete plain-English Markdown report.
`analyze_task`	Parse a free-text description into structured parameters (scale, use case, domain, token volumes).
`recommend_model`	Rank open- and closed-source models for the task.
`estimate_training_cost`	GPU-hours, wall-clock, cost, and sharding strategy (DDP/FSDP/TP+PP) for pretrain / continual-pretrain / SFT / RL.
`estimate_inference_cost`	Monthly cost across API providers and self-hosted options, with break-even, quantization (fp8/int8/int4), and replica sizing for the latency target.
`compare_cloud_vs_onprem`	Cloud vs. on-prem TCO over 1/3/5 years.
`estimate_maintenance_cost`	Detailed on-prem monthly OpEx + staffing.
`generate_followup_answer`	Focused answer to a single follow-up question with an inline glossary.
`save_report`	Write the report (and follow-ups) to `.md` and `.html`.
`list_available_gpus`	List all GPUs in the database with specs and pricing.
`get_data_freshness_info`	Report `last_updated` dates so you can tell if pricing is stale.
`reload_data`	Re-read the YAML data files without restarting the server.

Requirements

Python 3.11+
An MCP client (e.g. Claude Code, or any MCP-compatible host)

Installation

git clone https://github.com/c3-yang-song/infra-advisor-mcp.git
cd infra-advisor-mcp

python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate

pip install -e ".[dev]"          # drop [dev] if you don't need tests/lint

This installs an infra-advisor console script that runs the MCP server over stdio.

Connecting to an MCP client

Important: an MCP client launches the server in its own environment — it does not inherit the venv you activated in your shell. Always point the client at the absolute path to the infra-advisor script inside your venv, so it works regardless of your PATH.

Get the absolute path:

echo "$(pwd)/.venv/bin/infra-advisor"
# e.g. /Users/you/infra-advisor-mcp/.venv/bin/infra-advisor

Claude Code

claude mcp add infra-advisor -- /absolute/path/to/infra-advisor-mcp/.venv/bin/infra-advisor

Verify it connected:

claude mcp list          # infra-advisor should show as connected

Generic MCP client (JSON config)

Add this to your client's MCP server configuration (e.g. claude_desktop_config.json for Claude Desktop):

{
  "mcpServers": {
    "infra-advisor": {
      "command": "/absolute/path/to/infra-advisor-mcp/.venv/bin/infra-advisor",
      "args": []
    }
  }
}

(If you installed the package into an environment that is on the client's PATH, you can use the bare "infra-advisor" as the command instead.)

Note: after editing any .py file, fully restart the MCP server for changes to take effect. YAML-only edits can be picked up with the reload_data tool — no restart needed.

Usage

Once connected, just talk to your MCP client in natural language. Example prompts:

"Generate a full infrastructure report for a customer-support chatbot serving a 50-person startup."
"We want to continually pre-train a 7B model on 50B tokens of legal text. What does it cost on H100s vs A100s?"
"At 5 million tokens/day, is it cheaper to use the OpenAI API or self-host Llama 3.1 8B?"
"Compare 5-year TCO of 8× H100 on AWS vs buying our own cluster at 70% utilization."

The client will call the relevant tools and return the estimates. Start with generate_full_report for a complete picture, then use generate_followup_answer for focused questions.

Using the calculators directly (Python)

The estimators are plain functions and can be imported without an MCP client:

from infra_advisor.tools.report import generate_full_report

print(generate_full_report("A coding assistant for a 50-person startup"))

Example output

Real output from the tools (abridged where noted). Every figure is computed from the bundled data/ — your numbers will track whatever's in those YAML files.

1. Parse a request — `analyze_task`

Ask: "Customer support chatbot for a 50-person SaaS startup, ~2 million tokens/day, near real-time."

{
  "use_case": "inference_only",
  "domain": "nlp",
  "scale": "startup",
  "quality_requirement": "medium",
  "latency_requirement": "realtime",
  "estimated_daily_input_tokens": 1400000,
  "estimated_daily_output_tokens": 600000,
  "team_ml_expertise": "low",
  "on_prem_preference": false,
  "key_constraints": ["Real-time latency required (<1s response)"],
  "open_questions": [
    "What is your monthly infrastructure budget?",
    "What is your acceptable latency (P50 / P99)?",
    "How many concurrent users or requests do you expect at peak?"
  ]
}

2. Full report — `generate_full_report` (abridged)

generate_full_report(...) returns one Markdown document. Its Executive Summary for the request above:

The five things you need to know before reading the full report:

1. What you're building: Inference Only system for a Startup in the nlp domain. Estimated 1.4M input + 0.6M output tokens per day.2. Cloud vs. self-host: You're in the middle range where it depends on growth trajectory. Start with cloud APIs, monitor spend, and revisit self-hosting at 3× current volume.3. Training cost: Not applicable — this is an inference-only workload.4. Hardware: No hardware purchase recommended at this stage. Use cloud APIs or managed inference providers until monthly spend exceeds ~$5,000.5. Staffing: A single ML engineer (or 0.5 FTE of an existing engineer) can manage a small self-hosted deployment.

…followed by the scored model shortlist:

Rank	Model	Type	Size	Context Window	Price (In / Out per 1M)	Cost Tier
1	OpenAI GPT-4o Mini	Closed Source	Undisclosed	128,000 tokens	$0.15 / $0.60	Very Low ($)
2	Google Gemini 2.0 Flash	Closed Source	Undisclosed	1,000,000 tokens	$0.03 / $0.17	Very Low ($)
5	Meta LLaMA 3.1 8B	Open Source	8.0B	128,000 tokens	Self-hosted	low (self-hosted) / medium (API)

The full report continues through eight sections — inference cost (cloud API vs self-hosted), training cost, cloud-vs-on-prem TCO with a break-even month, on-prem monthly OpEx, a decision checklist, and next steps — plus a plain-English glossary. A data-staleness banner appears automatically when the bundled prices are >30 days old.

3. Follow-up: training cost — `generate_followup_answer`

Ask: "How much does a QLoRA fine-tune of an 8B model cost on one H100?"

Training type: QLoRA Fine-Tuning (4-bit base + adapters) · Model: 8.0B · Dataset: 50,000,000 tokens

Metric	Value	Plain English
GPU config	1× H100 SXM	Minimum to fit the model in VRAM
Sharding	DDP × 1 GPU	Data Parallel — adapters fit on one GPU
GPU-hours	1	All GPUs × hours each
Wall-clock	~0 days	Real elapsed time

Provider	On-demand	Spot (35% off)
Lambda	$2	$1
AWS	$8	$3
On-prem (power only)	$1	excl. $30,000 hardware

Recommendation: Use spot instances to cut cost ~35%; checkpoint every 30 min. Budget for 3 experimental runs: $9–$25.

(QLoRA needs one GPU because only small adapters are trained — full fine-tuning of the same 8B model reports 8 GPUs / ~216 GB.)

4. Follow-up: self-hosting capacity & quantization

Ask (on a ~300M-tokens/day, realtime workload): "How many GPUs and what monthly cost to self-host an open model in int4 for our volume?"

Cheapest cloud API option: Meta LLaMA 3.1 8B via Groq at $531/month.

Self-hosted options — sized for ~3,125 output tok/s peak (realtime latency), int4 weights:

Model	GPU	Total GPUs	Serving Topology	Cloud GPU/mo	On-prem/mo	Break-even vs API
Meta LLaMA 3.1 8B	RTX 4090	4	Single GPU × 4	$1,008	$578	Never
Mistral Mixtral 8x7B	RTX 4090	8	TP=4 × 2	$2,016	$1,155	Never
Meta LLaMA 3.1 8B	A100 80GB SXM	4	Single GPU × 4	$3,715	$554	Never

Recommendation: Consider managed inference (Meta LLaMA 3.1 8B) unless you have ML-ops expertise to self-host.

This shows the two newest levers working together: GPUs are sized to the load (4 replicas to sustain the peak token rate), int4 shrinks each replica, and the tool is honest that at this volume the $531/mo managed API beats owning hardware ("Never" breaks even).

Keeping data current

All pricing and hardware data lives in version-controlled YAML under src/infra_advisor/data/:

File	Holds	Authoritative for
`gpu_specs.yaml`	GPU specs (VRAM, TDP, buy price, MFU, inference throughput), `onprem_overhead`, `planning` defaults, and fallback cloud rates	hardware specs, on-prem costs
`cloud_pricing.yaml`	AWS/GCP/Azure GPU instance rates, `reserved_discounts`, `egress`	cloud GPU-hour rates (overlaid onto `gpu_specs` at load time), committed-use discounts
`model_registry.yaml`	open/closed-source model catalog, managed `inference_providers`	model + API pricing

Each entry carries a last_updated date; reports show a staleness warning when data is older than 30 days, and the get_data_freshness_info tool lists every date.

The refresh loop

Run the relevant sync script (see below).
Review the changes — for model/API pricing this is mandatory (scrapers can misread a page).
Set last_updated to today on anything you accept.
Reload — call the reload_data MCP tool (or infra_advisor.data_loader.reload_all()); the YAML loaders are cached, so changes aren't picked up until you do. No server restart needed for YAML-only edits.

# Cloud GPU rates → cloud_pricing.yaml (which the calculators read via gpu_specs overlay)
python scripts/sync_cloud_pricing.py --auto          # --auto writes without the confirm prompt
python scripts/sync_cloud_pricing.py --provider aws  # one provider

# API / model pricing → writes pricing_review.md for you to verify; NEVER edits the registry
python scripts/sync_provider_pricing.py

# New open-source models → prints suggestions; NEVER edits the registry
python scripts/sync_models.py --min-downloads 500000

A GitHub Action (.github/workflows/sync-pricing.yml) runs all three Mondays at 9am UTC and opens a PR if data/ changed — review it carefully before merging.

Cloud-sync credentials

The cloud fetchers use official APIs and skip cleanly when credentials are absent (so the Action still runs — Azure needs no credentials):

Provider	Requirement	How
Azure	none	Public Retail Prices API
AWS	`boto3` + AWS credentials	`pip install -e ".[sync]"`, then standard AWS creds (`AWS_ACCESS_KEY_ID` / `AWS_SECRET_ACCESS_KEY`, IAM `pricing:GetProducts`). Uses the Price List Query API.
GCP	`GCP_BILLING_API_KEY` env var	A Cloud Billing Catalog API key. GCP machine prices are reassembled from component SKUs (GPU + vCPU + RAM); a price is emitted only if every component resolves, otherwise that instance is skipped.

GCP figures are assembled from component SKUs and should be verified against the console — SKU descriptions occasionally change. For CI, set AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and GCP_BILLING_API_KEY as repository secrets; the model/API price scraper remains review-only by design.

Development

pytest                 # run the test suite
ruff check src/ tests/ # lint

Tests split into pure-calculator unit tests (tests/test_calculators.py, no I/O) and full-stack integration tests against the real YAML (tests/test_tools.py).

Architecture

src/infra_advisor/
├── server.py        # FastMCP entry point — registers all @mcp.tool()s
├── constants.py     # shared time-base constants (DAYS/HOURS per month)
├── data_loader.py   # lru_cache YAML loaders; reload_all() clears caches
├── glossary.py      # plain-English term definitions (report + follow-up)
├── data/            # gpu_specs / model_registry / cloud_pricing YAML
├── calculators/     # pure math, no I/O (compute, memory, tco)
└── tools/           # MCP tool implementations (call calculators + data)

Data flow: server.py → tools/ (calls calculators + data_loader) → calculators/ (pure math) + data/ YAML.

Design invariant: calculators never import from tools/ or data_loader — they receive specs as plain dict arguments. See CLAUDE.md for deeper contributor notes.

A note on accuracy

These are directional estimates for planning, not precise budgets. Actual costs vary by region, negotiated rates, model architecture, serving stack, and utilization. Always validate with a small paid pilot before committing to infrastructure.

Contact

Questions or issues: [email protected]

License

MIT

infra-advisor-mcp