srefix-diagnosis
Diagnose any tech stack with Claude. 256 specialized SRE agents for Postgres, Kafka, Kubernetes, Istio, β¦ β local, MCP-based, Apache 2.0 licensed.
π Live Β· srefix.com Β· β‘ Quick Start Β· ποΈ Architecture
30-second demo: Claude (headless) diagnoses an Nginx 502 incident by orchestrating5 MCPs β
diag-nginx(manual lookup),srefix-explorer(fallback plan),srefix-mock-telemetry(canned Prom/Loki/jumphost). Reasoning is real;only the telemetry I/O is mocked. Reproduce withpython3 demo/run_benchmark.py.
Content provenance: The diagnostic content under
agents/*.mdwasSYNTHESIZED BY LARGE LANGUAGE MODELS (Claude family by Anthropic and/orGPT family by OpenAI), then validated against publicly-available open-sourcedocumentation. No proprietary, internal, or non-public material was used.SeeNOTICEfor the list of public sources whose documentation informedsynthesis, andISSUES.mdto report any copyright concerns.
Verify coverage caveat: only 3 of 250 manuals (
nginx,prometheus,vitess) currently ship with metric-name whitelists, against which a manual'sPromQL is verified. The other 247 manuals are LLM-synthesized and unverifiedβ Claude may reference metric names that don't exist in your environment.Runsrefix-verify-corpusbefore trusting the manuals on a tech thatmatters. Adding a whitelist takes ~10 min per tech; PRs welcome.
What's in this repo
| Directory | What it is |
|---|---|
agents/ |
250 markdown diagnosis manuals |
mcp/ |
One Python package with 250 entry-point launchers β each is its own MCP server (srefix-diag-postgres / srefix-diag-hbase / ...) |
discovery-mcp/ |
Auto-discovery MCP (33 adapters across 5 layers β cloud API / service registry / cluster bootstrap / k8s / VM tag). Coverage of any specific tech depends on its deployment shape; see "Coverage caveats" below. |
prometheus-mcp/ |
PromQL query MCP |
loki-mcp/ |
LogQL query MCP |
es-mcp/ |
Elasticsearch / OpenSearch query MCP |
jumphost-mcp/ |
SSH-via-jumphost executor with safety gating |
explorer-mcp/ |
Tier-2/3 fallback exploration when manuals miss + cross-tech dependency fan-out |
verify-mcp/ |
Metric-name verifier β diffs manuals against real-exporter whitelists. Run this first. See "Verify accuracy" below. |
At a glance
256 MCP servers, ~1799 tools, 33 discovery adapters across 5 layers. Whether a given tech is actually auto-discoverable depends on how you deployed it β managed cloud service vs registered in Consul/ZK vs k8s workload vs tagged VM vs reachable bootstrap host. Techs that don't fit any path (local CLI tools / abstract concepts) are surfaced by a placeholder virtual adapter β listed, not "discovered." See "Coverage caveats" before relying on this.
5-scenario benchmark (real Claude, mocked telemetry)
Real Claude reasoning over 5 SRE incident scenarios. The mock-telemetry MCPserves canned data per scenario so the demo is reproducible without aproduction environment. Reasoning is genuine β only the I/O is mocked.
| Scenario | Difficulty | Pass | Duration | Keywords matched |
|---|---|---|---|---|
| Nginx 502 Bad Gateway β Upstream Timeout | Basic | β | 57.2s | 4/4 (100%) |
| MongoDB Replica Set Election Storm | Advanced | β | 44.2s | 4/6 (67%) |
| etcd Disk I/O Latency Degrading Kubernetes | Advanced | β | 36.0s | 5/5 (100%) |
| CoreDNS Failure β Cluster-wide DNS Outage | Advanced | β | 90.9s | 5/5 (100%) |
| Cassandra GC Pause Storm | Intermediate | β | 100.0s | 4/5 (80%) |
Pass rate: 5/5 (100%) Β· avg 65.7s β reproduce withpython3 demo/run_benchmark.py; per-scenario detail indemo/benchmark_report.json.Pass criterion: keyword overlap with the scenario's expected diagnosisβ₯ min_confidence (0.5β0.8 per scenario, see scenarios.json). Keywordsets include reasonable domain synonyms (e.g., rollback and secondaryalongside stepdown/replica for the MongoDB scenario) so a correctdiagnosis using equivalent vocabulary still scores. Numbers fluctuateΒ±10s run-to-run from non-determinism in Claude's tool-use loop.
| Layer | MCP servers | Tools |
|---|---|---|
Knowledge β 250 diag-{tech} |
250 | 1750 (7 each) |
Telemetry β srefix-prom / srefix-loki / srefix-es |
3 | 22 |
Execution β srefix-jumphost |
1 | 5β6 |
Discovery β srefix-discovery (33 adapters) |
1 | 6 |
Reasoning fallback β srefix-explorer |
1 | 8 |
| Total | 256 | ~1799 |
Architecture
flowchart TB
LLM([LLM / Claude])
subgraph KNOW["Knowledge β 250 diag-tech MCPs"]
DIAG["diag-postgres / diag-hbase / diag-kafka / ... / diag-cloudflare"]
end
subgraph META["Reasoning Layer"]
EX["srefix-explorer<br/>fallback plan + dependency graph"]
end
subgraph DISCOVERY["srefix-discovery β 33 adapters across 5 layers (coverage varies by deployment)"]
D_BASIC["opscloud4 / kubernetes / zookeeper"]
D_CLOUD["aws / gcp / azure / digitalocean"]
D_CHINA["aliyun / tencentcloud / huaweicloud / jdcloud / volcengine"]
D_PAAS["vercel / flyio / railway / heroku"]
D_REG["consul / eureka / nacos / backstage"]
D_DIRECT["redis-cluster / mongo / cassandra / es-direct / etcd"]
D_SAAS["SaaS 14: cloudflare / datadog / sentry / snowflake / planetscale / auth0 / okta / netlify / pagerduty / opsgenie / github-actions / gitlab-ci / newrelic / splunk"]
D_RUN["nomad / rancher / helm / zabbix / nagios / openfaas / knative"]
D_VIRT["virtual: meta-agents + local-tools + abstract concepts"]
end
subgraph EXEC["Telemetry + Execution"]
PROM["srefix-prom<br/>PromQL"]
LOKI["srefix-loki<br/>LogQL"]
ES["srefix-es<br/>ES/OpenSearch"]
JH["srefix-jumphost<br/>SSH via bastion"]
end
LLM --> DIAG
LLM --> EX
LLM --> DISCOVERY
LLM --> EXEC
EX -.suggested calls.-> EXEC
DIAG -.extract_diagnostic_queries.-> EXEC
Discovery layers + env-var matrix
flowchart LR
subgraph BASIC["Basic 3"]
OPSCLOUD4["OPSCLOUD4_BASE_URL +<br/>OPSCLOUD4_TOKEN"]
K8S["K8S_DISCOVERY_ENABLED=1"]
ZK["ZK_QUORUMS"]
end
subgraph CLOUD_W["Western cloud"]
AWS["AWS_DISCOVERY_REGIONS"]
GCP["GCP_PROJECTS"]
AZ["AZURE_SUBSCRIPTION_IDS"]
DO["DIGITALOCEAN_TOKEN"]
end
subgraph CLOUD_CN["China cloud"]
ALI["ALIBABA_CLOUD_ACCESS_KEY_ID"]
TC["TENCENTCLOUD_SECRET_ID"]
HW["HUAWEICLOUD_ACCESS_KEY"]
JD["JDCLOUD_ACCESS_KEY"]
VOLC["VOLCENGINE_ACCESS_KEY"]
end
subgraph PAAS["PaaS"]
VC["VERCEL_TOKEN"]
FLY["FLY_API_TOKEN"]
RW["RAILWAY_TOKEN"]
HK["HEROKU_API_TOKEN"]
end
subgraph REG["Service registries"]
CON["CONSUL_URL"]
EUR["EUREKA_URL"]
NAC["NACOS_URL"]
BS["BACKSTAGE_URL"]
end
subgraph DIRECT["Self-as-registry"]
RC["REDIS_CLUSTERS"]
MGO["MONGODB_CLUSTERS"]
CAS["CASSANDRA_CLUSTERS"]
ESD["ES_DISCOVERY_ENDPOINTS"]
ETD["ETCD_CLUSTERS"]
end
subgraph SAAS["SaaS β 14 adapters"]
CF["CLOUDFLARE_API_TOKEN"]
DD["DD_API_KEY + DD_APP_KEY"]
SE["SENTRY_AUTH_TOKEN + SENTRY_ORG"]
SF["SNOWFLAKE_ACCOUNT/USER/PASSWORD"]
PS["PLANETSCALE_TOKEN + ORG"]
AU["AUTH0_DOMAIN + AUTH0_MGMT_TOKEN"]
OK["OKTA_DOMAIN + OKTA_API_TOKEN"]
NL["NETLIFY_TOKEN"]
PD["PAGERDUTY_TOKEN"]
OG["OPSGENIE_API_KEY"]
GH["GITHUB_TOKEN + GITHUB_ORG"]
GL["GITLAB_TOKEN"]
NR["NEW_RELIC_API_KEY"]
SP["SPLUNK_URL + SPLUNK_TOKEN"]
end
subgraph RUN["Orchestrator + monitoring servers"]
NM["NOMAD_ADDR"]
RN["RANCHER_URL + RANCHER_TOKEN"]
HE["HELM_DISCOVERY_ENABLED=1"]
ZB["ZABBIX_URL + ZABBIX_API_TOKEN"]
NG["NAGIOS_URL"]
OF["OPENFAAS_URL"]
KN["KNATIVE_DISCOVERY_ENABLED=1"]
end
subgraph VIRT["Always-on virtual"]
VRT["(no env required β covers 32 meta + tools + abstract)"]
end
Each env block is independently opt-in. Setting nothing leaves only thevirtual adapter active (which still surfaces the 32 meta/tools/abstract techs).
Coverage caveats (read this before trusting "covered")
Discovery is seeded, not scanned. Whether a given tech in agents/ isactually surfaced by srefix-discovery depends on the deployment shape:
| If the tech isβ¦ | β¦it's discoverable via |
|---|---|
| A managed cloud service (RDS / ElastiCache / MSK / Aliyun RDS / Tencent Redis / etc.) | Layer 1 β cloud API |
Self-deployed on tagged VMs (Service=hbase) |
Layer 2 β VM tag classification |
| Registered in Consul / Nacos / Eureka / ZooKeeper | Layer 3 β service registry |
| Reachable from a bootstrap host (Redis / Mongo / Cassandra / ES / etcd) | Layer 4 β cluster bootstrap |
| Running as a k8s StatefulSet/Deployment | Layer 5 β kubeconfig walk |
Things to know:
- Big-data stack (Spark / Hadoop / Hive / Trino) maps to Layer 1 (onlyif you're on EMR / DataProc / HuaweiCloud MRS) or Layer 2 (taggedEC2 / GCE) or Layer 5 (k8s). Self-deployed Hive on bare metal thatdoesn't register anywhere β not auto-discoverable without contributinga custom seed.
- Local CLI tools, abstract concepts, and meta-agents (~32 techs:git / docker-compose / linux-perf / etc.) are surfaced by the
virtualadapter, which does no actual discovery β it returns a placeholdercluster so the diag-{tech} MCP can still be invoked. Don't read theseas "discovered." - ZooKeeper adapter ships dispatchers for HBase / Kafka / Solr / HDFS HA(the last surfaces the active + standby NameNodes by parsing the
ActiveNodeInfoprotobuf in/hadoop-ha/<nameservice>/ActiveStandbyElectorLock;DataNodes are not in ZK β query the NN admin API to enumerate those). - Network adapters that can't be tested without a real cluster (CNcloud SDKs, JD/Volcengine especially) are present but uncovered by thetest suite β version drift in their SDKs may break them silently.
- The
vtgate_queries_error-class problem (LLM hallucination in staticcontent) doesn't apply here β discovery code is hand-written and usesreal client libraries β but feature-completeness claims should betreated as "implemented" not "battle-tested in production."
Self-deployed tech on raw cloud VMs (tag-based classification)
Cloud APIs only see "an EC2 instance" β they don't know the box is runningHBase. To bridge this, every cloud adapter accepts a tag-based classifierthat groups VMs into typed clusters matching agents/<tech>.md. Convention:
Service=hbase β matches diag-hbase.md (tech short-name)
ClusterName=hbase-prod-us-east β members with same value form one cluster
Role=master | regionserver β optional; surfaced as Host.role
Supported across AWS / GCP / Azure / DigitalOcean / Aliyun / TencentCloud / Huawei Cloud / JD Cloud / Volcengine β each with the rightSDK-shape tag normalizer. The set of valid Service values is auto-loadedfrom agents/*.md, so all 250 techs are recognized; adding a new manualextends the matcher with no code change.
Example for AWS:
from srefix_discovery_mcp.adapters.aws_extended import build_aws_ec2_classified
instances = ec2.describe_instances()['Reservations'][...]['Instances'] # raw boto3
clusters = build_aws_ec2_classified(instances, region='us-east-1', account='123')
# β 1 hbase cluster (3 nodes, roles preserved) + 1 kafka cluster + N untagged
Per-cloud entry points:
| Cloud | Function | Tag shape |
|---|---|---|
| AWS | build_aws_ec2_classified |
[{Key, Value}, ...] |
| GCP | build_gce_instances_classified |
{key: value} (lowercase) |
| Azure | build_azure_vms_classified |
{key: value} |
| DigitalOcean | build_do_droplets_classified |
["service:hbase", ...] |
| Aliyun | build_aliyun_ecs_classified |
{Tag: [{TagKey, TagValue}]} |
| Tencent Cloud | build_tc_cvm_classified |
[{Key, Value}, ...] |
| Huawei Cloud | build_hw_ecs_classified |
[{key, value}, ...] |
| JD Cloud | build_jd_vm_classified |
[{Key, Value}, ...] |
| Volcengine | build_volc_ecs_classified |
[{Key, Value}, ...] |
Untagged instances fall through to per-cloud defaults (ec2 / gce /azure-vm / droplet / cvm / vm) so nothing is dropped silently.
For self-deployed HBase / Kafka on raw VMs, the fastest path is stillZooKeeper (ZK_QUORUMS=β¦) β it works without any tagging. Tag-basedclassification is the right answer when you want to discover arbitraryself-deployed tech (Cassandra, ClickHouse, Redis, MongoDB, custom apps)that doesn't necessarily register in ZK.
Requirements
- Python β₯ 3.10 (the
mcppackage requires it) - macOS / Linux
- For
discovery-mcp[kubernetes]: a kubeconfig - For
discovery-mcp[zookeeper]: network access to your ZK quorums - For
jumphost-mcp: working~/.ssh/configwithProxyJumpentries +ssh-agentloaded
Verify accuracy (recommended FIRST step)
The 250 manuals under agents/ were LLM-synthesized. We audited 60 of themmanually (Phase 2 + 3, ~819 fixes β see PHASE2_AUDIT.md andPHASE3_AUDIT.md); the remaining 190 are unverified and may still containhallucinated metric names, deprecated CLI flags, or version-drifted configkeys. Before you install and trust this corpus, run the metric verifier.
verify-mcp ships per-tech whitelists captured from real exporter/metrics output and diffs them against every metric reference in themanuals. Anything in a manual that doesn't exist in the real exporter isflagged as a likely hallucination.
# 1. Install the verifier (small β no telemetry deps)
cd srefix-diagnosis/verify-mcp
pip install -e .
# 2. Run the audit on the whole agents/ corpus
srefix-verify-corpus srefix-diagnosis/agents
# Or just one tech
srefix-verify-corpus --tech vitess srefix-diagnosis/agents
Sample output:
Manuals scanned: 250
with whitelist: 3 (nginx, prometheus, vitess)
without whitelist: 247 (not yet covered)
In covered manuals: 58 metric references
matched whitelist (likely real): 16
flagged (likely hallucinated): 42
ββ Flagged metrics by manual ββ
[vitess] 19 flagged
Γ vtgate_queries_error Γ6 L155,242,278,426,429,β¦
Γ vtgate_queries_processed_total Γ4 L167,1252,1271,1326
...
vtgate_queries_error is a representative case: the LLM invented thisname and used it ~21 times in the Vitess manual; the real metric isvtgate_api_error_counts. The verifier surfaces every such pattern withfile/line references so you (or a script) can patch them en masse.
Adding a whitelist for a new tech
The current ship covers only 3 of 250 techs β community contributionsare how we close the gap. To add a whitelist:
- Boot the tech's exporter (locally or in CI), capture
/metrics:curl http://<exporter>:<port>/metrics \ | grep '^# HELP' | awk '{print $3}' | sort -u > metric_names.txt - Save as
verify-mcp/verify_mcp/whitelists/<tech>.jsonwith the formatused byvitess.json(includesource,captured_at,method). - Re-run
srefix-verify-corpus --tech <tech>to see what gets flagged. - PR welcome.
For techs without a Prometheus exporter (CLI-only, Java JMX-only, etc.),a CLI-flag verifier is on the roadmap β see ISSUES.md.
Fixing what the verifier finds β propose / apply split
Once the verifier surfaces flagged metrics, the question is "what should theybe replaced with?" The srefix-fix tool keeps the LLM that proposes a fixstrictly separated from the deterministic tool that applies it β so anLLM mistake can never silently reach agents/.
| Stage | Tool | Trust |
|---|---|---|
| Propose | srefix-fix propose <tech> β spawns claude --print with read-only --allowedTools (Read / Bash(grep) / WebFetch). Cross-checks the manual against the real exporter source and emits a YAML draft. |
LOW (LLM may hallucinate) |
| Review | Human edits the YAML β drop unsure entries, correct new if proposer got it wrong, add confirmed_by: <name> to entries they trust |
gate |
| Apply | srefix-fix apply <yaml> β pure regex sed, word-boundary safe, no LLM at runtime. Same input, same output, every CI run. |
HIGH |
# 1. Auto-draft (read-only Claude run against the Vitess manual + source)
srefix-fix propose vitess
# β fix_maps/vitess.draft.yaml
# 2. Review (drop unsure entries, add confirmed_by)
$EDITOR fix_maps/vitess.draft.yaml
mv fix_maps/vitess.draft.yaml fix_maps/vitess.yaml
git add fix_maps/vitess.yaml
# 3. Dry-run, then real apply
srefix-fix apply fix_maps/vitess.yaml --dry-run
srefix-fix apply fix_maps/vitess.yaml
# 4. Inspect + commit
git diff agents/vitess-agent.md
git commit -am "fix(vitess): apply confirmed metric-name corrections"
See fix_maps/README.md for the full schema and fix_maps/_example.yamlfor a worked Vitess example (the vtgate_queries_error β vtgate_api_error_countsheadline fix).
Why this design works for batch audits:
claude --printheadless lets the propose stage run unattended overnightacross all 247 uncovered techs.--allowedToolswhitelist keeps Claude read-only β no possibility ofit editing a manual directly.- YAML in git makes "LLM suggestion" and "human decision" separatelyreviewable and auditable.
- The applier is pure
sed. Run it 1000 times in CI; same diff every time. --dry-run+git diffmakes every actual change toagents/reviewable.
If you remember one thing: proposer and applier MUST be separated, with ahuman-reviewed YAML between them. That separation is the only thingguaranteeing "an LLM mistake doesn't pollute source-of-truth content."
Use as an MCP tool
The verifier is also exposed as an MCP server (srefix-verify) so Claudecan self-check a manual before trusting it during diagnosis:
{
"mcpServers": {
"srefix-verify": { "command": "srefix-verify" }
}
}
Tools: verify_manual, audit_corpus, list_whitelisted_techs,whitelist_info.
Quick Start
Requirements: Python 3.10+, the
claudeCLI (Claude Code) on PATH.If your system Python is 3.9 or older, create a fresh env first:conda create -n srefix python=3.11 -y && conda activate srefix # or: python3.11 -m venv .venv && source .venv/bin/activate
1. The 250 diagnosis MCPs (mcp/)
By default, pip install -e . would install all 250 commands. You almost never want this β too heavy for Claude. Pick a subset first.
cd srefix-diagnosis/mcp
# See what's available
python3 generate.py --list-all
python3 generate.py --list-all | grep -i sql
# Pick one of three filter modes:
# (a) explicit list
python3 generate.py --techs postgres redis kafka hbase k8s
# (b) from a file (recommended β you can git-commit the file)
cat > my_stack.txt <<'EOF'
postgres
redis
kafka
hbase
k8s
prometheus
nginx
istio
EOF
python3 generate.py --from-file my_stack.txt
# (c) regex
python3 generate.py --regex 'postgres|mysql|redis|kafka'
# Then install β only the selected commands appear on PATH
pip install -e .
Verify:
which srefix-diag-postgres # should print a path
srefix-diag-postgres --help # MCP servers don't have --help; if it starts and waits on stdio, it works
To add or remove a tech later: edit my_stack.txt, re-run python3 generate.py --from-file my_stack.txt, re-run pip install -e ..
To install all 250 anyway (not recommended in Claude config, OK on PATH):
python3 generate.py # no filter = all 250
pip install -e .
2. Discovery MCP (cluster auto-discovery)
cd srefix-diagnosis/discovery-mcp
pip install -e ".[all]" # includes kubernetes + kazoo extras
# Or pick one:
# pip install -e ".[kubernetes]"
# pip install -e ".[zookeeper]"
# pip install -e . # opscloud4-only
Adapters auto-register based on env vars:
# opscloud4 (any one of: REST CMDB)
export OPSCLOUD4_BASE_URL=https://opscloud.your-corp.com
export OPSCLOUD4_TOKEN=<x-token>
# kubernetes (uses ambient kubeconfig)
export K8S_DISCOVERY_ENABLED=1
export K8S_CONTEXTS=prod-east,prod-west # optional
# zookeeper (HBase / Kafka legacy / Solr)
export ZK_QUORUMS="zk-prod-east=zk1:2181,zk2:2181,zk3:2181;zk-prod-west=zkw1:2181"
export ZK_WATCHES=hbase,kafka,solr # optional, default = all
export DISCOVERY_CACHE_TTL=300 # optional, default 300s
Run: srefix-discovery
3. Prometheus MCP
cd srefix-diagnosis/prometheus-mcp
pip install -e .
export PROMETHEUS_URL=http://prometheus.prod:9090
# Optional auth:
# PROMETHEUS_TOKEN=<bearer>
# PROMETHEUS_USERNAME=... PROMETHEUS_PASSWORD=...
# PROMETHEUS_VERIFY_TLS=0 (skip TLS verify)
# PROMETHEUS_TIMEOUT=30
Run: srefix-prom
4. Loki MCP
cd srefix-diagnosis/loki-mcp
pip install -e .
export LOKI_URL=http://loki.prod:3100
# Optional:
# LOKI_TOKEN=<bearer> or LOKI_USERNAME / LOKI_PASSWORD
# LOKI_ORG_ID=<tenant> (multi-tenant Loki)
# LOKI_TIMEOUT=30
# LOKI_VERIFY_TLS=0
Run: srefix-loki
5. Elasticsearch / OpenSearch MCP
cd srefix-diagnosis/es-mcp
pip install -e .
export ES_URL=https://es.prod:9200
# Auth (one of):
# ES_API_KEY=<id:secret base64>
# ES_USERNAME=... ES_PASSWORD=...
# Optional:
# ES_VERIFY_TLS=0
# ES_TIMEOUT=30
Run: srefix-es
6. Jumphost MCP (SSH via bastion)
cd srefix-diagnosis/jumphost-mcp
pip install -e .
Set up two YAML config files:
mkdir -p ~/.config/srefix
# Inventory β which hosts exist + their tags
cat > ~/.config/srefix/inventory.yaml <<'EOF'
hosts:
pg-prod-1:
tags: {env: prod, tech: postgres, role: primary}
pg-prod-2:
tags: {env: prod, tech: postgres, role: replica}
hbase-rs-001:
tags: {env: prod, tech: hbase, role: regionserver}
EOF
# Presets β read-only commands the LLM is allowed to invoke
cat > ~/.config/srefix/presets.yaml <<'EOF'
postgres:
pg-replication-status:
description: Replication status from primary
command: 'psql -At -c "SELECT pid, state, sent_lsn FROM pg_stat_replication"'
allowed_roles: [primary]
timeout: 10
pg-table-size:
description: Size of one table
command: 'psql -At -c "SELECT pg_total_relation_size(''{table_name}'')"'
allowed_roles: [primary, replica]
allowed_args: [table_name]
timeout: 10
EOF
Make sure ~/.ssh/config has ProxyJump set up for each host:
Host bastion
HostName bastion.your-corp.com
User you
Host pg-prod-*
ProxyJump bastion
User postgres
Env:
export JUMPHOST_INVENTORY=~/.config/srefix/inventory.yaml
export JUMPHOST_PRESETS=~/.config/srefix/presets.yaml
export JUMPHOST_MODE=preset_only # default; safest
# Other modes:
# JUMPHOST_MODE=filtered_arbitrary # `run` enabled, denylist applied
# JUMPHOST_MODE=unrestricted # `run` enabled, no filter (use with external approval)
export JUMPHOST_DRY_RUN=1 # never actually exec; great for testing
export JUMPHOST_DEFAULT_TIMEOUT=30
Run: srefix-jumphost
7. Explorer MCP (Tier-2/3 fallback)
cd srefix-diagnosis/explorer-mcp
pip install -e .
No env required. 8 tools:
fallback_exploration_plan(symptom, tech?, cluster_id?, host_pattern?)β Tier-2 structured planfree_explore_bootstrap(symptom, tech?, cluster_id?)β Tier-3 schema-aware starterexpand_to_dependencies(tech, depth?, observation?)β upstream fan-out via dep graphexpand_to_dependents(tech)β blast-radius reverse lookupreflect_on_findings(findings, top_k?)β feed evidence back, get keywordscategorize_symptom/list_symptom_categories/list_supported_techs
Run: srefix-explorer
Register with Claude
Add to your Claude MCP config:
- Claude Desktop:
~/Library/Application Support/Claude/claude_desktop_config.json - Claude Code:
~/.config/claude-code/mcp.jsonor project-level.claude/mcp.json
Skeleton (pick the MCPs you actually want):
{
"mcpServers": {
"discovery": {
"command": "srefix-discovery",
"env": {
"OPSCLOUD4_BASE_URL": "https://opscloud.your-corp.com",
"OPSCLOUD4_TOKEN": "...",
"K8S_DISCOVERY_ENABLED": "1",
"ZK_QUORUMS": "zk-prod-east=zk1:2181,zk2:2181,zk3:2181"
}
},
"prom": { "command": "srefix-prom", "env": { "PROMETHEUS_URL": "..." } },
"loki": { "command": "srefix-loki", "env": { "LOKI_URL": "..." } },
"es": { "command": "srefix-es", "env": { "ES_URL": "...", "ES_API_KEY": "..." } },
"jumphost": {
"command": "srefix-jumphost",
"env": {
"JUMPHOST_INVENTORY": "/Users/you/.config/srefix/inventory.yaml",
"JUMPHOST_PRESETS": "/Users/you/.config/srefix/presets.yaml",
"JUMPHOST_MODE": "preset_only",
"JUMPHOST_DRY_RUN": "1"
}
},
"diag-postgres": { "command": "srefix-diag-postgres" },
"diag-hbase": { "command": "srefix-diag-hbase" },
"diag-redis": { "command": "srefix-diag-redis" },
"diag-kafka": { "command": "srefix-diag-kafka" },
"diag-k8s": { "command": "srefix-diag-k8s" }
}
}
For the diag-* block, generated automatically β copy whichever lines you need from mcp/claude_mcp_config.json. Or filter to a subset:
cd mcp
python3 filter_config.py postgres redis kafka hbase k8s > ~/my_diag_subset.json
# then merge ~/my_diag_subset.json["mcpServers"] into your Claude config
How the diagnose-with-evidence loop works
Observation: "PostgreSQL replica ε€εΆε»ΆθΏ > 5min"
β
ββ discovery.list_hosts(tech="postgres", role="primary")
β β ["pg-prod-1"]
ββ diag-postgres.diagnose("replication lag spiking")
β β case "ReplicationLagCritical" + Symptoms / Diagnosis / Root Cause Tree / Thresholds
ββ diag-postgres.extract_diagnostic_queries("ReplicationLagCritical")
β β [4 structured queries: 3 psql + 1 shell, each with suggested_mcp]
β
ββ prom.range_query("pg_replication_lag_seconds{...}", "-30m", "now")
ββ jumphost.run_safe(host="pg-prod-1", tech="postgres",
β preset_name="pg-replication-status")
ββ loki.query_range('{app="postgres"} |= "wal"', "-30m", "now")
ββ es.search("logs-postgres-*", query='level:ERROR AND host:"pg-prod-1"')
β
ββ Claude correlates evidence + manual β diagnosis + suggested fixes
Verify install
# Each command should be on PATH:
which srefix-discovery srefix-prom srefix-loki srefix-es srefix-jumphost
which srefix-diag-postgres # plus whichever diag-* you generated
# Smoke-test one MCP server (will hang on stdin β that's normal; Ctrl-C to exit):
srefix-diag-postgres
# Full inspection via the MCP CLI inspector:
npx @modelcontextprotocol/inspector srefix-diag-postgres
Troubleshooting
| Symptom | Cause / Fix |
|---|---|
ModuleNotFoundError: No module named 'mcp' |
Python < 3.10 or pip install ran with the wrong interpreter. Use python3.10 -m pip install -e . |
srefix-diag-foo: command not found |
foo.md exists but you didn't include foo in the generate.py filter. Re-run generate.py + pip install -e . |
| Discovery returns 0 clusters | Missing env vars (no adapter enabled), or auth failure β check discovery_health() output |
jumphost.run_safe returns "host not in inventory" |
Add the host to your inventory.yaml, or your JUMPHOST_INVENTORY env points at a different file |
| Claude can't see the MCP | Restart Claude after editing config; Claude only re-reads at startup |
License
Apache License 2.0 β see LICENSE for the full text and the"CONTENT PROVENANCE DISCLOSURE" section explaining how the diagnosticcontent was synthesized.
Apache 2.0 was chosen over MIT because it adds an explicit patent grant,which protects users (and contributors) against patent claims by adopters.See NOTICE for required attribution and ISSUES.md for the copyrighttakedown procedure.
