k9fr4n

thruk-mcp

Community k9fr4n
Updated

mcp server for thruk monitoring

thruk-mcp

CIcodecovPyPIPyPI downloadsLicense: MITPythonghcr.ioGitHub release

Model Context Protocol (MCP) server for Thruk — the unified web frontend for Naemon, Nagios, Icinga and Shinken.

Expose Thruk's REST API to MCP-compatible clients (Claude Desktop, Dust, LibreChat, OpenWebUI...) so that an LLM can query hosts/services, schedule downtimes, acknowledge problems, force rechecks and more in natural language.

Features

  • Read: hosts, services, hostgroups, servicegroups, downtimes, comments, sites, aggregated stats, current problems
  • Write: schedule/delete downtimes, acknowledge & remove acks, force rechecks
  • Escape hatch: thruk_query tool to call any Thruk REST endpoint
  • Multi-backend support (Thruk federated sites): pass backends="prod,dr" to any tool
  • Two transports: stdio (default) or Streamable-HTTP (--listen <port>)
  • Async httpx client with proper error handling and TLS verification
  • Tested with pytest + respx, linted with ruff, packaged with hatchling

Quick start

1. Configure

cp .env.example .env
$EDITOR .env   # set THRUK_BASE_URL and THRUK_API_KEY

An API key can be created from the Thruk user profile page (requires api_keys_enabled in thruk_local.conf) or via the REST API itself.

2a. Run with Docker

docker compose up -d
# MCP Streamable-HTTP endpoint: http://localhost:8001/mcp

2b. Run locally

pip install thruk-mcp        # or: pipx install thruk-mcp

# stdio mode (for Claude Desktop, LibreChat, etc.)
thruk-mcp

# HTTP mode
thruk-mcp --listen 8001

For local development of the project itself, see CONTRIBUTING.md.

3. Wire it to an MCP client

Claude Desktop (~/.config/Claude/claude_desktop_config.json or macOS equivalent):

{
  "mcpServers": {
    "thruk": {
      "command": "thruk-mcp",
      "env": {
        "THRUK_BASE_URL": "https://monitor.example.com/thruk",
        "THRUK_API_KEY": "xxxxxxxx"
      }
    }
  }
}

4. Use with the Docker MCP Gateway

The image at ghcr.io/k9fr4n/thruk-mcp:latest defaults to stdio transport, so it can be spawned natively by the gateway.

Option A — Private local catalog
# 1. Create your private catalog
docker mcp catalog create thruk-private

# 2. Register this server (catalog/server.yaml ships with the repo)
docker mcp catalog add thruk-private thruk-mcp ./catalog/server.yaml

# 3. Configure credentials & enable
docker mcp secret set thruk-mcp.api_key=YOUR_KEY
docker mcp config write thruk-mcp.base_url=https://monitor.example.com/thruk
docker mcp server enable thruk-mcp

# 4. Run the gateway with your catalog
docker mcp gateway run --catalog thruk-private

Then point any MCP client (Claude Desktop, VS Code, Cursor, ...) at the gateway as documented here.

Option B — Submit upstream

catalog/server.yaml, catalog/tools.json and catalog/readme.md follow the docker/mcp-registry schema and can be submitted to the official Docker MCP Catalog via PR.

What's exposed

57 MCP Tools

Read — statethruk_list_hosts, thruk_get_host, thruk_list_services, thruk_get_service,thruk_list_hostgroups, thruk_list_servicegroups, thruk_list_contacts, thruk_get_contact,thruk_problems, thruk_stats, thruk_totals (compact 16-field host+service totals, fasterthan thruk_stats), thruk_sites.

Read — history & commentsthruk_list_logs, thruk_list_alerts, thruk_list_notifications,thruk_notification_summary (notifications grouped by contact/host/service/state/command),thruk_recent_events, thruk_list_comments, thruk_list_downtimes, thruk_get_downtime.

Read — noise & flap analysisthruk_top_noisy_hosts (hosts ranked by alert count over a window),thruk_top_noisy_services (services ranked by alert count),thruk_flap_summary (hosts/services ranked by state transition count).

Read — problem intelligencethruk_oldest_problems (unhandled problems sorted by age, oldest first),thruk_unacked_critical (CRITICAL/DOWN not acknowledged for > N minutes),thruk_stale_acks (acknowledgements older than N days — forgotten problems),thruk_problem_counts (flat aggregate of unhealthy-state counts, filterable by hostgroup,custom vars or any structured filter — replaces the former thruk_problems_by_hostgroup),thruk_stale_checks (surface checks that stopped running — the dangerous "false green").

Read — analyticsthruk_alert_heatmap (alert counts bucketed by time, useful for spotting recurringpatterns), thruk_notification_heatmap (notification counts bucketed by time — spotmail/paging storms), thruk_concurrent_failures (windows where multiple hosts failedsimultaneously),thruk_recurring_problems (hosts/services generating repeated alerts over a window).

Read — availability / SLAthruk_host_availability (uptime % for a single host — time_up_percent, time_down_percent,time_unreachable_percent and scheduled equivalents),thruk_service_availability (ok/warning/critical/unknown % for a single service),thruk_hostgroup_availability (availability for all hosts or services in a hostgroup,sorted worst-first; type = hosts | services | both).All three accept since/until (Thruk relative or ISO) or a timeperiod shortcut(lastmonth, thismonth, last24hours, lastweek, …).thruk_reliability_report (per host/service reliability metrics — MTTR / MTBF /incident counts — derived from the log over a window).

Read — performance datathruk_get_perfdata (fetch and parse performance data for a single host or service),thruk_perfdata_snapshot (parsed perfdata for every service matching a filter, in one call),thruk_perfdata_near_threshold (metrics within within_percent % of breaching theirwarn/crit range — early-warning signal before an alert fires).

Write — downtime managementthruk_schedule_downtime (host/service), thruk_schedule_host_services_downtime(all services of a host), thruk_schedule_propagated_host_downtime (parent+children),thruk_schedule_hostgroup_downtime, thruk_schedule_servicegroup_downtime,thruk_delete_downtime, thruk_delete_active_downtimes,thruk_delete_downtimes_by_filter.

Write — problem handlingthruk_acknowledge, thruk_bulk_acknowledge (acknowledge multiple hosts/services in one call),thruk_remove_acknowledgement, thruk_recheck,thruk_add_comment, thruk_delete_comment,thruk_checks (enable/disable active checks for a host or service),thruk_notifications (enable/disable host or service notifications, with optionalcascade to all services of a host).

Escape hatchesthruk_query (raw call to any REST endpoint), thruk_run_background_query(long-running endpoint via Thruk's ?background=1 mechanism with automaticjob polling).

All list-style tools share a consistent limit / offset / sort / columnscontract. By default they return a tight subset of columns (~10 fields per row)to keep LLM token consumption low. Pass columns="" to opt out and receiveevery column the Thruk row contains.

5 MCP Resources

URI templates that MCP clients with a resource browser (Claude Desktop, VSCode, ...) can "open" like files:

URI Content
thruk://hosts/{name} Full host JSON
thruk://services/{host}/{service} Full service JSON
thruk://hostgroups/{name} Host group config + members
thruk://problems Current unhandled problems (hosts + services)
thruk://stats Aggregated host/service stats (cached)

3 MCP Prompts

Pre-canned workflows the user can invoke as a slash-command in the MCPclient UI:

Prompt Arguments Purpose
investigate_alert host, optional service 7-step incident triage
schedule_maintenance target, duration_minutes, kind Safe downtime workflow with confirmation
diagnose_flapping host, service Root-cause a flapping service (uses thruk_flap_summary)

Robustness

  • Connection retrieshttpx.AsyncHTTPTransport(retries=3) handles DNSfailures, connection refusals, TLS handshakes.
  • HTTP retries with backoff — 5xx and 429 responses are retried up to3 times with exponential backoff + jitter (cap 5 s).
  • Opt-in TTL cache — slow-moving endpoints (/sites, /processinfo,/hosts/stats, /services/stats, /contacts, /timeperiods, ...) arecached in-process for 15 s. Any tool can request caching viacache_ttl= on the underlying client. This absorbs the burst of identicalcalls an LLM agent typically issues across a multi-tool turn.
  • Pagination helperThrukClient.get_all() is an async generator thatiterates pages of 500 rows up to a configurable hard limit (default 50 000),so internal callers can scan entire backends without manual offset math.
  • Long-running queries — the thruk_run_background_query tool wrapsThruk's ?background=1 flow and polls /thruk/jobs/<id>/output until thejob completes (5 min default timeout).

Environment variables

Connection

Variable Default Description
THRUK_BASE_URL http://localhost/thruk Thruk URL (no trailing slash)
THRUK_API_KEY (required) X-Thruk-Auth-Key header
THRUK_AUTH_USER Impersonation user (superuser key only)
THRUK_VERIFY_SSL true Set false for self-signed certs
THRUK_TIMEOUT 30 HTTP timeout in seconds
THRUK_DEFAULT_BACKENDS CSV of default backend names (federated Thruk)

Security / multi-tenant (v0.6)

Variable Default Description
THRUK_READ_ONLY false Strip every write tool (ack, downtime, recheck, ...)
THRUK_ENABLED_TOOLS Allowlist of tool names. CSV with fnmatch wildcards. Empty = all
THRUK_AUDIT_LOG true Emit one JSON audit line on stderr per write tool invocation
THRUK_MAX_CONCURRENT 0 Cap of concurrent in-flight HTTP requests. 0 = unlimited

Security

  • Read-only mode — set THRUK_READ_ONLY=true to remove every write tool(thruk_acknowledge, thruk_schedule_*_downtime, thruk_recheck,thruk_delete_*, thruk_run_background_query) from the MCP server. TheLLM literally cannot mutate monitoring state. Use this for general-purposeagents that should only observe.

  • Tool allowlistTHRUK_ENABLED_TOOLS=thruk_list_*,thruk_problems,thruk_statsrestricts the exposed surface to the listed tools (fnmatch wildcardssupported). Useful when fronting multiple LLM clients with the same gatewaybut different scopes.

  • Audit log — every write tool invocation emits one JSON line onthruk_mcp.audit (stderr by default):

    {"ts":"2026-05-17T22:00:00+00:00","tool":"thruk_acknowledge","user":"alice",
     "args":{"host":"srv01","comment":"investigating"},"target":"srv01","status":"ok"}
    

    Disable with THRUK_AUDIT_LOG=false. Sensitive keys (api_key, password,token) are redacted as *** before logging.

  • Rate limitTHRUK_MAX_CONCURRENT=8 caps in-flight HTTP requests withan asyncio.Semaphore. Combined with the v0.3 TTL cache, this protects theThruk core from an LLM that loops on tools or chains them aggressively.

Development

pip install -e ".[dev]"
pre-commit install                              # one-time setup of git hooks

ruff check src tests && ruff format src tests   # lint + format
mypy src                                        # type-check
pytest -v --cov=thruk_mcp --cov-fail-under=80   # tests with coverage gate

Conventions:

  • Conventional Commits (feat:, fix:, chore:, docs:, refactor:,test:).
  • No direct push to main: branch → PR → squash merge.
  • Any new tool must come with a respx-mocked unit test in tests/test_tools.pyand an entry in catalog/tools.json (Docker MCP Registry contract).
  • CI gate: ruff, ruff format --check, mypy, pytest with 80 %coverage minimum.

References

Project docs

  • CHANGELOG.md — what changed in each release.
  • UPGRADING.md — per-version migration notes.
  • SUPPORT.md — supported Python / Thruk / MCP-client versions,security policy, release cadence.
  • CONTRIBUTING.md — dev setup, PR conventions, tool /env-var contribution checklists.

License

MIT — see LICENSE.

MCP Server · Populars

MCP Server · New