databricks-mcp
Safe, read-only SQL analytics for AI agents — over MCP. Point an agent at a SQL warehouseand let it explore, profile, and query data without any risk of mutating it.
What it is
databricks-mcp is a Model Context Protocol server that givesan AI agent safe, read-only analytics access to a SQL warehouse. It exposes five typed tools —list_tables, describe_table, sample_rows, run_sql, and profile_table — and routes everyquery through an AST-based safety guard that enforces read-only, single-statement, and row-capguarantees.
Two backends ship in the box:
- DuckDB (default) — runs fully offline against a bundled synthetic logistics warehouse(shipments, carriers, lanes). Zero setup, ~30 seconds to first query.
- Databricks SQL Warehouse — connect to a real warehouse with a few environment variables.
About the name: the project is named for its Databricks backend, but it runs completelyoffline on DuckDB out of the box — you don't need a Databricks account to try it.
30-second quickstart
uvx databricks-mcp # runs on the bundled DuckDB logistics sample data
That's it — the server starts on stdio with the sample warehouse loaded and waits for an MCPclient.
Claude Desktop config
Add this to your Claude Desktop MCP configuration (claude_desktop_config.json):
{
"mcpServers": {
"databricks-mcp": { "command": "uvx", "args": ["databricks-mcp"] }
}
}
Restart Claude Desktop and the five tools become available to the assistant.
Connecting a real Databricks SQL Warehouse
Set the backend to databricks and provide your warehouse credentials via environment variables:
export DB_BACKEND=databricks
export DATABRICKS_SERVER_HOSTNAME=... # e.g. dbc-xxxxxxxx-xxxx.cloud.databricks.com
export DATABRICKS_HTTP_PATH=... # e.g. /sql/1.0/warehouses/abc123
export DATABRICKS_TOKEN=... # a Databricks personal access token
Optionally cap the maximum rows any single query may return (default 1000):
export MAX_ROWS=500
Secrets are only ever read from the environment and are never logged.
Tools
| Tool | Input | Output |
|---|---|---|
list_tables |
— | Table names and column counts. |
describe_table |
table |
Columns, types, and row count. |
sample_rows |
table, n (default 10, max 100) |
Preview rows from the table. |
run_sql |
query |
Guarded read-only result rows (row-capped). |
profile_table |
table |
Per-column null fraction, distinct count, and min/max. |
All inputs and outputs are typed with pydantic models, so the agent receives clean JSON schemas.
Safety / guardrails
Every query passed to run_sql — and every statement the other tools generate internally —goes through safety.py, which validates against the parsed sqlglotAST rather than fragile string matching:
- Parse or reject. Anything that fails to parse is rejected with a structured error.
- Single statement only. Multi-statement input is rejected, blocking stacked-query injection.
- Read-only only. Only
SELECTand CTE (WITH) queries are allowed. AnyINSERT/UPDATE/DELETE/DROP/ALTER/CREATE/GRANT/COPY/CALL/PRAGMA/ATTACHis rejected. - System-table block. References to
information_schema,pg_catalog,system, and similarcatalogs are denied — agents introspect schema throughlist_tables/describe_tableinstead. - Filesystem-function block. Read-only SELECTs can still call table functions like
read_csv,read_parquet,read_text, andglobto read local files. The guard walks the ASTand denies these, so an agent can't exfiltrate the host filesystem (e.g.SELECT * FROM read_text('/etc/passwd')). - Auto-LIMIT. A
LIMIT(default1000, configurable viaMAX_ROWS) is injected when absent,so an agent can never pull unbounded data.
Identifier arguments (table) are additionally checked against the known-table list before theyare ever interpolated into SQL, preventing identifier injection.
Every one of these rules is backed by a passing test — seetests/test_safety.py (read-only allowlist, multi-statement, unparseable,system-table, filesystem-function, and auto-LIMIT cases) and tests/test_duckdb_backend.py(unknown-table rejection, row-cap truncation). The README makes no guardrail claim that isn'tproven by the suite.
Recorded transcript
An agent exploring the bundled logistics warehouse:
> list_tables
[
{"name": "carriers", "column_count": 4},
{"name": "lanes", "column_count": 4},
{"name": "shipments", "column_count": 7}
]
> run_sql: SELECT c.mode,
count(*) AS shipments,
round(100.0 * avg(s.delivered_on_time::INT), 1) AS on_time_pct
FROM shipments s
JOIN carriers c ON s.carrier_id = c.carrier_id
GROUP BY c.mode
ORDER BY shipments DESC
columns: ["mode", "shipments", "on_time_pct"]
rows:
["LTL", 1250, 88.0]
["Intermodal", 1250, 88.0]
["FTL", 1250, 88.0]
["Parcel", 1250, 88.0]
truncated: false
A DDL attempt is refused before it ever reaches the warehouse:
> run_sql: DROP TABLE shipments
SQLValidationError: Only read-only SELECT queries are allowed.
Development
uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"
pytest
ruff check .
The DuckDB sample data is regenerated deterministically with:
python sample_data/generate.py
License
MIT — see LICENSE.