Caseware Procurement Knowledge Base
Local RAG pipeline and MCP server for the Caseware Senior AI Data take-home exercise.
Resumen visual (HTML): docs/take-home-summary.html Repositorio: https://github.com/leonardorhojas/caseware-procurement-kb
Quick start (how to run)
Run these steps in order from the project root:
cd /Users/rafael/Projects/CASEWARE
# 1. Create and activate the virtual environment (once)
python3 -m venv .venv
source .venv/bin/activate
# 2. Install dependencies (once)
pip install -e .
# 3. Configure API key (once) optional if using local embeddings
cp .env.example .env
# Edit .env and set OPENAI_API_KEY=sk-...
# 4. Build the knowledge base index (run after setup, or when data changes)
python scripts/ingest.py
# 5. Use the MCP server from Cursor (recommended)
# See "Connect from Cursor" below reload MCP after config changes.
# Then ask questions in Cursor chat, e.g.:
# "Which documents support order 10687?"
# Optional: run MCP server manually in a terminal (stdio for debugging only)
# python mcp_server/server.py
Expected ingest output: JSON summary with documents_indexed, chunks_indexed, and artifact_dir: ".../artifacts". If that folder exists with kb.sqlite and vectors.npy, the index is ready.
Typical workflow:
| Step | Command | When |
|---|---|---|
| Setup | pip install -e . + .env |
Once |
| Index data | python scripts/ingest.py |
Once, or after adding/changing files in data/ |
| Query via agent | Cursor MCP caseware-kb | Every session |
The MCP server does not expose a web UI. It communicates over stdio with Cursor (or another MCP client). Running python mcp_server/server.py in a terminal will appear to hang that is normal; connect it through Cursor instead.
Prerequisites
- Python 3.11+
- OpenAI API key (optional; local embeddings are used automatically if OpenAI quota is unavailable)
- Tesseract OCR for invoice JPGs
- macOS:
brew install tesseract
- macOS:
Setup (detailed)
Same steps as Quick start; use this if you need more context.
cd /Users/rafael/Projects/CASEWARE
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
cp .env.example .env
# Edit .env and set OPENAI_API_KEY=sk-...
On macOS, activate the venv in each new terminal:
source /Users/rafael/Projects/CASEWARE/.venv/bin/activate
Environment variables
| Variable | Default | Description |
|---|---|---|
OPENAI_API_KEY |
- | OpenAI key for embeddings |
CASEWARE_EMBEDDING_PROVIDER |
openai |
openai or local |
CASEWARE_LOCAL_EMBEDDING_MODEL |
all-MiniLM-L6-v2 |
Local fallback model |
CASEWARE_DATA_DIR |
data |
Source documents |
CASEWARE_ARTIFACTS_DIR |
artifacts |
Generated index |
If OpenAI returns a quota error during ingest, the pipeline automatically falls back to local embeddings and records the provider in artifacts/manifest.json.
Run the data pipeline
Must run before using MCP. Re-run if you add documents or install Tesseract for JPG OCR.
source .venv/bin/activate
python scripts/ingest.py
Alternative entry point:
caseware-ingest
This will:
- Scan
data/for PDFs and images - Extract text (PyMuPDF for PDFs, Tesseract for JPGs)
- Chunk text and infer metadata (
doc_type,order_id, inventoryperiod) - Build embeddings and persist:
artifacts/kb.sqliteartifacts/vectors.npyartifacts/manifest.json
Run the MCP server
The MCP server reads the pre-built index from artifacts/ it does not ingest documents at startup.
Option A Cursor (recommended)
- Ensure ingest has completed (
artifacts/kb.sqliteexists). - Config is in
.cursor/mcp.json(and optionally~/.cursor/mcp.json). - Restart Cursor or reload MCP in Settings ? MCP.
- Enable caseware-kb and ask questions in chat.
Option B Manual stdio (debugging)
source .venv/bin/activate
python mcp_server/server.py
This blocks the terminal waiting for an MCP client on stdin/stdout. Use only to debug; normal usage is through Cursor.
Alternative entry point:
caseware-mcp
Connect from Cursor
Project config: .cursor/mcp.json
Also added to your global Cursor MCP settings (~/.cursor/mcp.json).
Restart Cursor (or reload MCP servers in Settings > MCP), then enable caseware-kb.
{
"mcpServers": {
"caseware-kb": {
"command": "/Users/rafael/Projects/CASEWARE/.venv/bin/python",
"args": ["/Users/rafael/Projects/CASEWARE/mcp_server/server.py"],
"cwd": "/Users/rafael/Projects/CASEWARE",
"env": {
"CASEWARE_EMBEDDING_PROVIDER": "local"
}
}
}
}
Architecture overview
flowchart TB
subgraph ingest [Offline ingest]
Files[data PDFs and JPGs]
Extract[extract.py]
Chunk[chunk.py]
Embed[embed.py]
Store[store.py SQLite plus numpy]
Files --> Extract --> Chunk --> Embed --> Store
end
subgraph runtime [MCP runtime]
Load[KnowledgeBase.load]
Tools[MCP tools]
Store --> Load --> Tools
end
Agent[Cursor or Claude Desktop] --> Tools
Data flow
- Files in
data/are classified by folder (invoices,purchase_orders, etc.). order_idis parsed from filenames (invoice_10687.pdf) or document text (Order ID: 10687).- Text is chunked (~900 chars, 100 overlap) with page references preserved.
- Embeddings are stored in
vectors.npy; metadata and chunk text live in SQLite. - MCP tools query the pre-built index (no extraction at request time).
Data modeling
| Field | Source | Used for |
|---|---|---|
doc_type |
folder name | filtering, gap analysis |
order_id |
filename / OCR / text | cross-document joins |
period |
inventory filename | inventory report queries |
page |
PDF page / OCR page | citations |
chunk_id |
generated | source references |
Retrieval strategy
The system uses layered retrieval not a single search engine. Implementation: src/caseware_kb/retrieve.py.
| Layer | When | How |
|---|---|---|
| Structured routing | Order IDs, gaps, inventory lists | SQL on metadata (matching.py) |
| Vector search | Semantic questions (contract terms, concepts) | Cosine similarity on chunk embeddings |
| Keyword fallback | Vector search returns no hits | SQL LIKE '%term%' with AND across query terms |
What we do not use: BM25, SQLite FTS5, or score fusion (RRF). The current keyword fallback is a simple substring match, not ranked lexical retrieval.
Hybrid flow today: try vector search first; if empty, fall back to keyword LIKE. See hybrid_search() in retrieve.py.
Structured extraction vs document retrieval
- Structured:
order_id,doc_type, andperiodare parsed at ingest time and queried via dedicated MCP tools. Best for audit-style questions with clear keys (order 10687, missing POs). - Document retrieval: chunk text + embeddings for open-ended questions (contract clauses, vendor/item mentions). Returns cited snippets, not pre-extracted fields.
Future improvement: BM25 lexical search
Why it would help: Vector search finds semantically similar text but can miss exact tokens (order numbers, product SKUs, legal clause references like "Clause 7", vendor names). Our current LIKE fallback has no relevance ranking and requires all terms to match.
How BM25 works: BM25 (Best Matching 25) is a probabilistic ranking function for keyword search. For each query term it scores documents higher when:
- the term appears often in the chunk (term frequency), with diminishing returns;
- the term is rare across the whole corpus (inverse document frequency discriminative terms rank higher);
- shorter chunks are not unfairly boosted (length normalization).
A typical upgrade path for this project:
- Index chunk text in SQLite FTS5 (or a small BM25 library) at ingest time.
- Run vector search and BM25 in parallel on the same query.
- Merge ranks (e.g. reciprocal rank fusion) so exact ID/clause matches and semantic matches both surface.
That would improve queries like "Find evidence for product Spegesild" or "Clause 14 damages" without replacing the structured tools for order-level joins.
MCP tools
| Tool | Purpose |
|---|---|
search_knowledge_base |
Semantic search with optional doc_type / order_id filters |
get_order_documents |
All documents for an order ID |
find_document_gaps |
Invoices without PO, shipping without invoice, etc. |
list_inventory_reports |
Available stock reports and periods |
get_source_excerpt |
Expand a cited chunk |
Citation strategy
Search results return:
chunk_id,doc_id,doc_typefile_path,page,order_idscore,snippet
Use get_source_excerpt(chunk_id=...) for longer grounded context.
MCP demo evidence
Validated with caseware-kb MCP server connected in Cursor (June 2026).
1. Order document bundle get_order_documents
Query: "Which documents support order 10687?"
Tool: get_order_documents(order_id="10687")
| Type | Document | Path | Pages |
|---|---|---|---|
| Invoice | invoice 10687 | data/invoices/invoice_10687.pdf |
1 |
| Shipping order | order 10687 | data/shipping_orders/order_10687.pdf |
2 |
No purchase order, inventory report, or contract is linked to this order. Customer: HUNGO (Hungry Owl All-Night Grocers), order date 2017-09-30.
2. Cross-document gaps find_document_gaps
Query: "Which invoices are missing a matching purchase order?"
Tool: find_document_gaps()
Invoices without PO (3):
| Order ID | Invoice on file |
|---|---|
| 10436 | Yes no matching PO |
| 10687 | Yes no matching PO |
| 10839 | Yes no matching PO |
Context from the knowledge base:
- 8 invoices, 8 purchase orders, 16 shipping orders indexed
- 5 order IDs with full invoice + PO + shipping match:
10248,10382,10442,10603,10711
3. Contract terms search_knowledge_base
Query: "Summarize contract terms relevant to supply of goods"
Tool: search_knowledge_base(query="contract terms supply of goods", doc_type="contract")
Source: TotalEnergies Master Contract for Supply of Goods and/or Services (dated 20231231)
Key terms retrieved from indexed contract chunks (primarily Clause 7, with supporting clauses):
| Topic | Summary |
|---|---|
| Framework | Master Contract + Order Requests; no exclusivity; open Orders survive contract end |
| Ordering | Each Order specifies goods, quantity, delivery date/site, and price |
| Delivery (Clause 7) | Deliver on date at place; partial deliveries need consent; Supplier bears transport/packing risk |
| Title & risk | Title on identification/payment/acceptance; default Incoterms DDP if unspecified |
| Quality | Goods must conform to specs, Good Industry Practice, and Applicable Laws |
| Warranty | 12 months from first use or 18 months from delivery (whichever ends first) |
| Acceptance (Clause 9) | Customer inspects; may accept, accept with reserves, or refuse non-conforming goods |
| Pricing (Clause 13) | Firm prices; payment within 30 days; invoices must reference order and goods |
| Remedies (Clause 14) | Late/non-conforming delivery triggers damages and third-party procurement rights |
Practical takeaway: Customer-favorable terms on delivery, conformity, warranty, and remedies with firm pricing and clear rights to reject, terminate, or substitute supply.
4. Inventory reports list_inventory_reports
Query: "What inventory reports are available?"
Tool: list_inventory_reports()
7 stock reports (1 page each), covering July 2016 through January 2018:
| Period | Report | File |
|---|---|---|
| 2016-07 | StockReport 2016-07 8 | data/inventory_reports/StockReport_2016-07_8.pdf |
| 2016-12 | StockReport 2016-12 1 | data/inventory_reports/StockReport_2016-12_1.pdf |
| 2017-03 | StockReport 2017-03 6 | data/inventory_reports/StockReport_2017-03_6.pdf |
| 2017-06 | StockReport 2017-06 7 | data/inventory_reports/StockReport_2017-06_7.pdf |
| 2017-09 | StockReport 2017-09 4 | data/inventory_reports/StockReport_2017-09_4.pdf |
| 2017-11 | StockReport 2017-11 3 | data/inventory_reports/StockReport_2017-11_3.pdf |
| 2018-01 | StockReport 2018-01 5 | data/inventory_reports/StockReport_2018-01_5.pdf |
Coverage is roughly quarterly with gaps (e.g. no 2017-12 report). None are linked to specific order IDs.
5. Shipment vs invoice match get_order_documents + document content
Query: "Does shipment 10436 match the related invoice?"
Tools: get_order_documents(order_id="10436") plus line-item comparison from indexed text
Result: Yes shipment 10436 matches invoice 10436.
| Field | Shipment (order_10436.pdf) |
Invoice (invoice_10436.pdf) |
|---|---|---|
| Order ID | 10436 | 10436 |
| Customer ID | BLONP | BLONP |
| Order date | 2017-02-05 | 2017-02-05 |
| Address | 24, place Klber, Strasbourg, 67000, France | Same |
Line items (identical on both):
| Product | Qty | Unit price | Line total |
|---|---|---|---|
| Spegesild | 5 | 9.6 | 48.0 |
| Gnocchi di nonna Alice | 40 | 30.4 | 1,216.0 |
| Wimmers gute Semmelkndel | 30 | 26.6 | 798.0 |
| Rhnbru Klosterbier | 24 | 6.2 | 148.8 |
Grand total: 2,210.8 on both documents.
Notes:
- Shipped 2017-02-11 via United Package (shipping order only)
- No purchase order on file for order 10436 consistent invoice/shipment pair, but incomplete procurement trail
- Minor naming differences (ship-to vs invoice contact) do not affect product or financial match
Suggested demo flow (1015 min presentation)
- Architecture: ingest pipeline ? artifacts ? MCP tools
- Structured lookup: order 10687 (
get_order_documents) - Gap analysis: missing POs (
find_document_gaps) - Semantic search: contract supply terms (
search_knowledge_base) - Metadata query: inventory periods (
list_inventory_reports) - Cross-doc reasoning: shipment vs invoice 10436
Manual validation checklist
- Order
10687-> invoice + shipping (no PO in dataset) - Order
10436-> invoice + shipping, no PO - Contract TotalEnergies -> semantic search returns contract chunks
- 7 inventory reports indexed for periods 2016-07 through 2018-01
- JPG invoices require Tesseract (
brew install tesseract, then re-run ingest)
Project layout
CASEWARE/
|-- data/
|-- artifacts/ # generated index (gitignored)
|-- scripts/ingest.py
|-- mcp_server/server.py
`-- src/caseware_kb/
AI-assisted development note
This project was built with Cursor AI assistance. Parts assisted by AI: project scaffold, pipeline modules, MCP server, README, and debugging the tool naming collision in mcp_server/server.py.
Validation: python scripts/ingest.py, MCP client smoke tests, and the live demo queries documented above via Cursor + caseware-kb MCP.