PDF Reader MCP ๐
Production-ready PDF processing server for AI agents
5-10x faster parallel processing โข Y-coordinate content ordering โข 94%+ test coverage โข 103 tests passing
ๅบไบๅ้กน็ฎ: ๆญค้กน็ฎๅบไบ pdf-reader-mcp ไฟฎๆน่ๆฅ
๐ Overview
PDF Reader MCP is a production-ready Model Context Protocol server that empowers AI agents with enterprise-grade PDF processing capabilities. Extract text, images, and metadata with unmatched performance and reliability.
The Problem:
// Traditional PDF processing
- Sequential page processing (slow)
- No natural content ordering
- Complex path handling
- Poor error isolation
The Solution:
// PDF Reader MCP
- 5-10x faster parallel processing โก
- Y-coordinate based ordering ๐
- Flexible path support (absolute/relative) ๐ฏ
- Per-page error resilience ๐ก๏ธ
- 94%+ test coverage โ
Result: Production-ready PDF processing that scales.
โก Key Features
Performance
- ๐ 5-10x faster than sequential with automatic parallelization
- โก 12,933 ops/sec error handling, 5,575 ops/sec text extraction
- ๐จ Process 50-page PDFs in seconds with multi-core utilization
- ๐ฆ Lightweight with minimal dependencies
Developer Experience
- ๐ฏ Path Flexibility - Absolute & relative paths, Windows/Unix support (v1.3.0)
- ๐ผ๏ธ Smart Ordering - Y-coordinate based content preserves document layout
- ๐ก๏ธ Type Safe - Full TypeScript with strict mode enabled
- ๐ Battle-tested - 103 tests, 94%+ coverage, 98%+ function coverage
- ๐จ Simple API - Single tool handles all operations elegantly
๐ Performance Benchmarks
Real-world performance from production testing:
| Operation | Ops/sec | Performance | Use Case |
|---|---|---|---|
| Error handling | 12,933 | โกโกโกโกโก | Validation & safety |
| Extract full text | 5,575 | โกโกโกโก | Document analysis |
| Extract page | 5,329 | โกโกโกโก | Single page ops |
| Multiple pages | 5,242 | โกโกโกโก | Batch processing |
| Metadata only | 4,912 | โกโกโก | Quick inspection |
Parallel Processing Speedup
| Document | Sequential | Parallel | Speedup |
|---|---|---|---|
| 10-page PDF | ~2s | ~0.3s | 5-8x faster |
| 50-page PDF | ~10s | ~1s | 10x faster |
| 100+ pages | ~20s | ~2s | Linear scaling with CPU cores |
Benchmarks vary based on PDF complexity and system resources.
๐ฆ Installation
# Quick start - zero installation
npx @sylphx/pdf-reader-mcp
# Using pnpm (recommended)
pnpm add @sylphx/pdf-reader-mcp
# Using npm
npm install @sylphx/pdf-reader-mcp
# Using yarn
yarn add @sylphx/pdf-reader-mcp
# For Claude Desktop (easiest)
npx -y @smithery/cli install @sylphx/pdf-reader-mcp --client claude
๐ฏ Quick Start
Configuration
Add to your MCP client (claude_desktop_config.json, Cursor, Cline):
{
"mcpServers": {
"pdf-reader-mcp": {
"command": "npx",
"args": ["@bachstudio/pdf-reader-mcp"]
}
}
}
Basic Usage
{
"sources": [{
"path": "documents/report.pdf"
}],
"include_full_text": true,
"include_metadata": true,
"include_page_count": true
}
Result:
- โ Full text content extracted
- โ PDF metadata (author, title, dates)
- โ Total page count
- โ Structural sharing - unchanged parts preserved
Extract Specific Pages
{
"sources": [{
"path": "documents/manual.pdf",
"pages": "1-5,10,15-20"
}],
"include_full_text": true
}
Absolute Paths (v1.3.0+)
// Windows - Both formats work!
{
"sources": [{
"path": "C:\\Users\\John\\Documents\\report.pdf"
}],
"include_full_text": true
}
// Unix/Mac
{
"sources": [{
"path": "/home/user/documents/contract.pdf"
}],
"include_full_text": true
}
No more "Absolute paths are not allowed" errors!
Extract Images with Natural Ordering
{
"sources": [{
"path": "presentation.pdf",
"pages": [1, 2, 3]
}],
"include_images": true,
"include_full_text": true
}
Response includes:
- Text and images in exact document order (Y-coordinate sorted)
- Base64-encoded images with metadata (width, height, format)
- Natural reading flow preserved for AI comprehension
Batch Processing
{
"sources": [
{ "path": "C:\\Reports\\Q1.pdf", "pages": "1-10" },
{ "path": "/home/user/Q2.pdf", "pages": "1-10" },
{ "url": "https://example.com/Q3.pdf" }
],
"include_full_text": true
}
โก All PDFs processed in parallel automatically!
โจ Features
Core Capabilities
- โ Text Extraction - Full document or specific pages with intelligent parsing
- โ Image Extraction - Base64-encoded with complete metadata (width, height, format)
- โ Content Ordering - Y-coordinate based layout preservation for natural reading flow
- โ Metadata Extraction - Author, title, creation date, and custom properties
- โ Page Counting - Fast enumeration without loading full content
- โ Dual Sources - Local files (absolute or relative paths) and HTTP/HTTPS URLs
- โ Batch Processing - Multiple PDFs processed concurrently
Advanced Features
- โก 5-10x Performance - Parallel page processing with Promise.all
- ๐ฏ Smart Pagination - Extract ranges like "1-5,10-15,20"
- ๐ผ๏ธ Multi-Format Images - RGB, RGBA, Grayscale with automatic detection
- ๐ก๏ธ Path Flexibility - Windows, Unix, and relative paths all supported (v1.3.0)
- ๐ Error Resilience - Per-page error isolation with detailed messages
- ๐ Large File Support - Efficient streaming and memory management
- ๐ Type Safe - Full TypeScript with strict mode enabled
๐ What's New in v1.3.0
๐ Absolute Paths Now Supported!
// โ
Windows
{ "path": "C:\\Users\\John\\Documents\\report.pdf" }
{ "path": "C:/Users/John/Documents/report.pdf" }
// โ
Unix/Mac
{ "path": "/home/john/documents/report.pdf" }
{ "path": "/Users/john/Documents/report.pdf" }
// โ
Relative (still works)
{ "path": "documents/report.pdf" }
Other Improvements:
- ๐ Fixed Zod validation error handling
- ๐ฆ Updated all dependencies to latest versions
- โ 103 tests passing, 94%+ coverage maintained
v1.2.0 - Content Ordering
- Y-coordinate based text and image ordering
- Natural reading flow for AI models
- Intelligent line grouping
v1.1.0 - Image Extraction & Performance
- Base64-encoded image extraction
- 10x speedup with parallel processing
- Comprehensive test coverage (94%+)
View Full Changelog โ
๐ API Reference
read_pdf Tool
The single tool that handles all PDF operations.
Parameters
| Parameter | Type | Description | Default |
|---|---|---|---|
sources |
Array | List of PDF sources to process | Required |
include_full_text |
boolean | Extract full text content | false |
include_metadata |
boolean | Extract PDF metadata | true |
include_page_count |
boolean | Include total page count | true |
include_images |
boolean | Extract embedded images | false |
Source Object
{
path?: string; // Local file path (absolute or relative)
url?: string; // HTTP/HTTPS URL to PDF
pages?: string | number[]; // Pages to extract: "1-5,10" or [1,2,3]
}
Examples
Metadata only (fast):
{
"sources": [{ "path": "large.pdf" }],
"include_metadata": true,
"include_page_count": true,
"include_full_text": false
}
From URL:
{
"sources": [{
"url": "https://arxiv.org/pdf/2301.00001.pdf"
}],
"include_full_text": true
}
Page ranges:
{
"sources": [{
"path": "manual.pdf",
"pages": "1-5,10-15,20" // Pages 1,2,3,4,5,10,11,12,13,14,15,20
}]
}
๐ง Advanced Usage
๐ Y-Coordinate Content OrderingContent is returned in natural reading order based on Y-coordinates:
Document Layout:
โโโโโโโโโโโโโโโโโโโโโโโ
โ [Title] Y:100 โ
โ [Image] Y:150 โ
โ [Text] Y:400 โ
โ [Photo A] Y:500 โ
โ [Photo B] Y:550 โ
โโโโโโโโโโโโโโโโโโโโโโโ
Response Order:
[
{ type: "text", text: "Title..." },
{ type: "image", data: "..." },
{ type: "text", text: "..." },
{ type: "image", data: "..." },
{ type: "image", data: "..." }
]
Benefits:
- AI understands spatial relationships
- Natural document comprehension
- Perfect for vision-enabled models
- Automatic multi-line text grouping
Enable extraction:
{
"sources": [{ "path": "manual.pdf" }],
"include_images": true
}
Response format:
{
"images": [{
"page": 1,
"index": 0,
"width": 1920,
"height": 1080,
"format": "rgb",
"data": "base64-encoded-png..."
}]
}
Supported formats: RGB, RGBA, GrayscaleAuto-detected: JPEG, PNG, and other embedded formats
๐ Path ConfigurationAbsolute paths (v1.3.0+) - Direct file access:
{ "path": "C:\\Users\\John\\file.pdf" }
{ "path": "/home/user/file.pdf" }
Relative paths - Workspace files:
{ "path": "docs/report.pdf" }
{ "path": "./2024/Q1.pdf" }
Configure working directory:
{
"mcpServers": {
"pdf-reader-mcp": {
"command": "npx",
"args": ["@sylphx/pdf-reader-mcp"],
"cwd": "/path/to/documents"
}
}
}
๐ Large PDF Strategies
Strategy 1: Page ranges
{ "sources": [{ "path": "big.pdf", "pages": "1-20" }] }
Strategy 2: Progressive loading
// Step 1: Get page count
{ "sources": [{ "path": "big.pdf" }], "include_full_text": false }
// Step 2: Extract sections
{ "sources": [{ "path": "big.pdf", "pages": "50-75" }] }
Strategy 3: Parallel batching
{
"sources": [
{ "path": "big.pdf", "pages": "1-50" },
{ "path": "big.pdf", "pages": "51-100" }
]
}
๐ง Troubleshooting
"Absolute paths are not allowed"
Solution: Upgrade to v1.3.0+
npm update @sylphx/pdf-reader-mcp
Restart your MCP client completely.
"File not found"
Causes:
- File doesn't exist at path
- Wrong working directory
- Permission issues
Solutions:
Use absolute path:
{ "path": "C:\\Full\\Path\\file.pdf" }
Or configure cwd:
{
"pdf-reader-mcp": {
"command": "npx",
"args": ["@sylphx/pdf-reader-mcp"],
"cwd": "/path/to/docs"
}
}
"No tools showing up"
Solution:
npm cache clean --force
rm -rf node_modules package-lock.json
npm install @sylphx/pdf-reader-mcp@latest
Restart MCP client completely.
๐๏ธ Architecture
Tech Stack
| Component | Technology |
|---|---|
| Runtime | Node.js 22+ ESM |
| PDF Engine | PDF.js (Mozilla) |
| Validation | Zod + JSON Schema |
| Protocol | MCP SDK |
| Language | TypeScript (strict) |
| Testing | Vitest (103 tests) |
| Quality | Biome (50x faster) |
| CI/CD | GitHub Actions |
Design Principles
- ๐ Security First - Flexible paths with secure defaults
- ๐ฏ Simple Interface - One tool, all operations
- โก Performance - Parallel processing, efficient memory
- ๐ก๏ธ Reliability - Per-page isolation, detailed errors
- ๐งช Quality - 94%+ coverage, strict TypeScript
- ๐ Type Safety - No
anytypes, strict mode - ๐ Backward Compatible - Smooth upgrades always
๐งช Development
Setup & ScriptsPrerequisites:
- Node.js >= 22.0.0
- pnpm (recommended) or npm
Setup:
git clone https://github.com/SylphxAI/pdf-reader-mcp.git
cd pdf-reader-mcp
pnpm install && pnpm build
Scripts:
pnpm run build # Build TypeScript
pnpm run test # Run 103 tests
pnpm run test:cov # Coverage (94%+)
pnpm run check # Lint + format
pnpm run check:fix # Auto-fix
pnpm run benchmark # Performance tests
Quality:
- โ 103 tests
- โ 94%+ coverage
- โ 98%+ function coverage
- โ Zero lint errors
- โ Strict TypeScript
Quick Start:
- Fork repository
- Create branch:
git checkout -b feature/awesome - Make changes:
pnpm test - Format:
pnpm run check:fix - Commit: Use Conventional Commits
- Open PR
Commit Format:
feat(images): add WebP support
fix(paths): handle UNC paths
docs(readme): update examples
See CONTRIBUTING.md
๐ Documentation
- ๐ Full Docs - Complete guides
- ๐ Getting Started - Quick start
- ๐ API Reference - Detailed API
- ๐๏ธ Design - Architecture
- โก Performance - Benchmarks
- ๐ Comparison - vs. alternatives
๐บ๏ธ Roadmap
โ Completed
- Image extraction (v1.1.0)
- 5-10x parallel speedup (v1.1.0)
- Y-coordinate ordering (v1.2.0)
- Absolute paths (v1.3.0)
- 94%+ test coverage (v1.3.0)
๐ Next
- OCR for scanned PDFs
- Annotation extraction
- Form field extraction
- Table detection
- 100+ MB streaming
- Advanced caching
- PDF generation
Vote at Discussions
๐ Recognition
Featured on:
Trusted worldwide โข Enterprise adoption โข Battle-tested
๐ค Support
- ๐ Bug Reports
- ๐ฌ Discussions
- ๐ Documentation
- ๐ง Email
Show Your Support:โญ Star โข ๐ Watch โข ๐ Report bugs โข ๐ก Suggest features โข ๐ Contribute
๐ Stats
103 Tests โข 94%+ Coverage โข Production Ready
๐ License
MIT ยฉ Sylphx
๐ Credits
Built with:
Special thanks to the open source community โค๏ธ
5-10x faster. Production-ready. Battle-tested. The PDF processing server that actually scales sylphx.com โข @SylphxAI โข [email protected]