SylphxAI

PDF Reader MCP πŸ“„

Community SylphxAI
Updated

πŸ“„ Production-ready MCP server for PDF processing - 5-10x faster with parallel processing and 94%+ test coverage

PDF Reader MCP πŸ“„

Production-ready PDF processing server for AI agents

CI/CDcodecovnpm versionDownloadsLicense

5-10x faster parallel processing β€’ Y-coordinate content ordering β€’ 94%+ test coverage β€’ 103 tests passing

πŸš€ Overview

PDF Reader MCP is a production-ready Model Context Protocol server that empowers AI agents with enterprise-grade PDF processing capabilities. Extract text, images, and metadata with unmatched performance and reliability.

The Problem:

// Traditional PDF processing
- Sequential page processing (slow)
- No natural content ordering
- Complex path handling
- Poor error isolation

The Solution:

// PDF Reader MCP
- 5-10x faster parallel processing ⚑
- Y-coordinate based ordering πŸ“
- Flexible path support (absolute/relative) 🎯
- Per-page error resilience πŸ›‘οΈ
- 94%+ test coverage βœ…

Result: Production-ready PDF processing that scales.

⚑ Key Features

Performance

  • πŸš€ 5-10x faster than sequential with automatic parallelization
  • ⚑ 12,933 ops/sec error handling, 5,575 ops/sec text extraction
  • πŸ’¨ Process 50-page PDFs in seconds with multi-core utilization
  • πŸ“¦ Lightweight with minimal dependencies

Developer Experience

  • 🎯 Path Flexibility - Absolute & relative paths, Windows/Unix support (v1.3.0)
  • πŸ–ΌοΈ Smart Ordering - Y-coordinate based content preserves document layout
  • πŸ›‘οΈ Type Safe - Full TypeScript with strict mode enabled
  • πŸ“š Battle-tested - 103 tests, 94%+ coverage, 98%+ function coverage
  • 🎨 Simple API - Single tool handles all operations elegantly

πŸ“Š Performance Benchmarks

Real-world performance from production testing:

Operation Ops/sec Performance Use Case
Error handling 12,933 ⚑⚑⚑⚑⚑ Validation & safety
Extract full text 5,575 ⚑⚑⚑⚑ Document analysis
Extract page 5,329 ⚑⚑⚑⚑ Single page ops
Multiple pages 5,242 ⚑⚑⚑⚑ Batch processing
Metadata only 4,912 ⚑⚑⚑ Quick inspection

Parallel Processing Speedup

Document Sequential Parallel Speedup
10-page PDF ~2s ~0.3s 5-8x faster
50-page PDF ~10s ~1s 10x faster
100+ pages ~20s ~2s Linear scaling with CPU cores

Benchmarks vary based on PDF complexity and system resources.

πŸ“¦ Installation

# Quick start - zero installation
npx @sylphx/pdf-reader-mcp

# Using pnpm (recommended)
pnpm add @sylphx/pdf-reader-mcp

# Using npm
npm install @sylphx/pdf-reader-mcp

# Using yarn
yarn add @sylphx/pdf-reader-mcp

# For Claude Desktop (easiest)
npx -y @smithery/cli install @sylphx/pdf-reader-mcp --client claude

🎯 Quick Start

Configuration

Add to your MCP client (claude_desktop_config.json, Cursor, Cline):

{
  "mcpServers": {
    "pdf-reader-mcp": {
      "command": "npx",
      "args": ["@sylphx/pdf-reader-mcp"]
    }
  }
}

Basic Usage

{
  "sources": [{
    "path": "documents/report.pdf"
  }],
  "include_full_text": true,
  "include_metadata": true,
  "include_page_count": true
}

Result:

  • βœ… Full text content extracted
  • βœ… PDF metadata (author, title, dates)
  • βœ… Total page count
  • βœ… Structural sharing - unchanged parts preserved

Extract Specific Pages

{
  "sources": [{
    "path": "documents/manual.pdf",
    "pages": "1-5,10,15-20"
  }],
  "include_full_text": true
}

Absolute Paths (v1.3.0+)

// Windows - Both formats work!
{
  "sources": [{
    "path": "C:\\Users\\John\\Documents\\report.pdf"
  }],
  "include_full_text": true
}

// Unix/Mac
{
  "sources": [{
    "path": "/home/user/documents/contract.pdf"
  }],
  "include_full_text": true
}

No more "Absolute paths are not allowed" errors!

Extract Images with Natural Ordering

{
  "sources": [{
    "path": "presentation.pdf",
    "pages": [1, 2, 3]
  }],
  "include_images": true,
  "include_full_text": true
}

Response includes:

  • Text and images in exact document order (Y-coordinate sorted)
  • Base64-encoded images with metadata (width, height, format)
  • Natural reading flow preserved for AI comprehension

Batch Processing

{
  "sources": [
    { "path": "C:\\Reports\\Q1.pdf", "pages": "1-10" },
    { "path": "/home/user/Q2.pdf", "pages": "1-10" },
    { "url": "https://example.com/Q3.pdf" }
  ],
  "include_full_text": true
}

⚑ All PDFs processed in parallel automatically!

✨ Features

Core Capabilities

  • βœ… Text Extraction - Full document or specific pages with intelligent parsing
  • βœ… Image Extraction - Base64-encoded with complete metadata (width, height, format)
  • βœ… Content Ordering - Y-coordinate based layout preservation for natural reading flow
  • βœ… Metadata Extraction - Author, title, creation date, and custom properties
  • βœ… Page Counting - Fast enumeration without loading full content
  • βœ… Dual Sources - Local files (absolute or relative paths) and HTTP/HTTPS URLs
  • βœ… Batch Processing - Multiple PDFs processed concurrently

Advanced Features

  • ⚑ 5-10x Performance - Parallel page processing with Promise.all
  • 🎯 Smart Pagination - Extract ranges like "1-5,10-15,20"
  • πŸ–ΌοΈ Multi-Format Images - RGB, RGBA, Grayscale with automatic detection
  • πŸ›‘οΈ Path Flexibility - Windows, Unix, and relative paths all supported (v1.3.0)
  • πŸ” Error Resilience - Per-page error isolation with detailed messages
  • πŸ“ Large File Support - Efficient streaming and memory management
  • πŸ“ Type Safe - Full TypeScript with strict mode enabled

πŸ†• What's New in v1.3.0

πŸŽ‰ Absolute Paths Now Supported!

// βœ… Windows
{ "path": "C:\\Users\\John\\Documents\\report.pdf" }
{ "path": "C:/Users/John/Documents/report.pdf" }

// βœ… Unix/Mac
{ "path": "/home/john/documents/report.pdf" }
{ "path": "/Users/john/Documents/report.pdf" }

// βœ… Relative (still works)
{ "path": "documents/report.pdf" }

Other Improvements:

  • πŸ› Fixed Zod validation error handling
  • πŸ“¦ Updated all dependencies to latest versions
  • βœ… 103 tests passing, 94%+ coverage maintained
πŸ“‹ View Full Changelog

v1.2.0 - Content Ordering

  • Y-coordinate based text and image ordering
  • Natural reading flow for AI models
  • Intelligent line grouping

v1.1.0 - Image Extraction & Performance

  • Base64-encoded image extraction
  • 10x speedup with parallel processing
  • Comprehensive test coverage (94%+)

View Full Changelog β†’

πŸ“– API Reference

read_pdf Tool

The single tool that handles all PDF operations.

Parameters
Parameter Type Description Default
sources Array List of PDF sources to process Required
include_full_text boolean Extract full text content false
include_metadata boolean Extract PDF metadata true
include_page_count boolean Include total page count true
include_images boolean Extract embedded images false
Source Object
{
  path?: string;        // Local file path (absolute or relative)
  url?: string;         // HTTP/HTTPS URL to PDF
  pages?: string | number[];  // Pages to extract: "1-5,10" or [1,2,3]
}
Examples

Metadata only (fast):

{
  "sources": [{ "path": "large.pdf" }],
  "include_metadata": true,
  "include_page_count": true,
  "include_full_text": false
}

From URL:

{
  "sources": [{
    "url": "https://arxiv.org/pdf/2301.00001.pdf"
  }],
  "include_full_text": true
}

Page ranges:

{
  "sources": [{
    "path": "manual.pdf",
    "pages": "1-5,10-15,20"  // Pages 1,2,3,4,5,10,11,12,13,14,15,20
  }]
}

πŸ”§ Advanced Usage

πŸ“ Y-Coordinate Content Ordering

Content is returned in natural reading order based on Y-coordinates:

Document Layout:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ [Title]       Y:100 β”‚
β”‚ [Image]       Y:150 β”‚
β”‚ [Text]        Y:400 β”‚
β”‚ [Photo A]     Y:500 β”‚
β”‚ [Photo B]     Y:550 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Response Order:
[
  { type: "text", text: "Title..." },
  { type: "image", data: "..." },
  { type: "text", text: "..." },
  { type: "image", data: "..." },
  { type: "image", data: "..." }
]

Benefits:

  • AI understands spatial relationships
  • Natural document comprehension
  • Perfect for vision-enabled models
  • Automatic multi-line text grouping
πŸ–ΌοΈ Image Extraction

Enable extraction:

{
  "sources": [{ "path": "manual.pdf" }],
  "include_images": true
}

Response format:

{
  "images": [{
    "page": 1,
    "index": 0,
    "width": 1920,
    "height": 1080,
    "format": "rgb",
    "data": "base64-encoded-png..."
  }]
}

Supported formats: RGB, RGBA, GrayscaleAuto-detected: JPEG, PNG, and other embedded formats

πŸ“‚ Path Configuration

Absolute paths (v1.3.0+) - Direct file access:

{ "path": "C:\\Users\\John\\file.pdf" }
{ "path": "/home/user/file.pdf" }

Relative paths - Workspace files:

{ "path": "docs/report.pdf" }
{ "path": "./2024/Q1.pdf" }

Configure working directory:

{
  "mcpServers": {
    "pdf-reader-mcp": {
      "command": "npx",
      "args": ["@sylphx/pdf-reader-mcp"],
      "cwd": "/path/to/documents"
    }
  }
}
πŸ“Š Large PDF Strategies

Strategy 1: Page ranges

{ "sources": [{ "path": "big.pdf", "pages": "1-20" }] }

Strategy 2: Progressive loading

// Step 1: Get page count
{ "sources": [{ "path": "big.pdf" }], "include_full_text": false }

// Step 2: Extract sections
{ "sources": [{ "path": "big.pdf", "pages": "50-75" }] }

Strategy 3: Parallel batching

{
  "sources": [
    { "path": "big.pdf", "pages": "1-50" },
    { "path": "big.pdf", "pages": "51-100" }
  ]
}

πŸ”§ Troubleshooting

"Absolute paths are not allowed"

Solution: Upgrade to v1.3.0+

npm update @sylphx/pdf-reader-mcp

Restart your MCP client completely.

"File not found"

Causes:

  • File doesn't exist at path
  • Wrong working directory
  • Permission issues

Solutions:

Use absolute path:

{ "path": "C:\\Full\\Path\\file.pdf" }

Or configure cwd:

{
  "pdf-reader-mcp": {
    "command": "npx",
    "args": ["@sylphx/pdf-reader-mcp"],
    "cwd": "/path/to/docs"
  }
}

"No tools showing up"

Solution:

npm cache clean --force
rm -rf node_modules package-lock.json
npm install @sylphx/pdf-reader-mcp@latest

Restart MCP client completely.

πŸ—οΈ Architecture

Tech Stack

Component Technology
Runtime Node.js 22+ ESM
PDF Engine PDF.js (Mozilla)
Validation Zod + JSON Schema
Protocol MCP SDK
Language TypeScript (strict)
Testing Vitest (103 tests)
Quality Biome (50x faster)
CI/CD GitHub Actions

Design Principles

  • πŸ”’ Security First - Flexible paths with secure defaults
  • 🎯 Simple Interface - One tool, all operations
  • ⚑ Performance - Parallel processing, efficient memory
  • πŸ›‘οΈ Reliability - Per-page isolation, detailed errors
  • πŸ§ͺ Quality - 94%+ coverage, strict TypeScript
  • πŸ“ Type Safety - No any types, strict mode
  • πŸ”„ Backward Compatible - Smooth upgrades always

πŸ§ͺ Development

Setup & Scripts

Prerequisites:

  • Node.js >= 22.0.0
  • pnpm (recommended) or npm

Setup:

git clone https://github.com/SylphxAI/pdf-reader-mcp.git
cd pdf-reader-mcp
pnpm install && pnpm build

Scripts:

pnpm run build       # Build TypeScript
pnpm run test        # Run 103 tests
pnpm run test:cov    # Coverage (94%+)
pnpm run check       # Lint + format
pnpm run check:fix   # Auto-fix
pnpm run benchmark   # Performance tests

Quality:

  • βœ… 103 tests
  • βœ… 94%+ coverage
  • βœ… 98%+ function coverage
  • βœ… Zero lint errors
  • βœ… Strict TypeScript
Contributing

Quick Start:

  1. Fork repository
  2. Create branch: git checkout -b feature/awesome
  3. Make changes: pnpm test
  4. Format: pnpm run check:fix
  5. Commit: Use Conventional Commits
  6. Open PR

Commit Format:

feat(images): add WebP support
fix(paths): handle UNC paths
docs(readme): update examples

See CONTRIBUTING.md

πŸ“š Documentation

  • πŸ“– Full Docs - Complete guides
  • πŸš€ Getting Started - Quick start
  • πŸ“˜ API Reference - Detailed API
  • πŸ—οΈ Design - Architecture
  • ⚑ Performance - Benchmarks
  • πŸ” Comparison - vs. alternatives

πŸ—ΊοΈ Roadmap

βœ… Completed

  • Image extraction (v1.1.0)
  • 5-10x parallel speedup (v1.1.0)
  • Y-coordinate ordering (v1.2.0)
  • Absolute paths (v1.3.0)
  • 94%+ test coverage (v1.3.0)

πŸš€ Next

  • OCR for scanned PDFs
  • Annotation extraction
  • Form field extraction
  • Table detection
  • 100+ MB streaming
  • Advanced caching
  • PDF generation

Vote at Discussions

πŸ† Recognition

Featured on:

Trusted worldwide β€’ Enterprise adoption β€’ Battle-tested

🀝 Support

GitHub IssuesDiscord

Show Your Support:⭐ Star β€’ πŸ‘€ Watch β€’ πŸ› Report bugs β€’ πŸ’‘ Suggest features β€’ πŸ”€ Contribute

πŸ“Š Stats

StarsForksDownloadsContributors

103 Tests β€’ 94%+ Coverage β€’ Production Ready

πŸ“„ License

MIT Β© Sylphx

πŸ™ Credits

Built with:

Special thanks to the open source community ❀️

5-10x faster. Production-ready. Battle-tested. The PDF processing server that actually scales sylphx.com β€’ @SylphxAI β€’ [email protected]

MCP Server Β· Populars

MCP Server Β· New