Crawl4AI MCP Server: Extract content from web pages, PDFs, Office docs, YouTube videos with AI-powered summarization. 17 tools, token reduction, production-ready.

Crawl-MCP: Unofficial MCP Server for crawl4ai

โš ๏ธ Important: This is an unofficial MCP server implementation for the excellent crawl4ai library. Not affiliated with the original crawl4ai project.

A comprehensive Model Context Protocol (MCP) server that wraps the powerful crawl4ai library with advanced AI capabilities. Extract and analyze content from any source: web pages, PDFs, Office documents, YouTube videos, and more. Features intelligent summarization to dramatically reduce token usage while preserving key information.

๐ŸŒŸ Key Features

  • ๐Ÿ” Google Search Integration - 7 optimized search genres with Google official operators
  • ๐Ÿ” Advanced Web Crawling: JavaScript support, deep site mapping, entity extraction
  • ๐ŸŒ Universal Content Extraction: Web pages, PDFs, Word docs, Excel, PowerPoint, ZIP archives
  • ๐Ÿค– AI-Powered Summarization: Smart token reduction (up to 88.5%) while preserving essential information
  • ๐ŸŽฌ YouTube Integration: Extract video transcripts and summaries without API keys
  • โšก Production Ready: 17 specialized tools with comprehensive error handling

๐Ÿš€ Quick Start

Prerequisites (Required First)

  • Python 3.11 ไปฅไธŠ๏ผˆFastMCP ใŒ Python 3.11+ ใ‚’่ฆๆฑ‚๏ผ‰

Install system dependencies for Playwright:

Ubuntu 24.04 LTS (Manual Required):

# Manual setup required due to t64 library transition
sudo apt update && sudo apt install -y \
  libnss3 libatk-bridge2.0-0 libxss1 libasound2t64 \
  libgbm1 libgtk-3-0t64 libxshmfence-dev libxrandr2 \
  libxcomposite1 libxcursor1 libxdamage1 libxi6 \
  fonts-noto-color-emoji fonts-unifont python3-venv python3-pip

python3 -m venv venv && source venv/bin/activate
pip install playwright==1.55.0 && playwright install chromium
sudo playwright install-deps

Other Linux/macOS:

sudo bash scripts/prepare_for_uvx_playwright.sh

Windows (as Administrator):

scripts/prepare_for_uvx_playwright.ps1

Installation

UVX (Recommended - Easiest):

# After system preparation above - that's it!
uvx --from git+https://github.com/walksoda/crawl-mcp crawl-mcp

Docker (Production-Ready):

# Clone the repository
git clone https://github.com/walksoda/crawl-mcp
cd crawl-mcp

# Build and run with Docker Compose (STDIO mode)
docker-compose up --build

# Or build and run HTTP mode on port 8000
docker-compose --profile http up --build crawl4ai-mcp-http

# Or build manually
docker build -t crawl4ai-mcp .
docker run -it crawl4ai-mcp

Docker Features:

  • ๐Ÿ”ง Multi-Browser Support: Chromium, Firefox, Webkit headless browsers
  • ๐Ÿง Google Chrome: Additional Chrome Stable for compatibility
  • โšก Optimized Performance: Pre-configured browser flags for Docker
  • ๐Ÿ”’ Security: Non-root user execution
  • ๐Ÿ“ฆ Complete Dependencies: All required libraries included

Claude Desktop Setup

UVX Installation:Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "crawl-mcp": {
      "transport": "stdio",
      "command": "uvx",
      "args": [
        "--from",
        "git+https://github.com/walksoda/crawl-mcp",
        "crawl-mcp"
      ],
      "env": {
        "CRAWL4AI_LANG": "en"
      }
    }
  }
}

Docker HTTP Mode:

{
  "mcpServers": {
    "crawl-mcp": {
      "transport": "http",
      "baseUrl": "http://localhost:8000"
    }
  }
}

For Japanese interface:

"env": {
  "CRAWL4AI_LANG": "ja"
}

๐Ÿ“– Documentation

Topic Description
Installation Guide Complete installation instructions for all platforms
API Reference Full tool documentation and usage examples
Configuration Examples Platform-specific setup configurations
HTTP Integration HTTP API access and integration methods
Advanced Usage Power user techniques and workflows
Development Guide Contributing and development setup

Language-Specific Documentation

  • English: docs/ directory
  • ๆ—ฅๆœฌ่ชž: docs/ja/ directory

๐Ÿ› ๏ธ Tool Overview

Web Crawling

  • crawl_url - Single page crawling with JavaScript support
  • deep_crawl_site - Multi-page site mapping and exploration
  • crawl_url_with_fallback - Robust crawling with retry strategies
  • batch_crawl - Process multiple URLs simultaneously

AI-Powered Analysis

  • intelligent_extract - Semantic content extraction with custom instructions
  • auto_summarize - LLM-based summarization for large content
  • extract_entities - Pattern-based entity extraction (emails, phones, URLs, etc.)

Media Processing

  • process_file - Convert PDFs, Office docs, ZIP archives to markdown
  • extract_youtube_transcript - Multi-language transcript extraction
  • batch_extract_youtube_transcripts - Process multiple videos

Search Integration

  • search_google - Genre-filtered Google search with metadata
  • search_and_crawl - Combined search and content extraction
  • batch_search_google - Multiple search queries with analysis

๐ŸŽฏ Common Use Cases

Content Research:

search_and_crawl โ†’ intelligent_extract โ†’ structured analysis

Documentation Mining:

deep_crawl_site โ†’ batch processing โ†’ comprehensive extraction

Media Analysis:

extract_youtube_transcript โ†’ auto_summarize โ†’ insight generation

Competitive Intelligence:

batch_crawl โ†’ extract_entities โ†’ comparative analysis

๐Ÿšจ Quick Troubleshooting

Installation Issues:

  1. Run system diagnostics: Use get_system_diagnostics tool
  2. Re-run setup scripts with proper privileges
  3. Try development installation method

Performance Issues:

  • Use wait_for_js: true for JavaScript-heavy sites
  • Increase timeout for slow-loading pages
  • Enable auto_summarize for large content

Configuration Issues:

  • Check JSON syntax in claude_desktop_config.json
  • Verify file paths are absolute
  • Restart Claude Desktop after configuration changes

๐Ÿ—๏ธ Project Structure

  • Original Library: crawl4ai by unclecode
  • MCP Wrapper: This repository (walksoda)
  • Implementation: Unofficial third-party integration

๐Ÿ“„ License

This project is an unofficial wrapper around the crawl4ai library. Please refer to the original crawl4ai license for the underlying functionality.

๐Ÿค Contributing

See our Development Guide for contribution guidelines and development setup instructions.

๐Ÿ”— Related Projects

MCP Server ยท Populars

MCP Server ยท New

    TickDB

    TickDB โ€” Unified Real-time Market Data API for Forex, Stocks, Crypto

    ็ปŸไธ€็š„ๅฎžๆ—ถ้‡‘่ž่กŒๆƒ…ๆ•ฐๆฎAPI๏ผŒ่ฆ†็›–ๅค–ๆฑ‡ใ€่ดต้‡‘ๅฑžใ€ๆŒ‡ๆ•ฐใ€็พŽ่‚กใ€ๆธฏ่‚กใ€A่‚กๅ’ŒๅŠ ๅฏ†่ดงๅธ๏ผŒๆ”ฏๆŒ WebSocket ๅฎžๆ—ถๆŽจ้€ไธŽ REST ๆŽฅๅฃ่ฎฟ้—ฎ | Unified real-time market data API covering forex, commodities, indices, US stocks, HK stocks, A-shares and cryptocurrencies, with WebSocket streaming and REST access

    Community TickDB
    Patdolitse

    piia-engram

    One memory. Every AI tool. Yours to keep. Local-first, MCP-compatible, Apache 2.0.

    Community Patdolitse
    ada20204

    antigravity-sync

    MCP Server

    Community ada20204
    SepineTam

    mcp-for-stata

    Integrate Stata into your agent.

    Community SepineTam
    Keesan12

    martin-loop

    The control plane for autonomous work and coding agent teams.

    Community Keesan12