felores

Crawl4AI Documentation Scraper

Community felores
Updated

A documentation scraping tool

Crawl4AI Documentation Scraper

Keep your dependency documentation lean, current, and AI-ready. This toolkit helps you extract clean, focused documentation from any framework or library website, perfect for both human readers and LLM consumption.

Why This Tool?

In today's fast-paced development environment, you need:

  • ๐Ÿ“š Quick access to dependency documentation without the bloat
  • ๐Ÿค– Documentation in a format that's ready for RAG systems and LLMs
  • ๐ŸŽฏ Focused content without navigation elements, ads, or irrelevant sections
  • โšก Fast, efficient way to keep documentation up-to-date
  • ๐Ÿงน Clean Markdown output for easy integration with documentation tools

Traditional web scraping often gives you everything - including navigation menus, footers, ads, and other noise. This toolkit is specifically designed to extract only what matters: the actual documentation content.

Key Benefits

  1. Clean Documentation Output

    • Markdown format for content-focused documentation
    • JSON format for structured menu data
    • Perfect for documentation sites, wikis, and knowledge bases
    • Ideal format for LLM training and RAG systems
  2. Smart Content Extraction

    • Automatically identifies main content areas
    • Strips away navigation, ads, and irrelevant sections
    • Preserves code blocks and technical formatting
    • Maintains proper Markdown structure
  3. Flexible Crawling Strategies

    • Single page for quick reference docs
    • Multi-page for comprehensive library documentation
    • Sitemap-based for complete framework coverage
    • Menu-based for structured documentation hierarchies
  4. LLM and RAG Ready

    • Clean Markdown text suitable for embeddings
    • Preserved code blocks for technical accuracy
    • Structured menu data in JSON format
    • Consistent formatting for reliable processing

A comprehensive Python toolkit for scraping documentation websites using different crawling strategies. Built using the Crawl4AI library for efficient web crawling.

Powered by Crawl4AI

Features

Core Features

  • ๐Ÿš€ Multiple crawling strategies
  • ๐Ÿ“‘ Automatic nested menu expansion
  • ๐Ÿ”„ Handles dynamic content and lazy-loaded elements
  • ๐ŸŽฏ Configurable selectors
  • ๐Ÿ“ Clean Markdown output for documentation
  • ๐Ÿ“Š JSON output for menu structure
  • ๐ŸŽจ Colorful terminal feedback
  • ๐Ÿ” Smart URL processing
  • โšก Asynchronous execution

Available Crawlers

  1. Single URL Crawler (single_url_crawler.py)

    • Extracts content from a single documentation page
    • Outputs clean Markdown format
    • Perfect for targeted content extraction
    • Configurable content selectors
  2. Multi URL Crawler (multi_url_crawler.py)

    • Processes multiple URLs in parallel
    • Generates individual Markdown files per page
    • Efficient batch processing
    • Shared browser session for better performance
  3. Sitemap Crawler (sitemap_crawler.py)

    • Automatically discovers and crawls sitemap.xml
    • Creates Markdown files for each page
    • Supports recursive sitemap parsing
    • Handles gzipped sitemaps
  4. Menu Crawler (menu_crawler.py)

    • Extracts all menu links from documentation
    • Outputs structured JSON format
    • Handles nested and dynamic menus
    • Smart menu expansion

Requirements

  • Python 3.7+
  • Virtual Environment (recommended)

Installation

  1. Clone the repository:
git clone https://github.com/felores/crawl4ai_docs_scraper.git
cd crawl4ai_docs_scraper
  1. Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt

Usage

1. Single URL Crawler

python single_url_crawler.py https://docs.example.com/page

Arguments:

  • URL: Target documentation URL (required, first argument)

Note: Use quotes only if your URL contains special characters or spaces.

Output format (Markdown):

# Page Title

## Section 1
Content with preserved formatting, including:
- Lists
- Links
- Tables

### Code Examples
```python
def example():
    return "Code blocks are preserved"

2. Multi URL Crawler

# Using a text file with URLs
python multi_url_crawler.py urls.txt

# Using JSON output from menu crawler
python multi_url_crawler.py menu_links.json

# Using custom output prefix
python multi_url_crawler.py menu_links.json --output-prefix custom_name

Arguments:

  • URLs file: Path to file containing URLs (required, first argument)
    • Can be .txt with one URL per line
    • Or .json from menu crawler output
  • --output-prefix: Custom prefix for output markdown file (optional)

Note: Use quotes only if your file path contains spaces.

Output filename format:

  • Without --output-prefix: domain_path_docs_content_timestamp.md (e.g., cloudflare_agents_docs_content_20240323_223656.md)
  • With --output-prefix: custom_prefix_docs_content_timestamp.md (e.g., custom_name_docs_content_20240323_223656.md)

The crawler accepts two types of input files:

  1. Text file with one URL per line:
https://docs.example.com/page1
https://docs.example.com/page2
https://docs.example.com/page3
  1. JSON file (compatible with menu crawler output):
{
    "menu_links": [
        "https://docs.example.com/page1",
        "https://docs.example.com/page2"
    ]
}

3. Sitemap Crawler

python sitemap_crawler.py https://docs.example.com/sitemap.xml

Options:

  • --max-depth: Maximum sitemap recursion depth (optional)
  • --patterns: URL patterns to include (optional)

4. Menu Crawler

python menu_crawler.py https://docs.example.com

Options:

  • --selectors: Custom menu selectors (optional)

The menu crawler now saves its output to the input_files directory, making it ready for immediate use with the multi-url crawler. The output JSON has this format:

{
    "start_url": "https://docs.example.com/",
    "total_links_found": 42,
    "menu_links": [
        "https://docs.example.com/page1",
        "https://docs.example.com/page2"
    ]
}

After running the menu crawler, you'll get a command to run the multi-url crawler with the generated file.

Directory Structure

crawl4ai_docs_scraper/
โ”œโ”€โ”€ input_files/           # Input files for URL processing
โ”‚   โ”œโ”€โ”€ urls.txt          # Text file with URLs
โ”‚   โ””โ”€โ”€ menu_links.json   # JSON output from menu crawler
โ”œโ”€โ”€ scraped_docs/         # Output directory for markdown files
โ”‚   โ””โ”€โ”€ docs_timestamp.md # Generated documentation
โ”œโ”€โ”€ multi_url_crawler.py
โ”œโ”€โ”€ menu_crawler.py
โ””โ”€โ”€ requirements.txt

Error Handling

All crawlers include comprehensive error handling with colored terminal output:

  • ๐ŸŸข Green: Success messages
  • ๐Ÿ”ต Cyan: Processing status
  • ๐ŸŸก Yellow: Warnings
  • ๐Ÿ”ด Red: Error messages

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Attribution

This project uses Crawl4AI for web data extraction.

Acknowledgments

MCP Server ยท Populars

MCP Server ยท New

    chatmcp

    mcpso

    directory for Awesome MCP Servers

    Community chatmcp
    TBXark

    MCP Proxy Server

    An MCP proxy server that aggregates and serves multiple MCP resource servers through a single HTTP server.

    Community TBXark
    ttommyth

    interactive-mcp

    Ask users questions from your LLM! interactive-mcp: Local, cross-platform MCP server for interactive prompts, chat & notifications.

    Community ttommyth
    lpigeon

    ros-mcp-server

    The ROS MCP Server is designed to support robots in performing complex tasks and adapting effectively to various environments by providing a set of functions that transform natural language commands, entered by a user through an LLM, into ROS commands for robot control.

    Community lpigeon
    emicklei

    melrose-mcp

    interactive programming of melodies, producing MIDI

    Community emicklei