AI-Driven Universal Web Data Extraction Platform

A production-grade, MCP-enabled universal web scraping platform with MongoDB storage and advanced anti-bot (antigravity) mechanisms.

๐ŸŽฏ Features

  • Dual Scraping Engines: Static (Requests + BeautifulSoup) and Dynamic (Playwright)
  • Auto-Detection: Automatically selects the appropriate scraper based on page content
  • Anti-Bot Protection: User-Agent rotation, rate limiting, robots.txt compliance, stealth mode
  • MongoDB Storage: Persists all scraped data with full metadata
  • MCP Integration: Exposes scraping as tools for LLM invocation
  • Export Options: JSON and CSV export capabilities

๐Ÿ“ Project Structure

d:\mcp\
โ”œโ”€โ”€ requirements.txt          # Python dependencies
โ”œโ”€โ”€ config.py                 # Configuration settings
โ”œโ”€โ”€ main.py                   # FastAPI MCP server entry point
โ”œโ”€โ”€ scraper/
โ”‚   โ”œโ”€โ”€ static_scraper.py     # Requests + BeautifulSoup scraper
โ”‚   โ”œโ”€โ”€ dynamic_scraper.py    # Playwright scraper
โ”‚   โ””โ”€โ”€ strategy_selector.py  # Auto-detection logic
โ”œโ”€โ”€ antigravity/
โ”‚   โ”œโ”€โ”€ user_agents.py        # User-Agent rotation
โ”‚   โ”œโ”€โ”€ throttle.py           # Request delays & rate limiting
โ”‚   โ”œโ”€โ”€ robots_validator.py   # robots.txt compliance
โ”‚   โ””โ”€โ”€ stealth.py            # Playwright stealth configuration
โ”œโ”€โ”€ database/
โ”‚   โ”œโ”€โ”€ mongodb.py            # MongoDB connection & operations
โ”‚   โ””โ”€โ”€ models.py             # Pydantic data models
โ”œโ”€โ”€ mcp/
โ”‚   โ””โ”€โ”€ tools.py              # MCP tool definitions
โ”œโ”€โ”€ utils/
โ”‚   โ”œโ”€โ”€ normalizer.py         # Data normalization
โ”‚   โ””โ”€โ”€ exporter.py           # CSV/JSON export
โ”œโ”€โ”€ tests/                    # Test suite
โ””โ”€โ”€ docs/
    โ””โ”€โ”€ README.md             # This file

๐Ÿš€ Quick Start

1. Install Dependencies

cd d:\mcp
pip install -r requirements.txt
playwright install chromium

2. Start MongoDB

Ensure MongoDB is running on localhost:27017 (or update MONGODB_URI in config.py).

3. Run the Server

python main.py

The server will start at http://localhost:8000.

4. Test the API

Open http://localhost:8000/docs for interactive Swagger documentation.

๐Ÿ”Œ API Endpoints

Endpoint Method Description
/scrape POST/GET Scrape a website
/stats GET Get scraping statistics
/recent GET Get recently scraped data
/logs GET Get scrape logs
/export/json POST Export data to JSON
/export/csv POST Export data to CSV
/health GET Health check

Example Scrape Request

curl -X POST "http://localhost:8000/scrape" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "auto_detect": true}'

๐Ÿง  MCP Tool Usage

The platform exposes a scrape_website tool via MCP:

# Tool Schema
{
    "name": "scrape_website",
    "parameters": {
        "url": "string (required)",
        "dynamic": "boolean (default: false)",
        "auto_detect": "boolean (default: true)",
        "store_in_mongodb": "boolean (default: true)"
    }
}

๐Ÿ›ก๏ธ Anti-Bot (Antigravity) Features

  1. User-Agent Rotation: 20+ realistic browser User-Agents
  2. Request Throttling: 1-5 second random delays between requests
  3. Rate Limiting: Max 10 requests per domain per minute
  4. robots.txt Compliance: Respects crawling restrictions
  5. Playwright Stealth Mode: Disables automation detection flags

๐Ÿ“Š MongoDB Schema

scraped_data Collection

{
  "_id": "ObjectId",
  "url": "string",
  "scraped_at": "ISO timestamp",
  "scraper_type": "static | dynamic",
  "content": {
    "title": "string",
    "text": "string",
    "links": ["string"]
  },
  "metadata": {
    "status_code": "number",
    "response_time": "number",
    "user_agent": "string"
  }
}

scrape_logs Collection

{
  "url": "string",
  "timestamp": "ISO timestamp",
  "success": "boolean",
  "error": "string | null"
}

๐Ÿงช Running Tests

cd d:\mcp
pytest tests/ -v

โš–๏ธ Ethical Considerations

  • Always respects robots.txt directives
  • Implements polite crawling with delays
  • Only scrapes publicly accessible content
  • Rate limiting prevents server overload
  • Designed for responsible use

๐Ÿ“‹ Limitations

  • Cannot bypass authentication or CAPTCHAs
  • JavaScript-heavy SPAs may require dynamic scraping
  • Some sites may detect and block scraping despite stealth measures
  • Rate limiting may slow down bulk operations

๐Ÿ”ฎ Future Scope

  • Proxy rotation support
  • CAPTCHA solving integration
  • Distributed scraping with task queues
  • Advanced content extraction (structured data, tables)
  • Scheduled/recurring scrapes
  • WebSocket real-time updates

๐Ÿ“„ License

This project is for educational purposes.

MCP Server ยท Populars

MCP Server ยท New