AI-Driven Universal Web Data Extraction Platform
A production-grade, MCP-enabled universal web scraping platform with MongoDB storage and advanced anti-bot (antigravity) mechanisms.
๐ฏ Features
- Dual Scraping Engines: Static (Requests + BeautifulSoup) and Dynamic (Playwright)
- Auto-Detection: Automatically selects the appropriate scraper based on page content
- Anti-Bot Protection: User-Agent rotation, rate limiting, robots.txt compliance, stealth mode
- MongoDB Storage: Persists all scraped data with full metadata
- MCP Integration: Exposes scraping as tools for LLM invocation
- Export Options: JSON and CSV export capabilities
๐ Project Structure
d:\mcp\
โโโ requirements.txt # Python dependencies
โโโ config.py # Configuration settings
โโโ main.py # FastAPI MCP server entry point
โโโ scraper/
โ โโโ static_scraper.py # Requests + BeautifulSoup scraper
โ โโโ dynamic_scraper.py # Playwright scraper
โ โโโ strategy_selector.py # Auto-detection logic
โโโ antigravity/
โ โโโ user_agents.py # User-Agent rotation
โ โโโ throttle.py # Request delays & rate limiting
โ โโโ robots_validator.py # robots.txt compliance
โ โโโ stealth.py # Playwright stealth configuration
โโโ database/
โ โโโ mongodb.py # MongoDB connection & operations
โ โโโ models.py # Pydantic data models
โโโ mcp/
โ โโโ tools.py # MCP tool definitions
โโโ utils/
โ โโโ normalizer.py # Data normalization
โ โโโ exporter.py # CSV/JSON export
โโโ tests/ # Test suite
โโโ docs/
โโโ README.md # This file
๐ Quick Start
1. Install Dependencies
cd d:\mcp
pip install -r requirements.txt
playwright install chromium
2. Start MongoDB
Ensure MongoDB is running on localhost:27017 (or update MONGODB_URI in config.py).
3. Run the Server
python main.py
The server will start at http://localhost:8000.
4. Test the API
Open http://localhost:8000/docs for interactive Swagger documentation.
๐ API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/scrape |
POST/GET | Scrape a website |
/stats |
GET | Get scraping statistics |
/recent |
GET | Get recently scraped data |
/logs |
GET | Get scrape logs |
/export/json |
POST | Export data to JSON |
/export/csv |
POST | Export data to CSV |
/health |
GET | Health check |
Example Scrape Request
curl -X POST "http://localhost:8000/scrape" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "auto_detect": true}'
๐ง MCP Tool Usage
The platform exposes a scrape_website tool via MCP:
# Tool Schema
{
"name": "scrape_website",
"parameters": {
"url": "string (required)",
"dynamic": "boolean (default: false)",
"auto_detect": "boolean (default: true)",
"store_in_mongodb": "boolean (default: true)"
}
}
๐ก๏ธ Anti-Bot (Antigravity) Features
- User-Agent Rotation: 20+ realistic browser User-Agents
- Request Throttling: 1-5 second random delays between requests
- Rate Limiting: Max 10 requests per domain per minute
- robots.txt Compliance: Respects crawling restrictions
- Playwright Stealth Mode: Disables automation detection flags
๐ MongoDB Schema
scraped_data Collection
{
"_id": "ObjectId",
"url": "string",
"scraped_at": "ISO timestamp",
"scraper_type": "static | dynamic",
"content": {
"title": "string",
"text": "string",
"links": ["string"]
},
"metadata": {
"status_code": "number",
"response_time": "number",
"user_agent": "string"
}
}
scrape_logs Collection
{
"url": "string",
"timestamp": "ISO timestamp",
"success": "boolean",
"error": "string | null"
}
๐งช Running Tests
cd d:\mcp
pytest tests/ -v
โ๏ธ Ethical Considerations
- Always respects
robots.txtdirectives - Implements polite crawling with delays
- Only scrapes publicly accessible content
- Rate limiting prevents server overload
- Designed for responsible use
๐ Limitations
- Cannot bypass authentication or CAPTCHAs
- JavaScript-heavy SPAs may require dynamic scraping
- Some sites may detect and block scraping despite stealth measures
- Rate limiting may slow down bulk operations
๐ฎ Future Scope
- Proxy rotation support
- CAPTCHA solving integration
- Distributed scraping with task queues
- Advanced content extraction (structured data, tables)
- Scheduled/recurring scrapes
- WebSocket real-time updates
๐ License
This project is for educational purposes.