AI-Cursor-Scraping-Assistant
A powerful tool that leverages Cursor AI and MCP (Model Context Protocol) to easily generate web scrapers for various types of websites. This project helps you quickly analyze websites and generate proper Scrapy or Camoufox scrapers with minimal effort.
Project Overview
This project contains two main components:
- Cursor Rules - A set of rules that teach Cursor AI how to analyze websites and create different types of Scrapy spiders
- MCP Tools - A collection of Model Context Protocol tools that enhance Cursor's capabilities for web scraping tasks
Prerequisites
- Cursor AI installed
- Python 3.10+ installed
- Basic knowledge of web scraping concepts
Installation
Clone this repository to your local machine:
git clone https://github.com/TheWebScrapingClub/AI-Cursor-Scraping-Assistant.git
cd AI-Cursor-Scraping-Assistant
Install the required dependencies:
pip install mcp camoufox scrapy
If you plan to use Camoufox, you'll need to fetch its browser binary:
python -m camoufox fetch
Setup
Setting Up MCP Server
The MCP server provides tools that help Cursor AI analyze web pages and generate XPath selectors. To start the MCP server:
Navigate to the MCPfiles directory:
cd MCPfiles
Update the
CAMOUFOX_FILE_PATH
inxpath_server.py
to point to your localCamoufox_template.py
file.Start the MCP server:
python xpath_server.py
In Cursor, connect to the MCP server by configuring it in the settings or using the MCP panel.
Cursor Rules
The cursor-rules directory contains rules that teach Cursor AI how to analyze websites and create different types of scrapers. These rules are automatically loaded when you open the project in Cursor.
Detailed Cursor Rules Explanation
The cursor-rules
directory contains a set of MDC (Markdown Configuration) files that guide Cursor's behavior when creating web scrapers:
prerequisites.mdc
This rule handles initial setup tasks before creating any scrapers:
- Gets the full path of the current project using
pwd
- Stores the path in context for later use by other rules
- Confirms the execution of preliminary actions before proceeding
website-analysis.mdc
This comprehensive rule guides Cursor through website analysis:
- Identifies the type of Scrapy spider to build (PLP, PDP, etc.)
- Fetches and stores homepage HTML and cookies
- Strips CSS using the MCP tool to simplify HTML analysis
- Checks cookies for anti-bot protection (Akamai, Datadome, PerimeterX, etc.)
- For PLP scrapers: fetches category pages, analyzes structure, looks for JSON data
- For PDP scrapers: fetches product pages, analyzes structure, looks for JSON data
- Detects schema.org markup and modern frameworks like Next.js
scrapy-step-by-step-process.mdc
This rule provides the execution flow for creating scrapers:
- Outlines the sequence of steps to follow
- References other rule files in the correct order
- Ensures prerequisite actions are completed before scraper creation
- Guides Cursor to analyze the website before generating code
scrapy.mdc
This extensive rule contains Scrapy best practices:
- Defines recommended code organization and directory structure
- Details file naming conventions and module organization
- Provides component architecture guidelines
- Offers strategies for code splitting and reuse
- Includes performance optimization recommendations
- Covers security practices, error handling, and logging
- Provides specific syntax examples and code snippets
scraper-models.mdc
This rule defines the different types of scrapers that can be created:
- E-commerce PLP: Details the data structure, field definitions, and implementation steps
- E-commerce PDP: Details the data structure, field definitions, and implementation steps
- Field mapping guidelines for all scraper types
- Step-by-step instructions for creating each type of scraper
- Default settings recommendations
- Anti-bot countermeasures for different protection systems
Usage
Here's how to use the AI-Cursor-Scraping-Assistant:
- Open the project in Cursor AI
- Make sure the MCP server is running
- Ask Cursor to create a scraper with a prompt like:
Write an e-commerce PLP scraper for the website gucci.com
Cursor will then:
- Analyze the website structure
- Check for anti-bot protection
- Extract the relevant HTML elements
- Generate a complete Scrapy spider based on the website type
Available Scraper Types
You can request different types of scrapers:
- E-commerce PLP (Product Listing Page) - Scrapes product catalogs/category pages
- E-commerce PDP (Product Detail Page) - Scrapes detailed product information
For example:
Write an e-commerce PDP scraper for nike.com
Advanced Usage
Camoufox Integration
The project includes a Camoufox template for creating stealth scrapers that can bypass certain anti-bot measures. The MCP tools help you:
- Fetch page content using Camoufox
- Generate XPath selectors for the desired elements
- Create a complete Camoufox scraper based on the template
Custom Scrapers
You can extend the functionality by adding new scraper types to the cursor-rules files. The modular design allows for easy customization.
Project Structure
AI-Cursor-Scraping-Assistant/
├── MCPfiles/
│ ├── xpath_server.py # MCP server with web scraping tools
│ └── Camoufox_template.py # Template for Camoufox scrapers
├── cursor-rules/
│ ├── website-analysis.mdc # Rules for analyzing websites
│ ├── scrapy.mdc # Best practices for Scrapy
│ ├── scrapy-step-by-step-process.mdc # Guide for creating scrapers
│ ├── scraper-models.mdc # Templates for different scraper types
│ └── prerequisites.mdc # Setup requirements
└── README.md
TODO: Future Enhancements
The following features are planned for future development:
Proxy Integration
- Add proxy support when requested by the operator
- Implement proxy rotation strategies
- Support for different proxy providers
- Handle proxy authentication
- Integrate with popular proxy services
Improved XPath Generation and Validation
- Add validation mechanisms for generated XPath selectors
- Implement feedback loop for selector refinement
- Control flow management for reworking selectors
- Auto-correction of problematic selectors
- Handle edge cases like dynamic content and AJAX loading
Other Planned Features
- Support for more scraper types (news sites, social media, etc.)
- Integration with additional anti-bot bypass techniques
- Enhanced JSON extraction capabilities
- Support for more complex navigation patterns
- Multi-page scraping optimizations
References
This project is based on articles from The Web Scraping Club:
For more information on web scraping techniques and best practices, visit The Web Scraping Club.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.