SnotacusNexus

MCP Web Research Agent

Community SnotacusNexus
Updated

A powerful MCP tool for automated web research, scraping, and intelligence gathering. Converts web scraping into AI-powered research workflows with search integration, database storage, and multi-format exports.

MCP Web Research Agent

A powerful MCP (Model Context Protocol) tool for automated web research, scraping, and intelligence gathering.

License: MITPython 3.8+MCP Protocol

A sophisticated web research automation tool that converts your existing scraper into an MCP-compatible agent for enhanced AI workflows. Perfect for competitive intelligence, market research, and automated data collection.

🚀 Features

  • 🔍 Intelligent Scraping: Recursive web crawling with configurable depth
  • 🔎 Search Integration: Multi-engine search with result processing
  • 💾 Database Storage: Persistent SQLite storage with advanced querying
  • 📊 Multiple Export Formats: JSON, Markdown, and CSV exports
  • 🤖 MCP Integration: Seamless integration with AI assistants
  • ⚡ Async Ready: Built for concurrent operations
  • 🔧 Configurable: Adjustable settings for any use case

🛠️ Installation

Prerequisites

  • Python 3.8+
  • MCP-compatible client (Claude Desktop, etc.)

Quick Install

# Clone the repository
git clone https://github.com/yourusername/mcp-web-research-agent.git
cd mcp-web-research-agent

# Install dependencies
pip install -e .

MCP Client Configuration

Add to your MCP client configuration:

{
  "mcpServers": {
    "web-research-agent": {
      "command": "python",
      "args": ["/path/to/mcp-web-research-agent/server.py"]
    }
  }
}

📖 Usage

Available Tools

scrape_url

Scrape a single URL for specific keywords

result = await scrape_url(
    url="https://example.com",
    keywords=["python", "automation", "scraping"],
    extract_links=False,
    max_depth=1
)
search_and_scrape

Search the web and automatically scrape results

result = await search_and_scrape(
    query="web scraping best practices",
    keywords=["python", "beautifulsoup", "requests"],
    search_engine_url="https://searx.gophernuttz.us/search/",
    max_results=10
)
get_scraping_results

Query the database for previous scraping results

result = await get_scraping_results(
    keyword_filter="python",
    limit=50
)
export_results

Export results to various formats

result = await export_results(
    format="markdown",
    keyword_filter="python",
    output_path="/path/to/output.md"
)
get_scraping_stats

Get current statistics and status

result = await get_scraping_stats()

🗃️ Database Schema

The agent uses SQLite with the following structure:

-- URLs table
CREATE TABLE urls (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    url TEXT UNIQUE NOT NULL,
    title TEXT,
    content TEXT,
    timestamp DATETIME DEFAULT CURRENT_TIMESTAMP
);

-- Keywords table  
CREATE TABLE keywords (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    keyword TEXT UNIQUE NOT NULL
);

-- URL-Keyword relationships
CREATE TABLE url_keywords (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    url_id INTEGER,
    keyword_id INTEGER,
    matches INTEGER DEFAULT 1,
    context TEXT,
    FOREIGN KEY (url_id) REFERENCES urls (id),
    FOREIGN KEY (keyword_id) REFERENCES keywords (id),
    UNIQUE(url_id, keyword_id)
);

🔧 Configuration

Default Settings

  • Max Depth: 3 levels of recursive crawling
  • Request Delay: 1 second between requests
  • User Agent: Modern Chrome browser simulation
  • Database: scraper_results.db (auto-created)

Customization

Modify settings in the MCPWebScraper constructor:

scraper = MCPWebScraper(
    db_manager=db_manager,
    max_depth=5,      # Increase crawl depth
    delay=0.5         # Faster requests
)

🧪 Development

Running Tests

python test_mcp_scraper.py

Example Usage

python example_usage.py

Project Structure

mcp-web-research-agent/
├── server.py              # MCP server implementation
├── scraper.py             # Core scraping logic
├── database.py            # Database management
├── requirements.txt       # Python dependencies
├── pyproject.toml         # Package configuration
├── test_mcp_scraper.py    # Unit tests
├── example_usage.py       # Usage examples
└── README.md              # This file

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Built on the Model Context Protocol
  • Inspired by modern web scraping best practices
  • Thanks to the open-source community for amazing tools

Built with ❤️ for the MCP ecosystem

MCP Server · Populars

MCP Server · New

    mihaelamj

    🍎📚 Cupertino

    A local Apple Documentation crawler and MCP server. Written in Swift.

    Community mihaelamj
    HlidacStatu

    Kompletní kód pro www.hlidacstatu.cz

    Kompletní zdrojový kód pro web hlidacstatu.cz. Pomozte nám rozvíjet a vylepšovat jeden z největších a nejdůležitějších serverů pro transparentnost státu v ČR.

    Community HlidacStatu
    tomastommy622

    Polymarket TypeScript Trading Bot

    Polymarket trading bot: Polymarket copytrading bot, Polymarket arbitrage bot on Polymarket, Monitor real price on Polymarket and calculate prob and automatically mirror positions with intelligent sizing and safety checks on Polymarket.(copytrading bot & arbitrage bot))

    Community tomastommy622
    redleaves

    Context-Keeper

    🧠 LLM-Driven Intelligent Memory & Context Management System (AI记忆管理与智能上下文感知平台) AI记忆管理平台 | 智能上下文感知 | RAG检索增强生成 | 向量检索引擎

    Community redleaves
    wenerme

    @wener/mssql-mcp

    Wener Node, Bun, NestJS, React Utils, Hooks & Demos

    Community wenerme