๐ WebSurfer MCP
A powerful Model Context Protocol (MCP) server that enables Large Language Models (LLMs) to fetch and extract readable text content from web pages. This tool provides a secure, efficient, and feature-rich way for AI assistants to access web content through a standardized interface.
โจ Features
- ๐ Secure URL Validation: Blocks dangerous schemes, private IPs, and localhost domains
- ๐ Smart Content Extraction: Extracts clean, readable text from HTML pages using advanced parsing
- โก Rate Limiting: Built-in rate limiting to prevent abuse (60 requests/minute)
- ๐ก๏ธ Content Type Filtering: Only processes supported content types (HTML, plain text, XML)
- ๐ Size Limits: Configurable content size limits (default: 10MB)
- โฑ๏ธ Timeout Management: Configurable request timeouts with validation
- ๐ง Comprehensive Error Handling: Detailed error messages for various failure scenarios
- ๐งช Full Test Coverage: 45+ unit tests covering all functionality
๐๏ธ Architecture
The project consists of several key components:
Core Components
MCPURLSearchServer
: Main MCP server implementationTextExtractor
: Handles web content fetching and text extractionURLValidator
: Validates and sanitizes URLs for securityConfig
: Centralized configuration management
Key Features
- Async/Await: Built with modern Python async patterns for high performance
- Resource Management: Proper cleanup of network connections and resources
- Context Managers: Safe resource handling with automatic cleanup
- Logging: Comprehensive logging for debugging and monitoring
๐ Installation
Prerequisites
- Python 3.12 or higher
- uv package manager (recommended)
Quick Start
Clone the repository:
git clone https://github.com/crybo-rybo/websurfer-mcp cd websurfer-mcp
Install dependencies:
uv sync
Verify installation:
uv run python -c "import mcp_url_search_server; print('Installation successful!')"
๐ฏ Usage
Starting the MCP Server
The server communicates via stdio (standard input/output) and can be integrated with any MCP-compatible client.
# Start the server
uv run run_server.py serve
# Start with custom log level
uv run run_server.py serve --log-level DEBUG
Testing URL Search Functionality
Test the URL search functionality directly:
# Test with a simple URL
uv run run_server.py test --url "https://example.com"
# Test with custom timeout
uv run run_server.py test --url "https://httpbin.org/html" --timeout 15
Example Test Output
{
"success": true,
"url": "https://example.com",
"title": "Example Domain",
"content_type": "text/html",
"status_code": 200,
"text_length": 1250,
"text_preview": "Example Domain This domain is for use in illustrative examples in documents..."
}
๐ ๏ธ Configuration
The server can be configured using environment variables:
Variable | Default | Description |
---|---|---|
MCP_DEFAULT_TIMEOUT |
10 |
Default request timeout in seconds |
MCP_MAX_TIMEOUT |
60 |
Maximum allowed timeout in seconds |
MCP_USER_AGENT |
MCP-URL-Search-Server/1.0.0 |
User agent string for requests |
MCP_MAX_CONTENT_LENGTH |
10485760 |
Maximum content size in bytes (10MB) |
Example Configuration
export MCP_DEFAULT_TIMEOUT=15
export MCP_MAX_CONTENT_LENGTH=5242880 # 5MB
uv run run_server.py serve
๐งช Testing
Running All Tests
# Run all tests with verbose output
uv run python -m unittest discover tests -v
# Run tests with coverage (if coverage is installed)
uv run coverage run -m unittest discover tests
uv run coverage report
Running Specific Test Files
# Run only integration tests
uv run python -m unittest tests.test_integration -v
# Run only text extraction tests
uv run python -m unittest tests.test_text_extractor -v
# Run only URL validation tests
uv run python -m unittest tests.test_url_validator -v
Test Results
All 45 tests should pass successfully:
test_content_types_immutable (test_config.TestConfig.test_content_types_immutable) ... ok
test_default_configuration_values (test_config.TestConfig.test_default_configuration_values) ... ok
test_404_error_handling (test_integration.TestMCPURLSearchIntegration.test_404_error_handling) ... ok
...
----------------------------------------------------------------------
Ran 45 tests in 1.827s
OK
๐ง Development
Project Structure
websurfer-mcp/
โโโ mcp_url_search_server.py # Main MCP server implementation
โโโ text_extractor.py # Web content extraction logic
โโโ url_validator.py # URL validation and security
โโโ config.py # Configuration management
โโโ run_server.py # Command-line interface
โโโ run_tests.py # Test runner utilities
โโโ tests/ # Test suite
โ โโโ test_integration.py # Integration tests
โ โโโ test_text_extractor.py # Text extraction tests
โ โโโ test_url_validator.py # URL validation tests
โ โโโ test_config.py # Configuration tests
โโโ pyproject.toml # Project configuration
โโโ README.md # This file
๐ Security Features
URL Validation
- Scheme Blocking: Blocks
file://
,javascript:
,ftp://
schemes - Private IP Protection: Blocks access to private IP ranges (10.x.x.x, 192.168.x.x, etc.)
- Localhost Protection: Blocks localhost and local domain access
- URL Length Limits: Prevents extremely long URLs
- Format Validation: Ensures proper URL structure
Content Safety
- Content Type Filtering: Only processes supported text-based content types
- Size Limits: Configurable maximum content size (default: 10MB)
- Rate Limiting: Prevents abuse with configurable limits
- Timeout Protection: Configurable request timeouts
๐ Performance
- Async Processing: Non-blocking I/O for high concurrency
- Connection Pooling: Efficient HTTP connection reuse
- DNS Caching: Reduces DNS lookup overhead
- Resource Cleanup: Automatic cleanup prevents memory leaks
๐ Acknowledgments
- Built with the Model Context Protocol (MCP)
- Uses aiohttp for async HTTP requests
- Leverages trafilatura for content extraction
- Powered by BeautifulSoup for HTML parsing
Happy web surfing with your AI assistant! ๐