🤖 MCP Dataset Onboarding Server
A FastAPI-based MCP (Model-Compatible Protocol) server for automating dataset onboarding using Google Drive as both input source and mock catalog.
🔒 SECURITY FIRST - READ THIS BEFORE SETUP
⚠️ This repository contains template files only. You MUST configure your own credentials before use.
📖 Read SECURITY_SETUP.md for complete security instructions.
🚨 Never commit service account keys or real folder IDs to version control!
Features
- Automated Dataset Processing: Complete workflow from raw CSV/Excel files to cataloged datasets
- Google Drive Integration: Uses Google Drive folders as input source and catalog storage
- Metadata Extraction: Automatically extracts column information, data types, and basic statistics
- Data Quality Rules: Suggests DQ rules based on data characteristics
- Contract Generation: Creates Excel contracts with schema and DQ information
- Mock Catalog: Publishes processed artifacts to a catalog folder
- 🤖 Automated Processing: Watches folders and processes files automatically
- 🌐 Multiple Interfaces: FastAPI server, MCP server, CLI tools, and dashboards
Project Structure
├── main.py # FastAPI server and endpoints
├── mcp_server.py # True MCP protocol server for LLM integration
├── utils.py # Google Drive helpers and DQ functions
├── dataset_processor.py # Centralized dataset processing logic
├── auto_processor.py # 🤖 Automated file monitoring
├── start_auto_processor.py # 🚀 Easy startup for auto-processor
├── processor_dashboard.py # 📊 Monitoring dashboard
├── dataset_manager.py # CLI tool for managing datasets
├── local_test.py # Local processing script
├── auto_config.py # ⚙️ Configuration management
├── requirements.txt # Python dependencies
├── Dockerfile # Container configuration
├── .env.template # Environment variables template
├── .gitignore # Security: excludes sensitive files
├── SECURITY_SETUP.md # 🔒 Security configuration guide
├── processed_datasets/ # Organized output folder
│ └── [dataset_name]/ # Individual dataset folders
│ ├── [dataset].csv # Original dataset
│ ├── [dataset]_metadata.json
│ ├── [dataset]_contract.xlsx
│ ├── [dataset]_dq_report.json
│ └── README.md # Dataset summary
└── README.md # This file
🚀 Quick Start
1. Security Setup (REQUIRED)
# 1. Read the security guide
cat SECURITY_SETUP.md
# 2. Set up your Google service account (outside this repo)
# 3. Configure your environment variables
cp .env.template .env
# Edit .env with your actual values
# 4. Verify no sensitive files will be committed
git status
2. Installation
# Install dependencies
pip install -r requirements.txt
# Test the setup
python local_test.py
3. Choose Your Interface
🤖 Fully Automated (Recommended)
# Start auto-processor - upload files and walk away!
python start_auto_processor.py
🌐 API Server
# Start FastAPI server
python main.py
🧠 LLM Integration (MCP)
# Start MCP server for Claude Desktop, etc.
python mcp_server.py
🖥️ Command Line
# Manual dataset management
python dataset_manager.py list
python dataset_manager.py process YOUR_FILE_ID
🎯 Usage Scenarios
Scenario 1: Set-and-Forget Automation
python start_auto_processor.py
- Upload files to Google Drive
- Files processed automatically within 30 seconds
- Monitor with
python processor_dashboard.py --live
Scenario 2: LLM-Powered Data Analysis
- Configure MCP server in Claude Desktop
- Chat: "Analyze the dataset I just uploaded"
- Claude uses MCP tools to process and explain your data
Scenario 3: API Integration
python main.py
- Integrate with your data pipelines via REST API
- Programmatic dataset onboarding
📊 What You Get
For each processed dataset:
- 📄 Original File: Preserved in organized folder
- 📋 Metadata JSON: Column info, types, statistics
- 📊 Excel Contract: Professional multi-sheet contract
- 🔍 Quality Report: Data quality assessment
- 📖 README: Human-readable summary
🛠️ Available Tools
FastAPI Endpoints
/tool/extract_metadata
- Analyze dataset structure/tool/apply_dq_rules
- Generate quality rules/process_dataset
- Complete workflow/health
- System health check
MCP Tools (for LLMs)
extract_dataset_metadata
- Dataset analysisgenerate_data_quality_rules
- Quality assessmentprocess_complete_dataset
- Full pipelinelist_catalog_files
- Catalog browsing
CLI Commands
dataset_manager.py list
- Show processed datasetsauto_processor.py --once
- Single check cycleprocessor_dashboard.py --live
- Real-time monitoring
🔧 Configuration
Environment Variables (.env)
GOOGLE_SERVICE_ACCOUNT_KEY_PATH=path/to/your/key.json
MCP_SERVER_FOLDER_ID=your_input_folder_id
MCP_CLIENT_FOLDER_ID=your_output_folder_id
Auto-Processor Settings (auto_config.py)
- Check interval: 30 seconds
- Supported formats: CSV, Excel
- File age threshold: 1 minute
- Max files per cycle: 5
📈 Monitoring & Analytics
# Current status
python processor_dashboard.py
# Live monitoring (auto-refresh)
python processor_dashboard.py --live
# Detailed statistics
python processor_dashboard.py --stats
# Processing history
python auto_processor.py --list
🐳 Docker Deployment
# Build
docker build -t mcp-dataset-server .
# Run (mount your service account key securely)
docker run -p 8000:8000 \
-v /secure/path/to/key.json:/app/keys/key.json \
-e GOOGLE_SERVICE_ACCOUNT_KEY_PATH=/app/keys/key.json \
-e MCP_SERVER_FOLDER_ID=your_folder_id \
mcp-dataset-server
🔍 Troubleshooting
Common Issues
- No files detected: Check Google Drive permissions
- Processing errors: Verify service account access
- MCP not working: Check Claude Desktop configuration
Debug Commands
# Test Google Drive connection
python -c "from utils import get_drive_service; print('✅ Connected')"
# Check auto-processor status
python auto_processor.py --once
# Verify MCP server
python test_mcp_server.py
🤝 Contributing
- Fork the repository
- Create a feature branch
- Never commit sensitive data
- Test your changes
- Submit a pull request
📚 Documentation
- SECURITY_SETUP.md - Security configuration
- AUTOMATION_GUIDE.md - Automation features
- MCP_INTEGRATION_GUIDE.md - LLM integration
📄 License
MIT License
🎉 What Makes This Special
- 🔒 Security First: Proper credential management
- 🤖 True Automation: Zero manual intervention
- 🧠 LLM Integration: Natural language data processing
- 📊 Professional Output: Enterprise-ready documentation
- 🔧 Multiple Interfaces: API, CLI, MCP, Dashboard
- 📈 Real-time Monitoring: Live processing status
- 🗂️ Perfect Organization: Structured output folders
Transform your messy data files into professional, documented, quality-checked datasets automatically! 🚀