Video MCP Server
Bridge the gap between Claude and video content - An MCP (Model Context Protocol) server that extracts keyframes and transcribes audio from videos, enabling Claude to analyze video files.
Overview
While Claude is multimodal and can analyze images, it cannot directly process video files. This MCP server automatically:
- ๐ฌ Extracts keyframes using intelligent scene detection or fixed intervals
- ๐ค Transcribes audio using OpenAI's Whisper
- โก Provides structured, timestamped output combining visual and textual information
- ๐พ Caches results for instant follow-up queries
- ๐ผ๏ธ Optimized image compression to fit within Claude Code's token limits
Use Cases
- ๐น Meeting Analysis: Process Teams/Zoom recordings to extract key points and visual content
- ๐ Tutorial Understanding: Analyze demo videos and instructional content
- ๐ Content Review: Quickly scan through long videos to find specific moments
- ๐ Presentation Analysis: Extract slides and speaker notes from recorded presentations
Installation
Prerequisites
Python 3.11 or higher
python3 --version # Should be 3.11+FFmpeg (required for video processing)
macOS:
brew install ffmpegUbuntu/Debian:
sudo apt-get update sudo apt-get install ffmpegWindows:
- Download from ffmpeg.org
- Add to PATH
Verify FFmpeg installation:
ffmpeg -version
Install Video MCP Server
Clone repository:
Create a virtual environment (recommended):
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activateInstall the package:
pip install -e .Verify installation:
video-mcp --help # Should show no errors (will start MCP server)
Configuration
MCP Configuration for Claude Code
Add the following to your Claude Code MCP configuration file:
Location: ~/.claude.json
{
"mcpServers": {
"video-mcp": {
"command": "/path/to/vidmcp/venv/bin/video-mcp",
"args": [],
"env": {
"VIDEO_MCP_FRAME_INTERVAL_SECONDS": "10",
"VIDEO_MCP_WHISPER_MODEL": "base",
"VIDEO_MCP_MAX_FRAMES_PER_VIDEO": "100",
"VIDEO_MCP_USE_SCENE_DETECTION": "true",
"VIDEO_MCP_SCENE_THRESHOLD": "0.3",
"VIDEO_MCP_SCENE_MAX_INTERVAL_SECONDS": "30",
"PYTHONUNBUFFERED": "1"
}
}
}
}
Important Notes:
- Adjust the
commandpath to match your virtual environment location - Restart Claude Code after modifying
~/.claude.json
Environment Variables
You can customize behavior using environment variables (all optional):
| Variable | Default | Description |
|---|---|---|
VIDEO_MCP_FRAME_INTERVAL_SECONDS |
10 |
Interval between frames (fixed-interval mode) |
VIDEO_MCP_MAX_FRAMES_PER_VIDEO |
100 |
Maximum frames to extract per video |
VIDEO_MCP_USE_SCENE_DETECTION |
true |
Use scene detection instead of fixed intervals |
VIDEO_MCP_SCENE_THRESHOLD |
0.3 |
Scene detection sensitivity (0.0-1.0, lower=more sensitive) |
VIDEO_MCP_SCENE_MAX_INTERVAL_SECONDS |
30 |
Maximum seconds between frames (ensures static scenes are captured) |
VIDEO_MCP_WHISPER_MODEL |
base |
Whisper model (tiny, base, small, medium, large) |
VIDEO_MCP_WHISPER_DEVICE |
cpu |
Device for Whisper (cpu or cuda) |
VIDEO_MCP_CACHE_DIR |
~/.cache/video-mcp |
Directory for cache storage |
VIDEO_MCP_CACHE_MAX_AGE_DAYS |
30 |
Maximum age of cache entries |
VIDEO_MCP_CACHE_MAX_SIZE_GB |
10 |
Maximum total cache size |
Scene Detection
Enabled by default - Intelligently extracts frames at scene changes instead of fixed intervals.
Settings:
VIDEO_MCP_USE_SCENE_DETECTION=true- Enable scene detectionVIDEO_MCP_SCENE_THRESHOLD=0.3- Sensitivity (0.0-1.0)- 0.1-0.2: Very sensitive (subtle changes, camera movements)
- 0.3: Default (balanced, most content)
- 0.4-0.5: Conservative (only major scene changes)
VIDEO_MCP_SCENE_MAX_INTERVAL_SECONDS=30- Maximum seconds between frames- Ensures static content is captured even if scene doesn't change
- Prevents missing long presentations, talking heads, or static slides
- Set to
0to disable (pure scene detection only)
Benefits:
- Captures key moments automatically
- Avoids redundant frames in static scenes
- Max interval ensures no content is missed during static periods
- Better coverage for dynamic content (movies, vlogs, edited videos)
Example:If someone talks for 2 minutes without moving (no scene change), max interval of 30s ensures you still get frames at 0s, 30s, 60s, 90s, 120s.
When to use fixed intervals:
- You need perfectly predictable, evenly-spaced coverage
- Content is extremely static throughout
Whisper Model Selection
Choose based on your needs:
| Model | Size | Speed | Accuracy | Best For |
|---|---|---|---|---|
tiny |
39M | Fastest | Good | Quick scans, low-end hardware |
base |
74M | Fast | Better | Recommended default |
small |
244M | Medium | Great | High accuracy needs |
medium |
769M | Slow | Excellent | Production quality |
large |
1550M | Slowest | Best | Maximum accuracy |
Token Limits and Video Length
Claude Code has a 25,000 token limit for MCP responses. Current settings are optimized to fit within this limit.
Image Token Calculation:
- Formula (from Anthropic docs):
tokens = (width ร height) / 750 - For 640x360 frames:
(640 ร 360) / 750 = ~307 tokens per frame
Token Budget Breakdown:
- Frame images: ~307 tokens each
- Transcript text: ~1 token per 4 characters
- Metadata: ~200 tokens
Capacity at 25K Limit:
- Safe limit: ~70 frames (accounting for transcript)
- Maximum: ~80 frames (minimal transcript)
Recommended Video Lengths (10-second intervals):
- 2-5 minutes: 12-30 frames (~4-10K tokens) โ Always safe
- 5-10 minutes: 30-60 frames (~10-20K tokens) โ Usually safe
- 10-12 minutes: 60-72 frames (~20-24K tokens) โ ๏ธ Approaching limit
- 12+ minutes: 72+ frames โ Will likely truncate
If you hit truncation:
- Increase scene threshold:
VIDEO_MCP_SCENE_THRESHOLD=0.4 - Reduce max frames:
VIDEO_MCP_MAX_FRAMES_PER_VIDEO=60 - Increase frame interval:
VIDEO_MCP_FRAME_INTERVAL_SECONDS=15 - Process video in segments
Usage
Basic Usage with Claude Code
Once configured, you can ask Claude to process videos:
Process this video: /.../meeting_recording.mp4
Claude will automatically:
- Extract frames every 10 seconds (configurable)
- Transcribe the audio
- Present a timeline with frames and transcript
- Cache results for future queries
Follow-up Queries
Ask questions about the video content:
What were the main points discussed in the meeting?
Show me the slides that were presented
What was said around the 5-minute mark?
These follow-up queries are instant - they use the cached processed video data.
Custom Processing Options
You can customize processing parameters:
Process this video with 5-second intervals: /path/to/video.mp4
Process using the small Whisper model: /path/to/video.mp4
List Cached Videos
List all processed videos
Shows all videos in cache with their settings and cache timestamps.
Tools
The MCP server provides two tools:
1. process_video
Process a video file into frames and transcription.
Parameters:
video_path(required): Absolute path to video fileframe_interval(optional): Interval between frames (fixed-interval mode only)use_scene_detection(optional): Use scene detection (default: true)scene_threshold(optional): Scene detection sensitivity 0.0-1.0 (default: 0.3)whisper_model(optional): Whisper model to usemax_frames(optional): Maximum number of frames to extract (default: 100)
Example:
{
"video_path": "/path/to/video.mp4",
"frame_interval": 5,
"whisper_model": "small",
"max_frames": 100
}
2. list_processed_videos
List all videos currently in cache.
Returns: Summary of cached videos with paths, timestamps, and settings.
Resources
The server also provides MCP resources for cached videos:
URI Format: video:///absolute/path/to/video.mp4
Cached videos are automatically exposed as resources that Claude can access.
Cache Management
Cache Location
By default, cached data is stored in:
~/.cache/video-mcp/
โโโ {cache_key}/
โ โโโ metadata.json
โ โโโ transcript.json
โ โโโ timeline.json
โ โโโ frames/
โ โโโ frame_0000.jpg
โ โโโ frame_0001.jpg
โ โโโ ...
โโโ cache_index.json
Cache Key Format
Cache keys are generated from:
- Video file hash (MD5 of first 1MB + file size)
- Processing settings (scene detection/interval, Whisper model)
Fixed Interval Mode:Format: {hash}_{interval}s_{model}Example: abc123def456_10s_base
Scene Detection Mode:Format: {hash}_scene_{threshold}_{model}Example: abc123def456_scene_0.3_base
Manual Cache Management
Check cache size:
du -sh ~/.cache/video-mcp
Clear cache:
rm -rf ~/.cache/video-mcp
View cache index:
cat ~/.cache/video-mcp/cache_index.json | python3 -m json.tool
Automatic Cleanup
The cache automatically cleans up:
- Entries older than 30 days (configurable)
- When total size exceeds 10GB (configurable)
Supported Formats
- MP4 (
.mp4) - MOV (
.mov) - AVI (
.avi) - MKV (
.mkv) - WebM (
.webm)
Most common video codecs are supported via FFmpeg.
Performance
Processing Time
Approximate times for a 10-minute video (on Apple M1):
| Task | Time |
|---|---|
| Frame extraction (10s intervals) | ~10s |
| Audio transcription (base model) | ~60s |
| Timeline building | ~1s |
| Total | ~70s |
Cache Benefits
- First query: ~70s (processes video)
- Follow-up queries: Instant (uses cache)
Troubleshooting
Corporate Environments / JFrog Artifactory
Error: Could not find a version that satisfies the requirement fastmcp or No matching distribution found for fastmcp
Solutions:
Install with PyPI fallback:
pip install -e . --extra-index-url https://pypi.org/simpleThis checks your JFrog repo first, then falls back to public PyPI for missing packages.
Install FastMCP directly from PyPI first:
# Install FastMCP from public PyPI pip install --index-url https://pypi.org/simple fastmcp # Then install video-mcp (uses JFrog for other dependencies) pip install -e .
FFmpeg Not Found
Error: FFmpeg not found. Please install it...
Solution:
# macOS
brew install ffmpeg
# Ubuntu/Debian
sudo apt-get install ffmpeg
# Verify
ffmpeg -version
Video File Not Found
Error: Video file not found: /path/to/video.mp4
Solution:
- Provide absolute path, not relative path
- Check file exists:
ls -la /path/to/video.mp4 - Use tab completion to avoid typos
Unsupported Format
Error: Unsupported video format: .flv
Solution:
- Convert to MP4:
ffmpeg -i input.flv output.mp4 - Supported formats: MP4, MOV, AVI, MKV, WebM
Whisper Model Download Issues
Error: Failed to load Whisper model...
Solution:
- Models download automatically on first use
- Requires internet connection
- Check disk space (models: 39MB - 1.5GB)
- Download location:
~/.cache/whisper/
Out of Memory
Error: Processing fails on large videos
Solution:
- Reduce
max_frames:VIDEO_MCP_MAX_FRAMES_PER_VIDEO=30 - Increase
frame_interval:VIDEO_MCP_FRAME_INTERVAL_SECONDS=15 - Use smaller Whisper model:
VIDEO_MCP_WHISPER_MODEL=tiny
Slow Processing
Issue: Transcription takes too long
Solution:
- Use smaller Whisper model (
tinyorbase) - For GPU: Set
VIDEO_MCP_WHISPER_DEVICE=cuda(requires CUDA setup) - Process shorter video segments
Development
Project Structure
vidmcp/
โโโ src/vidmcp/
โ โโโ __init__.py # Package initialization
โ โโโ server.py # MCP server implementation
โ โโโ config.py # Configuration management
โ โโโ models.py # Data models
โ โโโ video_processor.py # FFmpeg integration
โ โโโ audio_processor.py # Whisper integration
โ โโโ timeline_builder.py # Timeline synchronization
โ โโโ cache.py # Cache management
โโโ tests/ # Test suite
โโโ pyproject.toml # Project configuration
โโโ README.md # This file
Running Tests
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run with coverage
pytest --cov=vidmcp
Code Formatting
# Format code
black src/vidmcp
# Lint code
ruff check src/vidmcp
Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Claude Code (Client) โ
โโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ MCP Protocol
โโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Video MCP Server โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ MCP Interface Layer โ โ
โ โ โข process_video tool โ โ
โ โ โข list_processed_videos tool โ โ
โ โ โข video:// resources โ โ
โ โโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Processing Pipeline โ โ
โ โ โข VideoProcessor (FFmpeg) โ โ
โ โ โข AudioProcessor (Whisper) โ โ
โ โ โข TimelineBuilder โ โ
โ โโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Cache Layer โ โ
โ โ โข Hash-based caching โ โ
โ โ โข Persistent storage โ โ
โ โ โข Auto cleanup โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Future Enhancements
Planned features:
- GPU Support: CUDA acceleration for faster Whisper transcription
- Video Chunking: Handle very long videos (>1 hour) efficiently
- YouTube Support: Download and process YouTube videos directly
- Speaker Diarization: Identify different speakers in transcript
- Summary Mode: Generate video summaries with fewer frames
- Custom Frame Selection: User-specified exact timestamps for frames
- Faster Whisper: Use
faster-whisperlibrary for 4x speed improvement - Adaptive Scene Detection: Auto-tune threshold based on content
Acknowledgments
- MCP Protocol: Anthropic's Model Context Protocol
- FFmpeg: Video processing
- Whisper: OpenAI's speech recognition
- Claude: Anthropic's AI assistant
Made with โค๏ธ for the Claude Code community