ethnn-b

Video MCP Server

Community ethnn-b
Updated

MCP server for processing videos into frames and transcriptions using FFmpeg and Whisper

Video MCP Server

Bridge the gap between Claude and video content - An MCP (Model Context Protocol) server that extracts keyframes and transcribes audio from videos, enabling Claude to analyze video files.

Overview

While Claude is multimodal and can analyze images, it cannot directly process video files. This MCP server automatically:

  • ๐ŸŽฌ Extracts keyframes using intelligent scene detection or fixed intervals
  • ๐ŸŽค Transcribes audio using OpenAI's Whisper
  • โšก Provides structured, timestamped output combining visual and textual information
  • ๐Ÿ’พ Caches results for instant follow-up queries
  • ๐Ÿ–ผ๏ธ Optimized image compression to fit within Claude Code's token limits

Use Cases

  • ๐Ÿ“น Meeting Analysis: Process Teams/Zoom recordings to extract key points and visual content
  • ๐ŸŽ“ Tutorial Understanding: Analyze demo videos and instructional content
  • ๐Ÿ” Content Review: Quickly scan through long videos to find specific moments
  • ๐Ÿ“Š Presentation Analysis: Extract slides and speaker notes from recorded presentations

Installation

Prerequisites

  1. Python 3.11 or higher

    python3 --version  # Should be 3.11+
    
  2. FFmpeg (required for video processing)

    macOS:

    brew install ffmpeg
    

    Ubuntu/Debian:

    sudo apt-get update
    sudo apt-get install ffmpeg
    

    Windows:

  3. Verify FFmpeg installation:

    ffmpeg -version
    

Install Video MCP Server

  1. Clone repository:

  2. Create a virtual environment (recommended):

    python3 -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    
  3. Install the package:

    pip install -e .
    
  4. Verify installation:

    video-mcp --help  # Should show no errors (will start MCP server)
    

Configuration

MCP Configuration for Claude Code

Add the following to your Claude Code MCP configuration file:

Location: ~/.claude.json

{
  "mcpServers": {
    "video-mcp": {
      "command": "/path/to/vidmcp/venv/bin/video-mcp",
      "args": [],
      "env": {
        "VIDEO_MCP_FRAME_INTERVAL_SECONDS": "10",
        "VIDEO_MCP_WHISPER_MODEL": "base",
        "VIDEO_MCP_MAX_FRAMES_PER_VIDEO": "100",
        "VIDEO_MCP_USE_SCENE_DETECTION": "true",
        "VIDEO_MCP_SCENE_THRESHOLD": "0.3",
        "VIDEO_MCP_SCENE_MAX_INTERVAL_SECONDS": "30",
        "PYTHONUNBUFFERED": "1"
      }
    }
  }
}

Important Notes:

  • Adjust the command path to match your virtual environment location
  • Restart Claude Code after modifying ~/.claude.json

Environment Variables

You can customize behavior using environment variables (all optional):

Variable Default Description
VIDEO_MCP_FRAME_INTERVAL_SECONDS 10 Interval between frames (fixed-interval mode)
VIDEO_MCP_MAX_FRAMES_PER_VIDEO 100 Maximum frames to extract per video
VIDEO_MCP_USE_SCENE_DETECTION true Use scene detection instead of fixed intervals
VIDEO_MCP_SCENE_THRESHOLD 0.3 Scene detection sensitivity (0.0-1.0, lower=more sensitive)
VIDEO_MCP_SCENE_MAX_INTERVAL_SECONDS 30 Maximum seconds between frames (ensures static scenes are captured)
VIDEO_MCP_WHISPER_MODEL base Whisper model (tiny, base, small, medium, large)
VIDEO_MCP_WHISPER_DEVICE cpu Device for Whisper (cpu or cuda)
VIDEO_MCP_CACHE_DIR ~/.cache/video-mcp Directory for cache storage
VIDEO_MCP_CACHE_MAX_AGE_DAYS 30 Maximum age of cache entries
VIDEO_MCP_CACHE_MAX_SIZE_GB 10 Maximum total cache size

Scene Detection

Enabled by default - Intelligently extracts frames at scene changes instead of fixed intervals.

Settings:

  • VIDEO_MCP_USE_SCENE_DETECTION=true - Enable scene detection
  • VIDEO_MCP_SCENE_THRESHOLD=0.3 - Sensitivity (0.0-1.0)
    • 0.1-0.2: Very sensitive (subtle changes, camera movements)
    • 0.3: Default (balanced, most content)
    • 0.4-0.5: Conservative (only major scene changes)
  • VIDEO_MCP_SCENE_MAX_INTERVAL_SECONDS=30 - Maximum seconds between frames
    • Ensures static content is captured even if scene doesn't change
    • Prevents missing long presentations, talking heads, or static slides
    • Set to 0 to disable (pure scene detection only)

Benefits:

  • Captures key moments automatically
  • Avoids redundant frames in static scenes
  • Max interval ensures no content is missed during static periods
  • Better coverage for dynamic content (movies, vlogs, edited videos)

Example:If someone talks for 2 minutes without moving (no scene change), max interval of 30s ensures you still get frames at 0s, 30s, 60s, 90s, 120s.

When to use fixed intervals:

  • You need perfectly predictable, evenly-spaced coverage
  • Content is extremely static throughout

Whisper Model Selection

Choose based on your needs:

Model Size Speed Accuracy Best For
tiny 39M Fastest Good Quick scans, low-end hardware
base 74M Fast Better Recommended default
small 244M Medium Great High accuracy needs
medium 769M Slow Excellent Production quality
large 1550M Slowest Best Maximum accuracy

Token Limits and Video Length

Claude Code has a 25,000 token limit for MCP responses. Current settings are optimized to fit within this limit.

Image Token Calculation:

  • Formula (from Anthropic docs): tokens = (width ร— height) / 750
  • For 640x360 frames: (640 ร— 360) / 750 = ~307 tokens per frame

Token Budget Breakdown:

  • Frame images: ~307 tokens each
  • Transcript text: ~1 token per 4 characters
  • Metadata: ~200 tokens

Capacity at 25K Limit:

  • Safe limit: ~70 frames (accounting for transcript)
  • Maximum: ~80 frames (minimal transcript)

Recommended Video Lengths (10-second intervals):

  • 2-5 minutes: 12-30 frames (~4-10K tokens) โœ… Always safe
  • 5-10 minutes: 30-60 frames (~10-20K tokens) โœ… Usually safe
  • 10-12 minutes: 60-72 frames (~20-24K tokens) โš ๏ธ Approaching limit
  • 12+ minutes: 72+ frames โŒ Will likely truncate

If you hit truncation:

  1. Increase scene threshold: VIDEO_MCP_SCENE_THRESHOLD=0.4
  2. Reduce max frames: VIDEO_MCP_MAX_FRAMES_PER_VIDEO=60
  3. Increase frame interval: VIDEO_MCP_FRAME_INTERVAL_SECONDS=15
  4. Process video in segments

Usage

Basic Usage with Claude Code

Once configured, you can ask Claude to process videos:

Process this video: /.../meeting_recording.mp4

Claude will automatically:

  1. Extract frames every 10 seconds (configurable)
  2. Transcribe the audio
  3. Present a timeline with frames and transcript
  4. Cache results for future queries

Follow-up Queries

Ask questions about the video content:

What were the main points discussed in the meeting?
Show me the slides that were presented
What was said around the 5-minute mark?

These follow-up queries are instant - they use the cached processed video data.

Custom Processing Options

You can customize processing parameters:

Process this video with 5-second intervals: /path/to/video.mp4
Process using the small Whisper model: /path/to/video.mp4

List Cached Videos

List all processed videos

Shows all videos in cache with their settings and cache timestamps.

Tools

The MCP server provides two tools:

1. process_video

Process a video file into frames and transcription.

Parameters:

  • video_path (required): Absolute path to video file
  • frame_interval (optional): Interval between frames (fixed-interval mode only)
  • use_scene_detection (optional): Use scene detection (default: true)
  • scene_threshold (optional): Scene detection sensitivity 0.0-1.0 (default: 0.3)
  • whisper_model (optional): Whisper model to use
  • max_frames (optional): Maximum number of frames to extract (default: 100)

Example:

{
  "video_path": "/path/to/video.mp4",
  "frame_interval": 5,
  "whisper_model": "small",
  "max_frames": 100
}

2. list_processed_videos

List all videos currently in cache.

Returns: Summary of cached videos with paths, timestamps, and settings.

Resources

The server also provides MCP resources for cached videos:

URI Format: video:///absolute/path/to/video.mp4

Cached videos are automatically exposed as resources that Claude can access.

Cache Management

Cache Location

By default, cached data is stored in:

~/.cache/video-mcp/
โ”œโ”€โ”€ {cache_key}/
โ”‚   โ”œโ”€โ”€ metadata.json
โ”‚   โ”œโ”€โ”€ transcript.json
โ”‚   โ”œโ”€โ”€ timeline.json
โ”‚   โ””โ”€โ”€ frames/
โ”‚       โ”œโ”€โ”€ frame_0000.jpg
โ”‚       โ”œโ”€โ”€ frame_0001.jpg
โ”‚       โ””โ”€โ”€ ...
โ””โ”€โ”€ cache_index.json

Cache Key Format

Cache keys are generated from:

  • Video file hash (MD5 of first 1MB + file size)
  • Processing settings (scene detection/interval, Whisper model)

Fixed Interval Mode:Format: {hash}_{interval}s_{model}Example: abc123def456_10s_base

Scene Detection Mode:Format: {hash}_scene_{threshold}_{model}Example: abc123def456_scene_0.3_base

Manual Cache Management

Check cache size:

du -sh ~/.cache/video-mcp

Clear cache:

rm -rf ~/.cache/video-mcp

View cache index:

cat ~/.cache/video-mcp/cache_index.json | python3 -m json.tool

Automatic Cleanup

The cache automatically cleans up:

  • Entries older than 30 days (configurable)
  • When total size exceeds 10GB (configurable)

Supported Formats

  • MP4 (.mp4)
  • MOV (.mov)
  • AVI (.avi)
  • MKV (.mkv)
  • WebM (.webm)

Most common video codecs are supported via FFmpeg.

Performance

Processing Time

Approximate times for a 10-minute video (on Apple M1):

Task Time
Frame extraction (10s intervals) ~10s
Audio transcription (base model) ~60s
Timeline building ~1s
Total ~70s

Cache Benefits

  • First query: ~70s (processes video)
  • Follow-up queries: Instant (uses cache)

Troubleshooting

Corporate Environments / JFrog Artifactory

Error: Could not find a version that satisfies the requirement fastmcp or No matching distribution found for fastmcp

Solutions:

  1. Install with PyPI fallback:

    pip install -e . --extra-index-url https://pypi.org/simple
    

    This checks your JFrog repo first, then falls back to public PyPI for missing packages.

  2. Install FastMCP directly from PyPI first:

    # Install FastMCP from public PyPI
    pip install --index-url https://pypi.org/simple fastmcp
    
    # Then install video-mcp (uses JFrog for other dependencies)
    pip install -e .
    

FFmpeg Not Found

Error: FFmpeg not found. Please install it...

Solution:

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt-get install ffmpeg

# Verify
ffmpeg -version

Video File Not Found

Error: Video file not found: /path/to/video.mp4

Solution:

  • Provide absolute path, not relative path
  • Check file exists: ls -la /path/to/video.mp4
  • Use tab completion to avoid typos

Unsupported Format

Error: Unsupported video format: .flv

Solution:

  • Convert to MP4: ffmpeg -i input.flv output.mp4
  • Supported formats: MP4, MOV, AVI, MKV, WebM

Whisper Model Download Issues

Error: Failed to load Whisper model...

Solution:

  • Models download automatically on first use
  • Requires internet connection
  • Check disk space (models: 39MB - 1.5GB)
  • Download location: ~/.cache/whisper/

Out of Memory

Error: Processing fails on large videos

Solution:

  • Reduce max_frames: VIDEO_MCP_MAX_FRAMES_PER_VIDEO=30
  • Increase frame_interval: VIDEO_MCP_FRAME_INTERVAL_SECONDS=15
  • Use smaller Whisper model: VIDEO_MCP_WHISPER_MODEL=tiny

Slow Processing

Issue: Transcription takes too long

Solution:

  • Use smaller Whisper model (tiny or base)
  • For GPU: Set VIDEO_MCP_WHISPER_DEVICE=cuda (requires CUDA setup)
  • Process shorter video segments

Development

Project Structure

vidmcp/
โ”œโ”€โ”€ src/vidmcp/
โ”‚   โ”œโ”€โ”€ __init__.py          # Package initialization
โ”‚   โ”œโ”€โ”€ server.py            # MCP server implementation
โ”‚   โ”œโ”€โ”€ config.py            # Configuration management
โ”‚   โ”œโ”€โ”€ models.py            # Data models
โ”‚   โ”œโ”€โ”€ video_processor.py   # FFmpeg integration
โ”‚   โ”œโ”€โ”€ audio_processor.py   # Whisper integration
โ”‚   โ”œโ”€โ”€ timeline_builder.py  # Timeline synchronization
โ”‚   โ””โ”€โ”€ cache.py             # Cache management
โ”œโ”€โ”€ tests/                   # Test suite
โ”œโ”€โ”€ pyproject.toml          # Project configuration
โ””โ”€โ”€ README.md               # This file

Running Tests

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run with coverage
pytest --cov=vidmcp

Code Formatting

# Format code
black src/vidmcp

# Lint code
ruff check src/vidmcp

Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚         Claude Code (Client)            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                โ”‚ MCP Protocol
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚         Video MCP Server                โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚  MCP Interface Layer            โ”‚   โ”‚
โ”‚  โ”‚  โ€ข process_video tool           โ”‚   โ”‚
โ”‚  โ”‚  โ€ข list_processed_videos tool   โ”‚   โ”‚
โ”‚  โ”‚  โ€ข video:// resources           โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚             โ”‚                           โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚  Processing Pipeline            โ”‚   โ”‚
โ”‚  โ”‚  โ€ข VideoProcessor (FFmpeg)      โ”‚   โ”‚
โ”‚  โ”‚  โ€ข AudioProcessor (Whisper)     โ”‚   โ”‚
โ”‚  โ”‚  โ€ข TimelineBuilder              โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚             โ”‚                           โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚  Cache Layer                    โ”‚   โ”‚
โ”‚  โ”‚  โ€ข Hash-based caching           โ”‚   โ”‚
โ”‚  โ”‚  โ€ข Persistent storage           โ”‚   โ”‚
โ”‚  โ”‚  โ€ข Auto cleanup                 โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Future Enhancements

Planned features:

  • GPU Support: CUDA acceleration for faster Whisper transcription
  • Video Chunking: Handle very long videos (>1 hour) efficiently
  • YouTube Support: Download and process YouTube videos directly
  • Speaker Diarization: Identify different speakers in transcript
  • Summary Mode: Generate video summaries with fewer frames
  • Custom Frame Selection: User-specified exact timestamps for frames
  • Faster Whisper: Use faster-whisper library for 4x speed improvement
  • Adaptive Scene Detection: Auto-tune threshold based on content

Acknowledgments

  • MCP Protocol: Anthropic's Model Context Protocol
  • FFmpeg: Video processing
  • Whisper: OpenAI's speech recognition
  • Claude: Anthropic's AI assistant

Made with โค๏ธ for the Claude Code community

MCP Server ยท Populars

MCP Server ยท New