๐๏ธ MCP Audio Server
A Model Context Protocol (MCP) server that gives AI agents the ability to process audio files โ transcribe speech to text, detect spoken languages, and extract audio metadata. Built with OpenAI Whisper and served over SSE (Server-Sent Events) transport for seamless integration with any MCP-compatible client.
โจ Features
| Tool | Description |
|---|---|
speech_to_text |
Transcribes spoken dialogue from an audio file into structured text using Whisper |
detect_audio_language |
Analyzes the first 30 seconds of audio to predict the primary spoken language with a confidence score |
get_audio_metadata |
Extracts technical specs โ duration, bitrate, sample rate, channels, format, and file size via ffprobe |
Highlights
- ๐ง Thread-safe model caching โ Whisper models are loaded once and reused across requests
- ๐ Strict input validation โ All inputs are validated with Pydantic (file existence, extension support, model size)
- ๐ก SSE transport โ HTTP-based transport accessible by any MCP client over the network
- ๐๏ธ Multiple Whisper models โ Choose from
tiny,base,small,medium, orlargedepending on accuracy/speed tradeoff - ๐ต Wide format support โ
.mp3,.wav,.flac,.m4a,.ogg,.mp4,.aac
๐ Project Structure
mcp-audio-server/
โโโ server.py # MCP server entry point โ registers tools, runs SSE transport
โโโ audio_processor.py # Core processing logic โ transcription, language detection, metadata
โโโ models.py # Pydantic models โ request validation & standardized response format
โโโ requirements.txt # Python dependencies
โโโ speech-text-MCP.json # Pre-built n8n workflow for AI agent integration
โโโ tests/
โโโ test_models.py # Unit tests for input validation and response serialization
๐ ๏ธ Prerequisites
- Python 3.10+
- ffmpeg (required for audio metadata extraction and Whisper audio loading)
- Windows:
winget install ffmpegor download from ffmpeg.org - macOS:
brew install ffmpeg - Linux:
sudo apt install ffmpeg
- Windows:
- GPU (optional) โ Whisper will use CUDA if available, otherwise falls back to CPU
๐ Getting Started
1. Clone the repository
git clone https://github.com/<your-username>/mcp-audio-server.git
cd mcp-audio-server
2. Create a virtual environment and install dependencies
Using uv (recommended):
uv venv
uv pip install -r requirements.txt
Or with standard pip:
python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS / Linux
source .venv/bin/activate
pip install -r requirements.txt
3. Start the server
python server.py
The server starts on http://127.0.0.1:8000 with the following endpoints:
| Endpoint | Purpose |
|---|---|
http://127.0.0.1:8000/sse |
SSE connection endpoint for MCP clients |
http://127.0.0.1:8000/messages/ |
JSON-RPC message endpoint |
๐งช Testing
MCP Inspector
The MCP Inspector is the easiest way to test the server interactively:
npx @modelcontextprotocol/inspector
- Open the Inspector UI in your browser
- Set Transport Type โ
SSE - Set URL โ
http://127.0.0.1:8000/sse - Click Connect
- Select any tool and provide an absolute path to an audio file
Unit Tests
pytest tests/ -v
๐ Integration
n8n Workflow
A pre-built n8n workflow is included in speech-text-MCP.json. It sets up a complete AI agent pipeline:
Chat Trigger โ AI Agent โ Google Gemini LLM
โ โ
MCP Client Buffer Memory
(this server)
To import:
- Start n8n (
npx n8n) - Go to Workflows โ Import from File
- Select
speech-text-MCP.json - Configure your Google Gemini API credentials in the Google Gemini Chat Model node
- Ensure this MCP server is running on
http://127.0.0.1:8000 - Activate the workflow and start chatting โ the AI agent can now transcribe audio, detect languages, and extract metadata on demand
Claude Desktop
Add to your claude_desktop_config.json:
{
"mcpServers": {
"audio-server": {
"url": "http://127.0.0.1:8000/sse"
}
}
}
Any MCP Client
Connect to the SSE endpoint at http://127.0.0.1:8000/sse using any MCP-compatible client. The server exposes three tools that are automatically discoverable through the MCP protocol.
๐ API Reference
speech_to_text
Transcribes audio to text using OpenAI Whisper.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
audio_path |
string |
required | Absolute path to the audio file |
model_size |
string |
"base" |
Whisper model variant: tiny, base, small, medium, large |
Response:
{
"status": "success",
"data": {
"text": "The transcribed text content...",
"language": "en"
}
}
detect_audio_language
Identifies the spoken language from the first 30 seconds of audio.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
audio_path |
string |
required | Absolute path to the audio file |
Response:
{
"status": "success",
"data": {
"detected_language": "en",
"confidence_score": 0.9847
}
}
get_audio_metadata
Extracts technical metadata using ffprobe.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
audio_path |
string |
required | Absolute path to the audio file |
Response:
{
"status": "success",
"data": {
"format_name": "mp3",
"duration_seconds": 245.67,
"size_bytes": 3932160,
"bit_rate": "128000",
"sample_rate": "44100",
"channels": 2
}
}
Error Response
All tools return a standardized error format on failure:
{
"status": "error",
"message": "Validation failed: The path '/bad/path.mp3' does not exist on this machine."
}
โ๏ธ Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ MCP Client โ
โ (Claude, n8n, Inspector, etc.) โ
โโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SSE (HTTP)
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ server.py โ FastMCP Server โ
โ โโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโ โ
โ โ speech_to_text โ detect_language โ get_metadata โ โ
โ โโโโโโโโโฌโโโโโโโโโดโโโโโโโโโฌโโโโโโโโโโดโโโโโโโโฌโโโโโโโโโ โ
โ โ โ โ โ
โ โผ โผ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ models.py โ Pydantic Validation Layer โ โ
โ โ (AudioPathMixin, TranscriptionRequest, etc.) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ audio_processor.py โ Processing Engine โ โ
โ โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโ โโโโโโโโโโโโโโ โ โ
โ โ โ Whisper โ โ Whisper โ โ ffprobe โ โ โ
โ โ โ transcribe() โ โ detect() โ โ metadata โ โ โ
โ โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโ โโโโโโโโโโโโโโ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ License
This project is open source. See LICENSE for details.