arXiv Deep Research
A Model Context Protocol (MCP) server for searching, downloading, and reading arXiv papers — designed as a specialist agent for integration into multi-agent systems like Microsoft Magentic-UI and AutoGen.
The idea: Rather than treating arXiv search as a simple lookup tool, this server is structured as a first-class research agent — one you can plug directly into a Magentic-One-style team as an
McpAgent, giving an Orchestrator access to the full scientific literature as a delegatable resource.
Integration with Magentic-UI
Magentic-UI supports custom McpAgent instances via mcp_agent_configs in its config file. This server plugs in directly:
# examples/magentic_ui_config.yaml
client:
mcp_agent_configs:
- agent_name: ArxivResearcher
description: >
Specialist agent for searching and reading arXiv papers.
Use when the task requires finding academic papers, understanding
research literature, or retrieving technical details from published work.
server_params:
type: StdioServerParams
command: python
args: ["-m", "arxiv_mcp_server"]
env:
PYTHONPATH: /path/to/arxiv-deep-research/src
Once registered, the Magentic-UI Orchestrator can delegate research subtasks to this agent through the standard Task Ledger / Progress Ledger pattern — exactly how WebSurfer handles web browsing, but for academic literature.
Integration with AutoGen AgentChat
See examples/autogen_research_team.py for a complete 3-agent team:
Orchestrator (MagenticOneGroupChat)
├── ArxivSurfer ← this MCP server, wrapped via StdioServerParams + mcp_server_tools
└── Coder ← synthesizes findings into structured markdown reports
pip install "autogen-agentchat" "autogen-ext[openai]" "mcp>=1.2.0"
export OPENAI_API_KEY=...
python examples/autogen_research_team.py
Tools
| Tool | Description |
|---|---|
search_papers |
Query arXiv with advanced filters: date range, category, sort by relevance or date |
download_paper |
Fetch a paper PDF and convert to clean markdown for LLM consumption |
read_paper |
Access previously downloaded paper content |
list_papers |
View all papers in local storage |
search_papers
Supports rich query syntax — quoted phrases, boolean operators, field-specific search (ti:, au:, abs:), and category filtering:
{
"query": "\"multi-agent\" AND \"orchestration\" ANDNOT survey",
"max_results": 10,
"date_from": "2024-01-01",
"categories": ["cs.AI", "cs.MA"],
"sort_by": "relevance"
}
Multi‑stage research pipeline
At a high level, arxiv-deep-research runs a simple but powerful multi‑stage loop:
- Plan the research task
- A coordinator agent (for example the AutoGen
MagenticOneGroupChatOrchestrator) takes the user goal and breaks it into sub‑tasks.
- A coordinator agent (for example the AutoGen
- Discover candidate papers
- The coordinator calls the MCP
search_paperstool to find relevant arXiv papers by topic, category, and date.
- The coordinator calls the MCP
- Download and normalize content
- For selected IDs, it calls
download_paper, which fetches the PDF and converts it into clean markdown for LLMs to read.
- For selected IDs, it calls
- Deep paper analysis
- The coordinator (or another agent) uses the
deep-paper-analysisprompt to ask for a structured analysis of a given paper ID, optionally across multiple calls as you explore related work.
- The coordinator (or another agent) uses the
- Synthesis and reporting
- A downstream agent such as
Coder(in the AutoGen example) turns these analyses into a final research report: summaries, comparison tables, open problems, and next‑step suggestions.
- A downstream agent such as
You can run this pipeline manually by calling the tools and prompts from any MCP‑aware client, or automatically using the sample AutoGen team.
Evaluation Benchmark
The repo includes a retrieval quality benchmark (eval/benchmark.py) measuring:
- Precision@K — fraction of top-K results that are relevant
- Recall@K — fraction of known relevant papers found in top-K
- MRR — Mean Reciprocal Rank of first relevant result
Ground-truth queries are seeded from landmark papers (AutoGen 2308.08155, Magentic-One 2411.04468, RAG 2005.11401, CoT 2201.11903) and can be extended automatically using the synthetic data pipeline below.
python eval/benchmark.py --k 10 --output results.json
Synthetic Eval Data Generation (AgentInstruct-style)
scripts/generate_eval_tasks.py implements a 4-stage pipeline that generates diverse benchmark queries from arXiv abstracts — mirroring the AgentInstruct approach:
Stage 1: Seed collection → fetch paper abstracts from arXiv by category
Stage 2: Content transform → extract key concepts and problem statements
Stage 3: Instruction gen → generate realistic research queries via GPT-4o-mini
Stage 4: Instruction refine → create harder variants at subtopic intersections
export OPENAI_API_KEY=...
python scripts/generate_eval_tasks.py --seed-category cs.AI --num-seeds 20 --output eval/generated_queries.json
Output includes easy/medium/hard difficulty tiers for stratified evaluation.
Observability: OpenTelemetry Tracing
Every tool call is instrumented with OpenTelemetry spans (mirrors AutoGen v0.4's built-in OTel support):
# Console output (no infrastructure needed)
export ARXIV_MCP_TRACE_CONSOLE=true
python -m arxiv_mcp_server
# OTLP export to Jaeger / Azure Monitor
docker run -d --name jaeger -p 16686:16686 -p 4317:4317 jaegertracing/all-in-one
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
export OTEL_SERVICE_NAME=arxiv-mcp-server
python -m arxiv_mcp_server
# View traces: http://localhost:16686
Spans recorded: mcp.tool.search_papers, mcp.tool.download_paper, mcp.tool.read_paper — each with query, categories, result count, latency, and error status as attributes.
Tracing is a zero-cost no-op when opentelemetry-sdk is not installed.
Installation
Requires Python 3.11+
git clone https://github.com/freyzo/arxiv-deep-research
cd arxiv-deep-research
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
# Optional: OTel tracing
pip install -e ".[tracing]"
Claude Desktop
{
"mcpServers": {
"arxiv": {
"command": "/path/to/.venv/bin/python",
"args": ["-m", "arxiv_mcp_server", "--storage-path", "/path/to/papers"]
}
}
}
Cursor
{
"mcpServers": {
"arxiv": {
"command": "python",
"args": ["-m", "arxiv_mcp_server"],
"env": { "PYTHONPATH": "/path/to/arxiv-deep-research/src" }
}
}
}
Prompts
deep-paper-analysis
Comprehensive analysis workflow covering executive summary, methodology, results, implications, and future directions:
{ "paper_id": "2401.12345" }
Running and resuming research sessions
There are two main ways to run research sessions today.
1. AutoGen multi‑agent team (recommended demo)
This uses OpenAI models to coordinate a full research workflow.
cd arxiv-deep-research
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
pip install "autogen-agentchat" "autogen-ext[openai]" "mcp>=1.2.0"
export OPENAI_API_KEY=your_openai_key
python examples/autogen_research_team.py
This starts an interactive console UI where:
- the Orchestrator plans the work,
- ArxivSurfer searches and downloads papers via MCP, and
- Coder writes the final markdown report.
To resume a session, you can:
- run the script again and paste the previous summary as part of a new task, or
- keep the same console session open and give the team a follow‑up instruction (for example, “Now focus on safety trade‑offs”).
2. Direct MCP usage from tools like Claude Desktop or Cursor
You can also talk to the MCP server directly and build your own loop:
cd arxiv-deep-research
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
export ARXIV_MCP_TRACE_CONSOLE=true # optional
python -m arxiv_mcp_server
While this server runs, any MCP‑aware client can:
- call
search_papersanddownload_paper, - use
read_paperto pull content into the chat, and - call the
deep-paper-analysisprompt multiple times.
The prompt handler keeps a simple global research context, so repeated calls in the same process will mention previously analyzed paper IDs and encourage the model to connect them. In practice, “resuming” a research session means:
- keeping the same MCP server process alive, and
- issuing new
deep-paper-analysiscalls for new paper IDs from the same client or workspace.
Repository Structure
arxiv-deep-research/
├── src/arxiv_mcp_server/
│ ├── server.py # MCP server + OTel init
│ ├── tracing.py # @trace_tool decorator, OTLP + console exporters
│ ├── config.py
│ ├── tools/ # search, download, read, list
│ └── prompts/ # deep research analysis prompt
├── examples/
│ ├── autogen_research_team.py # Magentic-One-style 3-agent team
│ └── magentic_ui_config.yaml # McpAgent config for Magentic-UI
├── eval/
│ └── benchmark.py # Precision@K / Recall@K / MRR harness
├── scripts/
│ └── generate_eval_tasks.py # AgentInstruct-style query generator
└── pyproject.toml
Environment Variables
| Variable | Default | Description |
|---|---|---|
ARXIV_STORAGE_PATH |
~/.arxiv-mcp-server/papers |
Paper storage location |
ARXIV_MCP_TRACE_CONSOLE |
false |
Enable console trace output |
OTEL_EXPORTER_OTLP_ENDPOINT |
— | OTLP endpoint (e.g. http://localhost:4317) |
OTEL_SERVICE_NAME |
arxiv-mcp-server |
Service name in traces |
If you use the optional eval data generator, you also need:
| Variable | Description |
|---|---|
OPENAI_API_KEY |
Used by scripts/generate_eval_tasks.py to talk to gpt-4o-mini |
Known issues
- Model support is OpenAI‑only today.
- The AutoGen research team and the synthetic eval generator both call OpenAI models (
gpt-4o/gpt-4o-mini) via the OpenAI Python SDK. - There is no first‑class
google-genai/ Gemini or Gemma integration yet, even though the design would support it.
- The AutoGen research team and the synthetic eval generator both call OpenAI models (
- No MCP Resources yet.
- Papers are exposed only via tools (
read_paper) rather than as MCP Resources with stablearxiv://URIs. MCP clients that prefer Resources cannot list papers yet.
- Papers are exposed only via tools (
- Limited testing.
- The core retrieval and eval logic has very light automated testing; metric functions and tool handlers should gain unit tests over time.
Roadmap
Planned improvements (subject to change):
- Gemini / Gemma support via
google-genai- Add an optional
google-genaidependency and a small runner that can call Gemini/Gemma models usingGEMINI_API_KEY. - Expose this as an alternative backend for the research team demo and the eval generator.
- Add an optional
- MCP Resources for downloaded papers
- Implement
list_resources/read_resourceso downloaded PDFs appear asarxiv://paper_idresources in MCP clients.
- Implement
- Stronger testing and evals
- Add unit tests for metrics, search helpers, and prompt handlers.
- Automate running
eval/benchmark.pyand track regression over time.
- Richer research sessions
- Replace the simple global research context with explicit session IDs and persisted state, so “resume session X” becomes a first‑class feature across restarts.