⚡ Archive Agent
Archive Agent is an open-source semantic file tracker with OCR + AI search (RAG) and MCP capability.
Smart Indexer with RAG Engine
MCP server for automation through IDE or AI extension
Fast and effective semantic chunking (smart chunking)
Qdrant vector DB (running locally) for storage and search
Automatic OCR and AI cache save costs and headaches
100% dev-friendly: Clean docs and code ✨
YouTube explainer: How RAG Helped Me Find Any PDF in Seconds
Just getting started? 👉 Install Archive Agent on Linux
Want to know the nitty-gritty details? 👉 How Archive Agent works
Looking for the CLI command reference? 👉 Run Archive Agent
Looking for the MCP tool reference? 👉 MCP Tools
Want to upgrade for the latest features? 👉 Update Archive Agent
🍀 Collaborators welcome You are invited to contribute to this open source project! Feel free to file issues and submit pull requests anytime.
📷 Screenshot of command-line interface (CLI):
📷 Screenshot of graphical user interface (GUI):(enlarge)
Structure
- ⚡ Archive Agent
- Structure
- Supported OS
- Install Archive Agent
- Ubuntu / Linux Mint
- AI provider setup
- OpenAI provider setup
- Ollama provider setup
- LM Studio provider setup
- How Archive Agent works
- Which files are processed
- OCR strategies
- How files are processed
- How smart chunking works
- How chunks are retrieved
- How files are selected for tracking
- Run Archive Agent
- Show list of commands
- Create or switch profile
- Open current profile config in nano
- Add included patterns
- Add excluded patterns
- Remove included / excluded patterns
- List included / excluded patterns
- Resolve patterns and track files
- List tracked files
- List changed files
- Commit changed files to database
- Combined track and commit
- Search your files
- Query your files
- Launch Archive Agent GUI
- Start MCP Server
- MCP Tools
- Update Archive Agent
- Archive Agent settings
- Profile configuration
- Watchlist
- AI cache
- Qdrant database
- Developer's guide
- Important modules
- Testing and code analysis
- Known bugs
- Licensed under GNU GPL v3.0
Supported OS
Archive Agent has been tested with these configurations:
- Ubuntu 24.04 (PC x64)
If you've successfully installed and tested Archive Agent with a different setup, please let me know and I'll add it here!
Install Archive Agent
Please install these requirements before proceeding:
Ubuntu / Linux Mint
This installation method should work on any Linux distribution derived from Ubuntu (e.g. Linux Mint).
To install Archive Agent in the current directory of your choice, run this once:
git clone https://github.com/shredEngineer/Archive-Agent
cd Archive-Agent
chmod +x install.sh
./install.sh
The install.sh
script will execute the following steps in order:
- Download and install
uv
(used for Python environment management) - Install the custom Python environment
- Install the
spaCy
tokenizer model (used for chunking) - Install
pandoc
(used for document parsing) - Download and install the Qdrant docker image with persistent storage and auto-restart
- Install a global
archive-agent
command for the current user
🚀 Archive Agent is now installed!
👉 Please complete the AI provider setup next. (Afterward, you'll be ready to Run Archive Agent!)
AI provider setup
Archive Agent lets you choose between different AI providers:
Remote APIs (higher performance and costs, less privacy):
- OpenAI: Requires an OpenAI API key.
Local APIs (lower performance and costs, best privacy):
- Ollama: Requires Ollama running locally.
- LM Studio: Requires LM Studio running locally.
💡 Good to know: You will be prompted to choose an AI provider at startup; see: Run Archive Agent.
📌 Note: You can customize the specific models used by the AI provider in the Archive Agent settings. However, you cannot change the AI provider of an existing profile, as the embeddings will be incompatible; to choose a different AI provider, create a new profile instead.
OpenAI provider setup
If the OpenAI provider is selected, Archive Agent requires the OpenAI API key.
To export your OpenAI API key, replace sk-...
with your actual key and run this once:
echo "export OPENAI_API_KEY='sk-...'" >> ~/.bashrc && source ~/.bashrc
This will persist the export for the current user.
💡 Good to know: OpenAI won't use your data for training.
Ollama provider setup
If the Ollama provider is selected, Archive Agent requires Ollama running at http://localhost:11434
.
With the default Archive Agent Settings, these Ollama models are expected to be installed:
ollama pull llama3.1:8b # for chunk/query
ollama pull llava:7b-v1.6 # for vision
ollama pull nomic-embed-text:v1.5 # for embed
💡 Good to know: Ollama also works without a GPU.At least 32 GiB RAM is recommended for smooth performance.
LM Studio provider setup
If the LM Studio provider is selected, Archive Agent requires LM Studio running at http://localhost:1234
.
With the default Archive Agent Settings, these LM Studio models are expected to be installed:
meta-llama-3.1-8b-instruct # for chunk/query
llava-v1.5-7b # for vision
text-embedding-nomic-embed-text-v1.5 # for embed
💡 Good to know: LM Studio also works without a GPU.At least 32 GiB RAM is recommended for smooth performance.
How Archive Agent works
Which files are processed
Archive Agent currently supports these file types:
- Text:
- Plaintext:
.txt
,.md
- Documents:
- ASCII documents:
.html
,.htm
- Binary documents:
.odt
,.docx
(including images)
- ASCII documents:
- PDF documents:
.pdf
(including images, see note below)
- Plaintext:
- Images:
.jpg
,.jpeg
,.png
,.gif
,.webp
,.bmp
OCR strategies
For PDF documents, there are different OCR strategies supported by Archive Agent:
auto
OCR strategy:- Selects best OCR strategy for each page based on the number of characters extracted from the PDF OCR text layer, if any.
- Decides based on
ocr_auto_threshold
(see Archive Agent settings), the minimum number of characters forauto
OCR strategy to resolve torelaxed
instead ofstrict
. - Optimal trade-off between cost, speed, and accuracy.
strict
OCR strategy:- PDF OCR text layer is ignored.
- PDF pages are treated as images.
- Expensive and slow, but more accurate.
relaxed
OCR strategy:- PDF OCR text layer is extracted.
- PDF foreground images are decoded, but background images are ignored.
- Cheap and fast, but less accurate.
💡 Good to know: You will be prompted to choose an OCR strategy at startup (see Run Archive Agent).
How files are processed
Ultimately, Archive Agent decodes everything to text like this:
- Plaintext files are decoded to UTF-8.
- Documents are converted to plaintext, images are extracted.
- PDF documents are decoded according to the OCR strategy.
- Images are decoded to text using AI vision.
- The vision model will reject unintelligible images.
Using Pandoc for documents, PyMuPDF4LLM for PDFs, Pillow for images.
📌 Note: Unsupported files are tracked but not processed.
How smart chunking works
Archive Agent processes decoded text like this:
- Decoded text is sanitized and split into sentences.
- Sentences are grouped into reasonably-sized blocks.
- Each block is split into smaller chunks using an AI model.
- Block boundaries are handled gracefully (last chunk carries over).
- Each chunk is turned into a vector using AI embeddings.
- Each vector is turned into a point with file metadata.
- Each point is stored in the Qdrant database.
💡 Good to know: This smart chunking improves the accuracy and effectiveness of the retrieval.
How chunks are retrieved
Archive Agent retrieves chunks related to your question like this:
- The question is turned into a vector using AI embeddings.
- Points with similar vectors are retrieved from the Qdrant database.
- Chunks of points with sufficient score are returned.
Archive Agent answers your question using retrieved chunks like this:
- The LLM receives the retrieved chunks as context to the question.
- The LLM's answer is returned and formatted.
The LLM's answer is structured to be multi-faceted, making Archive Agent a helpful assistant.
How files are selected for tracking
Archive Agent uses patterns to select your files:
- Patterns can be actual file paths.
- Patterns can be paths containing wildcards that resolve to actual file paths.
- Patterns must be specified as (or resolve to) absolute paths, e.g.
/home/user/Documents/*.txt
(or~/Documents/*.txt
). - Patterns may use the wildcard
**
to match any files and zero or more directories, subdirectories, and symbolic links to directories.
There are included patterns and excluded patterns:
- The set of resolved excluded files is removed from the set of resolved included files.
- Only the remaining set of files (included but not excluded) is tracked by Archive Agent.
- Hidden files are always ignored!
This approach gives you the best control over the specific files or file types to track.
Run Archive Agent
💡 Good to know: At startup, you will be prompted to choose the following:
- Profile name
- AI provider (see AI Provider Setup)
- OCR strategy (see OCR strategies)
Show list of commands
To show the list of supported commands, run this:
archive-agent
Create or switch profile
To switch to a new or existing profile, run this:
archive-agent switch "My Other Profile"
📌 Note: Always use quotes for the profile name argument,or skip it to get an interactive prompt.
💡 Good to know: Profiles are useful to manage independent Qdrant collections (see Qdrant database) and Archive Agent settings.
Open current profile config in nano
To open the current profile's config (JSON) in the nano
editor, run this:
archive-agent config
See Archive Agent settings for details.
Add included patterns
To add one or more included patterns, run this:
archive-agent include "~/Documents/*.txt"
📌 Note: Always use quotes for the pattern argument (to prevent your shell's wildcard expansion),or skip it to get an interactive prompt.
Add excluded patterns
To add one or more excluded patterns, run this:
archive-agent exclude "~/Documents/*.txt"
📌 Note: Always use quotes for the pattern argument (to prevent your shell's wildcard expansion),or skip it to get an interactive prompt.
Remove included / excluded patterns
To remove one or more previously included / excluded patterns, run this:
archive-agent remove "~/Documents/*.txt"
📌 Note: Always use quotes for the pattern argument (to prevent your shell's wildcard expansion),or skip it to get an interactive prompt.
List included / excluded patterns
To show the list of included / excluded patterns, run this:
archive-agent patterns
Resolve patterns and track files
To resolve all patterns and track changes to your files, run this:
archive-agent track
List tracked files
To show the list of tracked files, run this:
archive-agent list
📌 Note: Don't forget to track
your files first.
List changed files
To show the list of changed files, run this:
archive-agent diff
📌 Note: Don't forget to track
your files first.
Commit changed files to database
To sync changes to your files with the Qdrant database, run this:
archive-agent commit
To see additional information on chunking and embedding, pass the --verbose
option:
archive-agent commit --verbose
To bypass the AI cache for this commit, pass the --nocache
option:
archive-agent commit --nocache
💡 Good to know: Changes are triggered by:
- File added
- File removed
- File changed:
- Different file size
- Different modification date
📌 Note: Don't forget to track
your files first.
Combined track and commit
To track
and then commit
in one go, run this:
archive-agent update
To see additional information on chunking and embedding, pass the --verbose
option:
archive-agent update --verbose
To bypass the AI cache for this commit, pass the --nocache
option:
archive-agent update --nocache
Search your files
archive-agent search "Which files mention donuts?"
Lists files relevant to the question.
📌 Note: Always use quotes for the question argument, or skip it to get an interactive prompt.
Query your files
archive-agent query "Which files mention donuts?"
Answers your question using RAG.
📌 Note: Always use quotes for the question argument, or skip it to get an interactive prompt.
Launch Archive Agent GUI
To launch the Archive Agent GUI in your browser, run this:
archive-agent gui
📌 Note: Press CTRL+C
in the console to close the GUI server.
Start MCP Server
To start the Archive Agent MCP server, run this:
archive-agent mcp
📌 Note: Press CTRL+C
in the console to close the MCP server.
💡 Good to know: Use these MCP configurations to let your IDE or AI extension automate Archive Agent:
.vscode/mcp.json
for GitHub Copilot agent mode (VS Code):.roo/mcp.json
for Roo Code (VS Code extension)
MCP Tools
Archive Agent exposes these tools via MCP:
MCP tool | Equivalent CLI command(s) | Argument(s) | Description |
---|---|---|---|
get_patterns |
patterns |
None | Get the list of included / excluded patterns. |
get_files_tracked |
track and then list |
None | Get the list of tracked files. |
get_files_changed |
track and then diff |
None | Get the list of changed files. |
get_search_result |
search |
question |
Get the list of files relevant to the question. |
get_answer_rag |
query |
question |
Get answer to question using RAG. |
📌 Note: These commands are read-only, preventing the AI from changing your Qdrant database.
💡 Good to know: Just type #get_answer_rag
(e.g.) in your IDE or AI extension to call the tool directly.
Update Archive Agent
This step is not needed right away if you just installed Archive Agent.However, to get the latest features, you should update your installation regularly.
To update your Archive Agent installation, run this in the installation directory:
git pull
./install.sh
📌 Note: If updating doesn't work, try removing the installation directory and then Install Archive Agent again.Your config and data are safely stored in another place;see Archive Agent settings and Qdrant database for details.
💡 Good to know: To also update the Qdrant docker image, run this:
sudo ./manage-qdrant.sh update
Archive Agent settings
Archive Agent settings are organized as profile folders in ~/.archive-agent-settings/
.
E.g., the default
profile is located in ~/.archive-agent-settings/default/
.
The currently used profile is stored in ~/.archive-agent-settings/profile.json
.
📌 Note: To delete a profile, simply delete the profile folder.This will not delete the Qdrant collection (see Qdrant database).
Profile configuration
The profile configuration is contained in the profile folder as config.json
.
💡 Good to know: Use the config
CLI command to open the current profile's config (JSON) in the nano
editor (see Open current profile config in nano).
💡 Good to know: Use the switch
CLI command to switch to a new or existing profile (see Create or switch profile).
Key | Description |
---|---|
config_version |
Config version |
ocr_strategy |
OCR strategy in DecoderSettings.py |
ocr_auto_threshold |
Minimum number of characters for auto OCR strategy to resolve to relaxed instead of strict |
ai_provider |
AI provider in ai_provider_registry.py |
ai_server_url |
AI server URL |
ai_model_chunk |
AI model used for chunking |
ai_model_embed |
AI model used for embedding |
ai_model_query |
AI model used for queries |
ai_model_vision |
AI model used for vision ("" disables vision) |
ai_vector_size |
Vector size of embeddings (used for Qdrant collection) |
ai_temperature_query |
Temperature of the query model |
qdrant_server_url |
URL of the Qdrant server |
qdrant_collection |
Name of the Qdrant collection |
qdrant_score_min |
Minimum similarity score of retrieved chunks (0 ...1 ) |
qdrant_chunks_max |
Maximum number of retrieved chunks |
chunk_lines_block |
Number of lines per block for chunking |
mcp_server_port |
MCP server port (default 8008 ) |
Watchlist
The profile watchlist is contained in the profile folder as watchlist.json
.
The watchlist is managed by these commands only:
include
/exclude
/remove
track
/commit
/update
AI cache
Each profile folder also contains an ai_cache
folder.
The AI cache ensures that, in a given profile:
- The same text is only chunked once.
- The same text is only embedded once.
- The same image is only OCR-ed once.
This way, Archive Agent can quickly resume where it left off if a commit was interrupted.
To bypass the AI cache for a single commit, pass the --nocache
option to the commit
or update
command(see Commit changed files to database and Combined track and commit).
💡 Good to know: Queries are never cached, so you always get a fresh answer.
📌 Note: To clear the entire AI cache, simply delete the profile's cache folder.
📌 Technical Note: Archive Agent keys the cache using a composite hash made from the text/image bytes, and of the AI model names for chunking, embedding, and vision.Cache keys are deterministic and change generated whenever you change the chunking, embedding or vision AI model names.Since cache entries are retained forever, switching back to a prior combination of AI model names will again access the "old" keys.
Qdrant database
The Qdrant database is stored in ~/.archive-agent-qdrant-storage/
.
📌 Note: This folder is created by the Qdrant Docker image running as root.
💡 Good to know: Visit your Qdrant dashboard to manage collections and snapshots.
Developer's guide
Archive Agent was written from scratch for educational purposes (on either end of the software).
Important modules
To get started, check out these epic modules:
- The app context is initialized in
archive_agent/core/ContextManager.py
- The default config is defined in
archive_agent/config/ConfigManager.py
- The CLI commands are defined in
archive_agent/__main__.py
- The commit logic is implemented in
archive_agent/core/CommitManager.py
- The CLI verbosity is handled in
archive_agent/util/CliManager.py
- The GUI is implemented in
archive_agent/core/GuiManager.py
- The AI API prompts for chunking, embedding, vision, and querying are defined in
archive_agent/ai/AiManager.py
- The AI provider registry is located in
archive_agent/ai_provider/ai_provider_registry.py
If you miss something or spot bad patterns, feel free to contribute and refactor!
Testing and code analysis
To run unit tests, check types, and check style, run this:
./audit.sh
(Some remaining type errors need to be fixed…)
Known bugs
While
track
initially reports a file as added, subsequenttrack
calls report it as changed.Removing and restoring a tracked file in the tracking phase is currently not handled properly:
- Removing a tracked file sets
{size=0, mtime=0, diff=removed}
. - Restoring a tracked file sets
{size=X, mtime=Y, diff=added}
. - Because
size
andmtime
were cleared, we lost the information to detect a restored file.
- Removing a tracked file sets
Licensed under GNU GPL v3.0
Copyright © 2025 Dr.-Ing. Paul Wilhelm <[email protected]>
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
See LICENSE for details.