raintree-technology

docpull

Updated

Convert the public web into AI-ready Markdown with a local Python CLI/SDK/MCP crawler.

docpull

Context dependencies for AI agents. Browser-free by default.

Python 3.10+PyPI versionPyPI downloadsGitHub starsLicense: MIT

DocPull is a local-first dependency manager for AI context. Define the publicdocs and web sources an agent depends on, sync them into cited context packs,diff what changed, and export reproducible context for Cursor, Claude, Codex,OpenAI, LlamaIndex, LangChain, MCP clients, and RAG pipelines.

The core workflow is a docpull.yaml plus a .docpull/context.lock.json,similar in spirit to code dependency manifests and lockfiles:

docpull init my-agent-context
docpull add stripe react postgres
docpull install
docpull deps
docpull sync
docpull diff
docpull export context-pack --target codex

Bundled aliases such as stripe, react, postgres, openai, andapple-hig expand to normal HTTPS sources in docpull.yaml. Runs stayreproducible through the lockfile: source URLs, discovered URLs, content hashes,run IDs, aliases, and export metadata are recorded without storing secrets.Use docpull sources list to inspect the bundled alias catalog anddocpull install to validate or recreate the local dependency lock.Use docpull deps to see the current dependency, lockfile, latest run, andexport status.

Projects can also track typed known-source specs such as pypi:requests,rfc:9110, wiki:Web_scraping, or a local dataset path. Those sources syncthrough their typed lanes and do not use discovery.

The original docpull URL ... workflow still works: fetch public or explicitlyauthorized static/server-rendered web pages and write clean Markdown, NDJSON,SQLite, or OKF outputs. Project mode adds the persistent evidence lifecycle ontop: sources, runs, diffs, exports, evals, accounting, and local auditability.

DocPull is local-first: direct fetching, sitemap/link discovery, extraction,indexing, pack intelligence, and opt-in agent-browser rendering can run withno external account and no required API spend. Cloud rendering is explicit andbudget-guarded.

DocPull aligns core workflows across CLI, Python SDK, and MCP, with each surfaceoptimized for its user. The Surface Contract defineshow those surfaces align and where they intentionally differ.For the context dependency workflow, seeContext Dependencies.

Web-source ingestion is the core workflow. Documentation is one high-valuelane, not the product boundary. It works best on static or server-renderedpages such as blogs, API references, OpenAPI specs, changelogs, vendor pages,product pages, filings, docs sites, and other pages where the useful content isavailable in HTML or embedded page data.

DocPull is browser-free by default. JS-only pages are skipped with a clearreason unless you explicitly opt into a local renderer. SeeWeb Source Boundary andAlternatives for the full boundary.

Install

pip install docpull

Project Quickstart

docpull init stripe-docs
docpull add stripe
docpull install
docpull sync
docpull deps
docpull diff
docpull export context-pack --target cursor

Context CI

Use Context CI when an agent loop depends on current, cited context and amissing or stale source should fail the build:

docpull ci --prepare

docpull ci runs locally against either a project root or a standalone pack. Itchecks the project lockfile, pack score, pack audit, coverage confidence,citation coverage, eval-grade sidecars, evidence basis quality, rightsmetadata, and optional context predictions. It writes context-ci.report.json and CONTEXT_CI.md;the command exits non-zero when hard gates fail.

Minimal GitHub Actions job:

name: Context CI
on: [pull_request]
jobs:
  context:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install docpull
      - run: docpull ci --prepare

For the full workflow, see Context CI. The durableartifact shape is documented inContext Pack Contract v3.

Example diff after a later sync:

Project diff: +4 -2 ~18 api=2 pricing=1

Changed pages:
- /payments/payment-intents
  likely API behavior change
- /billing/subscriptions
  pricing / billing change
- /webhooks
  likely API behavior change

0 failed URLs
0 robots blocked
0 paid/cloud routes used

Context Pack Contract

DocPull writes three explicit layers of artifacts:

Layer Purpose Contract check
Raw extraction Fetched documents, chunks, routes, and source index sidecars docpull pack validate PACK --level raw
Agent-ready pack Raw evidence plus citation index, coverage, score, audit, and lock sidecars docpull pack validate PACK --level agent
Eval-grade pack Agent pack plus rights, provenance, basis/eval artifacts, and pack card docpull pack validate PACK --level eval

Core ingestion paths write into the same v3 contract:

docpull https://docs.example.com -o packs/docs
docpull parse ./handbook.pdf -o packs/handbook --backend auto
docpull openapi-pack ./openapi.json -o packs/api
docpull feed-pack https://example.com/news -o packs/news
docpull paper-pack arxiv:1706.03762 -o packs/papers
docpull repo-pack psf/requests -o packs/repo --cache
docpull package-pack pypi:requests -o packs/package
docpull standards-pack rfc:9110 -o packs/standard
docpull dataset-pack ./metrics.csv -o packs/dataset
docpull transcript-pack ./meeting.vtt -o packs/transcript
docpull wiki-pack wiki:Web_scraping -o packs/wiki
docpull pack prepare packs/docs --eval-grade
docpull pack validate packs/docs --level eval
docpull export packs/docs --format openai-vector-jsonl -o exports/openai.jsonl
docpull export packs/docs --format cursor-rules -o .cursor/rules --skill-name docs

Use docpull ci --prepare to validate a project or standalone pack in CI.

Install optional extras as needed:

pip install 'docpull[llm]'           # tiktoken for token-accurate chunking
pip install 'docpull[trafilatura]'   # alternative extractor for noisy pages
pip install 'docpull[parse]'         # MarkItDown + Unstructured local document parsers
pip install 'docpull[presidio]'      # optional Presidio PII detection for redaction
pip install 'docpull[mcp]'           # stdio MCP server
pip install 'docpull[serve]'         # local pack JSON server runner
pip install 'docpull[parquet]'       # optional Parquet export support
pip install 'docpull[e2b]'           # E2B cloud sandbox renderer SDK

Prefer installing the extras needed for the current lane instead of a broadbundle. The base install remains useful without API keys or paid services.

Browser rendering is an explicit external extension, not part of the baseinstall. Install an agent-browser compatible CLI separately, put it onPATH, or set DOCPULL_AGENT_BROWSER_BIN=/path/to/agent-browser. Verify theruntime with docpull render --check. Render targets must use HTTPS except forlocalhost/loopback HTTP during local testing, and DocPull keeps renderer actionpermissions locked down to HTML retrieval only. Because browser renderingcannot fully enforce redirect, subresource, or connect-time DNS allow-lists,network browser rendering fails closed unless the operator setsDOCPULL_RENDER_TRUSTED_BROWSER_TARGETS=1 for trusted targets. Forlocalhost/loopback HTTP tests, set DOCPULL_RENDER_ALLOW_LOCAL_TARGETS=1.

For stronger isolation, cloud runtimes are available explicitly:docpull render URL --runtime vercel uses the Vercel Sandbox CLI and Vercelauth, while docpull render URL --runtime e2b uses the E2B Python SDK andE2B_API_KEY. These are never enabled by default. All runtimes execute the sameagent-browser --json renderer contract. Use --cloud-max-estimated-cost toset a local per-render budget guard, and use --cloud-agent-browser-install skipwith a prebuilt sandbox/template that already includes agent-browser. For E2B,pass --template or set DOCPULL_E2B_TEMPLATE to use that prebuilt environment.

For release acceptance, run the opt-in real-data smoke harness. The default pathuses public free sources and local tooling; --include-cloud also attempts thekeyed/cloud render lanes when the local environment is configured for them.The strict scorecard also requires synchronized generated metadata and a cleangit status --short before tagging.

python scripts/release_a_plus_check.py --strict
python scripts/real_feature_smoke.py --json --full-mcp --strict-ci --auth-matrix --monitor-soak-minutes 10
python scripts/real_feature_smoke.py --include-cloud --json

Free-First Budgets

Use --budget 0 when a run must not make paid-capable cloud calls:

docpull https://docs.example.com --budget 0 -o ./docs/example
DOCPULL_RENDER_TRUSTED_BROWSER_TARGETS=1 docpull render https://example.com/app --runtime local --budget 0

Under a zero budget, local cache, direct HTTP, sitemap/static-link discovery,local extraction, local indexing, pack analysis, monitors, and localbrowser rendering for trusted targets remain allowed. Vercel Sandbox and E2Brendering are blocked before execution. Runs involving a budget or paid-capableroute write run.accounting.json with non-secret route, cost, HTTP/cache,browser, and blocked-action metadata.

Release Boundary

The open-source package owns local fetching, explicit rendering adapters,source aliases, v3 pack contracts, validation, preparation, exports, ContextCI, monitors, MCP, budget policy, and accounting.

This release does not include a hosted scheduler, browser/proxy service,accounts, marketplace, proprietary web index, CAPTCHA bypass, stealth scraping,or hidden paid calls.

Persistent Projects

Use project mode when a source corpus needs to stay fresh over time. A projectis a local docpull.yaml plus a .docpull/ state directory containing runhistory, cache, manifests, context-pack exports, eval sets, and a SQLite index.

docpull init stripe-docs
docpull add https://docs.stripe.com
docpull sync
docpull diff
docpull export context-pack --target cursor

Each sync writes a normal local DocPull pack under .docpull/runs/<run_id>/,including run.json, documents.jsonl, chunks.jsonl, manifest.json,documents.ndjson, corpus.manifest.json, sources.md,source-health.json, local.pack.json, and accounting metadata.

# Inspect the latest project state
docpull status

# Show run history
docpull history

# Diff the latest two runs, with deterministic local categories by default
docpull diff

# Write a review summary for the latest run
docpull review

# Create a versioned context-pack release
docpull release context-pack --target cursor --tag stripe-docs-v1

# One-command project sync, diff, and export for one source
docpull watch https://docs.stripe.com --export cursor --alert changes

Ad hoc docpull watch projects are bounded to one page and one level of depthby default. Use explicit bounds when the watch should cover more:

docpull watch https://docs.stripe.com --export cursor --max-pages 10 --max-depth 2

docpull diff is hash-based and deterministic locally. Optional BYOK semanticsummaries are advisory and skip cleanly when no model key is configured. Eachdiff also writes local semantic categories to semantic.diff.json.Use docpull add URL --discover or docpull sync --update-discovery torefresh and persist discovered source URLs in docpull.yaml; sync then usesthat stored URL set for repeatable exact refreshes.

For authenticated sources, store only environment variable references indocpull.yaml; DocPull resolves values in memory at sync time and writes onlymasked auth type/readiness to status, manifests, reviews, releases, andwebhooks:

sources:
  - name: internal-docs
    url: https://docs.example.com
    auth:
      type: bearer_env
      env: EXAMPLE_DOCS_TOKEN
      policy: explicit-private

The launch screenshot for this flow lives atdocs/launch-assets/docpull-project-diff-demo.png.

30-Second Usage

docpull https://www.python.org/blogs/ --single -o ./python-news

Example output:

python-news/
  index.md
  corpus.manifest.json

Markdown includes source metadata and readable page content:

---
title: "Blogs"
source: https://www.python.org/blogs/
source_type: "html"
---

# Blogs

News from the Python Software Foundation, Python core developers, and the
wider Python community.

Stream chunked NDJSON for agents and RAG:

docpull https://www.python.org/blogs/ \
  --single \
  --profile llm \
  --stream | jq .

Each line is a JSON document:

{"schema_version":1,"document_id":"doc_...","chunk_id":"chunk_...","url":"https://www.python.org/blogs/","title":"Blogs","content":"News from the Python Software Foundation...","source_type":"html","chunk_index":0,"token_count":842}

Common Workflows

# Crawl a public web section and write Markdown files
docpull https://www.python.org/blogs/ -o ./python-news

# Stream LLM-ready NDJSON chunks from a source
docpull https://www.python.org/blogs/ --profile llm --stream | jq .

# Write SQLite with an FTS5 search index
docpull https://www.python.org/blogs/ --format sqlite -o ./python-news-db

# Build an Open Knowledge Format (OKF) bundle for portable source packs
docpull https://example.com --profile okf -o ./site-okf

# Turn a source corpus into agent-ready skills/rules
docpull https://sdk.vercel.ai \
  --skill vercel-ai \
  --skill-agent all \
  --skill-description "Vercel AI SDK source reference"

More examples live in CLI Recipes.

With an explicit --skill-agent, docpull stores the fetched corpus under.docpull/skills/<name>/references and creates agent-specific wrappers thatpoint at that corpus. --skill-agent claude writes a Claude Code skill under.claude/skills/<name>/, --skill-agent codex writes a Codex skill under.agents/skills/<name>/ with agents/openai.yaml, and --skill-agent cursorwrites a Cursor project rule at .cursor/rules/<name>.mdc. Use--skill-agent all to create all three. If you pass --output-dir, docpullstages the generated corpus there; explicit --skill-agent targets still writetheir active agent wrappers.

Use docpull when you need to:

  • Convert public web sources - docs, blogs, API references, vendor pages,product pages, changelogs, filings, and OpenAPI specs - into Markdown orchunked NDJSON for LLM and RAG pipelines.
  • Give an agent a local tool for fetching, caching, grepping, and reading websources.
  • Build repeatable context packs with stable IDs, hashes, manifests, and sourcemetadata.
  • Mirror public web content for offline work while preserving attribution.

Why docpull?

docpull is designed for agent and RAG workflows, not just downloading pages.

Need docpull gives you
Clean Markdown Article-focused extraction with source metadata
LLM chunks NDJSON streaming and optional token-aware chunking
Repeatability Stable document IDs, chunk IDs, hashes, and manifests
Offline work Cached archives and mirrored source artifacts
Agent access Local CLI, Python SDK, and stdio MCP server
Downstream exports JSONL, Sheets CSV/TSV, n8n JSON, Vercel AI JSON, CrewAI JSON, warehouse NDJSON, optional Parquet, and agent skills
Safer fetching HTTPS defaults, robots.txt compliance, SSRF protections, and redirect guards

Supported Sources

docpull uses async HTTP instead of browser automation by default and includesspecial handling for common web, documentation, and API surfaces.

Source shape Support
Static HTML / SSR pages Extracts article, main, or document regions
Next.js / Mintlify Parses static HTML and __NEXT_DATA__ when available
OpenAPI / Swagger Renders specs into Markdown
OpenAPI pack docpull openapi-pack emits endpoint/schema records with v3 sidecars
RSS / Atom / JSON Feed docpull feed-pack emits item-level records, dates, and listing sidecars
Research papers docpull paper-pack emits paper metadata, abstracts, optional local/arXiv PDF full text, and references
Public GitHub repos docpull repo-pack emits repo metadata, README/docs/examples/changelog files, manifests, and releases
npm / PyPI packages docpull package-pack emits registry metadata, README/description, versions, license, dependencies, and install commands
Standards docpull standards-pack emits RFC, IETF, W3C, and WHATWG metadata plus section-level records
Local datasets docpull dataset-pack emits bounded schema, exact row counts where streamable, column, null-count, and sample summaries
Transcripts docpull transcript-pack emits timestamped segment records from VTT, SRT, text, JSON, or direct transcript URLs
Wikimedia / Wikipedia docpull wiki-pack emits MediaWiki REST page metadata, license/revision metadata, and section-level records
Docusaurus / Sphinx / MkDocs Extracts static article or document regions
VitePress / VuePress / Astro Starlight Extracts static content regions
GitBook / ReadMe.io Extracts available article or content regions
Redoc / Scalar Extracts static API reference regions
JS-only apps Skipped unless useful content is present in HTML or embedded data

Use --strict-js-required when an agent should treat JS-only pages as harderrors instead of normal skips.

Output Formats

Output Use it for
Markdown Local readable source snapshots with YAML frontmatter
NDJSON Streamed records or chunked records for agents and RAG
SQLite Local retrieval with an FTS5 index
OKF Portable Open Knowledge Format bundles with indexes and manifests
Archive / mirror Cached offline source snapshots

All file-backed outputs now write the DocPull output contract v3 raw sidecars:corpus.manifest.json, sources.md, and acquisition.routes.json. Usedocpull pack validate <pack-dir> --level raw|agent|eval to check whether apack is raw extraction output, agent-ready context, or eval-grade context.

Local files can enter the same contract with the document parse lane:

docpull parse ./handbook.pdf -o ./packs/handbook --backend auto
docpull parse ./handbook.docx -o ./packs/handbook --prepare --eval-grade

--backend auto reads plain text/Markdown directly and uses optionalMarkItDown or Unstructured parsers for complex office/PDF files when installed.Install docpull[markitdown], docpull[unstructured], or docpull[parse]for those backends.

Every file-backed run writes corpus.manifest.json with stable document IDs,chunk IDs, hashes, output paths, and chunk counts. SeeCorpus Manifest.

Profiles

docpull https://site.com --profile rag        # Default. Dedup + metadata.
docpull https://site.com --profile llm        # NDJSON chunks for agents/RAG.
docpull https://site.com --profile okf        # Portable Open Knowledge Format bundle.
docpull https://site.com --profile mirror     # Cached archive.
docpull https://site.com --profile quick      # Small sampling crawl.
docpull https://site.com --profile sec-filing # EDGAR-friendly evidence chunks.

Run docpull --help for the full option list.

When Not to Use docpull

docpull intentionally does not use a browser unless rendering is explicitlyenabled. It is not the right tool for:

  • JS-only pages that require complex browser workflows beyond static rendered HTML.
  • Authenticated dashboards or private apps.
  • Pages behind CAPTCHA or bot challenges.
  • Workflows that require clicking, scrolling, or browser state.

For those cases, use full browser automation outside DocPull, then passrendered HTML or exported content into your pipeline. For simple publicJS-rendered pages, use docpull render --runtime local or fetch with--render fallback for an explicit agent-browser fallback without changingthe default fetch behavior. DocPull does not claim complete browser coverageunless rendering is explicitly enabled and available.

Use --extractor ensemble when a crawl should score multiple local extractioncandidates and keep the strongest Markdown. The ensemble always includes thebuilt-in generic extractor and adds trafilatura when docpull[trafilatura] isinstalled.

How It Compares

Tool type Best for Tradeoff
wget / site mirroring Downloading raw files Not agent/RAG-oriented
Browser automation JS-heavy pages and interactions Slower, heavier, more stateful
Hosted extraction APIs Managed extraction at scale External dependency and cost
docpull Local public web-source extraction and context packs No JavaScript rendering by default

Python SDK

from docpull import fetch_one

ctx = fetch_one("https://docs.python.org/3/library/asyncio.html")
print(ctx.title)
print(ctx.markdown[:500])
import asyncio
from docpull import Fetcher, DocpullConfig, EventType, ProfileName

async def main():
    cfg = DocpullConfig(url="https://example.com/blog", profile=ProfileName.LLM)
    async with Fetcher(cfg) as fetcher:
        async for event in fetcher.run():
            if event.type == EventType.FETCH_PROGRESS:
                print(f"{event.current}/{event.total}: {event.url}")

asyncio.run(main())

MCP Server

docpull can run as a stdio MCP server for agent clients:

pip install 'docpull[mcp]'
docpull mcp

Claude Code:

claude mcp add --transport stdio docpull -- docpull mcp

Cursor and Claude Desktop use the same mcpServers shape:

{
  "mcpServers": {
    "docpull": {
      "type": "stdio",
      "command": "docpull",
      "args": ["mcp"]
    }
  }
}

The supported MCP path is the Python stdio server started by docpull mcp.The repository's mcp/ directory is an internal TypeScript/Bun lab and is notpart of the package release contract.

Advanced Workflows

  • Local pack intelligence can build citation maps, extract cited entities,search pack records, build cited source graphs, prepare the full sidecarbundle, and write eval-grade rights/provenance artifacts withdocpull pack citations, docpull pack entities, docpull pack search,docpull pack brief, docpull graph build, docpull graph query,and docpull pack prepare --eval-grade.
  • Release commands add policy files, refresh reports, audits, exports, alocalhost pack server, explicit rendering, authenticated-source checks, andcron-friendly monitors:docpull policy, docpull refresh,docpull parse, docpull openapi-pack, docpull feed-pack,docpull paper-pack, docpull repo-pack, docpull package-pack,docpull standards-pack, docpull dataset-pack, docpull transcript-pack,docpull wiki-pack,docpull pack validate,docpull pack audit, docpull export,docpull serve, docpull share, docpull render, docpull auth check,and docpull monitor.
  • docpull export writes local files for OpenAI vector JSONL, LangChain,LlamaIndex, DSPy, Sheets CSV/TSV, n8n workflow JSON, Vercel AI SDK JSON,CrewAI JSON, warehouse NDJSON, optional Parquet via docpull[parquet], andCodex/Claude/Cursor agent references.

Security Defaults

  • HTTPS-only fetching with robots.txt compliance.
  • SSRF protections, private network blocking, DNS rebinding protection, andconnect-time address pinning.
  • XXE protection for sitemaps.
  • Path traversal and CRLF header injection guards.
  • Auth headers stripped on cross-origin redirects.

When running with --proxy, DNS pinning is delegated to the proxy. Pass--require-pinned-dns to refuse that configuration.

Troubleshooting

docpull --doctor
docpull render --check
docpull URL --verbose
docpull URL --dry-run
docpull URL --preview-urls

Documentation

  • CLI Recipes - common commands and advanced workflows.
  • Web Source Boundary - what docpull does and does not fetch.
  • Alternatives - when to use browser automation or hosted extraction.
  • Corpus Manifest - stable IDs, hashes, and source maps.
  • Surface Contract - how the CLI, Python SDK/API, and MCP surfaces align.
  • Changelog - release history.

Links

License

MIT

MCP Server · Populars

MCP Server · New