TheCrawler

Web scraper + validated extraction contracts for AI agents. PDF/DOCX, markdown, JSON-LD/microdata/commerce/forms/analytics-detection, no-LLM readiness diagnostics, structured errors, UA rotation, retry/timeout, optional in-memory cache. Adaptive Cheerio→Playwright. Open source, AGPL-3.0.

Install

npm install thecrawler

Or from a local checkout:

npm install file:/path/to/TheCrawler/engine

Library use

import { crawl, extract } from 'thecrawler';

// Plain crawl
const r = await crawl({
    urls: ['https://example.com'],
    extractMarkdown: true,
});
console.log(r.pages[0].markdown);

// Multi-URL with reliability options
const r2 = await crawl({
    urls: ['https://...', 'https://...'],
    extractMarkdown: true,
    requestRetries: 3,           // retry transient failures
    requestTimeoutSecs: 30,
    rotateUserAgent: true,       // rotate from real-browser UA pool
    cache: { enabled: true, ttlSeconds: 300 },
});

// Errors are structured. Branch on errorType + retryable.
for (const p of r2.pages) {
    if (p.status === 'error') {
        console.log(p.errorType, p.errorRetryable, p.error);
        // errorType ∈ 'dns'|'timeout'|'rate-limit'|'blocked-bot'|
        // 'js-required'|'http-4xx'|'http-5xx'|'parse'|'network'|'unknown'
    }
}

LLM-powered structured extraction

Crawls a URL, sends the cleaned markdown to an OpenAI-compatible LLM endpoint with a JSON schema or natural-language prompt, returns parsed typed data. Endpoint-agnostic: works against llama.cpp's llama-server, vLLM, LM Studio, Ollama, OpenAI proper, and compatible /v1/chat/completions endpoints. Schema-backed extraction uses JSON Schema response format where supported, with fallbacks for endpoints that only support JSON-object or text output.

import { extract } from 'thecrawler';

const r = await extract({
    urls: ['https://shop.example.com/products/123'],
    jsonSchema: {
        type: 'object',
        properties: {
            productName: { type: 'string' },
            price: { type: 'number' },
            currency: { type: 'string' },
            inStock: { type: 'boolean' },
        },
        required: ['productName'],
    },
    llm: {
        baseUrl: 'http://your-llm-host:8080/v1/chat/completions',
        model: 'your-model-name',
        // apiKey: 'optional',
        // temperature: 0,
        // maxTokens: 4000,
        // timeoutSecs: 120,
    },
});

console.log(r[0].data);
// { productName: '...', price: 49.99, currency: 'USD', inStock: true }

ExtractResult includes parsed data, status, structured errorType, rawResponse (for debugging), token usage, and timing breakdown (crawlMs, llmMs, responseTimeMs).

CLI

# Crawl
thecrawler crawl https://example.com --markdown
thecrawler crawl https://example.com --retries 5 --timeout 60 --cache

# Search Google + scrape top results
thecrawler search "your query" --markdown

# Sitemap-driven crawl
thecrawler sitemap https://example.com/sitemap.xml --markdown

# Markdown shortcut
thecrawler md https://example.com

# Built-in extraction contract with validation evidence
thecrawler extract https://example.com/listing \
  --contract real-estate-listing \
  --llm-base-url http://localhost:1234/v1/chat/completions \
  --llm-model local-model \
  --evidence-output real-estate-evidence.json

# No-LLM contract readiness diagnostic
thecrawler diagnose https://example.com/listing-1 https://example.com/listing-2 \
  --contract real-estate-listing \
  --output real-estate-workflow-diagnostic.json \
  --report real-estate-workflow-report.md

Extraction contracts

Contracts turn a crawl into a repeatable, validated output shape for agent workflows. The first built-in contract is real-estate-listing, which extracts normalized listing fields such as title, price, location, beds/baths, area, listing type, broker/contact, source URL, confidence, and evidence notes.

Use thecrawler extract --list-contracts to list available contracts. Contract mode returns the normal ExtractResult plus a validation object with valid, requiredFields, and missingRequiredFields, so an agent can branch on extraction quality instead of trusting loose markdown.

Use thecrawler diagnose <url...> --contract real-estate-listing before LLM extraction to score whether a source or workflow is ready for contract extraction. The diagnostic does not call an LLM; it crawls each page, checks source signals, and returns per-URL verdict, readyForExtraction, score, blockers, warnings, recommendedNextStep, and signal evidence plus an aggregate workflow summary (readyUrls, blockedUrls, workflowVerdict, blockersByType, recommendedNextStep). Add --report report.md to produce a buyer-readable Markdown report without raw extracted contact details or page evidence.

MCP server

Five tools: crawl, crawl_markdown, search_and_crawl, crawl_sitemap, extract_structured.

Add to your MCP client config (Claude Code / Cursor / Windsurf):

{
    "mcpServers": {
        "thecrawler": {
            "command": "node",
            "args": ["/path/to/TheCrawler/engine/dist/mcp.js"],
            "env": {
                "NODE_OPTIONS": "--use-system-ca",
                "THECRAWLER_LLM_BASEURL": "http://your-llm-host:8080/v1/chat/completions",
                "THECRAWLER_LLM_MODEL": "your-model-name"
            }
        }
    }
}

The env vars set defaults for extract_structured; per-call args override.

REST API server

THECRAWLER_API_KEY=secret thecrawler-api --port 3000

Endpoints: POST /v1/crawl, POST /v1/markdown, POST /v1/search, POST /v1/sitemap, GET /v1/health. All accept the same options as the library.

What it extracts (out of the box, no extra code)

Per page: title, description, language, canonical URL, robots directives, full text (50K cap), markdown (boilerplate-stripped, GFM), heading-aware chunks, headings (h1-h6), links (with internal/external + rel), images (with lazy-load src), meta tags (incl. OG + Twitter Card), tables, JSON-LD, microdata (itemscope/itemprop), commerce data (price/currency/SKU/rating from JSON-LD Product), forms (action/method/fields), 16 analytics trackers detected (GA4, GTM, Facebook Pixel, Hotjar, Segment, Mixpanel, Amplitude, Heap, Plausible, Matomo, Clarity, LinkedIn, Twitter, Pinterest, TikTok, etc.), emails, phones, social links, hreflang tags, pagination links, redirect chain, response timing, page size.

PDF and DOCX URLs are auto-detected and parsed (text + metadata for PDFs; text + markdown for DOCX).

Adaptive crawling

Default Cheerio (fast HTTP+parse) — set usePlaywright: true for full JS rendering, or adaptiveCrawling: true to try Cheerio first and auto-fall-back to Playwright when an SPA shell is detected (text < 200 chars or known SPA root div).

Anti-bot resilience

User-Agent rotation from a real-browser pool (Chrome/Firefox/Safari, randomized per request). Anti-bot challenge page detection: when a 200 response carries a Cloudflare/WAF challenge body ("checking your browser", "attention required", "cloudflare ray id"), the page is marked errorType: 'blocked-bot' rather than silently returning the challenge HTML.

For harder targets, supply proxyUrl (any Crawlee-compatible HTTP(S) proxy URL).

License

AGPL-3.0-or-later. Commercial licensing available — contact through the GitHub repo.

TheCrawler

TheCrawler

Install

Library use

LLM-powered structured extraction

CLI

Extraction contracts

MCP server

REST API server

What it extracts (out of the box, no extra code)

Adaptive crawling

Anti-bot resilience

License

MCP Server · Populars

🦞 OpenClaw — Personal AI Assistant

MarkItDown-MCP

MarkItDown

Awesome MCP Servers

mcp-server-sentry: A Sentry MCP server

MCP Server · New

Memanto MCP Server

Octocode: Research Driven Development for AI

OpenAccountants

code-memory

TinySearch