manchittlab

TheCrawler

Community manchittlab
Updated

Open-source web scraper + LLM-powered structured extraction. PDF/DOCX, markdown, JSON-LD, microdata, commerce data, forms, 16 analytics-tracker detection. Structured errors with retryable flags. Adaptive Cheerio->Playwright. CLI, npm, REST API, and MCP server. AGPL-3.0.

TheCrawler

Web scraper + validated extraction contracts for AI agents. PDF/DOCX, markdown, JSON-LD/microdata/commerce/forms/analytics-detection, no-LLM readiness diagnostics, structured errors, UA rotation, retry/timeout, optional in-memory cache. Adaptive Cheerio→Playwright. Open source, AGPL-3.0.

Install

npm install thecrawler

Or from a local checkout:

npm install file:/path/to/TheCrawler/engine

Library use

import { crawl, extract } from 'thecrawler';

// Plain crawl
const r = await crawl({
    urls: ['https://example.com'],
    extractMarkdown: true,
});
console.log(r.pages[0].markdown);

// Multi-URL with reliability options
const r2 = await crawl({
    urls: ['https://...', 'https://...'],
    extractMarkdown: true,
    requestRetries: 3,           // retry transient failures
    requestTimeoutSecs: 30,
    rotateUserAgent: true,       // rotate from real-browser UA pool
    cache: { enabled: true, ttlSeconds: 300 },
});

// Errors are structured. Branch on errorType + retryable.
for (const p of r2.pages) {
    if (p.status === 'error') {
        console.log(p.errorType, p.errorRetryable, p.error);
        // errorType ∈ 'dns'|'timeout'|'rate-limit'|'blocked-bot'|
        // 'js-required'|'http-4xx'|'http-5xx'|'parse'|'network'|'unknown'
    }
}

LLM-powered structured extraction

Crawls a URL, sends the cleaned markdown to an OpenAI-compatible LLM endpoint with a JSON schema or natural-language prompt, returns parsed typed data. Endpoint-agnostic: works against llama.cpp's llama-server, vLLM, LM Studio, Ollama, OpenAI proper, and compatible /v1/chat/completions endpoints. Schema-backed extraction uses JSON Schema response format where supported, with fallbacks for endpoints that only support JSON-object or text output.

import { extract } from 'thecrawler';

const r = await extract({
    urls: ['https://shop.example.com/products/123'],
    jsonSchema: {
        type: 'object',
        properties: {
            productName: { type: 'string' },
            price: { type: 'number' },
            currency: { type: 'string' },
            inStock: { type: 'boolean' },
        },
        required: ['productName'],
    },
    llm: {
        baseUrl: 'http://your-llm-host:8080/v1/chat/completions',
        model: 'your-model-name',
        // apiKey: 'optional',
        // temperature: 0,
        // maxTokens: 4000,
        // timeoutSecs: 120,
    },
});

console.log(r[0].data);
// { productName: '...', price: 49.99, currency: 'USD', inStock: true }

ExtractResult includes parsed data, status, structured errorType, rawResponse (for debugging), token usage, and timing breakdown (crawlMs, llmMs, responseTimeMs).

CLI

# Crawl
thecrawler crawl https://example.com --markdown
thecrawler crawl https://example.com --retries 5 --timeout 60 --cache

# Search Google + scrape top results
thecrawler search "your query" --markdown

# Sitemap-driven crawl
thecrawler sitemap https://example.com/sitemap.xml --markdown

# Markdown shortcut
thecrawler md https://example.com

# Built-in extraction contract with validation evidence
thecrawler extract https://example.com/listing \
  --contract real-estate-listing \
  --llm-base-url http://localhost:1234/v1/chat/completions \
  --llm-model local-model \
  --evidence-output real-estate-evidence.json

# No-LLM contract readiness diagnostic
thecrawler diagnose https://example.com/listing-1 https://example.com/listing-2 \
  --contract real-estate-listing \
  --output real-estate-workflow-diagnostic.json \
  --report real-estate-workflow-report.md

Extraction contracts

Contracts turn a crawl into a repeatable, validated output shape for agent workflows. The first built-in contract is real-estate-listing, which extracts normalized listing fields such as title, price, location, beds/baths, area, listing type, broker/contact, source URL, confidence, and evidence notes.

Use thecrawler extract --list-contracts to list available contracts. Contract mode returns the normal ExtractResult plus a validation object with valid, requiredFields, and missingRequiredFields, so an agent can branch on extraction quality instead of trusting loose markdown.

Use thecrawler diagnose <url...> --contract real-estate-listing before LLM extraction to score whether a source or workflow is ready for contract extraction. The diagnostic does not call an LLM; it crawls each page, checks source signals, and returns per-URL verdict, readyForExtraction, score, blockers, warnings, recommendedNextStep, and signal evidence plus an aggregate workflow summary (readyUrls, blockedUrls, workflowVerdict, blockersByType, recommendedNextStep). Add --report report.md to produce a buyer-readable Markdown report without raw extracted contact details or page evidence.

MCP server

Five tools: crawl, crawl_markdown, search_and_crawl, crawl_sitemap, extract_structured.

Add to your MCP client config (Claude Code / Cursor / Windsurf):

{
    "mcpServers": {
        "thecrawler": {
            "command": "node",
            "args": ["/path/to/TheCrawler/engine/dist/mcp.js"],
            "env": {
                "NODE_OPTIONS": "--use-system-ca",
                "THECRAWLER_LLM_BASEURL": "http://your-llm-host:8080/v1/chat/completions",
                "THECRAWLER_LLM_MODEL": "your-model-name"
            }
        }
    }
}

The env vars set defaults for extract_structured; per-call args override.

REST API server

THECRAWLER_API_KEY=secret thecrawler-api --port 3000

Endpoints: POST /v1/crawl, POST /v1/markdown, POST /v1/search, POST /v1/sitemap, GET /v1/health. All accept the same options as the library.

What it extracts (out of the box, no extra code)

Per page: title, description, language, canonical URL, robots directives, full text (50K cap), markdown (boilerplate-stripped, GFM), heading-aware chunks, headings (h1-h6), links (with internal/external + rel), images (with lazy-load src), meta tags (incl. OG + Twitter Card), tables, JSON-LD, microdata (itemscope/itemprop), commerce data (price/currency/SKU/rating from JSON-LD Product), forms (action/method/fields), 16 analytics trackers detected (GA4, GTM, Facebook Pixel, Hotjar, Segment, Mixpanel, Amplitude, Heap, Plausible, Matomo, Clarity, LinkedIn, Twitter, Pinterest, TikTok, etc.), emails, phones, social links, hreflang tags, pagination links, redirect chain, response timing, page size.

PDF and DOCX URLs are auto-detected and parsed (text + metadata for PDFs; text + markdown for DOCX).

Adaptive crawling

Default Cheerio (fast HTTP+parse) — set usePlaywright: true for full JS rendering, or adaptiveCrawling: true to try Cheerio first and auto-fall-back to Playwright when an SPA shell is detected (text < 200 chars or known SPA root div).

Anti-bot resilience

User-Agent rotation from a real-browser pool (Chrome/Firefox/Safari, randomized per request). Anti-bot challenge page detection: when a 200 response carries a Cloudflare/WAF challenge body ("checking your browser", "attention required", "cloudflare ray id"), the page is marked errorType: 'blocked-bot' rather than silently returning the challenge HTML.

For harder targets, supply proxyUrl (any Crawlee-compatible HTTP(S) proxy URL).

License

AGPL-3.0-or-later. Commercial licensing available — contact through the GitHub repo.

MCP Server · Populars

MCP Server · New

    moorcheh-ai

    Memanto MCP Server

    Memory that AI Agents Love!

    Community moorcheh-ai
    bgauryy

    Octocode: Research Driven Development for AI

    MCP server for semantic code research and context generation on real-time using LLM patterns | Search naturally across public & private repos based on your permissions | Transform any accessible codebase/s into AI-optimized knowledge on simple and complex flows | Find real implementations and live docs from anywhere

    Community bgauryy
    openaccountants

    OpenAccountants

    Open-source tax skills for AI — 371 skills across 134 countries. Upload to any LLM or connect via MCP. Quality-tiered Q1–Q5.

    Community openaccountants
    kapillamba4

    code-memory

    MCP server with local vector search for your codebase. Smart indexing, semantic search, Git history — all offline.

    Community kapillamba4
    MarcellM01

    TinySearch

    Shrink the web for your local LLMs!

    Community MarcellM01