The local-first data studio with a built-in AI assistant.
Duckle is an open-source desktop ETL / ELT studio. Drag a pipeline onto the canvas, describe what you need in plain English to Duckie (the on-device AI assistant), and execute at native speed through DuckDB. 290+ connectors, 50+ transforms, a built-in scheduler, and a chat assistant that runs entirely on your CPU. Ships as a ~65 MB single-file desktop app. No cloud, no servers, no lock-in.
Quick links
Get started
|
Use the product
|
Reference
|
Resources
|
What is Duckle?
A visual data pipeline studio that runs on your laptop. Drag sources, transforms, validators, and sinks onto a canvas. Wire them together. Press Run. Duckle compiles the graph to SQL and executes it through a real columnar engine, with live previews, generated SQL on every node, and zero hidden state.
Three things make Duckle different from the heavyweights and the toy ETL tools:
- An AI assistant that ships in the box. Describe the pipeline you want in English; Duckie writes the JSON and drops it onto the canvas. The model runs locally - no API key, no telemetry, no cloud round-trip.
- 290+ connectors at install time. Files, lakehouses, SQL databases, warehouses, NoSQL, vector DBs, streaming brokers, SaaS REST/GraphQL APIs, even FTP and IMAP - working today, not coming-soon.
- A self-contained binary you can audit. ~65 MB download. Engines install on first launch. Workspaces are plain files in a folder you choose. Diff them, branch them, ship them.
Meet Duckie - the local AI pipeline assistant
Describe what you need. Duckie writes the pipeline.
The sidebar on the right is Duckie AI Assistant - powered by Qwen 2.5 Coder 1.5B running through llama.cpp, downloaded once (~1.1 GB) and then run entirely on your CPU. Ask in plain English; Duckie streams back a valid Duckle pipeline definition. One click drops it onto the canvas, ready to inspect, tweak, and run.
| Truly local | The Qwen model runs as a llama-server subprocess on 127.0.0.1. No API keys. No network calls. Disconnect your wifi and it keeps working. |
| Streamed responses | Tokens arrive as they're generated, with a blinking caret in the bubble. No "wait 20 seconds for the spinner to vanish" UX. |
| One-click insert | When Duckie produces a JSON pipeline, an Insert into canvas button appears. The graph populates with positioned nodes, wired edges, and the props the model chose. |
| Bring-your-own-model option | The chat plumbing is the same OpenAI-compatible HTTP interface used by xf.ai.llm / xf.ai.embed connectors. Point baseUrl at Ollama, llama.cpp, Cohere, OpenAI, Voyage - anything that speaks the OpenAI shape. |
| Sandboxed | The model has no fs / net / tool access. It can only emit text - your pipeline JSON. |
Why Duckle is different
| Visual, never opaque | The canvas compiles to SQL you can read, and every node has a live preview tab. No black box. |
| Local-first AI | An assistant that runs on your laptop without an API key. Your prompts, your data, your machine. |
| Single-file binary, no bundled DB | ~65 MB app (it embeds the headless runner + MCP server). DuckDB downloads on first launch with a guided step. AI engine is opt-in. |
| Native speed | Execution runs through DuckDB: vectorized, columnar, local. A clean-and-export job that crawls in a spreadsheet finishes in milliseconds. |
| Git-friendly by design | Pipelines, connections, contexts, and routines persist as plain files in a folder you pick. Diff them, branch them, review them. |
| 290+ connectors that work | Files, databases, warehouses, lakehouses, object stores, SaaS APIs, NoSQL, streaming brokers, vector DBs, FTP, IMAP, SMTP. Each is covered by tests. |
| Honest about scope | Single-machine and embedded by design. Built to make local and small-team data work fast, not to replace a distributed warehouse. |
| 60 UI languages | Topbar, palette, chat assistant, properties panel, and common dialogs ship localized. English, Spanish, Chinese (Simplified + Traditional), Hindi, Arabic, Portuguese (Brazil), Bengali, Russian, Japanese, Punjabi, German, Korean, French, Vietnamese, Telugu, Marathi, Turkish, Tamil, Urdu, Persian, Polish, Italian, Ukrainian, Indonesian, Thai, Dutch, Hebrew, Swedish, Greek, Czech, Hungarian, Romanian, Filipino, Malay, Norwegian, Danish, Finnish, Catalan, Bulgarian, Slovak, Croatian, Serbian, Slovenian, Lithuanian, Latvian, Estonian, Khmer, Burmese, Sinhala, Nepali, Swahili, Afrikaans, Welsh, Irish, Icelandic, Albanian, Azerbaijani, Mongolian, Kazakh. RTL (Arabic, Hebrew, Persian, Urdu) supported. Switch languages from the topbar globe. |
| Open source | Dual-licensed MIT OR Apache-2.0. Yours to use, fork, and extend. |
Status
Duckle is in public beta. The visual designer, the DuckDB execution engine, the scheduler, the cloud connectors, and the Duckie AI assistant all work today and are covered by 170+ integration tests across Linux, macOS, and Windows. The catalog is still growing and APIs may evolve before 1.0, but the day-to-day surface is stable enough for real work.
Scope, stated plainly: Duckle is a single-machine, embedded studio. If you outgrow one box, point Duckle's output at the system that scales (a warehouse, an object store, a lakehouse). It will not pretend to be a cluster.
The component palette ships 313 nodes so the roadmap is visible in the product itself:
- 292 available runs on the DuckDB engine today
- 5 preview is configurable in the designer (drag, wire, set properties); execution is being wired engine-by-engine
- 16 planned is reserved in the palette but not yet executable - see
docs/roadmap.md
Screenshots
Real pipelines, built and run in Duckle - not mockups.
A 5M-row pipeline: a CSV, a Parquet file, a DuckDB table, and a SQLite table enriched through one visual Map (3-way join), no SQL.
Left: the visual Map editor - main plus lookups, per-output expressions, an inline filter. Right: Parallelize fanning out aggregate, window, and top-N branches.
One run, many branches: 16 nodes finish in a few seconds. Concurrency auto-detects from CPU cores; branches write to Parquet, CSV, DuckDB, and SQLite at once.
Left: DuckLake CDC change-feed mirrored via upsert + delete propagation (100k rows). Right: watermark incremental load over 5M rows, advancing state only on a fully successful run.
Capabilities
Duckle is not a CSV tool with extras. It reads a broad set of formats and sources, ships a deep transform library, and writes to files, databases, object storage, vector DBs, message buses, and email.
Sources (74 available)
| Group | Connectors | Status |
|---|---|---|
| Files | CSV, TSV, Parquet, JSON, JSONL / NDJSON, Excel (.xlsx), YAML, TOML, Fixed-width (mainframe / banking positional dumps), XML (slash-separated rowPath), Apache Avro (.avro / .ocf, pure-Rust) | Available |
| Geospatial files | GeoJSON, Shapefile, GeoPackage, KML, GPX, GML via the spatial extension |
Available (lazy-loaded) |
| Lakehouse table formats | Apache Iceberg, Delta Lake, DuckLake | Available |
| Embedded databases | SQLite (read tables), DuckDB (read tables or run a query) | Available |
| Network relational DBs | PostgreSQL, MySQL, MariaDB, CockroachDB | Available (live CI for PG + MySQL) |
| Network relational DBs | SQL Server (TDS), Oracle (Instant Client at runtime), ClickHouse (HTTP API) | Available |
| Network relational DBs | IBM DB2, generic JDBC | Planned |
| Object storage | Amazon S3, Google Cloud Storage, Azure Blob, HTTP(S), MinIO, Cloudflare R2, Backblaze B2 | Available (live CI for MinIO) |
| Cloud warehouses | MotherDuck, Snowflake (SQL API + PAT/JWT), BigQuery, Redshift (postgres ATTACH), Databricks SQL (Statement Execution + chunk follow), Azure Synapse (TDS), DuckDB Quack (May 2026 remote protocol - HTTP on :9494, SECRET-based token auth) | Available |
| Streaming | Apache Kafka / Redpanda (pure-Rust rskafka), NATS JetStream, GCP Pub/Sub (REST + auto-ack), RabbitMQ (lapin AMQP), AWS Kinesis (HTTP + SigV4 - no AWS SDK) |
Available |
| Streaming | Pulsar, Event Hubs, multi-shard Kinesis | Planned |
| APIs and SaaS (REST) | Salesforce, HubSpot, Pipedrive, Zendesk, Intercom, Stripe, QuickBooks, Xero, Shopify, Notion, Airtable, Asana, Trello, ClickUp, Monday.com, GitHub, GitLab, Linear, Jira, Slack, Discord, Telegram, Twilio, Mailchimp, SendGrid, Segment - thin pre-configured wrappers over src.rest / src.graphql |
Available |
| APIs (protocols) | OData v4 (follows @odata.nextLink), SOAP / generic XML APIs (XML response parsing with namespace local-name match) |
Available |
| NoSQL and search | MongoDB (official driver), Cassandra / ScyllaDB (CQL), Elasticsearch / OpenSearch (from+size + search_after), Redis (SCAN + GET), CouchDB (_all_docs), DynamoDB (HTTP + SigV4 - no AWS SDK; auto-unwraps typed attributes) |
Available |
| Vector / AI databases | pgvector (postgres ATTACH), Qdrant (/points/scroll), Weaviate (/v1/objects), Milvus (/v1/vector/query) |
Available |
| Vector / AI databases | Pinecone (no list-all-vectors API), Chroma, LanceDB | Preview |
| File transfer | FTP / FTPS (pure-Rust suppaftp) and SFTP (SSH, pure-Rust russh + russh-sftp on the ring backend; password or private-key auth, optional host-fingerprint pin) - one File Transfer component, pick the protocol. Glob filter, base64 content per file |
Available |
| Mailbox | IMAP (rustls TLS, mail-parser) - basic auth today, OAuth (gmail / o365) on the roadmap |
Available |
| Webhook listener | Binds 127.0.0.1:port, collects N inbound HTTP requests with a timeout, parses JSON-object / JSON-array bodies into rows |
Available |
| Desktop | System clipboard (pure-Rust arboard, auto-detects JSON-array shape) |
Available |
| Repos | Git (commit log or file tree from a local working copy; shells out to system git CLI) |
Available |
For CSV / TSV sources, the Schema panel accepts an optional per-column Format (a strptime token string such as %d/%m/%Y) on Date and Timestamp columns. Several date columns can each parse a different layout in one read - the column is read as text and re-parsed with its own format, working around DuckDB's single global date format. A value that does not match its format becomes null rather than failing the run.
Transforms (126 available)
| Group | Operations |
|---|---|
| Fields | Map (visual mapper: joins a main input to up to 3 lookup inputs with inner / left joins and per-output expressions + filter), Project / Select, Cast, Rename, Add / Drop / Reorder Column, Coalesce, UUID v4 |
| Rows | Filter (visual or raw SQL, with reject port), Distinct, Sample, Top N / Limit, Sort, Skip, Top N per Group, Forward Fill, Backward Fill, Constant Fill |
| Aggregate | Group By, Rollup, Cube, Count, Window Aggregate, Cumulative, Approx Quantile (t-digest), Approx Count Distinct (HyperLogLog) |
| Join | Inner, Left, Right, Full Outer, Cross, Lookup, Semi, Anti, Spatial Join |
| Set operations | Union, Union All, Intersect, Except / Minus |
| Window | Row Number, Rank, Dense Rank, Lead, Lag, First Value, Last Value, NTile |
| Strings | Regex Replace, Regex Extract, Regex Match, Split, Concat, Trim, Case Change, Length, Substring, Format, Hash (md5 / sha1 / sha256), IP Parse, URL Parse, Text Similarity (Levenshtein / Jaro-Winkler / Jaccard), Base64, Pad, Text Match |
| Date / Time | Parse, Format, Extract Part, Date Diff / Add, Truncate, Timezone Convert, Time Bin, Current Timestamp, Epoch Convert |
| Numeric | Round, Modulo, Absolute, Logarithm, Power, Square Root, Bucketize, Z-Score, Clamp, Sign |
| JSON / nested | Parse, Stringify, Flatten, JSONPath Extract, Merge Objects, Array Aggregate |
| Array | Explode / Unnest, Collect List, Element At, Contains, Distinct, Length |
| Pivot / shape | Pivot, Unpivot, Denormalize, Normalize, Transpose |
| CDC / SCD | Incremental Load (watermark column; saves the high-water mark to workspace state and advances only on a fully successful run), Diff Detect, SCD Type 1, SCD Type 2 (valid_from / valid_to / is_current), Merge / Upsert (universal across embedded, network, warehouse and Mongo sinks, with optional delete propagation driven by a CDC change-type column), DuckLake CDC change-feed reader, Row Hash (md5 / sha1 / sha256 fingerprint), Audit Stamp (_loaded_at / _loaded_date / _source / _batch_id) |
| AI / Search | Vector Similarity Search (cosine / L2 / inner product over FLOAT[N] via vss), Full-Text Search (BM25 via fts), Embeddings (OpenAI-compatible /v1/embeddings), LLM Transform (per-row chat completion with {column} templates), Classify (LLM-backed, normalizes to UNKNOWN), Text Chunker (RAG-ready, pure local), PII Redact (regex - emails / phones / SSNs / cards), Semantic Dedupe (cosine over precomputed embeddings) |
| Geospatial | Spatial Distance (ST_Distance), Spatial Buffer (ST_Buffer), Spatial Intersects (ST_Intersects) |
| Debug | Log Rows, Assert (hard-fail on SQL predicate violation) |
All 6 AI transforms ship today. Three need a model API (LLM, Classify, Embeddings) and ride the apiKey-in-props pattern; three are pure-local (Chunk, PII Redact, Dedupe).
Data quality (12 available)
Validators split their input: passing rows continue on the main port, failures route to a reject port you can sink, count, or inspect.
| Component | Behavior |
|---|---|
| Not-Null Check | Pass rows with no nulls in the chosen columns |
| Range Check | Pass rows inside a numeric range (inclusive or exclusive) |
| Regex Match | Pass rows whose column fully matches a pattern |
| Uniqueness Check | Pass the first row per key; route duplicates to reject |
| Schema Validate | Reject rows where any expected column is null |
| Column Profile | Per-column stats (count, null %, distinct, min / max, quartiles) via SUMMARIZE |
| Describe | Column names + types of the input |
| Histogram | Value frequencies for one column, most-frequent first |
| Standardize | Trim + case-normalize + collapse inner whitespace, in place |
| Fuzzy Deduplicate | Keep the first row per near-duplicate cluster |
| Record Match | Self-join: emit pairs of rows above a similarity threshold |
| Address Cleanse | Address parsing / normalization (planned - needs external lib) |
Custom code (7 available)
| Capability | What it does |
|---|---|
| Inline SQL | Write a SELECT; the upstream node is exposed as input, result runs as a real materialized stage |
| SQL Template | Parameterized SQL with ${context.var} substitution |
| SQL Routines | Reusable, named SQL saved in the workspace |
| Shell | Run any shell command; emits {stdout, stderr, exit_code, duration_ms}. Platform-aware default shell. Optional timeoutMs kills the child. |
| WebAssembly UDF | Per-row WASM transform via pure-Rust wasmi. Sandboxed (no fs / net / env). Works with any WASM toolchain (Rust, AssemblyScript, C, TinyGo). |
| JavaScript UDF | Per-row JS transform via pure-Rust boa interpreter. Sandboxed. Define a transform(row) function. |
| Python / Rust UDFs | Embedded-language stages |
Sinks (58 available)
| Group | Connectors | Status |
|---|---|---|
| Files | CSV, TSV, Parquet (ZSTD), JSON, JSONL / NDJSON, Excel (.xlsx), YAML, TOML, XML (configurable wrappers), Avro (schema inferred from first row). Parquet + CSV support Hive-partitioned writes | Available |
| Geospatial files | GeoJSON, GeoPackage, Shapefile, KML, GPX via GDAL | Available (lazy-loaded) |
| Lakehouse | Apache Iceberg (full table layout), DuckLake - modes: overwrite, append, truncate, upsert (set-based delete-by-key + re-insert) with optional CDC delete propagation | Available |
| Embedded databases | SQLite, DuckDB - modes: overwrite, append, upsert (set-based delete-by-key + re-insert, no PK required) with optional CDC delete propagation | Available |
| Network relational DBs | PostgreSQL, MySQL, MariaDB, CockroachDB - modes: overwrite, append, truncate, upsert (ON CONFLICT / ON DUPLICATE KEY) with optional CDC delete propagation | Available (live CI for PG + MySQL) |
| Network relational DBs | SQL Server / Azure Synapse (TDS, multi-row VALUES batched; auto-creates the table if absent; upsert via MERGE), Oracle (Instant Client; INSERT ALL, batched per statement; auto-creates the table if absent; upsert via MERGE), ClickHouse (HTTP JSONEachRow; upsert by pointing at a ReplacingMergeTree target table) - every MERGE sink supports CDC delete propagation (a delete-flag column removes matched rows) | Available (SQL Server + Oracle + MySQL upsert and delete propagation verified live in Docker) |
| Network relational DBs | IBM DB2, generic JDBC | Planned |
| Object storage | S3, GCS, Azure Blob via DuckDB httpfs (MinIO / R2 / B2 via endpoint) |
Available |
| Cloud warehouses | MotherDuck, Snowflake (PAT or JWT RS256; upsert + delete propagation via MERGE), BigQuery, Redshift, Databricks SQL (upsert + delete propagation via MERGE), Azure Synapse, DuckDB Quack (concurrent writers to remote DuckDB via the May 2026 protocol) | Available (Snowflake MERGE verified live against the SQL-API emulator) |
| HTTP APIs | REST (POST/PUT/PATCH batched JSON-array), Webhook (one POST per row), GraphQL mutations | Available |
| Email (SMTP) | Per-row SMTP send via pure-Rust lettre + rustls. Plain text v1; HTML + attachments follow. |
Available |
| NoSQL | MongoDB (insert_many batched; upsert via replace_one on a key, plus delete propagation via delete_one), Cassandra / ScyllaDB (CQL), Elasticsearch / OpenSearch (_bulk NDJSON), Redis (pipelined SET) |
Available |
| NoSQL | DynamoDB | Planned |
| Streaming | Kafka / Redpanda (rskafka), NATS JetStream, GCP Pub/Sub (REST + OAuth2), RabbitMQ (lapin) |
Available |
| Streaming | Pulsar, Kinesis | Planned |
| Vector / AI databases | pgvector, Pinecone (/vectors/upsert), Qdrant (/points PUT), Weaviate (/v1/batch/objects), Milvus (/v1/vector/insert) |
Available |
| Vector / AI databases | Chroma, LanceDB | Preview (need vendor SDK) |
Control flow (19 available)
| Component | What it does |
|---|---|
| Replicate / Tee | Send the same data to multiple downstream outputs |
| Merge Streams | Concatenate multiple input streams (UNION ALL) |
| Switch / Conditional Split | Route rows to case_1..N outputs by boolean (first match wins); default for unmatched |
| Wait / Delay | Sleep N ms / s / min / h before passing rows through |
| Throttle | Inter-stage delay derived from a rows-per-second target |
| Checkpoint | Pass rows through and also write a parquet snapshot to a path |
| Dead Letter Queue | Terminal sink for rejected rows (JSON / CSV / Parquet) |
| Run Pipeline | Inline-execute another pipeline file (ctl.runpipeline) |
| Run Job | Call a child pipeline (picked from the workspace) passing parent context variables; chain several to build a Master Job (ctl.runjob) |
| Parallelize | Run the downstream branches wired to its outputs concurrently; branches are unlimited (ctl.parallelize) |
| Iterate | Run a sub-pipeline N times with ${ITER_INDEX} substitution |
| For Each | Run a sub-pipeline once per input row with ${ITER_ITEM_<FIELD>} substitution |
| Try / Catch | Install a fallback sub-pipeline if the wrapped stage fails |
| Retry | Per-stage retry policy (configure on Advanced tab) |
| Log Message | Emit an info log line ({rows} = upstream count), pass rows through (ctl.log) |
| Warn | Emit a warning log line, pass rows through (ctl.warn) |
| Die / Fail | Stop the run with a message: always, only when the input has rows, or only when empty (ctl.die) |
| Schedule | Cron / interval / file-watch triggers via the orchestration crate |
Advanced settings (per-node)
Every node has an Advanced tab with fields the engine honours at run time:
| Field | What it does |
|---|---|
| Retry attempts | Total tries on failure (1 = no retry). Sleeps backoff * attempt ms between attempts. |
| Retry backoff (ms) | Inter-attempt sleep, linearly scaled by attempt index. |
| Memory limit (MB) | PRAGMA memory_limit applied to this stage only. |
| Log row count | Print the post-stage rowcount to the run output. |
Orchestration and workspace
| Capability | What it does |
|---|---|
| Run feedback | Streaming run events light nodes up stage by stage, with per-node row counts, real mid-query cancel, and run history. |
| Run logs | Every run writes component-level NDJSON to <workspace>/logs/<pipeline name>/runtime.log (start/finish per stage, row counts, durations, ctl.log / ctl.warn / ctl.die messages). Tail it straight into Splunk or Dynatrace. |
| Schedules | Cron, fixed-interval, and file-watch triggers, driven by an in-process scheduler. |
| Context variables | Per-environment variables; bind any field to one via a Manual / Context dropdown, or reference ${var} inline. Resolved at run time. |
| Cloud credentials | Saved S3 / GCS / Azure connections become DuckDB SECRETs; cloud reads / writes go through httpfs. S3-compatible endpoints (MinIO / R2 / B2) supported via ENDPOINT + URL_STYLE. |
| Workspace | Pipelines, connections, contexts, documents, and routines persist as plain JSON and Markdown files in a folder you choose. |
Clean data before it reaches your AI
Models inherit the quality of their inputs. RAG indexes, embedding stores, and training sets quietly accumulate duplicates, nulls, malformed rows, mixed encodings, and inconsistent schemas. Duckle is built to scrub that data before it lands in a vector store:
- Deduplicate with exact Distinct, Uniqueness, and Fuzzy Deduplicate (Jaro-Winkler / Levenshtein); use Record Match to find near-duplicate pairs with a similarity score
- Semantic dedupe with
xf.ai.dedupeover a precomputed embedding column - Profile + describe every column up front (Column Profile, Describe, Histogram) so issues surface before they reach a model
- Validate and filter malformed, empty, or out-of-range records and route failures to a reject port
- Normalize types, encodings, casing, and null handling across messy sources (Standardize, Cast, regex / string transforms)
- Redact PII (emails, phones, SSNs, credit cards) via
xf.ai.piibefore embedding - Chunk + embed long text via
xf.ai.chunk->xf.ai.embedfor RAG indexing - Classify rows with an LLM (
xf.ai.classifyconstrains the model to one of N user-supplied categories) - Retrieve with both halves of hybrid search, locally, no model API required: Vector Similarity Search (cosine / L2 / inner product) and Full-Text Search (BM25)
- Land it in your store - pgvector ships, and Pinecone, Qdrant, Weaviate, Milvus all have working sinks that POST batches through each vendor's HTTP API
Engines
Duckle ships a thin shell and installs its engines on first launch.
| Engine | Role | Status |
|---|---|---|
| DuckDB | Default execution engine: analytics, file formats, cloud reads, SQL pushdown. Tracking v1.5.3 (latest stable). | Working |
| Duckie AI Assistant | Local chat assistant via llama.cpp + Qwen 2.5 Coder 1.5B GGUF. Downloads ~1.1 GB; runs entirely offline once installed. Managed as a llama-server subprocess exposing an OpenAI-compatible API on 127.0.0.1. |
Installable |
| SlothDB | Alternate embedded analytical engine (SouravRoy-ETL/slothdb), installed the same way and selectable per pipeline. | Installable |
| Native | In-process Rust streaming / incremental engine. | Planned |
First-launch extension pre-fetch
When the installer downloads the DuckDB CLI it also pre-fetches the extensions Duckle uses, with per-extension progress, so the first time you touch a Postgres source or an Iceberg table there is no surprise network hop mid-pipeline:
httpfs (S3 / GCS / HTTP), azure (Azure Blob native), sqlite, postgres, mysql, excel, iceberg, delta, ducklake, vss, fts.
spatial is lazy-loaded (~50 MB GDAL bundle) - it installs on first use of a geospatial source/sink to keep the initial download small.
Download / Install
Pick the binary for your OS from the latest release:
| OS | Asset | How to run |
|---|---|---|
| Windows | Duckle-windows-x64.exe |
Double-click. Unsigned binary - Windows SmartScreen will warn the first time; click "More info" -> "Run anyway". |
| macOS (Apple Silicon) | Duckle-macos-arm64 |
chmod +x Duckle-macos-arm64 && ./Duckle-macos-arm64. Right-click -> Open the first time to bypass Gatekeeper. |
| Linux (x86_64) | Duckle-linux-x64 |
chmod +x Duckle-linux-x64 && ./Duckle-linux-x64. Requires WebKitGTK 4.1 (libwebkit2gtk-4.1-0 on Debian / Ubuntu). |
The single-file binary above is all you need for Build Pipeline too: the headless runner is embedded into the app at build time, and exporting a pipeline produces ONE self-contained executable (the engine, the DuckDB CLI, any needed extensions, and the resolved pipeline are all inside that one file). Copy that single file to your server and run or schedule it - no separate runner download required.
The binary is ~55-78 MB depending on platform (it embeds the headless runner and the bundled MCP server). On first launch you'll be guided through downloading two engines into your app-data directory:
| Engine | Size | Required? | What it powers |
|---|---|---|---|
| DuckDB CLI | ~30 MB + extensions | Yes - cannot run pipelines without it | Every source / transform / sink that runs as SQL |
| Duckie AI Assistant | ~1.1 GB (llama-server + Qwen 2.5 Coder 1.5B GGUF) | Optional | The chat sidebar that generates pipelines from natural language |
App-data location:
- Windows:
%APPDATA%\io.duckle.app\engines\ - macOS:
~/Library/Application Support/io.duckle.app/engines/ - Linux:
~/.config/io.duckle.app/engines/
Delete the engines/ folder if you ever want to force a fresh install.
Quickstart (60 seconds)
- Download the binary for your OS (see Download / Install above) - or build from source.
- Launch it. First run shows the setup modal:
- Click Install on DuckDB (required, takes ~30 s).
- Optionally click Install on Duckie AI Assistant (~1.1 GB, takes 5-10 min on average broadband).
- Pick a workspace folder. Pipelines, connections, context variables, and routines live there as plain files.
- Build a pipeline two ways:
- Drag + wire: drag a CSV source in, point it at
samples/orders.csv, hit Autodetect schema. Drag a Filter, wire it up. Drag a Parquet sink with an output path. Press Run, watch the nodes light up. - Ask Duckie: click the Sparkles icon (top-right of the toolbar), type "read orders.csv, filter where status = 'paid', write to paid.parquet". When Duckie streams back a pipeline, click Insert into canvas.
- Drag + wire: drag a CSV source in, point it at
- Inspect. Click any node to see its generated SQL in the Plan tab and a live row sample in the Preview tab.
That's a real, native ETL pipeline built and run in under a minute. CSV is just the easiest first node; swap in Parquet, JSON, S3, Snowflake, MongoDB, or Stripe the same way.
Run your first pipeline
A worked example using the bundled samples/orders.csv data.
1. Add a source
- Open the Components sidebar (left). Click Sources -> Files -> CSV.
- Drag it onto the canvas.
- In the right-side Properties panel:
- Path: browse to
samples/orders.csv - Click Autodetect schema - the Schema tab fills in column types from the file, the Preview tab shows the first 20 rows.
- Path: browse to
2. Add a transform
- Components -> Transforms -> Rows -> Filter. Drag onto canvas.
- Wire the CSV source's
mainoutput port to the Filter'smaininput. - In Properties:
- Predicate:
status = 'paid'(you can write raw SQL or use the visual builder) - Filter has two output ports:
pass(rows matching) andreject(rows that don't).
- Predicate:
3. Add a sink
- Components -> Sinks -> Files -> Parquet.
- Wire Filter's
passport to the Parquet sink. - Path:
paid_orders.parquet. Write mode:overwrite. Compression:zstd.
4. Run it
- Press Run in the toolbar. Nodes light up in execution order; row counts appear under each.
- Open the Output tab (bottom panel) to see per-stage timing.
- Click any node to inspect generated SQL in Plan + sampled rows in Preview.
5. Iterate
- Add a Group By before the sink to aggregate. Re-run. Sub-second on small data.
- Cancel mid-run with the Stop button - the DuckDB process is killed cleanly.
- Save your work: Cmd/Ctrl-S writes a JSON pipeline file to your workspace folder.
How to use Duckle
A wider tour of the workflow.
| Step | What you do | Where to look |
|---|---|---|
| 1. Sources | Drag a source, point it at a file / DB / cloud URL / SaaS endpoint. Click Autodetect schema to read columns + a sample. | Sources reference |
| 2. Transforms | Wire transforms to source output ports. Configure in the Properties panel. Preview tab shows live rows; Plan tab shows generated SQL. | Transforms reference |
| 3. Data quality | Drop in a validator (Not-Null, Range, Regex, Uniqueness). Passing rows continue on the main port; failures route to the reject port. | Data quality reference |
| 4. Sinks | Finish with a sink (file, DB, cloud, vector DB, message bus, email). Set write mode (overwrite, append, truncate, upsert). | Sinks reference |
| 5. Run | Press Run to execute on DuckDB. Nodes light up stage by stage; Output + Console show row counts, timing, errors. Stop button kills mid-run. | Run feedback |
| 6. Ask Duckie | For anything you can describe in English, the AI assistant can sketch a pipeline. Iterate by editing the graph or asking follow-ups. | Meet Duckie |
| 7. Reuse | Save Connections, Context variables, and SQL Routines in the workspace; reference ${context.var} in any field. Everything persists as plain files. |
Workspace and Git flow |
| 8. Schedule | Attach a cron, interval, or file-watch trigger to run a pipeline automatically. | Schedules and triggers |
Recipes and examples
Ready-to-adapt patterns. Each one is a few nodes you wire on the canvas (or ask Duckie to sketch).
CSV cleanup
"Read orders.csv, drop nulls, deduplicate by order_id, write to orders_clean.parquet"
src.csv -> qa.not_null -> qa.uniqueness -> snk.parquet
Set qa.not_null to the columns that must be present; set qa.uniqueness to order_id. Rejected rows go to a snk.csv on the reject port for inspection.
Postgres -> Snowflake nightly load
"Read all rows from Postgres
events, upsert into Snowflake tableanalytics.eventsonevent_id"
src.postgres -> snk.snowflake (mode=upsert, conflict=event_id)
Attach a ctl.schedule with cron 0 2 * * * to run nightly at 02:00.
S3 -> partitioned Parquet
"Read all .json.gz files in
s3://logs/2026/*/*.json.gz, parse, write Hive-partitioned byevent_date"
src.s3 (glob, autodetect json.gz)
-> xf.derive (event_date = CAST(ts AS DATE))
-> snk.parquet (path=out/, partitionBy=event_date, mode=overwrite_or_ignore)
RAG ingestion
"Chunk our docs, embed with OpenAI, dedupe near-identicals, store in pgvector"
src.s3 (markdown files)
-> xf.ai.chunk (chunkSize=1500, overlap=150)
-> xf.ai.pii (redact)
-> xf.ai.embed (model=text-embedding-3-small, baseUrl=https://api.openai.com)
-> xf.ai.dedupe (threshold=0.95)
-> snk.pgvector (table=docs)
Slack channel digest
"Pull yesterday's Slack messages from #support, classify by sentiment, email a summary"
src.slack (channels.history with oldest=yesterday)
-> xf.ai.classify (categories=positive,negative,neutral)
-> xf.aggregate (group by sentiment, count)
-> snk.email (to=oncall@..., subject=Daily Support Digest)
Webhook -> S3 archive
"Receive 100 webhooks, archive each one as JSON in S3"
src.webhook (port=8080, maxRequests=100, timeoutMs=300000)
-> snk.s3 (path=s3://archive/events/, format=jsonl, partitionBy=event_date)
Git commit-log analytics
"Build a dashboard of who's been committing what in the last 30 days"
src.git (mode=log, maxRows=10000)
-> xf.filter (date > current_date - INTERVAL '30 days')
-> xf.aggregate (group by author_email, count)
-> snk.csv (path=author-stats.csv)
More examples live in samples/ - drop the pipeline files into a workspace and open them.
Git integration (GitHub + GitLab)
Push, pull, branch, and watch CI from inside Duckle. No terminal required.
Click the Git icon in the topbar to open the workspace Git panel. Built-in integration with GitHub and GitLab, on the system git CLI (no FFI, no embedded git library):
| Feature | What it does |
|---|---|
| Status snapshot | Current branch, ahead/behind counts, list of modified / staged / untracked / conflicted files |
| Stage all + commit | One-click git add -A && git commit -m "..." with your message |
| Push / Pull | git push and git pull --ff-only against origin. The button stays disabled when there's nothing to push |
| Branch list, switch, create | Lists local branches; click to switch; create new branches inline |
| Remote URL config | Add or change origin URL from inside the panel - auto-detects GitHub vs GitLab from the host |
| PAT-prompt fallback | First tries git push using your system credential helper (GitHub CLI, osxkeychain, manager-core). On a 401, prompts for a Personal Access Token, saves it AES-encrypted in <workspace>/.duckle/secrets/git.json (auto-gitignored), retries with the token injected into the HTTPS URL |
| CI build badge in topbar | Polls GitHub Actions or GitLab CI every 30 s for the latest pipeline on your current branch. Shows green / red / yellow / gray. Click to open the build in your browser |
Workflow. Workspaces are plain folders (see Workspace and Git flow) - any standard Git workflow works:
Create / clone -> open in Duckle -> edit pipelines -> commit + push ->
PR / MR -> CI runs your pipeline tests -> merge -> pull
You can do the entire push / pull / merge loop without leaving Duckle. Heavy operations (interactive rebase, conflict resolution, log archaeology) still live in your terminal or external Git tool - the panel is designed for the everyday flow, not as a full Git replacement.
Provider detection. The remote URL host determines which CI API the badge polls:
| Provider | CI source | API |
|---|---|---|
github.com |
GitHub Actions | GET /repos/{owner}/{repo}/actions/runs |
gitlab.com or self-hosted GitLab |
GitLab CI | GET /api/v4/projects/{id}/pipelines |
| Other / bitbucket | (no CI badge for now) | - |
The badge uses the same PAT you saved for pushes - no separate auth step.
Workspace and Git flow
A workspace is a folder you pick on first launch. Everything you build lives there as plain text:
my-workspace/
pipelines/
orders_etl.pipeline.json # the node graph
nightly_load.pipeline.json
connections/
prod-postgres.connection.json # saved DB credentials (encrypted)
snowflake-analytics.connection.json
contexts/
dev.context.json # variables for dev environment
prod.context.json
routines/
cleanse-addresses.sql # reusable SQL snippets
documents/
runbook.md # plain-Markdown docs
schedules.json # all scheduled runs in this workspace
run-history/
orders_etl/ # one folder per pipeline
2026-05-25T14-30-00.json # one file per run
Git-friendly by design. Every file is human-readable JSON or Markdown. Standard workflows work:
git init my-workspace && cd my-workspace
git add . && git commit -m "Initial pipelines"
# Pull a teammate's update
git pull --rebase
# Push your changes
git push
# Branch for a risky migration
git checkout -b feature/upsert-mode
# ...edit pipelines in Duckle...
git diff # readable JSON diffs
git push -u origin feature/upsert-mode
# open PR / MR
Sensitive values in connections get encrypted with a workspace-local key (workspace/.duckle/keys/). Don't commit that file - add **/.duckle/keys/ to .gitignore. The connection JSON files themselves only hold the ciphertext, which is safe.
Schedules and triggers
Pipelines can run on cron, fixed interval, or file-watch triggers. Configure these in the Schedule panel (toolbar -> Schedule icon), not as graph nodes.
| Trigger type | Config | Example |
|---|---|---|
| Cron | Standard 5-field cron expression with optional timezone | 0 2 * * * (every day at 2 AM) |
| Interval | every N {seconds, minutes, hours, days} |
every 15 minutes |
| File watch | Watch a directory for new/changed files matching a glob | /inbox/*.csv |
| Manual | Run-on-demand only (the default) | - |
Schedules persist to workspace/schedules.json and execute via the in-process scheduler crate. They survive app restarts but require Duckle to be running.
For headless / always-on schedules that run when Duckle is closed, build the pipeline into a standalone file and let the operating system's own scheduler run it - see Server deployment below.
Server deployment (Build Pipeline)
The in-app scheduler runs only while Duckle is open. To run a pipeline on a server with no desktop app, Build Pipeline turns it into ONE self-contained executable - the equivalent of a standalone "Job".
Right-click a pipeline (in the project tree or on the canvas) and choose Build Pipeline. The output is a single file named after the pipeline (orders_etl.exe on Windows, orders_etl on macOS / Linux) that embeds everything it needs:
- the headless execution engine,
- the DuckDB CLI,
- only the DuckDB extensions that pipeline's components actually use,
- the resolved pipeline (context variables substituted, routines inlined),
- its secrets (see below).
On first run it self-extracts to a temp cache and uses its own embedded DuckDB, so the server needs nothing installed - no Duckle, no DuckDB. There is no folder to copy, no run.sh, and no separate runner download. A CSV-to-CSV pipeline builds to about 28 MB; only the extensions a pipeline uses are bundled, so the file stays lean.
./orders_etl # or orders_etl.exe on Windows
The process exits 0 on success and non-zero on failure, and writes the same NDJSON run logs under logs/ (Splunk / Dynatrace friendly).
Build options
| Option | What it does |
|---|---|
| Target OS | The file is built for the OS you build on - build on Linux to deploy to a Linux server. Appending the payload makes the file unsigned, so do not codesign / Authenticode-sign it. |
| Context | Pick a context at build time; its non-secret variables are baked into the pipeline. |
| Secrets: Environment | Each secret becomes a ${ENV:KEY} placeholder, so nothing sensitive is written into the file. The runner resolves real environment variables first, then a secrets.env (KEY=VALUE lines) placed next to the file. |
| Secrets: Passphrase | Secrets are encrypted inside the file with AES-256-GCM, decrypted at run time from the DUCKLE_BUNDLE_PASSPHRASE environment variable. |
Schedule it with whatever the server already has - point the OS scheduler straight at the file:
# Linux cron - run every day at 02:00
0 2 * * * /opt/duckle/orders_etl >> /var/log/orders_etl.log 2>&1
On Windows use Task Scheduler; on macOS a launchd plist; on Linux a systemd timer. Full examples in docs/current/scheduler.md.
Run against an existing workspace - the same embedded headless runner can also execute a pipeline JSON directly, resolving context the way the app does:
duckle-runner --pipeline /path/to/pipeline.json [--workspace /path/to/workspace] [--duckdb /path/to/duckdb]
MCP server (connect Claude or any LLM to Duckle)
Duckle ships its own Model Context Protocolserver, so Claude (or any MCP client - Claude Desktop, Claude Code, Cursor, orany other LLM agent) can drive Duckle directly: browse the full component catalogand per-component property schemas, generate a pipeline straight into a workingdirectory you choose, validate it (compile without running), run it headlessly,read existing pipelines and their run logs, build a standalone artifact, andmanage saved connections.
Connect in one click (recommended)
The MCP server is bundled inside the app - there is nothing extra to install.In the designer, click Connect to Claude in the top bar to open the connectorpopup, then pick your client:
- Connect to Claude Code - registers the
duckleserver for you (runsclaude mcp addunder the hood). - Add to Claude Desktop / Add to Cursor - writes the
duckleentry intothat client's config, with the resolved engine paths filled in (both theMicrosoft Store / MSIX and standalone Claude Desktop layouts are handled). - Or copy the command / config for any other MCP client.
Restart the AI client, then try "Use duckle to list the available components"to confirm the connection.
Manual / headless
For a build-from-source or server setup, point any client at the duckle-mcpbinary directly. It speaks JSON-RPC over stdio and reuses the DuckDB enginein-process (no GUI, no Node runtime).
cargo build -p duckle-mcp --release # target/release/duckle-mcp
claude mcp add duckle -- /path/to/duckle-mcp
For Claude Desktop and other clients, add it to mcpServers:
{
"mcpServers": {
"duckle": {
"command": "/path/to/duckle-mcp",
"env": {
"DUCKLE_DUCKDB_BIN": "/path/to/duckdb",
"DUCKLE_RUNNER_BIN": "/path/to/duckle-runner"
}
}
}
}
Tools: list_components, get_component_schema, create_pipeline,validate_pipeline, run_pipeline, list_pipelines, read_pipeline,read_run_logs, build_pipeline, list_connections, create_connection.run_pipeline / build_pipeline need a DuckDB binary (DUCKLE_DUCKDB_BIN);build_pipeline also needs duckle-runner (DUCKLE_RUNNER_BIN). Full guide:docs/current/mcp.md.
Connection management
Saved connections become DuckDB secrets at runtime so credentials never leak into the pipeline JSON.
| Type | Stored fields | Used by |
|---|---|---|
| PostgreSQL / MySQL / etc. | host, port, user, password, database, ssl mode | src.postgres, snk.postgres, ... |
| Snowflake | account, user, role, warehouse, PAT or JWT private key | src.snowflake, snk.snowflake |
| S3 / GCS / Azure | access key, secret, region (or service-account JSON) | All cloud sources/sinks via httpfs |
| MotherDuck / Databricks / BigQuery | token, workspace URL | Respective sources/sinks |
| Generic REST / SaaS | base URL, auth scheme (Bearer / API key / Basic), token, custom headers | All REST aliases |
Connections live in workspace/connections/ as JSON. The token/password field is encrypted with the workspace key; the rest is plain text.
To use a connection in a pipeline, the Properties panel of any compatible source/sink shows a Connection dropdown - pick one and the fields auto-fill.
The Copy SQL / Export SQL output is display-only and never executed. Secret values (passwords, tokens, keys, connection strings) are replaced with named placeholders such as ${DUCKLE_PASSWORD}, so the exported script stays valid and is safe to share - substitute the real value at run time. To emit the real credentials instead (so the script runs unchanged), set the environment variable DUCKLE_EXPORT_INCLUDE_SECRETS=1; the output then contains live secrets and should be handled accordingly.
Context variables
Bind any field to a context variable that resolves at run time. Useful for dev vs prod, per-environment paths, secrets injected from CI, etc.
In a context file (workspace/contexts/prod.context.json):
{
"name": "prod",
"vars": {
"DB_HOST": "db.internal.acme.com",
"S3_BUCKET": "acme-prod-data",
"BATCH_SIZE": "10000"
}
}
In the Properties panel of any node, switch a field from Manual to Context and pick DB_HOST. Or inline-reference one with ${DB_HOST} in a string field.
Pick the active context from the topbar's Context dropdown. Switch contexts and re-run without editing the pipeline.
Build from source
Prerequisites
- Rust (stable)
- Node.js 18+ and npm
cargo-tauriCLI:cargo install tauri-cli --version "^2"- Platform webview dependencies per the Tauri prerequisites. WebView2 is preinstalled on Windows 10 and 11.
Clone and install
git clone https://github.com/SouravRoy-ETL/duckle
cd duckle
npm --prefix frontend install
Run in development (hot-reloading frontend plus the native shell):
cargo tauri dev
Build a release binary:
# The --features custom-protocol flag is required: without it, tauri-codegen
# embeds the dev URL instead of the bundled frontend.
cargo build --release --manifest-path apps/desktop/Cargo.toml --features custom-protocol
Outputs land in target/release/duckle (or duckle.exe). The engine is not statically linked: DuckDB downloads at first launch, which is why the build is fast and the binary is tiny.
Run the tests:
cargo test # workspace unit + plan tests
DUCKLE_DUCKDB_BIN=/path/to/duckdb cargo test -p duckle-duckdb-engine # full integration suite
Architecture
duckle/
apps/desktop/ Tauri 2 shell: Tauri commands, engine installer, llama runtime, window
frontend/ React 19 + Vite + TypeScript: the designer UI + chat panel
crates/
duckdb-engine/ Compiles the node graph to SQL and drives the DuckDB CLI
slothdb-engine/ SlothDB adapter
scheduler/ Cron / interval / file-watch triggers
metadata/ Schema and type model
plugin-sdk/ Connector / inspector traits
connectors/ Source and sink connectors
runtime, workflow-engine, transform-engine, stream-engine, execution-core
- The frontend (React with @xyflow/react) is the visual designer; it talks to the Rust core over Tauri commands.
- duckdb-engine topologically sorts the graph, lowers each node into SQL, and executes by shelling out to the downloaded DuckDB CLI. Non-sink nodes materialize as tables so later stages can reference them; sinks become
COPY ... TOstatements; cancel kills the process. No statically linked database, so the binary stays small. - Duckie is a
llama-serversubprocess on127.0.0.1exposing an OpenAI-compatible chat-completions API. The chat panel streams from it via SSE. The model is sandboxed: no fs, no net, no tools - it can only emit text. - Everything persists to the workspace folder you choose, as plain JSON and Markdown files.
Configuration
A few knobs you can set without touching code.
| Setting | Where | Effect |
|---|---|---|
| Theme | Topbar sun/moon toggle | Light / dark, persisted to localStorage |
| Workspace | Topbar workspace pill -> Switch | Change the folder Duckle reads/writes to |
| Active engine | Topbar engine selector | DuckDB (default) or SlothDB - per-pipeline |
| Active context | Topbar context dropdown | Switches which context variables resolve at run time |
| AI Assistant baseURL | xf.ai.llm / xf.ai.embed / xf.ai.classify props |
Point at any OpenAI-compatible endpoint (default: Duckie's local llama-server) |
| Per-stage retry | Properties panel -> Advanced tab | Total attempts + linear-scaled backoff per stage |
| Per-stage memory cap | Properties panel -> Advanced tab | PRAGMA memory_limit applied just to that stage |
| DuckDB extensions | Pre-fetched at install; lazy-loaded for spatial |
See First-launch extension pre-fetch |
Env var RUST_LOG |
Before launching the binary | RUST_LOG=debug duckle.exe to see verbose engine logs |
Env var DUCKLE_DUCKDB_BIN |
Before running engine tests | Points the integration test suite at a DuckDB CLI |
Env var DUCKLE_CA_CERT |
Before launching the binary | Path to a PEM bundle of extra CA certificates to trust (corporate proxy / private CA), added on top of the OS trust store and bundled roots |
Performance tips
A few patterns that consistently produce sub-second runs at small / medium data scale, and tractable runs at warehouse scale.
| Tip | Why |
|---|---|
| Use Parquet, not CSV, for intermediate steps | Columnar + compressed; DuckDB reads only the columns the next stage needs. CSV is fine for source / sink at the edges. |
| Push filters as early as possible | xf.filter early in the graph compiles to a WHERE that runs at scan time, not a post-scan filter. |
Use the vss + fts indexes |
Vector + full-text search hit DuckDB extensions directly. Faster than the alternative of pulling data out and indexing in Python. |
| Avoid per-row API calls when batch APIs exist | xf.ai.embed batches up to 100 inputs per request; snk.rest defaults to one batched request. Per-row patterns (xf.ai.llm, snk.webhook) are slower by design - use them when you actually need per-row behavior. |
| Cap heavy aggregates with the per-stage memory limit | Properties panel -> Advanced -> Memory limit (MB) prevents one big GROUP BY from blowing through all of RAM. |
Use ctl.checkpoint for long-running pipelines |
A checkpoint stage writes a Parquet snapshot to a path you choose, so a future run can resume from there with src.parquet. |
Disable xf.debug.log in prod |
Logging rows is per-row I/O; fine for dev, costly at scale. |
| Sort once at the end, not in the middle | xf.sort is a global sort; doing it once before the sink avoids re-sorting downstream. |
FAQ
Is Duckle free? What's the license?Yes, free + open source. Dual-licensed MIT OR Apache-2.0. You can use it commercially, fork it, sell what you build with it. No usage limits, no telemetry.
Does Duckle send my data anywhere?No. The app runs entirely on your machine. The engines (DuckDB, llama.cpp) are downloaded from official upstream releases on first launch and then run locally. The only network calls Duckle makes on your behalf are the ones your pipelines explicitly do (e.g. a src.s3 reading from your S3 bucket, or xf.ai.embed if you configure it to hit OpenAI).
Duckie AI Assistant runs fully offline once the model is downloaded.
How big are pipelines this works well on?DuckDB is excellent on data that fits on one machine - tens of GB on a laptop, hundreds on a workstation. Beyond that, point Duckle's output at a warehouse / lakehouse that scales horizontally. Duckle is honest about being single-machine.
Do I need DuckDB installed first?No - Duckle downloads it for you on first launch. The download is ~30 MB and includes the most-used extensions (httpfs, postgres, mysql, iceberg, delta, vss, fts, etc.) so the first time you touch a Postgres source there's no mid-pipeline network pause.
How big is the binary, exactly?About 55-78 MB depending on platform (macOS ~54-67, Windows ~59-68, Linux ~66-78); it embeds the headless runner and the MCP server. The engines aren't statically linked - DuckDB (~50 MB with extensions) and the Duckie LLM (~1.1 GB for the Qwen GGUF) both download on first launch with a guided installer into your app-data folder, so they update independently of the app.
Can I use OpenAI / Cohere / Voyage instead of the local Duckie?Yes. The AI transforms (xf.ai.embed, xf.ai.llm, xf.ai.classify) accept a baseUrl prop. Point it at any OpenAI-compatible /v1/... endpoint and an apiKey and Duckle uses that instead. The local Duckie chat panel is hardwired to localhost; the pipeline AI transforms are configurable.
In the workspace folder you pick on first launch (see Workspace and Git flow). Pipelines are plain JSON files you can commit to Git, diff, branch, and review.
Can multiple people collaborate on the same workspace?Via Git, yes - check the workspace into a repo and use standard branch/PR flows. Duckle does not have a real-time multiplayer mode (single-machine by design).
Can I run pipelines headlessly / from CI?Yes. Build Pipeline (right-click a pipeline) produces a single self-contained executable that runs anywhere with nothing installed - drop it on a server or CI runner and execute it, or schedule it with cron / systemd / Task Scheduler. The embedded duckle-runner can also run a workspace pipeline JSON directly (duckle-runner --pipeline pipeline.json). See Server deployment. You can also import the engine crate (duckle-duckdb-engine) into your own Rust binary.
For 90% of common pipelines (read source -> simple transforms -> sink), yes - the Qwen 2.5 Coder model is tuned for structured-JSON generation. For long, complex pipelines you'll likely want to iterate: describe the first half, click insert, then ask for the next half. You can also swap the model: point xf.ai.llm's baseUrl at GPT-4 or Claude for more capable pipeline drafting.
No. Once llama-server and the Qwen GGUF are downloaded into your app-data directory, Duckie runs fully offline. Tested by killing wifi and asking it for a pipeline - works fine.
DuckDB's SQL surface is wide enough to express most ETL work, it's vectorized and fast on a laptop, it has first-class Iceberg/Delta/Parquet readers, and its extension model lets us add vector + full-text + Postgres ATTACH without code changes. Polars is great but doesn't ship the cloud/format/extension breadth we need; Spark is a great cluster but overkill for the local-first niche we're in.
How do I contribute a new connector?See the Contributing section and crates/duckdb-engine/src/plan.rs (planner branch) + crates/duckdb-engine/src/lib.rs (executor). The shortest path: copy an existing connector with similar shape (e.g. src.rabbit for a streaming source, src.dynamodb for an HTTP+auth API), adapt, add a test, flip the palette tile.
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
| Window opens but content shows "localhost refused to connect" | Release binary built without --features custom-protocol (the v0.0.7 bug) |
Rebuild with cargo build --release --features custom-protocol per Build from source. The release workflow already passes this flag. |
| "DuckDB CLI not found" on Run | First-launch installer was skipped or interrupted | Open the engine setup modal from the toolbar; click Install on DuckDB |
| "Couldn't download Duckie AI Assistant (HTTP 404)" | Pinned llama.cpp build temporarily unavailable from upstream | Bump LLAMACPP_BUILD in apps/desktop/src/engine_manager.rs to a recent stable, rebuild |
| Linux: app won't launch, missing libwebkit | WebKitGTK 4.1 isn't installed | sudo apt install libwebkit2gtk-4.1-0 (Debian/Ubuntu) or your distro's equivalent |
| macOS: "App can't be opened because Apple cannot check it" | Gatekeeper, unsigned binary | Right-click the binary -> Open -> Open Anyway |
| Pipeline runs but a connector errors with "extension not loaded" | Lazy-loaded extension (e.g. spatial) downloaded mid-run and failed |
Run duckdb :memory: -c "INSTALL spatial; LOAD spatial;" from a terminal to pre-install; relaunch Duckle |
| Chat panel says "AI engine not registered" | Old version of Duckle before AI shipped (pre-v0.0.10) | Update to latest release |
| Duckie generates a pipeline but Insert doesn't put anything on the canvas | Active pipeline tab has been closed; nothing to insert into | Open a pipeline (or create a new one) before clicking Insert |
| MotherDuck / Snowflake auth fails | Token expired, or PAT lacks the role you're trying to use | Regenerate in the vendor UI; paste into the Connection in Duckle |
Postgres ATTACH says "could not connect" |
Local SSL mode mismatch | Connection -> Advanced -> set SSL mode to disable for localhost / require for production |
| AI tests skip with no failure | DUCKLE_DUCKDB_BIN isn't set |
export DUCKLE_DUCKDB_BIN=/path/to/duckdb before cargo test |
| TLS "UnknownIssuer" / "invalid peer certificate" behind a corporate proxy | A TLS-inspecting proxy (Zscaler, Netskope, ...) re-signs traffic with its own CA | Duckle trusts your OS certificate store on top of its bundled roots, so the proxy CA in the Windows / macOS / Linux store is honoured automatically. If the CA isn't in the store, point DUCKLE_CA_CERT at a PEM file containing it. Note: DuckDB's own extension fetch (extensions.duckdb.org) and cloud reads (S3 / GCS / Azure) run inside the DuckDB engine with its own TLS, so also allow / exempt extensions.duckdb.org from inspection. |
If you see something not listed, please open an issue with steps to reproduce + the relevant log line.
CI / CD
Duckle's CI pipeline runs on both GitHub and GitLab - the project mirrors to both. Push / pull-request / merge-request / tag events all trigger builds.
| Trigger | GitHub Actions | GitLab CI |
|---|---|---|
| Push to main or feature branch | .github/workflows/ci.yml |
.gitlab-ci.yml (test + desktop-build stages) |
| Pull request / merge request | .github/workflows/ci.yml |
.gitlab-ci.yml (same stages, rules: gate on MR events) |
Tag v* |
.github/workflows/release.yml |
.gitlab-ci.yml (release stage; uploads binaries to GitLab Releases) |
What each pipeline does:
- Frontend -
npm ci+npm run build(type-check + bundle) - Rust test matrix -
cargo test --workspaceon Linux + macOS + Windows - Live-service integration tests - PostgreSQL + MySQL + MinIO services spun up via Docker, real connector code runs against them
- Desktop release-build smoke check -
cargo build --release --features custom-protocolthen grep the binary for the embedded frontend JS chunk (catches the v0.0.7-class "binary loads devUrl" bug at PR time) - Format + clippy - informational (does not block merge)
- On tag: build the Duckle binary on all three OSes, upload as release assets
See .github/workflows/ and .gitlab-ci.yml for the exact steps. The two pipelines are kept feature-equivalent so contributors can fork to either platform.
Releasing a new version
Nothing regenerates this README, the hero / flow SVGs, or the downloadlinks automatically - they are hand-maintained, so they drift unless eachrelease updates them. Treat the README as a release artifact: walk thischecklist every time before tagging.
# 0. Update the README in the SAME commit as the version bump:
# - bump every vX.Y.Z reference (the Download / Install link, badges)
# - refresh capability tables for any new sources/transforms/sinks
# - add/replace screenshots in docs/assets for shipped features
# - re-check the hero/flow SVG wording if positioning changed
# 1. Bump version in apps/desktop/tauri.conf.json
# 2. Commit (README + version together)
git commit -am "Release: bump to vX.Y.Z"
# 3. Tag + push
git tag vX.Y.Z
git push origin main vX.Y.Z
# Both GitHub Actions and GitLab CI pick up the tag and build the
# release artifacts automatically. Once green, the draft release on
# GitHub gets the binaries uploaded; un-draft + mark Latest with:
gh release edit vX.Y.Z --draft=false --latest
Roadmap
A complete planned-component breakdown lives in docs/roadmap.md. Highlights:
- Multi-shard Kinesis and Pulsar streaming (Pulsar blocked on
protocat build time) - Apache ORC read / write (blocked on the Arrow version conflict between
orc-rustand our workspace pin) - SFTP source (shipped -
russh+russh-sftpon the ring backend, password / key auth, host-fingerprint pin) - OAuth-heavy SaaS (Google Sheets, Excel Online, full Salesforce OAuth, Gmail / O365 IMAP)
- Embedded Python / Rust code stages (current code.* family: SQL, Shell, JavaScript, WebAssembly all ship)
- Hosted documentation site
- Plugin marketplace via the connector SDK
- In-process Native engine - a Rust streaming / incremental executor as an alternative to shelling out to the DuckDB CLI
Contributing
Contributions, issues, and ideas are welcome. Duckle is young and there is a lot of green field. Open an issue to discuss a change before a large PR, match the existing code style, and keep changes focused. Run cargo test and npm --prefix frontend run build before submitting. See CONTRIBUTING.md.
License
Licensed under either of MIT or Apache-2.0 at your option.