DuckDB Over Parquet via MCP: Querying 175 Million Rows in 627ms

175.8M

Rows Queried

627ms

Query Time

67 GB

CSV Size

7.7 GB

Parquet Size (8.7x)

The Problem: PostgreSQL Doesn't Scale to 175 Million Rows

Our MCP platform lets AI agents query datasets through standard tool calls. Most datasets — government records, company registries, census data — fit comfortably in PostgreSQL. A few thousand to a few million rows, indexed and fast.

Then someone uploaded the Chicago taxi trip dataset: 175.8 million rows, 67 GB as CSV. PostgreSQL can technically ingest this, but the cost is prohibitive — you'd need to COPY 67 GB into a table, build indexes, and keep it all in managed storage. For a dataset that agents query occasionally, that's wasteful.

We needed a different approach: query the data where it sits on disk, with no ingestion step, and return results through MCP's JSON-RPC transport.

Why DuckDB + Parquet

DuckDB is an in-process OLAP database. It runs inside your Node.js process — no separate server, no TCP connections, no connection pooling. It reads files directly from the filesystem.

Parquet is a columnar storage format. Unlike CSV (which stores data row by row), Parquet stores each column contiguously and applies compression per-column. This means:

Column pruning — querying 3 columns out of 24 reads ~12% of the file instead of 100%
Predicate pushdown — row group statistics let DuckDB skip entire chunks that can't match a WHERE clause
Compression — our 67 GB CSV compressed to 7.7 GB Parquet (8.7x reduction) using ZSTD
Zero-copy reads — DuckDB memory-maps the Parquet file and reads directly without deserializing the entire dataset

The combination is powerful: DuckDB handles the query engine, Parquet handles storage efficiency, and both work in-process with no external dependencies.

The Conversion: CSV to Parquet

DuckDB itself handles the conversion. One SQL statement reads the CSV and writes Parquet:

conversion.ts

import { Database } from "duckdb-async";

const db = await Database.create(":memory:");
const conn = await db.connect();

// Convert 67 GB CSV → 7.7 GB Parquet
await conn.run(`
  COPY (SELECT * FROM read_csv_auto('taxi-trips.csv'))
  TO 'taxi-trips.parquet'
  (FORMAT PARQUET, COMPRESSION ZSTD, ROW_GROUP_SIZE 122880)
`);

// Verify row count
const rows = await conn.all(
  `SELECT COUNT(*) as cnt FROM read_parquet('taxi-trips.parquet')`
);
console.log(`Parquet rows: ${rows[0].cnt}`);
// → Parquet rows: 175842192

Key parameters:

COMPRESSION ZSTD — best compression-to-speed ratio for analytical workloads
ROW_GROUP_SIZE 122880 — 120K rows per group. Smaller groups = better predicate pushdown. Default is 122880.
read_csv_auto — DuckDB infers column types from the CSV header and sample rows

Format	Size	Rows	Query (LIMIT 5)	Full Scan
CSV (raw)	67 GB	175.8M	~8s (sequential scan)	Impractical
Parquet (ZSTD)	7.7 GB	175.8M	627ms	~1.2s (aggregation)

The Query Router

Not every dataset needs DuckDB. Small datasets (under 100 MB) query faster through PostgreSQL because the data is already indexed and the connection is already warm. Large datasets with Parquet derivatives should use DuckDB. We built a router that decides automatically.

query-router.ts

async function routeQuery(datasetId: string, sql: string) {
  const dataset = await getDataset(datasetId);
  const parquetFile = await findParquetDerivative(datasetId);

  // Route decision: use DuckDB if a Parquet file exists
  // and the dataset exceeds the size threshold
  if (parquetFile && dataset.sizeBytes > DUCKDB_THRESHOLD_MB * 1024 * 1024) {
    return queryViaDuckDB(parquetFile.storagePath, sql);
  }

  // Small datasets: PostgreSQL with existing indexes
  return queryViaPostgres(datasetId, sql);
}

async function queryViaDuckDB(parquetPath: string, sql: string) {
  const db = await Database.create(":memory:");
  const conn = await db.connect();

  // Create a VIEW that points to the Parquet file.
  // The user's SQL references "source" — our VIEW name.
  await conn.run(
    `CREATE VIEW source AS SELECT * FROM read_parquet('${parquetPath}')`
  );

  const results = await conn.all(sql);

  await conn.close();
  await db.close();

  return results;
}

The router checks two conditions: (1) does a Parquet derivative exist for this dataset, and (2) does the dataset exceed a configurable size threshold (default: 1 GB). If both are true, the query goes to DuckDB. Otherwise, it goes to PostgreSQL.

Gotcha #1: Lazy VIEW Creation

DuckDB doesn't have persistent tables when running in :memory: mode. You need to create a VIEW that points to the Parquet file on every query. This sounds expensive but it's not — creating a VIEW over read_parquet() is essentially free because DuckDB only reads the Parquet metadata (file footer + row group stats), not the actual data. The data is only read when the query executes.

VIEW vs TABLE

Using CREATE VIEW source AS SELECT * FROM read_parquet(...) instead of CREATE TABLE is critical. A TABLE would copy all data into DuckDB's in-memory storage. A VIEW is just a pointer — the Parquet file stays on disk and DuckDB reads only the columns and row groups needed by the query.

Gotcha #2: Column Name Auto-Quoting

Real-world CSV files have column names with spaces, special characters, and mixed case. The Chicago taxi dataset has columns like Trip Start Timestamp, Pickup Community Area, and Trip Miles.

DuckDB handles these natively, but your SQL needs to double-quote them. If users (or AI agents) submit queries with unquoted column names, they'll get cryptic errors. We auto-detect column names from the Parquet metadata and quote them in the VIEW definition:

auto-quoting.ts

// Read column names from Parquet metadata
const schema = await conn.all(
  `SELECT column_name FROM parquet_schema('${parquetPath}')
   WHERE column_name != 'duckdb_schema'`
);

// Build VIEW with quoted column aliases
const columns = schema.map(col =>
  `"${col.column_name}" AS "${col.column_name}"`
).join(", ");

await conn.run(
  `CREATE VIEW source AS SELECT ${columns}
   FROM read_parquet('${parquetPath}')`
);

Gotcha #3: MCP JSON-RPC Transport

MCP tools communicate over JSON-RPC — every result must be serialized to JSON and sent back through the transport layer. For small result sets this is fine. For large ones, three issues emerge:

Result size limits. A SELECT * on 175 million rows would serialize to hundreds of gigabytes of JSON. The query tool enforces a LIMIT (default: 100 rows) and warns the agent if the result was truncated.
BigInt serialization. DuckDB returns BigInt values for large integers. JSON.stringify() throws on BigInt. We convert to Number or String before serialization.
Date/timestamp handling. DuckDB returns native Date objects. MCP expects JSON-serializable values. We format timestamps as ISO 8601 strings.

serialize.ts

function sanitizeRow(row: Record<string, any>) {
  const clean: Record<string, any> = {};
  for (const [key, val] of Object.entries(row)) {
    if (typeof val === "bigint") {
      clean[key] = Number(val);
    } else if (val instanceof Date) {
      clean[key] = val.toISOString();
    } else {
      clean[key] = val;
    }
  }
  return clean;
}

Gotcha #4: DuckDB Timeout via MCP

When MCP tool calls go through the SSE transport, there's an inherent timeout problem — the MCP client expects a response within a certain window. DuckDB queries on large Parquet files can take seconds, and the first time DuckDB loads a Parquet file, there's a cold-start penalty.

Our benchmarks on the taxi dataset:

Query Type	First Run (Cold)	Subsequent (Warm)
`SELECT * LIMIT 5`	~1.2s	627ms
Filtered query (WHERE trip_miles > 20)	~2.1s	~1.0s
Aggregation (GROUP BY community_area)	~2.8s	~1.2s

The cold-start penalty comes from reading the Parquet file footer and row group metadata. On subsequent queries, this metadata is cached by the OS page cache. For our MCP server, we set the tool timeout to 30 seconds to accommodate cold starts, and the query-level timeout to 10 seconds for the actual DuckDB execution.

The Complete Pattern

Here's the full query flow from MCP tool call to DuckDB result:

MCP tool call arrives: dataset_query(dataset_id, sql, limit)
Router checks: Does this dataset have a Parquet derivative? Is it above the size threshold?
DuckDB path: Create in-memory DB → create VIEW over Parquet file → execute user SQL → sanitize results → serialize to JSON
PostgreSQL path: Use existing connection pool → execute against loaded table → return results
MCP response: JSON-RPC result with rows, column metadata, and truncation warning if applicable

The result

AI agents can query 175 million row datasets through the same MCP tool call they use for 10,000 row datasets. The routing is invisible — they just call dataset_query and get results. The only difference is which engine executes under the hood.

When to Use This Pattern

Your datasets range from thousands to hundreds of millions of rows
Large datasets are read-heavy (analytical queries, not transactional)
You don't want to maintain a separate analytics database (BigQuery, Redshift)
Your application runs on a single server or VPS (no distributed compute needed)
You're building MCP tools and need to support large file queries

When Not to Use This

You need real-time writes — Parquet is read-only once created
Your data changes frequently — re-creating Parquet files on every update is expensive
You need concurrent writes — DuckDB's in-process model doesn't support multi-writer
You need distributed queries across multiple machines — DuckDB is single-node

Key Takeaways

DuckDB + Parquet is the right tool for large analytical datasets. 627ms to query 175 million rows on a single VPS, no external database required.
Build a query router, not a monolithic engine. Small datasets stay in PostgreSQL. Large datasets go to DuckDB. The routing is transparent to the caller.
Use VIEWs, not TABLEs. CREATE VIEW source AS SELECT * FROM read_parquet(...) is free. CREATE TABLE would copy the entire dataset into memory.
Sanitize before JSON serialization. BigInt, Date, and other DuckDB native types don't survive JSON.stringify() without conversion.
ZSTD compression on Parquet is the sweet spot. 8.7x compression ratio with negligible decompression overhead.

DuckDB Parquet MCP Node.js Query Router Large Datasets PostgreSQL OLAP