← Back to Blog
10 min read

DuckDB Over Parquet via MCP: Querying 175 Million Rows in 627ms

PostgreSQL can't handle a 67 GB CSV with 175.8 million rows. DuckDB reads Parquet directly from disk with zero-copy. We built an automatic query router for our MCP platform.

175.8M
Rows Queried
627ms
Query Time
67 GB
CSV Size
7.7 GB
Parquet Size (8.7x)

The Problem: PostgreSQL Doesn't Scale to 175 Million Rows

Our MCP platform lets AI agents query datasets through standard tool calls. Most datasets — government records, company registries, census data — fit comfortably in PostgreSQL. A few thousand to a few million rows, indexed and fast.

Then someone uploaded the Chicago taxi trip dataset: 175.8 million rows, 67 GB as CSV. PostgreSQL can technically ingest this, but the cost is prohibitive — you'd need to COPY 67 GB into a table, build indexes, and keep it all in managed storage. For a dataset that agents query occasionally, that's wasteful.

We needed a different approach: query the data where it sits on disk, with no ingestion step, and return results through MCP's JSON-RPC transport.

Why DuckDB + Parquet

DuckDB is an in-process OLAP database. It runs inside your Node.js process — no separate server, no TCP connections, no connection pooling. It reads files directly from the filesystem.

Parquet is a columnar storage format. Unlike CSV (which stores data row by row), Parquet stores each column contiguously and applies compression per-column. This means:

The combination is powerful: DuckDB handles the query engine, Parquet handles storage efficiency, and both work in-process with no external dependencies.

The Conversion: CSV to Parquet

DuckDB itself handles the conversion. One SQL statement reads the CSV and writes Parquet:

conversion.ts
import { Database } from "duckdb-async";

const db = await Database.create(":memory:");
const conn = await db.connect();

// Convert 67 GB CSV → 7.7 GB Parquet
await conn.run(`
  COPY (SELECT * FROM read_csv_auto('taxi-trips.csv'))
  TO 'taxi-trips.parquet'
  (FORMAT PARQUET, COMPRESSION ZSTD, ROW_GROUP_SIZE 122880)
`);

// Verify row count
const rows = await conn.all(
  `SELECT COUNT(*) as cnt FROM read_parquet('taxi-trips.parquet')`
);
console.log(`Parquet rows: ${rows[0].cnt}`);
// → Parquet rows: 175842192

Key parameters:

Format Size Rows Query (LIMIT 5) Full Scan
CSV (raw) 67 GB 175.8M ~8s (sequential scan) Impractical
Parquet (ZSTD) 7.7 GB 175.8M 627ms ~1.2s (aggregation)

The Query Router

Not every dataset needs DuckDB. Small datasets (under 100 MB) query faster through PostgreSQL because the data is already indexed and the connection is already warm. Large datasets with Parquet derivatives should use DuckDB. We built a router that decides automatically.

query-router.ts
async function routeQuery(datasetId: string, sql: string) {
  const dataset = await getDataset(datasetId);
  const parquetFile = await findParquetDerivative(datasetId);

  // Route decision: use DuckDB if a Parquet file exists
  // and the dataset exceeds the size threshold
  if (parquetFile && dataset.sizeBytes > DUCKDB_THRESHOLD_MB * 1024 * 1024) {
    return queryViaDuckDB(parquetFile.storagePath, sql);
  }

  // Small datasets: PostgreSQL with existing indexes
  return queryViaPostgres(datasetId, sql);
}

async function queryViaDuckDB(parquetPath: string, sql: string) {
  const db = await Database.create(":memory:");
  const conn = await db.connect();

  // Create a VIEW that points to the Parquet file.
  // The user's SQL references "source" — our VIEW name.
  await conn.run(
    `CREATE VIEW source AS SELECT * FROM read_parquet('${parquetPath}')`
  );

  const results = await conn.all(sql);

  await conn.close();
  await db.close();

  return results;
}

The router checks two conditions: (1) does a Parquet derivative exist for this dataset, and (2) does the dataset exceed a configurable size threshold (default: 1 GB). If both are true, the query goes to DuckDB. Otherwise, it goes to PostgreSQL.

Gotcha #1: Lazy VIEW Creation

DuckDB doesn't have persistent tables when running in :memory: mode. You need to create a VIEW that points to the Parquet file on every query. This sounds expensive but it's not — creating a VIEW over read_parquet() is essentially free because DuckDB only reads the Parquet metadata (file footer + row group stats), not the actual data. The data is only read when the query executes.

VIEW vs TABLE

Using CREATE VIEW source AS SELECT * FROM read_parquet(...) instead of CREATE TABLE is critical. A TABLE would copy all data into DuckDB's in-memory storage. A VIEW is just a pointer — the Parquet file stays on disk and DuckDB reads only the columns and row groups needed by the query.

Gotcha #2: Column Name Auto-Quoting

Real-world CSV files have column names with spaces, special characters, and mixed case. The Chicago taxi dataset has columns like Trip Start Timestamp, Pickup Community Area, and Trip Miles.

DuckDB handles these natively, but your SQL needs to double-quote them. If users (or AI agents) submit queries with unquoted column names, they'll get cryptic errors. We auto-detect column names from the Parquet metadata and quote them in the VIEW definition:

auto-quoting.ts
// Read column names from Parquet metadata
const schema = await conn.all(
  `SELECT column_name FROM parquet_schema('${parquetPath}')
   WHERE column_name != 'duckdb_schema'`
);

// Build VIEW with quoted column aliases
const columns = schema.map(col =>
  `"${col.column_name}" AS "${col.column_name}"`
).join(", ");

await conn.run(
  `CREATE VIEW source AS SELECT ${columns}
   FROM read_parquet('${parquetPath}')`
);

Gotcha #3: MCP JSON-RPC Transport

MCP tools communicate over JSON-RPC — every result must be serialized to JSON and sent back through the transport layer. For small result sets this is fine. For large ones, three issues emerge:

serialize.ts
function sanitizeRow(row: Record<string, any>) {
  const clean: Record<string, any> = {};
  for (const [key, val] of Object.entries(row)) {
    if (typeof val === "bigint") {
      clean[key] = Number(val);
    } else if (val instanceof Date) {
      clean[key] = val.toISOString();
    } else {
      clean[key] = val;
    }
  }
  return clean;
}

Gotcha #4: DuckDB Timeout via MCP

When MCP tool calls go through the SSE transport, there's an inherent timeout problem — the MCP client expects a response within a certain window. DuckDB queries on large Parquet files can take seconds, and the first time DuckDB loads a Parquet file, there's a cold-start penalty.

Our benchmarks on the taxi dataset:

Query Type First Run (Cold) Subsequent (Warm)
SELECT * LIMIT 5 ~1.2s 627ms
Filtered query (WHERE trip_miles > 20) ~2.1s ~1.0s
Aggregation (GROUP BY community_area) ~2.8s ~1.2s

The cold-start penalty comes from reading the Parquet file footer and row group metadata. On subsequent queries, this metadata is cached by the OS page cache. For our MCP server, we set the tool timeout to 30 seconds to accommodate cold starts, and the query-level timeout to 10 seconds for the actual DuckDB execution.


The Complete Pattern

Here's the full query flow from MCP tool call to DuckDB result:

  1. MCP tool call arrives: dataset_query(dataset_id, sql, limit)
  2. Router checks: Does this dataset have a Parquet derivative? Is it above the size threshold?
  3. DuckDB path: Create in-memory DB → create VIEW over Parquet file → execute user SQL → sanitize results → serialize to JSON
  4. PostgreSQL path: Use existing connection pool → execute against loaded table → return results
  5. MCP response: JSON-RPC result with rows, column metadata, and truncation warning if applicable
The result

AI agents can query 175 million row datasets through the same MCP tool call they use for 10,000 row datasets. The routing is invisible — they just call dataset_query and get results. The only difference is which engine executes under the hood.

When to Use This Pattern

When Not to Use This

Key Takeaways

  1. DuckDB + Parquet is the right tool for large analytical datasets. 627ms to query 175 million rows on a single VPS, no external database required.
  2. Build a query router, not a monolithic engine. Small datasets stay in PostgreSQL. Large datasets go to DuckDB. The routing is transparent to the caller.
  3. Use VIEWs, not TABLEs. CREATE VIEW source AS SELECT * FROM read_parquet(...) is free. CREATE TABLE would copy the entire dataset into memory.
  4. Sanitize before JSON serialization. BigInt, Date, and other DuckDB native types don't survive JSON.stringify() without conversion.
  5. ZSTD compression on Parquet is the sweet spot. 8.7x compression ratio with negligible decompression overhead.