Notes on DuckCon 7 🦆

I followed DuckCon 7 live on YouTube today. The conference itself was taking place in Amsterdam; I was attending online.

This was my first genuinely database-oriented conference. I am more accustomed to Python, data science, or applied AI events, so I was curious to see what the atmosphere would be like.

The format was excellent: roughly three hours of short talks, with no live demos. This removed most of the usual conference dead time (laptops refusing to connect, environments breaking, speakers discovering that fifteen minutes is not actually half an hour). They kept the schedule to within about two minutes, which may be the most database-conference thing imaginable.

This is not a complete report. More modestly: a few things I liked, a few tools I want to investigate, and my overall impression of an ecosystem that is still visibly being built.

The full recording is available here.

The ecosystem feels energetic¶

My general impression was very positive.

DuckDB is no longer just the neat embedded OLAP database that data people occasionally use to query a Parquet file from a notebook. It is becoming the centre of a broader ecosystem: DuckLake for lakehouse storage, Quack for client-server communication, compatibility layers for existing APIs, browser-based analytical applications, database extensions, and increasingly, agent-oriented tooling.

Some of this will presumably disappear. Some projects are still in alpha, and several of the talks were about things that are only beginning to take shape. But that is also what made the conference interesting. There was a noticeable sense that people were actively discovering what becomes possible once a fast analytical database is cheap and easy to embed almost anywhere.

The opening "State of the Duck" talk announced DuckDB 2.0 for the autumn, with the codename Cinnamon Teal. The planned additions include a VARIANT type (described as "JSON on steroids") triggers, asynchronous I/O for object stores, improved partitioning support, and a new SQL parser.

Useful features, but the part I found most interesting was not any individual SQL addition. It was how far the project is expanding beyond its initial local, single-process framing.

Picture of a Cinnamon Teal duck, codename of DuckDB's next release.

DuckLake and the appeal of less infrastructure¶

DuckLake was announced about a year ago and recently reached version 1.0.

The basic proposition is attractive: use Parquet files for data storage and a normal SQL database for the catalogue and transactional metadata. You get the expected lakehouse features (snapshots, time travel, schema evolution, partitioning and ACID transactions) without reproducing all the machinery of formats such as Iceberg.

I should be careful here. A short conference presentation is not enough to conclude that DuckLake is simpler in every operational setting, or that it is a generic replacement for Snowflake, Iceberg or a managed lakehouse platform.

Still, it made me want to test it properly.

A recurring pattern in data engineering is that we adopt architectures designed for the largest possible workload, then apply them to much smaller problems. We end up operating distributed systems because the word "data" appeared somewhere in the requirements.

DuckLake is appealing because it asks the reverse question: how far can we go with Parquet, SQL, object storage and a much smaller amount of infrastructure?

One of the lightning talks apparently managed to run a DuckLake setup on Hetzner for less than €15 a month. The talk was too short to provide much substance, but it is exactly the sort of experiment I would like to reproduce.

This also connects nicely with DuckDB’s article on "The Lost Decade of Small Data": a useful reminder that "does not fit in memory on one laptop" and "requires a large distributed platform" are not the same statement.

SQLFrame: an exit ramp from unnecessary Spark¶

The most directly practical talk was probably Nicolas Renkamp's presentation of a migration from PySpark to DuckDB at Merck.

The project used was SQLFrame, which implements the PySpark DataFrame API and translates operations into SQL, using SQLGlot underneath. Existing PySpark transformation code can therefore run against DuckDB or another database engine without an immediate rewrite.

This is a clever migration strategy.

The usual objection to replacing an oversized platform is not necessarily that the replacement cannot execute the workload. It is that years of code, tests and team knowledge are built around the existing API. Even when the original architectural choice no longer makes sense, rewriting everything is expensive and risky.

SQLFrame provides an intermediate path: preserve much of the DataFrame code while changing the execution engine, then gradually decide what should remain as Python and what should eventually become explicit SQL.

The numbers presented by Merck were striking: roughly 75% less compute overall, with some jobs going from eight hours to ten-twenty minutes. They were running DuckDB on AWS Spot instances, so this was not simply a laptop benchmark presented as an enterprise architecture.

Of course, this does not prove that DuckDB is faster than Spark in general. Spark exists for good reasons, and at sufficient scale distribution is not optional.

But distribution also has a cost. If the data fits comfortably on one machine, Spark may spend a lot of time coordinating work that DuckDB can simply execute.

The speaker's recommendation was sensible: start with the workloads that look most promising rather than attempting a heroic platform migration. This is probably relevant to quite a few organizations with medium-sized PySpark pipelines.

DuckDB as an application runtime¶

Another thread I liked was the use of DuckDB inside standalone analytical applications.

The most polished example was SQLRooms, an open-source React toolkit for building local-first data applications. DuckDB runs in the browser, the application can process local data without a dedicated analytical backend, and the data does not necessarily have to leave the user's machine.

This reminded me of several DuckDB-powered interfaces I saw circulating on Twitter a few years ago but never properly investigated.

It also made me wonder where this approach could replace Streamlit.

I like Streamlit. It is an extremely efficient way to turn some Python and a dataframe into an internal application. But it generally preserves a fairly conventional architecture: a Python process runs somewhere, queries data, maintains application state and serves the interface.

A browser application with an embedded analytical engine is a different proposition. For the right use case, deployment can become little more than publishing static assets and making a Parquet file available.

That will not work for every application. Authentication, centrally controlled data, write operations, very large datasets and business logic will often require a backend. React is also a rather larger commitment than writing twenty lines of Streamlit.

But for portable exploratory tools, public datasets, offline applications or self-contained analytical reports, the model is very appealing.

A genomics presentation from the Hartwig Medical Foundation showed another good example: a data-heavy application combining DuckDB queries with interactive visualization. These applications felt less like dashboards and more like small analytical products.

SQL, search and agents¶

There were inevitably a few talks about agents.

Spotify presented a SQL layer built over user listening histories, allowing internal agents to query behavioural data. Altertable.ai presented a more unusual idea: adding search-oriented retrieval directly to DuckLake.

Their starting observation was that coding agents use grep constantly (and in fact it's something that's been bugging me, in explorations to use LLMs with Markdown notes). Before understanding a codebase precisely, they search for relevant fragments. But when agents interact with analytical data, we usually expect them to know the schema, identify the correct tables and generate a valid query immediately.

The missing primitive, in their framing, is schema-agnostic retrieval over the lakehouse.

Their implementation adds search indexes as Parquet "sidecars", close to the underlying data. The appeal is partly architectural: nobody particularly enjoys maintaining yet another pipeline to synchronize a database with a separate search engine.

I did not follow every implementation detail, so I would need to revisit the talk before forming a strong opinion. But the underlying idea seems right. Reliable agents over data will need more than text-to-SQL. They need ways to discover what data exists, retrieve likely relevant subsets, understand semantics, and only then formulate queries.

Once again, the model is not the hard part. The useful system is everything around it.

A grammar of graphics inside SQL¶

One of the more delightful talks was about ggsql, an alpha-stage project from Posit that brings a grammar-of-graphics approach directly into SQL.

You write a normal query, then add declarative clauses describing variables, graphical marks, scales, facets and labels.

The project already exists as a Jupyter kernel, a DuckDB extension, and integrations for VS Code and Positron. The demonstration included a recreation of Minard's famous visualization of Napoleon's Russian campaign, which is a fairly ambitious "hello world"!

Screenshot of the Minard chart recreation demo.

I am not yet convinced that SQL should absorb every other language and interface in data work. At some point, a query can become a programming language, visualization specification, orchestration framework and cry for help.

But the declarative model is elegant. It may also work particularly well with agents: generating a constrained visualization grammar is safer and easier to inspect than asking a model to write arbitrary Python plotting code.

At the very least, the online playground is worth trying.

What I want to explore¶

My short list after the conference:

build a small DuckLake locally and understand what the operational model actually looks like;
test SQLFrame on a real PySpark pipeline rather than a toy example;
try SQLRooms and compare the development experience with Streamlit;
experiment with ggsql;
understand the Altertable approach to search indexes in more detail;
use DuckDB more often as an embedded component of applications, rather than only as a notebook query engine.

My main takeaway is not that DuckDB should replace every existing data platform.

It is that a growing class of analytical problems can probably be solved with much less machinery than we have become accustomed to using.

A laptop, some Parquet files and a duck may take you surprisingly far.