Notes on the PyData Paris 2025 conference

At the end of September, I attended PyData Paris 2025, once again at the Cité des Sciences in Paris. Like last year, this is not meant to be a complete conference report. More modestly: a few notes, talks I liked, and things I want to look into later.
The full schedule is available here.

A few numbers first¶

The conference took place over two days, September 30th and October 1st, with a sprint day afterwards.

There were 48 presentations, split across three parallel tracks, plus lightning talks, coffee breaks, lunch breaks, and a cocktail. I attended 19 talks in total, including 4 that I watched later on YouTube, and took around 20 pages of notes.

Also, I was briefly seated next to Wes McKinney, creator of pandas, which was enough to make my inner data nerd very happy.

The City of Science and Industry, seen from the sky

My overall impressions¶

As always, PyData Paris was excellent. It’s a nice mix of practitioners, researchers, open-source maintainers, consultants, and people from larger organizations. The vibe is technical, but not academic, which is exactly what I like about it.

Compared to last year, I felt a few shifts.

First, the LLM / RAG tooling seems to be growing up a little. The talks I saw were less about flashy demos, and more about evaluation, retries, structured outputs, integration with existing workflows, and the general messiness of building reliable systems around non-deterministic components.

Second, the ecosystem is very clearly moving away from a pandas-only worldview. Pandas is still everywhere, of course, but the energy seems to be around Polars on one side, and DuckDB on the other. Personally, I’m not especially eager to learn yet another dataframe syntax, so DuckDB feels like the more tempting direction for my own notebook workflows.

Third, I felt a real consolidation around open-source projects. probabl, the scikit-learn spin-off, had a noticeable presence, and it was interesting to see the broader ecosystem around scikit-learn becoming more structured. Tools like skrub and skore are good examples: not glamorous in the "new model architecture" sense, but very relevant if you care about real-world tabular data, evaluation, reproducibility, and maintainable ML workflows.

Some talks I liked¶

State of Parquet 2025¶

A deep dive into the Parquet file format: row groups, columns, pages, metadata footers, compression, encodings, Bloom filters, and various extensions.

My main takeaway: Parquet is much more complicated than "a columnar file format". It is a seriously engineered piece of infrastructure, and I am very happy other people are in charge of implementing it.

Video here.

Open-source Business¶

A panel discussion with Sylvain Corlay, from QuantStack, and Yann Lechelle, from probabl.

The discussion was about open-source business models, the relationship between community needs and customer needs, and the role open source can play for Europe in the competition with US big tech.

One point I found interesting: open source can reduce marketing costs, because a good project already has users, visibility, and trust. But the needs of paying customers and the needs of the community are not always the same, so there is real product and governance work behind the scenes.

Yann Lechelle recently published a book on this topic, Ouvertarisme, available here.

Video here.

You Don’t Need Spark For That¶

A pragmatic intro to lakehouse workflows in Python, by Romain Clément.

The talk covered open table formats such as Delta, Iceberg, Hudi and DuckLake, with the usual lakehouse promises: ACID semantics, schema enforcement, time travel, snapshots, etc.

The nice part was that Spark was not treated as the unavoidable starting point. The talk explored how far one can go with a lighter Python stack, Parquet files, transaction logs, object storage, and DuckDB.

It also included a demo of laketower, a small web interface to browse datasets, which I found quite appealing.

Video here, slides here.

Modern Web Data Extraction¶

A good 101 on crawling and scraping, but broader than a simple BeautifulSoup tutorial.

The talk covered the practical constraints of web extraction: robots.txt, terms of service, crawl delays, user agents, sitemaps, and the general etiquette of crawling. It also gave a quick tour of the tooling landscape: Scrapy, Selenium, Puppeteer, Playwright, and related tools.

Nothing revolutionary, but a clear and useful overview of a domain that is often approached too casually. And the speaker was really funny, which made for an engaging talk!

Video here.

How to do real TDD in data science?¶

A very nice live-coding demo by Alix Tiran-Cappello, showing how to use test-driven development to refactor data science code and migrate a pipeline from pandas to Polars.

I liked the emphasis on equivalence testing: when refactoring a data pipeline, the main question is often "did I preserve the behavior?", and tests are the obvious way to make that question less scary.

Even without caring about Polars specifically, there were useful patterns to steal for notebook work.

Video here.

Move beyond academia¶

A talk by Alexandre Abraham and Louis Ledain on a new benchmark for machine learning on tabular data.

The motivation is easy to agree with: many academic tabular benchmarks are too clean, too small, or too disconnected from real industrial problems. If we want methods that transfer to practice, we need benchmarks that look more like practice.

I also liked the idea of comparing datasets through model behavior, which reminded me of some old work I did during my PhD.

Video here.

Documents Meet LLMs¶

A practical talk on using LLMs for information extraction from documents.

The interesting part was not "LLMs can extract structured data", which we know by now, but the operational reality around it: missing outputs, retries, structured formats, validation, and the need for actual evaluation criteria.

A useful reminder that checking whether a field is an integer is easy. Checking whether the extracted information is correct is the hard part.

Video here.

TorchFastText @ INSEE¶

A concrete applied ML talk around automatic NAF/NACE classification.

The constraints were interesting: training can happen on GPU, but inference has to run on CPU. The surrounding stack was also nice: DVC, MLflow, Quarto, Captum for explainability, and a bit of DuckDB.

Also, the original fastText repository was archived in 2024, so this is also a good example of a maintenance problem becoming a software project.

Video here.

Other notes¶

A few other talks or topics I want to follow up on:

Prediction intervals, from Olivier Grisel: quantile regression, pinball loss, coverage, proper scoring rules, reliability diagrams (video here).
Quarto, for turning notebooks into reports, slides, websites, or blog posts (link).
Browser-based Jupyter / AI workflows, especially JupyterLite, Pyodide, WebLLM, and Transformers.js.
Skada, for domain adaptation and distribution shift.
COSApp, for simulating complex systems; it made me wonder whether similar tools could help model data pipelines.
Why GenAI Models Cannot Feed Themselves, a nice talk by Valeria Zuccoli on model collapse, with a few papers I want to read (video here).

Things I want to explore¶

My short list after the conference:

use DuckDB more systematically in notebooks;
test Quarto properly;
keep exploring marimo;
learn more about prediction intervals and reliability diagrams;
keep an eye on probabl, especially skrub and skore;
maybe attend a more data-engineering-oriented conference next time.

Until next year!