Notes on the PyData Paris 2024 conference

In late September, I attended the PyData Paris 2024 conference. There were two days of talks and presentation, plus a day of sprints on open-source libraries, which I sadly couldn't participate in.

A few numbers first¶

In total, 46 talks were given on three parallel tracks (full schedule here), along with scores of lightning talks, a few coffee and lunch breaks, and a nice evening cocktail.

The event was hosted at “La Cité des Sciences et de l’Industrie”, and 680 people attended! The last PyData conference in Paris was in 2017, so I guess people were excited to join after such a long break.

I attended to 16 talks, took 15 pages of notes, slept during only one talk (probably a new personal record), did some networking, and got a cool t-shirt!

City of Science and Industry, the biggest science museum in Europe!

My overall impressions¶

It was my first in-person conference since PyData Cambridge 2019, and it’s much, much better than attending online!

A lot of the talks were on LLMs and RAG, but I found most of them somewhat unsophisticated, or missing comparisons with basic benchmarks.

There were however some excellent talks on methodology and best practices (my personal favorites), and a few unexpectedly introspective ones.

I was honestly quite disappointed with the keynotes from Mistral and HuggingFace: I felt they were presenting a product catalog, rather than giving insightful perspectives on the field, which I think most attendees were expecting.

As always, lightning talks are very random, but very fun! For instance, an Italian guy managed to turn data structures into stand-up comedy material—no easy feat!

All in all, I came back with a lot of new things to explore!

My top-5 talks¶

“Handling predictive uncertainty in Machine Learning”, by Olivier Grisel¶

(video here, summary + link to the slides here)

This was one of the keynotes, but it was more of a presentation on methodology, particularly on calibrating output probabilities in ML models, and how to make sure your model doesn’t bankrupt your company. I really liked how he presented the decoupling of the predictive model from the decision making algorithm.
Olivier is one of the core members of the original scikit-learn team, and he's always giving insightful talks.

“The expanding Apache Arrow universe”, by Joris Van den Bossche¶

(video here, summary of the talk here)

This was an unexpectedly fascinating talk on what’s going on behind the curtain when we're trying to efficiently store data in memory. It also allowed me to better understand the differences between Arrow, Parquet, and Feather, which can be somewhat blurry. The developers behind all of this are doing very low-level work in the stack, and I am truly in awe of their dedication and the positive impact they have on our field!
Joris is one of the core developers of pandas, Arrow, and GeoPandas, and he's very good at explaining deeply technical topics clearly.

“Unpacking business metrics”, by Max Halford¶

(video here, summary of the talk here, slides there)

This was a highly practical talk on how to do meaningful analytics for business purposes. I was already familiar with the topic, having read Max's blog post on it a few months ago. It laid out very plainly the kind of best practices that seem totally obvious when stated, but that I never see applied out there, so I think it had a lot of value!
On top of that, Max coalesced (😇) his insights into an open-source tool for us to easily use! And he used some really cool tech in the process, such as ibis.
I was just a little sad to see that some members in the audience really didn't get the value of the talk. I think they were too much on the academia side of the crowd, and obviously not used to touchy discussions with business stakeholders...
In any case, Max's blog is a must-read, he's a brilliant science communicator!

“Dreadful Frailties in Propensity Score Matching”, by Alexandre Abraham¶

(video here, summary of the talk here, slides, paper and code there)

Alexandre presented work from his recent paper on Propensity Score Matching, an algorithm widely used in the healthcare domain. Typically it's applied in order to understand the effect of a new treatment, evaluated in clinical trials on a specific population of sick patients, when given to a broader population of patients maybe not as sick, or with different demographic characteristics. The gist of it is to reweight parts of the samples, to align their statistics.
He showed that this method allows for a high number of degrees-of-freedom, that result in sometimes shockingly different estimations, which is obviously not good. He also introduced ways to reduce this estimation variability.
It was deeply interesting and gave me a lot of pointers to explore, because there's a lot that I didn't fully understand on the spot!
It was also a nice occasion to catch up with Alexandre, discuss about the evolution of the healthcare startup scene in Paris, and exchange impressions on the talks.

“MLOps at Renault Group”, by Alix Tiran-Cappello and Alexandre Carton¶

(video unavailable, summary of the talk here)

Last one, a surprisingly deep talk! As the speakers were scaling an MLOps pipeline to their whole organization, they found themselves grappling with Conway's Law in real life! It was great to hear them having that kind of hindsight, and explaining the organizational challenges they had to overcome.

Things I want to explore¶

I came back from the conference with a long list of interesting libraries and references to nice scientific papers and books. In no particular order, here's what I'll look into in the coming months:

more theory on propensity-score matching
FireDucks: a compiler-accelerated DataFrame wich pandas API (link)
Ibis, an open source dataframe library that works with 20+ data backends: SQL, dataframes... (link)
Py.Cafe, a platform to build Python apps that execute in the browser (thanks to Pyodide)—no need for deployment anymore! (link)
ADBC, a cross-language, Arrow-native column-based database access protocol (link)
“Practical Data Privacy”, a 2023 O’Reilly book by Katharine Jarmul, a Berlin-based privacy activist and data scientist straight out of a Cory Doctorow novel (link)

Finally, there are cool events coming up soon: PyCon FR will be hosted in Strasbourg from October 31st to November 3rd (link), and Scaleway is organizing the one-day ai-PULSE conference at StationF in Paris on November 7th (link).
In the meantime, you can watch the released videos from the recent PyCon DE and PyData Amsterdam conferences! (here) and there)

Until next time! 🐍