Napaykullayki! 🇵🇪
You've likely already heard about survivorship bias. Here's a striking (and very literal!) example, solved by mathematician Abraham Wald during the Second World War ✈️. You can discover this fascinating episode in an excerpt from the book "How not to be wrong" by Jordan Ellenberg, a professor of mathematics at the University of Wisconsin-Madison.
Here's also a more technical version in a column from the AMS (American Mathematical Society), which explains Wald's calculations in detail.
Confusion matrices are one of the essential tools for evaluating the performance of machine learning models, but they can be unwieldy as soon as the problem involves many classes, or a hierarchical structure. Researchers from Apple's 🍏 Machine Learning Research department proposed an elegant solution to these limitations in a recent paper: by designing an algebra to represent these matrices as probability distributions, they have been able to construct a system that allows them to be easily and interactively manipulated, for example by grouping several classes together.
The approach is clearly explained in this video, as well as in the corresponding article (which received a Best Paper Award 🤩 when it was presented at the Conference on Human Factors in Computing Systems in New Orleans a few days ago). An interactive demo is also available here 🤖.
Finally, some NLP: the result of recent work by a team from CédiDC (Centre for Epidemiology on Medical Causes of Death, a service of INSERM). In this article, they describe how they built a transformer-based model to automate the coding of diseases in the ICD-10 classification in free text in French. This is one of the fundamental tasks of medical text analysis.
The clever point is that they use death certificate data as a supervised training base: the certificates indeed contain a free-text description of the causes of death, as well as the corresponding codes, entered by a doctor.
The model shows very good performance, but it remains to be proven that it works well on other types of data, for example hospitalization reports 🏥.
Enjoy the rest of your week! 🤓