⚡️Trendbreak #21⚡️

Alò 🇭🇹

A dream many in the NLP (Natural Language Processing) world share is to unlock deeper insights and quantify well-known phenomena in literature by applying algorithms to lengthy texts 🤖📖. Take for instance this fascinating study, which dives into the creation of numerical representations or 'embeddings' of fictional characters. It's a game-changer! We can now detect shared traits among characters from different stories.
It also leaves me pondering if we could play around with a form of "character algebra", something akin to the "word algebra" in word2vec. Could we have something like Sherlock Holmes - Watson + Hastings ≃ Hercule Poirot or Don Quixote - Sancho Panza + Little John ≃ Robin Hood? word2vec allowed to experiment with "word algebra".

When we're crafting machine learning models, a question that's always at the back of our minds is whether packing in more data would supercharge performance. It's a head-scratcher that's traditionally sorted out through good old trial and error. However, a squad from Microsoft Research has come up with some seriously encouraging initial findings on calculating distances between datasets. This could be the silver bullet we've been waiting for to decide which dataset to use for training data augmentation. The blog post is chock-full of interactive visuals that make understanding the approach a breeze.
The metric they're using, known as the Wasserstein distance, can be traced back to a far more "down-to-earth" problem posed in 1781 by French mathematician Gaspard Monge: How can you shift a pile of rubble from A to B, spending the least amount of energy?

Finally, let's wrap up with a methodological paper on best practices for evaluating machine learning models, particularly when you're out to show you've got a leg up on the current state-of-the-art. As there are numerous sources of variation (choice of dataset, splitting into train/test, random seeds, hyperparameter tuning...) a classifier's performance can be considered a random variable. Therefore, it's crucial to sample from these sources of randomness 🎲 as much as possible to provide convincing proof that one method is superior to another.
The key takeaways from the paper are summed up here in a Twitter thread by Gaël Varoquaux, one of the co-founders of scikit-learn and co-author of the paper.

Wishing you a week filled with insightful reads! 📚