My Reading List: Data Science
This is a live post listing links to Data Science related posts and videos I consider to be interesting, high-quality, or even essential to better understand particular topics within such a wide field.
Data
Preprocessing
Extending Target Encoding: post by Daniele Micci-Barreca explaining how he came up with the idea of target encoding, and its possible extensions.
Target encoding done the right way: post by Max Halford, Head of Data at Carbonfact, explaining in detail how to combine additive smoothing and target encoding.
Handling and Management
Apache Parquet
A Deep Dive into Parquet: The Data Format Engineers Need to Know: This by Aditi Prakash, published in the Airbyte Blog offers a complete guide about the Apache Parquet file format.
Querying Parquet with Millisecond Latency this post from by Raphael Taylor-Davies and Andrew Lamb explains in deep the optimization methods used in Apache Parquet files. Warning, this is a very technical read!
DuckDB
Multi-Database Support in DuckDB This post by Mark Raasveldt published in the DuckDB blog explains how to query together data from different databases at once.
Analysis and Modeling
Modeling Methods
Unraveling Principal Component Analysis: This book, reviewed here, is a tour of linear algebra focused on intuitive explanations rather than mathematical demonstrations.
Mixed Models for Big Data: This post by Michael Clark (see entry below by the same author) reviews several mixed modelling approach for large data in R.
Generalized Additive Models: A good online book on Generalized Additive Models by Michael Clark, Senior Machine Learning Scientist at Strong Analytics.
Model Explainability
Model-Independent Score Explanation: Post by Daniele Micci-Barreca on model explainability. It also explains a very clever method to better understand any model just from it’s predictions.
AI Explanations whitepaper: White paper of Google’s “AI Explanations” product with a pretty good overall view of the state of the art of model explainability.
Towards A Rigorous Science of Interpretable Machine Learning: Pre-print by Finale Doshi-Velez and Been Kim offering a rigorous definition and evaluation of model interpretability.
Spatial Analysis
PostGEESE? Introducing The DuckDB Spatial Extension: In this post, the authors of DuckDB present the new PostGIS-like spatial extension for this popular in-process data base engine.
Geocomputation with Python: A very nice book on geographic data analysis with Python.
Coding
General Concepts
A Philosophy Of Software Design: This book by John Ousterhout is full of high-level concepts and tips to help tackle software complexity. It’s so good I had to buy a hard copy that now lives in my desk. This post by Gergely Orosz offers a balanced review of the book.
Why You Shouldn’t Nest Your Code: In this wonderful video, CodeAesthetic explains in detail (and beautiful graphics!) a couple of methods to reduce the level of nesting in our code to improve readability and maintainability. This video has truly changed how I code in R!
R
Beautiful Code, Because We’re Worth It!: This post by Maëlle Salmon (research software engineer), and Yanina Bellini Saibene (rOpenSci Community Manager) provides simple tips to help write more visually pleasant R code.
Coloring in R’s Blind Spot: This article published in The R Journal by Achim Zeileis (he has a great analytics blog too!) and Paul Murrel offers a great overview of the base R color functions, and offers specific advice on what color palettes work better in different scenarios.
Taking R to its limits: 70+ tips: This pre-print (not peer-reviewed AFAIK) by Tsagris and Papadakis offers a long list of tips to speed-up computation with the R language. I think a few of these tips lack enough context or are poorly explained, but it’s still a good resource to help optimize our R code.
Code Smell: Error Handling Eclipse : This post by Nick Tierney explains how to address these situations when error checking code totally eclipses the intent of the code.
Building a team of internal R packages: This post by Emily Riederer delves into the particularities of building a team of R packages to do jobs helping a organization answer impactful questions.
Python
Deep Learning With Python: This book by François Chollet, Software Engineer at Google and creator of the Keras library, seems to me like the best resource out there for those wanting to understand and build deep learning models from scratch. I have a hard copy on my desk, and I am finding it pretty easy to follow. Also, the code examples are clearly explained, and they ramp up in a very consistent manner.
Python Rgonomics: In this post, Emily Riederer offers a list of Python libraries with an “R feeling”.
Coding Workflow
How to use stacked PRs to unblock your entire team: This post in Graphite’s blog explains how to split large coding changes into small managed PRs (aka “stacked PRs”) to avoid blocks when PR reviews are hard to come by.
Other Fancy Things
What’s new with ML in production: This post by Vicki Boykis, machine learning engineer at Mozilla.ai, goes deep into the differences and similarities between classical Machine Learning approaches and Large Language Models. I learned a lot from this read!
What is Retrieval-Augmented Generation (RAG)?: In this video, Marina Danilevsky, Senior Data Scientist at IBM, offers a pretty good explanation on how the Retrieval-Augmented Generation method can improve the credibility of large language models.
A novel framework for spatio-temporal prediction of environmental data using deep learning: This paper by Federico Amato and collaborators describes an intriguing regression method combining a feedforward neural network with empirical orthogonal functions for spatio-temporal interpolation. Regrettably, the paper offers no code or data at all, but it’s still an interesting read.
Large Models for Time Series and Spatio-Temporal Data A Survey and Outlook: This pre-print by Weng and collaborators reviews the current state of the art in spatio-temporal modelling with Large Language Models and Pre-Trained Foundation Models.
Management and Leadership
Using fake deadlines without driving your engineers crazy: In this post, James Stanier explains how fake deadlines can help push projects forward in healthy work environments.
You are hurting your team without even noticing: This post by Anton Zaides (Development Team Leader), and Eugene Shulga, (Software Engineer) offers insight on the harmful effects of a manager’s ego in their team dynamics.
Teamwork Habits for Leaders: This post by Csaba Okrona focuses on how shifting from talker to listener in team meetings offers a good insight to better address the team’s needs.