R Code Optimization II: Language, Design, and Readability

In the first article I covered the dimensions of code efficiency, and the three commandments of code optimization. This article focuses on the actual code features that shape efficiency: programming languages, code design, and algorithms.

Programming Language

The choice between a compiled and an interpreted language (see here and here) shapes key aspects of code efficiency.

In languages like C, C++, or Fortran, the code we write is translated into binary code by a compiler before execution. Compilers optimize the binary code for the given hardware, resulting in very fast execution and a low memory overhead. These are the hallmarks of efficient code for a machine! But on the other hand, compiled languages lack interactive execution, which makes them harder to debug.

In contrast, languages like R and Python, much easier to write and read, are translated and executed line-by-line by an interpreter. Interpreters enable interactive execution and facilitate debugging at the expense of speed and a higher memory usage. Here goes a rabbit-hole on the R interpreter for the brave!

The boundary between compiled and interpreted languages isn’t rigid though: many built-in functions in interpreted languages rely on compiled code for speed. Tools like Rcpp (C++ in R) or Cython (C in Python) can boost performance by orders of magnitude, albeit with added code complexity.

Simplicity and Readability

Software exists for us, developers and users, and our time is far more valuable than CPU time! That’s why the most straightforward way to improve efficiency is writing clean code. Clean code is readable, modular (but not excessively modular!), easy to use, and easy to maintain.

The key here is reducing the cognitive load required to maintain AND use the code. That is exactly the focus of the best book I’ve read on this topic: A Philosophy of Software Design. It changed the way I code!

For example, the book presents the concept of deep modules, which are classes or functions with very simple interfaces (think of a well-named function with one or two arguments) hiding a complex functionality or long sequences of processing steps. Before reading it you might want to check this review and this interesting podcast with the author, or even this talk by John Ousterhout himself.

In the end, it is important to strike a good balance between code readability and computational efficiency, as excessive simplicity may leave other efficiency gains off the table.

Key Principles for Clean Code

Here are some fundamental principles to keep your code readable and maintainable:

  • Use a consistent style: Stick to a recognizable style guide, such as the tidyverse style guide or Google’s R style guide.

  • Avoid deep nesting: Excessive nesting makes code harder to read and debug. This wonderful video makes the point quite clear: Why You Shouldn’t Nest Your Code.

  • Use meaningful names: Clear names for functions, arguments, and variables make the code self-explanatory! Avoid cryptic abbreviations or single-letter names and do not hesitate to use long and descriptive names. The video Naming Things in Code is a great resource on this topic.

  • Limit number of function arguments: According to Uncle Bob Martin, author of the book “Clean Code”, the ideal number of arguments for a function is zero. There’s no need to be that extreme, but it is important to acknowledge that the user’s cognitive load increases with the number of arguments. If a function has more than five args you can either rethink your approach or let the user pay the check.

Algorithm Design and Data Structures

There is no efficient code without a well-designed algorithm. Good algorithms have a clear purpose, avoid redundant steps, and scale well with data size.

Choosing the right data structure for the job can make a huge difference in speed, readability, and memory usage. For instance, in R, vectors and matrices are more memory-efficient and faster than data frames for numerical operations, but in certain contexts, reference semantics applied to a data.table (extension of R’s data frames) can be orders of magnitude faster, at the expense of readability (the syntax can get weird).

A complete overview on the role of algorithm design and data structures in code optimization is well beyond the scope of this post. However, if you wish to go further, I strongly recommend the classic book The Algorithm Design Manual, which bridges theory and practice with elegance.

In the next article I will tackle some juicy stuff: vectorization, parallelization, and memory management.

Blas M. Benito
Blas M. Benito
Data Scientist and Team Lead

Related