The R vs Python debate needs more nuance

You should care far more about best practices
R
Python
Data Science
Author

Johnny Breen

Published

May 23, 2023

Disclaimer: Python is my favourite programming language but even I felt compelled to write this!

David Robinson once wrote in his blog the following sage advice,

When you’ve written the same code 3 times, write a function

When you’ve given the same in-person advice 3 times, write a blog post

Well, this post is my chance to give you my own advice on a matter very dear to my heart: it’s about whether data scientists should be writing their code using Python or R1.

But seriously, ‘real’ programmers only use Python right?

An xkcd comic which shows programmers of different languages successively deride each other for not using a 'real' programming language

Anyway, the TLDR; executive summary that I would like to propose for this post is the following,

For most data scientists, the programming language of choice is far less important than the development practices they are using to build their solutions

Now, in order to arrive at this conclusion, I first need to explain to you why I think this debate seems to arise over and over again and synthesise my thoughts into something more cohesive than this opening section seems to be!

Why do people care so much?

There are more important things in life (genuinely guys) such as eating a healthy balanced diet or getting sufficient exercise or meeting up with friends but here we are: people do care about this topic a lot.

That said, I think some of this has to do with context.

There is a selection bias at play here because most of the people who end up in Python tend to come from a background in software engineering and most of the people who end up in R tend to come from a more diverse range of subjects, ranging from social science to the field of statistics itself.

I probably leaned more into the former as most of my master’s degree involved the use of Java which is more similar in spirit to Python than it is to something like R (which I think is actually trickier to learn or grasp for beginners).

Anyway, what this means - by extension - is that most Pythonistas come into an R programming environment and panic / fuss as soon as they realise that there are three object-oriented systems2 and moan about the lack of type hints, decorators, dataclasses (the list goes on) and the plethora of other cool tools that Python helps to bring into the development workflow.

By the same token, users of R will load up a Python environment… actually, they won’t: they will spend a frustrating few hours just getting Python to work (not realising that they are doing it in the wrong way and creating hassle for themselves by not following the best practices of installation) and then give up immediately. Normally the process ends up feeling a bit like this for your typical R user,

An xkcd comic diagram depicting how convoluted the installation of Python can be sometimes (multiple environments, confusing dependencies)

They might get as far as reading in data from a CSV or experimenting with pandas but that will quickly start to annoy them because they just want to use the pipe operator, %>%, to funnel their actions into a pipeline: “What is all this ‘.’ notation everywhere?!” I hear them cry into the distance.

“If everything looks like a nail…”

The thing is, people miss two essential points here.

First, Python and R were created for entirely different reasons and, as such, they solve different types of problems in different ways.

This means that R tends3 to be very strong when it comes to interactive, exploratory data science whereas Python tends to be relatively adept at adding reliability, security and stability to production-grade code. This is because R was originally designed with the express purpose of being interactive and Python was designed, originally, as a multi-purpose programming language. What this means, in practice, is that there are certain tools in Python (invaluable tools) which you simply cannot construct or deploy in R: think of frameworks such as pydantic (data validation) or sqlalchemy (ORM) - you can’t recreate these things in R and anyone who tries to would be setting themselves up for failure (in my view).

So, knowing the purpose and limits of each language is highly important.

Second, far more important than which programming language you are using is what (if any) development practices are you using? Regardless of whether you are writing code in R or Python, can you answer ‘Yes’ to all of the following questions:

  • Have you documented your code?

  • Have you been consistent with naming conventions within your code?

  • Are you managing development, test and production configurations in your code properly?

  • Have you decomposed your code into the appropriate units of execution?

  • Have you built a test suite to formally ‘test’ the units of your code against an expected logic?

  • Are you using version control?

  • (Bonus) Have you configured any continuous integration to enable automated testing, thus leading to a more reliable and secure product?

If not, then it really doesn’t matter which language you are using: your code will fail eventually if you fail to adhere to these best practices. I can guarantee you of that, for sure!

I can’t tell you how many times I have had to elucidate these points to people who are still stuck on the irrelevant point of whether people should use ‘R or Python’. There isn’t one answer guys and gals! It depends on the use case, as frightening as that sounds.

Coexistence is healthy and productive

I work in the insurance industry and, like other corporate domains, unfortunately there are lots of people trying to convince you that they know the single best programming language to use.. across the entire industry.

The truth is however, there isn’t ‘one language’ that will work for all of the employees of one industry or corporation. If you look at technology giants such as Facebook and Google, they employ R and Python developers because they recognise that R and Python are required to solve different types of problems.

This is actually a more economical and productive setup for an organisation because rather than imposing one solution on all of their problems, they are thinking more carefully about which problems will benefit the most from which solutions.

So, the next time somebody makes the statement ‘We should be using Python not R’ or ‘We should be using R not Python’ please do me a favour: point them to this article and let them know (kindly) that they do not know what they are talking about.

Footnotes

  1. Obligatory xkcd comic↩︎

  2. Which, I also think is confusing, but so do people within the R community like Hadley Wickham who is attempting to synthesise these systems into something more well-rounded with the R7 project↩︎

  3. Note that I have italicised the word ‘tends’ because in practice this is no longer the case; the open source community - particularly Posit PBC and ropensci - have made immense contributions to the packages and development tools available to individuals who need to write robust, maintainable applications in R↩︎