I have spent the last three months in the SDG company in Milan for my private sector secondment, which is a part of my contract. During this time I was able to have a closer look at how it is to work in a consulting company and to put my hands on real business problems. Herein, I want to summarise some of my experiences and deliberate on the R language – an inseparable friend of statisticians.
At SDG I was given two separate tasks. The first was more oriented on developing computational skills, while the second focused on analytical abilities. The goal was to broadly extend my data-mining skills to become a „sexy” data-scientist.
The first task was to build a predictive model in R for customer traffic in stores of one of the luxurious Italian clothing brands (welcome to Milan!). The problem wasn’t trivial and, in addition, my boss put a lot of focus on the way the software is implemented. The proposed model depends on many parameters, so the code should be tidy to enable easy development, extensions, function nesting, debugging etc. In general, obvious things for computer scientists.
I realized that I had never been taught R in a proper way. For no reason, R (and similarly Matlab) had always been brought to me as a scripting language. One reason for this could be that, in general, statisticians are not the best programmers, but rather focus on theoretical properties of their methods. Algorithm implementation is a secondary task. In contrast, I’ve always been taught Python and C++ from a programming point of view, i.e. object and class construction, inheritance, proper debugging etc. Why was R treated in a different way?
At SDG, I had a two-day long course about advanced R programming. It was strange to learn how it works at a base level, though I’ve been using the language for years. I’m even kind of ashamed that it is only now that I learn object-oriented programming in R (a so-called S3 method).
Another important aspect is code edition and documentation writing. I have a bad habit of writing code in a way that only I understand it (and lose this ability after two weeks). Somehow, I often omit writing comments and descriptions – the code is very messy. Such an approach is unacceptable, particularly in a company where the code is often shared between employees. Not mentioning extensive documentation, which I have barely ever produced. I hope the secondment has successfully changed my practice.
Finally, I learnt how to build R packages. A package is a self-contained set of reusable functions, documentation, vignettes and possible data – an equivalent of a Python library. However, in R packages are often small, dedicated to very specific purposes. They are easy to create, test and share.
Hence, I’ve spent a decent amount of time on learning the fundamental approach of tidy and reusable programming linked with the proper documentation. Although it is not easy to change one’s habits, I’ve definitely advanced in programming with advantages for me (quicker code development) and my collaborators (code sharing).
To demonstrate what R packages are, I want to show you an example. Lately, I accidentally discovered a new R package that allows meme generation. A simple function allows creating specific images. It calls hidden and more advanced functions which in turn instruct ImageMagic to nicely print results. Tens of small R packages, dedicated to specific tasks, are created on a daily basis. This is the way in which the statistical community shares its advances.
Below I present an example of the “Meme” R package used to produce the feature image for this article.
require(meme) u <- "https://pbs.twimg.com/media/B1dUJFhIMAAObAo.jpg" meme(u, "Secondment in a private sector\n during your PhD", "AMVA4NewPhysics")