by Giles Strong

During my master’s degree in Glasgow, I mostly used C++ and Root for my research. However, these past few months I’ve been based almost entirely in Python. The focus of my work is currently on developing machine learning (ML) tools, so it’s no surprise that I would be using Python, since it is where a lot of the active development of ML libraries is. However, I am also enjoying using it not just for ML, but for general data-analysis as well.

I thought it would be good to share with you some of the modules I’ve recently begun using in case they are of use to you as well. All are easily available through pip, conda, or manual installation.

Jupyter

Not so much a module as a browser-based working environment. Rather than developing some code in a file, running it, checking the results, inevitably finding some small mistake, and having to repeat potentially hours of runtime, Jupyter instead allows you to split a program into cells in a notebook.

Cells can either contain Python code or contain documentation, which supports Markdown (think simplified LaTex). The cells can be run individually and display any outputs below themselves. This allows for quick development of a code since results are displayed in situ, and any changes are quick to perform and don’t require re-running the whole script.

The documentation cells also provide a much clearer description of what code blocks do than inline comments. They also allow the notebook to be presented as a description or tutorial for some concept or module.

There are even built-in methods to convert notebooks into .py format, or into slide-shows via Rise. For those who like to work remotely, you can also use ssh tunnels to access them through your browser as normal. It can also be used for lots of other programming languages.

Since being introduced to Jupyter at an ML school, I’ve used it for all the development of my multivariate analyses (MVAs).

Pandas

Pandas is a way of structuring data and then performing analysis of it; essentially building a database and then running queries. Where it really shines is in it’s speed of running queries.

Working in HEP, a lot of data is processed into Root format. Using the root_numpy module I’ll normally run something similar to

pandas.DataFrame(root_numpy.root2array("data.root", treename=“MyData”)).

This reads a specified TTree from a Root file and converts it into a Pandas dataframe, using the branches of the tree as fields in the database.

From there you can run queries of the form data[data.muon_pT >20], which would return a view of the dataframe which only contains entries (events) which satisfy the condition (here that the muon pT was greater than 20 GeV).

It can also be used to return Numpy arrays of fields, which is useful for plotting feature distributions or for feeding in data to MVAs.

Keras

This is what I use to create and train my neural networks. Its modular and minimalistic design and its openness to extension mean that it provides a great way of easily creating and testing networks, and adding in your own classes. It’s also kept up to date with active development working to include the latest ML concepts and runs on both CPUs and GPUs.

If you’re looking to try working with neural networks, I’d definitely suggest giving it a look.

Seaborn & Statsmodels

Having all this lovely data and neat MVAs is all well and good, but I am but a human (at least until the technological singularity): give me plots! For years I used Matplotlib, which was, and still is, great, but Pablo recently introduced me to Seaborn.

Importing Seaborn will, by default, override the appearance Matplotlib plots, giving them a ‘softer’ appearance, which is easier on the eye. I’d recommend it just for that, but it also provides a whole range of new plotting styles to easily visualise data, as well as methods to quickly apply regressions, kernel density-estimations, and confidence intervals.

My only complaint so far is that the kernel density-estimation class doesn’t provide any inbuilt bootstrapping methods, so there’s no way to directly plot uncertainty on a density estimate. My workaround is to use the KDE class form Statsmodels, and construct my own bootstraps of a sample, fit a KDE to each sample, then plot them on a Seaborn time-series plot, and use the built-in confidence interval arguments.

SciKit-Learn

Another gem from the MLHEP school. If you’re looking to get into ML, this is where to begin. It’s got implementations of lots of different MVAs for classification, regression, and clustering, as well as the necessary ‘support’ to train, test, and validate them; cross-validation, train/test splitting, etc. There are also modules for preprocessing data, such as standardisation and principal-component analysis.

Having moved to Keras for MVAs, I’ve reduced the amount of use I get from SciKit-Learn, but there are still lots of use cases for it in other applications.

If  these sound of use to you, or ML is something you’d like to get into, I’d recommend checking out the slides from the MLHEP school, since they provide a good introduction to applying many of these modules in the context of ML. There are also a few large, public datasets such as MNIST (handwriting recognition) and IRIS (plant classification), so you could even start today! For more challenging (and potentially financially rewarding) tasks check out the competitions run on Kaggle. They even have a load of datasets.