Welcome back to the second part of my journey through the Fast.AI deep-learning course; beginning section here. Last time I gave an example of analysing images, now I’ll move on to working with columnar data.
Columnar data is a form of structured data, meaning that the features of the data are already extracted (in this case into columns), unlike in images or audio where features must be learnt or carefully constructed by hand.
Coming from a background in experimental particle physics, columnar data is a format I was most familiar with analysing. Effectively it consists of rows of entries (particle collisions) and columns of features describing each entry (e.g. particle momenta, energies, and masses), and some target value one wishes to learn to compute (e.g. collision class and invariant mass). New features can be computed based on existing columns, and then a model can be trained by using batches of entries and trying to predict the target.
The example used in lesson 3 of the course used data from an earlier Kaggle competition (Rossman Stores), in which participants were tasked with predicting sales at pharmaceutical stores based on timing, geographical data, and store information. Basically, developing a regressor based on input features from the data in the columns which outputs the store turnover on a given day.
The lesson follows the procedure of the third place result, which includes a variety of extra datasets, such as Google keywords and meteorological information. Times and dates are decomposed into a format which allows a model to better capture periodic trends in the data, and new features, such as proximity to holidays and school dates, are added.
The really interesting part comes next. The massaged data now contains a lot of categorical features (features whose values do not have any numerical meaning, such as the store ID and day of the week). The common approach when dealing with categorical inputs is to one-hot encode them, turn them into vectors of length equal to the number of unique values in the feature (their cardinality), and then set the corresponding element to one and the rest to zero. As an example, “day of the week” has a cardinality of 7, and “Monday” could be encoded as “(1,0,0,0,0,0,0)”.
This is ok for a few features, but the problem becomes apparent when many features have to be encoded like this. The data contains 22 categorical features and store ID alone has a cardinality of 1116 – the number of inputs to the model will be huge if all of them get one-hot encoded! Instead, the approach is to learn an embedding matrix for each feature, which can considerably reduce the number of inputs whilst still providing a ‘rich representation’ of the feature; continuing the example, it is perhaps more important to know whether the day is a weekday or on the weekend, than exactly which day it is.
Basically, rather than feeding in the one-hot vector, the category is used to look up the row of an NxM matrix, where N is the cardinality of the feature, and M is the number of inputs used to represent the feature; the size of the embedding which gets fed into the model.
In the above diagram, ‘Monday’ would be represented by the vector (0.3,0.9,0.4,0.7) and these numbers would change during training as the optimal representation of ‘Monday’ is learnt.
The embedding sizes can be tuned, but the lesson uses the following rule-of-thumb: size equals half the cardinality of the feature (rounded up to the nearest 1) or 50, whichever is smaller. So the days of the week end up being represented by 4 values rather than 7, and store ID by 50 instead of O(1000). The exact values in the embedding matrix can start from some random initialisation and then be learnt via backpropagation as normal, or if a similar embedding already exists for an earlier problem then that can be used as initial values and refined.
This method of Entity Embedding was summarised by the 3rd-place team in a paper here.
Effectively, this can be thought of as learning an optimal dimensionality reduction method. Luckily in particle physics, the majority of our features are continuous, and the ones which are categorical normally have low cardinality; an example from my own work is the decay channel of a pair of tau leptons, each of which can decay along three different channels (or more if the individual modes of the hadronic channel are considered).
Still it made me think whether it is better to encode the decay channel of the di-tau (nominal cardinality of nine) or each of the taus separately (two features of cardinality three), and whether one-hot or embedding is better. From a test it didn’t seem to matter too much, except the embedding method took longer to run (but my afternoon’s worth of Keras implementation could probably be refined somewhat).
Putting it into practice
I was still eager to try out the embedding method on something with more categorical features, and it just so happened that Kaggle had recently launched a challenge to predict the poverty class of households in Costa Rica based on the occupants and the house itself, in order to help distribute welfare assistance better.
The data contained a good number of categorical features, but unfortunately only had about 10,000 training-data rows, and even fewer households; each entry was a member of a particular household, but the final prediction was done for the head of the household only.
My attempt was to preprocess the household by creating features for the head of the household describing various characteristics of the other members in the household (e.g. number of people and number of years of education for different age groups). The majority of the categorical features came from the house itself; the various construction materials, electricity and water suppliers, district, et cetera. I created embedding matrices for each of these.
Another challenge was to deal with some missing and incorrect values in the data (which had either been left deliberately or accidentally; the competition was classed as a ‘playground’). This isn’t something I’d encountered before, so it was a nice experience. I initially tried my own solution of taking averages, but eventually followed another competitor’s suggestions. They also had a more detailed way of aggregating the household member data.
I struggled quite a bit with overfitting due to small amount of training data, and picked up a bit of experience with tuning regularisation, but never managed to beat the BDT-based solutions other people were using.
One thing I had tried to do was use the other members of the household to individually predict the poverty class of the entire household as a way of pretraining a model; there was over double the amount of training data there and it seemed a waste to throw it away. Whilst this worked (much to my surprise and delight), it failed to offer a serious improvement; I still managed to make the top 25%, though, which was nice.