My internship at B12 Consulting is going on! And this time I would like to tell you a bit about some Machine Learning Clustering algorithms that I’ve been using here to treat and analyze my data.
Before going into details, let me just give you a very general overview about Machine Learning. Machine Learning techniques are in general divided in two big macro-areas: supervised and unsupervised machine learning (I’m not gonna talk about the semi-supervised category). Below a simple explanation.
- Supervised learning is about inferring a function from labeled training data that will eventually be used to map a new set of data. Ok, this is more or less the standard definition, but let me put it in simple words. Say we have a set of input variables (X) and output variables (Y). We can use an algorithm to infer the function f that maps this data, namely Y=f(X). This set of data is called training dataset. Once the algorithm has estimated the mapping function very precisely after many iterations, we can apply it to a new set of input data X’ to infer Y’ for that specific set. Classification and regression are two types of supervised learning problems.
- Unsupervised learning is about inferring a function to describe hidden structures from unlabelled data (standard definition, again). In simple words: say we have only input data X and no corresponding output, i.e. we don’t have a model to apply to our dataset. How can we learn more about the data? We should model the underlying structure or distribution in the data… one efficient way to do that is to use clustering algorithms.
k-means clustering algorithm
One of the most used clustering algorithm is k-means. It allows to group the data according to the existing similarities among them in k clusters, given as input to the algorithm. I’ll start with a simple example.
Let’s imagine we have 5 objects (say 5 people) and for each of them we know two features (height and weight). We want to group them into k=2 clusters.
Our dataset will look like this:
First of all, we have to initialize the value of the centroids for our clusters. For instance, let’s choose Person 2 and Person 3 as the two centroids c1 and c2, so that c1=(120,32) and c2=(113,33).
Now we compute the euclidian distance between each of the two centroids and each point in the data. If you did all the calculations, you should have come up with the following numbers:
|Distance of object from c1||Distance of object from c2|
At this point, we will assign each object to the cluster it is closer to (that is taking the minimum between the two computed distances for each object).
We can then arrange the points as follows:
Person 1 → cluster 1
Person 2 → cluster 1
Person 3 → cluster 2
Person 4 → cluster 1
Person 5→ cluster 2
Let’s iterate, which means to redefine the centroids by calculating the mean of the members of each of the two clusters.
So c’1 = ((167+120+175)/3, (55+32+76)/3) = (154, 54.3) and c’2 = ((113+108)/2, (33+25)/2) = (110.5, 29)
Then, we calculate the distances again and re-assign the points to the new centroids.
We repeat this process until the centroids don’t move anymore (or the difference between them is under a certain small threshold).
In our case, the result we get is given in the figure below. You can see the two different clusters labelled with two different colours and the position of the centroids, given by the crosses.
How to apply k-means?
As you probably already know, I’m using Python libraries to analyze my data. The k-means algorithm is implemented in the scikit-learn package. To use it, you will just need the following line in your script:
from sklearn.cluster import KMeans
I’m not going to go through all the “technical” steps, you will just need a quick search on “kmeans sklearn python” to find all the tutorials you need (some suggestions are indicated at the end of this post).
What if our data is… non-numerical?
At this point, you will maybe have noticed something. The basic concept of k-means stands on mathematical calculations (means, euclidian distances). But what if our data is non-numerical or, in other words, categorical? Imagine, for instance, to have the ID code and date of birth of the five people of the previous example, instead of their heights and weights.
We could think of transforming our categorical values in numerical values and eventually apply k-means. But beware: k-means uses numerical distances, so it could consider close two really distant objects that merely have been assigned two close numbers.
The solution lies in the k-modes algorithm and came out in a paper of 1998 by Zhexue Huang. But this time, don’t expect to find a lot of materials and tutorials on the web: unfortunately, there’s very little documentation about it.
k-modes is an extension of k-means. Instead of distances it uses dissimilarities (that is, quantification of the total mismatches between two objects: the smaller this number, the more similar the two objects). And instead of means, it uses modes. A mode is a vector of elements that minimizes the dissimilarities between the vector itself and each object of the data. We will have as many modes as the number of clusters we required, since they act as centroids.
For numerical and categorical data, another extension of these algorithms exists, basically combining k-means and k-modes. It is called k-prototypes.
If you had the patience to read this post until the end, here’s your reward: a collection of links to deepen your knowledge about clustering algorithms and some useful tutorials! 😛
Another explanation of KMeans with examples: http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html
A really useful talk by Sarah Guido: http://www.slideshare.net/SarahGuido/kmeans-clustering-with-scikitlearn and her tutorial on github: https://github.com/sarguido/k-means-clustering
Some useful documentation about k-modes: http://www.irma-international.org/viewtitle/10828/
The Python implementation of k-modes and k-prototypes: https://github.com/nicodv/kmodes.