by Greg Kotkowski

A year ago I posted an article that visualised with word clouds subjects touched by the authors of this blog. The clouds contained stemmed and filtered nouns and verbs used in posts for each author that had produced at least 3 articles. Giles had suggested to take up the argument again the following year for a comparison, so here it is.

Just to remind you, a word cloud (also called a tag cloud) visually represents the word frequencies used in a given text. Such a representation shows the most important terms considered by an author via the font size and colour of the used words. As an example, the feature image for this post is a cloud constructed from all the articles posted on this blog. The most frequent terms used are certainly strongly bound to our interest and activities, that is physics, particle, datum, event, network etc.

To obtain the cloud, the blog was crawled again (its content was downloaded) and the obtained data was cleaned. Later, the article contents were divided into the words which were transformed to their lemmas. Finally, after some filtering, only the nouns and verbs were selected. I was glad to use my code written last year, so with almost no effort I got the results.

In total, I gathered the data of 264 articles published by 17 authors written from the beginning of the blog’s existence (to be more accurate, more authors contributed, but their work is published under the AMVA4NewPhysics Press Office account). For the later analysis, the data of the 14 authors who have written at least 3 articles were used.

Below, I compare the word clouds constructed before November 25, 2016, in the left column, with the clouds constructed based on all the current data, shown on the right. The new clouds were constructed based on a slightly different stemming (e.g. data → datum). By rows, we can track the evolution of the used terms with time. It seems that most of us are mono-thematic writers who focus only on some particular subjects, while there are exceptions like Giles, whose cloud is completely reconstructed.

And the clouds of authors not considered in the former visualization.