A week ago I performed elementary text mining of the network Grant Agreement document, which was described in the post Summarizing documents. Pablo has suggested to scrap also our blog, as it might have some interesting information. Well, I took his suggestion seriously.
Downloading all the content from the blog turned out to be quite an easy task. I exploited thewget comment with special flags, allowing me to perform recurrent searches. I then read the collected data into R and performed some cleaning (removing comments, likes, duplicates etc.). By using the packages XML and rvest I could extract the important and informative data from the messy html content – that is the author and the actual text of the articles. The text was cleaned in the same way as described in the previous article.
Following this, I selected all authors who published at least 3 articles on the blog. For each author I drew the word-cloud of the most frequent words used by them (as by 25.11.2016). The results are presented below (from left to right: Alessia Saggio, Andrea Giammanco, Cecilia Tosciri, Fabricio Jiménez, Giles Strong, Greg Kotkowski, Pablo de Castro, Pietro Vischia, the AMVA4NewPhysics press office and, the most devoted author, Tommaso Dorigo). I hope nobody will be offended by making this work public. If yes, I’m sorry, but it is freely accessible data.
The comparison between authors was very surprising to me. You can see very clearly that each of us touches different subjects. I’d be far from proclaiming that the word-cloud is the fingerprint of our interest and character, but some correlation is apparent. It is also funny how often we speak about time. Next year I will surely repeat the analysis to check out the changes.
28 November 2016 at 10:35
This is really interesting to see! Would certainly be good to compere this to next year’s.
LikeLiked by 1 person
28 November 2016 at 11:16
Really nice, the per author idea was wonderful Greg! Given that mine is the largest and I am not the author with the largest number of posts, what are the factors that affect the size of the world-cloud, is it number of different words with at least N repetitions?
LikeLiked by 1 person
28 November 2016 at 12:21
I plotted the words that repeated at least 3 times but the size of the cloud depends on more factors. The most frequent word has the same font size across authors. The other words sizes are scaled respective to the dominant one. This is the reason why the Andrea’s cloud looks small even though it consist of many words. The usage of word “matrix” overwhelmed the others. The size of the cloud depends also greatly on the length of words (especially the most frequent words). You are not the author with the largest number of post but your post were often long and addressed the wide range of topics. That are the main reasons of your cloud variety and lack of dominant. Each word-cloud is also limited to 150 words and believe me that out of 10 clouds presented only 3 clouds didn’t reach the limit.
LikeLiked by 1 person