A week ago I performed elementary text mining of the network Grant Agreement document, which was described in the post Summarizing documents. Pablo has suggested to scrap also our blog, as it might have some interesting information. Well, I took his suggestion seriously.
Downloading all the content from the blog turned out to be quite an easy task. I exploited thewget comment with special flags, allowing me to perform recurrent searches. I then read the collected data into R and performed some cleaning (removing comments, likes, duplicates etc.). By using the packages XML and rvest I could extract the important and informative data from the messy html content – that is the author and the actual text of the articles. The text was cleaned in the same way as described in the previous article.
Following this, I selected all authors who published at least 3 articles on the blog. For each author I drew the word-cloud of the most frequent words used by them (as by 25.11.2016). The results are presented below (from left to right: Alessia Saggio, Andrea Giammanco, Cecilia Tosciri, Fabricio Jiménez, Giles Strong, Greg Kotkowski, Pablo de Castro, Pietro Vischia, the AMVA4NewPhysics press office and, the most devoted author, Tommaso Dorigo). I hope nobody will be offended by making this work public. If yes, I’m sorry, but it is freely accessible data.
The comparison between authors was very surprising to me. You can see very clearly that each of us touches different subjects. I’d be far from proclaiming that the word-cloud is the fingerprint of our interest and character, but some correlation is apparent. It is also funny how often we speak about time. Next year I will surely repeat the analysis to check out the changes.