by Greg Kotkowski

Physics or Mathematics could be considered complex fields, but for me the most incomprehensible field is Law. The natural sciences are driven by nature, while law is figured out by men and for this reason it is sometimes incoherent from a logical point of view. It sometimes seems to me that lawyers have no idea about logic.

For this reason I’m even more concerned about politicians who, with a single click, vote on new laws often according to their party guidelines and lawyers. As an example, let us take the CETA trade agreement between the EU and Canada. The document has almost 1600 pages written in difficult, legal jargon.

If a single person were to read the CETA agreement it would take between 30 and 60 hours of non-stop reading. Do politicians know for what they vote (in case of the polish politicians they had only 2 weeks to become familiar with the official polish translation of the agreement)? Thankfully, the Belgian government temporarily blocked the process in order to have time to study it in more depth.

However, let us leave politics and focus on another subject. This article nicely shows some text-mining techniques in order to understand what is written in the long document. I got inspired by it and performed my own analysis of this kind.

The work of our network is funded by the H2020 programme of the European Commission. In order to get the funding, Tommaso and others made a great effort and wrote the huge project proposal (not 1600 pages as in CETA but still a lot). Before signing my contract I was obliged to read the Grant Agreement. I’m sorry that I cannot share it with you, but believe me, it is a complex document.

To perform the analysis I loaded the pdf version of the Grant Agreement in R and cleaned the data. The cleaning removes the special signs and each word transforms to its lowercase lemma, so for example all words like washed, washing, washes, Wash are converted to the lemma wash. Cleaning also removes so called stop words that are important in the sentences, but do not add any information about the subject (for example to, be, the, of etc.). I also filtered out only nouns and verbs.

Having created this dataset I could draw a cloud of words. Below you see the set of the 150 most used words in the Grant Agreement. The size of the font corresponds to the frequency of appearance in the text.


From the figure I conclude that the GRANT AGREEMENT is all about AMOUNT of PAYMENTS and who BENEFITS most. The AGREEMENT is all about the RESEARCH in PHYSICS and extracting INFORMATION from the DATA (SEE DOCUMENT).

I was also interested in pair-wise appearance of the words in the sentences. This tells us a little more about the characteristics of the document. Below you see the corresponding graph. Each line illustrates the frequent coexistence of pairs of words in the sentences. The width of of the lines corresponds to the frequency.


To conclude, I’m very thankful to Tommaso and everybody else who made this project possible. I hope that PARTICLE PHYSICS is going to be greatly BENEFICIARY due to the GRANT AGREEMENT researches and network COORDINATION.