Summarizing documents

by Greg Kotkowski

Physics or Mathematics could be considered complex fields, but for me the most incomprehensible field is Law. The natural sciences are driven by nature, while law is figured out by men and for this reason it is sometimes incoherent from a logical point of view. It sometimes seems to me that lawyers have no idea about logic.

For this reason I’m even more concerned about politicians who, with a single click, vote on new laws often according to their party guidelines and lawyers. As an example, let us take the CETA trade agreement between the EU and Canada. The document has almost 1600 pages written in difficult, legal jargon.

If a single person were to read the CETA agreement it would take between 30 and 60 hours of non-stop reading. Do politicians know for what they vote (in case of the polish politicians they had only 2 weeks to become familiar with the official polish translation of the agreement)? Thankfully, the Belgian government temporarily blocked the process in order to have time to study it in more depth.

However, let us leave politics and focus on another subject. This article nicely shows some text-mining techniques in order to understand what is written in the long document. I got inspired by it and performed my own analysis of this kind.

The work of our network is funded by the H2020 programme of the European Commission. In order to get the funding, Tommaso and others made a great effort and wrote the huge project proposal (not 1600 pages as in CETA but still a lot). Before signing my contract I was obliged to read the Grant Agreement. I’m sorry that I cannot share it with you, but believe me, it is a complex document.

To perform the analysis I loaded the pdf version of the Grant Agreement in R and cleaned the data. The cleaning removes the special signs and each word transforms to its lowercase lemma, so for example all words like washed, washing, washes, Wash are converted to the lemma wash. Cleaning also removes so called stop words that are important in the sentences, but do not add any information about the subject (for example to, be, the, of etc.). I also filtered out only nouns and verbs.

Having created this dataset I could draw a cloud of words. Below you see the set of the 150 most used words in the Grant Agreement. The size of the font corresponds to the frequency of appearance in the text.

From the figure I conclude that the GRANT AGREEMENT is all about AMOUNT of PAYMENTS and who BENEFITS most. The AGREEMENT is all about the RESEARCH in PHYSICS and extracting INFORMATION from the DATA (SEE DOCUMENT).

I was also interested in pair-wise appearance of the words in the sentences. This tells us a little more about the characteristics of the document. Below you see the corresponding graph. Each line illustrates the frequent coexistence of pairs of words in the sentences. The width of of the lines corresponds to the frequency.

To conclude, I’m very thankful to Tommaso and everybody else who made this project possible. I hope that PARTICLE PHYSICS is going to be greatly BENEFICIARY due to the GRANT AGREEMENT researches and network COORDINATION.

6 thoughts on “Summarizing documents”

Add yours

Pietro Vischia
17 November 2016 at 18:35


Hi Greg,

very nice post!

Best,
Pietro

LikeLiked by 1 person

Pablo de Castro
18 November 2016 at 12:51


Really cool Greg,

just out of curiosity, how many times “beneficiary” appeared?

Another AMVA4NP-related interesting source of text data could be extracted by scraping all the blog post available here.

Best regards,
Pablo

LikeLike

- Greg Kotkowski
  18 November 2016 at 14:19
  
  
  I’m glad that you like it.
  
  I’m not sure if I could share this information. More than 150 times.
  
  Best,
  Greg
  
  LikeLike
  
Andrea Giammanco
22 November 2016 at 22:49


In that specific lingo, “beneficiary” is who gets the money. The prominence of that word in the Grant Agreement (as well as “amount”, “payment” and “grant”) reflects the importance that is attached to the money 🙂
I am only puzzled by the equal prominence of the word “see”.
What do they want to see?

LikeLike

- Greg Kotkowski
  23 November 2016 at 9:57
  
  
  Thank you for the question
  
  In the Grant Agreement we are often referred to other sections of the document or attachments by writing in brackets (see attachment 1).
  
  Maybe I should remove it from the analysis as a stop word.
  
  LikeLike

1 Pingback

Comparing characters – AMVA4NewPhysics

AMVA4NewPhysics

A Marie Sklodowska-Curie ITN funded by the Horizon2020 program of the European Commission