So, in a recent blog post I told a bit about me. However, this morning I was lazily re-reading it, when I suddenly remembered that a text is nothing more than a collection of elements, linked one to each other by some rules (grammar) and some standard associations: by standard associations, I mean that the topic you are writing about dictates the words that you are most likely to use. For example, if you are writing about baking a cake, you are more likely to use terms as oven, wheat, eggs, than if you are writing about a guidance system for missiles. Hopefully 😀
This lead me to remember two concepts: frequency tables and Markov Chains.
Frequency tables are very simple: you take an input text, you pick each word, and you count how many times it is followed by another word. For example, you take hot, and you might discover that 60% of the times it is followed by cake, and 40% of the times by topic. You do that for each word of the text.
And now we come to the awesome part, that is the Markov chain stuff: a Markov chain is, as Wikipedia puts it, a stochastic process that satisfies the Markov property. In layman’s terms, this means a process in which the evolution of the system depen
ds only on the current state of the system, and not on its past history. This means that you only need an initial state and a decision rule for any given state in order to perform a full evolution of the system, but that the future evolution will not depend on that initial state: it will depend only on the state immediately preceding any further state.
Let’s apply it to our text, to make it clear. Let’s say that from our input text we select as initial state the word hot. If we take our frequency table to be our decision rule, we see that we can generate the next word of our new text by picking up either cake with 60% probability or topic with 40% probability. Let’s say we picked cake. Now, to generate the next word, we look at the frequency table for the word cake, which might be constituted by made 90% of the times and from 10% of the times. We would obtain then either hot cake made or hot cake from.
As you can see, the probability of picking up made or from depends only on the frequency table (our decision rule) and not on the past state of the system: it is not a function of the frequency table for the word hot. If we modified the frequency table for hot, when eventually arriving at the word cake we would still have an evolution dictated by a 90% probability of finding made and 10% of finding from.
Good, we have all the ingredients for a little game, but we need one thing: some software to do that for us. Since it is Sunday and I have stuff to do (involving cleaning the house and having a nice walk), I did not write myself the code for such a method (I might do it in the next days, though). I just found some code from Andrew Plotkin, which is very nice, also because it enables to group words together.
Since the language rules are quite strict (both in terms of grammar and in terms of associations), instead of building a frequency table for single words we can build a frequency table for pairs of words. This will enable us to find groups of words that will be more likely to be following our usual grammar rules, resulting in a text that will somehow make more sense. Another very nice feature of Andrew’s code is that it includes punctuation and paragraph division in the count, thus making the output text much more realistic.
OK, what else do we need? Well, an input text, of course! Let’s take my presentation post and say that we want to generate 1000 words using frequency tables built by pairs of words. Here is the result, that I will comment below:
Very shamefully, this is my first year, I looked at the University of Oviedo, in Spain. Here, my analysis work will be oriented to searches of new physics, mostly in the stat stuff” for the first pages, it already promises to be built in a future multivariate study.
During my Ph.D. in Lisbon, I started to read a Spanish series of books called “Las aventuras de capitán Alatriste” (The Adventures of Captain Alatriste) that reportedly is the Spanish region Oviedo is the heavier brother of the Tevatron accelerator, near Chicago), so there I started working at simple tasks as learning how to plot variables.
An additional internship experience, the following year, finally convinced me that the challenge of their energy. The results were somehow inconclusive, though, leaving just the hope of being able to exploit one particular variable in a future multivariate study.
During my summer student I had the patience of arriving here) and encourage you to drop by in the CMS detector.
I moved then to the Higgs decay topology.
Actually, digging up old emails: by removing QCD events via the Neural Network, the resolution on the value of some ongoing war, had been performed by Ohio University on improving the di-jet mass got down from 18.0% to 10.7%, according to some criteria, and finally you redo the whole process a few non-work-related bits about me!
I always loved reading (thanks to my notes!!!
For the master thesis, I switched to the measurement of the work is to distinguish jets from hadronically decaying taus. I exploited this final state also in the stat stuff” for the jets coming from quarks of type “b” have some peculiar characteristics, that can be exploited to classify events into “signal-like” events and “background-like” events. For this classification problem, I used a machine learning technique called Boosted Decision Trees, which is somehow an extension of the BDT permitted to enhance significantly (in some cases even by 50%) the ability of choosing, in the CMS detector.
I moved then to the CMS experiment at the University of Oviedo group that is joining the network: more on the society.
I went on to humanities high school and beyond. Nowadays, I would have chosen on the basis of the most recent years I started to play live roleplaying games: the difference is that you are not seated around a table describing what your characters do; you are roaming around a table describing what your characters do; you are roaming around a room, or a garden, or a town, effectively acting as if you were your character. This implies that you can forget stuff typical of a heavy charged Higgs decays mostly into a top quark mass as a function of the blog (Sabine, the Press Office of the network!), so I will summarize the rest quite briefly: also, I must only say that it is expected to find two jets is not known (there are neutrinos, and when we collide two proton beams we don’t know exactly the longitudinal momentum of the muon, and its signature characteristic is that it simply reflects my prior ranking of all the possible discriminant variables: this way, I would then skip mentioning the fact that this practice helped me with my general dialectic skills. If you drop by in the CMS experiment at Tevatron, where I am reading none of them: I just began my postdoc at the Conservatory. I was preparing for a direct search of a solid scientific result. It is actually a pity that a non-negligible part of what an experimental physicist should know is not really “work”.
During my Ph.D. last July, and am currently in the comments for any question. For now, talk to you in the event, the correct pair of jets for the production of a heavy charged Higgs boson is lower than 173 GeV), the charged Higgs boson decaying into a tau and a bottom quark!
Across the years of my Ph.D., so right now I am still working nowadays. There, I worked on something we call “VBF H->ZZ->mumubb” Higgs boson decaying into two muons and the other side (Then you do this many times, for many variables, according to some criteria, and finally you redo the whole process a few hundred or thousand times). So, what has the result been? Well, it turned out that the Higgs boson search: OK, what the hell does this mean? Let’s start with “H”: that is the coolest physicist you might think about, if you were your character. This implies that you can forget stuff typical of the LHC).
How on Earth did I end up doing these things? During primary school I already knew that I just began my postdoc at the LHC, where I was a typical search channel for a heavy charged Higgs boson with mass higher than the top quark pair events in the context of a solid scientific result. It is actually a pity that a non-negligible part of what an experimental physicist working in the near future. Uh, and I soon became fascinated by the idea of becoming a lawyer (not a judge. Lawyer is cooler: Perry Mason was a typical search channel for a heavy charged Higgs boson we discovered could have some delicate statistics issue: having direct access to tricky and delicate conversations on statistics boosted and is boosting my understanding and experience in this blog some posts from Giles and from my former supervisor João Varela), where I have probably started to think that most likely the most amiable professor. Back at the time gathering useful documentation on the network blog: I hope to make up for this by spamming writing a reasonable amount of
Well, what can we say? First of all, there are some scrambled bits that are quite funny (meaning that from a grammatical point of view they make sense, but they say unexpected things), but we can also see that there are entire phrases that appear exactly as they appeared in the original blog post! Why is that? Well, this is simple: the frequency tables are often composed by a 100% probability of going to a given additional word, and this is due to the fact that the original blog post is composed by roughly 3000 words; this is not enough to have multiple choices in most of the steps.
What if we tried to use a longer input text? Well, I tried using as an input Harry Potter and the Sorcerer’s Stone, for a total of 78451 words, and generated a 1000-words text out of it, using again a frequency table built for 2-words groups.
Here is the output, that I will comment on below:
Harry Potter and the Weasley twins insisted that nothing had been trying to get past that three-headed dog at Halloween! That’s where he was laden down with rock cakes were shapeless lumps with raisins that almost broke their teeth, but Harry and felt it. It was chalk white with glaring red eyes and slits for nostrils, like a toothless walnut. The low buzz of chatter stopped when they walked in. Everyone seemed to have forgotten that Malfoy had
gotten a detention, too.
“It’s tonight,” said Harry, “so there’s someone called Nicolas Flamel is, that’s all,” said Hermione.
“I saw you and me, that’s saying something. You know that night you won Norbert? What did the best morning he’d had a word to him. The goblin wrinkled his nose. Harry watched the goblin on their right weighing a pile of Chocolate Frogs from Hermione.
“Well, it’s best if I added powdered root of asphodel to an infusion of wormwood?” Powdered root of what they were going to be somewhere, we’ll see you later –”
Professor McGonagall Harry made a mistake. I don’t think I know I don’t know
what was good at drawing, had done the guarding, really.” Hermione went on. “We wondered who had her arm around him.
“They’ll all forget this in a letter? These people will never forget you!”
“Neville, “Ron exploded, “get away from Professor Flitwick, you know.” He put Hedwig inside first and then shot at Wood, who dived on top of all were Professor Snape’s classes down in her pocket, she scrambled back along the walls, keeping their eyes on his card and gave Hermione a rare smile.
“Las’ time I saw you, you just remember what I’m famous for. I don’t see why
first years never — you seen anythin’?”
Ronan didn’t answer Aunt Petunia kept looking around. Everything looked so grim and worried, or why they had been given a wild jerk and Harry stood quite still, both thinking the
same thing — did the stranger you were trying to reason with him.
“Very good,” said Wood.
“Er — okay,” said Harry.
“ALL WHAT?” Hagrid thundered. “Now wait jus’ one second!”
He walked forward and pointed at the lights of passing cars and wondering….
They didn’t meet anyone else until they spotted a notice pinned up in the school, not jus’ then, anyway.
Harry was glad school was over, Hagrid himself was in the Gryffindor Seeker, which
could happen to anyone, I’m sure, so a penalty because George Weasley really did fall off his thick black coat and threw them into a nightmare — this year, the third-floor corridor
— and the round-faced boy Harry had to be working with Hagrid it wouldn’t budge, not even dressed in Muggle clothes, swapping rumors.”
She threw a dirty look at him and he got his revenge, without even realizing he was a plump woman who was quite bald and looked up at Harry, who couldn’t understand why they looked like a cork shot out of a centaur. Ruddy stargazers. Not interested in a minute, I hope…
And now there were a bit nearer home, said Hagrid. “So yeh haven’t noticed anythin’ strange?”
Yet Harry Potter day in the furor over the points they’d lost. He half expected Voldemort to find out that dangerous nonsense?”
That third night he tried to get any points back if we win.”
“Just as long as it may, fighting is against Hogwarts rules, Hagrid,” said Snape silkily. “Five points from Gryffindor. He sprinted back upstairs.
“Did you see all your family standing around you. Ronald Weasley, who has always been overshadowed by his name. Please cheer up, Hagrid, we saved the Stone, well, I’ll have thirty … thirty…”
“Thirty-nine, sweetums,” said Aunt Petunia had sheared it off He had no choice. The cut had turned to the first time in his step — Quirrell seemed to end with him in we’d put a stop to that rubbish,” said Uncle Vernon, “swore we’d stamp it out with difficulty, because it said:
THROUGH THE TRAPDOOR
In the back of his half-moon glasses. “It would be back to normal next year, or as normal as it cringed away from Curses and Countercurses (Bewitch Your Friends and Befuddle Your Enemies with the bat to stop me! Voldemort killed my mother because she said impatiently.
“You’d think they’d be running around looking for it again. Anyway, this — this wizard, about twenty years ago now, started lookin’ fer followers. Got ’em, too — Beaters.”
“I found out about the special circumstances, Potter. And what model is it?”
“A minute –“
Hermione had become a bit pink and pointing to a baked potato when Professor Quirrell
came sprinting into the clearing came — was that their sudden appearance had taken one step toward them out of his newspaper as usual and Dudley were leaning right up to something. And Gryffindor really can’t afford to lose him.” On Christmas Eve, Harry went to bed with a hag — never been brought up to it.”
“Nothing?” said Ron sleepily.
“The big one,” said Hermione. “I know he’s not about ter steal it.”
Harry watched Snape for a moment later they stood blinking in the house, rolled up and we can’t win at Quidditch?”
“Blimey, Harry, I don’ s’pose it could hurt ter tell yeh — mind you, he’s usually tremblin’.”
“Is it — any questions?”
Harry had never seen anything like it. Was that the Dursleys almost ten years, ten miserable years, as long as he poured sugar on his shoulder and looked until a distant shout.
“Is that where -?” whispered Professor McGonagall.
“Please wait quietly.”
She turned to the portrait of the Sorcerer’s Stone “You can’t –“
“Bet you could,” Ron muttered.
There we are. Behold the power of Markov Chains!
Grammatically, most of the text is OK, and the sentences kind of make sense, although in a bizarre way (well, the bizarre part was our purpose from the beginning, uh? 😉 ).
The most astounding characteristic, though, is that the output text is by construction statistically equivalent to the input one (at least, if we take the set of frequencies as an estimator of statistical equivalence)! Markov chains can actually be used to analyze texts in order to decide if they are similar, which could enable to make some inference about authorship.
For example, if we find a new Shakespeare theater piece, we can compare it statistically with the full Shakespeare corpus and determine if it is compatible or not! Of course this is not a 100% secure method (authors change style, for example, or write in a different way for different purposes), but still it is something I find quite impressive!
If you want to explore the world of computer-generated text, though, you could go on and check stuff based on context-free grammar, which is basically a set of rules that, given a formal grammar, can generate any possible string regardless of the context.
It might seem silly, but computer-generated texts have gone way beyond simple examples like the ones I gave you. A software called Mathgen had a computer-generated gibberish text accepted for publication in an advanced mathematics journal! Similarly, the authors of SCIgen managed to get invited to a conference by sending a computer-generated paper, as they describe in a Reddit AMA. In one of the answers, by the way, they discuss also Markov chain generators:
we explicitly avoided Markov chains or anything else that was technically challenging, in the service of trying to make the papers as funny as possible. With Markov chains, you might get something syntactically correct, but it is likely to be boring.
Well, it is time for me to start cooking: I hope you found Markov chains as awesome as I do think they are! If you want to play with them, you can download and compile Andrew’s code (just add a #include<string> on top, and compile with cc or gcc), or find some more generators online!
By the way, if you like Harry Potter, you might find some amusement in reading Harry Potter and the Methods of Rationality, which is a fan-fiction revisiting the original books by applying the scientific method to Harry’s world!
Have a nice Sunday (or whichever day this will be published 😛 )!
P.S. No Muggle has been hurt during the preparation of this post.
20 September 2016 at 21:17
Hmmm… I wonder if I’ve written enough science stuff to start auto-generating my blog articles. Might have to test it out.
20 September 2016 at 21:42
My idea is nastier. I will write you an email about that, so if you’ll join we will share the blame 😀