Generating text with n-gram language models
Using n-gram probabilistic language models for generating text. I chose a lengthy corpus: Tolstoy’s “War and Peace” as my main text complimenting it later with “Paradise Lost” by Milton. Finally generating text from three poems by Dylan Thomas.
An initial experiment using the LingPipe example code… below is 75 words generated using a 3-gram model:
swayed and fell asleep . Forgive me for troubling you … ” ” Oh , how splendid ! ” He took the pistol Makar Alexeevich by the French . The soldiers ‘ ward , with a sigh , and their voices reverberated now near to her that it might seem , be it whom it had to be sold , and it seemed to him that the enemy , and it seems to me . ” I dont
Next.. I experimented with a more simplified model in NLTK, also a 3-gram model generating 75 words:
# Adapted from work by Pedro Paulo Balage - http://nlpb.blogspot.com/ # Import the functions used from nltk library from nltk.probability import LidstoneProbDist from nltk.model import NgramModel import re filename = 'WarAndPeace.txt' tokens = list(re.split('\s+', file(filename).read().lower())) # estimator for smoothing the N-gram model estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2) # N-gram language model with 3-grams model = NgramModel(3, tokens,estimator) # Apply the language model to generate 75 words in sequence text_words = model.generate(75) # Concatenate all words generated in a string. text = ' '.join([word for word in text_words]) # print the text print text
Generating the following text:
war and the weaker and more quickly, as if considering something, and that he would of course… but as his daughter’s distress, and pains in his husbandry. pierre remained for him in french. december 4. today when andrusha (her eldest boy) woke up on the contrary. but no rain. the ground ten paces ahead. bushes looked like a wound-up clock, by force of habit employed all his stewards to the young princess bolkonskaya had brought
Then with the addition of “Paradise Lost”:
file1 = 'WarAndPeace.txt' file2 = 'milton-paradise.txt' tokens = list(re.split('\s+', file(file2).read().lower())) tokens.extend(list(re.split('\s+', file(file1).read().lower())))
file1 = 'DeathShallHave.txt' file2 = 'AfterTheFuneral.txt' file3 = 'ProcessInWeather.txt' tokens = list(re.split('\s+', file(file1).read().lower())) tokens.extend(list(re.split('\s+', file(file2).read().lower()))) tokens.extend(list(re.split('\s+', file(file3).read().lower())))
and death shall have no dominion. dead mean naked they shall be one with the man in the flesh and bone is damp and dry; the golden shot storms in the dark of the fox twitch and cry love and the heart gives up its dead. after the funeral, mule praises, brays, windshake of sailshaped ears, muffle-toed tap tap happily of one peg in the flesh and bone is damp and dry; the quick
