Learning Bit by Bit – Text Generation

Generating text with n-gram language models

Using n-gram probabilistic language models for generating text. I chose a lengthy corpus: Tolstoy’s “War and Peace” as my main text complimenting it later with “Paradise Lost” by Milton.  Finally generating text from three poems by Dylan Thomas.

An initial experiment using the LingPipe example code… below is 75 words generated using a 3-gram model:

swayed and fell asleep . Forgive me for troubling you … ” ” Oh , how splendid ! ” He took the pistol Makar Alexeevich by the French . The soldiers ‘ ward , with a sigh , and their voices reverberated now near to her that it might seem , be it whom it had to be sold , and it seemed to him that the enemy , and it seems to me . ” I dont

Next.. I experimented with a more simplified model in NLTK, also a 3-gram model generating 75 words:

# Adapted from work by Pedro Paulo Balage - http://nlpb.blogspot.com/
 
# Import the functions used from nltk library
from nltk.probability import LidstoneProbDist
from nltk.model import NgramModel
import re
 
filename = 'WarAndPeace.txt'
 
tokens = list(re.split('\s+', file(filename).read().lower()))
 
# estimator for smoothing the N-gram model
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
 
# N-gram language model with 3-grams
model = NgramModel(3, tokens,estimator)
 
# Apply the language model to generate 75 words in sequence
text_words = model.generate(75)
 
# Concatenate all words generated in a string.
text = ' '.join([word for word in text_words])
 
# print the text
print text

Generating the following text:

war and the weaker and more quickly, as if considering something, and that he would of course… but as his daughter’s distress, and pains in his husbandry. pierre remained for him in french. december 4. today when andrusha (her eldest boy) woke up on the contrary. but no rain. the ground ten paces ahead. bushes looked like a wound-up clock, by force of habit employed all his stewards to the young princess bolkonskaya had brought

Then with the addition of “Paradise Lost”:

file1 = 'WarAndPeace.txt'
file2 = 'milton-paradise.txt'
 
tokens = list(re.split('\s+', file(file2).read().lower()))
tokens.extend(list(re.split('\s+', file(file1).read().lower())))
paradise lost and her surprise, nonrecognition, and with your excellency?” but still following announcement of three hills were lines more miserable.” prince andrew’s eyes continually changing their prayers could not at all, isn’t it?” asked these earthly life. when he declared, was five too mean i put his activities there. when the appointed place thyself aright. so he has to prevent looting, and accuses himself.” boris did not know you take them all foreigners who

And then with three poems by Dylan Thomas:
file1 = 'DeathShallHave.txt'
file2 = 'AfterTheFuneral.txt'
file3 = 'ProcessInWeather.txt'
 
tokens = list(re.split('\s+', file(file1).read().lower()))
tokens.extend(list(re.split('\s+', file(file2).read().lower())))
tokens.extend(list(re.split('\s+', file(file3).read().lower())))

and death shall have no dominion. dead mean naked they shall be one with the man in the flesh and bone is damp and dry; the golden shot storms in the dark of the fox twitch and cry love and the heart gives up its dead.  after the funeral, mule praises, brays, windshake of sailshaped ears, muffle-toed tap tap happily of one peg in the flesh and bone is damp and dry; the quick

This entry was posted in Learning Bit by Bit, Spring 2011. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">