Reading and Writing Electronic Text – NLTK

An Introductory Experiment with NLTK
(Natural Language Toolkit for Python)

Let me start by saying that while it was my every intention to play with Sets and Dictionaries and Comprehensions… I somehow could not escape the allure of NLTK and N-gram language models for generating text… Something that we will be covering soon. (so… I’ll spend the weekend doing what I was supposed to be doing).

Below is a simple implementation of NLTK’s Ngram model using Lidstone smoothing (which I don’t understand yet). I used the entire text of “War and Peace” as a corpus with a 3 gram model. With this I generated 75 words of unique text.

The Code:

# Adapted from work by Pedro Paulo Balage - http://nlpb.blogspot.com/
# Import the functions used from nltk library
 
from nltk.probability import LidstoneProbDist
from nltk.model import NgramModel
import re
 
filename = 'WarAndPeace.txt'
 
tokens = list(re.split('\s+', file(filename).read().lower()))
 
# estimator for smoothing the N-gram model <<This is beyond me
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
 
# N-gram language model with 3-grams <<Looking two ahead and back
model = NgramModel(3, tokens,estimator)
 
# Apply the language model to generate 75 words in sequence
text_words = model.generate(75)
 
# Put everything back together.
text = ' '.join([word for word in text_words])
 
# print the text
print text

Some examples of generated text:

3 gram Text:
war and the house. those who were quite changed. and i have been calling you all this horror and curiosity at his host, sorted his packages and asked when he was not dancing it was all that belongs to history, in theory, rejects both these principles. it would still remain piteous and plain. after two in the village and turn to look round, but his trembling, swollen lips could not speak of the order.

2 gram Text:
he had been with faces he needs fresh moral strength, affirming the greater part in speranski, either declare his glass of another, pointing to him and that princess mary was plain and frightened faces. (he saw that appellation, which were stationed inactive behind the staff, a law of the second staircase led him along the lives in such a red and indistinctly, without proper posts. boris had to start running, out

This entry was posted in Reading and Writing Electronic Text, Spring 2011. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">