An Introductory Experiment with NLTK
(Natural Language Toolkit for Python)
Let me start by saying that while it was my every intention to play with Sets and Dictionaries and Comprehensions… I somehow could not escape the allure of NLTK and N-gram language models for generating text… Something that we will be covering soon. (so… I’ll spend the weekend doing what I was supposed to be doing).
Below is a simple implementation of NLTK’s Ngram model using Lidstone smoothing (which I don’t understand yet). I used the entire text of “War and Peace” as a corpus with a 3 gram model. With this I generated 75 words of unique text.
# Adapted from work by Pedro Paulo Balage - http://nlpb.blogspot.com/ # Import the functions used from nltk library from nltk.probability import LidstoneProbDist from nltk.model import NgramModel import re filename = 'WarAndPeace.txt' tokens = list(re.split('\s+', file(filename).read().lower())) # estimator for smoothing the N-gram model <<This is beyond me estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2) # N-gram language model with 3-grams <<Looking two ahead and back model = NgramModel(3, tokens,estimator) # Apply the language model to generate 75 words in sequence text_words = model.generate(75) # Put everything back together. text = ' '.join([word for word in text_words]) # print the text print text
Some examples of generated text:
3 gram Text:
war and the house. those who were quite changed. and i have been calling you all this horror and curiosity at his host, sorted his packages and asked when he was not dancing it was all that belongs to history, in theory, rejects both these principles. it would still remain piteous and plain. after two in the village and turn to look round, but his trembling, swollen lips could not speak of the order.
2 gram Text:
he had been with faces he needs fresh moral strength, affirming the greater part in speranski, either declare his glass of another, pointing to him and that princess mary was plain and frightened faces. (he saw that appellation, which were stationed inactive behind the staff, a law of the second staircase led him along the lives in such a red and indistinctly, without proper posts. boris had to start running, out