I’m starting to do phoneme analysis on words and phrases to create list of words and phrases that “sound” similar.
I have a very crude “scoring system” to keep track of things like:
- Overall similarity of phonemes (in order, would like to add overal phoneme similarity regardless of order)
- Overall similarity of stress patterns
- Extra points for alliteration (phoneme matches at start of word/phrase)
- Extra points for rhyming (phoneme matches at end of word/phrase)
- Word overlap
WORDS: DISAPPEAR
Enter a word: disappear
disappear
D, IH2, S, AH0, P, IH1, R
(‘disappeared’, {‘score’: -608})
(‘disappears’, {‘score’: -608})
(‘disappear’, {‘score’: -608})
(‘disappearance’, {‘score’: -598})
(‘disappearing’, {‘score’: -598})
(‘disappearances’, {‘score’: -588})
(‘disappoint’, {‘score’: -579})
(‘disappoints’, {‘score’: -579})
(‘disapproved’, {‘score’: -579})
(‘disapprove’, {‘score’: -579})
(‘discipline’, {‘score’: -569})
(‘disappointment’, {‘score’: -569})
(‘dissipates’, {‘score’: -569})
(‘dissipate’, {‘score’: -569})
(‘disciplines’, {‘score’: -569})
(‘disciplined’, {‘score’: -569})
(‘disappointing’, {‘score’: -569})
(‘disapproves’, {‘score’: -569})
(‘disappointed’, {‘score’: -569})
(‘datapower’, {‘score’: -565})
(‘disciplining’, {‘score’: -559})
(‘dissipative’, {‘score’: -559})
(‘disciple’, {‘score’: -559})
(‘disciples’, {‘score’: -559})
(‘deceptions’, {‘score’: -559})
(‘dissipated’, {‘score’: -559})
(‘deceptive’, {‘score’: -559})
(‘disapproving’, {‘score’: -559})
(‘dissipation’, {‘score’: -559})
(‘deception’, {‘score’: -559})
(‘dissipating’, {‘score’: -559})
(‘disapproval’, {‘score’: -559})
(‘disappointments’, {‘score’: -559})
(‘disallowed’, {‘score’: -558})
(‘disaffect’, {‘score’: -558})
(‘disobey’, {‘score’: -558})
(‘disavow’, {‘score’: -558})
(‘disavowed’, {‘score’: -558})
(‘disallow’, {‘score’: -558})
(‘displaywrite’, {‘score’: -557})
(‘domineer’, {‘score’: -554})
(‘dubilier’, {‘score’: -554})
(‘misapplies’, {‘score’: -554})
(‘misapplied’, {‘score’: -554})
(‘unsupported’, {‘score’: -553})
(‘persevere’, {‘score’: -552})
(‘perseveres’, {‘score’: -552})
(‘persevered’, {‘score’: -552})
(‘disappointingly’, {‘score’: -549})
(‘disciplinary’, {‘score’: -549})
(‘deskpro’, {‘score’: -549})
(‘deceptively’, {‘score’: -549})
(‘dissonant’, {‘score’: -548})
(‘dissidence’, {‘score’: -548})
(‘disregards’, {‘score’: -548})
(‘.decimal’, {‘score’: -548})
(‘description’, {‘score’: -548})
(‘disinclined’, {‘score’: -548})
(‘dispossess’, {‘score’: -548})
(‘disallowance’, {‘score’: -548})
(‘desiccates’, {‘score’: -548})
(‘disentangle’, {‘score’: -548})
(‘decimals’, {‘score’: -548})
(‘disbarment’, {‘score’: -548})
(‘disrepair’, {‘score’: -548})
(‘disrepute’, {‘score’: -548})
(‘decibels’, {‘score’: -548})
(‘discontents’, {‘score’: -548})
(‘decency’, {‘score’: -548})
(‘dissonance’, {‘score’: -548})
(‘duesler’, {‘score’: -548})
(‘decently’, {‘score’: -548})
(‘disagreed’, {‘score’: -548})
(‘disagrees’, {‘score’: -548})
(‘disassemble’, {‘score’: -548})
(‘disabuse’, {‘score’: -548})
(‘decimal’, {‘score’: -548})
(‘disaffection’, {‘score’: -548})
(‘disregard’, {‘score’: -548})
(‘dissident’, {‘score’: -548})
(‘dusseldorf’, {‘score’: -548})
(‘dispersant’, {‘score’: -548})
(‘disrespects’, {‘score’: -548})
(‘disconcert’, {‘score’: -548})
(‘disincline’, {‘score’: -548})
(‘disagree’, {‘score’: -548})
(‘disability’, {‘score’: -548})
(‘duesseldorf’, {‘score’: -548})
(‘disbelief’, {‘score’: -548})
(‘disaffected’, {‘score’: -548})
(‘disagreements’, {‘score’: -548})
(‘descriptive’, {‘score’: -548})
(‘descriptions’, {‘score’: -548})
(‘daseke’, {‘score’: -548})
(“dissidents’”, {‘score’: -548})
(‘decimate’, {‘score’: -548})
(‘dispersants’, {‘score’: -548})
(‘dissidents’, {‘score’: -548})
(‘disarray’, {‘score’: -548})
(‘disrespect’, {‘score’: -548})
PHRASES: TRIED AND TRUE
1011768
1011768
Enter a phrase: tried and true
tried and true
T, R, AY1, D
AH0, N, D
T, R, UW1
SAME WORD PHRASES
tried and true-12
tried and tried-11
cried and cried-9
tried and found-9
tried and tested-9
loved and trusted-8
priests and priestesses-8
trees and brush-8
trials and tribulations-8
tried and convicted-8
tried and failed-8
tried and sentenced-8
argentina and uruguay-7
around and tried-7
arrested and taken-7
breath and tried-7
briefly and then-7
buses and trucks-7
children and three-7
crime and drug-7
RHYMING PHRASES
absolutely not true-297
also be true-297
be especially true-297
but the true-297
could be true-297
is all true-297
is especially true-297
is probably true-297
is still true-297
it were true-297
n’t a true-297
n’t it true-297
same holds true-297
same is true-297
see the true-297
stories are true-297
that the true-297
that was true-297
thats not true-297
was certainly true-297
was it true-297
a big tree-199
a bill through-199
a child through-199
a finger through-199
a huge country-199
a huge tree-199
a look through-199
a major industry-199
a palm tree-199
a stake through-199
a tiny country-199
achieved only through-199
after the country-199
ago to try-199
agriculture and industry-199
an apple tree-199
an arab country-199
and continuing through-199
and gas industry-199
and halfway through-199
and he grew-199
and his crew-199
and i grew-199
and it grew-199
and loan industry-199
and looking through-199
and move through-199
and moved through-199
and not through-199
and private industry-199
and that through-199
and the poetry-199
and the tree-199
and then through-199
and then try-199
and will try-199
any other country-199
are heated through-199
are walking through-199
as it grew-199
as the country-199
as they grew-199
as they try-199
as we try-199
at the country-199
attempt to portray-199
baking and pastry-199
be spread through-199
became the country-199
because i grew-199
because the industry-199
become the country-199
before you try-199
believe the country-199
beneath the tree-199
best to try-199
big oak tree-199
but i grew-199
but the extra-199
by an industry-199
by going through-199
by the country-199
came in through-199
can move through-199
can pass through-199
can to try-199
carried out through-199
crisscrossing the country-199
decides to try-199
defend the country-199
democrats to try-199
determined to try-199
did n’t screw-199
did not destroy-199
dividing the country-199
do it through-199
do you try-199
down a tree-199
down the country-199
CODE
"""Word Test
"""
import nltk
from nltk.corpus import cmudict
import pickle
import re
import operator
def input():
#commonwords = pickle.load( open( "commonwords.p", "wb"))
try:
input = raw_input("Enter a word: ")
print input
#print worddict[input]
associate(input)
except KeyboardInterrupt:
print '\nInput Error'
return
def count_syllables(phs):
syllable_count = len([x for x in list(''.join(phs)) if x >= '0' and x <= '9'])
return syllable_count
def associate(input):
wds_to_consider = {}
phs = nltk.corpus.cmudict.dict()
input_phs = phs[input][0]
input_syll = count_syllables(input_phs)
#Create arrays for input with just phonemes and just stresses
input_phs_only = []
input_stresses = []
for ph in input_phs:
phs_only = re.sub("\d", "", ph)
input_phs_only.append(phs_only)
ph_stresses = re.sub("[A-Za-z]", "", ph)
input_stresses.append(ph_stresses)
# Create reversed array for input_phs_only for rhyme matching
input_phs_only_reverse = input_phs_only[0:]
input_phs_only_reverse.reverse()
# first 2 phonemes of input word
print ", ".join(input_phs)
#print worddict[input]
#Look through all the words in the phoneme list
for word in phs:
score = 0
word_phs = phs[word][0]
word_syll = count_syllables(phs[word][0])
#Create array with just phonemes and just stresses
word_phs_only = []
word_stresses = []
for ph in word_phs:
phs_only = re.sub("\d", "", ph)
word_phs_only.append(phs_only)
ph_stresses = re.sub("[A-Za-z]", "", ph)
word_stresses.append(ph_stresses)
# Subtract points for difference in syllable count
score += abs(word_syll - input_syll) * 10
# Points for having parallel phonemes *and* parallel stresses
for i, w in zip(input_phs, word_phs):
try:
if i == w: score -= 10
except: ValueError
# Points for having parallel phonemes
for i,w in zip(input_phs, word_phs_only):
try:
if i == w: score -= 5
except: ValueError
# Alliteration
j=10
for i,w in zip(input_phs, word_phs_only):
try:
if i == w: score -= j
j-=1
except: ValueError
# Rhyming
j=100
word_phs_only_reverse = input_phs_only[0:]
word_phs_only_reverse.reverse()
for i,w in zip(input_phs_only_reverse, word_phs_only_reverse):
try:
if i == w: score -= j
j-=10
except: ValueError
# Points for having parallel stresses
for i, w in zip(input_phs, word_stresses):
try:
if i == w: score -= 1
except: ValueError
wds_to_consider[word] = {"score": score}
sorted_wds = sorted(wds_to_consider.items(), key=operator.itemgetter(1))
sorted_wds = sorted_wds[:100]
for word in sorted_wds:
print word
if __name__ == "__main__":input()
"""Phrase Test
"""
import re
import nltk
from nltk.corpus import *
#for sorting dictionaries by value
import operator
ngrams = []
phrases_to_play = {}
phs = nltk.corpus.cmudict.dict()
ngrams_phs = []
ngrams_phs_only = []
ngrams_phs_only_reverse = []
ngrams_stresses = []
def parse_ngrams(fileloc):
ngramlist = open(fileloc, 'r')
ngramlist = ngramlist.readlines()
i=0
for line in ngramlist:
if i%10 == 0:
line = nltk.word_tokenize(line)
ngrams.append([line[1].lower(), line[2].lower(), line[3].lower()])
i+=1
# Find rhyming ngrams
# Get pronunciations for each ngram
for ngram in ngrams:
ngram_phs = []
ngram_phs_only = []
ngram_phs_only_reverse = []
ngram_stresses = []
for gram in ngram:
try: # Append phonemes for that word
gram_phs = phs[gram][0]
gram_phs_only = []
gram_stresses = []
for ph in gram_phs:
ph_only = re.sub("\d","", ph)
stress = re.sub("[A-Za-z]", "", ph)
gram_phs_only.append(ph_only)
gram_stresses.append(stress)
gram_phs_only_reverse = gram_phs_only[0:]
gram_phs_only_reverse.reverse()
ngram_phs.append(gram_phs)
ngram_phs_only.append(gram_phs_only)
ngram_phs_only_reverse.append(gram_phs_only_reverse)
ngram_stresses.append(gram_stresses)
except:
ngram_phs.append([])
ngram_phs_only.append([])
ngram_stresses.append([])
ngram_phs_only_reverse.reverse()
ngrams_phs.append(ngram_phs)
ngrams_phs_only.append(ngram_phs_only)
ngrams_phs_only_reverse.append(ngram_phs_only_reverse)
ngrams_stresses.append(ngram_stresses)
print len(ngrams)
print len(ngrams_phs)
def get_input():
parse_ngrams('../data/w3.txt')
try:
phrase = raw_input("Enter a phrase: ")
print phrase
find_anchor(phrase)
except KeyboardInterrupt:
print '\nInput Error'
return
def find_anchor(phrase):
phrase = nltk.word_tokenize(phrase)
phrase_phs = []
# Find phonemes for each word of phrase
for word in phrase:
phrase_phs.append(phs[word][0])
print ", ".join(phs[word][0])
#Create arrays for input with just phonemes and just stresses
phrase_phs_only = []
phrase_phs_only_reverse = []
phrase_stresses = []
for word in phrase_phs:
word_phs_only = []
word_stresses = []
for ph in word:
phs_only = re.sub("\d", "", ph)
word_phs_only.append(phs_only)
ph_stresses = re.sub("[A-Za-z]", "", ph)
word_stresses.append(ph_stresses)
word_phs_only_reverse = word_phs_only[0:]
word_phs_only_reverse.reverse()
phrase_stresses.append(ph_stresses)
phrase_phs_only.append(phs_only)
phrase_phs_only_reverse.append(word_phs_only_reverse)
# Create reversed array for phrase_phs_only for rhyme matching
phrase_phs_only_reverse.reverse()
#find_sameword_ngrams(phrase)
find_rhyming_ngrams(phrase_phs_only, phrase_phs_only_reverse, phrase_stresses)
#find_samelength_ngrams(phrase_phs)
def find_sameword_ngrams(match_phrase):
scores = []
sameword_ngrams = []
i=0
for ngram in ngrams:
score = 0
for word, match_word in zip(ngram, match_phrase):
for letter, match_letter in zip(word, match_word):
if letter == match_letter:
score -=1
scores.append([score, i])
i+=1
print "\nSAME WORD PHRASES"
score_it(scores)
return sameword_ngrams
def find_rhyming_ngrams(phrase_phs_only, phrase_phs_only_reverse, phrase_stresses):
scores = []
i = 0
for ngram_phs_only in ngrams_phs_only_reverse: # For each ngram in the list of ngrams
score = 0
cap = 100
for p_word, n_gram in zip(phrase_phs_only_reverse, ngram_phs_only): # Compare the words in the phrase against words in ngram
for p_ph, n_ph in zip(p_word, n_gram): # Compare each phoneme in each word
if p_ph == n_ph:
score -= cap
cap -= 1
scores.append([score, i]) # Track scores with
i+=1
print "\nRHYMING PHRASES"
score_it(scores)
def count_syllables(phs):
syllable_count = len([x for x in list(''.join(phs)) if x >= '0' and x <= '9'])
return syllable_count
def find_samelength_ngrams(phrase_phs):
scores = []
phrase_syll = 0
for word_ph in phrase_phs:
phrase_syll += count_syllables(word_ph)
print phrase_syll
# Compare length of phrase to ngrams
i=0
for ngram_phs in ngrams_phs:
ngram_syll = 0
for gram_ph in ngram_phs:
ngram_syll += count_syllables(gram_ph)
scores.append([abs(ngram_syll - phrase_syll), i])
i+=1
print "\nSAME LENGTH PHRASES"
score_it(scores)
def score_it(scores):
# Re-sort list of scores by score
scores.sort()
#print scores
# Take top 10 results
scores = scores[:100]
for score in scores:
print " ".join(ngrams[score[1]]) + str(score[0])
if __name__ == "__main__":get_input()