Exercise 1: Analyzing Sound

I’m starting to do phoneme analysis on words and phrases to create list of words and phrases that “sound” similar.

I have a very crude “scoring system” to keep track of things like:

  • Overall similarity of phonemes (in order, would like to add overal phoneme similarity regardless of order)
  • Overall similarity of stress patterns
  • Extra points for alliteration (phoneme matches at start of word/phrase)
  • Extra points for rhyming (phoneme matches at end of word/phrase)
  • Word overlap

WORDS: DISAPPEAR

Enter a word: disappear
disappear
D, IH2, S, AH0, P, IH1, R
(‘disappeared’, {‘score’: -608})
(‘disappears’, {‘score’: -608})
(‘disappear’, {‘score’: -608})
(‘disappearance’, {‘score’: -598})
(‘disappearing’, {‘score’: -598})
(‘disappearances’, {‘score’: -588})
(‘disappoint’, {‘score’: -579})
(‘disappoints’, {‘score’: -579})
(‘disapproved’, {‘score’: -579})
(‘disapprove’, {‘score’: -579})
(‘discipline’, {‘score’: -569})
(‘disappointment’, {‘score’: -569})
(‘dissipates’, {‘score’: -569})
(‘dissipate’, {‘score’: -569})
(‘disciplines’, {‘score’: -569})
(‘disciplined’, {‘score’: -569})
(‘disappointing’, {‘score’: -569})
(‘disapproves’, {‘score’: -569})
(‘disappointed’, {‘score’: -569})
(‘datapower’, {‘score’: -565})
(‘disciplining’, {‘score’: -559})
(‘dissipative’, {‘score’: -559})
(‘disciple’, {‘score’: -559})
(‘disciples’, {‘score’: -559})
(‘deceptions’, {‘score’: -559})
(‘dissipated’, {‘score’: -559})
(‘deceptive’, {‘score’: -559})
(‘disapproving’, {‘score’: -559})
(‘dissipation’, {‘score’: -559})
(‘deception’, {‘score’: -559})
(‘dissipating’, {‘score’: -559})
(‘disapproval’, {‘score’: -559})
(‘disappointments’, {‘score’: -559})
(‘disallowed’, {‘score’: -558})
(‘disaffect’, {‘score’: -558})
(‘disobey’, {‘score’: -558})
(‘disavow’, {‘score’: -558})
(‘disavowed’, {‘score’: -558})
(‘disallow’, {‘score’: -558})
(‘displaywrite’, {‘score’: -557})
(‘domineer’, {‘score’: -554})
(‘dubilier’, {‘score’: -554})
(‘misapplies’, {‘score’: -554})
(‘misapplied’, {‘score’: -554})
(‘unsupported’, {‘score’: -553})
(‘persevere’, {‘score’: -552})
(‘perseveres’, {‘score’: -552})
(‘persevered’, {‘score’: -552})
(‘disappointingly’, {‘score’: -549})
(‘disciplinary’, {‘score’: -549})
(‘deskpro’, {‘score’: -549})
(‘deceptively’, {‘score’: -549})
(‘dissonant’, {‘score’: -548})
(‘dissidence’, {‘score’: -548})
(‘disregards’, {‘score’: -548})
(‘.decimal’, {‘score’: -548})
(‘description’, {‘score’: -548})
(‘disinclined’, {‘score’: -548})
(‘dispossess’, {‘score’: -548})
(‘disallowance’, {‘score’: -548})
(‘desiccates’, {‘score’: -548})
(‘disentangle’, {‘score’: -548})
(‘decimals’, {‘score’: -548})
(‘disbarment’, {‘score’: -548})
(‘disrepair’, {‘score’: -548})
(‘disrepute’, {‘score’: -548})
(‘decibels’, {‘score’: -548})
(‘discontents’, {‘score’: -548})
(‘decency’, {‘score’: -548})
(‘dissonance’, {‘score’: -548})
(‘duesler’, {‘score’: -548})
(‘decently’, {‘score’: -548})
(‘disagreed’, {‘score’: -548})
(‘disagrees’, {‘score’: -548})
(‘disassemble’, {‘score’: -548})
(‘disabuse’, {‘score’: -548})
(‘decimal’, {‘score’: -548})
(‘disaffection’, {‘score’: -548})
(‘disregard’, {‘score’: -548})
(‘dissident’, {‘score’: -548})
(‘dusseldorf’, {‘score’: -548})
(‘dispersant’, {‘score’: -548})
(‘disrespects’, {‘score’: -548})
(‘disconcert’, {‘score’: -548})
(‘disincline’, {‘score’: -548})
(‘disagree’, {‘score’: -548})
(‘disability’, {‘score’: -548})
(‘duesseldorf’, {‘score’: -548})
(‘disbelief’, {‘score’: -548})
(‘disaffected’, {‘score’: -548})
(‘disagreements’, {‘score’: -548})
(‘descriptive’, {‘score’: -548})
(‘descriptions’, {‘score’: -548})
(‘daseke’, {‘score’: -548})
(“dissidents’”, {‘score’: -548})
(‘decimate’, {‘score’: -548})
(‘dispersants’, {‘score’: -548})
(‘dissidents’, {‘score’: -548})
(‘disarray’, {‘score’: -548})
(‘disrespect’, {‘score’: -548})

PHRASES: TRIED AND TRUE

1011768
1011768
Enter a phrase: tried and true
tried and true
T, R, AY1, D
AH0, N, D
T, R, UW1

SAME WORD PHRASES
tried and true-12
tried and tried-11
cried and cried-9
tried and found-9
tried and tested-9
loved and trusted-8
priests and priestesses-8
trees and brush-8
trials and tribulations-8
tried and convicted-8
tried and failed-8
tried and sentenced-8
argentina and uruguay-7
around and tried-7
arrested and taken-7
breath and tried-7
briefly and then-7
buses and trucks-7
children and three-7
crime and drug-7

RHYMING PHRASES
absolutely not true-297
also be true-297
be especially true-297
but the true-297
could be true-297
is all true-297
is especially true-297
is probably true-297
is still true-297
it were true-297
n’t a true-297
n’t it true-297
same holds true-297
same is true-297
see the true-297
stories are true-297
that the true-297
that was true-297
thats not true-297
was certainly true-297
was it true-297
a big tree-199
a bill through-199
a child through-199
a finger through-199
a huge country-199
a huge tree-199
a look through-199
a major industry-199
a palm tree-199
a stake through-199
a tiny country-199
achieved only through-199
after the country-199
ago to try-199
agriculture and industry-199
an apple tree-199
an arab country-199
and continuing through-199
and gas industry-199
and halfway through-199
and he grew-199
and his crew-199
and i grew-199
and it grew-199
and loan industry-199
and looking through-199
and move through-199
and moved through-199
and not through-199
and private industry-199
and that through-199
and the poetry-199
and the tree-199
and then through-199
and then try-199
and will try-199
any other country-199
are heated through-199
are walking through-199
as it grew-199
as the country-199
as they grew-199
as they try-199
as we try-199
at the country-199
attempt to portray-199
baking and pastry-199
be spread through-199
became the country-199
because i grew-199
because the industry-199
become the country-199
before you try-199
believe the country-199
beneath the tree-199
best to try-199
big oak tree-199
but i grew-199
but the extra-199
by an industry-199
by going through-199
by the country-199
came in through-199
can move through-199
can pass through-199
can to try-199
carried out through-199
crisscrossing the country-199
decides to try-199
defend the country-199
democrats to try-199
determined to try-199
did n’t screw-199
did not destroy-199
dividing the country-199
do it through-199
do you try-199
down a tree-199
down the country-199

CODE

"""Word Test
"""
import nltk
from nltk.corpus import cmudict

import pickle
import re
import operator

def input():
    #commonwords = pickle.load( open( "commonwords.p", "wb"))

    try:
        input = raw_input("Enter a word: ")
        print input
        #print worddict[input]
        associate(input)
    except KeyboardInterrupt:
        print '\nInput Error'
        return

def count_syllables(phs):
    syllable_count = len([x for x in list(''.join(phs)) if x >= '0' and x <= '9'])
    return syllable_count

def associate(input):
    wds_to_consider = {}
    phs = nltk.corpus.cmudict.dict()
    input_phs = phs[input][0]
    input_syll = count_syllables(input_phs)

    #Create arrays for input with just phonemes and just stresses
    input_phs_only = []
    input_stresses = []

    for ph in input_phs:
        phs_only = re.sub("\d", "", ph)
        input_phs_only.append(phs_only)

        ph_stresses = re.sub("[A-Za-z]", "", ph)
        input_stresses.append(ph_stresses)

    # Create reversed array for input_phs_only for rhyme matching
    input_phs_only_reverse = input_phs_only[0:]
    input_phs_only_reverse.reverse()

    # first 2 phonemes of input word
    print ", ".join(input_phs)

    #print worddict[input] 

    #Look through all the words in the phoneme list
    for word in phs:
        score = 0

        word_phs = phs[word][0]
        word_syll = count_syllables(phs[word][0])

        #Create array with just phonemes and just stresses
        word_phs_only = []
        word_stresses = []

        for ph in word_phs:
            phs_only = re.sub("\d", "", ph)
            word_phs_only.append(phs_only)

            ph_stresses = re.sub("[A-Za-z]", "", ph)
            word_stresses.append(ph_stresses)

        # Subtract points for difference in syllable count
        score += abs(word_syll - input_syll) * 10

        # Points for having parallel phonemes *and* parallel stresses
        for i, w in zip(input_phs, word_phs):

            try:
                if i == w: score -= 10
            except: ValueError

        # Points for having parallel phonemes
        for i,w in zip(input_phs, word_phs_only):
            try:
                if i == w: score -= 5
            except: ValueError

        # Alliteration
        j=10
        for i,w in zip(input_phs, word_phs_only):
            try:
                if i == w: score -= j
                j-=1
            except: ValueError

        # Rhyming
        j=100
        word_phs_only_reverse = input_phs_only[0:]
        word_phs_only_reverse.reverse()
        for i,w in zip(input_phs_only_reverse, word_phs_only_reverse):
            try:
                if i == w: score -= j
                j-=10
            except: ValueError

        # Points for having parallel stresses
        for i, w in zip(input_phs, word_stresses):
            try:
                if i == w: score -= 1
            except: ValueError

        wds_to_consider[word] = {"score": score}       

    sorted_wds = sorted(wds_to_consider.items(), key=operator.itemgetter(1))

    sorted_wds = sorted_wds[:100]

    for word in sorted_wds:
        print word

if __name__ == "__main__":input()
"""Phrase Test
"""
import re
import nltk
from nltk.corpus import *

#for sorting dictionaries by value
import operator

ngrams = []
phrases_to_play = {}
phs = nltk.corpus.cmudict.dict()
ngrams_phs = []
ngrams_phs_only = []
ngrams_phs_only_reverse = []
ngrams_stresses = []

def parse_ngrams(fileloc):
    ngramlist = open(fileloc, 'r')
    ngramlist = ngramlist.readlines()

    i=0
    for line in ngramlist:
        if i%10 == 0:
            line = nltk.word_tokenize(line)
            ngrams.append([line[1].lower(), line[2].lower(), line[3].lower()])
        i+=1

    # Find rhyming ngrams
    # Get pronunciations for each ngram
    for ngram in ngrams:
        ngram_phs = []
        ngram_phs_only = []
        ngram_phs_only_reverse = []
        ngram_stresses = []
        for gram in ngram:
            try:    # Append phonemes for that word
                gram_phs = phs[gram][0]
                gram_phs_only = []
                gram_stresses = []
                for ph in gram_phs:
                    ph_only = re.sub("\d","", ph)
                    stress = re.sub("[A-Za-z]", "", ph)
                    gram_phs_only.append(ph_only)
                    gram_stresses.append(stress)
                gram_phs_only_reverse = gram_phs_only[0:]
                gram_phs_only_reverse.reverse()

                ngram_phs.append(gram_phs)
                ngram_phs_only.append(gram_phs_only)
                ngram_phs_only_reverse.append(gram_phs_only_reverse)
                ngram_stresses.append(gram_stresses)
            except:
                ngram_phs.append([])
                ngram_phs_only.append([])
                ngram_stresses.append([])
        ngram_phs_only_reverse.reverse()
        ngrams_phs.append(ngram_phs)
        ngrams_phs_only.append(ngram_phs_only)
        ngrams_phs_only_reverse.append(ngram_phs_only_reverse)
        ngrams_stresses.append(ngram_stresses)       

    print len(ngrams)
    print len(ngrams_phs)

def get_input():
    parse_ngrams('../data/w3.txt')

    try:
        phrase = raw_input("Enter a phrase: ")
        print phrase
        find_anchor(phrase)
    except KeyboardInterrupt:
        print '\nInput Error'
        return

def find_anchor(phrase):
    phrase = nltk.word_tokenize(phrase)
    phrase_phs = []

    # Find phonemes for each word of phrase
    for word in phrase:
        phrase_phs.append(phs[word][0])
        print ", ".join(phs[word][0])

    #Create arrays for input with just phonemes and just stresses
    phrase_phs_only = []
    phrase_phs_only_reverse = []
    phrase_stresses = []

    for word in phrase_phs:
        word_phs_only = []
        word_stresses = []
        for ph in word:
            phs_only = re.sub("\d", "", ph)
            word_phs_only.append(phs_only)
            ph_stresses = re.sub("[A-Za-z]", "", ph)
            word_stresses.append(ph_stresses)
    word_phs_only_reverse = word_phs_only[0:]
    word_phs_only_reverse.reverse()

    phrase_stresses.append(ph_stresses)
    phrase_phs_only.append(phs_only)
    phrase_phs_only_reverse.append(word_phs_only_reverse)

    # Create reversed array for phrase_phs_only for rhyme matching
    phrase_phs_only_reverse.reverse()    

    #find_sameword_ngrams(phrase)
    find_rhyming_ngrams(phrase_phs_only, phrase_phs_only_reverse, phrase_stresses)
    #find_samelength_ngrams(phrase_phs)

def find_sameword_ngrams(match_phrase):
    scores = []
    sameword_ngrams = []

    i=0
    for ngram in ngrams:
        score = 0
        for word, match_word in zip(ngram, match_phrase):
            for letter, match_letter in zip(word, match_word):
                if letter == match_letter:
                    score -=1
        scores.append([score, i])
        i+=1

    print "\nSAME WORD PHRASES"
    score_it(scores)
    return sameword_ngrams

def find_rhyming_ngrams(phrase_phs_only, phrase_phs_only_reverse, phrase_stresses):
    scores = []
    i = 0

    for ngram_phs_only in ngrams_phs_only_reverse:                                     # For each ngram in the list of ngrams
        score = 0
        cap = 100
        for p_word, n_gram in zip(phrase_phs_only_reverse, ngram_phs_only):            # Compare the words in the phrase against words in ngram
            for p_ph, n_ph in zip(p_word, n_gram):                  # Compare each phoneme in each word
                if p_ph == n_ph:
                    score -= cap
                    cap -= 1
        scores.append([score, i])                                   # Track scores with
        i+=1

    print "\nRHYMING PHRASES"
    score_it(scores)   

def count_syllables(phs):
    syllable_count = len([x for x in list(''.join(phs)) if x >= '0' and x <= '9'])
    return syllable_count

def find_samelength_ngrams(phrase_phs):
    scores = []
    phrase_syll = 0

    for word_ph in phrase_phs:
        phrase_syll += count_syllables(word_ph) 

    print phrase_syll 

    # Compare length of phrase to ngrams
    i=0
    for ngram_phs in ngrams_phs:
        ngram_syll = 0
        for gram_ph in ngram_phs:
            ngram_syll += count_syllables(gram_ph)
        scores.append([abs(ngram_syll - phrase_syll), i])
        i+=1

    print "\nSAME LENGTH PHRASES"
    score_it(scores)

def score_it(scores):

    # Re-sort list of scores by score
    scores.sort()
    #print scores

    # Take top 10 results
    scores = scores[:100]
    for score in scores:
        print " ".join(ngrams[score[1]]) + str(score[0])

if __name__ == "__main__":get_input()
This entry was posted in Electronic Text. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>