新しい言葉 (New Words)
A method for generating phonetically unique words from a corpus of existing words in the Japanese language. —resulting in words that are statistically similar to Japanese. (project proposal)

This project was also presented as a final project for Learning Bit by Bit.
This project was inspired from an ongoing love of the Japanese language, from its unique phonetic and rhythmic qualities, and from its capacity for adaptive and inter-lingual word play.
The Japanese language has a multitude of unique phonetic forms including its abundance of onomatopoeia like ザーザー (za-za-) signifying heavy rainfall or its ability, especially within youth-centric slang, to adopt non-Japanese words such as the soon to fall out of fashion use of ナウ (nau) or “now” rather than the conventional いま (ima) to indicate the present. 宿題をするナウ。“Doing homework now.”
While I have from time to time come up with my own variations on Japanese words, some of which I even use frequently with close friends, it seemed natural to employ a method of natural language processing toward the generation of ‘unique’ Japanese words: words that are statistically similar in phonetic terms, but not present in the Japanese dictionary.
Initial Attempts
I began with some naive experiments in which I simply made a list of base phonemes derived from the standard Hepburn Romanization:
“a”, “i”, “u”, “e”, “o”, “ya”, “yu”, “yo”, “ka”, “ki”, “ku”, “ke”, “ko”, “kya”, “kyu”, “kyo”, “sa”, “shi”, “su”, “se”, “so”, “sha”, “shu”, “sho”, “ta”, “chi”, “tsu”, “te”, “to”, “cha”, “chu”, “cho”, “na”, “ni”, “nu”, “ne”, “no”, “nya”, “nyu”, “nyo”, “ha”, “hi”, “fu”, “he”, “ho”, “hya”, “hyu”, “hyo”, “ma”, “mi”, “mu”, “me”, “mo”, “mya”, “myu”, “myo”, “ya”, “yu”, “yo”, “ra”, “ri”, “ru”, “re”, “ro”, “rya”, “ryu”, “ryo”, “wa”, “wo”, “n”, “ga”, “gi”, “gu”, “ge”, “go”, “gya”, “gyu”, “gyo”, “za”, “ji”, “zu”, “ze”, “zo”, “ja”, “ju”, “jo”, “da”, “ji”, “zu”, “de”, “do”, “dya”, “dyu”, “dyo”, “ba”, “bi”, “bu”, “be”, “bo”, “bya”, “byu”, “byo”, “pa”, “pi”, “pu”, “pe”, “po”, “pya”, “pyu”, “pyo”
From here I would choose a random range of phonemes and randomly build sets of phonemes or ‘words’ within that range. The first obvious problems arose when words were generated starting with phonemes like “pyo” or “pyu” which gave a distinctively more Korean-like quality than Japanese, or words beginning with the only singular consonant “n” which don’t exist. I attempted to remedy this by creating a secondary list containing only start phonemes. This and subsequent versions use a Romaji module written by Ed Halley to convert to and from unicode kana characters.
A sample of words generated with this method:
(code)
jajidaji, じぁじだじ
jijafubi, じじぁふび
jugesunyu, じぅげすんゆ
zochiko, ぞちこ
rerinu, れりぬ
shukyu, しゅきゅ
Though at times there were a few interesting occurrences they were very seldom. At this point the words were not being checked against a dictionary so it was still possible to generate an existing word without knowing it. It became quickly apparent that this or any rule based approach was probably not the direction to proceed in and that a statistical model made more sense.
Statistical Models and EDICT
Turning toward a bigram method I needed a good corpus to train the model on. I found cjklib, an Asian language library for Python that makes use of Jim Breen’s EDICT project at Monash University. The EDICT dictionary resource file contains around 150,000 headwords. The headwords are all in unicode kana, so before I could start building a statistical model these had to be converted into parse-able ASCII which also made things more easy to debug. Finding discrepancies with my base Hepburn romanization lists I had to build a custom phoneme list that could account for stresses indicated by the small “tsu” “っ” as in あれっくす romanized: arekkusu or a hard ‘k’. This meant that variant phonemes like “kku” had to be added to the list as well as some unusual phonemes like “vu” and “du” that were most likely being used for loan words. This nearly doubled the number of ‘phonemes’. I knew that treating the hard consonants as separate phonemes would shift the eventual statistics in certain direction but it seemed like the best approach. An alternate method would have been to choose a unique ASCII character for “っ” and then modify the Romaji module to convert this upon later romanization to whatever character followed it for reading.
The eventual generator made use of a few core functions:

A sample of words generated from a single (EDICT) source corpus:
kikiiza, ききいざ
keou, けおう
sokutsusou, そくつそう
mouchuu, もうちゅう
naiawakaioku, ないあわかいおく
tensasantsuuegarigasa, てんささんつうえがりがさ
kagaicho, かがいちょ
kouraushucho, こうらうしゅちょ
iketsu, いけつ
Note: EDICT has a number of phrasal entries as well as lengthy technical terms consisting of conjoined subwords which may account for some of the unusually long words being generated. To confirm this I did a search for the longest headword in EDICT which is apparently 35 phonemes in length.
The generated words are immediately discernable from the purely random attempts. At this point I began to realize that the generator lacked a method of personal influence or nuance. I had recently attended Daniel Shiffman‘s session on ‘Genetic Algorithms’ which I found really inspiring and felt that a simplified form of genetic algorithm could be implemented with a human defined ‘fitness’ function in the form of a qualitative word list selection. The secondary list can consist of actual Japanese words or words that have been generated from the EDICT corpus. This allows for a weight coefficient to be applied to the secondary list which equivalently populates the general word population with a multiplier; increasing the probability of the secondary words.

A sample secondary list of words that I found interesting:
dictionary=[
"koutaria",
"mukotome",
"shisenbaikoushu",
"kyuubutsuke",
"keshinoko",
"umemoku",
"hoshiru",
"kimobata",
"hankaken",
"modansou",
"sangairu",
"osakirini",
"tsubanei",
"ekobari",
"ohinomi",
"sairabo",
"kanishinai",
"osameki",
"ronisu"]
After applying a weight of 20,000 to this list and running the generator we get:
shinoubu, しのうぶ
kyuumemoda, きゅうめもだ
hanbanega, はんばねが
kanishisen, かにしせん
keshinasa, けしなさ
osamemo, おさめも
hoshinoha, ほしのは
ubutsubane, うぶつばね
kanishinaha, かにしなは
The influence can be seen pretty clearly with pleasing results.
Further Investigation
At present I have enjoyed deriving meanings for the generated words through an entirely non-computational process, essentially from my own ability to find likenesses and relationships:
芥子の子 [けしのこ] keshinoko
-noun
1. mustard child
2. a precocious child
however… it would be feasible to allow the selection of available kanji to be randomized which would allow for partial meaning to be derived automatically.
(Main generator code)
(Pickled dictionary parse file – for optimization) 11MB / the parsing function can be uncommented and run alternatively with pickle commented.
(Sample secondary dictionary)
Required libs and modules:
cjklib with EDICT resource
romaji.py