Redial: Interactive Telephony : Week 10

Speech Synthesis

Sorry Dave Concept to Speech
It does not compute
- Hopefully we won't hear these too often!

Artificial production of human speech

Historically attempts have been mechanical in nature, tubes, simulating vocal cords and so on. Instruments that simulate speech have also been attempted. Unfortunately, these devices and instruments don't have the flexibility of human articulators (such lips, tongue and teeth).

Vocoder - Early voice synthesis, used as an instrument

More recent attempts at speech sythesis have been done with computers.

Two characteristics used to judge the quality. Naturalness and Intelligibility. Naturalness is how much it sounds like a human and intelligibility is how easily it can be understood.

3 main techniques:

Database of phones or diphones (concatenative synthesis) - Flexible able to produce a wide variety of words, not terribly easy to understand (intelligibile) but can be somewhat natural sounding. Has "glitches" due to the nature of combining parts of speech.

Limited Domain - Recordings of entire words, phrases and perhaps even sentences are stored for specific use. Telling time, pronouncing the alphabet, reading numbers. This is very easy to understand and natural sounding but not very flexible.

Mathematical models (Formant synthesis) - Shows great promise, expensive but can be very convincing. Highly intelligible not very natural.

Some key words and concepts:

Phonology - The study of the sound system of a language (abstract)

Phonetics - The physical production and perception of sounds that comprise speech (concrete).

Phone - A portion of speech that has a distinct physical or perceptual property (concrete).

Phoneme - The abstract representation of a sound (abstract).

Prosody - A term used referring to elements such as intonation, pitch, rate, loudness, rhythm used in speech.

Diphone - A pair of phones spoken together. Diphones are used in speech synthesis to create sounds that are more natural sounding than combining phones directly. The transitions between phones are different depending on the phones used, diphones capture those transitions.

More Information:
  • Speech Synthesis - Wikipedia
  • Phonology - Wikipedia
  • Phonetics - Wikipedia
  • International Phonetic Alphabet - Wikipedia
  • Written Language

    In order to perform text to speech, computers need to be able to turn our speech into something that can be spoken.

    This requires not only turning the written language into phones or sounds but understanding punctuation, timing, intonation and focus.

    From "The Talking Computer - Text to Speech Synthesis":

    HAL: I enjoy working with people.

    He could stress any word in the sentence and change its meaning. If he stresses I he contrasts the meaning with "you enjoy ..."If he stresses enjoy, he implies a contrast with "I hate ..."When working is stressed, it means "rather than playing." To convey the meaning of a message the computer must assign a prominent stress to the correct word.

    As you can see, turning text into intelligent speech is no easy task, even if we could make the computer sound natural and intelligible.

    Do we want computers to talk? Was it good that HAL talked?

    More Information:
  • The Talking Computer - Text to Speech Synthesis
  • Smithsonian Speech Synthesis History Project

  • Systems

    There are many different speech sythesis engines and databases of phones available to researchers. There are many more that are commercial products.

    Here are some that I think you will find interesting:

    The Festival Speech Synthesis System - What we will be using with Asterisk. (Open Source)

    MBROLA - Diphone Databases (of primary interest to us when using Festival, Not Open Source) | A list of Various Software that uses MBROLA.


    Java FreeTTS - Open Source

    Open Mary Java based, XML and has rich prosodic capabilities (emotional speech).

    AT&T Natural Voices Text to Speech | Research Site | Demo

    Festival for use with Asterisk

    Asterisk has a handy dandy command for working with a Festival server:

    		 Festival('Hello World, I am a talking computer!'); quotes are important
    		exten => s,1,Festival('Hello World, I am a talking phone system')

    Unfortunately, the Festival command for Asterisk doesn't give us much flexibility for determining the voice to be used or other timing elements. (Also, Festival isn't setup on our server for use in this manner)

    Festival uses the "scheme" programming language to define it's configuration. I don't pretend to understand it but here is an example from a configuration file:
    			(set! after_synth_hooks
    				(lambda (utt)
    				   (lambda (x)
    					 (format t "%s %s\n" (item.feat x 'segment_end) ( x)))
    					(utt.relation.items utt 'Segment))
    				   (utt.wave.rescale utt 2.6)))
    			text2wave -eval '(voice_kal_diphone)'


    Fortunately for us, we can gain a little bit of this power by using the text2wave system command instead of using the Festival command directly.

    Here are our options:
    		[sve204@social festival]$ text2wave -?
    		text2wave [options] textfile
    		  Convert a textfile to a waveform
    		  -mode   Explicit tts mode.
    		  -o ofile        File to save waveform (default is stdout).
    		  -otype  Output waveform type: ulaw, snd, aiff, riff, nist etc.
    						  (default is riff)
    		  -F         Output frequency.
    		  -scale   Volume factor
    		  -eval   File or lisp s-expression to be evaluated before

    The easiest way to use this in Festival is like so:
    			exten => s,1,System(echo 'Hello, I am a phone not a person' | /usr/bin/text2wave -scale 1.5 -F 8000 -o /home/sve204/tester.wav);
    			exten => s,2,Background(/home/sve204/tester);

    We can also pass in arguments to use different voices:
    			echo 'Hello World' | /usr/bin/text2wave -F 8000 -o /home/sve204/tester2.wav -eval "(voice_us1_mbrola)"

    This gives us a bit more flexibility but what if we want more more more control?

    Fortunately there is an XML spec called SABLE

    SABLE: A Synthesis Markup Language (version 1.0)

    With SABLE you can create a text file and pass that to text2wave. Here is a sample:
    			<?xml version="1.0"?>
    			<!DOCTYPE SABLE PUBLIC "-//SABLE//DTD SABLE speech mark up//EN" 
    			<SPEAKER NAME="kal_diphone">
    			The boy saw the girl in the park <BREAK/> with the telescope.
    			The boy saw the girl <BREAK/> in the park with the telescope.
    			Good morning <BREAK /> My name is Stuart, which is spelled
    			<RATE SPEED="-40%">
    			<SAYAS MODE="literal">stuart</SAYAS> </RATE>
    			though some people pronounce it 
    			<PRON SUB="stoo art">stuart</PRON>.  My telephone number
    			is <SAYAS MODE="literal">2787</SAYAS>.
    			I used to work in <PRON SUB="Buckloo">Buccleuch</PRON> Place, 
    			but no one can pronounce that.

    Another Example:

    			<?xml version="1.0"?>
    			<!DOCTYPE SABLE PUBLIC "-//SABLE//DTD SABLE speech mark up//EN" 
    			<SPEAKER NAME="kal_diphone">
    			Good evening class <BREAK />
    			How are you all doing?
    			My name is Shawn.
    			Should I <VOLUME LEVEL="loud">yell loudly</VOLUME>
    			or should I speak <VOLUME LEVEL="quiet">in a quiet voice</VOLUME>
    			Should I <RATE SPEED="+100%">speak in a fast voice</RATE> or
    			should I <RATE SPEED="-50%">speak in a slow manner</RATE>

    If the above was a text file named test.sable we would create a wav file using the text2wave command such as follows:

    		/usr/bin/text2wave -F 8000 -o /home/sve204/testsable.wav /home/sve204/test.sable

    Here is an article regarding SABLE with Festival Sable

    Here are the supported Supported Tags

    AGI + Web/RSS/XML + Festival

    Just an example: php_rss_example

    Building Voices in Festival
    For the really really ambitious:

    Voice Demo's

    Speech Recognition

    Although an extremely simple way of looking at it, speech recognition can be thought of as the inverse of what we covered with Speech Synthesis. Many of the concepts are the same but going in the opposite direction.

    Instead of using phones to generate audio, audio is parsed and phones are looked for. A statistical model is used to figure out which phones have been spoken and then those with high probabillty are mapped into words. Here is a nice overview: How Speech Recognition Works

    As with Speech Synthesis, there are many different speech recognition engines. Probably the most well known is Dragon Naturally Speaking. Dragon is generally used as a tool on desktop computers and should be trained on a per user basis. Unfortunately, this won't suit our purposes but it is something to be aware of.

    Here is a list of Speech Recognition software that runs on Linux (which is what we will be using): Speech Recognition HOWTO: 5. Speech Recognition Software

    We will be using Lumenvox Speech Engine with Asterisk. In the past we have used Sphinx engines from CMU, specifically Sphinx 2 as it is fast and as all of the Sphinx engines are it is Open Source. Unfortunately it has proven to be less accurate than we need for our applications.

    Lumenvox has several example applications that utilize the dialplan and AGI scripts to interface with the speech engine. You can find them here:

    Let's go through a simple example.

    The first thing we need to do is create a "grammar". A grammar defines the words that the speech engine will be expected to understand. We need this in order to narrow down the possibilities from the universe of words that someone might speak.

    Here is a simple grammar for Yes or No:
    #ABNF 1.0;
    language en-US; //use the American English pronunciation dictionary.
    mode voice;  //the input for this grammar will be spoken words (as opposed to DTMF)
    root $yesorno;
    $yes = yes;
    $no = no;
    $yesorno = $yes | $no;
    You would save this in a text file in your asterisk_conf directory as NET-ID_yesno.gram.

    There are many possibilities for doing more complex work in grammar definitions. You can look into these on your own but the above illustrates how to get just about any word that you need. For instance if we wanted to recognise the names of the students in the class we would create a grammar that looks like this:
    #ABNF 1.0;
    mode voice;
    language en-US;
    tag-format <lumenvox/1.0>;
    root $name;
    $shawn = ((shawn [van])[every]):"shawn";
    $james = "{JH AE M S:jaymes}";
    $nobu = nobu;
    $name = ($shawn|$james|$nobu);
    I saved this as sve204_names.gram in my asterisk_conf directory.

    You can see that the beginning is much the same. The "root" is the variable name that will be output at the end. Everything else is defining variables with the speech that will be matched. Their are rules involved for getting at more complicated words and to ignore certain words. You can research this further by looking through A Simple Gammar (Lumenvox Programmers Guide)

    More more more:
    Using Phonetic Spellings
    English Phonemes
    Built-in Grammars

    The next thing we need to do is create a dialplan that will utilize that grammar:

    exten => s,1,SpeechCreate
    exten => s,n,SpeechLoadGrammar(sve204_names|/home/sve204/asterisk_conf/sve204_names.gram)
    exten => s,n,SpeechActivateGrammar(sve204_names)
    exten => s,n,SpeechBackground(beep,10)
    exten => s,n,Verbose(1,Result was ${SPEECH_TEXT(0)})
    exten => s,n,Verbose(1,Confidence was ${SPEECH_SCORE(0)})
    exten => s,n,SpeechDeactivateGrammar(sve204_names)
    Of course much more can be done.. Have a look at Asterisk's Speech Recognition API