Redial: Interactive Telephony : Week 9
Text to Speech in Asterisk (Festival)
Sorry Dave Concept to Speech
It does not compute
- Hopefully we won't hear these too often!
Artificial production of human speech
Historically attempts have been mechanical in nature, tubes, simulating vocal cords and so on. Instruments that simulate speech have also been attempted. Unfortunately, these devices and instruments don't have the flexibility of human articulators (such lips, tongue and teeth).
Vocoder - Early voice synthesis, used as an instrument
More recent attempts at speech sythesis have been done with computers.
Two characteristics used to judge the quality. Naturalness and Intelligibility. Naturalness is how much it sounds like a human and intelligibility is how easily it can be understood.
3 main techniques:
Database of phones or diphones (concatenative synthesis) - Flexible able to produce a wide variety of words, not terribly easy to understand (intelligibile) but can be somewhat natural sounding. Has "glitches" due to the nature of combining parts of speech.
Limited Domain - Recordings of entire words, phrases and perhaps even sentences are stored for specific use. Telling time, pronouncing the alphabet, reading numbers. This is very easy to understand and natural sounding but not very flexible.
Mathematical models (Formant synthesis) - Shows great promise, expensive but can be very convincing. Highly intelligible not very natural.
Some key words and concepts:
Phonology - The study of the sound system of a language (abstract)
Phonetics - The physical production and perception of sounds that comprise speech (concrete).
Phone - A portion of speech that has a distinct physical or perceptual property (concrete).
Phoneme - The abstract representation of a sound (abstract).
Prosody - A term used referring to elements such as intonation, pitch, rate, loudness, rhythm used in speech.
Diphone - A pair of phones spoken together. Diphones are used in speech synthesis to create sounds that are more natural sounding than combining phones directly. The transitions between phones are different depending on the phones used, diphones capture those transitions.
Speech Synthesis - Wikipedia
Phonology - Wikipedia
Phonetics - Wikipedia
International Phonetic Alphabet - Wikipedia
In order to perform text to speech, computers need to be able to turn our speech into something that can be spoken.
This requires not only turning the written language into phones or sounds but understanding punctuation, timing, intonation and focus.
From "The Talking Computer - Text to Speech Synthesis":
HAL: I enjoy working with people.
He could stress any word in the sentence and change its meaning. If he stresses I he contrasts the meaning with "you enjoy ..."If he stresses enjoy, he implies a contrast with "I hate ..."When working is stressed, it means "rather than playing." To convey the meaning of a message the computer must assign a prominent stress to the correct word.
As you can see, turning text into intelligent speech is no easy task, even if we could make the computer sound natural and intelligible.
Do we want computers to talk? Was it good that HAL talked?
The Talking Computer - Text to Speech Synthesis
Smithsonian Speech Synthesis History Project
There are many different speech sythesis engines and databases of phones available to researchers. There are many more that are commercial products.
Here are some that I think you will find interesting:
The Festival Speech Synthesis System - What we will be using with Asterisk. (Open Source)
MBROLA - Diphone Databases (of primary interest to us when using Festival, Not Open Source) | A list of Various Software that uses MBROLA.
Java FreeTTS - Open Source
Open Mary Java based, XML and has rich prosodic capabilities (emotional speech).
AT&T Natural Voices Text to Speech | Research Site | Demo
Festival for use with Asterisk
Asterisk has a handy dandy command for working with a Festival server:
Festival('Hello World, I am a talking computer!') ; quotes are important
exten => s,1,Festival('Hello World, I am a talking phone system')
Unfortunately, the Festival command for Asterisk doesn't give us much flexibility for determining the voice to be used or other timing elements.
Festival uses the "scheme" programming language to define it's configuration. I don't pretend to understand it but here is an example from a configuration file:
(format t "%s %s\n" (item.feat x 'segment_end) (item.name x)))
(utt.relation.items utt 'Segment))
(utt.wave.rescale utt 2.6)))
text2wave -eval '(voice_kal_diphone)'
Fortunately for us, we can gain a little bit of this power by using the text2wave system command instead of using the Festival command directly.
Here are our options:
[sve204@social festival]$ text2wave -?
text2wave [options] textfile
Convert a textfile to a waveform
-mode Explicit tts mode.
-o ofile File to save waveform (default is stdout).
-otype Output waveform type: ulaw, snd, aiff, riff, nist etc.
(default is riff)
-F Output frequency.
-scale Volume factor
-eval File or lisp s-expression to be evaluated before
The easiest way to use this in Festival is like so:
exten => s,1,System(echo 'Hello, I am a phone not a person' | /usr/bin/text2wave -scale 1.5 -F 8000 -o /home/sve204/tester.wav);
exten => s,2,Background(/home/sve204/tester);
We can also pass in arguments to use different voices:
echo 'Hello World' | /usr/bin/text2wave -F 8000 -o /home/sve204/tester2.wav -eval "(voice_us1_mbrola)"
This gives us a bit more flexibility but what if we want more more more control?
Fortunately there is an XML spec called SABLE
SABLE: A Synthesis Markup Language (version 1.0)
With SABLE you can create a text file and pass that to text2wave. Here is a sample:
<!DOCTYPE SABLE PUBLIC "-//SABLE//DTD SABLE speech mark up//EN"
The boy saw the girl in the park <BREAK/> with the telescope.
The boy saw the girl <BREAK/> in the park with the telescope.
Good morning <BREAK /> My name is Stuart, which is spelled
<SAYAS MODE="literal">stuart</SAYAS> </RATE>
though some people pronounce it
<PRON SUB="stoo art">stuart</PRON>. My telephone number
is <SAYAS MODE="literal">2787</SAYAS>.
I used to work in <PRON SUB="Buckloo">Buccleuch</PRON> Place,
but no one can pronounce that.
<!DOCTYPE SABLE PUBLIC "-//SABLE//DTD SABLE speech mark up//EN"
Good evening class <BREAK />
How are you all doing?
My name is Shawn.
Should I <VOLUME LEVEL="loud">yell loudly</VOLUME>
or should I speak <VOLUME LEVEL="quiet">in a quiet voice</VOLUME>
Should I <RATE SPEED="+100%">speak in a fast voice</RATE> or
should I <RATE SPEED="-50%">speak in a slow manner</RATE>
If the above was a text file named test.sable we would create a wav file using the text2wave command such as follows:
/usr/bin/text2wave -F 8000 -o /home/sve204/testsable.wav /home/sve204/test.sable
Here is an article regarding SABLE with Festival Sable
Here are the supported Supported Tags
AGI + Web/RSS/XML + Festival
Just an example: php_rss_example
Building Voices in Festival
For the really really ambitious: