Redial: Interactive Telephony : Week 10

Speech Recognition for use with Asterisk (Lumenvox)

Fun Stuff!

Speech Recognition

Although an extremely simple way of looking at it, speech recognition can be thought of as the inverse of what we covered last week Speech Synthesis. Many of the concepts are the same but going in the opposite direction.

Instead of using phones to generate audio, audio is parsed and phones are looked for. A statistical model is used to figure out which phones have been spoken and then those with high probabillty are mapped into words. Here is a nice overview: How Speech Recognition Works

As with Speech Synthesis, there are many different speech recognition engines. Probably the most well known is Dragon Naturally Speaking. Dragon is generally used as a tool on desktop computers and should be trained on a per user basis. Unfortunately, this won't suit our purposes but it is something to be aware of.

Here is a list of Speech Recognition software that runs on Linux (which is what we will be using): Speech Recognition HOWTO: 5. Speech Recognition Software

We will be using Lumenvox Speech Engine with Asterisk. In the past we have used Sphinx engines from CMU, specifically Sphinx 2 as it is fast and as all of the Sphinx engines are it is Open Source. Unfortunately it has proven to be less accurate than we need for our applications.

Lumenvox has several example applications that utilize the dialplan and AGI scripts to interface with the speech engine. You can find them here: http://www.lumenvox.com/partners/integrator/digium/applicationzone/index.aspx

Let's go through a simple example.

The first thing we need to do is create a "grammar". A grammar defines the words that the speech engine will be expected to understand. We need this in order to narrow down the possibilities from the universe of words that someone might speak.

Here is a simple grammar for Yes or No:
#ABNF 1.0;
language en-US; //use the American English pronunciation dictionary.
mode voice;  //the input for this grammar will be spoken words (as opposed to DTMF)
root $yesorno;
$yes = yes;
$no = no;
$yesorno = $yes | $no;
You would save this in a text file in your asterisk_conf directory as NET-ID_yesno.gram.

There are many possibilities for doing more complex work in grammar definitions. You can look into these on your own but the above illustrates how to get just about any word that you need. For instance if we wanted to recognise the names of the students in the class we would create a grammar that looks like this:
 
#ABNF 1.0;
mode voice;
language en-US;
tag-format <lumenvox/1.0>;

root $name;

$shawn = ((shawn [van])[every]):"shawn";
$jaymes = "{JH AE M S:jaymes}";
$cho = cho;

$name = ($shawn|$jaymes|$cho);
I saved this as sve204_names.gram in my asterisk_conf directory.

You can see that the beginning is much the same. The "root" is the variable name that will be output at the end. Everything else is defining variables with the speech that will be matched. Their are rules involved for getting at more complicated words and to ignore certain words. You can research this further by looking through A Simple Gammar (Lumenvox Programmers Guide)

More more more:
Using Phonetic Spellings
English Phonemes
Phrases
Built-in Grammars


The next thing we need to do is create a dialplan that will utilize that grammar:

exten => s,1,SpeechCreate
exten => s,n,SpeechLoadGrammar(sve204_names|/home/sve204/asterisk_conf/sve204_names.gram)
exten => s,n,SpeechActivateGrammar(sve204_names)
exten => s,n,SpeechBackground(beep,10)
exten => s,n,Verbose(1,Result was ${SPEECH_TEXT(0)})
exten => s,n,Verbose(1,Confidence was ${SPEECH_SCORE(0)})
exten => s,n,SpeechDeactivateGrammar(sve204_names)
Of course much more can be done.. Have a look at Asterisk's Speech Recognition API