Redial: Interactive Telephony : Week 9

Speech Recognition for use with Asterisk (Sphinx)

Fun Stuff!

Speech Recognition

Although an extremely simple way of looking at it, speech recognition can be thought of as the inverse of what we covered last week Speech Synthesis. Many of the concepts are the same but going in the opposite direction.

Instead of using phones to generate audio, audio is parsed and phones are looked for. A statistical model is used to figure out which phones have been spoken and then those with high probabillty are mapped into words. Here is a nice overview: How Speech Recognition Works

As with Speech Synthesis, there are many different speech recognition engines. Probably the most well known is Dragon Naturally Speaking. Dragon is generally used as a tool on desktop computers and should be trained on a per user basis. Unfortunately, this won't suit our purposes but it is something to be aware of.

Here is a list of Speech Recognition software that runs on Linux (which is what we will be using): Speech Recognition HOWTO: 5. Speech Recognition Software

We will be using one of the Sphinx engines from CMU, specifically Sphinx 2 as it is the fastest (although not the most accurate). I choose Sphinx as it is Open Source, speaker independent and there has previously been some Asterisk integration done.

To give you an idea of what Sphinx can do for us, here are some examples of Sphinx in action:

LET'S GO!: A Spoken Dialog System For The General Public
RoomLine - A Spoken Dialogue System for Conference Room Reservation in SCS
ZoIP: Zork and Asterisk (using Sphinx)
CMU Communicator

Unfortunately, Sphinx is a BIG project with resources that aren't the easiest to get started with.

The Sphinx software requires two major things in order to be useful. The first is called an Acoustic Model. Acoustic Models are created through a large body of recordings and mapping the sounds to specific phones. (Of course it is much more complex).

The second is called a Language Model. A language model is a mapping of the words that are expected to phones. The probability of phones coming in a particular order is also taken into account.

The unfortunate thing here is that both Acoustic models and Language models are difficult. The fortunate thing is that CMU has created an acoustic model (communicator) for use by telephony applications that they use in the above examples and that they have an online tool for generating language models.

We already have the communicator accoustic model installed on our server so there is no need to download that but for any application you want to run with Sphinx you should generate a language model.

To generate a language model using the online tool, first you need to make a text file with the words and sentences that you expect to be spoken. (Words alone are fine but including them in possible sentences is better for accuracy).

Here is an example:

			my name is
			is my name
			shawn
			dan
			scott
			christin
			christian
			jury
			chris
			summer
			diane
			matt
			robert
			nanna
			ann
			steve
			jeff
			jacky
			megan
			ahn
			arly
			won
		
(Note some of these are purposefully misspelled so that they are more likely to be in the dictionary that is generating the model for us.)

After that is handed to the online tool, a GZIP file is generated with the various files that need to be made available to Sphinx. I would rename all of the files with something meaningful, such as sve204_names (don't remove the extensions though) and upload them to your home directory.

Next we need a way to call the sphinx recognition engine from Asterisk. Unfortunately, Sphinx does not come with a command line application nor an Asterisk plugin that will suit our purposes. Fortunately, there exists a Perl module called Speech::Recognizer::SPX that makes our lives slightly easier.

Here is a perl script that can be used and modified for our purposes: sphinxtest.pl.

To use the script:
  • Download the script and change the extension to pl instead of txt
  • Open the script in a text editor and change the following lines
  • $WAVFILE = "/home/sve204/sphinx/name-in.wav"; #This is the path to the wav file that you want processed
  • Any lines in fbs_init that mention the "names" folder need to be changed to the folder that contains your language model


  • That's it.. Now you can upload it and test it out:
    		perl sphinxtest.pl
    		
    Notice that it is slow? We will work on making a faster example next week...

    Now you can call this from your asterisk dialplan or from an AGI script. Here is a dialplan example:
    		[sve204_sphinx]
    		exten => s,1,Wait(1);
    		exten => s,n,Playback(/home/sve204/sphinx/tellmeyourname);
    		exten => s,n,Monitor(wav,/home/sve204/sphinx/name);
    		exten => s,n,Wait(7);
    		exten => s,n,StopMonitor();
    		exten => s,n,Playback(/home/sve204/sphinx/ithinkyousaid);
    		exten => s,n,System(/usr/bin/perl /home/sve204/sphinx/sphinxtest.pl | /usr/bin/text2wave -scale 1.5 -F 8000 -o /home/sve204/sphinx/playbackname.wav);
    		exten => s,n,Playback(/home/sve204/sphinx/playbackname);
    		exten => s,n,Wait(5);
    		exten => s,n,Goto(sve204_sphinx,s,1);
    		
    Try it out.. Extension 10 then 200. Seems to work best over a normal phone rather than a soft phone (due to the communicator acoustic model).

    More more more:
  • Primer on Java Speech API (JSAPI)
  • Christian's Blog
  • Speeding up Sphinx

    The speed issues that we see with the above commands in Sphinx can be partially alleviated by running our Sphinx application as a client/server application. The reason that Sphinx recognition is slow is that the application is somewhat slow on startup. The recognition itself can be done pretty quickly if Sphinx is already up and running.

    In order to get it up and running and just waiting for input we can modify our sphinxtest.pl script to run as a server and create a second script that will run as a client.

    Download the sphinxserver.pl (change extension from .txt to .pl) and sphinxclient.pl (change extension from .txt to .pl) scripts.

    These two scripts work in tandem. The server runs and manages connections to the Sphinx engine and the cilient feeds the server audio to be processed.

    Both scripts require a bit of editing in order to work with your particular situation:
    Specifically both require you to edit the port number that they will use:
    		$PORTTOUSE = 3010; # Use 3000 plus the number of your extension so that we are all unique
    		
    Just like the sphinxtest.pl script you have to put the path to the language model that you intend to use in the server:
    	  		-kbdumpdir	=> "$SPHINXDIR/model/lm/names",
    	  		-lmfn		=> "$SPHINXDIR/model/lm/names/names.lm",
    	  		-dictfn	=> "$SPHINXDIR/model/lm/names/names.dic",
    		
    The client needs to know the name of the audio file that you intend to process:
    			$WAVFILE = "/home/sve204/sphinx/name-in.wav"; # Monitor put's the -in on the file
    		
    Following those edits, you should be good to go. You can run the server from the command line to test such as follows:
    			perl sphinxserver.pl &
    		
    The ampersand tells the server to run in the background so that you can then run the client:
    			perl sphinxclient.pl
    		
    Assuming that you have an audio file for the client, it should connect to the server and output the text produced by the recognition engine.

    To use it in a dialplan, you would issue the same commands but instead use the System() application:
    			[sve204_sphinxclientserver]
    			exten => s,1,Wait(1);
    			exten => s,n,System(/usr/bin/perl /home/sve204/sphinx/sphinxserver.pl &);
    			exten => s,n,Playback(/home/sve204/sphinx/tellmeyourname);
    			exten => s,n,Monitor(wav,/home/sve204/sphinx/name);
    			exten => s,n,Wait(5);
    			exten => s,n,StopMonitor();
    			exten => s,n,Playback(/home/sve204/sphinx/ithinkyousaid);
    			exten => s,n,System(/usr/bin/perl /home/sve204/sphinx/sphinxclient.pl | /usr/bin/text2wave -scale 1.5 -F 8000 -o /home/sve204/sphinx/playbackname.wav);
    			exten => s,n,Playback(/home/sve204/sphinx/playbackname);
    			exten => s,n,Wait(5);
    			exten => s,n,Goto(sve204_sphinxclientserver,s,1);
    		
    The first System command runs the server (it just exits if it is already running) and the second runs the client on the newly recorded file. This removes the overhead of starting the sphinx system each and every time. This way it will just stay running between requests.

    The next problem that we encounter is when multiple individuals are using the system at the same time and recording to the same audio file. We can alleviate this problem by using a variable in the name of the audio file that we record to and then to change the client so that it get's the name of the file passed in.

    To enable this in the sphinxclient.pl script, we simply have to change the line that refers to the audio file: $WAVFILE = "/home/sve204/sphinx/name-in.wav"; to refer to the passed in variable: $WAVFILE = $ARGV[0];

    Next we have to modify our dialplan so that it uses a unique ID for each caller's file and passes that to the client:
    			[sve204_sphinxclientserver_new]
    			exten => s,1,Wait(1);
    			exten => s,n,System(/usr/bin/perl /home/sve204/sphinx/sphinxserver.pl &);
    			exten => s,n,Playback(/home/sve204/sphinx/tellmeyourname);
    			exten => s,n,Monitor(wav,/home/sve204/sphinx/name-${UNIQUEID});
    			exten => s,n,Wait(5);
    			exten => s,n,StopMonitor();
    			exten => s,n,Playback(/home/sve204/sphinx/ithinkyousaid);
    			exten => s,n,System(/usr/bin/perl /home/sve204/sphinx/sphinxclient_new.pl name-${UNIQUEID}-in.wav | /usr/bin/text2wave -scale 1.5 -F 8000 -o /home/sve204/sphinx/playbackname.wav);
    			exten => s,n,Playback(/home/sve204/sphinx/playbackname);
    			exten => s,n,Wait(5);
    			exten => s,n,Goto(sve204_sphinxclientserver_new,s,1);