Speech Recognition and VOIP

Speech recognition technology is quickly becoming ubiquitous. I'm willing to bet at some point today you will come in contact with some form of speech recognition technology. But, one area has presented a unique series of problems for speech recognition, voice over IP.

See, speech recognition depends on the data the engine can find in the audio. Too much noise, or too little, and the acoustic model does not provide acceptable results. In other words, the audio coming in should be as close as possible to the audio used to build the acoustic models for the engine. It takes a lot of audio to create an acoustic model reliable enough for production use. In the past, the audio had to be meticulously transcribed, time aligned, fed into a software tool that would take the audio, time and text information and churn through it to produce an acoustic model. Because of this, acoustic models for VOIP have been lacking. Since most VOIP terminates as a standard phone line, VOIP has usually been pushed through the acoustic models created for telephone quality audio. Therefore, the results of recognizing VOIP has been spotty at best.

My partners and I use VOIP extensively. Two years ago, we decided to embark on a series of experiments to see how hard it would be to come up with a reliable acoustic model for our VOIP use. The use case we concentrated on initially was where a maintenance person completes a repair and simply "calls in" the results. Whether they called an 800 number or initiated a VOIP connection to the PBX, everything coming in was VOIP by the time it reaches our engine. We wanted a way to create the acoustic model without having to time align every single audio file used to create the acoustic model. In fact, we didn't want to have to time align ANY of the audio.

The goal of the acoustic model is to find the phonemes. In the english language there are 42 phonemes that make up all of the words you can speak. Other languages have more or fewer phonemes. We were interested only in english. Once you have the phonemes, the hard part is done and the language model takes over for correct token replacement. I created a process where the audio was captured and sent to a software tool running on a linux box doing nothing but breaking the audio into phonemes. Then, a number of people were recruited to read known text into the system. This known text was converted to phonemes and the known phonemes were compared to the generated phonemes to create an initial statistical model to gauge the accuracy of the acoustic model. For the first year, we also had the calls transcribed, verbatim. If the caller coughed or sneezed, there was a marker, ums and ahs, are in there too.

In January of 2009, the callers started editing their own calls. For the first six months, nothing was suppressed and they were encouraged to leave any markers for the non-speech elements the system found like coughs, sneezes, ums and ahs, etc. And to add them if it missed them. They were assured this would allow us to have the system remove them before being added to the property maintenance records. They were able to verify this. After six months, the non-speech items were suppressed and now when they edit their call dictation, they don't see OR HEAR the non-speech items.

Now, with over 2 years of data, we are ready to take the next step. Now, anyone doing maintenance for me, or any of my partners using this system, can simply call an 800 number, speak the reference number of the work order, explain in detail what they found and did to correct or complete the work order and hang up. Then, the magic happens.

The system transcribes the call; identifies the reference number; matches that with the property manager responsible for the property and routes the transcribed text, audio and other data to them for verification and approval. Once the property manager approves the repair, the payment is authorized and submitted to the accounting system to cut the vendor a check or initiate an ACH deposit into their preferred account. This has greatly reduced the workload on the frontline property mangers.

Obviously, a lot of detail has been left out because I believe the way we created this acoustic model was innovative. But, the backend processes are even more innovative. The caller does have to key in any identifying information. They just start talking and we figure it all out on the backend.

Next, we intend to employ the same technology to capture a maintenance request from the tenant or the property manager.