Speech Recognition with Python
Speech recognition technologies have been evolving rapidly for the last couple of years, and are transitioning from the realm of science to engineering. With the growing popularity of voice assistants like Alexa, Siri and Google Assistant, several apps (e.g., YouTube, Gana, Paytm Travel, My Jio) are beginning to have functionalities controlled by voice. At Slang Labs, we are building a platform for programmers to easily augment existing apps with voice experiences. We are very interested in Conversational AI for Indic languages.
Automatic Speech Recognition (ASR) is the necessary first step in processing voice. In ASR, an audio file or speech spoken to a microphone is processed and converted to text, therefore it is also known as Speech-to-Text (STT). Then this text is fed to a Natural Language Processing/Understanding (NLP/NLU) to understand and extract key information (such as intentions, sentiments), and then appropriate action is taken. There are also stand-alone applications of ASR, e.g. transcribing dictation, or producing real-time subtitles for videos.
We are interested in ASR and NLU in general, and their efficacy in the voice-to-action loop in apps in particular. Our Android and Web SDKs provide simple APIs suitable from the perspective of app programmers, while Slang platform handles the burden of the complexity of stitching together ASR, NLU and Text-to-Speech (TTS). But, naturally, we are curious about the state of art in ASR, NLU and TTS even though we do not expose these parts of our tech stack as separate SaaS offerings. This exploration of existing ASR solutions is the result of that curiosity.
Service vs. Software
There are two possibilities: make calls to Speech-to-Text SaaS on the cloud or host one of the ASR software package in your application.
Service is the easiest way to start. You have to sigh-up for a SaaS, and get key/credentials, and you are all set to use it in your code, either through HTTP endpoints or libraries in the programming languages of your choice. However, for reasonable large usage, it typically cost more money.
Software packages offer you full control as you are hosting it, and also the possibility of creating smaller models tailored for your application, and deploying it on-device/edge without needing network connectivity. But it requires expertise and upfront efforts to train and deploy the models.
It is a reversible choice. For example, you can start with a cloud service, and if needed, move to your own deployment of a software package; and vice versa. You can design your code to limit the blast-radius of such reversal, as well as in case if you migrate to another SaaS or software package.
Batch vs. Streaming
You need to determine whether your application requires batch ASR or streaming ASR.
Batch: If you have audio recordings that need to transcribe it offline, then batch processing will suffice as well more economical. In batch API, an audio file is passed as a parameter, and speech-to-text transcribing is done in one shot.
Streaming: If you need to process speech in realtime (e.g. in voice-controlled applications, video subtitles), you will need streaming API. In the case of streaming API, it is repeatedly invoked with available chunks of the audio buffer. It may send interim results, but the final result is available at the end.
All services and software packages have batch APIs, but some lack streaming APIs at the moment. So if you have a streaming application, that eliminates some of the choices.
Choice of Python
Most speech services provide libraries in popular programming languages. In the worst case, you can always use HTTP endpoints. Same is true for speech packages, these come with bindings in various programming languages. In the worst case, you can create bindings yourself. So there is no constraint of using Python.
I am choosing Python for this article because most speech cloud services and ASR software packages have Python libraries. Also, you can run code snippets of the article using its companion Colab notebook in the browser, without requiring anything to be installed on your computer.
One common use case is to collect audio from microphone and pass on the buffer (batch or streaming) to the speech recognition API. Invariably, in such transcribers, the microphone is accessed though PyAudio, which is implemented over PortAudio. But since the microphone is not accessible on Colab, we simplify it. We will use a complete audio file to examine batch API. And for streaming API, we will break an audio file into chunks and simulate stream.
How to best use the rest of the article
Following services and software packages are covered.
- Google Speech-to-Text
- Microsoft Azure Speech
- IBM Watson Speech to Test
- Amazon Transcribe
- CMU Sphinx
- Mozilla DeepSpeech
- Facebook wav2letter
Code samples are not provided for Amazon Transcribe, Nuance, Kaldi, and Facebook wav2letter due to some peculiarity or limitation (listed in their respective sections). Instead, links to code samples and resources are given.
The next section has the common utility functions and test cases. The last section covers Python Speech Recognition package that provides an abstraction over batch API of several could services and software packages.
If you want to have an overview of all services and software packages, then please open the Colab, and execute the code as you read this post. If you are interested only in a specific service or package, directly jump to that section. But in either case, do play with the code in Colab to explore it better.
Let’s plunge into the code.
Download the audio files we will use for testing Speech Recognition services and software packages:
It has three audio files. Define test cases with needed metadata:
Also, write some utility functions. The read_wav_file() takes the path to the audio file, and returns the buffer bytes and sample rate:
The simulate_stream() is useful for simulating steam to try streaming APIs. Usually, there will be an audio source like a microphone. At regular intervals, the microphone will generate a speech chunk, which has to be passed to the streaming API. The simulate_stream() function helps to avoid all that complexity and to focus on the APIs. It takes an audio buffer and batch size, and generates chunks of that size. Notice the yield buf statement in the following:
You will need your Google Cloud Credentials. You will need to setup GOOGLE_APPLICATION_CREDENTIALS environment variable pointing to the cred file:
Using batch speech-to-text-API is straight forward. You need to create a SpeechClient, create a config with audio metadata and call recognize() method of the speech client.
When you run this, you will see the text of each of audio test files in the output:
Google’s streaming API is also quite simple. For processing audio stream, you can repeatedly call the streaming API with the available chunk of audio, and it will return you interim results:
In the output, you can see that the result improves as more audio is fed:
Microsoft Azure Speech
Install Azure speech package:
Azure’s batch API is simple too. It takes a config and audio input, and returns the text:
The output will be as following:
Azure has several kinds of streaming API. By creating different types of audio sources, one can either push the audio chunks, or pass a callback to Azure to pull the audio chunk. It fires several types of speech recognition events to hookup callbacks. Here is how you can wire a push audio stream with the audio stream generator:
The output for the first test case looks like:
IBM Watson Speech to Text
You will need to sign up/in, and get API key credential and service URL, and fill it below.
The batch API is predictably simple:
Here is the output:
Watson’s streaming API works over WebSocket, and takes a little bit of work to set it all up. It has the following steps:
- Create a RecognizeCallback object for receiving speech recognition notifications and results.
- Create a buffer queue. Audio chunks produced by the microphone (or stream simulator) should be written to this queue, and Watson reads and consumes the chunks.
- Start a thread in which speech recognition (along with WebSocket communication) executes.
- Start microphone or speech simulator, to start producing audio chunks
- Upon completion, join the speech recognition thread (i.e. wait till it completes).
Amazon Transcribe Python APIs currently do not facilitate use cases covered in this article, and therefore code samples are not included here.
Nuance is most probably the oldest commercial speech recognition products, even customised for various domains and industries. They do have Python bindings for a speech recognition service. Here is a code sample in their GitHub repo.
I could not figure out a way to create a developer account. I hope there is a way to get a limited period free trial credits similar to other products, and get the credentials needed to access the services.
First, install swig. On macOS, you can install using brew:
On Linux, you can use apt-get:
And then install pocketsphinx using pip:
Create a Decoder Object
Whether you use batch or streaming API, you will require a decoder object:
Batch API is expectedly simple, just a couple of lines of code:
And you will see now familiar output:
Notice the errors in transcription. With more training data, it typically improves.
Streaming APIs are also quite simple, but there is no hook to get intermediate results:
You can install DeepSpeech with pip (make it deepspeech-gpu==0.6.0 if you want to use GPU in Colab runtime or on your machine):
Download and unzip the models (this will take a while):
Test that it all works. Examine the output of the last three commands, and you will see results “experience proof less”, “why should one halt on the way”, and “your power is sufficient i said” respectively. You are all set.
Create a Model Object
The first step is to read the model files and create DeepSpeech model object.
It takes just a couple of line of code for doing batch speech-to-text:
DeepSpeech streaming API requires creating a stream-context and use it repeatedly to feed chunks of audio:
DeepSpeech returns interim results:
Kaldi is a very popular software toolkit speech recognition among the research community. It is designed to experiment with different research ideas and possibilities. It has a rich collection of various possible techniques and alternatives. The learning curve is steeper compared to other alternatives discussed in the code lab.
There is no pre-build PyPI ready to use package, and you have to build it either from source or from Conda. Neither options suit the Colab environment.
Facebook released wav2letter@anywhere in January 2020. It boasts a fully convolutional (CNN) acoustic model instead of a recurrent neural network (RNN) that is used by other solutions. It is very promising, include for use in edge devices. It has Python bindings for its inference framework.
Like Kaldi, this also does not provide a PyPI package, and needs to build and install from source.
SpeechRecognition Python Package
The SpeechRecognition package provides a nice abstraction over several solutions. We already explored using Google service and CMU Sphinxpackage. Now we will use these through SpeechRecognition package APIs. It can be installed using pip:
SpeechRecognition has only batch API. The first step to create an audio record, either from a file or from a microphone, and the second step is to call recognize_<speech engine="" name=""> function. It currently has APIs for </speech>CMU Sphinx, Google, Microsoft, IBM, Houndify, and Wit. Let's checkout using one cloud service (Google) and one software package (Sphinx) through SpeechRecognition abstraction.
API for other providers
For other speech recognition providers, you will need to create API credentials, which you have to pass to recognize_<speech engine="" name=""> function, you can check out </speech>this example.
Want to write a Python transcriber that converts microphone input to text? Check this out.