How to build Python transcriber using Mozilla DeepSpeech
Voice assistants and Conversational AI are one of the hottest tech right now. Siri, Alexa, Google Assistant, all aim to help you talk to computers and not just touch and type. Automated Speech Recognition (ASR) and Natural Language Understanding (NLU/NLP) are the key technologies enabling it. If you are just-a-programmer like me, you might be itching to get a piece of action and hack something. You are at the right place; read on.
Let’s get started
You need a computer with Python 3.6.5+ installed, good internet connection, and elementary Python programming skills. Even if you do not know Python, read along, it is not so hard. If you don't want to install anything, you can try out DeepSpeech APIs in the browser using this code lab.
Let’s do needed setup:
Examine the output of the last three commands, and you will see results “experience proof less”, “why should one halt on the way”, and “your power is sufficient i said” respectively. You are all set.
The API is quite simple. You first need to create a model object using the model files you downloaded:
You should add language model for better accuracy:
Once you have the model object, you can use either batch or streaming speech-to-text API.
To use the batch API, the first step is to read the audio file:
As you can see that the speech sample rate of the wav file is 16000hz, same as the model’s sample rate. But the buffer is a byte array, whereas DeepSpeech model expects 16-bit int array. Let’s convert it:
Run speech-to-text in batch mode to get the text:
Now let’s accomplish the same using streaming API. It consists of 3 steps: open session, feed data, close session.
Open a streaming session:
Repeatedly feed chunks of speech buffer, and get interim results if desired:
Close stream and get the final result:
A transcriber consists of two parts: a producer that captures voice from microphone, and a consumer that converts this speech stream to text. These two execute in parallel. The audio recorder keeps producing chunks of the speech stream. The speech recognizer listens to this stream, consumes these chunks upon arrival and updates the transcribed text.
PyAudio is Python bindings for PortAudio, and you can install it with pip:
PyAudio has two modes: blocking, where data has to read (pulled) from the stream; and non-blocking, where a callback function is passed to PyAudio for feeding (pushing) the audio data stream. The non-blocking mechanism suits transcriber. The data buffer processing code using DeepSpeech streaming API has to be wrapped in a call back:
Now you have to create a PyAudio input stream with this callback:
Finally, you need to print the final result and clean up when a user ends recording by pressing Ctrl-C:
That’s all it takes, just 66 lines of Python code to put it all together: ds-transcriber.py.
In this article, you had a quick introduction to batch and stream APIs of DeepSpeech 0.6, and learned how to marry it with PyAudio to create a speech transcriber. The ASR model used here is for US English speakers, accuracy will vary for other accents. By replacing the model for other languages or accents, the same code will work for that language/accent.
Did you enjoy building it? Any feedback, improvements, suggestions? What other voice application would you like to build? Do let me know in the comments. Thanks for reading.