Add Voice Commands to Android Apps

A guide to add voice commands to mobile applications.

“Okay Google, take me home.”
“Hey Siri, play some music.”

If you have ever used voice commands on a mobile device before, you would have tried something similar, the voice assistants bundled with your phone are getting better everyday and can help you do a lot of things. But can they help you once you are inside an app? Can they help you navigate the maze of each app’s UI design? Can they help you use voice for tasks once you open the app?

Wait what? What do you mean by voice “inside” the app?

Yes - have you ever wondered if you could add voice commands to an app? Imagine getting things done just by speaking to the app rather than trying to wriggle through the UI via touch, just like how we see in sci-fi movies.

We are going to do just that, with an Android app today.

No, I am not going to ask you to shell out loads of time and money to build some fancy deep learning based Automatic Speech Recognition system as a prerequisite. We will make use of the freely available SpeechRecognizer class in Android to complete this project.

However, if you are interested in building out your own ASR system, you can refer to our article linked below.

We will create a new Android Studio project for the purpose of this experiment.

There are four steps that need to be followed to enable our app to respond to voice commands,

  1. Setup audio permissions.
  2. Create a UI element which will act as a microphone trigger.
  3. Initialize and set-up the SpeechRecognizer instance.
  4. Create a list of supported utterances which will serve as the domain for voice commands in our app.
  5. Execute actions in the app depending on the voice command given by the user.

Setup audio permissions

We will need to declare the audio permission requirement by adding the following line in the AndroidManifest.xml file of our app.

Let us place the microphone trigger in the Before our app can start listening to user input, we need to run a permission check, and if ASR permission has not been granted, request the user to allow access to the microphone.

Note: You can choose to ask for audio permission on app startup, or during run time when the user tries to talk to your app by clicking on the designated UI element. Here we will ask for audio permissions on app startup.

We will also need to override the onRequestPermissionsResult() function in the MainActivity to know if the user approved or denied our permission request.

Add UI element for microphone trigger

The next step is to add a microphone button a.k.a the trigger. Whenever the user clicks on the trigger, we will start a new audio session. For this experiment, let us take the image of a microphone as our trigger icon. We will add the image resource to the project, and then add the ImageView to activity_main.xml. We will also add a TextView to show the user’s detected utterance when they are speaking.

We will also add a click listener to our trigger icon to start and stop audio sessions.

Setup SpeechRecognizer instance

Now onto the exciting stuff. Using the SpeechRecognizer instance, we will be able to listen to the user’s commands and execute actions inside our app.

Add a global instance of SpeechRecognizer called mSpeechRecognizer to MainActivity, and add a function to initialize it.

We will also add a RecognitionListener to mSpeechRecognizer which enables us to be notified at each stage of the audio session. The two callbacks we are interested in are onPartialResults() and onResults(). We will fill up these callbacks later on. You can ignore the other callbacks - we will not be needing them for this experiment.

Before starting off an audio session, we would also need to define an audio recognition intent which would be passed to mSpeechRecognizer.

"The EXTRA_LANGUAGE_MODEL is a required extra for ACTION_RECOGNIZE_SPEECH. We are also setting the EXTRA_PARTIAL_RESULTS to true so we get notified of partial speech results when the user starts speaking, via the onPartialResults() callback."

The EXTRA_LANGUAGE is optional and sets the locale for speech recognition. There are a few other optional extras, which you can check out here.

Let us now handle the audio sessions inside the click handler for our trigger icon.

Once an audio session starts and the user starts speaking, we will be notified of partial and complete results in the callbacks. Let us define those callbacks.

Create a list of app supported actions

Let us define certain actions the app can perform - and add the utterances for all those actions inside an array. This would depend on the app, for example, an e-commerce app would have a different set of actions as compared to a messaging app.

We will define actions according to an e-commerce app for this experiment.

So our array of user command utterances would look something like this.

Execute actions as per user voice commands

Now that we have our list of supported commands and the end to end working speech recognition functionality in the app, let us define the handleCommand() function which is responsible for executing user commands.

Depending on what command we received from the user, we would trigger different app actions. For the purposes of this experiment, we would only update the UI on our MainActivity informing the user that the action was a success. If we could not find the user utterance in our static list of commands, we will tell them that the action requested by them is not supported.

That’s it! Now our sample app is a voice enabled app which will enable users to perform actions based on voice commands.

Hope you found this helpful, and if you decide to take up any experiments with voice.

While the above steps help you code your own In-App Voice commands, they are still limited in the number of things you can do with it and the flexibility it provides. If the user speaks the command in a slightly different way, it would not work. Nor is this optimized to speak back to the user to collect any additional information when required. The Google Speech Recognition is powerful but it's a generic ASR. It is optimized for a more broad-based recognition and if you want to increase its accuracy for words that matter to your app, it's not possible.  Also the UI and UX experience still needs to be built by hand as the default textbox is very limited in functionality. 

This is where a platform like the one we have built at Slang called Slang CONVA would help you add sophisticated In-App Voice Assistants without the need to explicitly handle all the audio aspects and parsing the command to understand the intent behind it. Its full-stack pre-built Voice Assistants provide out of the box support for various domains including handling multiple languages.