Natural Language Processing for Voice Assistants 101

NLP is a critical component for building voice and conversational assistants. Deconstruct the basic terminologies of NLP in this blog. Part 1 of Slang Assistant blog series.

In this blog series, we discuss Slang’s assistant builder and how it can be used to help create domain-specific voice assistants quickly and effectively. In the first part of this series, we touch upon what natural language processing goes on behind the scenes of a voice assistant. In the later parts of the series, we discuss why Slang’s assistant builder is useful, and in the last part, we go deeper into how it is designed and used.

In case you want to skip ahead, the following parts of the blog are available here:

Part 2: Slang Assistant Builder helps you forget about intents and entities

Part 3: Slang Assistant Builder vs Building from Scratch

Part 4: How does the Assistant Builder Work?

First off, what is Slang? Slang is a Voice Assistant as a Service platform, which helps you to quickly add a multi-lingual and multi-modal Voice Assistant onto your mobile app. The Assistant allows users to talk to your app to get things done. Naturally, a crucial part of voice assistants is understanding what the user is saying so that the assistant can perform the correct actions. The bit that does the understanding broadly falls under Natural Language Processing.

There are multiple moving parts that all come together to make an effective voice assistant. The most dominant ones that would come to mind are Automatic Speech Recognition (ASR), Natural Language Understanding (NLU), the dialog system and an effective UI that allows you to control the app using voice and the traditional touch interface in a multi-modal manner. There are many more auxiliary parts such as analytics and feedback mechanisms that help the system learn, but more about these in another blog. 

The field of natural language processing (or NLP) is vast and consists of many subdomains or areas of study, such as document classification, machine translation, information extraction, information representation, summarization and language generation to name a few. 

The two aspects of the larger set of NLP domains that are used in building voice assistants or even the tech that underlines chatbots are classification and information extraction or tagging. Classification in the context of conversational systems is to find the intent of the spoken sentence or loosely the action that has to be performed, and tagging is to find useful information from within the sentence which is needed to complete the task. 

For those of you who have experience with Dialog Flow, Rasa or Snips, among others, may realize that I am describing the concepts of intents and entities. But for those of you who don’t know what these mean, these concepts are best understood with examples. Most eCommerce applications have the notion of filters on the product listing page, where based on particular attributes of the product, products shown can be filtered out of or into the displayed list. Usually, the types of functions that a user can execute related to filters are adding filters, removing filters or removing all filters at one go. Apart from filters, another action a user may want to perform is to add the product to the cart. The types of utterances a user may say are given in the following table:

Some eCommerce use cases and utterances that uses may speak out

As you can see from the above table, a possible utterance a user may say is, ‘Show only collared shirts’. Understanding that the meaning of the sentence is to instruct the assistant to perform the action of add filter and not remove filter or add to cart is classification. And in this case the buckets we classify the sentences are called intents and the intents loosely link to the action to be performed.

A filter also needs some additional information to perform its action, which in this case is collared and shirts. These become the entities that are extracted from the sentence through the tagging process. And will be used to understand which filter and filter values should be triggered.

The reason I say intents are loosely connected to the function to be performed is because while most times they can be linked one to one, there are many times where additional information like the context or state of the application, or the entities can influence the actual function performed. For example, if the user asks to show only collared shirts and the intent recognized is add filter with the entities, collar and shirt respectively for collar type and product type. If the app accepts these filters, then that forms the happy path of the filters being added. However, if the filters are not accepted, then no filter will be set and the user will be informed duly. Therefore even though the intent was add filter, the function performed will not be add filter.

Read the next part in the series here.