Slang Blueprints: Heuristics for the semi-automated feedback system

Detail description of the heuristics for the semi-automated feedback system

January 31, 2021

Engineering

Listen to the blog here:

‍

In Slang Blueprints: A feedback loop for continuous learning we made a case for the need of a feedback system for making the overall accuracy of Slang CONVA better. We also described the general architecture of the feedback loop while deferring some details to other blogs. In this current blog we cover details of one of the parts, which is the set of heuristic driven processors that would work on the analytics data and create suggestions for the human worker, who can use these suggestions to improve the templates; potentially lending itself to a fully automated feedback system.

The heuristics that can be used to flag issues can be loosely classified into

Natural language processing (NLP).
Heuristics based on user behaviour

Natural language Processing Heuristics.

Behind the scenes of Slang CONVA, there is an NLP engine that classifies utterances into intents and extracts entities from the utterances. The types of issues that can arise are those that pertain to the intent being wrongly classified or the entities wrongly extracted. The utterance could be wrongly classified to another intent or to no intent, that is to say it is unrecognized. The entities too could be extracted wrongly, that is, a phrase in the utterance could be extracted and tagged with the wrong entity, or a phrase that was supposed to be tagged was not tagged at all, or a phrase that was not supposed to be tagged was tagged.

The heuristics used here will be to better identify these false positives and false negatives.

Intent classification:

The classifier used inside CONVA, like most classifiers, is able to give a confidence score. If a certain utterance was classified with a certain intent with only a small delta of confidence over the next intent, then it usually means it was a toss up between the two intents and the other one could have been right too. Often, this means the two intents that are too overlapping would be better off merged, or we'd need better training sentences for the classifier to learn more nuances between the two. For example, assume two intents, ‘add to basket’ and ‘search’, and the utterance ‘send bananas to my basket’. Assume the words ‘send’ and ‘basket’ were missing from the training sentences. The intent classifier would be able to work only on the word ‘banana’, with the others potentially being unknown. Thus, to the classifier it could assign very close confidence values to ‘add to basket’ and ‘search’. In this case, if the sentence is flagged to the human worker, then the worker can add the missing words to the template, such that next time it is trained on, the distinction is made clearer between the two intents, with these two words.

If there were a system that could always give a better intent accuracy than what is currently behind the CONVA system, and given we could use it in our system without treating it as an oracle, then that system would be the one powering CONVA. That is, in other words, no classifier that we could build, can be better than the one inside CONVA, for if it were, then why aren’t we using it? But that doesn’t mean alternate classifiers are useless. Even though we have to choose the best one for live inference, these alternate classifiers can be used to give an alternate system of confidences that can help find utterances that are classified closely. Thus, by creating a family of alternative classifiers, and by pooling their outputs, we can flag a larger variety of possible conflicts.

Even though in the example above we described the potential remedy as adding missing words, examples of other remedies could be creating a better classifier, or merging intents so that the soft classification roughly gets the intent right, while more deterministic decision tools layered on top, based on entities, could help narrow it down, possibly, losing out on generalization, but, improving precision.

Entity Tagging

Assume a given word in a sentence is not tagged at all, this word could belong to a list of stopwords or what we like to call fluff words, which are words that don’t belong in any entity. In order to know that the word was indeed a fluff word and not a legitimate entity, we would maintain a list of all such stop words. If we knew that the word belongs to that list, then it is effectively the same as belonging to an entity with the name ‘fluff word’. This is an oversimplification: in reality we have more nuanced entities that try to tag every word in the sentence, even though certain words may not be useful for the business logic downstream. This we have seen leads to better, more fine grained outputs. For the sake of improving the template, if a word does not belong to an entity and also to our known list of words, then it can be flagged up as an interesting word, and using the context that it appeared in, the human worker can then add it to the template under the correct list.

Wrongly tagged words are harder. However, in this case, applying the same concept of using confidence scores of the classifier, or alternate classifiers, in the case of intents, but here using confidence scores of the tagger and alternate taggers, we can raise situations where a word was tagged with a certain entity, but it was very close to being tagged as another one. In this case the human worker can add back to the training sentences or entities in the template to help it make the distinction better.

Behaviour based heuristics:

NLP based heuristics make the assumption that an utterance used as input to the NLP system was correctly transcribed from the user’s speech. Many times, ASR gets the transcription wrong. The best way to know whether a sentence was transcribed wrong or not is by actually listening to what the user spoke. But, for reasons of privacy, we do not want to do that. In this case, there is no way for sure to know whether what the user spoke was correctly transcribed or not.

This is where the behavioural heuristics come in: is there something about the way the user uses the app and assistant, or patterns in behaviour, that can give us a clue as to whether a failure in classification or tagging can be amounted to wrong transcription by the ASR engine?

Repeated utterances:

Assume a user speaks a sentence and Slang responds with an ‘intent unrecognized’. The prompt that Slang will speak at that point would be the clarification prompt, which would be of the form, ‘I am sorry, I didn’t understand that, please say that again?’ In response, it can be expected that the user tries to say the same utterance they had said before, in the hope that the system picks it up this time. Therefore, searching for a pattern of repeated ‘intent unrecognized’ responses followed by the same or similar user utterances can be one signal of wrong ASR transcription.

Sometimes, entities that seem rightly tagged were actually false positives because ASR transcribed speech to a word that is legitimate in the context but was actually the wrong transcription. In this case, it would be hard to expect the NLP heuristics described above to find these issues. In these cases behaviour heuristics could help weed out the false positives.

Sometimes, two words could sound too similar for ASR to transcribe rightly. In this case coupling a fuzzy matcher into the loop will help find what the real word could be.

User Experience Flows:

Assume a user asks for lemons on a grocery voice assistant, the assistant or even the app does not take them down the happy path and either replies with an intent unrecognized, and shows them the wrong items or does not show them any items at all. The user will then either try to speak again like mentioned above, will try typing the word using the touch interface or will go back to the home screen and do something else: anything which is not the happy path of clicking on the item or adding it to the cart. These paths can be used as clues to an unhappy customer and probably a clue to whether the assistant’s voice capability was the source of the unhappiness. Of course, this would be a very noisy signal with many false negatives and false positives. This is why we need the human-in-the-loop to have the final say on these utterances.

Slang CONVA provides the easiest and fastest way to add In-App Voice Assistants to mobile and web apps. Sign up for an account at https://www.slanglabs.in to create your own retail assistant, or see Introducing Slang Conva for more information.