A Hand Worker’s Tale
Why am I talking about manual work in an automated world? It’s like being analog in a digital world right? Is it because manual work gives you a sense of confidence which automated work fails to do sometimes?
If you haven’t already, I suggest you take a look at Introducing Slang CONVA and Slang’s Assistant Builder - helps you forget about intents and entities for an overview of what Slang CONVA is.
Up until a few months ago, I didn’t know voice assistants required any kind of “manual work”. You probably weren’t as naive as me, however. I am sure most of you have thought about the level of manual work that goes into building a domain for a voice assistant.
Consider this blog as a qualitative recount of the types of manual effort that goes into building and improving a domain, so that the customers of Slang CONVA have the best domain that suits most of their requirements right out of the box.
To ensure that we had a dataset as extensive as possible for our voice assistants, we sourced our data from various places and ended up with a big pile of raw data. This big pile of data consisted of around 100,000 items.
To make a superset, we had to combine multiple datasets from all the sources. When combining multiple datasets, there are many opportunities for data to be duplicated or mislabeled. If the data is incorrect, even in the cases that they look somewhat correct to the eye, outcomes and algorithms will eventually be unreliable. Hence, data cleaning is required. However, there is no one absolute way to prescribe the exact steps for the data cleaning process because these processes vary from dataset to dataset. But, it is crucial to establish a template for the data cleaning process so that, at the very least, you have a playbook to tackle the vast amount of data, and know where to start.
Not your average car wash:
To add this data in any form to our domain, we need to send it through six layers of different data clean techniques. They are the following:
First, duplicate observations are removed. Duplicate observations will happen most often during data collection. When you combine data sets from multiple places, there are many opportunities for duplicate data entries to be created. De-duplication is often the source of the largest downsizing of data, you can often prune to 10% of your original data collection, just by choosing the unique items.
Second, irrelevant entries are removed. Irrelevant entries are the observations that do not fit into the specific problem you are trying to analyze. For example, if you were building a catalog for a grocery app that sells only fresh produce like fruits and vegetables only, you wouldn't want entries of Fast-moving Consumer Goods (FMCG) like shampoos and soaps there. Similarly, in a more extreme case when you’re building a retail catalog you wouldn’t want travel domain data lingering in there.
This can make the analysis more efficient and minimize distraction from your primary target, especially when the sets are pretty similar to groceries and FMCGs. This helps to create a more manageable and more performant dataset. Sometimes, there are attributes of the data, such as columns that tell you the possible category it could belong to. Or, sometimes, you know that if a datum came from a particular source then it most definitely belonged to a particular category. Like the site for, the clothing company Forever 21, probably only had clothing related inventory. In such cases, one can easily code a filter using Python or other programming techniques, or one can use tools like Google Sheets, to easily filter the required entries out.
However, the tougher version of this process is where such data is not available. The human worker would have to look at each item on its own and use discretion to decide whether an item duly belongs to the category or not.
Third, typographical errors are fixed. Typographical errors can be in the form of missing alphabets or extra alphabets, like ‘potatos’ instead of ‘potatoes’. They can also be in the form of extra or missing special characters, like ‘fritolay’ instead of ‘frito-lay’. It could also be a case of inconsistent capitalization, in the cases where your downstream application is case-sensitive. In our case, though, we are case-insensitive so we convert everything to lower-case in the de-duplication step itself.
Fourth, we try to merge synonymous words into the same entity. In most cases, the word ‘Tee’, in the context of the clothing/fashion category would refer to ‘T-shirts’. If the user searches for ‘Tees’, we want the downstream application, in this case, the CONVA assistant to return ‘T-shirts’. Again, this can be automated if there is some form of an attribute for that data that says these are the same objects, or if there is some well known source that claims these are synonyms. We can even use more sophisticated tools or concepts such as clustering, but we haven’t yet explored those, for now, we are depending on the human evaluator and their discretion to get the most accurate results.
Consider a more detailed example. We take a men’s fashion catalog consisting of different subcategories.
In most, if not all, cases, the word “trousers” is actually completely identical to the word “pants'' except for being British English instead of American English. Of course, it is debatable, but the question finally boils down to the effect on the application and its users, if it turns out most people looking for ‘pants’ were happy with the search results on the term ‘trousers’. Then in the context of the app, they are synonymous. It is the same with “jumper” and “sweater.” In British English, the term “jumper” means what is a sweater in American English. Considering the rest, if we merge, ‘t-shirts’, ‘tees’ and ‘Tshirts’; and ‘Inner Wear’ and ‘innerwear’, we get the following distribution, which looks narrower and deeper.
Fifth, we weed out unwanted relations. Often, there will be one-off entries where, at a glance, they do not appear to fit within the data you are analyzing. For example, in some datasets, it may be present that ‘fruits’ are synonymous to ‘mangoes.’ All ‘mangoes’ are ‘fruits,’ but all ‘fruits’ are not mangoes. This can be a problem for CONVA, because if someone asks for ‘mangoes’ then we may end up showing other fruits too, like ‘apples’ and ‘strawberries.’ The other way is a problem too, if someone asks for ‘fruits’, we may show only ‘mangoes.’ Removing such improper data points, will help the performance of the data you are working with.
But how does a human worker even come up with these kinds of problems? We depend on their past experiences and inherent knowledge of such relations. However, not all human workers know all possible patterns, especially in some esoteric facts. However, many times human workers work in layers, first, they create a mental tool kit to identify patterns that indicate outliers and then work on correcting them.
For example, consider a dataset of shoe sizes of 10 shoes in UK measurements. 9 out of these 10 shoes in your dataset have a size of 9 (UK). The tenth person's size is entered as a 1 UK size. It may be an outlier. It stands out, so we take a look into it, and after we look into it we realize it was an impossible size value for an adult and thus something was wrong with that data.
Sixth, we handle missing data. Missing data is a deceptively tricky issue. You may let yourself ignore missing data, after all, you can’t say a relationship is wrong if it wasn’t there in the first place. It was just incomplete and that is a lesser evil most times. It’s always better to handle them in some way and try to fill in the attribute or relationship. Most algorithms may accept missing values syntactically, but then they will be forced to fill in dummy values in the given positions. For example, we have an item ‘bananas’, but nowhere is it mentioned that ‘bananas’ are fruits. In such a case, the algorithm will be forced to treat it as an item that does not belong to any category or sub-category.
Unfortunately, from our experience, the two processes that we could use for solving the missing data problem, both have various pros and cons.
The first option, as we said earlier was to drop or delete observations that have missing values, but doing this will drop or lose information. Consider another example, just to add to the example given above.
Consider a brand called “$ Pancakes” which is generally pronounced as “Dollar Pancakes”. During the collation of data, the “$” may have been removed as part of ‘removal of special characters’ in step 3. And, unaware of this, you proceed to delete the whole entry as it appears as a generic brand-less “Pancakes” entry. You have just committed two consecutive deletions for different reasons, one, because it was a special character and, two, because you thought it was a generic word and not a well known brand. The end result was that we now cannot recognize ‘dollar pancakes.’ You might start thinking that, you probably should have taken it easy with the removal of special characters. But then, now you’ll have to look at every phrase that has ‘dollar’ in it, with the chance that one of them was meaningful and the rest were just the price of the item in hand.
Sometimes, the fact that the value was missing may be informative in itself. For example, the brand ‘Unilever’ is an international corporation, while ‘Hindustan Unilever’ is its Indian subsidiary. These could be two different brands for a given purpose. How does one decide then?
As a second option, you can impute the missing values based on other observations. Imputing missing values means to infer or make educated guesses from the surrounding information. The con here is the value was originally missing, but you filled it in, which always leads to a loss in information when comparing to perfect truth, no matter how sophisticated your imputation method is. Again, there is an opportunity to lose the integrity of the data because you may be operating from assumptions and not actual entries. In the above-stated example of “$ Pancakes,” you cannot just add “$” to it as the list may consist of multiple special character items which may be missing.
What we think about when we think about validation.
Now that you got a taste for the different types of data cleaning we perform at Slang, let’s discuss why we do data validation. Technically, we should have discussed this before, but it would have been hard for you to visualize what we were talking about without seeing some concrete examples first. But before we do any work on the data, we need to know the goals we are working towards. Some of the questions that can help answer these are.
- Does the data make sense, Does the structure we are using semantically fits the domain? Travel domain probably needs a bunch of airport names, is that what we are supplying? Is it what the algorithm expects - airport name and corresponding airport codes - and is it in the form it expects. If your data meets this bar then it is valid.
- Does it meet a minimum bar in terms of quantity/coverage of data? At the end of the day people aren’t going to think you are the real deal, but just a toy if your ‘grocery’ assistant only understands 10 fruits and doesn’t understand more, or more ways of speaking. If it does meet this minimum bar then your data is complete.
- Does it meet a minimum bar in terms of the accuracy of data? Like many of the examples explained above, if a user asks for ‘potatoes,’ they probably don’t want to be shown ‘potato chips.’ If it meets this minimum bar then it is accurate.
- Is the data consistent? Do you remember this analogy about relationships from your school days? ‘If man is to a woman, then a king is to ____?’ Well, the data should maintain such a similar relationship. It shouldn’t depict, ‘if man is to a woman, then a king is to kingdom.’ Taking an example that resonates more closely with CONVA, in the retail domain, if we have ‘Maggi is to noodles’, the similar brand-item relationship for ‘Lays’ would be ‘chips’, but, instead of ‘chips’, if we had said ‘packed foods’, then that would not consistent since ‘packed foods’ is the sub-category, not the item type for ‘Lays’.
An astute reader may ask, ‘what is this “minimum bar”, that you speak of above, for completeness, accuracy, and consistency?’ To which the answer is, there is no clear answer, is it 80% or 90%? This is a topic for another blog. In short, there is some number at which the cost-benefit trade-off skews towards it being too costly for the benefit it gives. However, once we have found a set of numbers, the point of data cleaning and validation is to make sure we can reach these numbers.
Another astute reader may ask what is the difference between what we are going to talk about and step 6 in the data cleaning set, where we add back for missing data. Technically, the line is blurry, but it helps to make a logical break at what we describe next because these methods are used to improve the systems that are auxiliary to the core NLU, more specifically, the translation system and the Automatic Speech Recognition (ASR) system.
For inference in another language, the ASR system first has to transcribe the word in that language, and then the translation system needs to translate it to English because our NLU works in English.
The ASR and translate systems work on general models that aren’t specific to a domain, but to tune them to a domain we can augment their inherent models with extra words or language-language word pairs.
So after the data cleaning steps above, we need to create this clean set of translate pairs. For some words that are common nouns, we want the translation to be from the common noun in that language to the common noun in the other language. For example, सेब in Hindi should translate to ‘apple’ in English. However, we’d need a proper noun like ‘oreo’ to translate to Hindi. This means it is not really translated, but is a transcript.
In general, we can transcribe words one by one by hand, but many simple transcription algorithms exist that do a syllable for syllable conversion. However, such conversions aren’t always accurate because the conversion is not a one-to-one conversion, but rather a one-to-many conversion, because of the way many Indian languages deal with vowels. If you are one for nice linguistics lingo: these Indian languages follow a writing system that uses vowel diacritic for all its vowels called an Abugida. Here a consonant (vyanjana varna) is added to a vowel diacritic (Swara maatraa) to give a syllable. This means when a vowel appears on its own it takes a form, but when it appears after a non-vowel, or consonant, it takes another form. Added to that there many times multiple similar vowels depending on their stress, the English language itself has many ways to spell something that may sound the same: is it, boy, as in a male child, or buoy as in a flotation device on the sea?
So all in all, though the script helps get most things right, there is always someplace that it trips up and this is where human workers work to improve upon that accuracy that the automated system was able to produce.
If one is interested in a real-world example: A tablet called "pregaba NT" was translated as “प्रेगाबा एनट” by the automated script which is not an accurate translation. So this had to be manually changed to “ प्रेगाबा एनटी”.
Again, “Amlokind H” was translated as “अमलोकाइंड च”, again this had to be manually changed to “एमलोकाइंड एच”
Aren't there multiple correct pronunciations for an item? Yes! You’re correct. And this is where hints to the ASR systems help, both in English and an alternate language. However, sometimes, due to the nature of the model used in the ASR system, it decides that it can ignore the hint, after all, it is not called a ‘hint’ for no reason.
At this point, if we feel the word getting recognized is otherwise meaningless or meaningless in this context, we can map this word to the word that we were actually trying to say. For example, if you were in the context of selling Toyota vehicles and you were trying to say ‘Corolla’, as in the car, but the ASR system always assumed you were saying ‘Gorilla’. Since both sound the same, the model feels that in the data it trained on, ‘Gorilla’ seemed the higher probability word. In such a case, we choose to map ‘Gorilla’ to ‘Corolla’, since we know ‘Gorilla’ itself is meaningless in this context.
However, the downside of this, that you may very often hit a situation where the words that are being mismatched by ASR are both meaningful in the context. Take the case, of retail, the user may say, ‘show me my cart,’ but it often is recognized as ‘show me my card,’ and in this case, the system might assume the user was searching for the retail item, ‘cards.’ Thus, there is a heavy amount of light stepping that one has to employ here.
Fast yet Focused.
Based on feedback and learnings during the past year of extensive remote work, and research, we came up with a processed way to quickly work on large datasets while still taking care of all the cleaning and augmentation steps as mentioned above. It’s with these learnings, that we now believe, that building world class voice assistants are not as easy as using off the shelf NLU products, which were probably made for generic reasons, and need some amount of specialization to get that level of finesse. That is what our aim at Slang is, with CONVA we do all the heavy lifting for you, so that you don’t have to do it. And if you think what we described above was not heavy, well, hats off to you, Arnold!
Batteries included inside
For many of the pure NLU systems and even with Slang’s own VAX system, one has to add their own synonyms and ASR hints. This meant sourcing, cleaning, and all of the above shenanigans. However, not only does CONVA provide this for you, for argument’s sake let’s say you were unhappy with only some of your data, synonyms, ASR hints, and translate hints, but want to keep most of ours. With CONVA, you get the best of both worlds, managed and customized. You can use our data and augment your own seamlessly. And the best part is, whenever we make an update or improve our data, you get the benefit OTA (over the air), without losing any of your augmentations, or having to start again.
We’re in control
How do we account for diversity in speech patterns and possible other gotchas, or how do we improve the above process itself? Since each voice is distinct, to potentially maximize our quality, we have a company-wide “bug bashes” regularly during which all employees get together on a video call to test the sample test data set to identify recognition issues which would further help us to improve our recognition quality from a broader perspective. We truly apply dogfooding to our testing
Staying up to date
To continuously improve the recognition quality and our assistant’s experience, we monitor the production data regularly to implement any changes that will positively impact the recognition experience. This topic warrants on its own blog and we are writing about it shortly.
Above, we had mentioned that we work on a domain and aspects of the domain till we reach certain cutoff numbers for accuracy, completeness, validity, and consistency. We will soon be writing a blog on how we do these experiments and also our numbers for them.
As of today, we have a manual process in place as mentioned above, but we aim to improve upon this system to make a semi-automated feedback loop that helps the human worker not to spend time on parts that a computer is good at but rather spend time on parts for which they are truly needed. You can read about our plans to build such a system in this blog, Slang Blueprints: A feedback loop for continuous learning.
Slang CONVA provides the easiest and fastest way to add In-App Voice Assistants to mobile and web apps. Sign up for an account at https://www.slanglabs.in to create your own retail assistant, or see Introducing Slang Conva for more information.
(For those wondering, the title is a play on The Handmaid's Tale, a dystopian novel by Margaret Atwood)