Deep-diving in Linguistics and Localization

Different cultures have a different way in which they use their language, Linguistics highlights these cultural nuances. My key takeaways from Joseph Tyler's talk on linguistics and localization at Voice Summit 2019

Language is an important part of our lives. It is a uniquely human gift which lets us communicate and differentiates us from primates. But language is much more than just a means of communication. It is also an inseparable part of our culture. Different cultures have a different way in which they use their language, and they have differences which cannot be underestimated. Linguistics highlights these cultural nuances.

Linguists were earlier limited to academia, but now people are taking note of the importance of linguistics in Natural Language Processing. Companies are hiring linguists to help in incorporating these cultural nuances while building these models and for conversational design. Companies now realize the importance of designing conversations for chatbots and voice apps for the consumers, and linguists can help them design these conversations.

These conversational designs become even more critical when companies expand into new markets.

Spoken Arabic is different and Arabic while reading is different

Joseph Tyler is a PhD in linguistics and heads conversation design at Sensely. Sensely is a medical conversational AI assistant. He gave a brilliant and insightful workshop on the importance of linguistics and localization.

The process of localization for a simple application is quite complex . This becomes even more complex when you are trying to localize — chatbots and assistants. This is because businesses are dealing with crucial components like NLP, ASR, and TTS. Localization has to be done individually for these building blocks which is not a trivial effort and it needs a lot of investment.

Just using a Google Translate for the content on the app will not give the optimum result and instead deteriorate the User eXperience.

Few examples that he shared during his talk were — spoken Arabic is very different from Arabic for reading. One has vowels the other doesn’t. It also varies according to gender and whether the subject is singular or plural. There can be more than 20 variants of Arabic. Not only Arabic is written from right to left, but all the significant icons are changed from left to right. A classic example would be the ‘home’ button is traditionally placed towards the left, but when the app is in Arabic, it should be placed on the right.

Tokenization is the process of breaking down a sentence to smaller fragments, it is widely used in NLP. It is one of the key techniques used for making NLP models. Tokenization in English may be dependent on word boundaries but the same cannot be used for languages like Japanese and Chinese. These languages don’t have spaces between words. Similarly, if your word boundaries are dependent on punctuations then they cannot be used for every language. Hence it’s not recommended to use the same model for every other app.

This blog is based on my understanding from a workshop taken by Joseph Tyler at Voice Summit 2019. I attended Voice Summit conference as a part of Slang Labs, where I had the pleasure of meeting Joseph. Slang Labs is a platform that allows the ability to add voice inside apps.