Indic Language Stack for Voice Assistants and Conversational AI

Bhārat Bhāṣā Stack will catalyze Voice Assistant and Conversational AI innovations for vernacular Indic languages as India Stack did for FinTech.

A decade ago, it was unimaginable.

That one would pay a street vendor in a nondescript small town in India by scanning on mobile a QR code hung on his cart. Even for an amount as little as 50 rupees (less than a dollar).

That there would be many mobile apps and payment wallets from banks and non-banks. All seamlessly interoperable. Any two parties would transact by sharing an email like wallet-address. Without paying any transaction fee.

That myriad of small businesses would send catalogs on WhatsApp. Deliver goods to your home. Accept digital payments at your doorsteps. Without having to build a website or payment gateway.

A decade ago, cash was the king.

A decade ago, it was unimaginable.

But it happened. Thanks to the India Stack. The digital infrastructure for authentication, payment, and authorization. It started in 2009.

India Stack is a set of APIs that allows governments, businesses, startups and developers to utilize a unique digital Infrastructure to solve India’s hard problems towards presence-less, paperless, and cashless service delivery.

Unified Payment Interface (UPI) is at the heart of the cashless instantaneous payment. UPI became the digital payment gateway for small businesses. It eliminated an entry barrier that only big enterprises could afford. India Stack catalyzed FinTech innovations.


Bhārat Bhāṣā Stack can do for Indic language tech what India stack did for FinTech.

Right now, it may be unimaginable.

That anyone at any remote corner of India will harness the power of the Internet. Across linguistic and socioeconomic groups. Even if they can’t read or write English. By talking to mobile apps. In their own language.

That any business will be able to bake voice assistants in their apps for masses. In all Indic languages. With affordable data sets, AI models, and services.

Right now, type and touch are kings. But Bharat is discovering its voice. Voice searches in Indic languages have been growing.

So it can happen. If we build the Bhārat Bhāṣā Stack. The tech ecosystem for conversational AI in Indic languages.

Smartphone penetration and voice usage have been increasing. As per Google's report, voice searches in Indic languages have been growing too.

Bhārat Bhāṣā Stack can become voice and language API for small businesses. It can eliminate another entry barrierthat only big enterprises can afford. It can catalyze Voice Assistant and Conversational AI innovations.

Bhārat Bhāṣā Stack can unleash the next wave of unimaginable innovations across India’s linguistic and socioeconomic boundaries to more than a billion people.

The rest of the article:

  • maps Conversational AI landscape of current voice assistants and applications,
  • proposes Bhārat Bhāṣā Stack with needed technologies and layers for various entry points to build these apps, and
  • discusses how Ecosystem Participants can help in building the stack.

Conversational AI

Conversational AI is essential, but takes huge investment.

Conversational AI makes machines communicate like a human. From research labs, it is now reaching consumers’ hands. It started as standalone voice assistants like Siri. It is progressing towards being in apps and devices in several forms.

Conversational AI will become pervasive in all apps and devices.

Voice Assistants

Voice Assistants are intelligent virtual assistants that interact using voice. These are also called voice bots, especially when delivered through a voice-only interface. For example, interactive voice response (IVR) systems as customer support voice bots.

Siri was the first famous voice assistant. Then came Alexa in Amazon Echo devices which, among other things, made it possible to buy things Amazon. The next entrant was Google Assistant. It first came as Google Home devices and later in Android phones.

It progressed from being amusing to being useful, even if in a closed and limited ecosystem.

Voice Actions

The next logical progression was to make it available in apps. Allow programmers to integrate voice commands to trigger specific app actions. Amazon did this with Alexa for Apps, and Google with App Actions.

Voice Search has emerged as a common use case. Almost all apps have some kind of search:

  • Search the internet
  • Search for directions in a Map app
  • Search for a song in music or video apps
  • Search for an item in a shopping app
  • Search for a flight or train in a travel app

Though all are a kind of search, each works with a different category of world knowledge.

Voice Assistants in Apps

Voice Actions in apps have serious limitations. The voice journey ends as soon as it starts. Once the assistant invokes an app, users can interact with the app only through touch. That prevents building rich voice experiences suitable for an app.

That’s why several apps built a Voice Assistant inside the app (instead of their app hidden behind Alexa or Google Assistant). Gaana, YouTube, Paytm Travel, My Jio, Amazon, and Flipkart apps have optimised Voice Assistants for their domain.

Building these optimised assistants requires deep pockets. It takes significant investment, effort, and time to build. Most of these apps support English and Hindi. Broad support for most Indic languages is missing. Bhārat Bhāṣā Stack can make these technologies accessible to smaller entrepreneurs..


Bhārat Bhāṣā Stack: Indic Language Stack

An open Bhārat Bhāṣā Stack can lower barriers of communication, costs, and spur innovations.

Bhārat Bhāṣā Stack should have a set of models, services, and SDKs for building conversational apps in Indic languages. It should include speech, language, and vision technologies needed for building voice applications. The stack layers should offer convenient entry points to use the technologies. That will make it easy to build chatbots, voice bots, voice assistants, and applications.

Technologies

Voice Assistants mimics human actions:

  • Listen: Convert speech audio to text. It is called Automatic Speech Recognition (ASR) or Speech-to-Text (STT).
  • Understand: Understand meaning or intent in the text, and extract important entities. It is called Natural Language Understanding (NLU).
  • Act: Based on that understanding, the application performs the desired action. This is the business logic of the app. The stack provides hooks for applications to plug-in their logic.
  • Speak: Ask questions to clarify, confirm, or seek needed information from the user. It is called Speech Synthesis or Text-to-Speech (TTS).
Steps in Voice Assistants

Other Conversational AI tasks are:

  • Translate: Humans speak different languages. Applications may need to translate text from one language to another. It is called Machine Translation (MT).
  • Phonetically Translate: Many people type Indic languages using phonetic spellings on roman keyboards. The computers may need to do phonetic-translation of the text to Indic language scripts. It is called Transliteration.
  • See/Read: The ability to recognize images of handwritten or printed characters. It is called Optical Character Recognition (OCR).

All these need a machine learning technique called Deep Neural Networks (DNN). Building and training DNN is very expensive as it requires a lot of data and computing time.

To summarise, Bhārat Bhāṣā Stack spans across all three types of problems that DNNs are good at solving:

  • Speech: Automatic Speech Recognition, Speech Synthesis
  • Language: Natural Language Understanding, Machine Translation, Transliteration
  • Vision: Optical Character Recognition

Layers

Application developers should be able to focus on only their business logic. Bhārat Bhāṣā Stack should provide hooks to conversational AI tech for the rest of the steps in voice assistants. This section describes various layers at which applications can hook in.

Scripts

Scripts for a language are encoded using the Unicode character set. Information exchange between Conversational AI technologies happens using these character sets.

Indic languages are phonetic, i.e., the spelling of a word is the same as its pronunciation. This characteristic might allow using speech data for training across similar languages.

The stack should utilize and address the uniqueness of Indic language speakers:

  • Many Indic languages are low-resource languages. Available speech and language data is not enough to train models
  • Loan words from English and Hindi are common in conversations (known as code-switching)
  • A significant shared vocabulary due to common linguistic origins
  • Many Indians are multilingual. They speak 2 to 3 languages and understand 4–5 languages.

Bharati Script, developed at IIT Chennai, is designed to be a common script for Indic languages. The work shows that Indic languages can be transliterated using one-to-one character set mappings.

Bharati Script invented at IIT Chennai can be a common script for Indic languages.

Data

The availability and cost of the data is the biggest obstacle for most entrepreneurs. Curated data sets for speech and language are the lowest layer in the stack.

Many renowned institutes have been collecting data on Indic languages for their research. These institutes have hundreds of hours of data. But the data as well as the knowhow to use it is lost over time.

After a student graduates, thesis data is like grandmother’s precious jewelry box. Few know where it is, and nobody ever opens it. — A professor @ IISc Bangalore

For Speech Recognition, some available data sets:

For Natural Language, some available data sets:

Consolidating and establishing a LibriSpeech-like data set for research and development in Indic languages will pay rich dividends.

Models

Having data is the first mandatory step. But it requires a high level of expertise to train DNN models. It is costly too.

Making pre-trained ready-to-use models available for Indic languages is the logical next step. Privacy sensitive applications can either use these models on the device or host it as service on an on-premise private cloud.

Software as a Service (SaaS)

SaaS frees developers from hosting the model and managing the service infrastructure. It makes it easier to start building applications.

All major cloud infrastructure provides have SaaS for speech recognition, natural language understanding, and text-to-speech for some Indic languages.

These services are quite expensive (just as payment gateways were). Having more SaaS providers using these pre-trained models can bring the cost down.

Software Development Kit (SDK)

SDKs in popular programming languages and OS platforms form the final layer. SDKs can use the models or SaaS.

Tuning a model or service for an application domain requires some ASR and NLU expertise. Domain-specific SDKs(e.g. for banking, e-commerce, agriculture) are needed to further reduce the entry barriers.

We at Slang Labs have learned this from our customers. We now offer Voice Assistant as a Service (VaaS) to improve customer adoption of Voice Assistants.

Bhārat Bhāṣā Stack. Technologies: Automatic Speech Recognition (ASR), Text-to-Speech (TTS), Natural Language Understanding (NLU), Machine Translation, Transliteration, and Optical Character Recognition (OCR).

Ecosystem Participants

It will take systematic and sustained collaboration to design and build the Bhārat Bhāṣā Stack:

  • Academia sharing research paper with code on Conversational AI problems relevant to India.
  • Industry building voice-enabled products and services for the common man.
  • Government playing a role like the one in building India Stack.
  • Industry Bodies speeding up collaboration through conferences and consortiums.

The government has been proactive in formulating AI policy:

NASSCOM and FICCI have been conducting workshops bringing companies and universities together. Slang Labs has been an avid participant.


Summary

Voice-enabled applications can bridge the internet divide across diverse linguistic and socioeconomic groups. Voice and natural language technologies are maturing, but remain prohibitively expensive.

Data and model training costs are formidable obstacles for entrepreneurs and small businesses. Bhārat Bhāṣā Stack for vernacular Indic languages will remove that entry barrier.

This article outlines the stack, and the components we need to build. We need to:

  • Consolidate ongoing efforts in various organizations
  • Pool in resources to build it, and
  • Make it available at a low cost.

India Stack for FinTech succeeded because everyone, including the government, came together. Building Bhārat Bhāṣā Stack together for India’s unique needs is pivotal for its success and widespread adoption.

Here is a video of a domestic helper struggling to find her account balance in the SBI app. She hesitantly tries everything though the right button is prominently there on the first screen. Bhārat Bhāṣā Stack can make her life a tad easier.

Let’s do it!