Slang Visits… BHARAT NLP series by Niti Aayog
On Friday, 30th November, Slang had the opportunity to represent industries at large at the first workshop of Building Heuristic Architecture for Artificial Intelligence (BHARAT) NLP series organized by Niti Aayog and IIC, University of Chicago at Microsoft Research Labs, Bangalore. This workshop is part of the beginnings of an endeavour to collate, organize and, potentially create, resources for Natural Language Processing in Indian Languages in the form of a Toolkit available to all researchers, practitioners, hobbyist, and those in the industry.
The program was kicked off by a call to action by Ms. Anna Roy from NITI Aayog, followed by many short talks by industry and academic leaders, such as Dr. Vivekananda Pani and Dr. Pushpak Bhattacharyya. It also included panel discussion which included leaders an industry leader and representatives of two startups, one of which was our very own Kumar Rangarajan of Slang Labs.
The highlight of the day however were the breakout sessions in which groups of various participants took part in moderated brainstorming sessions, in order to answer questions such as what is the current state of affairs of resources for NLP in indic languages and what would be the desired outcome of an NLP toolkit for indic languages, if one were to be created from scratch. It was refreshing to see the level of discussions amongst the various participants, even if it got a little heated sometimes. Anyone seen 1957’s 12 Angry Men? Imagine the NLP version of those jurors. Now imagine 8 copies of them in one large room.
But if the participants had to agree on one thing, it was that there is an urgent need for NLP resources to be accessible location. Right now there are many toolkits, created for many indic languages around, but fragmented and hosted at multiple sites by multiple research groups both in the industry and the academia. Most times this makes access not easy because of lack of discoverability, resources not being in a ready to download format, or being in various formats with little documentation to support.
We all found that the ease of access of NLP tools in the English language, like Spacy, Stanford NLP toolkit and NLTK just to name a few, led to easy adoption and fast prototyping which led to their popularity and was something that the toolkit that is supposed to come out of this endeavour should aim to emulate.
In general, tools in NLP are wide ranging and can be grouped in many ways. One form of grouping was by use. NLP tools could be grouped by abstraction of use. Some tools were directly useful at the application layer such as spell checkers and intent classifier. And some were more specific tools such as Named Entity Recognizers (NER), Parts of Speech (POS) Taggers and Morphology Parsers for that can be used by the more NLP inclined. There is a need to server tools to for both layers of abstraction. In addition to this, the most cutting edge research usually employs deep learning, and deep learning in NLP usually uses and (sometimes abuses) word embeddings. Seeing the success of off the shelf embeddings such as Google’s Word2Vec, Stanford’s GloVe Embeddings and Facebook’s FastText, there could be a need for such ready made embeddings for each language, just for the ease of development time for many practitioners.
However, if and when successful, whatever comes out of this workshop would be tremendously useful to companies, research groups and startups like Slang Labs, that aim to bring technology of the 21st Century to the masses of countries like India, which speak naturally in languages other than English.
All in all, Slang Labs applauds these efforts taken by Niti Aayog and IIC as part of its Niti Aayog’s #AIForAll Program launched this year, and hopes to closely work with them to bring this and other resources to see the light of day.