Evaluating Whisper's Spoken Language Identification capabilities
Spoken Language Identification (SLID) is a widely known and active research problem. The task is to accurately classify what the language spoken by a user given the audio sample was. The goal of the experiment is to evaluate the whisper model on SLID in Indian English and Hindi.
Whisper is an open-source model trained and provided by OpenAI. It is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. It can perform multiple tasks such as language identification, transcription, and translating a non-English language to English while transcribing. You can learn more about the model on their blog and test out its capabilities using the resources provided by OpenAI in their GitHub repo.
We used our playground app to collect audio samples within the company. These audio samples were labeled by using the language selected by the testers while recording the audio. We had 238 English audio samples and 99 Hindi samples from the retail domain. While Indian English was labeled as 1 (positive) and Hindi was labeled as 0 (negative).
You can learn more about the Slang playground app here
From our previous experiments, the CLSRIL-23 model was able to get reasonably accurate results. So we treat this model as our baseline. The baseline scores were calculated using the methods specified in their paper i.e performing clustering on speech representations output by the model.
A visualization of Hindi and English speech representations provided by the CLSRIL model.
We used the whisper models, from tiny to medium variants, to test out how well each of those models performs on SLID with Indian English (en-IN) and Hindi (hi-IN). Keeping that in mind, the evaluation script was modified to consider scores in Hindi and English, and the rest of the scores were masked out. For evaluation purposes, we calculate the following metrics — accuracy, precision, recall, and f1-score. We also tried to measure the latency of each model to understand which is fit for rolling out to production.
Accuracy = (TP + TN)/(TP + TN + FP + FN)
Precision = TP/(TP + FP)
Recall = TP/(TP + FN)
F1-score = (2 * Precision * Recall) / (Precision + Recall)
TP = True Positives
TN = True Negative
FP = False Positives
FN = False Negatives
True positives are the number of samples labeled as positive and predicted as positive.
True negatives are the number of samples labeled as negative and predicted as negative.
False positives are the number of samples that were labeled as negative and were predicted as positive
False negatives are the number of samples that were labeled as positive and were predicted as negative
The results after evaluation can be seen in the table below.
In terms of accuracy numbers, all of the whisper models were able to outperform the CLSRIL model by a larger margin. The CLSRIL model does have 100 percent recall which means that it didn’t have a single false negative. This is more evident if you check the visualization of CLSRIL’s speech representations.
In terms of latency, all models have a sub-second p99 latency. The whisper small and medium models were slower than our baseline models.
Audio Data can be downloaded from here - https://drive.google.com/drive/folders/1dNLeuRKg_y9AUDX2z0w7ISwxbs9x61PI?usp=sharing
The link to Colab Notebook is https://colab.research.google.com/drive/1l-CrROo8x0DZLGQjPxRQlBrHxFPJdj7E#scrollTo=f5tGniEqy6zY