How we used a series of caches to decrease inference latencies
Last year when we built our Slang VAX platform, we had one of India's biggest online grocer integrate Slang. As the number of incoming requests to our server grew, we noticed that our performance began to slow. Each query latency that took a millisecond to respond increased to over 1 seconds for English and over 2 seconds for non-English. In this article, we look at how we reduced our latency by over 30% and our ongoing efforts to improve it continuously.
One of the significant services of VAX, and now CONVA, is the infer service: the one responsible for handling the user voice queries and sending back the intent and entities recognized so that the app can perform the actions. Infer itself consists of 2 steps; the first is the handshake and then the actual infer request. The handshake consists of data such as prompts, hints, and other data that doesn't change from inference to inference. As a result, the handshake does not need to precede every actual infer request. Still, it is required for one entire session of the app or when the user changes the language. For example, when the app first opens up, Slang makes a handshake request and performs the pre-infer process at the client level. Handshake enables the app to take actions based on the infer request.
Before we dived into optimizing and making assumptions along the way, we started by understanding the overall system performance. Is the current response time the best we could do to handle infer requests or get better? What if we horizontally scale by adding more servers to handle the peak load? How would the system perform during average load and in peak? These were some of the questions that we initially set before we started investigating the performance bottlenecks.
As a standard process, best, average and worst-case latency across multiple requests are measured by percentiles. You'd often see the words p50, p95, p99, etc. These all mean the percentile on which the parameters were measured. p50 means the 50th percentile (meaning how well the system performs for 50% of the requests), p90 means the 90th percentile.
Percentiles are better suited to measure latency metric than averages. Consider this example; you have collected some data on how long the user waits on your website to load. You have collected 6 data points (in milliseconds): 62, 920, 37, 20, 850, and 45. If you take the average load time, you get 322 ms. But 322 ms is not the accurate metric for the user experience. From this data, it's clear some of your users are having a high-speed experience (less than 70ms), and some are having a prolonged experience (more significant than 850ms). But none of them has an average experience. Average can give you the incorrect metric.
The way to avoid being misled is to use percentiles. An excellent place to start is P50 and P90. To compute P50, which is just the median, sort the data points in ascending order: 20, 37, 45, 62, 850, 920. You get P50 by throwing out the bottom 50% of the points and looking at the first point that remains: 62ms. You get P90 by throwing out the bottom 90% of the points and looking at the first point, which remains: 920.
To measure our current performance, we loaded the inference into a sandbox environment. We simulated load by sending a predefined number of actual requests at a predefined rate. We took into account the p75 latencies during high load.
Running tests under varying load and RPS (Request Per Second), we were able to identify three significant bottlenecks.
🙌🏻 Handshakes required sending the entire schema to the client for post inference operations, and doing so each time a client session ended was costly in terms of time and actual financial cost. The handshake schema could be huge for complex applications, and additional language support could increase in size to over 4MB. This was a bottleneck for the fast initialization of the application during launches.
🙌🏻 For every request, we made round trips to the NLP service, which is costly. It involves loading the model, extracting the intents and entities, and encoding and decoding the response.
🙌🏻 Slang supports non-English languages like Kannada, Tamil, Malayalam and Hindi. However, under the hood, the NLP service works only in English, so the text requires additional translation and transliteration before being sent to the NLP. A request that involves the translation of the text can roughly take two times more time than an English text.
The answers, for questions we set initially, were found with the help of the test data. As it turned out, we were not efficiently utilizing our resources. There was much room for improvement with the current setup that we had.
If we closely analyzed the bottlenecks, we could see that most of the request time was spent on the NLP or the translation system. We realized that most requests were repeated with a long tail of first-time requests. But, an NLP engine working on a model, until trained again, will reply with exactly the same answer. Therefore, asking NLP to spend time on precisely the same kind of request to get the same response is overkill. What if a layer could store a map between a given request and a given response. Every time the user says 'onions', we just lookup in the map and reply with the same response that the NLP engine had already processed. Imagine if hundreds of users speak the same set of sentences, the amount of processing time we would save.
The other obvious problem with the current system is the number of unnecessary calls the client has to make even when there are no schema changes.
A preflight request is a small request that the client sends before the actual client request. This is done to figure out if the larger, more expensive request is even required. The larger request may not be required because the data backing it has not been invalidated yet. Thus we can use the same one the client had received on the previous request.
We used a preflight HEAD request first to check if the client handshake was needed. This gave us manifold improvement. The client did not have to do a handshake every time before an infer request, since we implemented a client cache that gets invalidated based on preflight response.
As we mentioned above, to improve the latency of NLP and translation responses, we decided to go with implementing cache due to the nature of the response. The request did not have to go to the NLP or the translation each time if it could be reused again across requests.
When designing a cache system in a multilayer system (layers such as client, server, model among others) like CONVA, we could benefit from adding a cache to each layer in the system, but based on the nature of the layer there were choices to be made on the type of cache and the invalidation strategy of the cache.
First was the L1 cache. An L1 cache is a volatile cache that stores data in RAM, which means that if you boot up the system again or the system crashes, all the data is lost. But, the upside of an L1 cache is that they are the fastest of all the possible caches mentioned here. The volatile nature of the cache means that upon reboot, the cache has to be repopulated upon every inference request.
To account for the non-persistent nature of the in-memory cache, one can use an application like Redis or Memcached to create an L2 cache. This solves the problem by periodically saving the in-memory data into a hard disk. This reduces the impact of the volatility of an L1 cache.
We first implemented the L1 cache for resources in the critical path. The way it now works is when a handshake/infer request is fired, we do an in-memory cache lookup and fetch the result from the cache; however, if there is a cache miss, then the request goes all the way to NLP service and responses are saved in the in-memory cache so that subsequent request can retrive the data from the cache.
We then started exploring Redis and Memcache, these are both in-memory data storage systems with support for data persistence. Redis is an open-source key-value store and both store most of the data in-memory. However, Redis supports operations on various data types including strings, hash tables, and linked lists among others. And it was because of this rich feature support, we decided to go ahead with Redis for our L2 cache layer.
The Redis cluster is not self-hosted; instead, it is an instance hosted on the cloud provider. Cloud-hosted Redis ensures high availability and less overhead on the team to manage our cluster.
In the two-layer cache system design, imagine an inference request is sent to a Slang server, the response can be satisfied in one of the ways described below.
At this point, we have discussed details of our multi layer cache system but there is one thing that is still missing. Lets understand when our cache evicts or invalidates the cache entry.
Normally, the trade-off of a cache is that to be fast, it will also be small. That means it will store less data in a smaller, faster and more expensive memory unit. But that means, at some point this memory is going to be filled up and we need policy to either ignore this for new requests or to make space for the new request by removing old stored requests.
Our cache uses LRU (Least Recently Used) policy. LRU discards items that have not been used for a long time (more than other elements in the cache). The least recently used elements get evicted first when the cache gets full. It ensures that our memory doesn’t get bloated.
There are many possible eviction policies, read about them on Wikipedia. But, just to contrast, another possible eviction policy is Least Frequently Used. What this means is we order the requests from most used to least used and then remove the one that is used least with the logic that if not many people or processes need it, then we can probably do without it. The problem here though, is that, if we have a large long tail of single use requests, then it is possible that almost 80% of the requests are equivalent in that they are the least used. Then the cache will have a low hit ratio, because as soon as the request is stored, it is removed for another one and then that one is removed itself, and so on. This is called thrashing. And we have seen that CONVA (or rather VAX, it’s previous iteration) has this kind of a long tail of requests.
There are only two hard things in Computer Science: cache invalidation and naming things — Phil Karlton.
Another trade-off of a cache is that it will serve data faster, but it may not necessarily be the most recent data. Why is that so? Because the cache saves time by not looking into the database or by not performing some function that computes a possible new value for the data. This means we needed a policy to decide when the cache was stale and when we needed to possibly recompute the data and refresh the cache.
To invalidate the cache, we revisited our inference flow to better understand when the response from NLP/handshake/translation would change. Changes to schema required the retraining of the assistant. This means once the train completed, all caches had to be invalidated. But how did we know if the train has completed. We could do that in two ways. The first was a broadcast system. Once the train had completed it would inform each of the processes that a train had, in fact, completed. The process then knew that on the next inference it should look at the data base.
The alternative was, storing a flag in a particular position and having the concerned process continuously and periodically ping this flag. If the flag had changed, then that meant the underlying data, or model had been invalidated and the cache had to be refreshed on the next inference. This was called the time-to-live (TTL) approach. Since the period (TTL) of the ping was configured into the service, and that was the time until which it assumed the data was still ‘alive’ or fresh. Keeping a small TTL means that the cache very frequently invalidates its cache. If the TTL goes to 0, it is then the same as not having a cache at all.
We used the second type. When an assistant got trained, it stored the timestamp at which the train completed in the assistant’s metadata. At inference time, once the TTL ran out, it pinged the database for the assistant’s metadata. It looked at the last train timestamp, if it was greater than the one it already knew then that meant that a train had occurred since the last time. The process then invalidated its own data and requested for the new data.
Above, we spoke of two layers in the inference flow, there were actually many more caches that we implemented all over the server. Another one of these was the NLU model cache. The NLU model itself took about 2 seconds to load during inference. It first had to be fetched from the database, get deserialized and then be used. However, we could save time by caching the deserialized model itself. Using the principles mentioned above we did exactly that, we cached the model itself and chose a TTL eviction strategy that evicted it based on the last timestamp available in the metadata.
So now we had a cache in the client itself, we had two different caches in the server which were keyed on the utterance, and we had an NLU model cache that was keyed on the assistant ID.
The reason we had so many caches were, they all had different trade-offs and all saved processing time in different ways at their respective layers.
The client cache was only for that client. If a user spoke the same sentence on the phone, then we could use a previous response that was keyed using the utterance, looking out for usual cache invalidation, obviously. But, since this cache was only for the client, the repetition of utterances did not benefit other clients, since there was no sharing of data across clients. This, of course, was the fastest from the client point of view since the request did not even reach the server, thus, saving on network and processing time.
As mentioned at length above, the L1 cache saved time by not having to go to NLU to get the response, since a previous response for the same utterance was stored and keyed by that utterance. However, the issue with this cache was that it was volatile and couldn’t keep its memory across reboots.
The L2 cache saved time by not having to go NLU, but was slower than L1 and, of course, only worked if the utterance has been spoken before.
The final resort for all utterances was the actual NLU process if they were not present in the lower cache. But, if we knew for a fact that the assistant model had remained consistent with the one in the DB, why would we have had to look up the DB for a model every time?
So every layer had its own purpose and together they maintained correctness and sped up returns due to the principle of locality. (Quick digression: What is the principle of locality you ask? Ever used Instagram nowadays? You’ve seen one post with one thing and then every post after that is the same thing?)
Looking at the explanation above, you may think that a better metaphor for caches are sieves, but hey, who doesn’t like themselves a piece of wedding cake!
Performance improvement is ongoing work. This is one way we reduced overall latency and got some great results on the infer system.
Some of the takeaways while solving the problem:
Slang CONVA provides the easiest and fastest way to add In-App Voice Assistants to mobile and web apps. Sign up for an account at https://www.slanglabs.in to create your own retail assistant, or see Introducing Slang Conva for more information.