Upgrading Elasticsearch at Redmart Pt 2 - Stress Testing

This is part 2 of a three part series about how we upgraded from Elasticsearch 1.3 to 5.6 at Redmart. Upgrading a critical piece of infrastructure which directly impacts the customer experience is a large task, and this post covers how we stress tested the system.

1) The code level changes - This post is about the code changes needed to change our queries between 1.3 and 5.6 compatibility. You can think of this as the unit testing phase
2) The system level tests (this post) - This post is about how we stress tested the system to make sure that it could handle the necessary levels of traffic, as well as making sure it could handle all the different types of queries that we use. You can think of this as the integration testing phase
3) The user level tests - This post is about how we tested our customers' reactions to the new system. You can think of this as the acceptance testing phase

Introduction

In order to thoroughly test a large piece of infrastructure, you need to make sure of two important things:

1) Stress testing - can the system handle production levels of traffic gracefully?
2) Edge cases - can the system handle all the different scenarios that customers will throw at it?

Technical Testing

At Redmart, we use many different Elasticsearch queries for different contexts. It is very difficult to exhaustively unit test all possible combinations of personalization and customization. Therefore, before we could have confidence that our version upgrade was correct, we needed to test it with real user behavior. In order to do this, we did both stress testing and a dark launch.

All of our backend is written in micro services, and we have two primary micro services which are responsible for the majority of our search infrastructure. We have catalog service which faces frontend clients (web site, mobile apps, bots, etc) and handles prettying the results, and search service which interfaces with Elasticsearch by transforming the input into an Elasticsearch query and interpreting Elasticsearch's response.

Existing Search

Stress Testing

In order to stress test the new cluster, we used Siege. Siege can read from a list of URLs at random, and fire many of them in parallel. This allowed us to easily change the host URL between the existing cluster and the new cluster and verify that we got similar performance numbers in terms of queries per second.

Using this, we identified that there were a few types of custom queries which were performing inefficiently. The culprit ended up being some of the aggregations we were running, which were not necessary in some contexts. While the aggregations were syntactically correct, they did not behave exactly as they did in 1.3, and so we modified them to perform more efficiently. We also identified some aggregations which were being run on EVERY query, even when the frontend client does not display the aggregation results. Since aggregations took up more than 50% of the execution time of the queries, we eliminated them from these types of queries, resulting in dramatic improvements in performance (queries went from 100 milliseconds to 20)

Dark Launch

A dark launch works by having the new version of the software running "silently" in production, but still serving traffic. To accomplish this, we used a tool called TeeProxy to duplicate incoming traffic. What this does is sends duplicated requests to the new system, while only showing customers the results from the existing system.

Dark Launch Search

There were two important sources of data for the performance. The Elasticsearch logs themselves would log when an invalid query was sent, which would hint at a type of query we needed to fix. We also have some analysis running in catalog service, which logs how many products were returned for a certain query, and generates a unique id for the search based on the time, customer, and query performed. This id lets us compare the results of the dark launch to the results of the existing services. We logged this data to a separate log file, and shipped that separate log file to an external system for analysis. Here we looked for queries which returned different numbers of results between the new cluster and the old cluster.

Findings

We found that there was a problem with one form of personalization we were doing. We implement personalization using boost queries, and there were some customers whose histories caused over a thousand boost queries. This did not work in version 5.6 of Elasticsearch, as there are limits to how many items you can boost. This caused the dark launch results to return 0 items for certain customers, but work fine in the old cluster. It's a good thing we caught this before showing our customers, as this would have been a very difficult situation to reproduce in production unless our test users happened to have the same amount of personalization as the affected customers.

Further Improvements

There were other things we could have tested. For example, one common approach to automating search evaluation is to have "golden sets" for your most frequently searched terms. Then you can programmatically check that the golden set items are within the top 10 results for the terms you care about after making any change to your search algorithm. In the absence of golden sets, you could simply use your previous results as the set to compare against, and see how different they are, and make sure that at least 80% of the top 10 items are the same across most of the search terms you care about. We will go into some math that explains why the results would have changed in the next blog post. In the end, due to the other issues we found, we did not have the bandwidth to tackle these things.

Takeaways

Dark launches can be a great way to test a complex system before releasing to your customers. In order to have an effective dark launch, you will need to have two important pieces: infrastructure to support doubling traffic, and data generation / analysis to be able to find problems and understand how to fix them. It also gives you confidence that your new system can handle production levels of traffic.

Conclusion

After migrating the code, stress testing the new system, and performing a dark launch, we still weren't ready to release to production. We had one final test to run, which was to see how our customers liked the changes.

Part 3