Upgrading Elasticsearch at Redmart Pt 3 - Testing Customer Reactions

This is the final part of a three part series about how we upgraded from Elasticsearch 1.3 to 5.6 at Redmart. Upgrading a critical piece of infrastructure which directly impacts the customer experience is a large task, and this post covers how and why we used AB tests to measure our customers' reaction.

1) The code level changes - This post is about the code changes needed to change our queries between 1.3 and 5.6 compatibility. You can think of this as the unit testing phase
2) The system level tests - This post is about how we stress tested the system to make sure that it could handle the necessary levels of traffic, as well as making sure it could handle all the different types of queries that we use. You can think of this as the integration testing phase
3) The user level tests (this post) - This post is about how we tested our customers' reactions to the new system. You can think of this as the acceptance testing phase

Introduction

How can you be certain that a change is good for your customers? You can use your expertise about your own product to make changes which you think are good, but the best way to make decisions is with data. One way to collect data is to expose new features or changes to a percentage of your customers, chosen randomly, and measure their reaction. This is called AB testing, and is the final tool we used to test our 5.6 Elasticsearch upgrade.

User Experience Testing

At Redmart, we use Elasticsearch to power almost every shopping page. Because this is such a critical dependency, small changes can have large impacts on our customers. However, this upgrade was a very large change! When customers go to the wine section, they like seeing the red wine they usually buy at the top. When a customer searches for chicken, they expect to see food and not a costume as the first result. To make sure that the way Elasticsearch 5.6 ranked products was satisfactory for our customers, we needed to perform an AB test. The important metrics we wanted to track were how many items customers added to cart and how much money they spent with us (overall revenue).

What Changed

In version 5.6 of Elasticsearch, the range of the scoring values changed. Relevant documents were still ranked higher than others, but by a different amounts. Our customization functions add values on top of the relevancy scores, which means that the ranges of our customization functions may need to change as well. Let's make this concrete with a classic example of confusing search: chocolate milk vs milk chocolate when the user searches for chocolate.

Elasticsearch 1.3 Scores for chocolate

1.3 relevancy

Elasticsearch 5.6 Scores for chocolate

5.6 relevancy

Here you can see that even though the two documents are scored relatively similar to each other, the customization values which previously caused Milk Chocolate to be ranked higher are no longer sufficient to make that impact. We could manually verify the results for "chocolate" (and we did), but this extends across thousands of potential customer queries!

Even though we accounted for the changing structure of 5.6 queries (see previous posts for details), we still needed to make sure that our customers liked the new results.

AB Test

An AB test is when you take random segments of your users and show them different versions of your features. Then you measure how they react through metrics such as how much they purchase, how quickly they leave the web site, and others. Finally, you compare the numbers to see if there's a statistically significant impact to decide whether or not to move forward. You can read more about this topic on wikipedia. We use Optimizely as our AB testing tool, which integrates easily with our web code and makes it easy to view experiment results.

In order to show different versions to different customers, we needed the frontend clients to send different parameters. When a request came in with the query parameter "variation=a", then catalog service would direct the request to the new cluster. Otherwise, it would direct the request to the existing cluster. Deciding this variation parameter is decided by Optimizely, which handles splitting customers into random segments for us. This is different from our dark launch because now we are showing results from the new cluster to our customers!

AB Search

Strategy

We ran several progressive AB tests, slowly increasing the percentage of our customers which were allocated to the random group for each test. It is generally considered a poor practice to alter the allocation in the middle of a test as it can obscure the results, so every time we increased the traffic we would start a new test to gather fresh data. For each test, we measured how many items customers added to cart for each page view, and the overall revenue per customer. It is critical to make sure that even though you've made great technical improvements, your customers still like the changes and you won't have a negative impact on the business.

Manual Feedback

The architecture of this setup made it easy to have our internal testing site point to the new cluster as well. We hard coded our beta web site to always indicate that the user was part of the new customer segment, then asked internal employees to look at the new results. This helped to divide up the work needed to verify the changes, as the category managers (the heads of a business department which manages different types of products and is more familiar with the products sold) have a lot more context about which items they want to appear higher within their categories than the developers do. We also made sure that all Redmart employees were part of the new variation in the AB test, as a form of dog fooding.

Findings

Fortunately, our AB tests did not reveal any reductions in business critical metrics, and so aside from increasing the percentage of customers who were exposed to the new results, we did not need to iterate on the implementation.

Takeaways

AB tests can be useful tools to diagnose how your customers will react to new features. As long as your code is architected so that you can easily show one version or the other based on a runtime flag, you can quickly test things and be confident with data that your changes will be good for the business.

Conclusion

This concludes our three part series on upgrading Elasticsearch. It was a massive project, but helped us eliminate a lot of tech debt, get on a faster version of Elasticsearch, and establish some internal practices for testing before releasing (we have run a few dark launches since this project).