Upgrading Elasticsearch at Redmart Pt 1 - Code Changes

For several years, Redmart has been using Elasticsearch to power our catalog, browsing, and search experience. We started with version 1.3, but this reached end of life in January of 2016. We decided to upgrade to 5.6, but upgrading a critical piece of infrastructure which directly impacts the customer experience is a large task. This series of blog posts breaks down the journey into three pieces, which you can think of as analogous to different levels of testing:

1) The code level changes (this post) - This post is about the code changes needed to change our queries between 1.3 and 5.6 compatibility. You can think of this as the unit testing phase
2) The system level tests - This post is about how we stress tested the system to make sure that it could handle the necessary levels of traffic, as well as making sure it could handle all the different types of queries that we use. You can think of this as the integration testing phase
3) The user level tests - This post is about how we tested our customers' reactions to the new system. You can think of this as the acceptance testing phase

Introduction

Whenever you make changes to code, you should include automated tests which capture your intent and will fail if that intent is no longer matched by a future change in the code. In order to test a system as complex as Elasticsearch, we used two core concepts to test our code changes while we upgraded the version of Elasticsearch:

1) Use libraries and SDKs where possible, specifically for constructing Elasticsearch queries.
2) Test your queries against a real Elasticsearch instance. In order to isolate that Elasticsearch instance for testing, you can use Docker for an ephemeral test-only instance.

Technical Path

At Redmart, we use Elasticsearch to power all the browsing and searching on our site. When you search as a new customer, it will execute one query. When you search as a logged in customer, it will execute a more personalized query. When you browse to a subcategory it will execute a query specific to that category. All of these different contexts for search mean that we need to dynamically generate Elasticsearch queries for every customer request.

Using the Library

Our code which generates Elasticsearch queries was originally using StringBuilders to create a json string of the Elasticsearch query. This meant that we had to take care of escaping any special characters, and also left the code rather difficult for newcomers to understand.

Json Code

Before this upgrade project, we were using version 1.3 of Elasticsearch. When we decided to upgrade, the latest stable version was 5.6. One of the big challenges this introduced for our upgrade plan is that there are several structural changes to Elasticsearch queries between 1.3 and 5.6 which were backwards incompatible. This meant that our json construction would not easily indicate any invalid queries until we manually executed each edge case. Because we use many different types of queries for many different contexts, this would be very tedious to test and find all broken queries without impacting our customers. Therefore, our first step was to migrate our code to use Elasticsearch's Java SDK to generate the same queries.

Sdk Code

In order to test that this change did not break anything, we used pinning tests to simply verify that the same json was being generated. We would need to make very low level changes to the json fixtures whenever we changed something in the query. While these kinds of tests are better than nothing, they are often very brittle, and in the case of 1.3 -> 5.6 when so much syntax and structure was changing, we were likely to encounter a lot of headaches.

Pinning Test

Upgrading the Version

When we first changed the version of the Elasticsearch SDK from 1.3.2 to 5.6.2, there were approximately 300 compilation errors! This was a good thing though, as it allowed us to know what needed to be fixed when changes are cheap (locally) before pushing to our test environment or potentially production, where problems take longer to identify, fix, push, and verify.

In order to automate testing our queries, we inserted approximately 40 documents in a local Elasticsearch cluster. Then we had our tests connect to this cluster and execute queries against it, and verify that the correct documents came back (for example, when querying for oats, you should get the food and not the toilet brush brand oates). This worked alright as a proof of concept, but quickly ran into reproducibility problems between different developers on the team and trying to run this setup in our continuous integration environment. To solve this, we used Docker to run Elasticsearch hosting our test data.

Using Docker

In order to run docker in our unit tests, we needed to tackle three things:
1) starting and stopping the docker image as part of the test
2) keeping data consistent between test runs
3) preventing the test runtime from getting too long

In order to start and stop the docker image, we decided to use a JUnit rule. There are many options out there, but we ended up using this: https://github.com/tdomzal/junit-docker-rule Running this as a @Rule or @Classrule did the trick. However, in order for the docker image to run, it needed to read data from somewhere, which led into our second problem.

In order to keep data consistent between test runs, we wanted to satisfy two conditions: new developers can run the tests immediately, and running tests should not have any side effect on our local systems. To allow anyone to run the tests, we decided to include Elasticsearch's data files in git. To generate these files, we kept json documents that we wanted Elasticsearch to have in a resource folder, started the container with a data volume mounted inside a resource folder, sent the documents into Elasticsearch, then stopped the container and added the generated data into git.

Data directory:

tests/resources/es-data  
└── nodes
    └── 0
        ├── _state
        │   ├── global-103.st
        │   └── node-25.st
        ├── indices
        │   ├── erHV5kbTQsW0oQpJByctuQ
        │   ...
        │   ├── Dz3pk8jpR5uL0KGmbZXMNw

Docker run command: docker run -v tests/resources/es-data/:/usr/share/elasticsearch/data elasticsearch:5.6.3

This worked well for allowing any developer to access the data and run the tests, but there's a problem. Every time Elasticsearch starts and stops, it modifies the metadata files it has. This meant that every time we ran the tests, git would think that there were a lot of changes to tracked files. In order to solve this, we copied the data to a temporary folder every time we start the image, and read from that temporary folder. This way the data tracked by git is not changed.

Finally, we noticed that every time the Docker image would start and stop, it took approximately fifteen seconds. We have many different types of queries, and overall we had 11 different integration test classes. This meant that even if each integration test used @Classrule, there would be almost three minutes of non-productive test time added to our builds. In order to streamline this, we used the JUnit's Enclosed test runner. This allowed us to list the 11 integration tests in a single file which had a single @Classrule, limiting the amount of time spend starting and stopping the Elasticsearch image.

ElasticIntegration

Takeaways

Two main learnings came out of this part of the project. First, use libraries and SDKs where possible. This gives you greater protection and flexibility when developing new features or upgrading. Secondly, using Docker to run your external dependencies can give you a lightweight way to test your integration points.

Closing

This is only the start of our series about upgrading Elasticsearch. We still need to cover how we stress tested the new cluster, looked for malfunctioning queries, and validated the user experience!

Part 2