Making Micro-services Visible through ELK - Part 1 of 3

RedMart’s journey has been one that is evolutionary rather than revolutionary. In few previous occasions, we’ve shared how our infrastructure stack evolved as we scaled up the engineering team and become one of the region’s most recognized online grocery retailers and logistic partners.

The decision to break away from single monolithic into micro-service architecture was initiated in 2013. By 2014, development of micro-services were in high gear. Fast forward three years, more than 95% of our current traffic is being served by over 65 micro-services. (We are still waiting for the day when we can say we are 100% on micro-service but that day is not too far away)

Micro-service architecture is great but it comes with its own set of challenges. First,priority was on the provisioning of test environment. Without proper test environment, micro-service could end up creating more bad than good. The room for autonomy across teams will be limited. Higher productivity and development speed would be nothing but elusive promises. (If you are interested to find out about our micro-service test architecture, you can watch it here)

There are of course numerous challenges we need to address as the team continued to scale and core business grew. But always high on our priority list was a centralized logging system.

With micro-service architecture, a single end-to-end customer transaction is likely to touch on multiple services. The task of carrying out occasional Root-cause-analysis (RCA) is made more difficult by the fact that a service is now horizontally scaled and load balanced across multiple hosts.

The horizontal scaling elasticity differs among services. Services that get hit more frequently will be scaled more than those that get hit less. But as a general principle, every service in our production environment will be deployed on at least 2 hosts (in two AWS avaliablity zones).

How much complexity does it involve now compares to back in the days when we had a single monolithic API? You get the idea.

The Choices

In setting up any centralized logging system, the questions that came up are always:
1 Free vs Paid - The answer to this question depends on the in-house engineering resource you have. Setting up the ELK (Elasticsearch, Logstash and Kibana) stack is trivial. Scaling it up to be production-ready is not! Our success did not happen on the first attempt but I’ll get more into that.
2 The Stack - Splunk vs ELK. There are plenty of discussions on which makes the better technology choice. We do not intend to start another such discussion but it is the first decision you have to make before you start building your centralized logging system.

Never Never Give Up!

Our first ELK setup attempt did not end well. It is really easy to bring up the entire ELK stack. I’ve heard some are able to do it within minutes while others may take longer, depending on one’s technical background.

The simplicity of starting an ELK stack is what led most people (and we are not excluded) to assume the absence of challenge in setting up production-ready cluster.

In our first attempt, we ran into issue with the logstash shipper (a.k.a Filebeat now). Back then, we were using the Java-based shipper where it ended up taking too much resources to be run on the same machine where a micro-service was running.

As RedMartian, we ‘Never Never Give Up!’. It is what keeps us going in the face of setbacks such as this. Today, that setback looks like a minor speed bump and we’re happy to share details of our ELK production stack.

Our ELK Architecture

The renewed effort to bring up our own ELK stack started in late 2015. Back then, Filebeat was newly released and it was in beta. We assessed it was probably pre-mature to jump into it and decided to go with the Go-based logstash-forwarder.

Following our usual development / production pipeline, we started setting up the ELK cluster in our alpha environment (we have 3 environments in total - feature, alpha and production). From our initial research, we found the architecture recommended by most (including guys from Elasticsearch) would be to have an additional layer (a.k.a. broker) in between the logstash shipper and the logstash server. Redis is what is commonly used while more recently, I’m seeing more popular use of AWS Kinesis for this purpose.

Textbook-recommended ELK ArchitectureFigure 1. Textbook-recommended ELK Setup

We had considered starting with this textbook recommended approach but decided to pursue a different path. We decided to do away with the broker layer for a start while keeping the option open should all options failed.