Making Micro-services Visible through ELK - Part 2 of 3

Logstash Clustering

Figure 2. Initial Experimentation Setup

We did experiment with a number of different setups before finally settling with one. During the experiment, we found that there exists ‘virtual’ limit on how many logstash agents a logstash server is capable of handling.

Initially, we suspected that the logstash server was starved of resources (since Grok parsing is performed there). After all, the conventional wisdom says that we should not start with anything that is less than 2 CPU cores.

Figure 3. Our Current ELK Configuration

But from our further findings, we are able to conclude that the limit is not due to the number of events being shipped. It is also not due to resource limitation (both CPU and memory utilization level was low). Rather, it is due the connection pool managed at the logstash server.

Depending on the RAM size and CPU that you have, the limit may vary. In our case, the limit lies between 8-10 logstash agents for a logstash server that is running on t2.medium. Indeed, due to the low resource utilization, we did experiment with running multiple logstash servers (on different ports). Judging by the level of resource utilization, running 3-4 logstash servers on the same instance appeared a distinct option.

Come AWS’ introduction of t2.nano (first announced in October 2015), opening up another possibility. Instead of running multiple logstash servers on a single host, why not spread it out across tiny instances, we asked ourselves. A bold idea indeed! Most people would be balk at the slightest idea of running t2.nano in production.

What advantage do we get out of it? At least three.

First, we’re able to achieve High-Availability. Second, it allows us to do clustering. In heterogeneous micro-service environment, services do not always scale uniformly. The amount of traffic and hence logs generated by different services scale differently. Some services might be servicing traffic load that is much heavier than others and depending on the business criticality, the amount of log details being recorded will also vary. Based on this pattern alone, there are many ways cluster formation can be achieved. In our case, we chose to form clusters based on team structure. Third, it allows us to achieve this at significant cost savings. We can spin 11 instances of t2.nano for the same price as one t2.medium.

Elasticsearch Clustering

There are many lessons we learnt when we were designing the Elasticsearch cluster.

One initial thought that surfaced was to perform a read / write split (with the help of DNS routing). It appeared as a viable option because on one end, there will be constant stream of write operation and on the other end, the read operation (log visualization) is not mission-critical and real-timeness is not a paramount design criteria. And since running Elasticsearch query is expected to be more resource-intensive than running write operation, an asymmetric read / write split was on the card.

Figure 4. Elasticsearch Cluster Snapshot (1x Daily Index, 5 Shards / Index, 1 Replica / Index, 10 Elasticsearch Nodes)

With a bit of experimentation and help of Elasticsearch’s documentation, we get to see how Elasticsearch cluster and shards were in action. The result? Asymmetric read / write cluster configuration gave limited performance gain as the cluster performance is tied to the slowest query (or slowest node) - weakest link in the chain syndrome.

With a good view of Elasticsearch cluster in action, our next task was “cluster optimization”. Few key metrics that were closely watched were: 1. Query time; 2. Horizontal vs Vertical scaling; and 3. Disk (EBS) type.

From our list, we picked one Kibana dashboard. With time range set over two weeks period, it took 45 secs for the dashboard to load completely. With this, we have a good candidate to optimize. Our optimization variables include the AWS EC2 instance type, AWS EC2 volume type (gp2, io1) and sharding / replica settings.

The result? We managed to bring down the query time to about 10 secs or less. Our experiment results indicate that the disk type used has limited impact on the query performance. Compared to vertical scaling, horizontal scaling seemed to yield higher performance gain. (Note: horizontal scaling can be done by careful calibration of shard and replica configuration.) We were happy to know that horizontal scaling works well as this would mean a couple of things. First, we would automatically get the benefits of high-availability should we need to scale out the Elasticsearch cluster. Second, we have to worry less about the limit to vertical scaling.