Nebulaworks Insight Content Card Background - Andrew buchanan concrete

Nebulaworks Insight Content Card Background - Andrew buchanan concrete

Consul, Datacenters, and LANs

May 25, 2018 Rob Hernandez

The consequences of not properly isolating your consul clusters.

Recent Updates

Hashicorp Consul

Consul Overview

Consul is a distributed, highly available system providing: Service Discovery, Health Checking, and a KV Store across multiple datacenters. A more detailed overview of the architecture can be found here

The Issue

Recently I was working in an environment that had 2 separate Consul clusters (env1 and env2) on the same network segments in AWS. During one point in the build-out and testing of the environment env1 I had noticed that the env1 and env2 consul clusters had somehow merged into a single cluster! I now had ec2 instances attempting to communicate across environments and leading to all kinds of chaos! Below I’ll cover why this happened and how to avoid this within your own Consul cluster.

Investigating

With the clusters merged and in a bad state I went ahead and poked around to see if I could figure out what had triggered the two clusters to become one.

First I checked the aws ec2 tags on the consul servers to make sure they were, in fact, unique to each environment since each consul agent was using ec2 autojoin functionality for initial agent bootstrapping. That checked out and seemed to be good. The env1 cluster had its own tag env1-consul while env2 having its own tag: env2-consul. Moving on!

I then started to look at the consul gossip protocol in more depth. The gossip protocol allows all members of a consul datacenter to automatically discover servers. This sounds like something worth investigating further. It turned out to be a race condition related to using duplicate consul datacenter names on the same LAN.

After looking at the ec2 console in AWS I noticed that some instances were being terminated in both env1 and env2 environments around the same time the two clusters had joined each other (checked the local instance logs in /var/log/consul.log). Turned out that an IP used for an env2 ec2 instance was reused for a new env1 instance thanks to EC2 Autoscaling and DHCP. This, in turn, caused the env2 consul cluster to mistake the missing env2 instance as having come back online when in fact it was really a different node in environment env1!

Turns out this is an intended functionality of the gossip protocol in order to repair temporary network segmentation and other issues that might crop up. The issue is covered on the Hashicorp Consul Google Group as well.

Luckily I didn’t have to try to separate each cluster back out. And since they were just development environments I terminated everything and respun the environment from scratch. Regardless, it was nice to have an explanation for this kind of behavior!

Improvements

Let’s cover a couple of different configurations that can be enabled to keep this kind of problem from affecting your own environment.

Datacenter

The easiest way to isolate these two environments would’ve been to create unique datacenters for each consul cluster. This way even if the consul agents have the ability to talk to one another on the network the gossip protocol will see that the datacenters don’t match so agent discovery will not happen. An example would be to prefix the environment with the datacenter, in this case dc1 for environment env1:

consul agent -datacenter env1-dc1 -retry-join "provider=aws tag_key=... tag_value=..."

Encryption

A better solution would be to enable encryption on each consul cluster. That way the consul agents are only able to decrypt their own cluster’s traffic. This is outside the scope of this article, however, its covered in more depth in the Consul Encryption Guide

Separate LAN

Running two separate consul clusters in the same LAN should be avoided if at all possible. The consul docs even point out that:

Consul has first-class support for multiple datacenters, but it relies on proper configuration. Nodes in the same datacenter should be on a single LAN.

Conclusion

Hopefully, this article helps provide some guidance for those implementing their own consul clusters. With any technology (especially new tech) there are bound to be some war stories but overall I’m excited to see what features the future Consul development brings. Consul is really an awesome piece of software and its rapid development is truly remarkable.

Be sure to follow the development progress on Github

Insight Authors

Rob Hernandez, CTO Rob Hernandez CTO
Nebulaworks - Wide/concrete light half gray

Looking for a partner with engineering prowess? We got you.

Learn how we've helped companies like yours.