Nebulaworks Insight Content Card Background - Andrew buchanan concrete
Consul is a distributed, highly available system providing: Service Discovery, Health Checking, and a KV Store across multiple datacenters. A more detailed overview of the architecture can be found here
Recently I was working in an environment that had 2 separate Consul clusters (
env2) on the same network
segments in AWS. During one point in the build-out and testing of the environment
env1 I had noticed that the
env2 consul clusters had somehow merged into a single cluster! I now had ec2 instances attempting to communicate
across environments and leading to all kinds of chaos! Below I’ll cover why this happened and how to avoid this within
your own Consul cluster.
With the clusters merged and in a bad state I went ahead and poked around to see if I could figure out what had triggered the two clusters to become one.
First I checked the aws ec2 tags on the consul servers to make sure they were, in fact, unique to each environment since
each consul agent was using ec2 autojoin
functionality for initial agent bootstrapping. That checked out and seemed to be good. The
env1 cluster had its own
env2 having its own tag:
env2-consul. Moving on!
I then started to look at the consul gossip protocol in more depth. The gossip protocol allows all members of a consul datacenter to automatically discover servers. This sounds like something worth investigating further. It turned out to be a race condition related to using duplicate consul datacenter names on the same LAN.
After looking at the ec2 console in AWS I noticed that some instances were being terminated in both
environments around the same time the two clusters had joined each other (checked the local instance logs in
/var/log/consul.log). Turned out that an IP used for an
env2 ec2 instance was reused for a new
EC2 Autoscaling and
DHCP. This, in turn, caused the
env2 consul cluster to mistake the missing
instance as having come back online when in fact it was really a different node in environment
Turns out this is an intended functionality of the gossip protocol in order to repair temporary network segmentation and other issues that might crop up. The issue is covered on the Hashicorp Consul Google Group as well.
Luckily I didn’t have to try to separate each cluster back out. And since they were just development environments I terminated everything and respun the environment from scratch. Regardless, it was nice to have an explanation for this kind of behavior!
Let’s cover a couple of different configurations that can be
enabled to keep this kind of problem from affecting your
The easiest way to isolate these two environments would’ve been to create unique
datacenters for each consul cluster. This way even if the
consul agents have the ability to talk to one another on the network the
gossip protocol will see that the datacenters don’t match so agent
discovery will not happen. An example would be to
prefix the environment with the datacenter, in this case
consul agent -datacenter env1-dc1 -retry-join "provider=aws tag_key=... tag_value=..."
A better solution would be to enable encryption on each consul cluster. That way the consul agents are only able to decrypt their own cluster’s traffic. This is outside the scope of this article, however, its covered in more depth in the Consul Encryption Guide
Running two separate consul clusters in the same LAN should be avoided if at all possible. The consul docs even point out that:
Consul has first-class support for multiple datacenters, but it relies on proper configuration. Nodes in the same datacenter should be on a single LAN.
Hopefully, this article helps provide some guidance for those implementing their own consul clusters. With any technology (especially new tech) there are bound to be some war stories but overall I’m excited to see what features the future Consul development brings. Consul is really an awesome piece of software and its rapid development is truly remarkable.
Be sure to follow the development progress on Github