Nebulaworks Insight Content Card Background - Andrew buchanan concrete
Recent Updates
Consul Overview
Consul is a distributed, highly available system providing: Service Discovery, Health Checking, and a KV Store across multiple datacenters. A more detailed overview of the architecture can be found here
The Issue
Recently I was working in an environment that had 2 separate Consul clusters (env1
and env2
) on the same network
segments in AWS. During one point in the build-out and testing of the environment env1
I had noticed that the env1
and env2
consul clusters had somehow merged into a single cluster! I now had ec2 instances attempting to communicate
across environments and leading to all kinds of chaos! Below I’ll cover why this happened and how to avoid this within
your own Consul cluster.
Investigating
With the clusters merged and in a bad state I went ahead and poked around to see if I could figure out what had triggered the two clusters to become one.
First I checked the aws ec2 tags on the consul servers to make sure they were, in fact, unique to each environment since
each consul agent was using ec2 autojoin
functionality for initial agent bootstrapping. That checked out and seemed to be good. The env1
cluster had its own
tag env1-consul
while env2
having its own tag: env2-consul
. Moving on!
I then started to look at the consul gossip protocol in more depth. The gossip protocol allows all members of a consul datacenter to automatically discover servers. This sounds like something worth investigating further. It turned out to be a race condition related to using duplicate consul datacenter names on the same LAN.
After looking at the ec2 console in AWS I noticed that some instances were being terminated in both env1
and env2
environments around the same time the two clusters had joined each other (checked the local instance logs in
/var/log/consul.log
). Turned out that an IP used for an env2
ec2 instance was reused for a new env1
instance
thanks to EC2 Autoscaling
and DHCP
. This, in turn, caused the env2
consul cluster to mistake the missing env2
instance as having come back online when in fact it was really a different node in environment env1
!
Turns out this is an intended functionality of the gossip protocol in order to repair temporary network segmentation and other issues that might crop up. The issue is covered on the Hashicorp Consul Google Group as well.
Luckily I didn’t have to try to separate each cluster back out. And since they were just development environments I terminated everything and respun the environment from scratch. Regardless, it was nice to have an explanation for this kind of behavior!
Improvements
Let’s cover a couple of different configurations that can be enabled
to keep this kind of problem from affecting your
own environment.
Datacenter
The easiest way to isolate these two environments would’ve been to create unique
datacenters for each consul cluster. This way even if the
consul agents have the ability to talk to one another on the network the
gossip protocol will see that the datacenters don’t match so agent
discovery will not happen. An example would be to prefix
the environment with the datacenter, in this case dc1
for
environment env1
:
consul agent -datacenter env1-dc1 -retry-join "provider=aws tag_key=... tag_value=..."
Encryption
A better solution would be to enable encryption on each consul cluster. That way the consul agents are only able to decrypt their own cluster’s traffic. This is outside the scope of this article, however, its covered in more depth in the Consul Encryption Guide
Separate LAN
Running two separate consul clusters in the same LAN should be avoided if at all possible. The consul docs even point out that:
Consul has first-class support for multiple datacenters, but it relies on proper configuration. Nodes in the same datacenter should be on a single LAN.
Conclusion
Hopefully, this article helps provide some guidance for those implementing their own consul clusters. With any technology (especially new tech) there are bound to be some war stories but overall I’m excited to see what features the future Consul development brings. Consul is really an awesome piece of software and its rapid development is truly remarkable.
Be sure to follow the development progress on Github
Looking for a partner with engineering prowess? We got you.
Learn how we've helped companies like yours.