Our EC2 network black hole
This is a blog detailing an issue I encountered while attempting to connect an EC2 instance to an internet endpoint via a firewall.
The Requirements
Nice and simple requirements.
-
We need to deploy two EC2 instances to AWS, one in AZ-A, and one in AZ-B.
-
Both instances must have static IP addresses assigned.
-
Both instances must be created from the same AMI.
-
They should connect out to the internet to reach the public AWS SSM endpoint (via their respective AZ firewall).
So they are pretty much identical, apart from the subnet (AZ) they are deployed to (and implicitly the firewall they are talking to).
The Problem
For some reason, I find that the instance in AZ-A is not able to connect out to the public SSM endpoint (ssm.ap-southeast-2.amazonaws.com
). The instance in AZ-B is connecting fine, so what’s gone wrong?
Time to check the usual suspects:
- DNS ✅
- Security group rules ✅
- ufw ✅
- proxy ✅
- ssm agent - installed. Failing to connect to the AWS mothership obviously
- route tables ✅
- NACLs ✅
Hmmm, okay. The usual suspects are all fine. Maybe there is a rule missing from the firewall?
Nope. All good there. Traffic is flowing and green. Weird.
Digging Deeper
tcpdump time! Let’s see what happens when we run a tcpdump on the problem instance and attempt to curl https://ssm.ap-southeast-2.amazonaws.com
…
75 36.398945 10.130.105.12 99.82.187.47 TCP 76 55702 → 443 [SYN] Seq=0 Win=26883 Len=0 MSS=8961 SACK_PERM=1 TSval=421262246 TSecr=0 WS=128
79 37.428293 10.130.105.12 99.82.187.47 TCP 76 [TCP Retransmission] [TCP Port numbers reused] 55702 → 443 [SYN] Seq=0 Win=26883 Len=0 MSS=8961 SACK_PERM=1 TSval=421263276 TSecr=0 WS=128
80 39.476291 10.130.105.12 99.82.187.47 TCP 76 [TCP Retransmission] [TCP Port numbers reused] 55702 → 443 [SYN] Seq=0 Win=26883 Len=0 MSS=8961 SACK_PERM=1 TSval=421265324 TSecr=0 WS=128
81 43.508296 10.130.105.12 99.82.187.47 TCP 76 [TCP Retransmission] [TCP Port numbers reused] 55702 → 443 [SYN] Seq=0 Win=26883 Len=0 MSS=8961 SACK_PERM=1 TSval=421269356 TSecr=0 WS=128
101 51.700296 10.130.105.12 99.82.187.47 TCP 76 [TCP Retransmission] [TCP Port numbers reused] 55702 → 443 [SYN] Seq=0 Win=26883 Len=0 MSS=8961 SACK_PERM=1 TSval=421277548 TSecr=0 WS=128
Hmm okay, that’s obviously not right - we’re not getting an ACK
back when connecting to the endpoint.
I also got a tcpdump on the firewall run, and it was very similar to the above. No results sorry.
On further troubleshooting, a simple ping was attempted from the firewall to the trouble instance. That was successful, as expected. Then something weird happened - our problem went away! The trouble instance was able to connect to SSM!
Thanks to Phil for mentioning ARP, that was the hint I needed to go digging a little bit deeper. Found this Hacker News post with a quick “aws arp caching” google which led me to this Clever blog post which just reads exactly like the issue I’ve had.
The Firewall ARP Tables
On looking at the firewall ARP tables while the issue was happening, I noticed the MAC address for the problem instance was out of date!
Firewall 1 ARP entry - showing the incorrect MAC address -
[root@FW01:0]$ arp | grep 10.130.105.12
ip-10-130-105-12.ap-sou ether 06:cf:a2:b2:c6:9e C eth2
Firewall 2 ARP entry - this MAC address is correct -
[root@FW02:0]$ arp | grep 10.130.106.12
ip-10-130-106-12.ap-sou ether 02:73:56:60:d5:e6 C eth2
After another ping
to “fix” the issue, this is what the ARP entries looked like:
Firewall 1 -
[root@FW01:0]$ arp | grep 10.130.105.12
ip-10-130-105-12.ap-sou ether 06:0c:03:1a:aa:fc C eth2
Firewall 2 -
[root@FW02:0]$ arp | grep 10.130.106.12
ip-10-130-106-12.ap-sou ether 02:73:56:60:d5:e6 C eth2
… and traffic was flowing smoothly again - the MACs had been updated.
The (temporary) Resolution
The final resolution is pending.
We have a ticket open with our firewall manufacturer to see what they think. I’m assuming there is a patch to apply, so it could be a little while for a permanent fix.
In the meantime, ping
is our friend to run if this pops up again.
Closing Notes
What first looked like a nice simple issue (hello missing security group rules) turned out to be much more complex than thought (from a debugging perspective at least!).
It’s been super interesting digging a little deeper into the networking side of things, but I think I’ll leave networking to the pros 😄