SRX/AWS VPN tunnel dropouts

One of my clients’ networks is made up of a number of small offices, all of which have VPN tunnels from their SRX210s to AWS, where their AD/file/application servers live. After the cloud migration, I was getting complaints from users of intermittent dropouts of AWS connectivity. They were experiencing this primarily as RDP sessions dropping, which could have many different causes, but I looked into it a bit. I was able to observe a couple dropouts of a few seconds each, but couldn’t find any causes or correlations at a glance.

Amazon has a great writeup on the requirements and procedures for a site-to-site VPN between your equipment and a VPC, but to sum it up, you can essentially provide Amazon with a few pieces of info (public IP address, prefixes to route, firewall vendor/version) and Amazon will provide a list of set commands that only require minor edits. This is how I built these tunnels, and after changing interface names and IP addresses everything was up and running except for proxy-dns. I wrote about troubleshooting and resolving that here, but essentially (and important to the above issue) I had to give the interfaces on my side of the tunnel addresses in my corporate network rather than the link-local addresses that the autogenerated Amazon configs provided.

Back to the issue at hand – after observing intermittent dropouts and not finding immediately apparent issues, I took a look at our network monitoring setup (LogicMonitor), which is essentially just consuming SNMP from a number of our firewalls/routers/switches. The SRXes were showing interface flaps on the st0 interfaces that correspond to the AWS tunnels, although these were very frequent, much more frequent than any observed or reported issues. LogicMonitor is (without tuning) noisier than it needs to be, at least on Juniper gear, so I didn’t make too much of this, although inspecting the tunnels in the AWS console showed very recent (minutes) “status last changed” times. Some more cursory investigation on the firewalls found nothing, so I decided to focus on the VPN tunnel config. After a significant amount of troubleshooting by disabling one of the two links and monitoring one link at a time.

This showed me the actual issue, which is fascinating – the link was going down for exactly ten seconds, once every two minutes:

dropout1

dropout2

By watching the tunnel’s security-association on the firewall, I could see the cause of the link loss: AWS’ default config for a Junos device includes enabling vpn-monitor on the IPsec tunnel. Because of the tunnel address config mentioned above, vpn-monitor won’t work in this setup – the probes will not be returned, and the tunnel will be taken down and rebuilt as it appears to have failed. This is the cause of the perfectly regular dropouts on each tunnel.

So why was this causing intermittent and hard-to-identify outages? Beat frequency. Each link was experiencing (roughly) 10 seconds of downtime (roughly) every two minutes. Most of the time, these outages didn’t line up, and traffic continued to flow over the remaining link. However, because internet traffic is nondeterministic, the difference in tunnel rebuild times would cause each link’s outage time to shift slightly each cycle. Eventually, like turn signals or windshield wipers, these similar-but-not-exactly-identical periods would line up for a few seconds causing a complete outage.

Disabling vpn-monitoring for these tunnels resolved the issue completely. A very satisfying issue to troubleshoot.

programmatic generation of sequential Junos set commands

While building out a new enterprise network for a client using Juniper hardware (SRX240s and EX3300s), I ran into a decision I’ve encountered a number of times: how to generate repetitive configuration. For example:

set interface ge-2/0/0 unit 0 family ethernet-switching port-mode access vlan members corporate-data
set interface ge-2/0/1 unit 0 family ethernet-switching port-mode access vlan members corporate-data
, etc.

Obviously interface-range exists to solve this issue, at least for switchport config, but other members of my team take issue with the transparency of configuration in that form, and I agree somewhat – to verify an interface’s configuration, you’re either following a chain of config items or using the less-than-ideal show vlans. Additionally, interface-range only helps with interface configuration, not for security zones or policies.

In the past, I’ve used a quick bash script that I modify as necessary for outputting the required set commands. This is fine for me, but I wanted to make something a bit easier and more portable for my (Windows-using) team, so I converted it to a quick Python script.

Python is still not the most accessible interpreter for Windows users, so I figured I’d go all the way and create a javascript web version. It’s barebones and the interactivity is a bit weird – I intended to go for the full terminal experience, but my JS and CSS skills are weak at best. Anyway, hope this is useful to someone other than me and my coworkers.

manually set AWS EC2 RHEL7 hostname

I use a few t2.micro EC2 instances running RHEL7 for various general-purpose applications best suited to a small VPS. A personal preference of mine is to have short, descriptive hostnames on each system I regularly interact with, and by default, RHEL7 on AWS uses dynamic hostnames – the system’s hostname is defined by the private IP address of the instance:

Screen Shot 2016-04-10 at 00.01.22

I’m sure that’s useful in some applications (“at scale”), but it’s not what I prefer. Amazon has an article on how to set a static hostname, but it doesn’t seem correct – I have a hard time believing that every step is required. I put this to a test: rather than follow the steps to the letter, I tested each step independently, then in combination with multiple iterations, to see what was actually required. Here’s all that’s needed:

  1. Replace the contents of /etc/hostname with your desired hostname. No idea why Amazon tells you to use HOSTNAME=newhostname, just echo "newhostname" | sudo tee /etc/hostname
  2. Append preserve_hostname: true to /etc/cloud/cloud.cfg
  3. Reboot.

Caveat: I could be wrong, there could be good reasons to go through all of the other steps that Amazon’s doc explains. Can’t imagine why though!