20

Jul 2020

Fun with AWS Transit Gateways

Sushil Developer Talk No Comments

Recently I was playing around with setting up an AWS Transit Gateway. The objective was to route traffic from multiple VPCs through a single NAT gateway to the company network. The solution had to work if the origin VPCs were in different AWS accounts.

Part I: NAT Gateway

I took it in stages. I first set up the NAT gateway in a single VPC and made sure the traffic was routing through the NAT. As Figure 1 shows, the VPC (vpc-1) has three subnets: subnet-1a1 is private (no route to an Internet gateway) and all non-vpc traffic (0.0.0.0/0) is routed to the NAT Gateway, (nat-1); subnet-1a2 is public and all non-local traffic is routed through the Internet Gateway (this is the default configuration when you create a VPC); subnet-1b1 is a public subnet (it has the same routing table as subnet-1a2) but it also contains the NAT Gateway.

A server instance in subnet-1a2 (e.g. i-1a2) can connect to the company network (public IP 11.22.33.44) or any other Internet endpoint directly via the Internet Gateway. The instance will present itself to the enpoint with its own public IP address (in this case 52.52.52.52). An instance in subnet-1a1 can also connect to any endpoint on the Internet, but its traffic is routed through nat-1 and it will show up to endpoints with the NAT gateway’s Elastic IP address (in this case 34.34.34.34).

Figure 1: NAT Gateway and subnet routing for a single VPC

So far so normal and fine. I tested the connectivity from each instance and it looked like the NAT was set up correctly. The next step was to add another VPC to the mix and route traffic from the second VPC through NAT gateway. In order to do that I chose to use a Transit Gateway.

Part II: Transit Gateway

In Figure 2 I have added another VPC (vpc-2) containing a public subnet (subnet-2c1). This subnet’s route table sends all non-local traffic to either of two destinations: If the endpoint is the company network’s public IP (11.22.33.44) then the traffic is sent to the transit gateway, where the transit gateway route table (rtb-tg1) sends all external Internet bound traffic (0.0.0.0/0) to VPC1. All other outbound traffic from vpc-2 is sent to that VPC’s internet gateway (igw-2).

Figure 2: NAT Gateway and subnet routing with additional VPC and Transit Gateway

In order to make this work you need to attach a subnet in VPC1 that does not contain the NAT to the transit gateway. In this case that was subnet-1a1. That way, the traffic from the TG lands in subnet-1a1, and any traffic destined for the Internet is then routed to the NAT.

I now tested the connectivity from the instance in the second VPC (i-2c1). While I was able to connect to other Internet endpoints (route 0.0.0.0/0), I was not able to connect to the company network (IP 11.22.33.44). This had me stumped. If instance i-1a1 in private subnet subnet-1a1 could connect to the corporate network, why couldn’t instance i-2c1, whose traffic was being dumped into subnet-1a1?

It turns out that the subnet that is attached to the TG and the subnet that holds the NAT gateway must be in the same availability zone (Figure 3). Once that piece of the puzzle fell in place, everything started to work as expected.

Figure 3: Corrected subnet routing for two or more VPCs and Transit Gateway

Parting Thoughts

My purpose here is not to write a TG step-by-step guide, since there are already several out there. However, I do have some observations about the process that I have not seen others mention.

To get a TG operational you have to 1) attach VPCs to the TG, then 2) associate the VPCs with a TG routing table, then 3) set up the transit gateway route table and then finally 4) update the VPC/subnet route tables. My first observation is that you can only associate a VPC with one TG route table, so this may seem like a superfluous step. Second, when a VPC is associated with a route table, AWS automatically adds a route for that VPC to the table. You can double-check this because I have seen that sometimes it doesn’t happen.

The TG setup steps 1-3 above don’t actually change traffic flow in any way, and therefore they’re perfectly safe to try and play around with. The actual step that changes traffic flow occurs when you modify the VPC routing tables (step 4). Therefore this should be the last step. It is critical to add a route back to the all the VPCs using the NAT gateway in the route table of the subnet containing the gateway, even though that is not the subnet that is attached to the TG. Otherwise the return traffic coming into the NAT will not make its way all the way back to the origin.

Let me know if this was helpful, if something doesn’t make sense or if any of the information is out of date.

Sushil Kambampati

I help companies save money on cloud, do DevOps right and secure their systems