pmeerw's blog /rant/awsdns.html

We were load-testing a service hosted on an AWS EC2 instances behind an application load balancer (ALB). At first, simulated clients directly connected to the applicaton (no ALB), and 3000 clients were handled easily by a single server. Add in an ALB with a target group (TG) spanning two availability zones (AZ), and: nothing works! Wild errors everywhere with only about 70 simulated clients.

First steps

We also tried a local setup with everything hosted on a Ubuntu machine and some Windows subsystem for Linux (WSL) instance. Initially, WSL worked nicely, Ubuntu also failed with few clients.

Quickly, it became apparent that Ubuntu imposes a limit on the number of file handles that are available to a process:

$ ulimit -n
1024    # (on Ubuntu)
10240   # (on WSL2)

No more file handles, no more open sockets. Unfortunately, the application's log output did not clearly indicate any failure condition. This can be easily rectified by setting ulimit -n 16384 to up the limit. Amazon Linux has no such limit. So we can now handle thousand of simulated clients on a single machine, but the goal was a service that is scaling horizontally by simply spawning new server instances. Hence, the service needs to go behind an ALB that distributes the load "in the cloud".

Back to ALB

What's going on with the ALB? Turning on logging and fetching logs from S3 is tedious. Logs revealed plenty of 460 status code errors indicating that the client has closed the connection to the ALB before the target service could answer. So the client apparently closes connections. But where? And how?

What is the client doing?

Collecting packet traces with tcpdump (.pcap files) shows that the client indeed sends a FIN packet to close the TCP connection to the ALB on a particular TCP port (and the request with that client port shows in the ALB log). So process traces obtained with strace to see the syscalls leading to close() or shutdown() should indicate where it's happening. With added logging we could correlate the connection's port number with the interesting socket handle numbers. But: nothing showed up. No reason for the FIN packets to be found. The C++ application's asio code is dense and logging is added. Still inconclusive.

At least we were able to shift reason for the ALB 460 errors to the target server's side by adjusting some connection timeouts. The ALB now logs custom 520 status codes which at least feels a bit better (more under control). Thanks to AI for suggesting the change.

Detour: Let's put simulated clients outside AWS

Running the simulated clients outside the AWS network, accessing the servers behind the ALB worked quite well. However, to simulate significant load easily we'd need many client and this is most easily achieved by spinning up EC2 instances.

The revelation

More log files were created, narrowing down on the number of simulated clients the setup with ALB seems to handle:

50 clients: OK

70 clients: NG

Waaay too low numbers to be practical.

Log file analysis revealed that some resolve operation of the client for the ALB hostname took approximately 5 seconds. How can DNS take so long to resolve within AWS? Looking at tcpdump -n -i any port 53 to see the DNS traffic confirmed the following:

Shitload of DNS requests are sent to the EC2 resolver's IP
DNS resolving from time to time blocks for about 5 seconds

Amazon clearly states the limits for AWS DNS operations:

Each network interface in an Amazon VPC has a hard limit of 1024 packets that it can send to the Amazon-provided DNS server every second.

The solution

The solution becomes evident when asking the right questions

Other people were running into the same issue as well. Mikail has a nice writeup. Amazon Linux 2023 uses systemd-resolved and per default configures it to disable any caching. This might be beneficial to have e.g. instant database failovers. The due to the number of connections and our application logic, we have to cache and limit the number of requests to the EC2 DNS resolver.

The following fixes the issue and allows local caching of DNS queries

don't disable the stub resolver

sudo rm /usr/lib/systemd/resolved.conf.d/resolved-disable-stub-listener.conf
sudo systemctl restart systemd-resolved

point /etc/resolv.conf to systemd's stub resolver

sudo rm /etc/resolv.conf
sudo ln -s /run/systemd/resolve/stub-resolv.conf /etc/resolv.conf

Use systemd-resolve --statistics to check the cache's hit/miss rate.

Conclusion

In memoriam Dan Kaminsky: It's always DNS!

Even apparently unrelated problems can have the root cause in failure to resolve name information using the venerable domain name system.

posted at: 13:37 | path: /rant | permanent link

<	May 2026					>
Su	Mo	Tu	We	Th	Fr	Sa
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

Fri, 08 May 2026