pmeerw's blog
08 May 2026
We were load-testing a service hosted on an AWS EC2 instances behind an application load balancer (ALB). At first, simulated clients directly connected to the applicaton (no ALB), and 3000 clients were handled easily by a single server. Add in an ALB with a target group (TG) spanning two availability zones (AZ), and: nothing works! Wild errors everywhere with only about 70 simulated clients.
We also tried a local setup with everything hosted on a Ubuntu machine and some Windows subsystem for Linux (WSL) instance. Initially, WSL worked nicely, Ubuntu also failed with few clients.
Quickly, it became apparent that Ubuntu imposes a limit on the number of file handles that are available to a process:
$ ulimit -n 1024 # (on Ubuntu) 10240 # (on WSL2)No more file handles, no more open sockets. Unfortunately, the application's log output did not clearly indicate any failure condition. This can be easily rectified by setting
ulimit -n 16384 to up the limit. Amazon Linux has no such limit.
So we can now handle thousand of simulated clients on a single machine, but the goal was a service that is scaling horizontally by simply spawning new server instances. Hence, the service needs to go behind an ALB that distributes the load "in the cloud".
What's going on with the ALB? Turning on logging and fetching logs from S3 is tedious. Logs revealed plenty of 460 status code errors indicating that the client has closed the connection to the ALB before the target service could answer. So the client apparently closes connections. But where? And how?
close() or shutdown() should indicate where it's happening. With added logging we could correlate the connection's port number with the interesting socket handle numbers.
But: nothing showed up. No reason for the FIN packets to be found. The C++ application's asio code is dense and logging is added. Still inconclusive.
At least we were able to shift reason for the ALB 460 errors to the target server's side by adjusting some connection timeouts. The ALB now logs custom 520 status codes which at least feels a bit better (more under control). Thanks to AI for suggesting the change.
Running the simulated clients outside the AWS network, accessing the servers behind the ALB worked quite well. However, to simulate significant load easily we'd need many client and this is most easily achieved by spinning up EC2 instances.
More log files were created, narrowing down on the number of simulated clients the setup with ALB seems to handle:
| 50 clients: | OK |
| 70 clients: | NG |
Log file analysis revealed that some resolve operation of the client for the ALB hostname took approximately 5 seconds. How can DNS take so long to resolve within AWS?
Looking at tcpdump -n -i any port 53 to see the DNS traffic confirmed the following:
Amazon clearly states the limits for AWS DNS operations:
Each network interface in an Amazon VPC has a hard limit of 1024 packets that it can send to the Amazon-provided DNS server every second.
The solution becomes evident when asking the right questions
Other people were running into the same issue as well. Mikail has a nice writeup. Amazon Linux 2023 uses systemd-resolved and per default configures it to disable any caching. This might be beneficial to have e.g. instant database failovers. The due to the number of connections and our application logic, we have to cache and limit the number of requests to the EC2 DNS resolver.
The following fixes the issue and allows local caching of DNS queries
sudo rm /usr/lib/systemd/resolved.conf.d/resolved-disable-stub-listener.conf sudo systemctl restart systemd-resolved(2)
/etc/resolv.conf to systemd's stub resolver
sudo rm /etc/resolv.conf sudo ln -s /run/systemd/resolve/stub-resolv.conf /etc/resolv.conf
systemd-resolve --statistics to check the cache's hit/miss rate.
In memoriam Dan Kaminsky: It's always DNS!
Even apparently unrelated problems can have the root cause in failure to resolve name information using the venerable domain name system.
posted at: 18:00 | path: /rant | permanent link
posted at: 15:57 | path: /configuration | permanent link