Squeezing through the DNS bottleneck in EKSAugust 13th, 2024
I recently had a project at work to scale up concurrency with our QA test automation system. It uses Selenium Grid to run headless browser clients that carry out scripted user actions against our web app in a test environment. These headless browser pods run in an EKS test cluster, while the web app that they test against lives in yet another EKS cluster.
I initially dealt with the typical kubernetes scaling touchpoints, horizontal pod autoscaling and worker node capacity in both clusters.
The Problem
However, there was a curveball. On the Selenium Grid side of things, the logs were showing a lot of DNS failures for both in-cluster addresses and external addresses, legitimate addresses that should be resolvable.
Failed to open TCP connection to selenium-router.test-pipeline-12345:4444 (getaddrinfo: Name or service not known) (SocketError) Failed to open TCP connection to api.company.biz:443 (getaddrinfo: Temporary failure in name resolution) Failed to open TCP connection to microservice.api.company.biz:443 (getaddrinfo: Temporary failure in name resolution)
I was aware that the CoreDNS service in kubernetes is responsible for DNS resolution, so In my cursory search for an answer, it sounded like scaling up CoreDNS might help alleviate the problem.
When researching DNS on the AWS VPC layer, I discovered that there is a rate-limit on the VPC resolver which CoreDNS sends its DNS queries to, however AWS does not provide a straight-forward way to know if you're getting rate-limited by the VPC resolver.
The Amazon Virtual Private Cloud (Amazon VPC) resolver can accept a maximum hard limit of only 1024 packets per second per network interface. If more than one CoreDNS pod is on the same node, then the chances are higher to reach this limit for external domain queries.
AWS re:Post - How do I troubleshoot DNS failures with Amazon EKS?
Scaling up CoreDNS actually made the problem worse, fewer tests were now able to complete with even more DNS failures showing in the logs. I reverted CoreDNS back to its original scaling rule. I had no choice but to do the unthinkable and try to learn how DNS queries work in an EKS environment.
Investigations
Turning on the logging in CoreDNS was the first step. It produces a very large amount of logs, so it is most likely something you only want to keep on while you're actively debugging problems.
From there I realized I needed to brush up on the status codes being printed to the CoreDNS logs. TL;DR NXDOMAIN means the query was unsuccessful and NOERROR means the query was successful.
With the CoreDNS logs in front of my face, I could clearly see the problem. Multiple suffixes were being attempted before we were getting the successful result.
Here's an example where the name we are trying to resolve is api.company.biz.
CoreDNS Logs
[INFO] "A IN api.company.biz.default.svc.cluster.local. udp 58 false 512" NXDOMAIN qr,aa,rd 151 0.000073309s [INFO] "A IN api.company.biz.svc.cluster.local. udp 58 false 512" NXDOMAIN qr,aa,rd 151 0.000073309s [INFO] "A IN api.company.biz.cluster.local. udp 54 false 512" NXDOMAIN qr,aa,rd 147 0.000113114s [INFO] "A IN api.company.biz.ec2.internal. udp 53 false 512" NXDOMAIN qr,aa,rd,ra 138 0.000083938s [INFO] "A IN api.company.biz. udp 40 false 512" NOERROR qr,aa,rd,ra 324 0.000057045s
My best guess at this point was that we were getting rate-limited by the VPC resolver, and all the unnecessary DNS queries that the pods were producing were most likely a major contributing factor.
More reading about kubernetes DNS revealed that this is the default behavior. Pod DNS is configured to behave like a fuzzy search to make things user-friendly. You can see that the suffixes that are being attempted also exist in the search list of the /etc/resolv.conf file that lives in each pod, and the ndots value is 5:
search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal nameserver 172.20.0.10 options ndots:5
Excerpt from the gnu/linux resolv.conf man page:
ndots:n Sets a threshold for the number of dots which must ap- pear in a name given to res_query(3) (see resolver(3)) before an initial absolute query will be made. The de- fault for n is 1, meaning that if there are any dots in a name, the name will be tried first as an absolute name before any search list elements are appended to it. The value for this option is silently capped to 15.
Changing this behavior of iterating through search suffixes is achieved by adjusting the ndots value in the /etc/resolv.conf file. In a kubernetes environment, the contents of this file is controlled through the DNS Policy of the pod spec.
Solution
It seemed like the most straight-forward way to reduce bad DNS queries was to avoid doing any fuzzy searching at all. The solution would be:
- update the source code running in the test pods to only use fully-qualified domain names
- set ndots to a value of 1 so pods will treat each query as fully qualified first before iterating through the search suffixes
By keeping an eye on the occurrences of NXDOMAIN in the CoreDNS logs I could see that after applying the changes above, we were no longer querying for non-existent domains. The DNS failures in the test pods disappeared presumably because we were no longer hitting the rate-limit threshold. Our overall DNS query volume to the VPC resolver is likely to be about 1/5th of what it was prior to the change.