pmeerw's blog

Sun, 21 Apr 2024

A long tale of an (inconvenient) bug: Firefox not working on Ubuntu 24.04 pre-beta

Sometimes one feels adventurous: so I made the hasty decision to upgrade from Ubuntu 22.04 to Ubuntu 24.04 (Noble Numbat) in alpha/pre-beta state -- I swear, this has worked nicely previously.

The update went semi-smooth, only the network/nameserver configuration was lost, but that was easy to fix by manually installing the network-manager package and editing /etc/resolv.conf. It's good if you can remember to manually bring up an network interface, add a default route, set a static DNS:

ifconfig -a # to discover available network interfaces
ifconfig enp5s0 192.168.1.123 # set static IP
route add default gw 192.168.1.1 # add default route
nano /etc/resolv.conf # set static DNS

All looks good, except: Firefox cannot resolve any host name. Also Chromium. Also Slack is dead. Uugh, the machine is pretty useless without the Web. However, network connectivity is good: ping works, nslookup works. Turned out only snap applications are affected, but that was not so clear in the beginning.

Initially, I thought some browser-related package is at fault. Here are my notes of the debugging process and the train of thought; in any case, I learned quite a bit...

System calls

I used strace to look for failing syscalls. I also tried ltrace, but that did not really work. So strace showed that the /etc/nsswitch.conf and the /etc/hosts files are being read, a connect() call to establish a connection to the DNS resolver/nameserver (to 127.0.0.53, port 53 in my case) given in /etc/resolv.conf, the DNS query being sent, and the DNS response being received, but the recvfrom() syscall returning an error code (EINVAL). Inspecting the query traffic with wireshark, everthing looked good on the network layer.

I made a couple of experiments: the browser had no issue connecting to IP addresses directly, also a connection to a hostname in the /etc/hosts file was no problem. So DNS...

Firefox debugging

Next, I tried to corner the issue from above and learned about Firefox's built-in debugging and logging capabilities. Firefox is my main browser; it exposes a wide range of information. Entering about:networking in the browser's address bar, one can observe recent HTTP connections, network socket information, a glimpse on the DNS cache (see screenshot). Using the 'DNS Lookup' functionality, it is easy to confirm that Firefox cannot resolve any hostname.

good
bad

Entering about:logging presents Firefox's logging manger. One restrict logging to certain modules and log levels (supposedly, that is what '5' means). The Firefox Profiler is some online capability to diagnose log files and obviously not very helpful to track down network issues, hence I selected 'Logging to a file'. Since Firefox is usually run in a Snap container, the log file actually ends up inthe /tmp/snap-private-tmp/snap.firefox/tmp/ directory of the file system.

A successful DNS lookup for 'example.com' produces the following lines:

Parent 17655: Main Thread]: D/nsHostResolver Resolving host
Parent 17655: Main Thread]: D/nsHostResolver   No usable record in cache for host
Parent 17655: Main Thread]: D/nsHostResolver NameLookup host:example.com af:0
Parent 17655: Main Thread]: D/nsHostResolver NameLookup: example.com effectiveTRRmode: 1 flags: 0
Parent 17655: Main Thread]: D/nsHostResolver TRR service not enabled - off or disabled
Parent 17655: Main Thread]: D/nsHostResolver NativeLookup host:example.com af:0
Parent 17655: Main Thread]: D/nsHostResolver   DNS thread counters: total=6 any-live=0 idle=6 pending=1
Parent 17655: Main Thread]: D/nsHostResolver   DNS lookup for host
Parent 17655: DNS Resolver #123]: E/nsHostResolver DNS lookup thread - Calling getaddrinfo for host
Parent 17655: Main Thread]: D/nsHostResolver Resolving host
Parent 17655: Main Thread]: D/nsHostResolver   No usable record in cache for host
Parent 17655: Main Thread]: D/nsHostResolver NameLookup host:example.com af:0
Parent 17655: Main Thread]: D/nsHostResolver NameLookup: example.com effectiveTRRmode: 1 flags: 0
Parent 17655: Main Thread]: D/nsHostResolver TRR service not enabled - off or disabled
Parent 17655: DNS Resolver #123]: E/nsHostResolver DNS lookup thread - lookup completed for host
Parent 17655: DNS Resolver #123]: D/nsHostResolver nsHostResolver::CompleteLookup example.com 7650d9e24650 0 resolver=0 stillResolving=0
Parent 17655: DNS Resolver #123]: D/nsHostResolver nsHostResolver record 7650de6a1060 new gencnt
Parent 17655: DNS Resolver #123]: D/nsHostResolver Caching host
Parent 17655: DNS Resolver #123]: D/nsHostResolver CompleteLookup: example.com has 93.184.215.14
Parent 17655: DNS Resolver #123]: D/nsHostResolver CompleteLookup: example.com has 2606:2800:21f:cb07:6820:80da:af6b:8b2c
There's a couple of interesting things: (1) it calling getaddrinfo(), which lives in the C runtime; (2) nsHostResolver is a module in Firefox's code base, which is online browsable, e.g. using searchfox; (3) it ultimately calls PR_GetAddrInfoByName(), which lives in he libnspr4 package/library, it seemed like a good candidate to investigate further. Again, it's browsable online and indeed, it calls getaddrinfo().

Digging deeper, IDA Pro

The strace tool would have been really helpful here to trace at runtime the arguments passed and results returned, however, it didn't work for me. So I tried attaching the IDA Pro 8.3 disassembler/debugger to the Firefox process, a free IDA Pro version is available from Hex-Rays.

In IDA, I looked for the libnspr4.so shared object in 'Modules' list, then searched for the PR_GetAddrInfoByName() symbol in the module's function list to get a disassembly of the function.

Using the basic block graph of the function (and having the almost matching, corresponding source code), it's relatively easy to locate the potential getaddrinfo() call (right after the checks for "localhost" domain). I've set a breakpoint (press F2) on the call and on the next instruction.

Having the debugger interrupt and block the Firefox process is tedious: those functions are called too often for manual inspection of the call arguments and return code. A better way is conditional breakpoints and logging. In IDA, edit the breakpoints and enter a condition:

msg("!!!1 %s\n", get_strlit_contents(rdi, -1, STRTYPE_C)), 0
and
msg("-> %08x\n", eax), 0
respectively. Both output a message, and at the end (", 0") return 0 as the results of the conditional expression (i.e. the breakpoint's action is NOT triggered). The first message outputs the null-terminated string pointed to by the rdi register, that is the host name to resolve. The second message outputs the value of the eax registers, i.e. the result value of the getaddrinfo() function. After instrumenting the Firefox process as described, press F9 to continue execution of the debuggee. Perform a DNS query in Firefox, and the logging -- thanks to the breakpoints -- will be shown in IDA's "Output" window.

I tried to step into (press F7 in IDA) the getaddrinfo() implementation, but quickly gave up on the code. It will somehow call connect() to query the DNS resolver as seen in strace...

Showdown & conclusion

After a long time deducing all the steps, and ignoring the fact that a fresh install might have been the quicker option, here are the finding:

I tried to implement the PR_GetAddrInfoByName() in a small test program, of course no issue there:

// compile: gcc -g -Wall nspr.c -lnspr4

#include "nspr/nspr.h"
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char* argv[]) {
  PR_Init(0, PR_PRIORITY_NORMAL, 0);

  PRAddrInfo* ai = PR_GetAddrInfoByName("example.com", PR_AF_INET, PR_AI_ADDRCONFIG | PR_AI_NOCANONNAME);
  if (!ai) {
    printf("PR_GetAddrInfoByName() = %d\n", PR_GetError());
    goto error;
  }

  PR_FreeAddrInfo(ai);

error:
  PR_Cleanup();

  return EXIT_SUCCESS;
}

My suspicion was that either apparmor or the Snap containerization is to blame. Apparmor can be easily disabled, and hence ruled out. Snap is not so easy and I know little about it and I had to defer further investigation. Plan was to run the above test program under Snapand hope that the error would reproduce...

Aftermath

A couple of days later, I learned that other people experienced the same problem: "Network problems with snap apps" (on askubuntu.com), "DNS for snaps like Firefox and Chromium fails" (on launchpad), "Snaps unable to connect to network under linux-lowlatency", "Noble kernel regression with new apparmor profiles/features"

And also a solution became known: kernel update from Linux 6.8.0-25-lowlatency to Linux 6.8.0-28-lowlatency -- and the issue is gone! Somehow disappointing, I've tried to track down the actual fix, maybe it's this:

diff -u linux-6.8.0/security/apparmor/af_inet.c linux-6.8.0/security/apparmor/af_inet.c
--- linux-6.8.0/security/apparmor/af_inet.c
+++ linux-6.8.0/security/apparmor/af_inet.c
@@ -103,14 +103,12 @@
 	AA_BUG(!maddr);
 
 	maddr->addrtype = addrtype;
-	if (!addr) {
+	if (!addr || addrlen < offsetofend(struct sockaddr, sa_family)) {
 		maddr->addrp = NULL;
 		maddr->port = 0;
 		maddr->len = 0;
 		return 0;
 	}
-	if (addrlen < offsetofend(struct sockaddr, sa_family))
-		return -EINVAL;
 
 	/*
 	 * its possibly to have sk->sk_family == PF_INET6 and
This somehow would match the initial finding that recvfrom() returns -EINVAL.

posted at: 22:39 | path: /rant | permanent link

Made with PyBlosxom