Summary:
A bug in the Ubuntu Kernel 6.2.0 can cause Linux to believe it is using more TCP memory than it is, this can lead to Linux enforcing a limit on TCP memory and to act as if it were out of TCP memory.
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2045560
When TCP memory is limited on a node you may see leadership transfers from that broker, or under replicated partitions. This can occur briefly or over a longer time period.
Severity:
Medium
Redpanda Products Affected:
Redpanda Self Hosted Enterprise
Releases Affected:
All Releases Potentially Affected based on Operating System Kernel
Identification
Below are several methods you can use to determine if you are impacted by this kernel bug. If you are not on an affected Architecture or not on an affected kernel version then you are not impacted and can skip the remaining steps.
Otherwise the steps can help you determine if you are immediately impacted. If impacted at all, we suggest following the steps in Action Required.
Check Architecture
This bug only affects arm64 systems. The output of uname -a should contain `aarch64` if the system is arm64.
Check Kernel Version
Determine if you are using an impacted kernel version. Any kernel version between 6.2.0 and 6.5.x may be affected.
[foo~]$ uname -r
6.5.12-100.fc37.x86_64
In the above the kernel version is 6.5.12-100
Check dmesg
If on an impacted kernel, you may want to determine if you are seeing the bug’s impact by checking dmesg.
-
Search dmesg for errors such as TCP: out of memory
- If you see recurring TCP: out of memory or similar TCP memory errors and are on the matching kernel version you are likely impacted.
dmesg | grep 'out of memory'
Examine Metrics
If on an impacted kernel, you may want to determine if you are seeing the bug’s impact by checking your metrics.
If you have Prometheus Node Exporter set up and examine the node_sockstat_TCP_mem_bytes metric over time. If you see an ever increasing value and have a matching kernel version, then you are likely impacted.
Examine socketstat util
At this point there are a few reasons why we may be out of TCP memory. This bug being one of them. To ensure that its the bug causing the issue run;
[foo~]$ cat /proc/net/sockstat
sockets: used 2930
TCP: inuse 137 orphan 0 tw 14 alloc 143 mem 2025
UDP: inuse 51 mem 2318
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0
If there is a large number of orphan sockets here then the issue may be something else. If not we are interested in the TCP mem value. This value is in pages so the true memory usage is 4096*2025 in the case above. Then run;
[foo~]$ ss -atmp | grep -o "[rtfwod][0-9][0-9]*" | grep -o "[0-9]*" | awk '{sum += $0} END {print sum}'
20528
This gives us the memory usage of each TCP socket in bytes. This value will likely be lower than the first due to Linux allowing for internal slack in accounting. However, if it is hundreds of megabytes lower than the first then it's likely the bug causing your issues.
Impact:
When TCP memory is limited on a node you may see leadership transfers from that broker, under replicated partitions or client disconnects. This can occur briefly or over a longer time period.
Action Required:
In the immediate term if you believe you are impacted by this kernel bug, you can restart the underlying node to reclaim TCP memory.
Ubuntu based Operating Systems
For a long term resolution, even if not seeing immediate impact, you will want to upgrade your OS to at least one of the following.
- Ubuntu 23.04 (Lunar Lobster): Kernel Version: 6.2.0-30.1
- Ubuntu 22.04 LTS (Jammy Jellyfish): Kernel Version: 6.2.0-33.1
- Ubuntu 22.04 LTS (Jammy Jellyfish) Hardware Enablement (HWE) kernel: Kernel Version: 6.5.0-1007
Amazon Linux Operating systems
For a long term resolution, even if not seeing immediate impact, you want to upgrade to at least Amazon Linux 2023 version 2023.4.20240513
Other Operating Systems
For other Linux Operating Systems, if you believe you are impacted, you may need to search to see if you are affected by this bug, and if it is resolved in a later version.
Questions? If you have any questions on this TSB, or need further guidance, please contact support@redpanda.com