Update:
Previously in TSB 29 we suggested Self Hosted customers upgrade to mitigate a potential overflow bug, we are releasing an updated patch and recommend that Self Hosted customers do not upgrade to versions 24.1.20, 24.2.23, 24.3.12 or 25.1.3 and instead upgrade to the versions listed below in the Action Required section.
Summary:
The use of a 32-bit unsigned integer in Redpanda broker code used for reading Raft logs can result in an overflow when working with offsets whose distance from the base offset registered in their segment is greater than 4,294,967,295.
Raft logs are read in several scenarios: when a client does a fetch, or for internal use by Redpanda when compacting segments or when recovering segments. In each such scenario the incorrect reader can have failed to return batches or returned the wrong ones, leading to the wrong data being returned to clients (fetch) or failure to copy over to newer logs data being compacted or recovered.
Elaborating a bit, this bug can cause compaction to wipe away arbitrary batches in the segment, leading directly to loss of data (for batches representing user data) or loss of Redpanda internal state (for batches representing system state of some kind).
This issue has been resolved by modifying the code to not perform incorrect arithmetic. The fix will become available in Redpanda versions 25.1.4, 24.3.13, 24.2.24 and 24.1.21.
Severity:
High
Redpanda Products Affected:
- Redpanda Self-Managed - Enterprise
- Redpanda Self-Managed - Community
- Redpanda Cloud - BYOC
- Redpanda Cloud - Dedicated
- Redpanda Cloud - Serverless
Release Affected:
- Redpanda versions 24.1.21 and older
- Redpanda versions 24.2.1-24.2.23
- Redpanda versions 24.3.1-24.3.12
- Redpanda versions 25.1.1-25.1.3
Identification:
Customers can execute “rpk topic describe” to determine if a given topic is compacted, and executing with the ‘-p’ option will then show the high watermark for the partitions. A high watermark beyond the 4,294,967,295 limit does not necessarily imply existence of problems, but does mean it was possible.
If you already know which of your topics are compacted you can use the public metric ‘redpanda_kafka_max_offset’ to check if the high watermark is beyond 4,294,967,295.
You should also look specifically at your __consumer_offsets topic. A high watermark beyond the limit in that topic can explain why one or more of your consumers incorrectly reset to an earlier offset value after seeming to have consumed much more of a given topic.
If you encounter a topic with a high watermark over 4,294,967,295 you can check the Redpanda Logs for the following log lines to verify if you are seeing this issue. Note: 24.1 and older versions do not have this log entry, so you can only rely upon the offset high watermark to determine if you might be affected on those older releases.
Offset translator state inconsistency detected
If you see the above log line followed by a topic with a high watermark over 4,294,967,295 you are likely impacted and should upgrade.
Impact:
Data loss is possible, as is other unexplained behavior.
Action required:
For Enterprise and Community users:
Self-hosted Enterprise and Community users should upgrade to one of the following fixed versions:
- Redpanda broker version 25.1.4 (Released)
- Redpanda broker version 24.3.13 (Released)
- Redpanda broker version 24.2.24 (Released)
- Redpanda broker version 24.1.21 (Released)
Enterprise customers or Community users who are impacted and are running releases prior to 24.1 should contact Redpanda Support.
Please check the Redpanda Releases page on Github to see availability for each version.
For Cloud users on BYOC, Dedicated, or Serverless:
Cloud clusters will have their clusters will be upgraded during their normal maintenance window, unless contacted by Redpanda.
Questions? If you have any questions on this TSB, or need further guidance, please contact Redpanda Support