Partition balancing moves partition replicas to alleviate disk pressure and maintains the configured replication factor across brokers.
Sometimes this process can become "Stalled" due to several factors. In this article we are specifically focusing on clusters that are not using tiered storage.
If you are using tiered storage, then take a look the links at the bottom of this article for additional troubleshooting.
As of 23.3.2, space_management_enable is defaulted to true. This feature is great for tiered storage, but can cause Redpanda to incorrectly calculate total disk space, when not utilizing tiered storage, which is why we want to make sure to disable it.
1. Check existing settings
Let's first check two configurations, partition_autobalancing_max_disk_usage_percent and space_management_enable. The first configuration will tell us the max disk usage Redpanda will use (as a percentage) while the second configuration will let us know if space management is enabled.
rpk cluster config get partition_autobalancing_max_disk_usage_percent
rpk cluster config get space_management_enable
If space management returns "false", then the rest of the steps won't apply to you and we would recommend taking a look at the additional troubleshooting section below.
2. Checking existing disk space
By default, Redpanda will only use up to a certain amount of the available disk space, based on the parition_autobalancing_max_disk_usage_percent configuration we checked in the previous step.
With this knowledge, we can take a look at our system disk utilization.
df -h
Look for the location of your Redpanda data, typically located in the /var/lib/redpanda/data directory. If it shows you have currently less than 80% utilization (default value), but the rebalancing is stalled, this could indicate that space_management_enable configuration is causing Redpanda to believe you have reached the partition_autobalancing_max_disk_usage_percent max usage percent, even though you have not.
3. Changing settings
Disabling space management can be easily accomplished by running the command below. This change does NOT require a restart.
rpk cluster config set space_management_enable=false
Once disabled, check the balancer status. The re-balancing should start to resume.
rpk cluster partitions balancer-status
That did not work!
If this did not help get your balancing to resume, then it is most likely something else causing you trouble. Take a look at the documents linked below. If you are still running into trouble, then feel free to reach out to our support team for further assistance.