How to troubleshoot pods stuck in Crash loop detected state – Redpanda

Background

This article is an addendum to documentation :

https://docs.redpanda.com/current/troubleshoot/errors-solutions/k-resolve-errors/#crash-loop-backoffs

When running Redpanda on kubernetes, if a broker exits due to a crash or abort on startup..
the pods are restarted. How many times a pod is restarted is based on the value of crash_loop_limit
(default 5 )

After the value of crash_loop_limit is exceeded ..restart attempts will abort on server startup with message:

ERROR 2024-12-11 09:12:47,179 [shard 0:main] main - application.cc:1050 - Crash loop detected. 
Too many consecutive crashes 6, exceeded crash_loop_limit configured value 5. 
To recover Redpanda from this state, manually remove file at path "/var/lib/redpanda/data/startup_log".
 Crash loop automatically resets 1h after last crash or with node configuration changes.
INFO 2024-12-11 09:12:47,179 [shard 0:main] main - application.cc:461 - Shutdown complete.
ERROR 2024-12-11 09:12:47,179 [shard 0:main] main - application.cc:487 - Failure during startup: std::runtime_error (Crash loop detected, aborting startup.)

The exceeded crash_loop_limit state as detailed in the above documentation link is a safeguard against frequent restarts causing lots of small segments being created, which can then cause slow Redpanda startup times..

The key thing for troubleshooting is we want to understand is what caused the pod/broker to go into the CrashLoopback state in the first place, then take the appropriate action.

Which means we need to review the Redpanda logs for the broker that aborted on startup or crashed before the "Crash loop detected." state..

Steps

-If Redpanda logs are being sent to a 3rd party log SIEM...
You want to check the Redpanda logs prior to the first occurrence of string "Crash loop detected."
Gather logs for 10 minute prior to that entry which should show cause of crash/abort or shutdown,

-If you do not have a a 3rd party log SIEM, then it's. a bit more involved and the steps are detailed below ..

1. identify node the pod in CrashLoopBack state
kubectl get pods -o wide |grep <restarting-pod-name>

2. identify the PV name for the pod in CrashLoopBack state
kubectl get pv |grep <restarting-pod-name>

3. Connect to the Node that the CrashLoopback pod runs on identified in step 2, so we can delete the /var/lib/redpanda/data/startup_log so to reset the crash counter..

How you connect to the node is implementation dependant..

For example, on AWS there is aws session manager. There are other options such as kubectl-node-shell

If you cannot access the node directly .. one option is to use kubectl debug to attach to the node gathered in earlier step

kubectl debug node/$node --image=busybox:latest -it -- chroot /host /bin/bash

4. Once connected / attached to the node , now Identify the access the mount/pathname for /var/lib/redpanda/data

--Node Access via 'kubectl debug'

mount |grep <pv-name-of-restarting-pod-from-step-2>

The output will be similar to

<device > on /var/lib/kubelet/pods/<pod-id> /volumes/driver>/<pv-name-of-restarting-pod-from-step-2>/mount .....

The second part (after "on" ) being the mount path.

5. Delete the. <mount_path>/var/lib/redpanda/data/startup_log file to reset crash counter...

6. . At this point the pod / broker should restart and then fail with a reason detailed in the logs.

kubectl -n redpanda logs <restarting_pod>

--

If you are unable to determine the cause of Redpanda failing, please reach out to
Redpanda Support.