W&B marks a run as Crashed when it stops receiving heartbeats from the process that calledDocumentation Index
Fetch the complete documentation index at: https://wb-21fd5541-kb-refresh.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
wandb.init(), without the process having called wandb.finish(). This happens when the training process is killed, exits unexpectedly, or loses connectivity before it can report a clean finish.
Common causes
- Out-of-memory (OOM) error: The process is killed by the OS or GPU driver when it exceeds available memory. Check
output.logforCUDA out of memoryorKilledmessages. - Uncaught exception: An unhandled Python exception causes the process to exit without calling
wandb.finish(). The exception appears inoutput.log. - Job scheduler preemption: On SLURM or other cluster schedulers, jobs can be preempted and killed without warning. The run never gets a chance to finish cleanly.
- Network loss: In rare cases, a long network outage causes the W&B backend to time out waiting for heartbeats and mark the run as crashed, even though the process is still running. The run will resume uploading when connectivity is restored.
- Process killed manually: Using
kill -9orSIGKILLbypasses Python’s signal handlers, preventingwandb.finish()from being called.
- Open the run page and click the Files tab.
- Download
output.logfor stdout/stderr — this usually contains the error that caused the crash. - Download
debug.loganddebug-internal.logfor W&B-level diagnostics (connectivity issues, upload errors). - If the run was on a cluster, also check the scheduler’s job log for preemption or OOM signals.
wandb.finish(exit_code=1) explicitly on error to ensure the run is marked as Failed (rather than Crashed) and all buffered data is flushed:
Runs Run Crashes