Run resumption is not supported inside a W&B sweep. If you pass aDocumentation Index
Fetch the complete documentation index at: https://wb-21fd5541-kb-refresh.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
run_id or use wandb.init(resume=...) while a sweep agent is running, W&B ignores the run ID and starts a fresh run instead. You will see the following warning:
-
Checkpoint and reload within a single run: Save model checkpoints at regular intervals inside your training function. On restart, load the latest checkpoint at the beginning of
train(). The sweep starts a new run, but training picks up from the saved state. -
Use
--count 1on SLURM with requeue: Submit each sweep agent job withwandb agent --count 1 SWEEP_ID. If the job is preempted, SLURM can requeue it and the sweep controller will assign a new configuration. - Mark a run as failed and requeue manually: If a run crashes mid-way, the sweep controller will eventually mark it as failed and may assign the same configuration to a new agent depending on your sweep settings.
wandb.init(resume="allow", id="YOUR_RUN_ID") in a standalone script instead.
Sweeps Resuming