When running a W&B sweep with distributed training on SLURM (for example, multi-GPU jobs withDocumentation Index
Fetch the complete documentation index at: https://wb-21fd5541-kb-refresh.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
--gpus-per-node), only one process per SLURM job should call wandb.agent(). All other processes on the same node should join the run directly.
Use the SLURM_PROCID environment variable to restrict wandb.agent() to rank 0:
- Each SLURM job registers exactly one run with the sweep controller.
- Other ranks on the same node participate in the distributed run without creating duplicate sweep entries.
- The sweep controller correctly tracks progress and schedules new hyperparameter configurations.
submitit or a similar launcher, apply the same check in your training entry point before calling wandb.agent().
For single-GPU or non-distributed jobs, use wandb agent --count 1 SWEEP_ID as described in How should I run sweeps on SLURM?.
Sweeps Experiments