Skip to main content

Documentation Index

Fetch the complete documentation index at: https://wb-21fd5541-style-guide-support-models-articles-20260527-00.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

This article helps you resolve training hangs when you use W&B with distributed training frameworks, so your runs can start and finish without stalling. Two common reasons cause training to stall when you use W&B with distributed training:
  • Stalls at the beginning of training: W&B’s multiprocessing can interfere with the multiprocessing from distributed training frameworks.
  • Stalls at the end of training: The W&B process doesn’t detect when to exit.

Fix hangs at the start

If your run stalls as training begins, the cause is usually a conflict between W&B’s multiprocessing and the distributed training framework’s multiprocessing. To resolve this, enable W&B Service, which is the default for W&B SDK 0.13.0 and later. If you’re on an older version, upgrade your SDK:
pip install --upgrade wandb
For W&B SDK 0.12.5 through 0.12.x, enable W&B Service explicitly:
def main():
    wandb.require("service")
    # rest of your script
For W&B SDK 0.12.4 and earlier, set the WANDB_START_METHOD environment variable:
export WANDB_START_METHOD=thread
After you enable W&B Service (or set the start method on older SDKs), your distributed training run starts without stalling.

Fix hangs at the end

If your run stalls after training completes, W&B doesn’t detect that the run is finished. Call wandb.finish() at the end of your training script to signal to W&B that the run is complete:
wandb.finish()
This call uploads all data and exits the W&B process cleanly. For more information, see Distributed training.
Experiments Run Crashes