How can I use W&B with multiprocessing, for example distributed training?

This article explains how to use W&B in training programs that span multiple processes, such as distributed training jobs, so that runs are logged correctly without conflicts. If a training program uses multiple processes, structure the program to avoid making W&B method calls from processes without wandb.init(). Choose one of the following approaches to manage multiprocess training:

Call wandb.init() in all processes and use the group keyword argument to create a shared group. Each process has its own W&B run, and the UI groups the training processes together.
Call wandb.init() from only one process and pass data to log through multiprocessing queues.

Refer to Log distributed training experiments for detailed explanations of these approaches, including code examples with Torch DDP.

Experiments

⌘I

Documentation Index