Why am I getting rate limit errors (429) with Serverless Inference?

This page explains why Serverless Inference returns 429 rate limit errors and how to resolve them so your requests succeed within the allowed concurrency limits. Rate limit errors (429) occur when you exceed concurrency limits. Error: “Concurrency limit reached for requests” Solution: To resolve the error, do one of the following:

Reduce the number of parallel requests.
Add delays between requests.
Implement exponential backoff.

Note: Rate limits apply per W&B project.

Best practices to avoid rate limits

The following practices help your application stay within concurrency limits and recover gracefully when it hits limits.

Implement retry logic with exponential backoff: Backoff spaces out retries so transient 429 responses clear before the next attempt.

import time

def retry_with_backoff(func, max_retries=3):
    for i in range(max_retries):
        try:
            return func()
        except Exception as e:
            if "429" in str(e) and i < max_retries - 1:
                time.sleep(2 ** i)
            else:
                raise

Use batch processing instead of parallel requests.
Monitor your usage on the W&B Billing page.

Default spending caps

Accounts also have default spending caps that bound overall Inference usage:

Pro accounts: $6,000 per month
Enterprise accounts: $700,000 per year

Contact your account executive or support to adjust limits.

Inference

Documentation Index

​Best practices to avoid rate limits

​Default spending caps

Best practices to avoid rate limits

Default spending caps