This page explains how to serve your own custom LoRA adapters on W&B Serverless Inference. It’s for developers and ML practitioners who want to deploy fine-tuned variants of supported base models without managing infrastructure. LoRA (Low-Rank Adaptation) lets you customize large language models by training and storing only a lightweight add-on instead of a full new model. This reduces the size and cost of customization. You can train or upload a LoRA to give a base model new capabilities, such as specializing it for customer support, creative writing, or a particular technical field. This lets you adapt the model’s behavior without retraining or redeploying the entire model.Documentation Index
Fetch the complete documentation index at: https://wb-21fd5541-style-guide-support-models-articles-20260527-00.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Why use Serverless Inference for LoRAs
Serverless Inference for LoRAs offers the following benefits:- Upload once, deploy without managing servers.
- Track which version is live with artifact versioning.
- Update models by swapping small LoRA files instead of full model weights.
Workflow
At a high level, serving a custom LoRA involves three steps:- Upload your LoRA weights as a W&B artifact.
- Reference the artifact URI as your model name in the API.
- W&B dynamically loads your weights for inference.
Prerequisites
You need the following:- A W&B API key.
- A W&B project.
- Python 3.8+ with the
openaiandwandbpackages:pip install wandb openai.
Add and use LoRAs
You can add LoRAs to your W&B account and start using them with two methods. Choose the tab that matches where your LoRA was trained:- Upload a LoRA you trained elsewhere
- Train a new LoRA with W&B
Upload your own custom LoRA directory as a W&B artifact. Use this method if you trained your LoRA elsewhere (local environment, cloud provider, or partner service).This Python code uploads your locally stored LoRA weights to W&B as a versioned artifact. It creates a
lora type artifact with the required metadata (base model and storage region), adds your LoRA files from a local directory, and logs it to your W&B project for use with inference.Key requirements
To use your own LoRAs with Inference, ensure the following:- The LoRA must have been trained using one of the models listed in the Supported base models section.
- A LoRA saved in PEFT format as a
loratype artifact in your W&B account. - The LoRA must be stored in the
storage_region="coreweave-us"for low latency. - When you upload, include the name of the base model you trained it on (for example,
meta-llama/Llama-3.1-8B-Instruct). This ensures W&B loads it with the correct model.
Supported base models
Your LoRA must be trained against one of the following base models. Use the exact model ID string when settingwandb.base_model so W&B can pair your adapter with the correct base model at inference time.
| Model ID (for API usage) | Maximum LoRA Rank |
|---|---|
meta-llama/Llama-3.1-70B-Instruct | 16 |
meta-llama/Llama-3.1-8B-Instruct | 16 |
openai/gpt-oss-120b | 64 |
OpenPipe/Qwen3-14B-Instruct | 16 |
Qwen/Qwen3.6-27B | 16 |
Qwen/Qwen3-30B-A3B-Instruct-2507 | 16 |
Pricing
You pay only for storage and the inference you run, rather than for always-on servers or dedicated GPU instances. Pricing has two components:- Storage: You’re billed for the storage that holds your LoRA weights.
- Inference usage: Calls that use LoRA artifacts are billed at the same rates as standard model inference.