Hugging Face Accelerate: Making device-agnostic ML training and inference easy at scale

Who am I?

Zachary Mueller
Technical Lead for the 🤗 Accelerate project
Maintain the transformers Trainer
API design geek

What is 🤗 Accelerate?

A training framework
An inference framework
A command-line interface

A Training Framework

Powered by PyTorch
Change a few lines of code, gain device and hardware-agnostic capabilities
Low-code, with minimal magic aimed at easy hackability and use without high-level abstractions
We handle the intracies so you don’t have to

A Training Framework

Support for any hardware-accelerator on the market:
- CPU, GPU, TPU, XPU, NPU, MLU
Automatic mixed-precision training safely in whatever fashion you may choose:
- FP16, BF16, FP8 (through either TransformerEngine or MS-AMP)
Automatic and efficient gradient accumulation
Support for quantization through bitsandbytes
Support your favorite experiment trackers (aim, clearml, comet_ml, dvc-lite, ml-flow, tensorboard, wandb)
Easy to configure plugin or YAML-level API for setting up advanced frameworks like FSDP, DeepSpeed, and Megatron-LM

Low-Code

Biggest friction with “wrapper” libraries is control of your code
By being minimally intrusive, your code just “works” while still giving you complete control

  import torch
  import torch.nn.functional as F
  from datasets import load_dataset
+ from accelerate import Accelerator

+ accelerator = Accelerator()
- device = 'cpu'
+ device = accelerator.device

  model = torch.nn.Transformer().to(device)
  optimizer = torch.optim.Adam(model.parameters())
  dataset = load_dataset('my_dataset')
  data = torch.utils.data.DataLoader(dataset, shuffle=True)

+ model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

  model.train()
  for epoch in range(10):
      for source, targets in dataloader:
          source, targets = source.to(device), targets.to(device)
          optimizer.zero_grad()
          output = model(source)
          loss = F.cross_entropy(output, targets)
-         loss.backward()
+         accelerator.backward(loss)
          optimizer.step()

Easy to integrate

Due to the low-code nature, it’s trivial to integrate into existing PyTorch frameworks:
1. Create an Accelerator

  import torch
  import torch.nn.functional as F
  from datasets import load_dataset
+ from accelerate import Accelerator

+ accelerator = Accelerator()
  device = 'cpu'

  model = torch.nn.Transformer().to(device)
  optimizer = torch.optim.Adam(model.parameters())
  dataset = load_dataset('my_dataset')
  data = torch.utils.data.DataLoader(dataset, shuffle=True)

  model.train()
  for epoch in range(10):
      for source, targets in dataloader:
          source, targets = source.to(device), targets.to(device)
          optimizer.zero_grad()
          output = model(source)
          loss = F.cross_entropy(output, targets)
          loss.backward()
          optimizer.step()

Easy to integrate

Due to the low-code nature, it’s trivial to integrate into existing PyTorch frameworks:
1. Wrap your PyTorch objects with accelerator.prepare and remove device-placements

  import torch
  import torch.nn.functional as F
  from datasets import load_dataset
  from accelerate import Accelerator

  accelerator = Accelerator()
- device = 'cpu'

  model = torch.nn.Transformer().to(device)
  optimizer = torch.optim.Adam(model.parameters())
  dataset = load_dataset('my_dataset')
  data = torch.utils.data.DataLoader(dataset, shuffle=True)

+ model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

  model.train()
  for epoch in range(10):
      for source, targets in dataloader:
          source, targets = source.to(device), targets.to(device)
          optimizer.zero_grad()
          output = model(source)
          loss = F.cross_entropy(output, targets)
          loss.backward()
          optimizer.step()

Easy to integrate

Due to the low-code nature, it’s trivial to integrate into existing PyTorch frameworks:
1. Use accelerator.backward for the backward pass

  import torch
  import torch.nn.functional as F
  from datasets import load_dataset
  from accelerate import Accelerator

  accelerator = Accelerator()

  model = torch.nn.Transformer().to(device)
  optimizer = torch.optim.Adam(model.parameters())
  dataset = load_dataset('my_dataset')
  data = torch.utils.data.DataLoader(dataset, shuffle=True)

  model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

  model.train()
  for epoch in range(10):
      for source, targets in dataloader:
          source, targets = source.to(device), targets.to(device)
          optimizer.zero_grad()
          output = model(source)
          loss = F.cross_entropy(output, targets)
-         loss.backward()
+         accelerator.backward(loss)
          optimizer.step()

But what about inference?

🤗 Accelerate is not just for training, and has helped make the GPU-Poor take control of the narrative
Using tools like Big Model Inference, users with tiny compute can run large models locally
Started with the boom of stable diffusion, and now has scaled to having the ability to run huge LLMs locally with a single graphics card

How does it work?

PyTorch introduced device="meta"
🤗 Accelerate introduced device_map="auto"

A CLI Interface

accelerate config
- Configure the environment
accelerate launch
- How to run your script

Launching distributed training is hard

python script.py

vs.

torchrun --nnodes=1 --nproc_per_node=2 script.py

vs.

deepspeed --num_gpus=2 script.py

How can we make this better?

`accelerate launch`

accelerate launch script.py

accelerate launch --multi_gpu --num_processes 2 script.py

accelerate launch \
  --multi_gpu \ 
  --use_deepspeed \
  --num_processes 2 \
  script.py

`accelerate config`

Rely on config.yaml files
Choose to either running accelerate config or write your own:

ddp_config.yaml

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8

fsdp_config.yaml

compute_environment: LOCAL_MACHINE
distributed_type: FSDP
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: false
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8

Now that you’re up to speed, what’s new?

We’ve had a busy last year, and so has the ML Community!

New training techniques

Quantization has taken the field by storm
New ideas such as FSDP + QLoRA to train huge models on tiny compute!
New precision backends as we train natively on smaller precision
Optimizing futher how much we can push on a single machine through efficient RAM and timing techniques

Larger compute landscape

As we search for alternatives to NVIDIA, new compilers rise:
- XPU (Intel)
- NPU (Intel)
- MLU (Cambricon)

All of which are supported by 🤗 Accelerate

Lower abstractions

While the Accelerator was great, needed better abstractions focused on controlling behaviors
Introduced the PartialState

from accelerate import PartialState

if PartialState().is_main_process:
  # Run on only 1 device

with PartialState().main_process_first:
  # Useful for dataset processing

# Device-agnostic without the bulk of the `Accelerator`
device = PartialState().device

Faster and better inference alternatives

PiPPy gives us efficient pipeline-parallelism in distributed environments to increase throughput while keeping a simple torch-bound API
Rather than having to wait for each GPU, every GPU can be busy in parallel
Will be critical as larger LLMs take hold and more than one computer is needed

import torch
from transformers import AutoModelForSequenceClassification

from accelerate import PartialState, prepare_pippy

model = AutoModelForSequenceClassification.from_pretrained("gpt2")
model.eval()

input = torch.randint(
    low=0,
    high=model.config.vocab_size,
    size=(2, 1024),  # bs x seq_len
    device="cpu",
)

model = prepare_pippy(model, split_points="auto", example_args=(input,))

with torch.no_grad():
    output = model(input)

Adoption: Accelerate in the ecosystem

Accelerate in the Ecosystem

Many of the frameworks you use daily already rely on 🤗 Accelerate!
- Nearly all of 🤗
- axolotl
- fastai
- FastChat
- lucidrains
- kornia

Accelerate in the Ecosystem

Started as a way to isolate out distributed code on TPU and DistributedDataParallelism

Accelerate in the Ecosystem

Now is the backbone of some of the largest PyTorch training frameworks in the ecosystem

What’s next?

Elevating the community

Now that more advanced training techniques are reachable (FSDP, DeepSpeed, etc), we need to focus on educating the community on how to use it best
Goes beyond how to use the Trainer or Accelerator, but how to use what where
Keep Accelerate as a tool for the community to utilize when new techniques come out and play with, to push new ideas to scale quickly

1.0.0: Soon!

Tried and battle-tested by over 7M users/month | 110M+ total downloads
As we’ve been stable for over a year now, we’re near ready to release 1.0.0

Hugging Face Accelerate: Making device-agnostic ML training and inference easy at scale

Who am I?

What is 🤗 Accelerate?

A Training Framework

A Training Framework

Low-Code

Easy to integrate

Easy to integrate

Easy to integrate

But what about inference?

How does it work?

A CLI Interface

Launching distributed training is hard

accelerate launch

accelerate config

Now that you’re up to speed, what’s new?

We’ve had a busy last year, and so has the ML Community!

New training techniques

Larger compute landscape

Lower abstractions

Faster and better inference alternatives

Adoption: Accelerate in the ecosystem

Accelerate in the Ecosystem

Accelerate in the Ecosystem

Accelerate in the Ecosystem

What’s next?

Elevating the community

1.0.0: Soon!

Thanks for joining!

`accelerate launch`

`accelerate config`