1. Training Performance
This matters because throughput depends on data movement, precision, and evaluation mode as much as model choice. Focus on the small changes that increase utilization without changing model semantics.
[ ]:
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
torch.manual_seed(37)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
transform = transforms.Compose([transforms.Resize((64, 64)), transforms.ToTensor()])
dataset = datasets.ImageFolder('./shapes/train', transform=transform)
loader = DataLoader(
dataset,
batch_size=8,
shuffle=True,
num_workers=2,
pin_memory=(device.type == 'cuda'),
)
model = nn.Sequential(
nn.Flatten(),
nn.Linear(3 * 64 * 64, 64),
nn.ReLU(),
nn.Linear(64, len(dataset.classes)),
).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
device
1.1. Non-blocking transfer
Pinned host memory plus non_blocking=True can overlap CPU-to-GPU copies with GPU work. It is harmless to keep the transfer call explicit.
[ ]:
images, labels = next(iter(loader))
images = images.to(device, non_blocking=True)
labels = labels.to(device, non_blocking=True)
images.device, labels.device
1.2. Mixed precision and gradient clipping
Automatic mixed precision can reduce memory and increase throughput on CUDA. Gradient clipping is useful when gradients explode, especially in recurrent models and some transfer-learning runs.
[ ]:
use_amp = device.type == 'cuda'
scaler = torch.amp.GradScaler('cuda', enabled=use_amp)
optimizer.zero_grad(set_to_none=True)
with torch.amp.autocast('cuda', enabled=use_amp):
logits = model(images)
loss = criterion(logits, labels)
scaler.scale(loss).backward()
scaler.unscale_(optimizer)
gradient_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
scaler.step(optimizer)
scaler.update()
print('loss:', loss.item())
print('gradient norm before clipping:', float(gradient_norm))
1.3. Gradient accumulation
Gradient accumulation simulates a larger batch when a full batch does not fit in GPU memory. Divide the loss by the number of accumulation steps so the gradient scale stays comparable.
[ ]:
accumulation_steps = 2
optimizer.zero_grad(set_to_none=True)
for step, (images, labels) in enumerate(loader):
images = images.to(device, non_blocking=True)
labels = labels.to(device, non_blocking=True)
with torch.amp.autocast('cuda', enabled=use_amp):
logits = model(images)
loss = criterion(logits, labels) / accumulation_steps
scaler.scale(loss).backward()
if (step + 1) % accumulation_steps == 0:
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad(set_to_none=True)
break
print('completed an accumulated optimizer step')
1.4. Evaluation mode
Use torch.inference_mode() for inference/evaluation when you do not need gradients or autograd metadata.
[ ]:
@torch.inference_mode()
def predict_one_batch(model, loader):
model.eval()
images, _ = next(iter(loader))
images = images.to(device, non_blocking=True)
return model(images).softmax(dim=1)
probabilities = predict_one_batch(model, loader)
probabilities.shape, probabilities.device