1. Training Performance

This matters because throughput depends on data movement, precision, and evaluation mode as much as model choice. Focus on the small changes that increase utilization without changing model semantics.

[ ]:
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

torch.manual_seed(37)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

transform = transforms.Compose([transforms.Resize((64, 64)), transforms.ToTensor()])
dataset = datasets.ImageFolder('./shapes/train', transform=transform)

loader = DataLoader(
    dataset,
    batch_size=8,
    shuffle=True,
    num_workers=2,
    pin_memory=(device.type == 'cuda'),
)

model = nn.Sequential(
    nn.Flatten(),
    nn.Linear(3 * 64 * 64, 64),
    nn.ReLU(),
    nn.Linear(64, len(dataset.classes)),
).to(device)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
device

1.1. Non-blocking transfer

Pinned host memory plus non_blocking=True can overlap CPU-to-GPU copies with GPU work. It is harmless to keep the transfer call explicit.

[ ]:
images, labels = next(iter(loader))
images = images.to(device, non_blocking=True)
labels = labels.to(device, non_blocking=True)
images.device, labels.device

1.2. Mixed precision and gradient clipping

Automatic mixed precision can reduce memory and increase throughput on CUDA. Gradient clipping is useful when gradients explode, especially in recurrent models and some transfer-learning runs.

[ ]:
use_amp = device.type == 'cuda'
scaler = torch.amp.GradScaler('cuda', enabled=use_amp)

optimizer.zero_grad(set_to_none=True)
with torch.amp.autocast('cuda', enabled=use_amp):
    logits = model(images)
    loss = criterion(logits, labels)

scaler.scale(loss).backward()
scaler.unscale_(optimizer)
gradient_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
scaler.step(optimizer)
scaler.update()

print('loss:', loss.item())
print('gradient norm before clipping:', float(gradient_norm))

1.3. Gradient accumulation

Gradient accumulation simulates a larger batch when a full batch does not fit in GPU memory. Divide the loss by the number of accumulation steps so the gradient scale stays comparable.

[ ]:
accumulation_steps = 2
optimizer.zero_grad(set_to_none=True)

for step, (images, labels) in enumerate(loader):
    images = images.to(device, non_blocking=True)
    labels = labels.to(device, non_blocking=True)

    with torch.amp.autocast('cuda', enabled=use_amp):
        logits = model(images)
        loss = criterion(logits, labels) / accumulation_steps

    scaler.scale(loss).backward()

    if (step + 1) % accumulation_steps == 0:
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad(set_to_none=True)
        break

print('completed an accumulated optimizer step')

1.4. Evaluation mode

Use torch.inference_mode() for inference/evaluation when you do not need gradients or autograd metadata.

[ ]:
@torch.inference_mode()
def predict_one_batch(model, loader):
    model.eval()
    images, _ = next(iter(loader))
    images = images.to(device, non_blocking=True)
    return model(images).softmax(dim=1)

probabilities = predict_one_batch(model, loader)
probabilities.shape, probabilities.device