1. Training Performance

This matters because throughput depends on data movement, precision, and evaluation mode as much as model choice. Focus on the small changes that increase utilization without changing model semantics.

[ ]:

import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

torch.manual_seed(37)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

transform = transforms.Compose([transforms.Resize((64, 64)), transforms.ToTensor()])
dataset = datasets.ImageFolder('./shapes/train', transform=transform)

loader = DataLoader(
    dataset,
    batch_size=8,
    shuffle=True,
    num_workers=2,
    pin_memory=(device.type == 'cuda'),
)

model = nn.Sequential(
    nn.Flatten(),
    nn.Linear(3 * 64 * 64, 64),
    nn.ReLU(),
    nn.Linear(64, len(dataset.classes)),
).to(device)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
device

1.1. Non-blocking transfer

Pinned host memory plus non_blocking=True can overlap CPU-to-GPU copies with GPU work. It is harmless to keep the transfer call explicit.

[ ]:

images, labels = next(iter(loader))
images = images.to(device, non_blocking=True)
labels = labels.to(device, non_blocking=True)
images.device, labels.device

1.2. Mixed precision and gradient clipping

Automatic mixed precision can reduce memory and increase throughput on CUDA. Gradient clipping is useful when gradients explode, especially in recurrent models and some transfer-learning runs.

[ ]:

use_amp = device.type == 'cuda'
scaler = torch.amp.GradScaler('cuda', enabled=use_amp)

optimizer.zero_grad(set_to_none=True)
with torch.amp.autocast('cuda', enabled=use_amp):
    logits = model(images)
    loss = criterion(logits, labels)

scaler.scale(loss).backward()
scaler.unscale_(optimizer)
gradient_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
scaler.step(optimizer)
scaler.update()

print('loss:', loss.item())
print('gradient norm before clipping:', float(gradient_norm))

1.3. Gradient accumulation

Gradient accumulation simulates a larger batch when a full batch does not fit in GPU memory. Divide the loss by the number of accumulation steps so the gradient scale stays comparable.

[ ]:

accumulation_steps = 2
optimizer.zero_grad(set_to_none=True)

for step, (images, labels) in enumerate(loader):
    images = images.to(device, non_blocking=True)
    labels = labels.to(device, non_blocking=True)

    with torch.amp.autocast('cuda', enabled=use_amp):
        logits = model(images)
        loss = criterion(logits, labels) / accumulation_steps

    scaler.scale(loss).backward()

    if (step + 1) % accumulation_steps == 0:
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad(set_to_none=True)
        break

print('completed an accumulated optimizer step')

1.4. Evaluation mode

Use torch.inference_mode() for inference/evaluation when you do not need gradients or autograd metadata.

[ ]:

@torch.inference_mode()
def predict_one_batch(model, loader):
    model.eval()
    images, _ = next(iter(loader))
    images = images.to(device, non_blocking=True)
    return model(images).softmax(dim=1)

probabilities = predict_one_batch(model, loader)
probabilities.shape, probabilities.device