3. Profiling

This matters because performance work without measurements is mostly superstition. Focus on reading profiler output well enough to decide whether your bottleneck is compute, Python overhead, memory movement, or something else.

[ ]:

import torch
from torch import nn
from torch.profiler import ProfilerActivity, profile

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = nn.Sequential(nn.Linear(256, 512), nn.ReLU(), nn.Linear(512, 10)).to(device)
x = torch.randn(128, 256, device=device)
y = torch.randint(0, 10, (128,), device=device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

3.1. Profile one train step

Include CPU activities and CUDA activities when CUDA is available.

[ ]:

activities = [ProfilerActivity.CPU]
if torch.cuda.is_available():
    activities.append(ProfilerActivity.CUDA)

with profile(activities=activities, record_shapes=True, profile_memory=True) as prof:
    optimizer.zero_grad(set_to_none=True)
    logits = model(x)
    loss = criterion(logits, y)
    loss.backward()
    optimizer.step()

table = prof.key_averages().table(sort_by='self_cpu_time_total', row_limit=8)
print(table)
assert 'Self CPU' in table

3.2. Export a trace

Chrome traces and TensorBoard traces make it easier to inspect operator order and overlap.

[ ]:

trace_path = './output/profile-trace.json'
prof.export_chrome_trace(trace_path)
print(trace_path)