3. Profiling
This matters because performance work without measurements is mostly superstition. Focus on reading profiler output well enough to decide whether your bottleneck is compute, Python overhead, memory movement, or something else.
[ ]:
import torch
from torch import nn
from torch.profiler import ProfilerActivity, profile
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = nn.Sequential(nn.Linear(256, 512), nn.ReLU(), nn.Linear(512, 10)).to(device)
x = torch.randn(128, 256, device=device)
y = torch.randint(0, 10, (128,), device=device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
3.1. Profile one train step
Include CPU activities and CUDA activities when CUDA is available.
[ ]:
activities = [ProfilerActivity.CPU]
if torch.cuda.is_available():
activities.append(ProfilerActivity.CUDA)
with profile(activities=activities, record_shapes=True, profile_memory=True) as prof:
optimizer.zero_grad(set_to_none=True)
logits = model(x)
loss = criterion(logits, y)
loss.backward()
optimizer.step()
table = prof.key_averages().table(sort_by='self_cpu_time_total', row_limit=8)
print(table)
assert 'Self CPU' in table
3.2. Export a trace
Chrome traces and TensorBoard traces make it easier to inspect operator order and overlap.
[ ]:
trace_path = './output/profile-trace.json'
prof.export_chrome_trace(trace_path)
print(trace_path)