11. Inference Batching
This matters because inference has a different goal than training. You usually care about throughput, latency, and memory, which means the right batch size depends on the deployment context rather than on the training recipe.
This notebook shows the practical effects of model.eval(), torch.no_grad(), and batch size on simple inference timing.
11.1. Setup
[ ]:
import time
import torch
from torch import nn
torch.manual_seed(19)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = nn.Sequential(
nn.Linear(256, 512),
nn.ReLU(),
nn.Linear(512, 256),
nn.ReLU(),
nn.Linear(256, 64),
).to(device)
inputs = torch.randn(4096, 256, device=device)
11.2. Helper
[ ]:
def benchmark(batch_size, use_no_grad=True, repeats=20):
model.eval()
if device.type == 'cuda':
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(repeats):
for i in range(0, len(inputs), batch_size):
batch = inputs[i:i+batch_size]
if use_no_grad:
with torch.no_grad():
_ = model(batch)
else:
_ = model(batch)
if device.type == 'cuda':
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
examples = len(inputs) * repeats
return {'batch_size': batch_size, 'no_grad': use_no_grad, 'seconds': round(elapsed, 4), 'examples_per_second': round(examples / elapsed, 1)}
11.3. Batch Size Comparison
[ ]:
results = []
for batch_size in [1, 8, 32, 128, 512]:
results.append(benchmark(batch_size=batch_size, use_no_grad=True))
results
11.4. no_grad() vs Tracking Gradients
Inference should usually disable gradient tracking. Otherwise you spend memory and time on work you will never use.
[ ]:
[
benchmark(batch_size=128, use_no_grad=True),
benchmark(batch_size=128, use_no_grad=False),
]