11. Inference Batching

This matters because inference has a different goal than training. You usually care about throughput, latency, and memory, which means the right batch size depends on the deployment context rather than on the training recipe.

This notebook shows the practical effects of model.eval(), torch.no_grad(), and batch size on simple inference timing.

11.1. Setup

[ ]:

import time
import torch
from torch import nn

torch.manual_seed(19)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = nn.Sequential(
    nn.Linear(256, 512),
    nn.ReLU(),
    nn.Linear(512, 256),
    nn.ReLU(),
    nn.Linear(256, 64),
).to(device)
inputs = torch.randn(4096, 256, device=device)

11.2. Helper

[ ]:

def benchmark(batch_size, use_no_grad=True, repeats=20):
    model.eval()
    if device.type == 'cuda':
        torch.cuda.synchronize()
    start = time.perf_counter()
    for _ in range(repeats):
        for i in range(0, len(inputs), batch_size):
            batch = inputs[i:i+batch_size]
            if use_no_grad:
                with torch.no_grad():
                    _ = model(batch)
            else:
                _ = model(batch)
    if device.type == 'cuda':
        torch.cuda.synchronize()
    elapsed = time.perf_counter() - start
    examples = len(inputs) * repeats
    return {'batch_size': batch_size, 'no_grad': use_no_grad, 'seconds': round(elapsed, 4), 'examples_per_second': round(examples / elapsed, 1)}

11.3. Batch Size Comparison

[ ]:

results = []
for batch_size in [1, 8, 32, 128, 512]:
    results.append(benchmark(batch_size=batch_size, use_no_grad=True))
results

11.4. `no_grad()` vs Tracking Gradients

Inference should usually disable gradient tracking. Otherwise you spend memory and time on work you will never use.

[ ]:

[
    benchmark(batch_size=128, use_no_grad=True),
    benchmark(batch_size=128, use_no_grad=False),
]