1. Text Data

This matters because text pipelines fail when tokenization, padding, and batching are hidden behind too much abstraction. Focus on how raw text becomes tensors and how variable-length batches stay usable.

1.1. Tokenize and build a vocabulary

A vocabulary maps string tokens to integer ids. Reserve ids for padding and unknown words before adding words from the training corpus.

[ ]:

from collections import Counter
import re

import torch
from torch.utils.data import DataLoader, Dataset

PAD = '<pad>'
UNK = '<unk>'

examples = [
    ('pytorch makes tensor programs readable', 1),
    ('debug the training loop before scaling', 1),
    ('the model overfits when labels are shuffled', 0),
    ('a tiny batch should run on the gpu', 1),
    ('downloads failed and the cache was empty', 0),
    ('checkpoints make experiments easier to resume', 1),
]

def tokenize(text):
    return re.findall(r"[a-z0-9']+", text.lower())

counts = Counter(token for text, _ in examples for token in tokenize(text))
itos = [PAD, UNK] + sorted(counts)
stoi = {token: index for index, token in enumerate(itos)}

def encode(text):
    return torch.tensor([stoi.get(token, stoi[UNK]) for token in tokenize(text)], dtype=torch.long)

print('vocab size:', len(itos))
print('encoded:', encode('debug unknown words'))

1.2. Dataset and collate

Text sequences have different lengths. A collate_fn decides how to assemble a list of examples into one batch. Here it pads examples and returns the original lengths so a model can ignore padded positions.

[ ]:

class TextDataset(Dataset):
    def __init__(self, rows):
        self.rows = rows

    def __len__(self):
        return len(self.rows)

    def __getitem__(self, index):
        text, label = self.rows[index]
        return encode(text), torch.tensor(label, dtype=torch.long)

def collate_padded(batch):
    sequences, labels = zip(*batch)
    lengths = torch.tensor([len(sequence) for sequence in sequences], dtype=torch.long)
    padded = torch.full((len(sequences), lengths.max().item()), stoi[PAD], dtype=torch.long)
    for row, sequence in enumerate(sequences):
        padded[row, :len(sequence)] = sequence
    return padded, lengths, torch.stack(labels)

loader = DataLoader(TextDataset(examples), batch_size=3, shuffle=False, collate_fn=collate_padded)
tokens, lengths, labels = next(iter(loader))
print(tokens)
print('lengths:', lengths.tolist())
print('labels:', labels.tolist())
assert tokens.shape[0] == labels.shape[0] == 3
assert tokens.shape[1] == lengths.max().item()

1.3. Move a text batch to device

The batch contains ordinary tensors. Move every tensor that participates in the forward pass to the same device.

[ ]:

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokens = tokens.to(device)
lengths = lengths.to(device)
labels = labels.to(device)

print(tokens.device, lengths.device, labels.device)
assert tokens.device.type == device.type