1. Calibration and Metrics
This matters because a model can rank well, predict accurately, and still be badly calibrated. Focus on measuring different failure modes instead of collapsing everything into one score.
[ ]:
import torch
from sklearn.metrics import roc_auc_score
logits = torch.tensor([2.2, 1.4, -0.7, 0.3, 1.8, -1.5])
labels = torch.tensor([1, 1, 0, 0, 1, 0])
probabilities = torch.sigmoid(logits)
predictions = (probabilities >= 0.5).long()
tp = int(((predictions == 1) & (labels == 1)).sum())
tn = int(((predictions == 0) & (labels == 0)).sum())
fp = int(((predictions == 1) & (labels == 0)).sum())
fn = int(((predictions == 0) & (labels == 1)).sum())
print({'tp': tp, 'tn': tn, 'fp': fp, 'fn': fn})
print('auroc:', roc_auc_score(labels.numpy(), probabilities.numpy()))
1.1. Expected calibration error
Bin predictions by confidence, compare confidence to empirical accuracy inside each bin, and weight by bin size.
[ ]:
def expected_calibration_error(probabilities, labels, bins=5):
confidences = torch.where(probabilities >= 0.5, probabilities, 1 - probabilities)
predictions = (probabilities >= 0.5).long()
edges = torch.linspace(0, 1, bins + 1)
ece = torch.tensor(0.0)
for start, end in zip(edges[:-1], edges[1:]):
in_bin = (confidences >= start) & (confidences < end if end < 1 else confidences <= end)
if in_bin.any():
bin_accuracy = (predictions[in_bin] == labels[in_bin]).float().mean()
bin_confidence = confidences[in_bin].mean()
ece += in_bin.float().mean() * torch.abs(bin_accuracy - bin_confidence)
return ece
ece = expected_calibration_error(probabilities, labels)
print('ece:', round(ece.item(), 4))
assert ece.item() >= 0.0