Evaluation#

The evaluation pipeline lives in torchgeo_bench.main and a few focused sub-modules. Each evaluation method (KNN-5, linear probe, segmentation, intrinsic dimension) consumes per-split feature embeddings or raw images and produces one EvaluationResult row per metric.

Result schema#

class torchgeo_bench.main.EvaluationResult(dataset, method, metric_name, metric_value, ci_lower, ci_upper, feature_dim, best_c, best_lr, best_batch_size, n_train, n_val, n_test, seed, model, name, normalization, image_size, interpolation, partition, bands, c_range_start, c_range_stop, c_range_num, merge_val, bootstrap, fw_iou=None, precision=None, recall=None, f1=None, ece=None, rms_ce=None, mce=None, ece_ts=None, rms_ce_ts=None, mce_ts=None, temperature=None, calibration_n_bins=None)[source][source]#

Bases: object

Container for a single evaluation result row.

to_row()[source][source]#

Convert to a flat dictionary suitable for CSV/DataFrame export.

Feature extraction#

torchgeo_bench.main.embed_split(model, dataloader, device, verbose)[source][source]#

Extract feature embeddings and labels from a data split.

Bootstrap helpers#

torchgeo_bench.main.bootstrap_accuracy(y_true, y_pred, n_boot=1000, ci=95.0, seed=None)[source][source]#

Bootstrapped accuracy with confidence interval. Returns (mean, ci_lower, ci_upper).

torchgeo_bench.main.bootstrap_map(y_true, y_scores, n_boot=1000, ci=95.0, seed=None)[source][source]#

Bootstrap micro-averaged mean Average Precision.

KNN-5 evaluation#

torchgeo_bench.main.evaluate_knn(x_train, y_train, x_test, y_test, seed, n_bootstrap, verbose=False, device='cpu', n_neighbors=5, calibration_n_bins=None)[source][source]#

Evaluate KNN classifier. Auto-detects single-label vs multi-label from y shape.

Returns the primary metric with bootstrap CI, a calibration dict (ece/rms_ce/mce) computed from predict_proba, and the n_bins actually used (defaults to n_neighbors + 1).

class torchgeo_bench.knn.KNNClassifier(n_neighbors=5, device='cpu', metric='l2', use_fp16=False)[source][source]#

Bases: object

FAISS-backed KNN classifier with single- and multi-label support.

Multi-label mode is auto-detected from the shape of y during fit(): 1-D labels → single-label, 2-D labels → multi-label.

Parameters:

n_neighbors (int) – Number of neighbours (k). Clamped to min(k, n_train) on the CPU path; faissknn does not clamp internally.
device (str) – "cpu" (default) → the FAISS CPU index. Anything else ("cuda", "cuda:0") requires faissknn with a GPU FAISS backend (installed automatically on Linux x86_64); raises an actionable error if unavailable.
metric (Literal['l2', 'ip', 'cosine']) – Distance metric — "l2" (default), "ip" (inner product), or "cosine" (cosine similarity; auto-normalizes inputs). GPU path only; CPU path always uses L2.
use_fp16 (bool) – Use fp16 for GPU index computation (~30 % speedup on Ampere+). GPU path only; ignored on CPU.

fit(X, y)[source][source]#

Index training features and store labels.

Parameters:

X (ndarray) – (n_samples, n_features) float32 feature matrix.
y (ndarray) – (n_samples,) int single-label or (n_samples, n_classes) multi-hot multi-label.

property multi_label: bool#: Whether the classifier is in multi-label mode.

predict(X)[source][source]#

Predict labels for X.

Returns single-label class indices (n_samples,) or multi-label binary predictions (n_samples, n_classes).

predict_proba(X)[source][source]#

Predict per-class probabilities (n_samples, n_classes).

Linear probing#

torchgeo_bench.main.evaluate_logistic(x_train, y_train, x_val, y_val, x_test, y_test, c_values, seed, n_bootstrap, merge_val, device, verbose=False, calibration_n_bins=15, temp_scale=True)[source][source]#

Sweep C values, retrain, and evaluate. Auto-detects single/multi-label from y shape.

Returns the primary metric with bootstrap CI, the selected C, a calibration dict from raw predict_proba on the test split, and a second dict with temperature-scaled calibration plus the fitted temperature (all None when temp_scale=False).

class torchgeo_bench.linear.LogisticRegression(C=1.0, max_iter=1000, lr=1.0, batch_size=1024, solver='lbfgs', tol=0.0001, patience=1, random_state=None, device=None, verbose=False, use_tf32=True, multi_label=False)[source][source]#

Bases: object

Logistic regression with identical objective scaling to sklearn.

Supports both single-label (softmax cross-entropy) and multi-label (sigmoid BCE) classification via the multi_label flag.

Objective:

loss = (1/n) * CrossEntropy + (1/n) * 0.5/C * ||W||^2

Differences from the previous version (speed-oriented but same math):

LBFGS uses its internal iteration loop (one external .step).
Adam uses on-device manual batching (no DataLoader overhead).
Inference paths use torch.inference_mode.
Optional TF32 for CUDA matmul (single linear layer still benefits slightly).
Coefficients and intercept are exposed via properties (no copying at fit time).

Args match previous class unless noted.

fit(X, y)[source][source]#

Fit the logistic regression model on training data.

Parameters:

X (Tensor) – Feature matrix of shape (n_samples, n_features).
y (Tensor) – Labels — (n_samples,) for single-label or (n_samples, n_classes) for multi-label.

Returns:

Self, for method chaining.

Raises:

TypeError – If X or y is not a torch.Tensor.
ValueError – If shapes are invalid or data is empty.

Return type:

Self

property coef_: ndarray#: Return learned weight matrix as a NumPy array of shape (n_classes, n_features).

property intercept_: ndarray#: Return learned bias vector as a NumPy array of shape (n_classes,).

predict(X)[source][source]#

Predict class labels (single-label) or binary indicators (multi-label).

Parameters:: X (Tensor) – Feature matrix of shape (n_samples, n_features).
Returns:: Predicted labels as a NumPy array.
Return type:: ndarray

predict_proba(X)[source][source]#

Predict per-class probabilities.

Parameters:: X (Tensor) – Feature matrix of shape (n_samples, n_features).
Returns:: Probability matrix of shape (n_samples, n_classes).
Return type:: ndarray

decision_function(X)[source][source]#

Compute raw logits (decision function values).

Parameters:: X (Tensor) – Feature matrix of shape (n_samples, n_features).
Returns:: Logits array of shape (n_samples, n_classes).
Return type:: ndarray

Segmentation#

torchgeo_bench.main.evaluate_segmentation(model, train_loader, val_loader, test_loader, cfg, num_classes, device, collect_preds=False)[source][source]#

Evaluate segmentation performance using a frozen-backbone segmentation probe.

Trains a lightweight segmentation head on top of the frozen backbone and evaluates mIoU on the test split. Optionally pre-caches backbone features for faster training across epochs.

Parameters:

model (Module) – Frozen backbone model.
train_loader (DataLoader) – Training DataLoader.
val_loader (DataLoader) – Validation DataLoader.
test_loader (DataLoader) – Test DataLoader.
cfg (DictConfig) – Full Hydra config.
num_classes (int) – Number of segmentation classes.
device (device) – Torch device.
collect_preds (bool) – If True, collect and return test predictions as (N, H, W) tensor.

Returns:

Tuple of (metrics_dict, feature_dim, None, None, preds_or_None). preds_or_None is None when collect_preds is False.

Return type:

tuple[dict[str, float], int, float | None, int | None, Tensor | None]

class torchgeo_bench.segmentation_probe.SegmentationProbe(backbone, layer_names, num_classes, freeze_backbone=True, head_type='linear', hidden_dim=None)[source][source]#

Bases: Module

Multi-scale segmentation probe that hooks into backbone feature layers.

Backbone layers are tapped via forward hooks. Features are passed to a decoder head (LinearHead, ConvBlockHead, FPNHead, or DPTHead) that produces per-pixel class logits.

Layer ordering convention (applies to all head types):

Coarse-to-fine — deepest / lowest-resolution layer first.
Example for ResNet: ["layer4", "layer3", "layer2", "layer1"].
For DPTHead this means index 0 = coarsest, which is also what the DPT cascade expects.

Parameters:

backbone (Module) – Feature extractor. May be a raw backbone or a BenchModel wrapper (backbone.* prefixes are stripped automatically).
layer_names (list[str]) – Ordered list of layer names to hook (coarse-to-fine).
num_classes (int) – Number of segmentation output classes.
freeze_backbone (bool) – If True (default), backbone parameters are frozen and the backbone runs in eval mode during inference.
head_type (str) – Decoder architecture — one of "linear", "conv_block", "fpn", "dpt", "patch_linear".
hidden_dim (int | None) – Hidden channel dimension for conv_block, fpn, and dpt heads (default 256).

extract_segmentation_features(dataloader, cache_dtype=torch.float16)[source][source]#

Run the frozen backbone once over dataloader and cache features.

Parameters:

dataloader (DataLoader) – DataLoader that yields dict or (image, mask) batches.
cache_dtype (dtype) – Storage dtype for cached feature tensors. Use torch.float16 (default) to halve RAM, or torch.float32 for full precision.

Returns:

A CachedFeaturesDataset with one entry per sample.

Return type:

CachedFeaturesDataset

forward(x)[source][source]#

Compute segmentation logits from input images.

Parameters:: x (Tensor) – Input tensor of shape (B, C, H, W).
Returns:: Logits tensor of shape (B, num_classes, H, W).
Return type:: Tensor

class torchgeo_bench.segmentation_task.SegmentationSolver(model, num_classes, lr=0.001, weight_decay=0.0, device='cuda', criterion=None, lr_scheduler='cosine', ignore_index=255)[source][source]#

Bases: object

A lightweight trainer for the SegmentationProbe.

fit(train_loader, val_loader=None, epochs=10, verbose=True)[source][source]#

Train the segmentation probe.

Parameters:

train_loader (DataLoader) – Training data loader.
val_loader (DataLoader | None) – Optional validation data loader for per-epoch mIoU logging.
epochs (int) – Number of training epochs.
verbose (bool) – Whether to show progress bars and epoch logs.

Returns:

Val mIoU from the final epoch if val_loader is given, else None.

Return type:

float | None

evaluate(dataloader, collect_preds=False)[source][source]#

Evaluate the model on a dataloader and return segmentation metrics.

Parameters:

dataloader (DataLoader) – Evaluation data loader.
collect_preds (bool) – If True, also return predicted class maps (N, H, W) int64.

Returns:

Dict of metric name → value, or (metrics_dict, preds_tensor) when collect_preds=True.

Return type:

dict[str, float] | tuple[dict[str, float], Tensor]

fit_cached(train_cache, val_cache=None, batch_size=64, epochs=10, verbose=True, gpu_train=None, gpu_val=None)[source][source]#

Train the segmentation head on pre-cached backbone features.

The backbone is not called during training — cached features are fed directly to self.model.head, which is the only component that runs a forward/backward pass.

The entire feature cache is pre-moved to the GPU as contiguous tensors (GPUTensorCache), eliminating per-batch CPU→GPU DMA transfers and torch.stack calls.

If gpu_train is provided, that pre-built cache is used directly, allowing callers (e.g. an HPO loop) to transfer the cache once and reuse it across many calls.

Parameters:

train_cache (CachedFeaturesDataset) – Pre-extracted training features from SegmentationProbe.extract_segmentation_features().
val_cache (CachedFeaturesDataset | None) – Optional validation cache for per-epoch mIoU logging.
batch_size (int) – Batch size for iterating over cached data.
epochs (int) – Number of training epochs.
verbose (bool) – Whether to show progress bars and epoch logs.
gpu_train (GPUTensorCache | None) – Optional pre-built GPU cache for training. If provided, the GPU transfer is skipped.
gpu_val (GPUTensorCache | None) – Optional pre-built GPU cache for validation. Used only when gpu_train is also provided.

Returns:

Val mIoU from the final epoch if val_cache is given, else None.

Return type:

float | None

evaluate_cached(cache, batch_size=64, collect_preds=False)[source][source]#

Evaluate on a CachedFeaturesDataset.

The cache is moved to GPU as a GPUTensorCache for zero per-batch host→device transfers.

Parameters:

cache (CachedFeaturesDataset) – Pre-extracted features (output of SegmentationProbe.extract_segmentation_features()).
batch_size (int) – Batch size for iterating over the cache.
collect_preds (bool) – If True, also return predicted class maps (N, H, W) int64.

Returns:

Dict of metric name → value, or (metrics_dict, preds_tensor) when collect_preds=True.

Return type:

dict[str, float] | tuple[dict[str, float], Tensor]

Intrinsic dimension#

See torchgeo_bench.intrinsic_dim for the standalone module API; the orchestration function lives in torchgeo_bench.main:

torchgeo_bench.main.evaluate_intrinsic_dim(splits, estimators, selected_splits, device, max_samples, seed, common_meta, feature_dim, n_counts, verbose=False)[source][source]#

Compute intrinsic-dimension metrics over selected splits and return CSV rows.

Each (split, estimator) yields one row with method="intrinsic_dim" and metric_name=f"id_{estimator}_{split}".

Result I/O#

torchgeo_bench.main.append_rows_atomic(path, rows)[source][source]#

Append rows to a CSV atomically, with advisory file lock and schema healing.

Behavior:

Empty/missing file: writes the header derived from rows and the rows.
Existing file whose header matches rows[0] keys exactly: appends rows without rewriting the header (fast path).
Existing file with a different schema (e.g. EvaluationResult gained a field since the file was first written): the file is rewritten with the unioned schema so every value lives under a named column instead of being silently stuffed into an unnamed position.

Parameters:

path (str) – Output CSV path; created if missing.
rows (list[dict]) – List of dicts to append. All dicts should share the same keys.