Evaluation#

The evaluation pipeline lives in torchgeo_bench.main and a few focused sub-modules. Each evaluation method (KNN-5, linear probe, segmentation, intrinsic dimension) consumes per-split feature embeddings or raw images and produces one EvaluationResult row per metric.

Result schema#

class torchgeo_bench.main.EvaluationResult(dataset, method, metric_name, metric_value, ci_lower, ci_upper, feature_dim, best_c, best_lr, best_batch_size, n_train, n_val, n_test, seed, model, name, normalization, image_size, interpolation, partition, bands, c_range_start, c_range_stop, c_range_num, merge_val, bootstrap, fw_iou=None, precision=None, recall=None, f1=None, ece=None, rms_ce=None, mce=None, ece_ts=None, rms_ce_ts=None, mce_ts=None, temperature=None, calibration_n_bins=None)[source][source]#

Bases: object

Container for a single evaluation result row.

to_row()[source][source]#

Convert to a flat dictionary suitable for CSV/DataFrame export.

Feature extraction#

torchgeo_bench.main.embed_split(model, dataloader, device, verbose)[source][source]#

Extract feature embeddings and labels from a data split.

Bootstrap helpers#

torchgeo_bench.main.bootstrap_accuracy(y_true, y_pred, n_boot=1000, ci=95.0, seed=None)[source][source]#

Bootstrapped accuracy with confidence interval. Returns (mean, ci_lower, ci_upper).

torchgeo_bench.main.bootstrap_map(y_true, y_scores, n_boot=1000, ci=95.0, seed=None)[source][source]#

Bootstrap micro-averaged mean Average Precision.

KNN-5 evaluation#

torchgeo_bench.main.evaluate_knn(x_train, y_train, x_test, y_test, seed, n_bootstrap, verbose=False, device='cpu', n_neighbors=5, calibration_n_bins=None)[source][source]#

Evaluate KNN classifier. Auto-detects single-label vs multi-label from y shape.

Returns the primary metric with bootstrap CI, a calibration dict (ece/rms_ce/mce) computed from predict_proba, and the n_bins actually used (defaults to n_neighbors + 1).

class torchgeo_bench.knn.KNNClassifier(n_neighbors=5, device='cpu', metric='l2', use_fp16=False)[source][source]#

Bases: object

FAISS-backed KNN classifier with single- and multi-label support.

Multi-label mode is auto-detected from the shape of y during fit(): 1-D labels → single-label, 2-D labels → multi-label.

Parameters:
  • n_neighbors (int) – Number of neighbours (k). Clamped to min(k, n_train) on the CPU path; faissknn does not clamp internally.

  • device (str) – "cpu" (default) → faiss-cuda-cu128 CPU index. Anything else ("cuda", "cuda:0") requires the cuda extra (faissknn); raises ImportError if not installed.

  • metric (Literal['l2', 'ip', 'cosine']) – Distance metric — "l2" (default), "ip" (inner product), or "cosine" (cosine similarity; auto-normalizes inputs). GPU path only; CPU path always uses L2.

  • use_fp16 (bool) – Use fp16 for GPU index computation (~30 % speedup on Ampere+). GPU path only; ignored on CPU.

fit(X, y)[source][source]#

Index training features and store labels.

Parameters:
  • X (ndarray) – (n_samples, n_features) float32 feature matrix.

  • y (ndarray) – (n_samples,) int single-label or (n_samples, n_classes) multi-hot multi-label.

property multi_label: bool#

Whether the classifier is in multi-label mode.

predict(X)[source][source]#

Predict labels for X.

Returns single-label class indices (n_samples,) or multi-label binary predictions (n_samples, n_classes).

predict_proba(X)[source][source]#

Predict per-class probabilities (n_samples, n_classes).

Linear probing#

torchgeo_bench.main.evaluate_logistic(x_train, y_train, x_val, y_val, x_test, y_test, c_values, seed, n_bootstrap, merge_val, device, verbose=False, calibration_n_bins=15, temp_scale=True)[source][source]#

Sweep C values, retrain, and evaluate. Auto-detects single/multi-label from y shape.

Returns the primary metric with bootstrap CI, the selected C, a calibration dict from raw predict_proba on the test split, and a second dict with temperature-scaled calibration plus the fitted temperature (all None when temp_scale=False).

class torchgeo_bench.linear.LogisticRegression(C=1.0, max_iter=1000, lr=1.0, batch_size=1024, solver='lbfgs', tol=0.0001, patience=1, random_state=None, device=None, verbose=False, use_tf32=True, multi_label=False)[source][source]#

Bases: object

Logistic regression with identical objective scaling to sklearn.

Supports both single-label (softmax cross-entropy) and multi-label (sigmoid BCE) classification via the multi_label flag.

Objective:

loss = (1/n) * CrossEntropy + (1/n) * 0.5/C * ||W||^2

Differences from the previous version (speed-oriented but same math):

  • LBFGS uses its internal iteration loop (one external .step).

  • Adam uses on-device manual batching (no DataLoader overhead).

  • Inference paths use torch.inference_mode.

  • Optional TF32 for CUDA matmul (single linear layer still benefits slightly).

  • Coefficients and intercept are exposed via properties (no copying at fit time).

Args match previous class unless noted.

fit(X, y)[source][source]#

Fit the logistic regression model on training data.

Parameters:
  • X (Tensor) – Feature matrix of shape (n_samples, n_features).

  • y (Tensor) – Labels — (n_samples,) for single-label or (n_samples, n_classes) for multi-label.

Returns:

Self, for method chaining.

Raises:
  • TypeError – If X or y is not a torch.Tensor.

  • ValueError – If shapes are invalid or data is empty.

Return type:

Self

property coef_: ndarray#

Return learned weight matrix as a NumPy array of shape (n_classes, n_features).

property intercept_: ndarray#

Return learned bias vector as a NumPy array of shape (n_classes,).

predict(X)[source][source]#

Predict class labels (single-label) or binary indicators (multi-label).

Parameters:

X (Tensor) – Feature matrix of shape (n_samples, n_features).

Returns:

Predicted labels as a NumPy array.

Return type:

ndarray

predict_proba(X)[source][source]#

Predict per-class probabilities.

Parameters:

X (Tensor) – Feature matrix of shape (n_samples, n_features).

Returns:

Probability matrix of shape (n_samples, n_classes).

Return type:

ndarray

decision_function(X)[source][source]#

Compute raw logits (decision function values).

Parameters:

X (Tensor) – Feature matrix of shape (n_samples, n_features).

Returns:

Logits array of shape (n_samples, n_classes).

Return type:

ndarray

Segmentation#

torchgeo_bench.main.evaluate_segmentation(model, train_loader, val_loader, test_loader, cfg, num_classes, device, collect_preds=False)[source][source]#

Evaluate segmentation performance using a frozen-backbone segmentation probe.

Trains a lightweight segmentation head on top of the frozen backbone and evaluates mIoU on the test split. Optionally pre-caches backbone features for faster training across epochs.

Parameters:
  • model (Module) – Frozen backbone model.

  • train_loader (DataLoader) – Training DataLoader.

  • val_loader (DataLoader) – Validation DataLoader.

  • test_loader (DataLoader) – Test DataLoader.

  • cfg (DictConfig) – Full Hydra config.

  • num_classes (int) – Number of segmentation classes.

  • device (device) – Torch device.

  • collect_preds (bool) – If True, collect and return test predictions as (N, H, W) tensor.

Returns:

Tuple of (metrics_dict, feature_dim, None, None, preds_or_None). preds_or_None is None when collect_preds is False.

Return type:

tuple[dict[str, float], int, float | None, int | None, Tensor | None]

class torchgeo_bench.segmentation_probe.SegmentationProbe(backbone, layer_names, num_classes, freeze_backbone=True, head_type='linear', hidden_dim=None)[source][source]#

Bases: Module

Multi-scale segmentation probe that hooks into backbone feature layers.

Backbone layers are tapped via forward hooks. Features are passed to a decoder head (LinearHead, ConvBlockHead, FPNHead, or DPTHead) that produces per-pixel class logits.

Layer ordering convention (applies to all head types):
  • Coarse-to-fine — deepest / lowest-resolution layer first.

  • Example for ResNet: ["layer4", "layer3", "layer2", "layer1"].

  • For DPTHead this means index 0 = coarsest, which is also what the DPT cascade expects.

Parameters:
  • backbone (Module) – Feature extractor. May be a raw backbone or a BenchModel wrapper (backbone.* prefixes are stripped automatically).

  • layer_names (list[str]) – Ordered list of layer names to hook (coarse-to-fine).

  • num_classes (int) – Number of segmentation output classes.

  • freeze_backbone (bool) – If True (default), backbone parameters are frozen and the backbone runs in eval mode during inference.

  • head_type (str) – Decoder architecture — one of "linear", "conv_block", "fpn", "dpt", "patch_linear".

  • hidden_dim (int | None) – Hidden channel dimension for conv_block, fpn, and dpt heads (default 256).

extract_segmentation_features(dataloader, cache_dtype=torch.float16)[source][source]#

Run the frozen backbone once over dataloader and cache features.

Parameters:
  • dataloader (DataLoader) – DataLoader that yields dict or (image, mask) batches.

  • cache_dtype (dtype) – Storage dtype for cached feature tensors. Use torch.float16 (default) to halve RAM, or torch.float32 for full precision.

Returns:

A CachedFeaturesDataset with one entry per sample.

Return type:

CachedFeaturesDataset

forward(x)[source][source]#

Compute segmentation logits from input images.

Parameters:

x (Tensor) – Input tensor of shape (B, C, H, W).

Returns:

Logits tensor of shape (B, num_classes, H, W).

Return type:

Tensor

class torchgeo_bench.segmentation_task.SegmentationSolver(model, num_classes, lr=0.001, weight_decay=0.0, device='cuda', criterion=None, lr_scheduler='cosine', ignore_index=255)[source][source]#

Bases: object

A lightweight trainer for the SegmentationProbe.

fit(train_loader, val_loader=None, epochs=10, verbose=True)[source][source]#

Train the segmentation probe.

Parameters:
  • train_loader (DataLoader) – Training data loader.

  • val_loader (DataLoader | None) – Optional validation data loader for per-epoch mIoU logging.

  • epochs (int) – Number of training epochs.

  • verbose (bool) – Whether to show progress bars and epoch logs.

Returns:

Val mIoU from the final epoch if val_loader is given, else None.

Return type:

float | None

evaluate(dataloader, collect_preds=False)[source][source]#

Evaluate the model on a dataloader and return segmentation metrics.

Parameters:
  • dataloader (DataLoader) – Evaluation data loader.

  • collect_preds (bool) – If True, also return predicted class maps (N, H, W) int64.

Returns:

Dict of metric name → value, or (metrics_dict, preds_tensor) when collect_preds=True.

Return type:

dict[str, float] | tuple[dict[str, float], Tensor]

fit_cached(train_cache, val_cache=None, batch_size=64, epochs=10, verbose=True, gpu_train=None, gpu_val=None)[source][source]#

Train the segmentation head on pre-cached backbone features.

The backbone is not called during training — cached features are fed directly to self.model.head, which is the only component that runs a forward/backward pass.

The entire feature cache is pre-moved to the GPU as contiguous tensors (GPUTensorCache), eliminating per-batch CPU→GPU DMA transfers and torch.stack calls.

If gpu_train is provided, that pre-built cache is used directly, allowing callers (e.g. an HPO loop) to transfer the cache once and reuse it across many calls.

Parameters:
  • train_cache (CachedFeaturesDataset) – Pre-extracted training features from SegmentationProbe.extract_segmentation_features().

  • val_cache (CachedFeaturesDataset | None) – Optional validation cache for per-epoch mIoU logging.

  • batch_size (int) – Batch size for iterating over cached data.

  • epochs (int) – Number of training epochs.

  • verbose (bool) – Whether to show progress bars and epoch logs.

  • gpu_train (GPUTensorCache | None) – Optional pre-built GPU cache for training. If provided, the GPU transfer is skipped.

  • gpu_val (GPUTensorCache | None) – Optional pre-built GPU cache for validation. Used only when gpu_train is also provided.

Returns:

Val mIoU from the final epoch if val_cache is given, else None.

Return type:

float | None

evaluate_cached(cache, batch_size=64, collect_preds=False)[source][source]#

Evaluate on a CachedFeaturesDataset.

The cache is moved to GPU as a GPUTensorCache for zero per-batch host→device transfers.

Parameters:
  • cache (CachedFeaturesDataset) – Pre-extracted features (output of SegmentationProbe.extract_segmentation_features()).

  • batch_size (int) – Batch size for iterating over the cache.

  • collect_preds (bool) – If True, also return predicted class maps (N, H, W) int64.

Returns:

Dict of metric name → value, or (metrics_dict, preds_tensor) when collect_preds=True.

Return type:

dict[str, float] | tuple[dict[str, float], Tensor]

Intrinsic dimension#

See torchgeo_bench.intrinsic_dim for the standalone module API; the orchestration function lives in torchgeo_bench.main:

torchgeo_bench.main.evaluate_intrinsic_dim(splits, estimators, selected_splits, device, max_samples, seed, common_meta, feature_dim, n_counts, verbose=False)[source][source]#

Compute intrinsic-dimension metrics over selected splits and return CSV rows.

Each (split, estimator) yields one row with method="intrinsic_dim" and metric_name=f"id_{estimator}_{split}".

Result I/O#

torchgeo_bench.main.append_rows_atomic(path, rows)[source][source]#

Append rows to a CSV atomically, with advisory file lock and schema healing.

Behavior:

  • Empty/missing file: writes the header derived from rows and the rows.

  • Existing file whose header matches rows[0] keys exactly: appends rows without rewriting the header (fast path).

  • Existing file with a different schema (e.g. EvaluationResult gained a field since the file was first written): the file is rewritten with the unioned schema so every value lives under a named column instead of being silently stuffed into an unnamed position.

Parameters:
  • path (str) – Output CSV path; created if missing.

  • rows (list[dict]) – List of dicts to append. All dicts should share the same keys.