Evaluation methodology#

This page describes the evaluation methodology used by torchgeo_bench.main for each supported task type. In all cases the backbone model is kept frozen — the benchmark measures the quality of learned representations, not end-to-end fine-tuning performance. The model is reinstantiated fresh for each dataset because num_channels (and therefore the input convolution) varies across datasets and band selections.

Overview#

The benchmark loads a pre-trained backbone and one or more geospatial datasets. Depending on whether a dataset provides per-pixel masks (segmentation) or per-image labels (classification), a different evaluation path is taken:

Dataset type

Methods

Primary metric

Classification (single-label)

knn5, linear

accuracy

Classification (multi-label)

knn5, linear

micro-mAP

Segmentation

seg-{linear,conv_block,fpn,dpt}

mIoU

Optionally, intrinsic-dimension metrics (torchgeo_bench.intrinsic_dim) can be emitted alongside the standard classification rows when eval.intrinsic_dim.enabled=true.

Feature extraction (classification)#

For classification tasks, features are extracted once per split and reused by both KNN and the linear probe.

  1. The backbone is placed in eval() mode with gradients disabled (torch.no_grad + torch.inference_mode).

  2. Each batch is passed through forward_patch_features, returning embeddings of shape (B, K).

  3. Dictionary outputs (e.g. DINO-style models) are flattened on the first available key in order: "norm", "global_pool", "head.global_pool".

  4. 3-D outputs (B, tokens, K) (typical of ViT models) are mean-pooled across the token dimension to a single (B, K) vector.

  5. Embeddings and labels are concatenated across batches into NumPy arrays for downstream evaluation.

KNN (k=5)#

Method name: knn5. A non-parametric baseline that measures how well the feature space clusters by class.

Procedure#

  1. Extract train and test embeddings (see above).

  2. Fit a fixed k = 5 nearest-neighbour classifier (KNNClassifier, backed by FAISS) on the train embeddings.

  3. Predict labels for every test sample.

  4. Compute test-set accuracy (or micro-mAP for multilabel datasets).

  5. Compute 95% bootstrap confidence intervals (eval.bootstrap resamples, default 200) by resampling test predictions with replacement.

Key details#

  • No hyperparameter tuning — k is fixed at 5.

  • No validation use — the validation split is extracted but ignored.

  • FAISS is used for efficient L2 nearest-neighbour search and runs on CPU or GPU.

  • Feature vectors are cast to float32 and labels to int64 before indexing.

Linear probe (logistic regression)#

Method name: linear. Multinomial logistic regression trained on top of frozen features — the standard linear-evaluation protocol.

Procedure#

  1. Extract train, validation, and test embeddings.

  2. Hyperparameter sweep: train one logistic regression per C value in a log-spaced grid (eval.c_range, default 40 values from 10⁻⁶ to 10⁴). Each model is evaluated on the validation set to pick the best C.

  3. Final model: retrain with the chosen C, optionally on train val (eval.merge_val=true, the default).

  4. Evaluate on the test set; report accuracy / micro-mAP with 95% bootstrap confidence intervals.

Implementation#

LogisticRegression is a custom PyTorch implementation matching scikit-learn’s objective scaling:

\[\mathrm{loss} \;=\; \frac{1}{n}\,\mathrm{CrossEntropy} \;+\; \frac{1}{n}\cdot\frac{1}{2C}\,\|W\|^2\]
  • Architecture: a single nn.Linear(K, num_classes) (weight + bias).

  • Solver (sweep): L-BFGS with strong-Wolfe line search, max_iter=2000, tol=1e-6.

  • Solver (final): same, but max_iter=4000 for tighter convergence.

  • Alternative solver: mini-batch Adam is available; L-BFGS is the default.

  • No feature standardisation — embeddings are used as-is.

  • TF32 is enabled on CUDA for faster matmul when available.

Hyperparameters#

Parameter

Default

Description

eval.c_range

[-6, 4, 40]

log₁₀ start, stop, and number of C values

eval.merge_val

true

merge train + val for final model training

eval.bootstrap

200

bootstrap resamples for the confidence interval

Segmentation probes#

All segmentation methods share a common skeleton: SegmentationProbe registers forward hooks on the configured eval.segmentation.layers to capture intermediate feature maps. See Segmentation backbone layer reference for verified layer names per timm backbone family.

Each layer’s feature map is reshaped into a spatial tensor (B, C, H, W):

  • 2-D (B, C) is reshaped to (B, C, 1, 1).

  • 3-D ViT-style (B, L, C) is reshaped to (B, C, √L, √L) (assuming a square spatial grid).

  • 4-D (B, C, H, W) is used directly.

The probe head is then trained end-to-end (backbone frozen) using SegmentationSolver.

Linear segmentation probe#

Method name: seg-linear (eval.segmentation.head_type=linear). A lightweight per-pixel linear classifier per layer, with multi-layer fusion via learned scalar weights.

  • Per layer: BatchNorm2d Conv2d(C, num_classes, 1×1) (a per-pixel linear classifier).

  • Each layer’s logits are bilinearly upsampled to the input resolution.

  • Multi-layer logits are combined via a learned scalar scale_weights parameter.

Convolutional probe#

Method name: seg-conv_block (eval.segmentation.head_type=conv_block). Slightly more expressive: projects and fuses multi-scale features before classification, testing whether the backbone captures complementary information at different depths.

  1. Per layer: Conv2d(C, hidden_dim, 1×1, bias=False) BatchNorm2d SiLU.

  2. Bilinearly upsample all projected maps to the largest spatial resolution among them (minimum 16×16).

  3. Concatenate along the channel dimension to (B, hidden_dim × num_layers, H, W).

  4. Conv2d(hidden_dim × num_layers, num_classes, 1×1) to produce logits, then upsample to the input resolution.

FPN probe#

Method name: seg-fpn (eval.segmentation.head_type=fpn). A Feature-Pyramid-Network-style top-down decoder that fuses multi-scale maps in coarse-to-fine order — matching common dense-prediction literature.

  1. Layers must be supplied in coarse-to-fine order (deepest / lowest-resolution first, e.g. ["layer4", "layer3", "layer2", "layer1"] for a ResNet).

  2. Each layer is projected to hidden_dim channels via a lateral 1×1 conv.

  3. Top-down pathway: starting from the coarsest scale, each level is upsampled 2× and added to the next finer lateral output.

  4. Each merged level is refined with a 3×3 conv.

  5. All refined levels are upsampled to the finest spatial resolution, concatenated, and passed through a 1×1 conv to per-pixel class logits.

  6. Logits are bilinearly upsampled to the input resolution.

DPT probe#

Method name: seg-dpt (eval.segmentation.head_type=dpt). A DPT-style reassemble + fusion-transformer decoder (DPTHead). Requires exactly four backbone layers in coarse-to-fine order; otherwise structurally similar to the FPN probe but with deeper fusion blocks.

Training & evaluation (all heads)#

  • Optimiser: AdamW, applied only to unfrozen probe parameters.

  • Loss: CrossEntropyLoss(ignore_index=255) so unlabeled pixels are excluded from both loss and metric computation.

  • Schedule: cosine decay to 1e-6 by default (eval.segmentation.lr_scheduler); none disables.

  • Metric: mean Intersection-over-Union (mIoU) via torchmetrics.MulticlassJaccardIndex. Frequency-weighted IoU plus macro precision / recall / F1 are also reported in the result row (see Results format).

Segmentation knobs#

All keys live under eval.segmentation in src/torchgeo_bench/conf/config.yaml (global defaults) or under a model preset’s eval block (per-model override).

Head type#

head_type

Description

linear

Per-layer BN + 1×1 conv → upsample. Multi-layer fused with learned scalar weights.

conv_block

Per-layer 1×1 projection to hidden_dim → upsample + concat → 1×1 head.

fpn

FPN top-down pathway. Layers must be coarse-to-fine.

dpt

DPT-style reassemble + fusion-transformer decoder.

Training knobs#

Option

Default

Description

layers

(per model)

Backbone layer names to hook. For FPN / DPT, deepest layer first.

epochs

10

Training epochs for the probe head.

lr

1e-3

Initial learning rate (AdamW).

lr_scheduler

cosine

cosine (CosineAnnealingLR to 1e-6) or none (constant).

criterion

torch.nn.CrossEntropyLoss

Instantiable loss criterion; provide an alternative via the Hydra criterion block.

hidden_dim

256

Projection dimension for conv_block / fpn / dpt heads.

batch_size

64

Batch size when training the probe head.

Feature caching#

Option

Default

Description

cache_features

true

Pre-extract backbone features once per split into RAM. Stored layer-first as contiguous (N, C, H, W) float16 tensors (CachedFeaturesDataset). GPU transfer is a single memcpy per layer. Eliminates backbone re-runs across epochs — the dominant speedup.

cache_dtype

float16

Storage dtype for cached features. float16 halves RAM; autocast upcasts during the head forward pass.

Common configuration#

All evaluation paths share these settings:

Setting

Default

Description

seed

0

Random seed for reproducibility (numpy + torch).

device

cuda:0

PyTorch device.

dataset.batch_size

64

Batch size for data loading.

dataset.image_size

224

Resize input images (null = preserve native size).

dataset.interpolation

bilinear

Resize interpolation method.

resume

false

Skip already-computed (dataset, method, model, …) combinations.

Resume mode and output schema#

When resume=true the runner reads the existing CSV at startup and skips any combination already present. See Results format for the full key, the column schema, and the atomic-append guarantee.