Evaluation methodology#

This page describes the evaluation methodology used by torchgeo_bench.main for each supported task type. In all cases the backbone model is kept frozen — the benchmark measures the quality of learned representations, not end-to-end fine-tuning performance. The model is reinstantiated fresh for each dataset because num_channels (and therefore the input convolution) varies across datasets and band selections.

Overview#

The benchmark loads a pre-trained backbone and one or more geospatial datasets. Depending on whether a dataset provides per-pixel masks (segmentation) or per-image labels (classification), a different evaluation path is taken:

Dataset type	Methods	Primary metric
Classification (single-label)	`knn5`, `linear`	accuracy
Classification (multi-label)	`knn5`, `linear`	micro-mAP
Segmentation	`seg-{linear,conv_block,fpn,dpt}`	mIoU

Optionally, intrinsic-dimension metrics (torchgeo_bench.intrinsic_dim) can be emitted alongside the standard classification rows when eval.intrinsic_dim.enabled=true.

Feature extraction (classification)#

For classification tasks, features are extracted once per split and reused by both KNN and the linear probe.

The backbone is placed in eval() mode with gradients disabled (torch.no_grad + torch.inference_mode).
Each batch is passed through forward_patch_features, returning embeddings of shape (B, K).
Dictionary outputs (e.g. DINO-style models) are flattened on the first available key in order: "norm", "global_pool", "head.global_pool".
3-D outputs (B, tokens, K) (typical of ViT models) are mean-pooled across the token dimension to a single (B, K) vector.
Embeddings and labels are concatenated across batches into NumPy arrays for downstream evaluation.

KNN (k=5)#

Method name: knn5. A non-parametric baseline that measures how well the feature space clusters by class.

Procedure#

Extract train and test embeddings (see above).
Fit a fixed k = 5 nearest-neighbour classifier (KNNClassifier, backed by FAISS) on the train embeddings.
Predict labels for every test sample.
Compute test-set accuracy (or micro-mAP for multilabel datasets).
Compute 95% bootstrap confidence intervals (eval.bootstrap resamples, default 200) by resampling test predictions with replacement.

Key details#

No hyperparameter tuning — k is fixed at 5.
No validation use — the validation split is extracted but ignored.
FAISS is used for efficient L2 nearest-neighbour search and runs on CPU or GPU.
Feature vectors are cast to float32 and labels to int64 before indexing.

Linear probe (logistic regression)#

Method name: linear. Multinomial logistic regression trained on top of frozen features — the standard linear-evaluation protocol.

Procedure#

Extract train, validation, and test embeddings.
Hyperparameter sweep: train one logistic regression per C value in a log-spaced grid (eval.c_range, default 40 values from 10⁻⁶ to 10⁴). Each model is evaluated on the validation set to pick the best C.
Final model: retrain with the chosen C, optionally on train ∪ val (eval.merge_val=true, the default).
Evaluate on the test set; report accuracy / micro-mAP with 95% bootstrap confidence intervals.

Implementation#

LogisticRegression is a custom PyTorch implementation matching scikit-learn’s objective scaling:

\[\mathrm{loss} \;=\; \frac{1}{n}\,\mathrm{CrossEntropy} \;+\; \frac{1}{n}\cdot\frac{1}{2C}\,\|W\|^2\]

Architecture: a single nn.Linear(K, num_classes) (weight + bias).
Solver (sweep): L-BFGS with strong-Wolfe line search, max_iter=2000, tol=1e-6.
Solver (final): same, but max_iter=4000 for tighter convergence.
Alternative solver: mini-batch Adam is available; L-BFGS is the default.
No feature standardisation — embeddings are used as-is.
TF32 is enabled on CUDA for faster matmul when available.

Hyperparameters#

Parameter	Default	Description
`eval.c_range`	`[-6, 4, 40]`	log₁₀ start, stop, and number of `C` values
`eval.merge_val`	`true`	merge train + val for final model training
`eval.bootstrap`	`200`	bootstrap resamples for the confidence interval

Segmentation probes#

All segmentation methods share a common skeleton: SegmentationProbe registers forward hooks on the configured eval.segmentation.layers to capture intermediate feature maps. See Segmentation backbone layer reference for verified layer names per timm backbone family.

Each layer’s feature map is reshaped into a spatial tensor (B, C, H, W):

2-D (B, C) is reshaped to (B, C, 1, 1).
3-D ViT-style (B, L, C) is reshaped to (B, C, √L, √L) (assuming a square spatial grid).
4-D (B, C, H, W) is used directly.

The probe head is then trained end-to-end (backbone frozen) using SegmentationSolver.

Linear segmentation probe#

Method name: seg-linear (eval.segmentation.head_type=linear). A lightweight per-pixel linear classifier per layer, with multi-layer fusion via learned scalar weights.

Per layer: BatchNorm2d → Conv2d(C, num_classes, 1×1) (a per-pixel linear classifier).
Each layer’s logits are bilinearly upsampled to the input resolution.
Multi-layer logits are combined via a learned scalar scale_weights parameter.

Convolutional probe#

Method name: seg-conv_block (eval.segmentation.head_type=conv_block). Slightly more expressive: projects and fuses multi-scale features before classification, testing whether the backbone captures complementary information at different depths.

Per layer: Conv2d(C, hidden_dim, 1×1, bias=False) → BatchNorm2d → SiLU.
Bilinearly upsample all projected maps to the largest spatial resolution among them (minimum 16×16).
Concatenate along the channel dimension to (B, hidden_dim × num_layers, H, W).
Conv2d(hidden_dim × num_layers, num_classes, 1×1) to produce logits, then upsample to the input resolution.

FPN probe#

Method name: seg-fpn (eval.segmentation.head_type=fpn). A Feature-Pyramid-Network-style top-down decoder that fuses multi-scale maps in coarse-to-fine order — matching common dense-prediction literature.

Layers must be supplied in coarse-to-fine order (deepest / lowest-resolution first, e.g. ["layer4", "layer3", "layer2", "layer1"] for a ResNet).
Each layer is projected to hidden_dim channels via a lateral 1×1 conv.
Top-down pathway: starting from the coarsest scale, each level is upsampled 2× and added to the next finer lateral output.
Each merged level is refined with a 3×3 conv.
All refined levels are upsampled to the finest spatial resolution, concatenated, and passed through a 1×1 conv to per-pixel class logits.
Logits are bilinearly upsampled to the input resolution.

DPT probe#

Method name: seg-dpt (eval.segmentation.head_type=dpt). A DPT-style reassemble + fusion-transformer decoder (DPTHead). Requires exactly four backbone layers in coarse-to-fine order; otherwise structurally similar to the FPN probe but with deeper fusion blocks.

Training & evaluation (all heads)#

Optimiser: AdamW, applied only to unfrozen probe parameters.
Loss: CrossEntropyLoss(ignore_index=255) so unlabeled pixels are excluded from both loss and metric computation.
Schedule: cosine decay to 1e-6 by default (eval.segmentation.lr_scheduler); none disables.
Metric: mean Intersection-over-Union (mIoU) via torchmetrics.MulticlassJaccardIndex. Frequency-weighted IoU plus macro precision / recall / F1 are also reported in the result row (see Results format).

Segmentation knobs#

All keys live under eval.segmentation in src/torchgeo_bench/conf/config.yaml (global defaults) or under a model preset’s eval block (per-model override).

Head type#

`head_type`	Description
`linear`	Per-layer BN + 1×1 conv → upsample. Multi-layer fused with learned scalar weights.
`conv_block`	Per-layer 1×1 projection to `hidden_dim` → upsample + concat → 1×1 head.
`fpn`	FPN top-down pathway. Layers must be coarse-to-fine.
`dpt`	DPT-style reassemble + fusion-transformer decoder.

Training knobs#

Option	Default	Description
`layers`	(per model)	Backbone layer names to hook. For FPN / DPT, deepest layer first.
`epochs`	`10`	Training epochs for the probe head.
`lr`	`1e-3`	Initial learning rate (AdamW).
`lr_scheduler`	`cosine`	`cosine` (CosineAnnealingLR to 1e-6) or `none` (constant).
`criterion`	`torch.nn.CrossEntropyLoss`	Instantiable loss criterion; provide an alternative via the Hydra `criterion` block.
`hidden_dim`	`256`	Projection dimension for `conv_block` / `fpn` / `dpt` heads.
`batch_size`	`64`	Batch size when training the probe head.

Feature caching#

Option	Default	Description
`cache_features`	`true`	Pre-extract backbone features once per split into RAM. Stored layer-first as contiguous `(N, C, H, W)` `float16` tensors (`CachedFeaturesDataset`). GPU transfer is a single memcpy per layer. Eliminates backbone re-runs across epochs — the dominant speedup.
`cache_dtype`	`float16`	Storage dtype for cached features. `float16` halves RAM; autocast upcasts during the head forward pass.

Common configuration#

All evaluation paths share these settings:

Setting	Default	Description
`seed`	`0`	Random seed for reproducibility (numpy + torch).
`device`	`cuda:0`	PyTorch device.
`dataset.batch_size`	`64`	Batch size for data loading.
`dataset.image_size`	`224`	Resize input images (`null` = preserve native size).
`dataset.interpolation`	`bilinear`	Resize interpolation method.
`resume`	`false`	Skip already-computed `(dataset, method, model, …)` combinations.

Resume mode and output schema#

When resume=true the runner reads the existing CSV at startup and skips any combination already present. See Results format for the full key, the column schema, and the atomic-append guarantee.