Evaluation methodology#
This page describes the evaluation methodology used by
torchgeo_bench.main for each supported task type. In all cases
the backbone model is kept frozen — the benchmark measures the
quality of learned representations, not end-to-end fine-tuning
performance. The model is reinstantiated fresh for each dataset
because num_channels (and therefore the input convolution) varies
across datasets and band selections.
Overview#
The benchmark loads a pre-trained backbone and one or more geospatial datasets. Depending on whether a dataset provides per-pixel masks (segmentation) or per-image labels (classification), a different evaluation path is taken:
Dataset type |
Methods |
Primary metric |
|---|---|---|
Classification (single-label) |
|
accuracy |
Classification (multi-label) |
|
micro-mAP |
Segmentation |
|
mIoU |
Optionally, intrinsic-dimension metrics
(torchgeo_bench.intrinsic_dim) can be emitted alongside the standard
classification rows when eval.intrinsic_dim.enabled=true.
Feature extraction (classification)#
For classification tasks, features are extracted once per split and reused by both KNN and the linear probe.
The backbone is placed in
eval()mode with gradients disabled (torch.no_grad+torch.inference_mode).Each batch is passed through
forward_patch_features, returning embeddings of shape(B, K).Dictionary outputs (e.g. DINO-style models) are flattened on the first available key in order:
"norm","global_pool","head.global_pool".3-D outputs
(B, tokens, K)(typical of ViT models) are mean-pooled across the token dimension to a single(B, K)vector.Embeddings and labels are concatenated across batches into NumPy arrays for downstream evaluation.
KNN (k=5)#
Method name: knn5. A non-parametric baseline that measures how
well the feature space clusters by class.
Procedure#
Extract train and test embeddings (see above).
Fit a fixed k = 5 nearest-neighbour classifier (
KNNClassifier, backed by FAISS) on the train embeddings.Predict labels for every test sample.
Compute test-set accuracy (or micro-mAP for multilabel datasets).
Compute 95% bootstrap confidence intervals (
eval.bootstrapresamples, default200) by resampling test predictions with replacement.
Key details#
No hyperparameter tuning — k is fixed at 5.
No validation use — the validation split is extracted but ignored.
FAISS is used for efficient L2 nearest-neighbour search and runs on CPU or GPU.
Feature vectors are cast to
float32and labels toint64before indexing.
Linear probe (logistic regression)#
Method name: linear. Multinomial logistic regression trained on
top of frozen features — the standard linear-evaluation protocol.
Procedure#
Extract train, validation, and test embeddings.
Hyperparameter sweep: train one logistic regression per
Cvalue in a log-spaced grid (eval.c_range, default 40 values from 10⁻⁶ to 10⁴). Each model is evaluated on the validation set to pick the bestC.Final model: retrain with the chosen
C, optionally ontrain ∪ val(eval.merge_val=true, the default).Evaluate on the test set; report accuracy / micro-mAP with 95% bootstrap confidence intervals.
Implementation#
LogisticRegression is a custom PyTorch
implementation matching scikit-learn’s objective scaling:
Architecture: a single
nn.Linear(K, num_classes)(weight + bias).Solver (sweep): L-BFGS with strong-Wolfe line search,
max_iter=2000,tol=1e-6.Solver (final): same, but
max_iter=4000for tighter convergence.Alternative solver: mini-batch Adam is available; L-BFGS is the default.
No feature standardisation — embeddings are used as-is.
TF32 is enabled on CUDA for faster matmul when available.
Hyperparameters#
Parameter |
Default |
Description |
|---|---|---|
|
|
log₁₀ start, stop, and number of |
|
|
merge train + val for final model training |
|
|
bootstrap resamples for the confidence interval |
Segmentation probes#
All segmentation methods share a common skeleton:
SegmentationProbe registers
forward hooks on the configured
eval.segmentation.layers
to capture intermediate feature maps. See Segmentation backbone layer reference
for verified layer names per timm backbone family.
Each layer’s feature map is reshaped into a spatial tensor
(B, C, H, W):
2-D
(B, C)is reshaped to(B, C, 1, 1).3-D ViT-style
(B, L, C)is reshaped to(B, C, √L, √L)(assuming a square spatial grid).4-D
(B, C, H, W)is used directly.
The probe head is then trained end-to-end (backbone frozen) using
SegmentationSolver.
Linear segmentation probe#
Method name: seg-linear
(eval.segmentation.head_type=linear).
A lightweight per-pixel linear classifier per layer, with multi-layer
fusion via learned scalar weights.
Per layer:
BatchNorm2d → Conv2d(C, num_classes, 1×1)(a per-pixel linear classifier).Each layer’s logits are bilinearly upsampled to the input resolution.
Multi-layer logits are combined via a learned scalar
scale_weightsparameter.
Convolutional probe#
Method name: seg-conv_block
(eval.segmentation.head_type=conv_block).
Slightly more expressive: projects and fuses multi-scale features
before classification, testing whether the backbone captures
complementary information at different depths.
Per layer:
Conv2d(C, hidden_dim, 1×1, bias=False) → BatchNorm2d → SiLU.Bilinearly upsample all projected maps to the largest spatial resolution among them (minimum 16×16).
Concatenate along the channel dimension to
(B, hidden_dim × num_layers, H, W).Conv2d(hidden_dim × num_layers, num_classes, 1×1)to produce logits, then upsample to the input resolution.
FPN probe#
Method name: seg-fpn (eval.segmentation.head_type=fpn).
A Feature-Pyramid-Network-style top-down decoder that fuses multi-scale
maps in coarse-to-fine order — matching common dense-prediction
literature.
Layers must be supplied in coarse-to-fine order (deepest / lowest-resolution first, e.g.
["layer4", "layer3", "layer2", "layer1"]for a ResNet).Each layer is projected to
hidden_dimchannels via a lateral 1×1 conv.Top-down pathway: starting from the coarsest scale, each level is upsampled 2× and added to the next finer lateral output.
Each merged level is refined with a 3×3 conv.
All refined levels are upsampled to the finest spatial resolution, concatenated, and passed through a 1×1 conv to per-pixel class logits.
Logits are bilinearly upsampled to the input resolution.
DPT probe#
Method name: seg-dpt (eval.segmentation.head_type=dpt).
A DPT-style reassemble + fusion-transformer decoder
(DPTHead). Requires exactly four
backbone layers in coarse-to-fine order; otherwise structurally
similar to the FPN probe but with deeper fusion blocks.
Training & evaluation (all heads)#
Optimiser: AdamW, applied only to unfrozen probe parameters.
Loss:
CrossEntropyLoss(ignore_index=255)so unlabeled pixels are excluded from both loss and metric computation.Schedule: cosine decay to 1e-6 by default (
eval.segmentation.lr_scheduler);nonedisables.Metric: mean Intersection-over-Union (mIoU) via
torchmetrics.MulticlassJaccardIndex. Frequency-weighted IoU plus macro precision / recall / F1 are also reported in the result row (see Results format).
Segmentation knobs#
All keys live under eval.segmentation in
src/torchgeo_bench/conf/config.yaml (global defaults) or under a
model preset’s eval block (per-model override).
Head type#
|
Description |
|---|---|
|
Per-layer BN + 1×1 conv → upsample. Multi-layer fused with learned scalar weights. |
|
Per-layer 1×1 projection to |
|
FPN top-down pathway. Layers must be coarse-to-fine. |
|
DPT-style reassemble + fusion-transformer decoder. |
Training knobs#
Option |
Default |
Description |
|---|---|---|
|
(per model) |
Backbone layer names to hook. For FPN / DPT, deepest layer first. |
|
|
Training epochs for the probe head. |
|
|
Initial learning rate (AdamW). |
|
|
|
|
|
Instantiable loss criterion; provide an alternative via the
Hydra |
|
|
Projection dimension for |
|
|
Batch size when training the probe head. |
Feature caching#
Option |
Default |
Description |
|---|---|---|
|
|
Pre-extract backbone features once per split into RAM. Stored
layer-first as contiguous |
|
|
Storage dtype for cached features. |
Common configuration#
All evaluation paths share these settings:
Setting |
Default |
Description |
|---|---|---|
|
|
Random seed for reproducibility (numpy + torch). |
|
|
PyTorch device. |
|
|
Batch size for data loading. |
|
|
Resize input images ( |
|
|
Resize interpolation method. |
|
|
Skip already-computed
|
Resume mode and output schema#
When resume=true the runner reads the existing CSV at startup and
skips any combination already present. See Results format for
the full key, the column schema, and the atomic-append guarantee.