Evaluation methodology ====================== This page describes the evaluation methodology used by :mod:`torchgeo_bench.main` for each supported task type. In all cases the backbone model is kept **frozen** — the benchmark measures the quality of learned representations, not end-to-end fine-tuning performance. The model is reinstantiated fresh for each dataset because ``num_channels`` (and therefore the input convolution) varies across datasets and band selections. Overview -------- The benchmark loads a pre-trained backbone and one or more geospatial datasets. Depending on whether a dataset provides per-pixel masks (segmentation) or per-image labels (classification), a different evaluation path is taken: .. list-table:: :header-rows: 1 :widths: 25 50 25 * - Dataset type - Methods - Primary metric * - Classification (single-label) - ``knn5``, ``linear`` - accuracy * - Classification (multi-label) - ``knn5``, ``linear`` - micro-mAP * - Segmentation - ``seg-{linear,conv_block,fpn,dpt}`` - mIoU Optionally, intrinsic-dimension metrics (:doc:`/api/intrinsic_dim`) can be emitted alongside the standard classification rows when ``eval.intrinsic_dim.enabled=true``. Feature extraction (classification) ----------------------------------- For classification tasks, features are extracted **once** per split and reused by both KNN and the linear probe. 1. The backbone is placed in ``eval()`` mode with gradients disabled (``torch.no_grad`` + ``torch.inference_mode``). 2. Each batch is passed through ``forward_patch_features``, returning embeddings of shape ``(B, K)``. 3. Dictionary outputs (e.g. DINO-style models) are flattened on the first available key in order: ``"norm"``, ``"global_pool"``, ``"head.global_pool"``. 4. 3-D outputs ``(B, tokens, K)`` (typical of ViT models) are **mean-pooled** across the token dimension to a single ``(B, K)`` vector. 5. Embeddings and labels are concatenated across batches into NumPy arrays for downstream evaluation. KNN (k=5) --------- **Method name:** ``knn5``. A non-parametric baseline that measures how well the feature space clusters by class. Procedure ^^^^^^^^^ 1. Extract train and test embeddings (see above). 2. Fit a fixed **k = 5** nearest-neighbour classifier (:class:`~torchgeo_bench.knn.KNNClassifier`, backed by FAISS) on the train embeddings. 3. Predict labels for every test sample. 4. Compute test-set accuracy (or micro-mAP for multilabel datasets). 5. Compute **95% bootstrap confidence intervals** (``eval.bootstrap`` resamples, default ``200``) by resampling test predictions with replacement. Key details ^^^^^^^^^^^ * No hyperparameter tuning — k is fixed at 5. * No validation use — the validation split is extracted but ignored. * FAISS is used for efficient L2 nearest-neighbour search and runs on CPU or GPU. * Feature vectors are cast to ``float32`` and labels to ``int64`` before indexing. Linear probe (logistic regression) ---------------------------------- **Method name:** ``linear``. Multinomial logistic regression trained on top of frozen features — the standard linear-evaluation protocol. Procedure ^^^^^^^^^ 1. Extract train, validation, and test embeddings. 2. **Hyperparameter sweep:** train one logistic regression per ``C`` value in a log-spaced grid (``eval.c_range``, default 40 values from 10⁻⁶ to 10⁴). Each model is evaluated on the validation set to pick the best ``C``. 3. **Final model:** retrain with the chosen ``C``, optionally on ``train ∪ val`` (``eval.merge_val=true``, the default). 4. Evaluate on the test set; report accuracy / micro-mAP with 95% bootstrap confidence intervals. Implementation ^^^^^^^^^^^^^^ :class:`~torchgeo_bench.linear.LogisticRegression` is a custom PyTorch implementation matching scikit-learn's objective scaling: .. math:: \mathrm{loss} \;=\; \frac{1}{n}\,\mathrm{CrossEntropy} \;+\; \frac{1}{n}\cdot\frac{1}{2C}\,\|W\|^2 * **Architecture:** a single ``nn.Linear(K, num_classes)`` (weight + bias). * **Solver (sweep):** L-BFGS with strong-Wolfe line search, ``max_iter=2000``, ``tol=1e-6``. * **Solver (final):** same, but ``max_iter=4000`` for tighter convergence. * **Alternative solver:** mini-batch Adam is available; L-BFGS is the default. * **No feature standardisation** — embeddings are used as-is. * **TF32** is enabled on CUDA for faster matmul when available. Hyperparameters ^^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 25 25 50 * - Parameter - Default - Description * - ``eval.c_range`` - ``[-6, 4, 40]`` - log₁₀ start, stop, and number of ``C`` values * - ``eval.merge_val`` - ``true`` - merge train + val for final model training * - ``eval.bootstrap`` - ``200`` - bootstrap resamples for the confidence interval Segmentation probes ------------------- All segmentation methods share a common skeleton: :class:`~torchgeo_bench.segmentation_probe.SegmentationProbe` registers forward hooks on the configured :attr:`eval.segmentation.layers ` to capture intermediate feature maps. See :doc:`segmentation-layers` for verified layer names per timm backbone family. Each layer's feature map is reshaped into a spatial tensor ``(B, C, H, W)``: * 2-D ``(B, C)`` is reshaped to ``(B, C, 1, 1)``. * 3-D ViT-style ``(B, L, C)`` is reshaped to ``(B, C, √L, √L)`` (assuming a square spatial grid). * 4-D ``(B, C, H, W)`` is used directly. The probe head is then trained end-to-end (backbone frozen) using :class:`~torchgeo_bench.segmentation_task.SegmentationSolver`. Linear segmentation probe ^^^^^^^^^^^^^^^^^^^^^^^^^ **Method name:** ``seg-linear`` (``eval.segmentation.head_type=linear``). A lightweight per-pixel linear classifier per layer, with multi-layer fusion via learned scalar weights. * Per layer: ``BatchNorm2d → Conv2d(C, num_classes, 1×1)`` (a per-pixel linear classifier). * Each layer's logits are bilinearly upsampled to the input resolution. * Multi-layer logits are combined via a learned scalar ``scale_weights`` parameter. Convolutional probe ^^^^^^^^^^^^^^^^^^^ **Method name:** ``seg-conv_block`` (``eval.segmentation.head_type=conv_block``). Slightly more expressive: projects and fuses multi-scale features before classification, testing whether the backbone captures complementary information at different depths. 1. Per layer: ``Conv2d(C, hidden_dim, 1×1, bias=False) → BatchNorm2d → SiLU``. 2. Bilinearly upsample all projected maps to the largest spatial resolution among them (minimum 16×16). 3. Concatenate along the channel dimension to ``(B, hidden_dim × num_layers, H, W)``. 4. ``Conv2d(hidden_dim × num_layers, num_classes, 1×1)`` to produce logits, then upsample to the input resolution. FPN probe ^^^^^^^^^ **Method name:** ``seg-fpn`` (``eval.segmentation.head_type=fpn``). A Feature-Pyramid-Network-style top-down decoder that fuses multi-scale maps in coarse-to-fine order — matching common dense-prediction literature. 1. Layers must be supplied in **coarse-to-fine** order (deepest / lowest-resolution first, e.g. ``["layer4", "layer3", "layer2", "layer1"]`` for a ResNet). 2. Each layer is projected to ``hidden_dim`` channels via a lateral 1×1 conv. 3. **Top-down pathway:** starting from the coarsest scale, each level is upsampled 2× and added to the next finer lateral output. 4. Each merged level is refined with a 3×3 conv. 5. All refined levels are upsampled to the finest spatial resolution, concatenated, and passed through a 1×1 conv to per-pixel class logits. 6. Logits are bilinearly upsampled to the input resolution. DPT probe ^^^^^^^^^ **Method name:** ``seg-dpt`` (``eval.segmentation.head_type=dpt``). A DPT-style reassemble + fusion-transformer decoder (:class:`~torchgeo_bench.models.DPTHead`). Requires exactly four backbone layers in coarse-to-fine order; otherwise structurally similar to the FPN probe but with deeper fusion blocks. Training & evaluation (all heads) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * **Optimiser:** AdamW, applied **only** to unfrozen probe parameters. * **Loss:** ``CrossEntropyLoss(ignore_index=255)`` so unlabeled pixels are excluded from both loss and metric computation. * **Schedule:** cosine decay to 1e-6 by default (``eval.segmentation.lr_scheduler``); ``none`` disables. * **Metric:** mean Intersection-over-Union (mIoU) via ``torchmetrics.MulticlassJaccardIndex``. Frequency-weighted IoU plus macro precision / recall / F1 are also reported in the result row (see :doc:`results-format`). Segmentation knobs ------------------ All keys live under ``eval.segmentation`` in :file:`src/torchgeo_bench/conf/config.yaml` (global defaults) or under a model preset's ``eval`` block (per-model override). Head type ^^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 20 80 * - ``head_type`` - Description * - ``linear`` - Per-layer BN + 1×1 conv → upsample. Multi-layer fused with learned scalar weights. * - ``conv_block`` - Per-layer 1×1 projection to ``hidden_dim`` → upsample + concat → 1×1 head. * - ``fpn`` - FPN top-down pathway. Layers must be coarse-to-fine. * - ``dpt`` - DPT-style reassemble + fusion-transformer decoder. Training knobs ^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 22 18 60 * - Option - Default - Description * - ``layers`` - *(per model)* - Backbone layer names to hook. For FPN / DPT, deepest layer first. * - ``epochs`` - ``10`` - Training epochs for the probe head. * - ``lr`` - ``1e-3`` - Initial learning rate (AdamW). * - ``lr_scheduler`` - ``cosine`` - ``cosine`` (CosineAnnealingLR to 1e-6) or ``none`` (constant). * - ``criterion`` - ``torch.nn.CrossEntropyLoss`` - Instantiable loss criterion; provide an alternative via the Hydra ``criterion`` block. * - ``hidden_dim`` - ``256`` - Projection dimension for ``conv_block`` / ``fpn`` / ``dpt`` heads. * - ``batch_size`` - ``64`` - Batch size when training the probe head. Feature caching ^^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 25 18 57 * - Option - Default - Description * - ``cache_features`` - ``true`` - Pre-extract backbone features once per split into RAM. Stored layer-first as contiguous ``(N, C, H, W)`` ``float16`` tensors (:class:`~torchgeo_bench.segmentation_probe.CachedFeaturesDataset`). GPU transfer is a single memcpy per layer. Eliminates backbone re-runs across epochs — the dominant speedup. * - ``cache_dtype`` - ``float16`` - Storage dtype for cached features. ``float16`` halves RAM; autocast upcasts during the head forward pass. Common configuration -------------------- All evaluation paths share these settings: .. list-table:: :header-rows: 1 :widths: 25 18 57 * - Setting - Default - Description * - ``seed`` - ``0`` - Random seed for reproducibility (numpy + torch). * - ``device`` - ``cuda:0`` - PyTorch device. * - ``dataset.batch_size`` - ``64`` - Batch size for data loading. * - ``dataset.image_size`` - ``224`` - Resize input images (``null`` = preserve native size). * - ``dataset.interpolation`` - ``bilinear`` - Resize interpolation method. * - ``resume`` - ``false`` - Skip already-computed ``(dataset, method, model, …)`` combinations. Resume mode and output schema ----------------------------- When ``resume=true`` the runner reads the existing CSV at startup and skips any combination already present. See :doc:`results-format` for the full key, the column schema, and the atomic-append guarantee.