Results format#

All evaluation runs append rows to a single CSV file (default results/all_results.csv). Each row is a flattened EvaluationResult describing a single (dataset, method, model, config) measurement.

Sample rows#

dataset,method,metric_name,metric_value,ci_lower,ci_upper,feature_dim,best_c,n_train,n_val,n_test,seed,model,name,normalization,image_size,interpolation,partition,bands
m-eurosat,knn5,accuracy,0.8234,0.8123,0.8345,512,,21600,5400,5400,0,torchgeo_bench.models.RCFBench,rcf,bandspec_zscore,224,bilinear,default,rgb
m-eurosat,linear,accuracy,0.8567,0.8461,0.8673,512,0.1,21600,5400,5400,0,torchgeo_bench.models.RCFBench,rcf,bandspec_zscore,224,bilinear,default,rgb
burn_scars,seg-fpn,mIoU,0.6234,0.0,0.0,768,,1000,200,300,0,torchgeo_bench.models.TimmPatchBenchModel,resnet50,bandspec_zscore,224,bilinear,default,rgb

Datasets emit unnormalized tensors; each model wrapper normalises inside normalize_inputs() according to the strategy selected by cfg.dataset.normalization. Allowed values:

Strategy	Behaviour
`bandspec_zscore`	Per-channel z-score using `BandSpec` mean/std (default).
`model_native`	Convert to the wrapper’s `expected_input_unit`, then apply any `pretrain_mean` / `pretrain_std` declared on the class.
`minmax`	Scale each channel to `[0, 1]` from BandSpec min/max.
`minmax_zscore`	`minmax` then z-score against assumed `mean=0.5, std=0.25`.
`identity`	No rescaling (for models whose forward owns normalisation).

Older snapshots may carry legacy values such as raw / mean_stdev / percentile_2_98 — they are kept verbatim for resume safety.

Method values#

`method`	Meaning
`knn5`	KNN-5 classification (multilabel KNN for `m-bigearthnet`).
`linear`	L-BFGS logistic regression with C-sweep on the validation set.
`intrinsic_dim`	Optional intrinsic-dimension metrics on extracted embeddings (requires the `[id]` extra and `eval.intrinsic_dim.enabled=true`).
`seg-<head>`	Segmentation probe with the configured head (`linear` / `conv_block` / `fpn` / `dpt`).

CSV schema#

Column	Description
`dataset`	Dataset CLI name (e.g. `m-eurosat`).
`method`	`knn5`, `linear`, `intrinsic_dim`, or `seg-<head_type>`.
`metric_name`	Primary metric (`accuracy`, `micro_mAP`, `mIoU`, or `id_<estimator>_<split>` for intrinsic dim rows).
`metric_value`	Point estimate.
`ci_lower`	Bootstrap CI lower bound (0.0 when not applicable).
`ci_upper`	Bootstrap CI upper bound (0.0 when not applicable).
`feature_dim`	Embedding dimension produced by the backbone.
`best_c`	Best `C` from the logistic-regression sweep (linear probe only, otherwise `None`).
`best_lr`	Best learning rate (segmentation only).
`best_batch_size`	Best batch size (segmentation only).
`n_train`	Train-split sample count.
`n_val`	Validation-split sample count.
`n_test`	Test-split sample count.
`seed`	RNG seed used for the run.
`model`	Fully-qualified model class (`cfg.model._target_`).
`name`	Human-readable model name (`cfg.model.name`).
`normalization`	Strategy applied by the model wrapper (see table above).
`image_size`	Input resize size (`None` if no resizing).
`interpolation`	Resize interpolation mode.
`partition`	GeoBench V1 partition name (`default` for V2).
`bands`	`rgb` / `all` / a sorted comma-joined list.
`c_range_start`	`eval.c_range[0]`.
`c_range_stop`	`eval.c_range[1]`.
`c_range_num`	`eval.c_range[2]`.
`merge_val`	Whether `train+val` was merged before final logistic fit.
`bootstrap`	Number of bootstrap resamples used for CIs.
`fw_iou`	Frequency-weighted IoU (segmentation only).
`precision`	Macro precision (segmentation only).
`recall`	Macro recall (segmentation only).
`f1`	Macro F1 (segmentation only).

Atomic appends#

Rows are appended via append_rows_atomic(), which uses fcntl advisory file locking (available on Linux and macOS). This makes it safe to point multiple parallel jobs (e.g. one per GPU or per dataset) at the same output file without corrupting it.

Resume mode#

When resume=true, the runner reads the existing CSV at startup and skips any combination that already has a matching row. The de-dup key is:

(dataset, method, model._target_, model.name,
 normalization, image_size, interpolation, partition, bands)

Note that method is per-method (knn5 / linear / intrinsic_dim / seg-<head_type>), so re-running with eval.skip_linear=false after a skip_linear=true run will fill in just the linear-probe rows.