Results format#
All evaluation runs append rows to a single CSV file (default
results/all_results.csv). Each row is a flattened
EvaluationResult describing a single
(dataset, method, model, config) measurement.
Sample rows#
dataset,method,metric_name,metric_value,ci_lower,ci_upper,feature_dim,best_c,n_train,n_val,n_test,seed,model,name,normalization,image_size,interpolation,partition,bands
m-eurosat,knn5,accuracy,0.8234,0.8123,0.8345,512,,21600,5400,5400,0,torchgeo_bench.models.RCFBench,rcf,bandspec_zscore,224,bilinear,default,rgb
m-eurosat,linear,accuracy,0.8567,0.8461,0.8673,512,0.1,21600,5400,5400,0,torchgeo_bench.models.RCFBench,rcf,bandspec_zscore,224,bilinear,default,rgb
burn_scars,seg-fpn,mIoU,0.6234,0.0,0.0,768,,1000,200,300,0,torchgeo_bench.models.TimmPatchBenchModel,resnet50,bandspec_zscore,224,bilinear,default,rgb
Datasets emit unnormalized tensors; each model wrapper normalises inside
normalize_inputs() according to
the strategy selected by cfg.dataset.normalization. Allowed values:
Strategy |
Behaviour |
|---|---|
|
Per-channel z-score using |
|
Convert to the wrapper’s |
|
Scale each channel to |
|
|
|
No rescaling (for models whose forward owns normalisation). |
Older snapshots may carry legacy values such as raw / mean_stdev /
percentile_2_98 — they are kept verbatim for resume safety.
Method values#
|
Meaning |
|---|---|
|
KNN-5 classification (multilabel KNN for |
|
L-BFGS logistic regression with C-sweep on the validation set. |
|
Optional intrinsic-dimension metrics on extracted embeddings (requires
the |
|
Segmentation probe with the configured head ( |
CSV schema#
Column |
Description |
|---|---|
|
Dataset CLI name (e.g. |
|
|
|
Primary metric ( |
|
Point estimate. |
|
Bootstrap CI lower bound (0.0 when not applicable). |
|
Bootstrap CI upper bound (0.0 when not applicable). |
|
Embedding dimension produced by the backbone. |
|
Best |
|
Best learning rate (segmentation only). |
|
Best batch size (segmentation only). |
|
Train-split sample count. |
|
Validation-split sample count. |
|
Test-split sample count. |
|
RNG seed used for the run. |
|
Fully-qualified model class ( |
|
Human-readable model name ( |
|
Strategy applied by the model wrapper (see table above). |
|
Input resize size ( |
|
Resize interpolation mode. |
|
GeoBench V1 partition name ( |
|
|
|
|
|
|
|
|
|
Whether |
|
Number of bootstrap resamples used for CIs. |
|
Frequency-weighted IoU (segmentation only). |
|
Macro precision (segmentation only). |
|
Macro recall (segmentation only). |
|
Macro F1 (segmentation only). |
Atomic appends#
Rows are appended via append_rows_atomic(),
which uses fcntl advisory file locking. This makes it safe to point
multiple parallel jobs (e.g. one per GPU or per dataset) at the same
output file without corrupting it.
Resume mode#
When resume=true, the runner reads the existing CSV at startup and
skips any combination that already has a matching row. The de-dup key
is:
(dataset, method, model._target_, model.name,
normalization, image_size, interpolation, partition, bands)
Note that method is per-method (knn5 / linear /
intrinsic_dim / seg-<head_type>), so re-running with
eval.skip_linear=false after a skip_linear=true run will fill in
just the linear-probe rows.