Four winners on GeoBench: Panopticon on KNN, DINOv3-SAT and OlmoEarth on linear, Terramind on multispectral
Across 14,000 measurements on 11 GeoBench classification datasets and 28 frozen-backbone variants, four distinct leaders emerge: Panopticon tops KNN-5 on most datasets, DINOv3-SAT remains the strongest RGB-only linear probe, OlmoEarth reaches 97.8% on eurosat-spatial and 97.6% on m-eurosat, and Terramind wins the multispectral datasets when all MSI bands are available.
By torchgeo-benchPublished 19 May 2026Source: results/all_results.csv14409 of 14409 observations shown
Across — frozen-backbone variants evaluated on
— GeoBench classification datasets, the
strongest configuration in this snapshot reaches an accuracy of
— on —, while the
median model variant clusters around —.
Use the controls below to filter the underlying observations; every
figure on this page updates accordingly.
Each row of the underlying table records a single
(dataset, method, model, normalization) experiment with bootstrapped
95% confidence intervals on accuracy. The four figures below explore
the data from different angles — first a per-dataset leaderboard,
then a flexible scatter view, a head-to-head comparison of KNN-5 and
linear-probe accuracy, and finally a cross-dataset ranking that
surfaces variants which generalise.
Customise the analysis
Dataset
Method
Model family
Normalization
Bands
Search by name
0 observations match.Best accuracy in selection: —
Figure 1
Per-dataset leaderboard
The strongest model variants on each GeoBench dataset, ranked by
accuracy. Whiskers show the bootstrapped 95% confidence interval.
Source: torchgeo-bench results CSV
Figure 2
Two-axis explorer
Map any numeric column against any other. By default, embedding
dimension is plotted against accuracy — points coloured by model
family, faceted by dataset.
Source: torchgeo-bench results CSV
Figure 3
KNN-5 versus linear probe
For each (dataset, model) pair, the linear-probe accuracy plotted
against the parametric-free KNN-5 baseline. Points above the
diagonal are configurations where the linear probe extracts more
signal than nearest-neighbour retrieval.
Source: torchgeo-bench results CSV
Figure 4
Mean accuracy across datasets
Variants are ranked by their mean accuracy across the currently
selected datasets — the best generalisers within the filter.
Source: torchgeo-bench results CSV
Figure 5
Compute & efficiency
Accuracy alone hides cost. The left panel plots linear-probe
accuracy against the backbone's measured throughput on an
A100-80GB; up-and-right is the efficient frontier. The right
panel extrapolates the $ and kgCO2 of running each model on
one million samples on the selected cloud region.
Profile rows measured on NVIDIA A100-SXM4-80GB. Prices from cloud
on-demand list (snapshot 2026-05-15); carbon intensity from
codecarbon's regional grid table. Both inlined into the page.
Figure 6
Intrinsic dimension by dataset
Each frozen backbone produces a feature manifold that — for a
given dataset — has an effective dimensionality much smaller
than the embedding size. Higher bars mean the backbone uses
more of its budget; the spread between estimators is a coarse
confidence interval on the estimate.
One observation per (model, dataset, estimator, bands). Bars show
the cross-model mean; horizontal whiskers the [P10, P90] band.
Figure 7
Does intrinsic dimension predict probe accuracy?
For each (model, dataset, bands) tuple, the linear-probe
accuracy is plotted against the intrinsic dimension of the
backbone's frozen embeddings. Higher ID does not automatically
buy you accuracy — some models produce richly-spread embeddings
that aren't linearly separable.
Each point is one (name, dataset, bands) trio. Linear-probe
accuracy is the matching `method="linear"` row.
Appendix · the underlying data
Click any column to sort. Search the table or use the filters above to
narrow the view. Numeric values are rounded for display.