torchgeo-benchBenchmark Edition
AI · Geospatial Foundation Models

Four winners on GeoBench: Panopticon on KNN, DINOv3-SAT and OlmoEarth on linear, Terramind on multispectral

Across 14,000 measurements on 11 GeoBench classification datasets and 28 frozen-backbone variants, four distinct leaders emerge: Panopticon tops KNN-5 on most datasets, DINOv3-SAT remains the strongest RGB-only linear probe, OlmoEarth reaches 97.8% on eurosat-spatial and 97.6% on m-eurosat, and Terramind wins the multispectral datasets when all MSI bands are available.

By torchgeo-bench Published 19 May 2026 Source: results/all_results.csv 14409 of 14409 observations shown

Across frozen-backbone variants evaluated on GeoBench classification datasets, the strongest configuration in this snapshot reaches an accuracy of on , while the median model variant clusters around . Use the controls below to filter the underlying observations; every figure on this page updates accordingly.

Each row of the underlying table records a single (dataset, method, model, normalization) experiment with bootstrapped 95% confidence intervals on accuracy. The four figures below explore the data from different angles — first a per-dataset leaderboard, then a flexible scatter view, a head-to-head comparison of KNN-5 and linear-probe accuracy, and finally a cross-dataset ranking that surfaces variants which generalise.

Customise the analysis

Dataset

Method

Model family

Normalization

Bands

Search by name

0 observations match. Best accuracy in selection:
Figure 1
Per-dataset leaderboard
The strongest model variants on each GeoBench dataset, ranked by accuracy. Whiskers show the bootstrapped 95% confidence interval.
Source: torchgeo-bench results CSV
Figure 2
Two-axis explorer
Map any numeric column against any other. By default, embedding dimension is plotted against accuracy — points coloured by model family, faceted by dataset.
Source: torchgeo-bench results CSV
Figure 3
KNN-5 versus linear probe
For each (dataset, model) pair, the linear-probe accuracy plotted against the parametric-free KNN-5 baseline. Points above the diagonal are configurations where the linear probe extracts more signal than nearest-neighbour retrieval.
Source: torchgeo-bench results CSV
Figure 4
Mean accuracy across datasets
Variants are ranked by their mean accuracy across the currently selected datasets — the best generalisers within the filter.
Source: torchgeo-bench results CSV
Figure 5
Compute & efficiency
Accuracy alone hides cost. The left panel plots linear-probe accuracy against the backbone's measured throughput on an A100-80GB; up-and-right is the efficient frontier. The right panel extrapolates the $ and kgCO2 of running each model on one million samples on the selected cloud region.
Profile rows measured on NVIDIA A100-SXM4-80GB. Prices from cloud on-demand list (snapshot 2026-05-15); carbon intensity from codecarbon's regional grid table. Both inlined into the page.
Figure 6
Intrinsic dimension by dataset
Each frozen backbone produces a feature manifold that — for a given dataset — has an effective dimensionality much smaller than the embedding size. Higher bars mean the backbone uses more of its budget; the spread between estimators is a coarse confidence interval on the estimate.
One observation per (model, dataset, estimator, bands). Bars show the cross-model mean; horizontal whiskers the [P10, P90] band.
Figure 7
Does intrinsic dimension predict probe accuracy?
For each (model, dataset, bands) tuple, the linear-probe accuracy is plotted against the intrinsic dimension of the backbone's frozen embeddings. Higher ID does not automatically buy you accuracy — some models produce richly-spread embeddings that aren't linearly separable.
Each point is one (name, dataset, bands) trio. Linear-probe accuracy is the matching `method="linear"` row.

Appendix · the underlying data

Click any column to sort. Search the table or use the filters above to narrow the view. Numeric values are rounded for display.