Terrakit: Labels to dataset pipeline¶
This notebook demonstrates generating a ML-ready dataset from a collection of labels. The labels used in this example are in GeoJSON format from two wildfire events reported on Copernicus Rapid Mapping Service. The two wildfire events are:
- Wildfire in Central Madeira, Portugal, 2024/08/26
- Wildfire in Biebrza National Park, Poland, 2025/04/23
import os
from glob import glob
from pathlib import Path
# Set LOGLEVEL to info for more verbose logging
os.environ["LOGLEVEL"] = "WARNING"
import terrakit
# Import TerraKit Utils
from terrakit.general_utils.labels_downloader import (
rapid_mapping_geojson_downloader,
rapid_mapping_class_split,
hugging_face_file_downloader,
EXAMPLE_LABEL_FILES,
EXAMPLE_RASTER_LABEL_FILES,
EXAMPLE_CLASS_LABEL_FILES,
)
from terrakit.general_utils.plotting import (
plot_label_dataframes,
plot_labels_on_map,
plot_tiles_and_label_pair,
plot_chip_and_label_pairs,
)
from terrakit.download.geodata_utils import check_and_crop_bbox
1. Process labels¶
This initial steps takes a directory containing some label files (geojson), and calculates a list of bboxes that encompass the geospatial locations described by the label files. Assuming that the date is included in the label filename, temporal information will also be captured.
The function process_geojson_labels returns a Geopandas DataFrame where each row corresponds to a bbox for a given area. The function also saves this information in a shp file. The shp file is output into the working directory (default = "./tmp").
Either the shp file or the DataFrame can now be used in the next step to tell Terrakit which time and location to download some data from.
# Set a working directory, a dataset name and a directory where some labels can be found
DATASET_NAME = "test_dataset"
WORKING_DIR = f"./tmp/{DATASET_NAME}"
LABELS_FOLDER = "./test_wildfire_vector"
# Download some labels if none already exist.
if (
Path(LABELS_FOLDER).is_dir() is False
or set(EXAMPLE_LABEL_FILES).issubset(
[Path(f).name for f in glob(f"{LABELS_FOLDER}/*.json")]
)
is False
):
rapid_mapping_geojson_downloader(
event_id="748",
aoi="01",
monitoring_number="05",
version="v1",
dest=LABELS_FOLDER,
)
rapid_mapping_geojson_downloader(
event_id="801",
aoi="01",
monitoring_number="02",
version="v1",
dest=LABELS_FOLDER,
)
# Process the labels by providing a labels folder, working directory and dataset name.
labels_gdf, grouped_bbox_gdf = terrakit.process_labels(
labels_folder=LABELS_FOLDER,
dataset_name=DATASET_NAME,
working_dir=WORKING_DIR,
)
print(labels_gdf)
print(grouped_bbox_gdf)
# Plot the process labels and bboxes to confirm they appear as expected.
plot_label_dataframes(labels_gdf, grouped_bbox_gdf)
# Additionally plot labels and bbox on map.
map_collection, title_list = plot_labels_on_map(labels_gdf, grouped_bbox_gdf)
for i in range(0, len(map_collection)):
print(title_list[i])
display(map_collection[i])
Download tile bounding box and labels for: EMSR801_AOI01_DEL_MONIT02_observedEventA_v1_2025-04-23.json
Download tile bounding box and labels for: EMSR748_AOI01_DEL_MONIT05_observedEventA_v1_2024-08-26.json
Multiple label class example¶
It's also possible to have label files representing different label classes by including the pattern _CLASS_1_ in the label file name, in this case the label file would represent class index 1. By default the labels will have class index 1 and everything not represented by the labels would have class index 0. There is also the possibility to set anything not represented by a label as no data with class index -1. In the example below there will be two label classes, and with no data for remaining area, which means that classes for labels are indicies 0 and 1.
# Using a separate working directory, dataset name and an example label folder with multiple classes
CLASSES_DATASET_NAME = "class_test_dataset"
CLASSES_WORKING_DIR = f"./tmp/{CLASSES_DATASET_NAME}"
CLASSES_LABELS_FOLDER = "./test_wildfire_classes_vector"
# Download example multi-class labels from EMSR801 wildfire event (Poland, 2025)
# Using a single monitoring period with multiple spatial features to demonstrate multi-class labeling
# Download some labels if none already exist.
if (
Path(CLASSES_LABELS_FOLDER).is_dir() is False
or set(EXAMPLE_CLASS_LABEL_FILES).issubset(
[Path(f).name for f in glob(f"{CLASSES_LABELS_FOLDER}/*.json")]
)
is False
):
# Download MONIT02 from EMSR801 (contains 2 spatial features from same date)
downloaded_file = rapid_mapping_geojson_downloader(
event_id="801",
aoi="01",
monitoring_number="02",
version="v1",
dest=CLASSES_LABELS_FOLDER,
)
rapid_mapping_class_split(downloaded_file)
# Process the labels by providing a labels folder, working directory and dataset name.
classes_labels_gdf, classes_grouped_bbox_gdf = terrakit.process_labels(
labels_folder=CLASSES_LABELS_FOLDER,
dataset_name=CLASSES_DATASET_NAME,
working_dir=CLASSES_WORKING_DIR,
)
print(classes_labels_gdf)
print(classes_grouped_bbox_gdf)
# Plot the process labels and bboxes to confirm they appear as expected.
plot_label_dataframes(classes_labels_gdf, classes_grouped_bbox_gdf)
# Additionally plot labels and bbox on map.
map_collection, title_list = plot_labels_on_map(
classes_labels_gdf, classes_grouped_bbox_gdf
)
for i in range(0, len(map_collection)):
print(title_list[i])
display(map_collection[i])
Download tile bounding box and labels for: EMSR801_AOI01_DEL_MONIT02_CLASS_1_observedEventA_v1_2025-04-23.json,EMSR801_AOI01_DEL_MONIT02_CLASS_0_observedEventA_v1_2025-04-23.json
2. Download the data¶
Example 2.1: Use a shp file to download data¶
config = {
"download": {
"data_sources": [
{
"data_connector": "sentinel_aws",
"collection_name": "sentinel-2-l2a",
"bands": ["blue", "green", "red"],
},
],
"date_allowance": {"pre_days": 0, "post_days": 21},
"transform": {
"scale_data_xarray": True,
"impute_nans": True,
"reproject": True,
},
"max_cloud_cover": 80,
},
}
queried_data = terrakit.download_data(
data_sources=config["download"]["data_sources"],
date_allowance=config["download"]["date_allowance"],
transform=config["download"]["transform"],
max_cloud_cover=config["download"]["max_cloud_cover"],
dataset_name=DATASET_NAME,
working_dir=WORKING_DIR,
keep_files=False,
)
2.2: Inspect the data¶
Use the plot_tiles_and_label_pair function to inspect the downloaded tiles and corresponding labels.
plot_tiles_and_label_pair(
queried_data, bands=config["download"]["data_sources"][0]["bands"]
)
Legend
image_0: sentinel_aws_sentinel-2-l2a_2025-04-23_imputed.tif, label_0: sentinel_aws_sentinel-2-l2a_2025-04-23_imputed_labels.tif
---
Legend
image_1: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed.tif, label_1: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_labels.tif
---
Example 2.3: Download data for multi-class labels¶
Now let's download data for the multi-class labels we processed earlier.
Important: When working with multi-class labels, we use the set_no_data=True parameter. This ensures that areas not covered by any label class are marked as no-data (class -1) rather than being assigned to class 0. This is crucial for multi-class classification where class 0 represents a valid label class.
config_classes = {
"download": {
"data_sources": [
{
"data_connector": "sentinel_aws",
"collection_name": "sentinel-2-l2a",
"bands": ["blue", "green", "red"],
},
],
"date_allowance": {"pre_days": 0, "post_days": 21},
"transform": {
"scale_data_xarray": True,
"impute_nans": True,
"reproject": True,
},
"max_cloud_cover": 80,
},
}
# Note: set_no_data=True is critical for multi-class labels
# This sets unlabeled areas to -1 (no-data) instead of 0 (which is a valid class)
queried_data_classes = terrakit.download_data(
data_sources=config_classes["download"]["data_sources"],
date_allowance=config_classes["download"]["date_allowance"],
transform=config_classes["download"]["transform"],
max_cloud_cover=config_classes["download"]["max_cloud_cover"],
dataset_name=CLASSES_DATASET_NAME,
working_dir=CLASSES_WORKING_DIR,
keep_files=False,
set_no_data=True, # Set unlabeled areas as no-data (-1) for multi-class
)
Example 2.4: Inspect the data for multi-class labels¶
Use the plot_tiles_and_label_pair function to inspect the downloaded tiles and corresponding multi-class labels.
Important: Since we used set_no_data=True during download, we need to pass no_data_value=-1 to the plotting function so it correctly treats -1 as the background (not class 0, which is a valid label class). You should see two distinct label classes (0 and 1) representing different burnt area features.
plot_tiles_and_label_pair(
queried_data_classes,
bands=config_classes["download"]["data_sources"][0]["bands"],
no_data_value=-1, # Set to -1 because we used set_no_data=True in download
)
Legend
image_0: sentinel_aws_sentinel-2-l2a_2025-04-23_imputed.tif, label_0: sentinel_aws_sentinel-2-l2a_2025-04-23_imputed_labels.tif
---
3. Chip the data¶
Now that the tiled data has been downloaded, let's chip it accordingly.
3.1 Example 1: Use the queried data list returned from download_data to find the files to chip.¶
chip_args = {
"chip": {"sample_dim": 256},
}
res = terrakit.chip_and_label_data(
dataset_name=DATASET_NAME,
sample_dim=chip_args["chip"]["sample_dim"],
queried_data=queried_data,
working_dir=WORKING_DIR,
)
# Note that we can use the working directory to find the list of files to chip.
# Uncomment to try this out.
# res = terrakit.chip_and_label_data(
# dataset_name=DATASET_NAME,
# working_dir=WORKING_DIR,
# sample_dim=chip_args["chip"]["sample_dim"],
# keep_files=True,
# )
Check the results¶
Use the plot_chip_and_label_pairs function to check the chip and label pairs look as expected.
plot_chip_and_label_pairs(
res, bands=config["download"]["data_sources"][0]["bands"], samples=10
)
Legend
image_0: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_19.data.tif
image_1: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_18.data.tif
image_2: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_12.data.tif
image_3: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_1.data.tif
image_4: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_5.data.tif
image_5: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_10.data.tif
image_6: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_2.data.tif
image_7: sentinel_aws_sentinel-2-l2a_2025-04-23_imputed_0.data.tif
image_8: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_23.data.tif
image_9: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_16.data.tif
---
label_0: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_19.label.tif
label_1: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_18.label.tif
label_2: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_12.label.tif
label_3: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_1.label.tif
label_4: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_5.label.tif
label_5: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_10.label.tif
label_6: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_2.label.tif
label_7: sentinel_aws_sentinel-2-l2a_2025-04-23_imputed_0.label.tif
label_8: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_23.label.tif
label_9: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_16.label.tif
Example 3.2: Chip the multi-class data¶
Now let's chip the multi-class data we downloaded earlier.
# Check the tile dimensions before chipping
import rasterio
with rasterio.open(queried_data_classes[0]) as src:
print(f"Tile dimensions: {src.width} x {src.height}")
print(f"Tile shape: {src.shape}")
# Use smaller chip size (128) since this example has a small area (~117x255 pixels)
# This will create multiple chips for demonstration purposes
chip_args_classes = {
"chip": {"sample_dim": 128},
}
res_classes = terrakit.chip_and_label_data(
dataset_name=CLASSES_DATASET_NAME,
sample_dim=chip_args_classes["chip"]["sample_dim"],
queried_data=queried_data_classes,
working_dir=CLASSES_WORKING_DIR,
)
print(f"\nTotal files returned: {len(res_classes)}")
print(f"Files: {res_classes}")
Check the multi-class chip results¶
Use the plot_chip_and_label_pairs function to check the multi-class chip and label pairs. You should see chips with labels showing different classes (0 and 1) in different colors.
plot_chip_and_label_pairs(
res_classes,
bands=config_classes["download"]["data_sources"][0]["bands"],
samples=10,
no_data_value=-1, # Set to -1 because we used set_no_data=True in download
)
Legend
image_0: sentinel_aws_sentinel-2-l2a_2025-04-23_imputed_0.data.tif
image_1: sentinel_aws_sentinel-2-l2a_2025-04-23_imputed_1.data.tif
---
label_0: sentinel_aws_sentinel-2-l2a_2025-04-23_imputed_0.label.tif
label_1: sentinel_aws_sentinel-2-l2a_2025-04-23_imputed_1.label.tif
4. Store¶
terrakit.taco_store_data(
dataset_name=DATASET_NAME,
working_dir=WORKING_DIR,
save_dir=WORKING_DIR,
tortilla_name="terrakit_curated_dataset.tortilla",
)
import pandas as pd
import tacoreader
df = tacoreader.load("./tmp/test_dataset/terrakit_curated_dataset.tortilla")
splits = ["train", "validation", "test"]
sample_rows = []
for split in splits:
subset = df[df["tortilla:data_split"] == split]
if not subset.empty:
sample_rows.append(subset.iloc[0]) # Take first row
# Create a new DataFrame
sample_df = pd.DataFrame(sample_rows)
# Save back to a new file
tacoreader.compile(sample_df, "./tmp/test_dataset/sampled_dataset.tortilla")
Making a tortilla 🫓: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 54471.48file/s]
PosixPath('tmp/test_dataset/sampled_dataset.tortilla')
5. Upload¶
6. Additional utility functions¶
check_and_crop_bbox(bbox=[-73.645550, 44.233885, -72.539635, 44.278023], resolution=10)
2026-02-09 10:34:00,444 - INFO - Dimension less than 244, will pad - (2, 491) 2026-02-09 10:34:00,445 - INFO - New dimensions are (310, 500)
([BBox(((-73.64555, 44.233885), (-73.36960432686855, 44.278023)), crs=CRS('4326')),
BBox(((-73.36960432686855, 44.233885), (-73.0936586537371, 44.278023)), crs=CRS('4326')),
BBox(((-73.0936586537371, 44.233885), (-72.81771298060565, 44.278023)), crs=CRS('4326')),
BBox(((-72.81771298060565, 44.233885), (-72.5417673074742, 44.278023)), crs=CRS('4326')),
BBox(((-72.56101965371401, 44.233885), (-72.5203826537602, 44.278023)), crs=CRS('4326'))],
[(2194, 530), (2192, 538), (2191, 545), (2189, 553), (310, 500)])
Raster Labels to Data¶
Let's take a quick look at how we can also use TerraKit to generating a ML-ready dataset from a collection of raster labels.
The label used in this section are in raster format from two burn scar events included in the ibm-nasa-geospatial/hls_burn_scars hugging face dataset.
1. Process raster labels¶
As before, this initial steps takes a directory containing some label files. This time the labels are raster files (.tif). The date is assumed to be contained in the filename again. Supported date types are YYYYDDD (7), YYYYMMDD (8), YYMMDD (6 -> 20YYMMDD).
# Set a working directory, a dataset name and a directory where some labels can be found
DATASET_NAME_RASTER = "test_dataset_raster"
WORKING_DIR_RASTER = f"./tmp/{DATASET_NAME_RASTER}"
LABELS_FOLDER_RASTER = "./test_burn_scar_raster"
if (
Path(LABELS_FOLDER_RASTER).is_dir() is False
or set(EXAMPLE_RASTER_LABEL_FILES).issubset(
[Path(f).name for f in glob(f"{LABELS_FOLDER_RASTER}/*.tif")]
)
is False
):
for filename in EXAMPLE_RASTER_LABEL_FILES:
hugging_face_file_downloader(
repo_id="ibm-nasa-geospatial/hls_burn_scars",
filename=filename,
revision="e48662b31288f1d5f1fd5cf5ebb0e454092a19ce",
subfolder="training",
dest=LABELS_FOLDER_RASTER,
)
labels_gdf, grouped_bbox_gdf = terrakit.process_labels(
labels_folder=LABELS_FOLDER_RASTER,
dataset_name=DATASET_NAME_RASTER,
working_dir=WORKING_DIR_RASTER,
label_type="raster",
)
print(labels_gdf)
print(grouped_bbox_gdf)
plot_label_dataframes(labels_gdf, grouped_bbox_gdf)
2. Download the data¶
config = {
"download": {
"data_sources": [
{
"data_connector": "sentinel_aws",
"collection_name": "sentinel-2-l2a",
"bands": ["blue", "green", "red"],
},
],
"date_allowance": {"pre_days": 0, "post_days": 21},
"transform": {
"scale_data_xarray": True,
"impute_nans": True,
"reproject": True,
},
"max_cloud_cover": 80,
},
}
queried_data = terrakit.download_data(
data_sources=config["download"]["data_sources"],
date_allowance=config["download"]["date_allowance"],
transform=config["download"]["transform"],
max_cloud_cover=config["download"]["max_cloud_cover"],
dataset_name=DATASET_NAME_RASTER,
working_dir=WORKING_DIR_RASTER,
keep_files=False,
)
plot_tiles_and_label_pair(
queried_data, bands=config["download"]["data_sources"][0]["bands"]
)
Legend
image_0: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed.tif, label_0: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_labels.tif
---
Legend
image_1: sentinel_aws_sentinel-2-l2a_2018-09-02_imputed.tif, label_1: sentinel_aws_sentinel-2-l2a_2018-09-02_imputed_labels.tif
---
3. Chip the data¶
Now that the tiled data has been downloaded, let's chip it accordingly.
chip_args = {
"chip": {"sample_dim": 256},
}
res = terrakit.chip_and_label_data(
dataset_name=DATASET_NAME_RASTER,
sample_dim=chip_args["chip"]["sample_dim"],
queried_data=queried_data,
working_dir=WORKING_DIR_RASTER,
)
plot_chip_and_label_pairs(
res, bands=config["download"]["data_sources"][0]["bands"], samples=10
)
Legend
image_0: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_14.data.tif
image_1: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_3.data.tif
image_2: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_13.data.tif
image_3: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_5.data.tif
image_4: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_15.data.tif
image_5: sentinel_aws_sentinel-2-l2a_2018-09-02_imputed_3.data.tif
image_6: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_12.data.tif
image_7: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_4.data.tif
image_8: sentinel_aws_sentinel-2-l2a_2018-09-02_imputed_2.data.tif
image_9: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_11.data.tif
---
label_0: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_14.label.tif
label_1: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_3.label.tif
label_2: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_13.label.tif
label_3: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_5.label.tif
label_4: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_15.label.tif
label_5: sentinel_aws_sentinel-2-l2a_2018-09-02_imputed_3.label.tif
label_6: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_12.label.tif
label_7: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_4.label.tif
label_8: sentinel_aws_sentinel-2-l2a_2018-09-02_imputed_2.label.tif
label_9: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_11.label.tif
4. Store¶
terrakit.taco_store_data(
dataset_name=DATASET_NAME_RASTER,
working_dir=WORKING_DIR_RASTER,
save_dir=WORKING_DIR_RASTER,
tortilla_name="terrakit_curated_dataset.tortilla",
)