Terrakit: Labels to dataset pipeline¶

This notebook demonstrates generating a ML-ready dataset from a collection of labels. The labels used in this example are in GeoJSON format from two wildfire events reported on Copernicus Rapid Mapping Service. The two wildfire events are:

Wildfire in Central Madeira, Portugal, 2024/08/26
Wildfire in Biebrza National Park, Poland, 2025/04/23

Install Terrakit: For instructions on how to install TerraKit, take a look at the Welcome page.

In [26]:

Copied!





import os
from glob import glob
from pathlib import Path

# Set LOGLEVEL to info for more verbose logging
os.environ["LOGLEVEL"] = "WARNING"

import terrakit

# Import TerraKit Utils
from terrakit.general_utils.labels_downloader import (
    rapid_mapping_geojson_downloader,
    rapid_mapping_class_split,
    hugging_face_file_downloader,
    EXAMPLE_LABEL_FILES,
    EXAMPLE_RASTER_LABEL_FILES,
    EXAMPLE_CLASS_LABEL_FILES,
)
from terrakit.general_utils.plotting import (
    plot_label_dataframes,
    plot_labels_on_map,
    plot_tiles_and_label_pair,
    plot_chip_and_label_pairs,
)
from terrakit.download.geodata_utils import check_and_crop_bbox
import os
from glob import glob
from pathlib import Path

# Set LOGLEVEL to info for more verbose logging
os.environ["LOGLEVEL"] = "WARNING"

import terrakit

# Import TerraKit Utils
from terrakit.general_utils.labels_downloader import (
    rapid_mapping_geojson_downloader,
    rapid_mapping_class_split,
    hugging_face_file_downloader,
    EXAMPLE_LABEL_FILES,
    EXAMPLE_RASTER_LABEL_FILES,
    EXAMPLE_CLASS_LABEL_FILES,
)
from terrakit.general_utils.plotting import (
    plot_label_dataframes,
    plot_labels_on_map,
    plot_tiles_and_label_pair,
    plot_chip_and_label_pairs,
)
from terrakit.download.geodata_utils import check_and_crop_bbox

1. Process labels¶

This initial steps takes a directory containing some label files (geojson), and calculates a list of bboxes that encompass the geospatial locations described by the label files. Assuming that the date is included in the label filename, temporal information will also be captured.

The function process_geojson_labels returns a Geopandas DataFrame where each row corresponds to a bbox for a given area. The function also saves this information in a shp file. The shp file is output into the working directory (default = "./tmp").

Either the shp file or the DataFrame can now be used in the next step to tell Terrakit which time and location to download some data from.

In [2]:

Copied!





# Set a working directory, a dataset name and a directory where some labels can be found
DATASET_NAME = "test_dataset"
WORKING_DIR = f"./tmp/{DATASET_NAME}"
LABELS_FOLDER = "./test_wildfire_vector"
# Set a working directory, a dataset name and a directory where some labels can be found
DATASET_NAME = "test_dataset"
WORKING_DIR = f"./tmp/{DATASET_NAME}"
LABELS_FOLDER = "./test_wildfire_vector"

Example labels: To download a set of example labels, use the `rapid_mapping_geojson_downloader` module.

In [ ]:

Copied!





# Download some labels if none already exist.
if (
    Path(LABELS_FOLDER).is_dir() is False
    or set(EXAMPLE_LABEL_FILES).issubset(
        [Path(f).name for f in glob(f"{LABELS_FOLDER}/*.json")]
    )
    is False
):
    rapid_mapping_geojson_downloader(
        event_id="748",
        aoi="01",
        monitoring_number="05",
        version="v1",
        dest=LABELS_FOLDER,
    )
    rapid_mapping_geojson_downloader(
        event_id="801",
        aoi="01",
        monitoring_number="02",
        version="v1",
        dest=LABELS_FOLDER,
    )
# Download some labels if none already exist.
if (
    Path(LABELS_FOLDER).is_dir() is False
    or set(EXAMPLE_LABEL_FILES).issubset(
        [Path(f).name for f in glob(f"{LABELS_FOLDER}/*.json")]
    )
    is False
):
    rapid_mapping_geojson_downloader(
        event_id="748",
        aoi="01",
        monitoring_number="05",
        version="v1",
        dest=LABELS_FOLDER,
    )
    rapid_mapping_geojson_downloader(
        event_id="801",
        aoi="01",
        monitoring_number="02",
        version="v1",
        dest=LABELS_FOLDER,
    )

In [4]:

Copied!





# Process the labels by providing a labels folder, working directory and dataset name.
labels_gdf, grouped_bbox_gdf = terrakit.process_labels(
    labels_folder=LABELS_FOLDER,
    dataset_name=DATASET_NAME,
    working_dir=WORKING_DIR,
)
print(labels_gdf)
print(grouped_bbox_gdf)
# Process the labels by providing a labels folder, working directory and dataset name.
labels_gdf, grouped_bbox_gdf = terrakit.process_labels(
    labels_folder=LABELS_FOLDER,
    dataset_name=DATASET_NAME,
    working_dir=WORKING_DIR,
)
print(labels_gdf)
print(grouped_bbox_gdf)

In [5]:

Copied!

# Plot the process labels and bboxes to confirm they appear as expected.
plot_label_dataframes(labels_gdf, grouped_bbox_gdf)
# Plot the process labels and bboxes to confirm they appear as expected.
plot_label_dataframes(labels_gdf, grouped_bbox_gdf)

No description has been provided for this image

In [6]:

Copied!





# Additionally plot labels and bbox on map.
map_collection, title_list = plot_labels_on_map(labels_gdf, grouped_bbox_gdf)
for i in range(0, len(map_collection)):
    print(title_list[i])
    display(map_collection[i])
# Additionally plot labels and bbox on map.
map_collection, title_list = plot_labels_on_map(labels_gdf, grouped_bbox_gdf)
for i in range(0, len(map_collection)):
    print(title_list[i])
    display(map_collection[i])


Download tile bounding box and labels for: EMSR801_AOI01_DEL_MONIT02_observedEventA_v1_2025-04-23.json

Make this Notebook Trusted to load map: File -> Trust Notebook


Download tile bounding box and labels for: EMSR748_AOI01_DEL_MONIT05_observedEventA_v1_2024-08-26.json

Make this Notebook Trusted to load map: File -> Trust Notebook

Multiple label class example¶

It's also possible to have label files representing different label classes by including the pattern _CLASS_1_ in the label file name, in this case the label file would represent class index 1. By default the labels will have class index 1 and everything not represented by the labels would have class index 0. There is also the possibility to set anything not represented by a label as no data with class index -1. In the example below there will be two label classes, and with no data for remaining area, which means that classes for labels are indicies 0 and 1.

In [27]:

Copied!





# Using a separate working directory, dataset name and an example label folder with multiple classes
CLASSES_DATASET_NAME = "class_test_dataset"
CLASSES_WORKING_DIR = f"./tmp/{CLASSES_DATASET_NAME}"
CLASSES_LABELS_FOLDER = "./test_wildfire_classes_vector"
# Using a separate working directory, dataset name and an example label folder with multiple classes
CLASSES_DATASET_NAME = "class_test_dataset"
CLASSES_WORKING_DIR = f"./tmp/{CLASSES_DATASET_NAME}"
CLASSES_LABELS_FOLDER = "./test_wildfire_classes_vector"

Example multi-class labels: To download a set of example multi-class labels from EMSR801 wildfire event in Poland (2025), use the `rapid_mapping_geojson_downloader` module. This demonstrates multi-class labeling using multiple spatial features from the same monitoring period (MONIT02, 2025-04-23).

In [ ]:

Copied!





# Download example multi-class labels from EMSR801 wildfire event (Poland, 2025)
# Using a single monitoring period with multiple spatial features to demonstrate multi-class labeling

# Download some labels if none already exist.
if (
    Path(CLASSES_LABELS_FOLDER).is_dir() is False
    or set(EXAMPLE_CLASS_LABEL_FILES).issubset(
        [Path(f).name for f in glob(f"{CLASSES_LABELS_FOLDER}/*.json")]
    )
    is False
):
    # Download MONIT02 from EMSR801 (contains 2 spatial features from same date)
    downloaded_file = rapid_mapping_geojson_downloader(
        event_id="801",
        aoi="01",
        monitoring_number="02",
        version="v1",
        dest=CLASSES_LABELS_FOLDER,
    )

    rapid_mapping_class_split(downloaded_file)
# Download example multi-class labels from EMSR801 wildfire event (Poland, 2025)
# Using a single monitoring period with multiple spatial features to demonstrate multi-class labeling

# Download some labels if none already exist.
if (
    Path(CLASSES_LABELS_FOLDER).is_dir() is False
    or set(EXAMPLE_CLASS_LABEL_FILES).issubset(
        [Path(f).name for f in glob(f"{CLASSES_LABELS_FOLDER}/*.json")]
    )
    is False
):
    # Download MONIT02 from EMSR801 (contains 2 spatial features from same date)
    downloaded_file = rapid_mapping_geojson_downloader(
        event_id="801",
        aoi="01",
        monitoring_number="02",
        version="v1",
        dest=CLASSES_LABELS_FOLDER,
    )

    rapid_mapping_class_split(downloaded_file)

In [9]:

Copied!





# Process the labels by providing a labels folder, working directory and dataset name.
classes_labels_gdf, classes_grouped_bbox_gdf = terrakit.process_labels(
    labels_folder=CLASSES_LABELS_FOLDER,
    dataset_name=CLASSES_DATASET_NAME,
    working_dir=CLASSES_WORKING_DIR,
)
print(classes_labels_gdf)
print(classes_grouped_bbox_gdf)
# Process the labels by providing a labels folder, working directory and dataset name.
classes_labels_gdf, classes_grouped_bbox_gdf = terrakit.process_labels(
    labels_folder=CLASSES_LABELS_FOLDER,
    dataset_name=CLASSES_DATASET_NAME,
    working_dir=CLASSES_WORKING_DIR,
)
print(classes_labels_gdf)
print(classes_grouped_bbox_gdf)

In [24]:

Copied!

# Plot the process labels and bboxes to confirm they appear as expected.
plot_label_dataframes(classes_labels_gdf, classes_grouped_bbox_gdf)
# Plot the process labels and bboxes to confirm they appear as expected.
plot_label_dataframes(classes_labels_gdf, classes_grouped_bbox_gdf)

In [11]:

Copied!





# Additionally plot labels and bbox on map.
map_collection, title_list = plot_labels_on_map(
    classes_labels_gdf, classes_grouped_bbox_gdf
)
for i in range(0, len(map_collection)):
    print(title_list[i])
    display(map_collection[i])
# Additionally plot labels and bbox on map.
map_collection, title_list = plot_labels_on_map(
    classes_labels_gdf, classes_grouped_bbox_gdf
)
for i in range(0, len(map_collection)):
    print(title_list[i])
    display(map_collection[i])


Download tile bounding box and labels for: EMSR801_AOI01_DEL_MONIT02_CLASS_1_observedEventA_v1_2025-04-23.json,EMSR801_AOI01_DEL_MONIT02_CLASS_0_observedEventA_v1_2025-04-23.json

Make this Notebook Trusted to load map: File -> Trust Notebook

2. Download the data¶

Example 2.1: Use a shp file to download data¶

In [12]:

Copied!





config = {
    "download": {
        "data_sources": [
            {
                "data_connector": "sentinel_aws",
                "collection_name": "sentinel-2-l2a",
                "bands": ["blue", "green", "red"],
            },
        ],
        "date_allowance": {"pre_days": 0, "post_days": 21},
        "transform": {
            "scale_data_xarray": True,
            "impute_nans": True,
            "reproject": True,
        },
        "max_cloud_cover": 80,
    },
}

queried_data = terrakit.download_data(
    data_sources=config["download"]["data_sources"],
    date_allowance=config["download"]["date_allowance"],
    transform=config["download"]["transform"],
    max_cloud_cover=config["download"]["max_cloud_cover"],
    dataset_name=DATASET_NAME,
    working_dir=WORKING_DIR,
    keep_files=False,
)
config = {
    "download": {
        "data_sources": [
            {
                "data_connector": "sentinel_aws",
                "collection_name": "sentinel-2-l2a",
                "bands": ["blue", "green", "red"],
            },
        ],
        "date_allowance": {"pre_days": 0, "post_days": 21},
        "transform": {
            "scale_data_xarray": True,
            "impute_nans": True,
            "reproject": True,
        },
        "max_cloud_cover": 80,
    },
}

queried_data = terrakit.download_data(
    data_sources=config["download"]["data_sources"],
    date_allowance=config["download"]["date_allowance"],
    transform=config["download"]["transform"],
    max_cloud_cover=config["download"]["max_cloud_cover"],
    dataset_name=DATASET_NAME,
    working_dir=WORKING_DIR,
    keep_files=False,
)

2.2: Inspect the data¶

Use the plot_tiles_and_label_pair function to inspect the downloaded tiles and corresponding labels.

In [13]:

Copied!

plot_tiles_and_label_pair(
    queried_data, bands=config["download"]["data_sources"][0]["bands"]
)
plot_tiles_and_label_pair(
    queried_data, bands=config["download"]["data_sources"][0]["bands"]
)

Legend
image_0: sentinel_aws_sentinel-2-l2a_2025-04-23_imputed.tif, label_0: sentinel_aws_sentinel-2-l2a_2025-04-23_imputed_labels.tif
---

Legend
image_1: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed.tif, label_1: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_labels.tif
---

Example 2.3: Download data for multi-class labels¶

Now let's download data for the multi-class labels we processed earlier.

Important: When working with multi-class labels, we use the set_no_data=True parameter. This ensures that areas not covered by any label class are marked as no-data (class -1) rather than being assigned to class 0. This is crucial for multi-class classification where class 0 represents a valid label class.

In [14]:

Copied!





config_classes = {
    "download": {
        "data_sources": [
            {
                "data_connector": "sentinel_aws",
                "collection_name": "sentinel-2-l2a",
                "bands": ["blue", "green", "red"],
            },
        ],
        "date_allowance": {"pre_days": 0, "post_days": 21},
        "transform": {
            "scale_data_xarray": True,
            "impute_nans": True,
            "reproject": True,
        },
        "max_cloud_cover": 80,
    },
}

# Note: set_no_data=True is critical for multi-class labels
# This sets unlabeled areas to -1 (no-data) instead of 0 (which is a valid class)
queried_data_classes = terrakit.download_data(
    data_sources=config_classes["download"]["data_sources"],
    date_allowance=config_classes["download"]["date_allowance"],
    transform=config_classes["download"]["transform"],
    max_cloud_cover=config_classes["download"]["max_cloud_cover"],
    dataset_name=CLASSES_DATASET_NAME,
    working_dir=CLASSES_WORKING_DIR,
    keep_files=False,
    set_no_data=True,  # Set unlabeled areas as no-data (-1) for multi-class
)
config_classes = {
    "download": {
        "data_sources": [
            {
                "data_connector": "sentinel_aws",
                "collection_name": "sentinel-2-l2a",
                "bands": ["blue", "green", "red"],
            },
        ],
        "date_allowance": {"pre_days": 0, "post_days": 21},
        "transform": {
            "scale_data_xarray": True,
            "impute_nans": True,
            "reproject": True,
        },
        "max_cloud_cover": 80,
    },
}

# Note: set_no_data=True is critical for multi-class labels
# This sets unlabeled areas to -1 (no-data) instead of 0 (which is a valid class)
queried_data_classes = terrakit.download_data(
    data_sources=config_classes["download"]["data_sources"],
    date_allowance=config_classes["download"]["date_allowance"],
    transform=config_classes["download"]["transform"],
    max_cloud_cover=config_classes["download"]["max_cloud_cover"],
    dataset_name=CLASSES_DATASET_NAME,
    working_dir=CLASSES_WORKING_DIR,
    keep_files=False,
    set_no_data=True,  # Set unlabeled areas as no-data (-1) for multi-class
)

Example 2.4: Inspect the data for multi-class labels¶

Use the plot_tiles_and_label_pair function to inspect the downloaded tiles and corresponding multi-class labels.

Important: Since we used set_no_data=True during download, we need to pass no_data_value=-1 to the plotting function so it correctly treats -1 as the background (not class 0, which is a valid label class). You should see two distinct label classes (0 and 1) representing different burnt area features.

In [15]:

Copied!





plot_tiles_and_label_pair(
    queried_data_classes,
    bands=config_classes["download"]["data_sources"][0]["bands"],
    no_data_value=-1,  # Set to -1 because we used set_no_data=True in download
)
plot_tiles_and_label_pair(
    queried_data_classes,
    bands=config_classes["download"]["data_sources"][0]["bands"],
    no_data_value=-1,  # Set to -1 because we used set_no_data=True in download
)

Legend
image_0: sentinel_aws_sentinel-2-l2a_2025-04-23_imputed.tif, label_0: sentinel_aws_sentinel-2-l2a_2025-04-23_imputed_labels.tif
---

3. Chip the data¶

Now that the tiled data has been downloaded, let's chip it accordingly.

3.1 Example 1: Use the queried data list returned from download_data to find the files to chip.¶

In [ ]:

Copied!





chip_args = {
    "chip": {"sample_dim": 256},
}

res = terrakit.chip_and_label_data(
    dataset_name=DATASET_NAME,
    sample_dim=chip_args["chip"]["sample_dim"],
    queried_data=queried_data,
    working_dir=WORKING_DIR,
)

# Note that we can use the working directory to find the list of files to chip.
# Uncomment to try this out.
# res = terrakit.chip_and_label_data(
#     dataset_name=DATASET_NAME,
#     working_dir=WORKING_DIR,
#     sample_dim=chip_args["chip"]["sample_dim"],
#     keep_files=True,
# )
chip_args = {
    "chip": {"sample_dim": 256},
}

res = terrakit.chip_and_label_data(
    dataset_name=DATASET_NAME,
    sample_dim=chip_args["chip"]["sample_dim"],
    queried_data=queried_data,
    working_dir=WORKING_DIR,
)

# Note that we can use the working directory to find the list of files to chip.
# Uncomment to try this out.
# res = terrakit.chip_and_label_data(
#     dataset_name=DATASET_NAME,
#     working_dir=WORKING_DIR,
#     sample_dim=chip_args["chip"]["sample_dim"],
#     keep_files=True,
# )

Check the results¶

Use the plot_chip_and_label_pairs function to check the chip and label pairs look as expected.

In [17]:

Copied!

plot_chip_and_label_pairs(
    res, bands=config["download"]["data_sources"][0]["bands"], samples=10
)
plot_chip_and_label_pairs(
    res, bands=config["download"]["data_sources"][0]["bands"], samples=10
)

Legend
image_0: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_19.data.tif
image_1: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_18.data.tif
image_2: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_12.data.tif
image_3: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_1.data.tif
image_4: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_5.data.tif
image_5: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_10.data.tif
image_6: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_2.data.tif
image_7: sentinel_aws_sentinel-2-l2a_2025-04-23_imputed_0.data.tif
image_8: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_23.data.tif
image_9: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_16.data.tif
---
label_0: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_19.label.tif
label_1: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_18.label.tif
label_2: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_12.label.tif
label_3: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_1.label.tif
label_4: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_5.label.tif
label_5: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_10.label.tif
label_6: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_2.label.tif
label_7: sentinel_aws_sentinel-2-l2a_2025-04-23_imputed_0.label.tif
label_8: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_23.label.tif
label_9: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_16.label.tif

Example 3.2: Chip the multi-class data¶

Now let's chip the multi-class data we downloaded earlier.

In [ ]:

Copied!





# Check the tile dimensions before chipping
import rasterio

with rasterio.open(queried_data_classes[0]) as src:
    print(f"Tile dimensions: {src.width} x {src.height}")
    print(f"Tile shape: {src.shape}")

# Use smaller chip size (128) since this example has a small area (~117x255 pixels)
# This will create multiple chips for demonstration purposes
chip_args_classes = {
    "chip": {"sample_dim": 128},
}

res_classes = terrakit.chip_and_label_data(
    dataset_name=CLASSES_DATASET_NAME,
    sample_dim=chip_args_classes["chip"]["sample_dim"],
    queried_data=queried_data_classes,
    working_dir=CLASSES_WORKING_DIR,
)

print(f"\nTotal files returned: {len(res_classes)}")
print(f"Files: {res_classes}")
# Check the tile dimensions before chipping
import rasterio

with rasterio.open(queried_data_classes[0]) as src:
    print(f"Tile dimensions: {src.width} x {src.height}")
    print(f"Tile shape: {src.shape}")

# Use smaller chip size (128) since this example has a small area (~117x255 pixels)
# This will create multiple chips for demonstration purposes
chip_args_classes = {
    "chip": {"sample_dim": 128},
}

res_classes = terrakit.chip_and_label_data(
    dataset_name=CLASSES_DATASET_NAME,
    sample_dim=chip_args_classes["chip"]["sample_dim"],
    queried_data=queried_data_classes,
    working_dir=CLASSES_WORKING_DIR,
)

print(f"\nTotal files returned: {len(res_classes)}")
print(f"Files: {res_classes}")

Check the multi-class chip results¶

Use the plot_chip_and_label_pairs function to check the multi-class chip and label pairs. You should see chips with labels showing different classes (0 and 1) in different colors.

In [29]:

Copied!





plot_chip_and_label_pairs(
    res_classes,
    bands=config_classes["download"]["data_sources"][0]["bands"],
    samples=10,
    no_data_value=-1,  # Set to -1 because we used set_no_data=True in download
)
plot_chip_and_label_pairs(
    res_classes,
    bands=config_classes["download"]["data_sources"][0]["bands"],
    samples=10,
    no_data_value=-1,  # Set to -1 because we used set_no_data=True in download
)

Legend
image_0: sentinel_aws_sentinel-2-l2a_2025-04-23_imputed_0.data.tif
image_1: sentinel_aws_sentinel-2-l2a_2025-04-23_imputed_1.data.tif
---
label_0: sentinel_aws_sentinel-2-l2a_2025-04-23_imputed_0.label.tif
label_1: sentinel_aws_sentinel-2-l2a_2025-04-23_imputed_1.label.tif

4. Store¶

In [ ]:

Copied!





terrakit.taco_store_data(
    dataset_name=DATASET_NAME,
    working_dir=WORKING_DIR,
    save_dir=WORKING_DIR,
    tortilla_name="terrakit_curated_dataset.tortilla",
)
terrakit.taco_store_data(
    dataset_name=DATASET_NAME,
    working_dir=WORKING_DIR,
    save_dir=WORKING_DIR,
    tortilla_name="terrakit_curated_dataset.tortilla",
)

In [21]:

Copied!





import pandas as pd
import tacoreader

df = tacoreader.load("./tmp/test_dataset/terrakit_curated_dataset.tortilla")
splits = ["train", "validation", "test"]
sample_rows = []

for split in splits:
    subset = df[df["tortilla:data_split"] == split]
    if not subset.empty:
        sample_rows.append(subset.iloc[0])  # Take first row

# Create a new DataFrame
sample_df = pd.DataFrame(sample_rows)

# Save back to a new file
tacoreader.compile(sample_df, "./tmp/test_dataset/sampled_dataset.tortilla")
import pandas as pd
import tacoreader

df = tacoreader.load("./tmp/test_dataset/terrakit_curated_dataset.tortilla")
splits = ["train", "validation", "test"]
sample_rows = []

for split in splits:
    subset = df[df["tortilla:data_split"] == split]
    if not subset.empty:
        sample_rows.append(subset.iloc[0])  # Take first row

# Create a new DataFrame
sample_df = pd.DataFrame(sample_rows)

# Save back to a new file
tacoreader.compile(sample_df, "./tmp/test_dataset/sampled_dataset.tortilla")

Making a tortilla 🫓: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 54471.48file/s]

Out[21]:

PosixPath('tmp/test_dataset/sampled_dataset.tortilla')

5. Upload¶

6. Additional utility functions¶

In [22]:

Copied!

check_and_crop_bbox(bbox=[-73.645550, 44.233885, -72.539635, 44.278023], resolution=10)
check_and_crop_bbox(bbox=[-73.645550, 44.233885, -72.539635, 44.278023], resolution=10)

2026-02-09 10:34:00,444 - INFO - Dimension less than 244, will pad - (2, 491)
2026-02-09 10:34:00,445 - INFO - New dimensions are (310, 500)

Out[22]:

([BBox(((-73.64555, 44.233885), (-73.36960432686855, 44.278023)), crs=CRS('4326')),
  BBox(((-73.36960432686855, 44.233885), (-73.0936586537371, 44.278023)), crs=CRS('4326')),
  BBox(((-73.0936586537371, 44.233885), (-72.81771298060565, 44.278023)), crs=CRS('4326')),
  BBox(((-72.81771298060565, 44.233885), (-72.5417673074742, 44.278023)), crs=CRS('4326')),
  BBox(((-72.56101965371401, 44.233885), (-72.5203826537602, 44.278023)), crs=CRS('4326'))],
 [(2194, 530), (2192, 538), (2191, 545), (2189, 553), (310, 500)])

Raster Labels to Data¶

Let's take a quick look at how we can also use TerraKit to generating a ML-ready dataset from a collection of raster labels.

The label used in this section are in raster format from two burn scar events included in the ibm-nasa-geospatial/hls_burn_scars hugging face dataset.

1. Process raster labels¶

As before, this initial steps takes a directory containing some label files. This time the labels are raster files (.tif). The date is assumed to be contained in the filename again. Supported date types are YYYYDDD (7), YYYYMMDD (8), YYMMDD (6 -> 20YYMMDD).

In [23]:

Copied!





# Set a working directory, a dataset name and a directory where some labels can be found
DATASET_NAME_RASTER = "test_dataset_raster"
WORKING_DIR_RASTER = f"./tmp/{DATASET_NAME_RASTER}"
LABELS_FOLDER_RASTER = "./test_burn_scar_raster"
# Set a working directory, a dataset name and a directory where some labels can be found
DATASET_NAME_RASTER = "test_dataset_raster"
WORKING_DIR_RASTER = f"./tmp/{DATASET_NAME_RASTER}"
LABELS_FOLDER_RASTER = "./test_burn_scar_raster"

Example labels: To download a set of example raster labels, use the `hugging_face_file_downloader` function.

In [ ]:

Copied!





if (
    Path(LABELS_FOLDER_RASTER).is_dir() is False
    or set(EXAMPLE_RASTER_LABEL_FILES).issubset(
        [Path(f).name for f in glob(f"{LABELS_FOLDER_RASTER}/*.tif")]
    )
    is False
):
    for filename in EXAMPLE_RASTER_LABEL_FILES:
        hugging_face_file_downloader(
            repo_id="ibm-nasa-geospatial/hls_burn_scars",
            filename=filename,
            revision="e48662b31288f1d5f1fd5cf5ebb0e454092a19ce",
            subfolder="training",
            dest=LABELS_FOLDER_RASTER,
        )
if (
    Path(LABELS_FOLDER_RASTER).is_dir() is False
    or set(EXAMPLE_RASTER_LABEL_FILES).issubset(
        [Path(f).name for f in glob(f"{LABELS_FOLDER_RASTER}/*.tif")]
    )
    is False
):
    for filename in EXAMPLE_RASTER_LABEL_FILES:
        hugging_face_file_downloader(
            repo_id="ibm-nasa-geospatial/hls_burn_scars",
            filename=filename,
            revision="e48662b31288f1d5f1fd5cf5ebb0e454092a19ce",
            subfolder="training",
            dest=LABELS_FOLDER_RASTER,
        )

In [ ]:

Copied!





labels_gdf, grouped_bbox_gdf = terrakit.process_labels(
    labels_folder=LABELS_FOLDER_RASTER,
    dataset_name=DATASET_NAME_RASTER,
    working_dir=WORKING_DIR_RASTER,
    label_type="raster",
)
print(labels_gdf)
print(grouped_bbox_gdf)
labels_gdf, grouped_bbox_gdf = terrakit.process_labels(
    labels_folder=LABELS_FOLDER_RASTER,
    dataset_name=DATASET_NAME_RASTER,
    working_dir=WORKING_DIR_RASTER,
    label_type="raster",
)
print(labels_gdf)
print(grouped_bbox_gdf)

In [26]:

Copied!

plot_label_dataframes(labels_gdf, grouped_bbox_gdf)
plot_label_dataframes(labels_gdf, grouped_bbox_gdf)

2. Download the data¶

In [ ]:

Copied!





config = {
    "download": {
        "data_sources": [
            {
                "data_connector": "sentinel_aws",
                "collection_name": "sentinel-2-l2a",
                "bands": ["blue", "green", "red"],
            },
        ],
        "date_allowance": {"pre_days": 0, "post_days": 21},
        "transform": {
            "scale_data_xarray": True,
            "impute_nans": True,
            "reproject": True,
        },
        "max_cloud_cover": 80,
    },
}
queried_data = terrakit.download_data(
    data_sources=config["download"]["data_sources"],
    date_allowance=config["download"]["date_allowance"],
    transform=config["download"]["transform"],
    max_cloud_cover=config["download"]["max_cloud_cover"],
    dataset_name=DATASET_NAME_RASTER,
    working_dir=WORKING_DIR_RASTER,
    keep_files=False,
)
config = {
    "download": {
        "data_sources": [
            {
                "data_connector": "sentinel_aws",
                "collection_name": "sentinel-2-l2a",
                "bands": ["blue", "green", "red"],
            },
        ],
        "date_allowance": {"pre_days": 0, "post_days": 21},
        "transform": {
            "scale_data_xarray": True,
            "impute_nans": True,
            "reproject": True,
        },
        "max_cloud_cover": 80,
    },
}
queried_data = terrakit.download_data(
    data_sources=config["download"]["data_sources"],
    date_allowance=config["download"]["date_allowance"],
    transform=config["download"]["transform"],
    max_cloud_cover=config["download"]["max_cloud_cover"],
    dataset_name=DATASET_NAME_RASTER,
    working_dir=WORKING_DIR_RASTER,
    keep_files=False,
)

In [28]:

Copied!

plot_tiles_and_label_pair(
    queried_data, bands=config["download"]["data_sources"][0]["bands"]
)
plot_tiles_and_label_pair(
    queried_data, bands=config["download"]["data_sources"][0]["bands"]
)

Legend
image_0: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed.tif, label_0: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_labels.tif
---

Legend
image_1: sentinel_aws_sentinel-2-l2a_2018-09-02_imputed.tif, label_1: sentinel_aws_sentinel-2-l2a_2018-09-02_imputed_labels.tif
---

3. Chip the data¶

Now that the tiled data has been downloaded, let's chip it accordingly.

In [ ]:

Copied!





chip_args = {
    "chip": {"sample_dim": 256},
}

res = terrakit.chip_and_label_data(
    dataset_name=DATASET_NAME_RASTER,
    sample_dim=chip_args["chip"]["sample_dim"],
    queried_data=queried_data,
    working_dir=WORKING_DIR_RASTER,
)
chip_args = {
    "chip": {"sample_dim": 256},
}

res = terrakit.chip_and_label_data(
    dataset_name=DATASET_NAME_RASTER,
    sample_dim=chip_args["chip"]["sample_dim"],
    queried_data=queried_data,
    working_dir=WORKING_DIR_RASTER,
)

In [30]:

Copied!

plot_chip_and_label_pairs(
    res, bands=config["download"]["data_sources"][0]["bands"], samples=10
)
plot_chip_and_label_pairs(
    res, bands=config["download"]["data_sources"][0]["bands"], samples=10
)

Legend
image_0: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_14.data.tif
image_1: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_3.data.tif
image_2: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_13.data.tif
image_3: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_5.data.tif
image_4: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_15.data.tif
image_5: sentinel_aws_sentinel-2-l2a_2018-09-02_imputed_3.data.tif
image_6: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_12.data.tif
image_7: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_4.data.tif
image_8: sentinel_aws_sentinel-2-l2a_2018-09-02_imputed_2.data.tif
image_9: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_11.data.tif
---
label_0: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_14.label.tif
label_1: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_3.label.tif
label_2: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_13.label.tif
label_3: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_5.label.tif
label_4: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_15.label.tif
label_5: sentinel_aws_sentinel-2-l2a_2018-09-02_imputed_3.label.tif
label_6: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_12.label.tif
label_7: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_4.label.tif
label_8: sentinel_aws_sentinel-2-l2a_2018-09-02_imputed_2.label.tif
label_9: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_11.label.tif

4. Store¶

In [ ]:

Copied!





terrakit.taco_store_data(
    dataset_name=DATASET_NAME_RASTER,
    working_dir=WORKING_DIR_RASTER,
    save_dir=WORKING_DIR_RASTER,
    tortilla_name="terrakit_curated_dataset.tortilla",
)
terrakit.taco_store_data(
    dataset_name=DATASET_NAME_RASTER,
    working_dir=WORKING_DIR_RASTER,
    save_dir=WORKING_DIR_RASTER,
    tortilla_name="terrakit_curated_dataset.tortilla",
)