💿 PoF Data Generator

Importing all the necessary python libraries before we begin, we must prepare our enviroment. This includes installing the Application Programming Interface (API) of the CDS, and importing the various python libraries we will need.

These should have been installed into a conda environment before running the Notebook.

!pip install -q cdsapi

import xarray as xr
import pandas as pd
import numpy as np
from xarray.coding.times import CFDatetimeCoder
from pathlib import Path
import cdsapi
import os
import zipfile
import requests

Here we will define some configuration options. For the purpose of the book we try to keep the volume of data small. This will involve downloading data from the CDS.

years = list(range(2003, 2004))          # extend this list if needed
months = list(range(1, 12))            # or range(1, 13) for all months
years=[2003]
months=[1]
cds_months = [f"{m:02d}" for m in months]
cds_days= [f"{d:02d}" for d in range(1, 32)]

ECBox Data Download

Some of the data we will use is available on an EC Box at ECMWF We have provided the download here to retrieve the files needed for the Climate and for Active Fire Data.

os.makedirs("data", exist_ok=True)
os.makedirs("data/CLIMATE", exist_ok=True)
def retrieve_file(path):
    
    output_path=f"data/{path}"
    if not os.path.exists(output_path):
        username = "ecbox"
        token = "1vXbp0THtxZkla015skARAqTnf2DxxoQS0qlwrgpKiIGpPRKLtXklYO39tOJTXtnZrxP16WZs99iSoFgTTAy3orpmjykOBAO7CVk5RQIWFvZFhiSv0Ev3O3"

        base_url = "https://sites.ecmwf.int/ecbox/POF_IN_A_BOX/s/dav/data/"

        response = requests.get(f'{base_url}/{path}', auth=(username, token))

        if response.status_code == 200:
            with open(f"data/{path}", "wb") as f:
                f.write(response.content)
            # print("Downloaded!")
        else:
            print(response.status_code, response.text)

for year in range(2003, 2022):
    for month in range(1, 13):
        retrieve_file(f"ACTIVE_FIRE_MAP_{year}_{month:02d}_R.nc")
    
retrieve_file(f"CLIMATE/POP_2020.nc")
retrieve_file(f"CLIMATE/road_density_2015_agg_r.nc")

##Enter your CDS API key and XDS-Preprod API Key

We will request data from the Climate Data Store (CDS) and the Cross Data Store (XDS) programmatically with the help of the CDS API.

If you have not set up your CDS API credentials with a ~/.cdsapirc file, it is possible to provide the credentials when initialising the cdsapi.Client. To do this we must define the two variables needed to establish a connection: URL and KEY . The URL for the cds api is common and you do not need to modify that field. The KEY is string of characters made up of your your personal User ID and CDS API key. To obtain these, first register or login to the CDS (https://cds.climate.copernicus.eu), then visit https://cds.climate.copernicus.eu/how-to-api and copy the string of characters listed after “key:”. Replace the ######### below with this string.

For XDS the URL and instance are different as so we need the a different set of credentials for XDS. The API functions in the same way, but you will need to register and login on the XDS (https://xds-preprod.ecmwf.int/) and retrieve your API key here: https://xds-preprod.ecmwf.int/how-to-api

NOTE: If you have set up your cdsapi key using a ~/.cdsapirc you will need to update it for using the xds key or override it as we have demonstrated below.


cds_kwargs = {
    'url': 'https://cds.climate.copernicus.eu/api',
    'key': '#####################################',
}

xds_kwargs = {
    'url': 'https://xds-preprod.ecmwf.int/api',
    'key': '#####################################',
}

CDS Data Retrieval

This code block will retrieve the datasets needed from the CDS including temperature, wind speed, precipidation and dewpoint. We will retreive for the years and months that were configured above.

We require wind speed, we will calculate this from the 10U and 10V wind components.

for year in years:
    for month in cds_months: 
        for shortname,var in [["T2M","2m_temperature"],
                      ["D2M","2m_dewpoint_temperature"],
                      ["10U","10m_u_component_of_wind"],
                      ["10V","10m_v_component_of_wind"],
                      ["P","total_precipitation"]]:

            dataset = "reanalysis-era5-land"
            request = {
                "variable": [
                    var,
                ],
                "year": year,
                "month": month,
                "day": cds_days,
                "time": [
                    "00:00"
                ],
                "data_format": "netcdf",
                "download_format": "unarchived"
            }

            client = cdsapi.Client(**cds_kwargs)
            if not os.path.exists(f"data/{shortname}_{year}_{month}.nc"):
                client.retrieve(dataset, request, target=f"data/{shortname}_{year}_{month}.nc")
        if not os.path.exists(f"data/WS_{year}_{month}.nc"):
            with xr.open_dataset(f"data/10U_{year}_{month}.nc") as dsu,\
                xr.open_dataset(f"data/10V_{year}_{month}.nc") as dsv:
                dsw=dsu.u10**2+ dsv.v10**2
                dsw=dsw.rename("ws")
                dsw.to_netcdf(f"data/WS_{year}_{month}.nc")
                del dsw

        if not os.path.exists(f"data/RH_{year}_{month}.nc"):
            with xr.open_dataset(f"data/T2M_{year}_{month}.nc") as dst,\
                xr.open_dataset(f"data/D2M_{year}_{month}.nc") as dsd:
                es = 10**(7.5*(dst.t2m-273.15)/(237.7+(dst.t2m-273.15))) * 6.11
                e =  10**(7.5*(dsd.d2m-273.15)/(237.7+(dsd.d2m-273.15))) * 6.11
                dsrh = (e/es)*100
                dsrh = dsrh.rename("rh")
                dsrh.to_netcdf(f"data/RH_{year}_{month}.nc")
                del dsrh

We will now retrieve some data from the XDS.

The XDS data is stored at a different grid to ERA5 Land and so we will regrid the data on the fly once it is downloaded.

lat_lon_data = xr.open_dataset(f"data/T2M_{years[0]}_{cds_months[0]}.nc").sortby("latitude").sortby("longitude")
lat_lon_data=lat_lon_data.rename(valid_time="time",latitude='lat',longitude='lon')
lat_lon_data=lat_lon_data.drop_vars(['expver','number'])

for year in years:
    for month in cds_months: 
        for shortname, var in \
            [["DFMC_MAP","dead_fuel_moisture_content_group"],
                ["LFMC_MAP","live_fuel_moisture_content_group"],
                ["FUEL_MAP","fuel_group"]]:


            dataset = "derived-fire-fuel-biomass"
            request = {
                "variable": [
                    var,
                ],
                "version": ["1"],
                "year": [
                    f'{year}'
                ],
                "month": [
                    month
                ]
            }

            client = cdsapi.Client(**xds_kwargs)
            if not os.path.exists(f"data/{shortname}_{year}_{month}_R.nc"):
                client.retrieve(dataset, request, target=f"data/{shortname}_{year}_{month}.zip")

                with zipfile.ZipFile(f"data/{shortname}_{year}_{month}.zip", 'r') as zip_ref:
                    zip_ref.extractall("data/")
                print(f"Regridding {shortname} for {year}-{month}")
                with xr.open_dataset(f"data/{shortname}_{year}_{month}.nc") as ds_temp:
                    ds_regrid=ds_temp.interp_like(lat_lon_data, method="linear")
                    ds_regrid.to_netcdf(f"data/{shortname}_{year}_{month}_R.nc")
                    del ds_regrid

Build an XGBoost-compatible training dataset by combining climate, vegetation, and anthropogenic data from multiple NetCDF sources.

output_path="./data/training_data.parquet"
base_path = Path("./data/")

sample_frac = 1 # keep all data for training.

# -----------------------------
# LOAD STATIC DATASETS
# -----------------------------
time_coder = CFDatetimeCoder(use_cftime=True)

print("Loading static datasets...")
PO = xr.open_dataset(base_path / "CLIMATE/POP_2020.nc")
RD = xr.open_dataset(base_path / "CLIMATE/road_density_2015_agg_r.nc")

po = PO.population_density.values.flatten()
rd = RD.road_length.values.flatten()

all_samples = []

# -----------------------------
# MAIN LOOP
# -----------------------------
for year in years:
    print(f"Processing year {year}...")

    for month in months:
        mon = f"{month:02d}"  # zero-padded month

        # Define paths compactly
        ds_paths = {
            "AF": f"ACTIVE_FIRE_MAP_{year}_{mon}_R.nc",
            "FU": f"FUEL_MAP_{year}_{mon}_R.nc",
            "DF": f"DFMC_MAP_{year}_{mon}_R.nc",
            "LF": f"LFMC_MAP_{year}_{mon}_R.nc",
            "PR": f"P_{year}_{mon}.nc",
            "T2": f"T2M_{year}_{mon}.nc",
            "RH": f"RH_{year}_{mon}.nc",
            "WS": f"WS_{year}_{mon}.nc"
        }

        # Load datasets efficiently with context managers
        with xr.open_dataset(base_path / ds_paths["AF"]) as AF, \
                xr.open_dataset(base_path / ds_paths["FU"]) as FU, \
                xr.open_dataset(base_path / ds_paths["DF"]) as DF, \
                xr.open_dataset(base_path / ds_paths["LF"]) as LF, \
                xr.open_dataset(base_path / ds_paths["PR"]) as PR, \
                xr.open_dataset(base_path / ds_paths["T2"]) as T2, \
                xr.open_dataset(base_path / ds_paths["RH"]) as RH, \
                xr.open_dataset(base_path / ds_paths["WS"]) as WS:
            AF=AF.fillna(0)  # Fill NaNs in ACTIVE_FIRE with 0
            days = len(AF.ACTIVE_FIRE)

            # Process only the first day (change np.arange(1) to range(len(AF.ACTIVE_FIRE)) if needed)
            for i in np.arange(days):
                af = AF.ACTIVE_FIRE[i].values.flatten()
                fu_ll = FU.Live_Leaf[i].values.flatten()
                fu_lw = FU.Live_Wood[i].values.flatten()
                fu_df = FU.Dead_Foliage[i].values.flatten()
                fu_dw = FU.Dead_Wood[i].values.flatten()
                df = DF.DFMC_Foliage[i].values.flatten()
                dw = DF.DFMC_Wood[i].values.flatten()
                lf = LF.LFMC[i].values.flatten()
                pr = PR.tp[i].values.flatten()
                t2 = T2.t2m[i].values.flatten()
                rh = RH.rh[i].values.flatten()
                ws = WS.ws[i].values.flatten()

                # Mask where total fuel > 0
                ft = fu_ll + fu_lw + fu_df + fu_dw
                mask = ft > 0.0

                # Build dataframe for valid pixels
                dfx = pd.DataFrame({
                    "AF": af[mask],
                    "PR": pr[mask],
                    "T2": t2[mask],
                    "RH": rh[mask],
                    "WS": ws[mask],
                    "FU_LL": fu_ll[mask],
                    "FU_LW": fu_lw[mask],
                    "FU_DF": fu_df[mask],
                    "FU_DW": fu_dw[mask],
                    "DF": df[mask],
                    "DW": dw[mask],
                    "LF": lf[mask],
                    "PO": po[mask],
                    "RD": rd[mask],
                }, dtype=float)

                dfx.dropna(inplace=True)

                # Random sample for manageable dataset
                if len(dfx) > 0:
                    n_samples = max(1, int(len(dfx) * sample_frac))
                    dfx = dfx.sample(n=n_samples, random_state=1)
                    all_samples.append(dfx)

# -----------------------------
# COMBINE & SAVE
# -----------------------------
if not all_samples:
    print("⚠️ No samples generated. Check data masks or paths.")


print("Combining sampled data...")
dfa = pd.concat(all_samples, ignore_index=True)

print(f"Saving dataset → {output_path}")
dfa.to_parquet(output_path)

print("✅ Training dataset ready.")