Chunking of GeoTIFFs, netCDF-files and Zarr

How to write chunked data

from pathlib import Path
import numpy as np 
import rasterio as rio
import xarray as xr
import rioxarray as riox

# generate some data to write
data = np.random.default_rng().random(size=(100,100))

GeoTIFF (`GDAL`, `rioxarray`, `rasterio`)

tmp_dir = Path("/tmp")

# tiling profile for both rasterio and rioxarray
common_profile = {
    "tiled": True,  # <-- turn tiling/chunking on
    "blockysize": 16,  # <-- tile/chunk size in y
    "blockxsize": 16,  # <-- tile/chunk size in x
    "compress": "lzw",  # <-- compress (optional)
}

Write with `rasterio`

with rio.open(
    tmp_dir / "data_rio.tif",
    "w",
    height=data.shape[0],
    width=data.shape[1],
    count=1,
    dtype=data.dtype,
    **common_profile,
) as dst:
    dst.write(data, 1)

Write with `rioxarray`

rows, cols = data.shape

ds = xr.DataArray(
    data=data, coords=(("y", np.arange(rows) + 0.5), ("x", np.arange(cols) + 0.5))
)
ds.rio.to_raster(tmp_dir / "data_rioxarray.tif", **common_profile)

Chunk existing file with `GDAL`

(different tile size so we can see that something happened)

!gdal_translate \
    -co TILED=YES \
    -co BLOCKYSIZE=32 \
    -co BLOCKXSIZE=32 \
    /tmp/data_rio.tif \
    /tmp/data_gdal.tif

Input file size is 100, 100
0...10...20...30...40...50...60...70...80...90...100 - done.

Find out if file is chunked

!gdalinfo /tmp/data_rio.tif

Driver: GTiff/GeoTIFF
Files: /tmp/data_rio.tif
Size is 100, 100
Image Structure Metadata:
  COMPRESSION=LZW
  INTERLEAVE=BAND
Corner Coordinates:
Upper Left  (    0.0,    0.0)
Lower Left  (    0.0,  100.0)
Upper Right (  100.0,    0.0)
Lower Right (  100.0,  100.0)
Center      (   50.0,   50.0)
Band 1 Block=16x16 Type=Float64, ColorInterp=Gray

!gdalinfo /tmp/data_rioxarray.tif

Driver: GTiff/GeoTIFF
Files: /tmp/data_rioxarray.tif
Size is 100, 100
Origin = (0.000000000000000,0.000000000000000)
Pixel Size = (1.000000000000000,1.000000000000000)
Image Structure Metadata:
  COMPRESSION=LZW
  INTERLEAVE=BAND
Corner Coordinates:
Upper Left  (   0.0000000,   0.0000000) 
Lower Left  (       0.000,     100.000) 
Upper Right ( 100.0000000,   0.0000000) 
Lower Right (     100.000,     100.000) 
Center      (  50.0000000,  50.0000000) 
Band 1 Block=16x16 Type=Float64, ColorInterp=Gray

!gdalinfo /tmp/data_gdal.tif

Driver: GTiff/GeoTIFF
Files: /tmp/data_gdal.tif
Size is 100, 100
Image Structure Metadata:
  INTERLEAVE=BAND
Corner Coordinates:
Upper Left  (    0.0,    0.0)
Lower Left  (    0.0,  100.0)
Upper Right (  100.0,    0.0)
Lower Right (  100.0,  100.0)
Center      (   50.0,   50.0)
Band 1 Block=32x32 Type=Float64, ColorInterp=Gray

Block=16x16 (Block=32x32) indicates that tiles are 16 x 16 (32 x 32) pixels

netCDF (via `xarray`)

# generate some 3d data to write
size_x_y = 100
size_time = 365

data_3d = np.random.default_rng().random(size=(size_x_y, size_x_y, size_time))

coords = np.arange(size_x_y)
coords_time = np.arange(size_time)

ds = xr.Dataset(
    data_vars={"test": (("y", "x", "time"), data_3d)},
    coords=dict(y=coords, x=coords, time=coords_time),
)

encoding = {"test": {"chunksizes": (20, 20, 50)}}

ds.to_netcdf(tmp_dir / "data.nc", encoding=encoding)

Read the encoding attribute:

xr.open_dataset(tmp_dir / "data.nc").test.encoding

{'dtype': dtype('float64'),
 'zlib': False,
 'szip': False,
 'zstd': False,
 'bzip2': False,
 'blosc': False,
 'shuffle': False,
 'complevel': 0,
 'fletcher32': False,
 'contiguous': False,
 'chunksizes': (20, 20, 50),
 'preferred_chunks': {'y': 20, 'x': 20, 'time': 50},
 'source': '/tmp/data.nc',
 'original_shape': (100, 100, 365),
 '_FillValue': np.float64(nan)}

(or run ncdump -h -s /tmp/data.nc)

Zarr

The xarray docs on .to_zarr(...) describe different ways how chunks can be explicitly defined or how the driver picks up existing chunksizes.

ds.chunk(dict(x=20, y=20, time=100)).to_zarr(
    tmp_dir / "data.zarr",
    mode="w",
)

<xarray.backends.zarr.ZarrStore at 0x73f263c4b130>

xr.open_zarr(tmp_dir / "data.zarr").test.encoding

{'chunks': (20, 20, 100),
 'preferred_chunks': {'y': 20, 'x': 20, 'time': 100},
 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0),
 'filters': None,
 '_FillValue': nan,
 'dtype': dtype('float64')}

Background

This post is written from the following perspective:

I mostly work with array data (satellite- and webcam imagery, output from (hydro-) climatological models) which usually comes either in the form of GeoTIFF or netCDF files that have x and y dimensions and often a time dimension
I still analyze most of the array data locally on my desktop computer, a workstation at our lab or the HPC system of our university. By “local”, I mean that my data is located directly on the system and not in some cloud storage
The datasets I deal with are too big to fit in to memory and have to be read and processed chunk-wise
These datasets often contain relatively large areas with no data values (due to non-rectangular domains/masks) and therefore benefit a lot from compression. Writing compressed netCDF4 files requires chunking (see here). I usually can’t afford to write uncrompressed data anyway, unless it’s a small subset of data.
I almost exclusively use xarray, rioxarray and some of the GDAL tools (or whatever software they wrap) to take care of and manipulate data formats

Sidenote

Chunking is closely related to compression (see here). Nevertheless, I will only summarize the options for chunking here.
The chunksizes here are only for illustration and should not be used (i.e., copy-pasted) for usage elsewhere.

Use-cases

Writing/creating data

This is the point where I can decide how the data will be chunked (tiled).

Read 2D GeoTIFFs (ot other formats) and construct a (at least 3D) xarray.Dataset by concatenation along some dimension. The output will be written as a netCDF or Zarr file
Clean data that already is in netCDF format and write the result back to disk

Read/access data

This is the point where no or unsuitable chunking can slow down data access.

Extracting a timeseries of a single pixel at a specific location in x and y
Extracting a time-slice, often a 2D array of a single timestep
Mapping a function over chunks of an entire (at least 3D) file (netCDF, Zarr)

It is not possible to optimize chunks in a way that they are optimal for all of these access patterns at the same time!

Important things to figure out first

What is the layout of my data and the chunks if I don’t take care of chunking explicitly? In my experience the default settings are often not ideal.
Which access pattern do I primarily need to support?
Does my data format support chunking? GeoTIFF, netCDF and Zarr all support chunking.

Sources

GDAL GeoTIFF driver page: the primary source to check the available options for GeoTIFFs
GDAL translate docs
rasterio docs on creation options
rioxarray docs on .to_raster(...)
xarray docs on writing netcdfs with compression