from pathlib import Path
import numpy as np
import rasterio as rio
import xarray as xr
import rioxarray as rioxHow to write chunked data
# generate some data to write
data = np.random.default_rng().random(size=(100,100))GeoTIFF (GDAL, rioxarray, rasterio)
tmp_dir = Path("/tmp")# tiling profile for both rasterio and rioxarray
common_profile = {
"tiled": True, # <-- turn tiling/chunking on
"blockysize": 16, # <-- tile/chunk size in y
"blockxsize": 16, # <-- tile/chunk size in x
"compress": "lzw", # <-- compress (optional)
}Write with rasterio
with rio.open(
tmp_dir / "data_rio.tif",
"w",
height=data.shape[0],
width=data.shape[1],
count=1,
dtype=data.dtype,
**common_profile,
) as dst:
dst.write(data, 1)Write with rioxarray
rows, cols = data.shape
ds = xr.DataArray(
data=data, coords=(("y", np.arange(rows) + 0.5), ("x", np.arange(cols) + 0.5))
)
ds.rio.to_raster(tmp_dir / "data_rioxarray.tif", **common_profile)Chunk existing file with GDAL
(different tile size so we can see that something happened)
!gdal_translate \
-co TILED=YES \
-co BLOCKYSIZE=32 \
-co BLOCKXSIZE=32 \
/tmp/data_rio.tif \
/tmp/data_gdal.tifInput file size is 100, 100
0...10...20...30...40...50...60...70...80...90...100 - done.
Find out if file is chunked
!gdalinfo /tmp/data_rio.tifDriver: GTiff/GeoTIFF
Files: /tmp/data_rio.tif
Size is 100, 100
Image Structure Metadata:
COMPRESSION=LZW
INTERLEAVE=BAND
Corner Coordinates:
Upper Left ( 0.0, 0.0)
Lower Left ( 0.0, 100.0)
Upper Right ( 100.0, 0.0)
Lower Right ( 100.0, 100.0)
Center ( 50.0, 50.0)
Band 1 Block=16x16 Type=Float64, ColorInterp=Gray
!gdalinfo /tmp/data_rioxarray.tifDriver: GTiff/GeoTIFF
Files: /tmp/data_rioxarray.tif
Size is 100, 100
Origin = (0.000000000000000,0.000000000000000)
Pixel Size = (1.000000000000000,1.000000000000000)
Image Structure Metadata:
COMPRESSION=LZW
INTERLEAVE=BAND
Corner Coordinates:
Upper Left ( 0.0000000, 0.0000000)
Lower Left ( 0.000, 100.000)
Upper Right ( 100.0000000, 0.0000000)
Lower Right ( 100.000, 100.000)
Center ( 50.0000000, 50.0000000)
Band 1 Block=16x16 Type=Float64, ColorInterp=Gray
!gdalinfo /tmp/data_gdal.tifDriver: GTiff/GeoTIFF
Files: /tmp/data_gdal.tif
Size is 100, 100
Image Structure Metadata:
INTERLEAVE=BAND
Corner Coordinates:
Upper Left ( 0.0, 0.0)
Lower Left ( 0.0, 100.0)
Upper Right ( 100.0, 0.0)
Lower Right ( 100.0, 100.0)
Center ( 50.0, 50.0)
Band 1 Block=32x32 Type=Float64, ColorInterp=Gray
Block=16x16 (Block=32x32) indicates that tiles are 16 x 16 (32 x 32) pixels
netCDF (via xarray)
# generate some 3d data to write
size_x_y = 100
size_time = 365
data_3d = np.random.default_rng().random(size=(size_x_y, size_x_y, size_time))
coords = np.arange(size_x_y)
coords_time = np.arange(size_time)
ds = xr.Dataset(
data_vars={"test": (("y", "x", "time"), data_3d)},
coords=dict(y=coords, x=coords, time=coords_time),
)
encoding = {"test": {"chunksizes": (20, 20, 50)}}
ds.to_netcdf(tmp_dir / "data.nc", encoding=encoding)Read the encoding attribute:
xr.open_dataset(tmp_dir / "data.nc").test.encoding{'dtype': dtype('float64'),
'zlib': False,
'szip': False,
'zstd': False,
'bzip2': False,
'blosc': False,
'shuffle': False,
'complevel': 0,
'fletcher32': False,
'contiguous': False,
'chunksizes': (20, 20, 50),
'preferred_chunks': {'y': 20, 'x': 20, 'time': 50},
'source': '/tmp/data.nc',
'original_shape': (100, 100, 365),
'_FillValue': np.float64(nan)}
(or run ncdump -h -s /tmp/data.nc)
Zarr
The xarray docs on .to_zarr(...) describe different ways how chunks can be explicitly defined or how the driver picks up existing chunksizes.
ds.chunk(dict(x=20, y=20, time=100)).to_zarr(
tmp_dir / "data.zarr",
mode="w",
)<xarray.backends.zarr.ZarrStore at 0x73f263c4b130>
xr.open_zarr(tmp_dir / "data.zarr").test.encoding{'chunks': (20, 20, 100),
'preferred_chunks': {'y': 20, 'x': 20, 'time': 100},
'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0),
'filters': None,
'_FillValue': nan,
'dtype': dtype('float64')}
Background
This post is written from the following perspective:
- I mostly work with array data (satellite- and webcam imagery, output from (hydro-) climatological models) which usually comes either in the form of GeoTIFF or netCDF files that have
xandydimensions and often atimedimension - I still analyze most of the array data locally on my desktop computer, a workstation at our lab or the HPC system of our university. By “local”, I mean that my data is located directly on the system and not in some cloud storage
- The datasets I deal with are too big to fit in to memory and have to be read and processed chunk-wise
- These datasets often contain relatively large areas with no data values (due to non-rectangular domains/masks) and therefore benefit a lot from compression. Writing compressed netCDF4 files requires chunking (see here). I usually can’t afford to write uncrompressed data anyway, unless it’s a small subset of data.
- I almost exclusively use
xarray,rioxarrayand some of theGDALtools (or whatever software they wrap) to take care of and manipulate data formats
Sidenote
- Chunking is closely related to compression (see here). Nevertheless, I will only summarize the options for chunking here.
- The chunksizes here are only for illustration and should not be used (i.e., copy-pasted) for usage elsewhere.
Use-cases
Writing/creating data
This is the point where I can decide how the data will be chunked (tiled).
- Read 2D GeoTIFFs (ot other formats) and construct a (at least 3D)
xarray.Datasetby concatenation along some dimension. The output will be written as a netCDF or Zarr file - Clean data that already is in netCDF format and write the result back to disk
Read/access data
This is the point where no or unsuitable chunking can slow down data access.
- Extracting a timeseries of a single pixel at a specific location in
xandy - Extracting a time-slice, often a 2D array of a single timestep
- Mapping a function over chunks of an entire (at least 3D) file (netCDF, Zarr)
It is not possible to optimize chunks in a way that they are optimal for all of these access patterns at the same time!
Important things to figure out first
- What is the layout of my data and the chunks if I don’t take care of chunking explicitly? In my experience the default settings are often not ideal.
- Which access pattern do I primarily need to support?
- Does my data format support chunking? GeoTIFF, netCDF and Zarr all support chunking.
Further reading
Sources
- GDAL GeoTIFF driver page: the primary source to check the available options for GeoTIFFs
- GDAL translate docs
- rasterio docs on creation options
- rioxarray docs on
.to_raster(...) - xarray docs on writing netcdfs with compression