from pathlib import Path
import numpy as np
import rasterio as rio
import xarray as xr
import rioxarray as riox
How to write chunked data
# generate some data to write
= np.random.default_rng().random(size=(100,100)) data
GeoTIFF (GDAL
, rioxarray
, rasterio
)
= Path("/tmp") tmp_dir
# tiling profile for both rasterio and rioxarray
= {
common_profile "tiled": True, # <-- turn tiling/chunking on
"blockysize": 16, # <-- tile/chunk size in y
"blockxsize": 16, # <-- tile/chunk size in x
"compress": "lzw", # <-- compress (optional)
}
Write with rasterio
with rio.open(
/ "data_rio.tif",
tmp_dir "w",
=data.shape[0],
height=data.shape[1],
width=1,
count=data.dtype,
dtype**common_profile,
as dst:
) 1) dst.write(data,
Write with rioxarray
= data.shape
rows, cols
= xr.DataArray(
ds =data, coords=(("y", np.arange(rows) + 0.5), ("x", np.arange(cols) + 0.5))
data
)/ "data_rioxarray.tif", **common_profile) ds.rio.to_raster(tmp_dir
Chunk existing file with GDAL
(different tile size so we can see that something happened)
!gdal_translate \
-co TILED=YES \
-co BLOCKYSIZE=32 \
-co BLOCKXSIZE=32 \
/tmp/data_rio.tif \
/tmp/data_gdal.tif
Input file size is 100, 100
0...10...20...30...40...50...60...70...80...90...100 - done.
Find out if file is chunked
!gdalinfo /tmp/data_rio.tif
Driver: GTiff/GeoTIFF
Files: /tmp/data_rio.tif
Size is 100, 100
Image Structure Metadata:
COMPRESSION=LZW
INTERLEAVE=BAND
Corner Coordinates:
Upper Left ( 0.0, 0.0)
Lower Left ( 0.0, 100.0)
Upper Right ( 100.0, 0.0)
Lower Right ( 100.0, 100.0)
Center ( 50.0, 50.0)
Band 1 Block=16x16 Type=Float64, ColorInterp=Gray
!gdalinfo /tmp/data_rioxarray.tif
Driver: GTiff/GeoTIFF
Files: /tmp/data_rioxarray.tif
Size is 100, 100
Origin = (0.000000000000000,0.000000000000000)
Pixel Size = (1.000000000000000,1.000000000000000)
Image Structure Metadata:
COMPRESSION=LZW
INTERLEAVE=BAND
Corner Coordinates:
Upper Left ( 0.0000000, 0.0000000)
Lower Left ( 0.000, 100.000)
Upper Right ( 100.0000000, 0.0000000)
Lower Right ( 100.000, 100.000)
Center ( 50.0000000, 50.0000000)
Band 1 Block=16x16 Type=Float64, ColorInterp=Gray
!gdalinfo /tmp/data_gdal.tif
Driver: GTiff/GeoTIFF
Files: /tmp/data_gdal.tif
Size is 100, 100
Image Structure Metadata:
INTERLEAVE=BAND
Corner Coordinates:
Upper Left ( 0.0, 0.0)
Lower Left ( 0.0, 100.0)
Upper Right ( 100.0, 0.0)
Lower Right ( 100.0, 100.0)
Center ( 50.0, 50.0)
Band 1 Block=32x32 Type=Float64, ColorInterp=Gray
Block=16x16
(Block=32x32
) indicates that tiles are 16 x 16 (32 x 32) pixels
netCDF (via xarray
)
# generate some 3d data to write
= 100
size_x_y = 365
size_time
= np.random.default_rng().random(size=(size_x_y, size_x_y, size_time))
data_3d
= np.arange(size_x_y)
coords = np.arange(size_time)
coords_time
= xr.Dataset(
ds ={"test": (("y", "x", "time"), data_3d)},
data_vars=dict(y=coords, x=coords, time=coords_time),
coords
)
= {"test": {"chunksizes": (20, 20, 50)}}
encoding
/ "data.nc", encoding=encoding) ds.to_netcdf(tmp_dir
Read the encoding attribute:
/ "data.nc").test.encoding xr.open_dataset(tmp_dir
{'dtype': dtype('float64'),
'zlib': False,
'szip': False,
'zstd': False,
'bzip2': False,
'blosc': False,
'shuffle': False,
'complevel': 0,
'fletcher32': False,
'contiguous': False,
'chunksizes': (20, 20, 50),
'preferred_chunks': {'y': 20, 'x': 20, 'time': 50},
'source': '/tmp/data.nc',
'original_shape': (100, 100, 365),
'_FillValue': np.float64(nan)}
(or run ncdump -h -s /tmp/data.nc
)
Zarr
The xarray docs on .to_zarr(...)
describe different ways how chunks can be explicitly defined or how the driver picks up existing chunksizes.
dict(x=20, y=20, time=100)).to_zarr(
ds.chunk(/ "data.zarr",
tmp_dir ="w",
mode )
<xarray.backends.zarr.ZarrStore at 0x73f263c4b130>
/ "data.zarr").test.encoding xr.open_zarr(tmp_dir
{'chunks': (20, 20, 100),
'preferred_chunks': {'y': 20, 'x': 20, 'time': 100},
'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0),
'filters': None,
'_FillValue': nan,
'dtype': dtype('float64')}
Background
This post is written from the following perspective:
- I mostly work with array data (satellite- and webcam imagery, output from (hydro-) climatological models) which usually comes either in the form of GeoTIFF or netCDF files that have
x
andy
dimensions and often atime
dimension - I still analyze most of the array data locally on my desktop computer, a workstation at our lab or the HPC system of our university. By “local”, I mean that my data is located directly on the system and not in some cloud storage
- The datasets I deal with are too big to fit in to memory and have to be read and processed chunk-wise
- These datasets often contain relatively large areas with no data values (due to non-rectangular domains/masks) and therefore benefit a lot from compression. Writing compressed netCDF4 files requires chunking (see here). I usually can’t afford to write uncrompressed data anyway, unless it’s a small subset of data.
- I almost exclusively use
xarray
,rioxarray
and some of theGDAL
tools (or whatever software they wrap) to take care of and manipulate data formats
Sidenote
- Chunking is closely related to compression (see here). Nevertheless, I will only summarize the options for chunking here.
- The chunksizes here are only for illustration and should not be used (i.e., copy-pasted) for usage elsewhere.
Use-cases
Writing/creating data
This is the point where I can decide how the data will be chunked (tiled).
- Read 2D GeoTIFFs (ot other formats) and construct a (at least 3D)
xarray.Dataset
by concatenation along some dimension. The output will be written as a netCDF or Zarr file - Clean data that already is in netCDF format and write the result back to disk
Read/access data
This is the point where no or unsuitable chunking can slow down data access.
- Extracting a timeseries of a single pixel at a specific location in
x
andy
- Extracting a time-slice, often a 2D array of a single timestep
- Mapping a function over chunks of an entire (at least 3D) file (netCDF, Zarr)
It is not possible to optimize chunks in a way that they are optimal for all of these access patterns at the same time!
Important things to figure out first
- What is the layout of my data and the chunks if I don’t take care of chunking explicitly? In my experience the default settings are often not ideal.
- Which access pattern do I primarily need to support?
- Does my data format support chunking? GeoTIFF, netCDF and Zarr all support chunking.
Further reading
Sources
- GDAL GeoTIFF driver page: the primary source to check the available options for GeoTIFFs
- GDAL translate docs
- rasterio docs on creation options
- rioxarray docs on
.to_raster(...)
- xarray docs on writing netcdfs with compression