Tutorial¶
Zarr provides classes and functions for working with N-dimensional arrays that behave like NumPy arrays but whose data is divided into chunks and compressed. If you are already familiar with HDF5 then Zarr arrays provide similar functionality, but with some additional flexibility.
Creating an array¶
Zarr has a number of convenience functions for creating arrays. For example:
>>> import zarr
>>> z = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4')
>>> z
Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
nbytes: 381.5M; nbytes_stored: 323; ratio: 1238390.1; initialized: 0/100
compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
store: dict
The code above creates a 2-dimensional array of 32-bit integers with 10000 rows and 10000 columns, divided into chunks where each chunk has 1000 rows and 1000 columns (and so there will be 100 chunks in total).
For a complete list of array creation routines see the
zarr.creation
module documentation.
Reading and writing data¶
Zarr arrays support a similar interface to NumPy arrays for reading and writing data. For example, the entire array can be filled with a scalar value:
>>> z[:] = 42
>>> z
Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
nbytes: 381.5M; nbytes_stored: 1.8M; ratio: 215.1; initialized: 100/100
compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
store: dict
Notice that the values of nbytes_stored
, ratio
and
initialized
have changed. This is because when a Zarr array is
first created, none of the chunks are initialized. Writing data into
the array will cause the necessary chunks to be initialized.
Regions of the array can also be written to, e.g.:
>>> import numpy as np
>>> z[0, :] = np.arange(10000)
>>> z[:, 0] = np.arange(10000)
The contents of the array can be retrieved by slicing, which will load the requested region into a NumPy array, e.g.:
>>> z[0, 0]
0
>>> z[-1, -1]
42
>>> z[0, :]
array([ 0, 1, 2, ..., 9997, 9998, 9999], dtype=int32)
>>> z[:, 0]
array([ 0, 1, 2, ..., 9997, 9998, 9999], dtype=int32)
>>> z[:]
array([[ 0, 1, 2, ..., 9997, 9998, 9999],
[ 1, 42, 42, ..., 42, 42, 42],
[ 2, 42, 42, ..., 42, 42, 42],
...,
[9997, 42, 42, ..., 42, 42, 42],
[9998, 42, 42, ..., 42, 42, 42],
[9999, 42, 42, ..., 42, 42, 42]], dtype=int32)
Persistent arrays¶
In the examples above, compressed data for each chunk of the array was stored in memory. Zarr arrays can also be stored on a file system, enabling persistence of data between sessions. For example:
>>> z1 = zarr.open_array('example.zarr', mode='w', shape=(10000, 10000),
... chunks=(1000, 1000), dtype='i4', fill_value=0)
>>> z1
Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
nbytes: 381.5M; nbytes_stored: 323; ratio: 1238390.1; initialized: 0/100
compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
store: DirectoryStore
The array above will store its configuration metadata and all
compressed chunk data in a directory called ‘example.zarr’ relative to
the current working directory. The zarr.creation.open_array()
function
provides a convenient way to create a new persistent array or continue
working with an existing array. Note that there is no need to close an
array, and data are automatically flushed to disk whenever an array is
modified.
Persistent arrays support the same interface for reading and writing data, e.g.:
>>> z1[:] = 42
>>> z1[0, :] = np.arange(10000)
>>> z1[:, 0] = np.arange(10000)
Check that the data have been written and can be read again:
>>> z2 = zarr.open_array('example.zarr', mode='r')
>>> z2
Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
nbytes: 381.5M; nbytes_stored: 1.9M; ratio: 204.5; initialized: 100/100
compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
store: DirectoryStore
>>> np.all(z1[:] == z2[:])
True
Resizing and appending¶
A Zarr array can be resized, which means that any of its dimensions can be increased or decreased in length. For example:
>>> z = zarr.zeros(shape=(10000, 10000), chunks=(1000, 1000))
>>> z[:] = 42
>>> z.resize(20000, 10000)
>>> z
Array((20000, 10000), float64, chunks=(1000, 1000), order=C)
nbytes: 1.5G; nbytes_stored: 3.6M; ratio: 422.3; initialized: 100/200
compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
store: dict
Note that when an array is resized, the underlying data are not rearranged in any way. If one or more dimensions are shrunk, any chunks falling outside the new array shape will be deleted from the underlying store.
For convenience, Zarr arrays also provide an append()
method,
which can be used to append data to any axis. E.g.:
>>> a = np.arange(10000000, dtype='i4').reshape(10000, 1000)
>>> z = zarr.array(a, chunks=(1000, 100))
>>> z
Array((10000, 1000), int32, chunks=(1000, 100), order=C)
nbytes: 38.1M; nbytes_stored: 1.9M; ratio: 20.3; initialized: 100/100
compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
store: dict
>>> z.append(a)
(20000, 1000)
>>> z
Array((20000, 1000), int32, chunks=(1000, 100), order=C)
nbytes: 76.3M; nbytes_stored: 3.8M; ratio: 20.3; initialized: 200/200
compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
store: dict
>>> z.append(np.vstack([a, a]), axis=1)
(20000, 2000)
>>> z
Array((20000, 2000), int32, chunks=(1000, 100), order=C)
nbytes: 152.6M; nbytes_stored: 7.5M; ratio: 20.3; initialized: 400/400
compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
store: dict
Compressors¶
By default, Zarr uses the Blosc compression library to compress each chunk of an array. Blosc is extremely fast and can be configured in a variety of ways to improve the compression ratio for different types of data. Blosc is in fact a “meta-compressor”, which means that it can used a number of different compression algorithms internally to compress the data. Blosc also provides highly optimized implementations of byte and bit shuffle filters, which can significantly improve compression ratios for some data.
Different compressors can be provided via the compressor
keyword argument
accepted by all array creation functions. For example:
>>> z = zarr.array(np.arange(100000000, dtype='i4').reshape(10000, 10000),
... chunks=(1000, 1000),
... compressor=zarr.Blosc(cname='zstd', clevel=3, shuffle=2))
>>> z
Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
nbytes: 381.5M; nbytes_stored: 4.4M; ratio: 87.6; initialized: 100/100
compressor: Blosc(cname='zstd', clevel=3, shuffle=2)
store: dict
The array above will use Blosc as the primary compressor, using the Zstandard algorithm (compression level 3) internally within Blosc, and with the bitshuffle filter applied.
A list of the internal compression libraries available within Blosc can be obtained via:
>>> from zarr import blosc
>>> blosc.list_compressors()
['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib', 'zstd']
In addition to Blosc, other compression libraries can also be used. Zarr comes with support for zlib, BZ2 and LZMA compression, via the Python standard library. For example, here is an array using zlib compression, level 1:
>>> z = zarr.array(np.arange(100000000, dtype='i4').reshape(10000, 10000),
... chunks=(1000, 1000),
... compressor=zarr.Zlib(level=1))
>>> z
Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
nbytes: 381.5M; nbytes_stored: 132.2M; ratio: 2.9; initialized: 100/100
compressor: Zlib(level=1)
store: dict
Here is an example using LZMA with a custom filter pipeline including LZMA’s built-in delta filter:
>>> import lzma
>>> lzma_filters = [dict(id=lzma.FILTER_DELTA, dist=4),
... dict(id=lzma.FILTER_LZMA2, preset=1)]
>>> compressor = zarr.LZMA(filters=lzma_filters)
>>> z = zarr.array(np.arange(100000000, dtype='i4').reshape(10000, 10000),
... chunks=(1000, 1000), compressor=compressor)
>>> z
Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
nbytes: 381.5M; nbytes_stored: 248.9K; ratio: 1569.7; initialized: 100/100
compressor: LZMA(format=1, check=-1, preset=None, filters=[{'dist': 4, 'id': 3}, {'preset': 1, 'id': 33}])
store: dict
The default compressor can be changed by setting the value of the
zarr.storage.default_compressor
variable, e.g.:
>>> import zarr.storage
>>> # switch to using Zstandard via Blosc by default
... zarr.storage.default_compressor = zarr.Blosc(cname='zstd', clevel=1, shuffle=1)
>>> z = zarr.zeros(100000000, chunks=1000000)
>>> z
Array((100000000,), float64, chunks=(1000000,), order=C)
nbytes: 762.9M; nbytes_stored: 302; ratio: 2649006.6; initialized: 0/100
compressor: Blosc(cname='zstd', clevel=1, shuffle=1)
store: dict
>>> # switch back to Blosc defaults
... zarr.storage.default_compressor = zarr.Blosc()
To disable compression, set compressor=None
when creating an array, e.g.:
>>> z = zarr.zeros(100000000, chunks=1000000, compressor=None)
>>> z
Array((100000000,), float64, chunks=(1000000,), order=C)
nbytes: 762.9M; nbytes_stored: 209; ratio: 3827751.2; initialized: 0/100
store: dict
Filters¶
In some cases, compression can be improved by transforming the data in some way. For example, if nearby values tend to be correlated, then shuffling the bytes within each numerical value or storing the difference between adjacent values may increase compression ratio. Some compressors provide built-in filters that apply transformations to the data prior to compression. For example, the Blosc compressor has highly optimized built-in implementations of byte- and bit-shuffle filters, and the LZMA compressor has a built-in implementation of a delta filter. However, to provide additional flexibility for implementing and using filters in combination with different compressors, Zarr also provides a mechanism for configuring filters outside of the primary compressor.
Here is an example using the Zarr delta filter with the Blosc compressor:
>>> filters = [zarr.Delta(dtype='i4')]
>>> compressor = zarr.Blosc(cname='zstd', clevel=1, shuffle=1)
>>> z = zarr.array(np.arange(100000000, dtype='i4').reshape(10000, 10000),
... chunks=(1000, 1000), filters=filters, compressor=compressor)
>>> z
Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
nbytes: 381.5M; nbytes_stored: 633.4K; ratio: 616.7; initialized: 100/100
filters: Delta(dtype=int32)
compressor: Blosc(cname='zstd', clevel=1, shuffle=1)
store: dict
Zarr comes with implementations of delta, scale-offset, quantize, packbits and
categorize filters. It is also relatively straightforward to implement custom
filters. For more information see the zarr.codecs
API docs.
Parallel computing and synchronization¶
Zarr arrays can be used as either the source or sink for data in parallel computations. Both multi-threaded and multi-process parallelism are supported. The Python global interpreter lock (GIL) is released for both compression and decompression operations, so Zarr will not block other Python threads from running.
A Zarr array can be read concurrently by multiple threads or processes. No synchronization (i.e., locking) is required for concurrent reads.
A Zarr array can also be written to concurrently by multiple threads or processes. Some synchronization may be required, depending on the way the data is being written.
If each worker in a parallel computation is writing to a separate region of the array, and if region boundaries are perfectly aligned with chunk boundaries, then no synchronization is required. However, if region and chunk boundaries are not perfectly aligned, then synchronization is required to avoid two workers attempting to modify the same chunk at the same time.
To give a simple example, consider a 1-dimensional array of length 60,
z
, divided into three chunks of 20 elements each. If three workers
are running and each attempts to write to a 20 element region (i.e.,
z[0:20]
, z[20:40]
and z[40:60]
) then each worker will be
writing to a separate chunk and no synchronization is
required. However, if two workers are running and each attempts to
write to a 30 element region (i.e., z[0:30]
and z[30:60]
) then
it is possible both workers will attempt to modify the middle chunk at
the same time, and synchronization is required to prevent data loss.
Zarr provides support for chunk-level synchronization. E.g., create an array with thread synchronization:
>>> z = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4',
... synchronizer=zarr.ThreadSynchronizer())
>>> z
Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
nbytes: 381.5M; nbytes_stored: 323; ratio: 1238390.1; initialized: 0/100
compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
store: dict; synchronizer: ThreadSynchronizer
This array is safe to read or write within a multi-threaded program.
Zarr also provides support for process synchronization via file locking, provided that all processes have access to a shared file system. E.g.:
>>> synchronizer = zarr.ProcessSynchronizer('example.sync')
>>> z = zarr.open_array('example', mode='w', shape=(10000, 10000),
... chunks=(1000, 1000), dtype='i4',
... synchronizer=synchronizer)
>>> z
Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
nbytes: 381.5M; nbytes_stored: 323; ratio: 1238390.1; initialized: 0/100
compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
store: DirectoryStore; synchronizer: ProcessSynchronizer
This array is safe to read or write from multiple processes.
User attributes¶
Zarr arrays also support custom key/value attributes, which can be useful for associating an array with application-specific metadata. For example:
>>> z = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4')
>>> z.attrs['foo'] = 'bar'
>>> z.attrs['baz'] = 42
>>> sorted(z.attrs)
['baz', 'foo']
>>> 'foo' in z.attrs
True
>>> z.attrs['foo']
'bar'
>>> z.attrs['baz']
42
Internally Zarr uses JSON to store array attributes, so attribute values must be JSON serializable.
Groups¶
Zarr supports hierarchical organization of arrays via groups. As with arrays, groups can be stored in memory, on disk, or via other storage systems that support a similar interface.
To create a group, use the zarr.hierarchy.group()
function:
>>> root_group = zarr.group()
>>> root_group
Group(/, 0)
store: DictStore
Groups have a similar API to the Group class from h5py. For example, groups can contain other groups:
>>> foo_group = root_group.create_group('foo')
>>> bar_group = foo_group.create_group('bar')
Groups can also contain arrays, e.g.:
>>> z1 = bar_group.zeros('baz', shape=(10000, 10000), chunks=(1000, 1000), dtype='i4',
... compressor=zarr.Blosc(cname='zstd', clevel=1, shuffle=1))
>>> z1
Array(/foo/bar/baz, (10000, 10000), int32, chunks=(1000, 1000), order=C)
nbytes: 381.5M; nbytes_stored: 324; ratio: 1234567.9; initialized: 0/100
compressor: Blosc(cname='zstd', clevel=1, shuffle=1)
store: DictStore
Arrays are known as “datasets” in HDF5 terminology. For compatibility with
h5py, Zarr groups also implement the zarr.hierarchy.Group.create_dataset()
and zarr.hierarchy.Group.require_dataset()
methods, e.g.:
>>> z = bar_group.create_dataset('quux', shape=(10000, 10000),
... chunks=(1000, 1000), dtype='i4',
... fill_value=0, compression='gzip',
... compression_opts=1)
>>> z
Array(/foo/bar/quux, (10000, 10000), int32, chunks=(1000, 1000), order=C)
nbytes: 381.5M; nbytes_stored: 275; ratio: 1454545.5; initialized: 0/100
compressor: Zlib(level=1)
store: DictStore
Members of a group can be accessed via the suffix notation, e.g.:
>>> root_group['foo']
Group(/foo, 1)
groups: 1; bar
store: DictStore
The ‘/’ character can be used to access multiple levels of the hierarchy, e.g.:
>>> root_group['foo/bar']
Group(/foo/bar, 2)
arrays: 2; baz, quux
store: DictStore
>>> root_group['foo/bar/baz']
Array(/foo/bar/baz, (10000, 10000), int32, chunks=(1000, 1000), order=C)
nbytes: 381.5M; nbytes_stored: 324; ratio: 1234567.9; initialized: 0/100
compressor: Blosc(cname='zstd', clevel=1, shuffle=1)
store: DictStore
The zarr.hierarchy.open_group()
provides a convenient way to create or
re-open a group stored in a directory on the file-system, with sub-groups
stored in sub-directories, e.g.:
>>> persistent_group = zarr.open_group('example', mode='w')
>>> persistent_group
Group(/, 0)
store: DirectoryStore
>>> z = persistent_group.create_dataset('foo/bar/baz', shape=(10000, 10000),
... chunks=(1000, 1000), dtype='i4',
... fill_value=0)
>>> z
Array(/foo/bar/baz, (10000, 10000), int32, chunks=(1000, 1000), order=C)
nbytes: 381.5M; nbytes_stored: 323; ratio: 1238390.1; initialized: 0/100
compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
store: DirectoryStore
For more information on groups see the zarr.hierarchy
API docs.
Tips and tricks¶
Copying large arrays¶
Data can be copied between large arrays without needing much memory, e.g.:
>>> z1 = zarr.empty((10000, 10000), chunks=(1000, 1000), dtype='i4')
>>> z1[:] = 42
>>> z2 = zarr.empty_like(z1)
>>> z2[:] = z1
Internally the example above works chunk-by-chunk, extracting only the
data from z1
required to fill each chunk in z2
. The source of
the data (z1
) could equally be an h5py Dataset.
Changing memory layout¶
The order of bytes within each chunk of an array can be changed via
the order
keyword argument, to use either C or Fortran layout. For
multi-dimensional arrays, these two layouts may provide different
compression ratios, depending on the correlation structure within the
data. E.g.:
>>> a = np.arange(100000000, dtype='i4').reshape(10000, 10000).T
>>> zarr.array(a, chunks=(1000, 1000))
Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
nbytes: 381.5M; nbytes_stored: 26.3M; ratio: 14.5; initialized: 100/100
compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
store: dict
>>> zarr.array(a, chunks=(1000, 1000), order='F')
Array((10000, 10000), int32, chunks=(1000, 1000), order=F)
nbytes: 381.5M; nbytes_stored: 9.2M; ratio: 41.6; initialized: 100/100
compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
store: dict
In the above example, Fortran order gives a better compression ratio. This is an artifical example but illustrates the general point that changing the order of bytes within chunks of an array may improve the compression ratio, depending on the structure of the data, the compression algorithm used, and which compression filters (e.g., byte shuffle) have been applied.
Storage alternatives¶
Zarr can use any object that implements the MutableMapping
interface as
the store for a group or an array.
Here is an example storing an array directly into a Zip file:
>>> store = zarr.ZipStore('example.zip', mode='w')
>>> z = zarr.zeros((1000, 1000), chunks=(100, 100), dtype='i4', store=store)
>>> z
Array((1000, 1000), int32, chunks=(100, 100), order=C)
nbytes: 3.8M; nbytes_stored: 319; ratio: 12539.2; initialized: 0/100
compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
store: ZipStore
>>> z[:] = 42
>>> z
Array((1000, 1000), int32, chunks=(100, 100), order=C)
nbytes: 3.8M; nbytes_stored: 21.8K; ratio: 179.2; initialized: 100/100
compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
store: ZipStore
>>> store.close()
>>> import os
>>> os.path.getsize('example.zip')
30721
Re-open and check that data have been written:
>>> store = zarr.ZipStore('example.zip', mode='r')
>>> z = zarr.Array(store)
>>> z
Array((1000, 1000), int32, chunks=(100, 100), order=C)
nbytes: 3.8M; nbytes_stored: 21.8K; ratio: 179.2; initialized: 100/100
compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
store: ZipStore
>>> z[:]
array([[42, 42, 42, ..., 42, 42, 42],
[42, 42, 42, ..., 42, 42, 42],
[42, 42, 42, ..., 42, 42, 42],
...,
[42, 42, 42, ..., 42, 42, 42],
[42, 42, 42, ..., 42, 42, 42],
[42, 42, 42, ..., 42, 42, 42]], dtype=int32)
>>> store.close()
Note that there are some restrictions on how Zip files can be used, because items within a Zip file cannot be updated in place. This means that data in the array should only be written once and write operations should be aligned with chunk boundaries.
Note also that the close()
method must be called after writing any data to
the store, otherwise essential records will not be written to the underlying
zip file.
The Dask project has implementations of the MutableMapping
interface for distributed storage systems, see the S3Map
and HDFSMap
classes.
Chunk size and shape¶
In general, chunks of at least 1 megabyte (1M) seem to provide the best performance, at least when using the Blosc compression library.
The optimal chunk shape will depend on how you want to access the data. E.g.,
for a 2-dimensional array, if you only ever take slices along the first
dimension, then chunk across the second dimenson. If you know you want to
chunk across an entire dimension you can use None
within the chunks
argument, e.g.:
>>> z1 = zarr.zeros((10000, 10000), chunks=(100, None), dtype='i4')
>>> z1.chunks
(100, 10000)
Alternatively, if you only ever take slices along the second dimension, then chunk across the first dimension, e.g.:
>>> z2 = zarr.zeros((10000, 10000), chunks=(None, 100), dtype='i4')
>>> z2.chunks
(10000, 100)
If you require reasonable performance for both access patterns then you need to find a compromise, e.g.:
>>> z3 = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4')
>>> z3.chunks
(1000, 1000)
If you are feeling lazy, you can let Zarr guess a chunk shape for your data, although please note that the algorithm for guessing a chunk shape is based on simple heuristics and may be far from optimal. E.g.:
>>> z4 = zarr.zeros((10000, 10000), dtype='i4')
>>> z4.chunks
(313, 313)
Configuring Blosc¶
The Blosc compressor is able to use multiple threads internally to accelerate compression and decompression. By default, Zarr allows Blosc to use up to 8 internal threads. The number of Blosc threads can be changed to increase or decrease this number, e.g.:
>>> from zarr import blosc
>>> blosc.set_nthreads(2)
8
When a Zarr array is being used within a multi-threaded program, Zarr
automatically switches to using Blosc in a single-threaded
“contextual” mode. This is generally better as it allows multiple
program threads to use Blosc simultaneously and prevents CPU thrashing
from too many active threads. If you want to manually override this
behaviour, set the value of the blosc.use_threads
variable to
True
(Blosc always uses multiple internal threads) or False
(Blosc always runs in single-threaded contextual mode). To re-enable
automatic switching, set blosc.use_threads
to None
.