Tutorial¶
Zarr provides classes and functions for working with N-dimensional arrays that behave like NumPy arrays but whose data is divided into chunks and compressed. If you are already familiar with HDF5 datasets then Zarr arrays provide similar functionality, but with some additional flexibility.
Creating an array¶
Zarr has a number of convenience functions for creating arrays. For example:
>>> import zarr
>>> z = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4')
>>> z
zarr.core.Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
nbytes: 381.5M; nbytes_stored: 317; ratio: 1261829.7; initialized: 0/100
store: builtins.dict
The code above creates a 2-dimensional array of 32-bit integers with 10000 rows and 10000 columns, divided into chunks where each chunk has 1000 rows and 1000 columns (and so there will be 100 chunks in total).
For a complete list of array creation routines see the
zarr.creation
module documentation.
Reading and writing data¶
Zarr arrays support a similar interface to NumPy arrays for reading and writing data. For example, the entire array can be filled with a scalar value:
>>> z[:] = 42
>>> z
zarr.core.Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
nbytes: 381.5M; nbytes_stored: 2.2M; ratio: 170.4; initialized: 100/100
store: builtins.dict
Notice that the values of nbytes_stored
, ratio
and
initialized
have changed. This is because when a Zarr array is
first created, none of the chunks are initialized. Writing data into
the array will cause the necessary chunks to be initialized.
Regions of the array can also be written to, e.g.:
>>> import numpy as np
>>> z[0, :] = np.arange(10000)
>>> z[:, 0] = np.arange(10000)
The contents of the array can be retrieved by slicing, which will load the requested region into a NumPy array, e.g.:
>>> z[0, 0]
0
>>> z[-1, -1]
42
>>> z[0, :]
array([ 0, 1, 2, ..., 9997, 9998, 9999], dtype=int32)
>>> z[:, 0]
array([ 0, 1, 2, ..., 9997, 9998, 9999], dtype=int32)
>>> z[:]
array([[ 0, 1, 2, ..., 9997, 9998, 9999],
[ 1, 42, 42, ..., 42, 42, 42],
[ 2, 42, 42, ..., 42, 42, 42],
...,
[9997, 42, 42, ..., 42, 42, 42],
[9998, 42, 42, ..., 42, 42, 42],
[9999, 42, 42, ..., 42, 42, 42]], dtype=int32)
Persistent arrays¶
In the examples above, compressed data for each chunk of the array was stored in memory. Zarr arrays can also be stored on a file system, enabling persistence of data between sessions. For example:
>>> z1 = zarr.open('example.zarr', mode='w', shape=(10000, 10000),
... chunks=(1000, 1000), dtype='i4', fill_value=0)
>>> z1
zarr.core.Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
nbytes: 381.5M; nbytes_stored: 317; ratio: 1261829.7; initialized: 0/100
store: zarr.storage.DirectoryStore
The array above will store its configuration metadata and all
compressed chunk data in a directory called ‘example.zarr’ relative to
the current working directory. The zarr.creation.open()
function
provides a convenient way to create a new persistent array or continue
working with an existing array. Note that there is no need to close an
array, and data are automatically flushed to disk whenever an array is
modified.
Persistent arrays support the same interface for reading and writing data, e.g.:
>>> z1[:] = 42
>>> z1[0, :] = np.arange(10000)
>>> z1[:, 0] = np.arange(10000)
Check that the data have been written and can be read again:
>>> z2 = zarr.open('example.zarr', mode='r')
>>> z2
zarr.core.Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
nbytes: 381.5M; nbytes_stored: 2.3M; ratio: 163.8; initialized: 100/100
store: zarr.storage.DirectoryStore
>>> np.all(z1[:] == z2[:])
True
Resizing and appending¶
A Zarr array can be resized, which means that any of its dimensions can be increased or decreased in length. For example:
>>> z = zarr.zeros(shape=(10000, 10000), chunks=(1000, 1000))
>>> z[:] = 42
>>> z.resize(20000, 10000)
>>> z
zarr.core.Array((20000, 10000), float64, chunks=(1000, 1000), order=C)
compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
nbytes: 1.5G; nbytes_stored: 5.9M; ratio: 259.9; initialized: 100/200
store: builtins.dict
Note that when an array is resized, the underlying data are not rearranged in any way. If one or more dimensions are shrunk, any chunks falling outside the new array shape will be deleted from the underlying store.
For convenience, Zarr arrays also provide an append()
method,
which can be used to append data to any axis. E.g.:
>>> a = np.arange(10000000, dtype='i4').reshape(10000, 1000)
>>> z = zarr.array(a, chunks=(1000, 100))
>>> z
zarr.core.Array((10000, 1000), int32, chunks=(1000, 100), order=C)
compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
nbytes: 38.1M; nbytes_stored: 2.0M; ratio: 19.3; initialized: 100/100
store: builtins.dict
>>> z.append(a)
>>> z
zarr.core.Array((20000, 1000), int32, chunks=(1000, 100), order=C)
compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
nbytes: 76.3M; nbytes_stored: 4.0M; ratio: 19.3; initialized: 200/200
store: builtins.dict
>>> z.append(np.vstack([a, a]), axis=1)
>>> z
zarr.core.Array((20000, 2000), int32, chunks=(1000, 100), order=C)
compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
nbytes: 152.6M; nbytes_stored: 7.9M; ratio: 19.3; initialized: 400/400
store: builtins.dict
Compression¶
By default, Zarr uses the Blosc compression library to compress each chunk of an array. Blosc is extremely fast and can be configured in a variety of ways to improve the compression ratio for different types of data. Blosc is in fact a “meta-compressor”, which means that it can used a number of different compression algorithms internally to compress the data. Blosc also provides highly optimised implementations of byte and bit shuffle filters, which can significantly improve compression ratios for some data.
Options for the compressor can be controlled via the
compression_opts
keyword argument accepted by all array creation
functions. For example:
>>> z = zarr.array(np.arange(100000000, dtype='i4').reshape(10000, 10000),
... chunks=(1000, 1000), compression='blosc',
... compression_opts=dict(cname='lz4', clevel=3, shuffle=2))
>>> z
zarr.core.Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
compression: blosc; compression_opts: {'clevel': 3, 'cname': 'lz4', 'shuffle': 2}
nbytes: 381.5M; nbytes_stored: 17.6M; ratio: 21.7; initialized: 100/100
store: builtins.dict
The array above will use Blosc as the primary compressor, using the LZ4 algorithm (compression level 3) internally within Blosc, and with the bitshuffle filter applied.
In addition to Blosc, other compression libraries can also be used. Zarr comes with support for zlib, BZ2 and LZMA compression, via the Python standard library. For example, here is an array using zlib compression, level 1:
>>> z = zarr.array(np.arange(100000000, dtype='i4').reshape(10000, 10000),
... chunks=(1000, 1000), compression='zlib',
... compression_opts=1)
>>> z
zarr.core.Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
compression: zlib; compression_opts: 1
nbytes: 381.5M; nbytes_stored: 132.2M; ratio: 2.9; initialized: 100/100
store: builtins.dict
Here is an example using LZMA with a custom filter pipeline including the delta filter:
>>> import lzma
>>> filters = [dict(id=lzma.FILTER_DELTA, dist=4),
... dict(id=lzma.FILTER_LZMA2, preset=1)]
>>> z = zarr.array(np.arange(100000000, dtype='i4').reshape(10000, 10000),
... chunks=(1000, 1000), compression='lzma',
... compression_opts=dict(filters=filters))
>>> z
zarr.core.Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
compression: lzma; compression_opts: {'preset': None, 'filters': [{'dist': 4, 'id': 3}, {'preset': 1, 'id': 33}], 'check': 0, 'format': 1}
nbytes: 381.5M; nbytes_stored: 248.1K; ratio: 1574.7; initialized: 100/100
store: builtins.dict
Parallel computing and synchronization¶
Zarr arrays can be used as either the source or sink for data in parallel computations. Both multi-threaded and multi-process parallelism are supported. The Python global interpreter lock (GIL) is released for both compression and decompression operations, so Zarr will not block other Python threads from running.
A Zarr array can be read concurrently by multiple threads or processes. No synchronization (i.e., locking) is required for concurrent reads.
A Zarr array can also be written to concurrently by multiple threads or processes. Some synchronization may be required, depending on the way the data is being written.
If each worker in a parallel computation is writing to a separate region of the array, and if region boundaries are perfectly aligned with chunk boundaries, then no synchronization is required. However, if region and chunk boundaries are not perfectly aligned, then synchronization is required to avoid two workers attempting to modify the same chunk at the same time.
To give a simple example, consider a 1-dimensional array of length 60,
z
, divided into three chunks of 20 elements each. If three workers
are running and each attempts to write to a 20 element region (i.e.,
z[0:20]
, z[20:40]
and z[40:60]
) then each worker will be
writing to a separate chunk and no synchronization is
required. However, if two workers are running and each attempts to
write to a 30 element region (i.e., z[0:30]
and z[30:60]
) then
it is possible both workers will attempt to modify the middle chunk at
the same time, and synchronization is required to prevent data loss.
Zarr provides support for chunk-level synchronization. E.g., create an array with thread synchronization:
>>> z = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4',
... synchronizer=zarr.ThreadSynchronizer())
>>> z
zarr.sync.SynchronizedArray((10000, 10000), int32, chunks=(1000, 1000), order=C)
compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
nbytes: 381.5M; nbytes_stored: 317; ratio: 1261829.7; initialized: 0/100
store: builtins.dict; synchronizer: zarr.sync.ThreadSynchronizer
This array is safe to read or write within a multi-threaded program.
Zarr also provides support for process synchronization via file locking, provided that all processes have access to a shared file system. E.g.:
>>> synchronizer = zarr.ProcessSynchronizer('example.zarr')
>>> z = zarr.open('example.zarr', mode='w', shape=(10000, 10000),
... chunks=(1000, 1000), dtype='i4',
... synchronizer=synchronizer)
>>> z
zarr.sync.SynchronizedArray((10000, 10000), int32, chunks=(1000, 1000), order=C)
compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
nbytes: 381.5M; nbytes_stored: 317; ratio: 1261829.7; initialized: 0/100
store: zarr.storage.DirectoryStore; synchronizer: zarr.sync.ProcessSynchronizer
This array is safe to read or write from multiple processes.
User attributes¶
Zarr arrays also support custom key/value attributes, which can be useful for associating an array with application-specific metadata. For example:
>>> z = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4')
>>> z.attrs['foo'] = 'bar'
>>> z.attrs['baz'] = 42
>>> sorted(z.attrs)
['baz', 'foo']
>>> 'foo' in z.attrs
True
>>> z.attrs['foo']
'bar'
>>> z.attrs['baz']
42
Internally Zarr uses JSON to store array attributes, so attribute values must be JSON serializable.
Tips and tricks¶
Copying large arrays¶
Data can be copied between large arrays without needing much memory, e.g.:
>>> z1 = zarr.empty((10000, 10000), chunks=(1000, 1000), dtype='i4')
>>> z1[:] = 42
>>> z2 = zarr.empty_like(z1)
>>> z2[:] = z1
Internally the example above works chunk-by-chunk, extracting only the
data from z1
required to fill each chunk in z2
. The source of
the data (z1
) could equally be an h5py Dataset.
Changing memory layout¶
The order of bytes within each chunk of an array can be changed via
the order
keyword argument, to use either C or Fortran layout. For
multi-dimensional arrays, these two layouts may provide different
compression ratios, depending on the correlation structure within the
data. E.g.:
>>> a = np.arange(100000000, dtype='i4').reshape(10000, 10000).T
>>> zarr.array(a, chunks=(1000, 1000))
zarr.core.Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
nbytes: 381.5M; nbytes_stored: 26.1M; ratio: 14.6; initialized: 100/100
store: builtins.dict
>>> zarr.array(a, chunks=(1000, 1000), order='F')
zarr.core.Array((10000, 10000), int32, chunks=(1000, 1000), order=F)
compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
nbytes: 381.5M; nbytes_stored: 10.0M; ratio: 38.0; initialized: 100/100
store: builtins.dict
In the above example, Fortran order gives a better compression ratio. This is an artifical example but illustrates the general point that changing the order of bytes within chunks of an array may improve the compression ratio, depending on the structure of the data, the compression algorithm used, and which compression filters (e.g., byte shuffle) have been applied.
Storage alternatives¶
Zarr can use any object that implements the MutableMapping
interface as
the store for an array.
Here is an example storing an array directly into a Zip file via the zict package:
>>> import zict
>>> import os
>>> store = zict.Zip('example.zip', mode='w')
>>> z = zarr.zeros((1000, 1000), chunks=(100, 100), dtype='i4',
... compression='zlib', compression_opts=1, store=store)
>>> z
zarr.core.Array((1000, 1000), int32, chunks=(100, 100), order=C)
compression: zlib; compression_opts: 1
nbytes: 3.8M; initialized: 0/100
store: zict.zip.Zip
>>> z[:] = 42
>>> store.flush() # only required for zict.Zip
>>> os.path.getsize('example.zip')
30828
Re-open and check that data have been written:
>>> store = zict.Zip('example.zip', mode='r')
>>> z = zarr.Array(store)
>>> z
zarr.core.Array((1000, 1000), int32, chunks=(100, 100), order=C)
compression: zlib; compression_opts: 1
nbytes: 3.8M; initialized: 100/100
store: zict.zip.Zip
>>> z[:]
array([[42, 42, 42, ..., 42, 42, 42],
[42, 42, 42, ..., 42, 42, 42],
[42, 42, 42, ..., 42, 42, 42],
...,
[42, 42, 42, ..., 42, 42, 42],
[42, 42, 42, ..., 42, 42, 42],
[42, 42, 42, ..., 42, 42, 42]], dtype=int32)
Note that there are some restrictions on how Zip files can be used, because items within a Zip file cannot be updated in place. This means that data in the array should only be written once and write operations should be aligned with chunk boundaries.
Chunk size and shape¶
In general, chunks of at least 1 megabyte (1M) seem to provide the best performance, at least when using the Blosc compression library.
The optimal chunk shape will depend on how you want to access the data. E.g.,
for a 2-dimensional array, if you only ever take slices along the first
dimension, then chunk across the second dimenson. If you know you want to
chunk across an entire dimension you can use None
within the chunks
argument, e.g.:
>>> z1 = zarr.zeros((10000, 10000), chunks=(100, None), dtype='i4')
>>> z1.chunks
(100, 10000)
Alternatively, if you only ever take slices along the second dimension, then chunk across the first dimension, e.g.:
>>> z2 = zarr.zeros((10000, 10000), chunks=(None, 100), dtype='i4')
>>> z2.chunks
(10000, 100)
If you require reasonable performance for both access patterns then you need to find a compromise, e.g.:
>>> z3 = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4')
>>> z3.chunks
(1000, 1000)
Configuring Blosc¶
The Blosc compressor is able to use multiple threads internally to accelerate compression and decompression. By default, Zarr allows Blosc to use up to 4 internal threads. The number of Blosc threads can be changed, e.g.:
>>> from zarr import blosc
>>> blosc.set_nthreads(2)
4
When a Zarr array is being used within a multi-threaded program, Zarr
automatically switches to using Blosc in a single-threaded
“contextual” mode. This is generally better as it allows multiple
program threads to use Blosc simultaneously and prevents CPU thrashing
from too many active threads. If you want to manually override this
behaviour, set the value of the blosc.use_threads
variable to
True
(Blosc always uses multiple internal threads) or False
(Blosc always runs in single-threaded contextual mode). To re-enable
automatic switching, set blosc.use_threads
to None
.