Zarr¶
Zarr is a Python package providing an implementation of chunked, compressed, N-dimensional arrays.
Highlights¶
- Create N-dimensional arrays with any NumPy dtype.
- Chunk arrays along any dimension.
- Compress chunks using the fast Blosc meta-compressor or alternatively using zlib, BZ2 or LZMA.
- Store arrays in memory, on disk, inside a Zip file, on S3, ...
- Read an array concurrently from multiple threads or processes.
- Write to an array concurrently from multiple threads or processes.
- Organize arrays into hierarchies via groups.
- Use filters to preprocess data and improve compression.
Status¶
Zarr is still in an early phase of development. Feedback and bug reports are very welcome, please get in touch via the GitHub issue tracker.
Installation¶
Zarr depends on NumPy. It is generally best to install NumPy first using whatever method is most appropriate for you operating system and Python distribution.
Install Zarr from PyPI:
$ pip install zarr
Alternatively, install Zarr via conda:
$ conda install -c conda-forge zarr
Zarr includes a C extension providing integration with the Blosc library. Installing via conda will install a pre-compiled binary distribution. However, if you have a newer CPU that supports the AVX2 instruction set (e.g., Intel Haswell, Broadwell or Skylake) then installing via pip is preferable, because this will compile the Blosc library from source with optimisations for AVX2.
To work with Zarr source code in development, install from GitHub:
$ git clone --recursive https://github.com/alimanfoo/zarr.git
$ cd zarr
$ python setup.py install
To verify that Zarr has been fully installed (including the Blosc extension) run the test suite:
$ pip install nose
$ python -m nose -v zarr
Contents¶
Tutorial¶
Zarr provides classes and functions for working with N-dimensional arrays that behave like NumPy arrays but whose data is divided into chunks and compressed. If you are already familiar with HDF5 then Zarr arrays provide similar functionality, but with some additional flexibility.
Creating an array¶
Zarr has a number of convenience functions for creating arrays. For example:
>>> import zarr
>>> z = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4')
>>> z
Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
nbytes: 381.5M; nbytes_stored: 323; ratio: 1238390.1; initialized: 0/100
compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
store: dict
The code above creates a 2-dimensional array of 32-bit integers with 10000 rows and 10000 columns, divided into chunks where each chunk has 1000 rows and 1000 columns (and so there will be 100 chunks in total).
For a complete list of array creation routines see the
zarr.creation
module documentation.
Reading and writing data¶
Zarr arrays support a similar interface to NumPy arrays for reading and writing data. For example, the entire array can be filled with a scalar value:
>>> z[:] = 42
>>> z
Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
nbytes: 381.5M; nbytes_stored: 1.8M; ratio: 215.1; initialized: 100/100
compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
store: dict
Notice that the values of nbytes_stored
, ratio
and
initialized
have changed. This is because when a Zarr array is
first created, none of the chunks are initialized. Writing data into
the array will cause the necessary chunks to be initialized.
Regions of the array can also be written to, e.g.:
>>> import numpy as np
>>> z[0, :] = np.arange(10000)
>>> z[:, 0] = np.arange(10000)
The contents of the array can be retrieved by slicing, which will load the requested region into a NumPy array, e.g.:
>>> z[0, 0]
0
>>> z[-1, -1]
42
>>> z[0, :]
array([ 0, 1, 2, ..., 9997, 9998, 9999], dtype=int32)
>>> z[:, 0]
array([ 0, 1, 2, ..., 9997, 9998, 9999], dtype=int32)
>>> z[:]
array([[ 0, 1, 2, ..., 9997, 9998, 9999],
[ 1, 42, 42, ..., 42, 42, 42],
[ 2, 42, 42, ..., 42, 42, 42],
...,
[9997, 42, 42, ..., 42, 42, 42],
[9998, 42, 42, ..., 42, 42, 42],
[9999, 42, 42, ..., 42, 42, 42]], dtype=int32)
Persistent arrays¶
In the examples above, compressed data for each chunk of the array was stored in memory. Zarr arrays can also be stored on a file system, enabling persistence of data between sessions. For example:
>>> z1 = zarr.open_array('example.zarr', mode='w', shape=(10000, 10000),
... chunks=(1000, 1000), dtype='i4', fill_value=0)
>>> z1
Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
nbytes: 381.5M; nbytes_stored: 323; ratio: 1238390.1; initialized: 0/100
compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
store: DirectoryStore
The array above will store its configuration metadata and all
compressed chunk data in a directory called ‘example.zarr’ relative to
the current working directory. The zarr.creation.open_array()
function
provides a convenient way to create a new persistent array or continue
working with an existing array. Note that there is no need to close an
array, and data are automatically flushed to disk whenever an array is
modified.
Persistent arrays support the same interface for reading and writing data, e.g.:
>>> z1[:] = 42
>>> z1[0, :] = np.arange(10000)
>>> z1[:, 0] = np.arange(10000)
Check that the data have been written and can be read again:
>>> z2 = zarr.open_array('example.zarr', mode='r')
>>> z2
Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
nbytes: 381.5M; nbytes_stored: 1.9M; ratio: 204.5; initialized: 100/100
compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
store: DirectoryStore
>>> np.all(z1[:] == z2[:])
True
Resizing and appending¶
A Zarr array can be resized, which means that any of its dimensions can be increased or decreased in length. For example:
>>> z = zarr.zeros(shape=(10000, 10000), chunks=(1000, 1000))
>>> z[:] = 42
>>> z.resize(20000, 10000)
>>> z
Array((20000, 10000), float64, chunks=(1000, 1000), order=C)
nbytes: 1.5G; nbytes_stored: 3.6M; ratio: 422.3; initialized: 100/200
compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
store: dict
Note that when an array is resized, the underlying data are not rearranged in any way. If one or more dimensions are shrunk, any chunks falling outside the new array shape will be deleted from the underlying store.
For convenience, Zarr arrays also provide an append()
method,
which can be used to append data to any axis. E.g.:
>>> a = np.arange(10000000, dtype='i4').reshape(10000, 1000)
>>> z = zarr.array(a, chunks=(1000, 100))
>>> z
Array((10000, 1000), int32, chunks=(1000, 100), order=C)
nbytes: 38.1M; nbytes_stored: 1.9M; ratio: 20.3; initialized: 100/100
compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
store: dict
>>> z.append(a)
(20000, 1000)
>>> z
Array((20000, 1000), int32, chunks=(1000, 100), order=C)
nbytes: 76.3M; nbytes_stored: 3.8M; ratio: 20.3; initialized: 200/200
compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
store: dict
>>> z.append(np.vstack([a, a]), axis=1)
(20000, 2000)
>>> z
Array((20000, 2000), int32, chunks=(1000, 100), order=C)
nbytes: 152.6M; nbytes_stored: 7.5M; ratio: 20.3; initialized: 400/400
compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
store: dict
Compressors¶
By default, Zarr uses the Blosc compression library to compress each chunk of an array. Blosc is extremely fast and can be configured in a variety of ways to improve the compression ratio for different types of data. Blosc is in fact a “meta-compressor”, which means that it can used a number of different compression algorithms internally to compress the data. Blosc also provides highly optimized implementations of byte and bit shuffle filters, which can significantly improve compression ratios for some data.
Different compressors can be provided via the compressor
keyword argument
accepted by all array creation functions. For example:
>>> z = zarr.array(np.arange(100000000, dtype='i4').reshape(10000, 10000),
... chunks=(1000, 1000),
... compressor=zarr.Blosc(cname='zstd', clevel=3, shuffle=2))
>>> z
Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
nbytes: 381.5M; nbytes_stored: 4.4M; ratio: 87.6; initialized: 100/100
compressor: Blosc(cname='zstd', clevel=3, shuffle=2)
store: dict
The array above will use Blosc as the primary compressor, using the Zstandard algorithm (compression level 3) internally within Blosc, and with the bitshuffle filter applied.
A list of the internal compression libraries available within Blosc can be obtained via:
>>> from zarr import blosc
>>> blosc.list_compressors()
['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib', 'zstd']
In addition to Blosc, other compression libraries can also be used. Zarr comes with support for zlib, BZ2 and LZMA compression, via the Python standard library. For example, here is an array using zlib compression, level 1:
>>> z = zarr.array(np.arange(100000000, dtype='i4').reshape(10000, 10000),
... chunks=(1000, 1000),
... compressor=zarr.Zlib(level=1))
>>> z
Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
nbytes: 381.5M; nbytes_stored: 132.2M; ratio: 2.9; initialized: 100/100
compressor: Zlib(level=1)
store: dict
Here is an example using LZMA with a custom filter pipeline including LZMA’s built-in delta filter:
>>> import lzma
>>> lzma_filters = [dict(id=lzma.FILTER_DELTA, dist=4),
... dict(id=lzma.FILTER_LZMA2, preset=1)]
>>> compressor = zarr.LZMA(filters=lzma_filters)
>>> z = zarr.array(np.arange(100000000, dtype='i4').reshape(10000, 10000),
... chunks=(1000, 1000), compressor=compressor)
>>> z
Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
nbytes: 381.5M; nbytes_stored: 248.9K; ratio: 1569.7; initialized: 100/100
compressor: LZMA(format=1, check=-1, preset=None, filters=[{'dist': 4, 'id': 3}, {'preset': 1, 'id': 33}])
store: dict
The default compressor can be changed by setting the value of the
zarr.storage.default_compressor
variable, e.g.:
>>> import zarr.storage
>>> # switch to using Zstandard via Blosc by default
... zarr.storage.default_compressor = zarr.Blosc(cname='zstd', clevel=1, shuffle=1)
>>> z = zarr.zeros(100000000, chunks=1000000)
>>> z
Array((100000000,), float64, chunks=(1000000,), order=C)
nbytes: 762.9M; nbytes_stored: 302; ratio: 2649006.6; initialized: 0/100
compressor: Blosc(cname='zstd', clevel=1, shuffle=1)
store: dict
>>> # switch back to Blosc defaults
... zarr.storage.default_compressor = zarr.Blosc()
To disable compression, set compressor=None
when creating an array, e.g.:
>>> z = zarr.zeros(100000000, chunks=1000000, compressor=None)
>>> z
Array((100000000,), float64, chunks=(1000000,), order=C)
nbytes: 762.9M; nbytes_stored: 209; ratio: 3827751.2; initialized: 0/100
store: dict
Filters¶
In some cases, compression can be improved by transforming the data in some way. For example, if nearby values tend to be correlated, then shuffling the bytes within each numerical value or storing the difference between adjacent values may increase compression ratio. Some compressors provide built-in filters that apply transformations to the data prior to compression. For example, the Blosc compressor has highly optimized built-in implementations of byte- and bit-shuffle filters, and the LZMA compressor has a built-in implementation of a delta filter. However, to provide additional flexibility for implementing and using filters in combination with different compressors, Zarr also provides a mechanism for configuring filters outside of the primary compressor.
Here is an example using the Zarr delta filter with the Blosc compressor:
>>> filters = [zarr.Delta(dtype='i4')]
>>> compressor = zarr.Blosc(cname='zstd', clevel=1, shuffle=1)
>>> z = zarr.array(np.arange(100000000, dtype='i4').reshape(10000, 10000),
... chunks=(1000, 1000), filters=filters, compressor=compressor)
>>> z
Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
nbytes: 381.5M; nbytes_stored: 633.4K; ratio: 616.7; initialized: 100/100
filters: Delta(dtype=int32)
compressor: Blosc(cname='zstd', clevel=1, shuffle=1)
store: dict
Zarr comes with implementations of delta, scale-offset, quantize, packbits and
categorize filters. It is also relatively straightforward to implement custom
filters. For more information see the zarr.codecs
API docs.
Parallel computing and synchronization¶
Zarr arrays can be used as either the source or sink for data in parallel computations. Both multi-threaded and multi-process parallelism are supported. The Python global interpreter lock (GIL) is released for both compression and decompression operations, so Zarr will not block other Python threads from running.
A Zarr array can be read concurrently by multiple threads or processes. No synchronization (i.e., locking) is required for concurrent reads.
A Zarr array can also be written to concurrently by multiple threads or processes. Some synchronization may be required, depending on the way the data is being written.
If each worker in a parallel computation is writing to a separate region of the array, and if region boundaries are perfectly aligned with chunk boundaries, then no synchronization is required. However, if region and chunk boundaries are not perfectly aligned, then synchronization is required to avoid two workers attempting to modify the same chunk at the same time.
To give a simple example, consider a 1-dimensional array of length 60,
z
, divided into three chunks of 20 elements each. If three workers
are running and each attempts to write to a 20 element region (i.e.,
z[0:20]
, z[20:40]
and z[40:60]
) then each worker will be
writing to a separate chunk and no synchronization is
required. However, if two workers are running and each attempts to
write to a 30 element region (i.e., z[0:30]
and z[30:60]
) then
it is possible both workers will attempt to modify the middle chunk at
the same time, and synchronization is required to prevent data loss.
Zarr provides support for chunk-level synchronization. E.g., create an array with thread synchronization:
>>> z = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4',
... synchronizer=zarr.ThreadSynchronizer())
>>> z
Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
nbytes: 381.5M; nbytes_stored: 323; ratio: 1238390.1; initialized: 0/100
compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
store: dict; synchronizer: ThreadSynchronizer
This array is safe to read or write within a multi-threaded program.
Zarr also provides support for process synchronization via file locking, provided that all processes have access to a shared file system. E.g.:
>>> synchronizer = zarr.ProcessSynchronizer('example.sync')
>>> z = zarr.open_array('example', mode='w', shape=(10000, 10000),
... chunks=(1000, 1000), dtype='i4',
... synchronizer=synchronizer)
>>> z
Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
nbytes: 381.5M; nbytes_stored: 323; ratio: 1238390.1; initialized: 0/100
compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
store: DirectoryStore; synchronizer: ProcessSynchronizer
This array is safe to read or write from multiple processes.
User attributes¶
Zarr arrays also support custom key/value attributes, which can be useful for associating an array with application-specific metadata. For example:
>>> z = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4')
>>> z.attrs['foo'] = 'bar'
>>> z.attrs['baz'] = 42
>>> sorted(z.attrs)
['baz', 'foo']
>>> 'foo' in z.attrs
True
>>> z.attrs['foo']
'bar'
>>> z.attrs['baz']
42
Internally Zarr uses JSON to store array attributes, so attribute values must be JSON serializable.
Groups¶
Zarr supports hierarchical organization of arrays via groups. As with arrays, groups can be stored in memory, on disk, or via other storage systems that support a similar interface.
To create a group, use the zarr.hierarchy.group()
function:
>>> root_group = zarr.group()
>>> root_group
Group(/, 0)
store: DictStore
Groups have a similar API to the Group class from h5py. For example, groups can contain other groups:
>>> foo_group = root_group.create_group('foo')
>>> bar_group = foo_group.create_group('bar')
Groups can also contain arrays, e.g.:
>>> z1 = bar_group.zeros('baz', shape=(10000, 10000), chunks=(1000, 1000), dtype='i4',
... compressor=zarr.Blosc(cname='zstd', clevel=1, shuffle=1))
>>> z1
Array(/foo/bar/baz, (10000, 10000), int32, chunks=(1000, 1000), order=C)
nbytes: 381.5M; nbytes_stored: 324; ratio: 1234567.9; initialized: 0/100
compressor: Blosc(cname='zstd', clevel=1, shuffle=1)
store: DictStore
Arrays are known as “datasets” in HDF5 terminology. For compatibility with
h5py, Zarr groups also implement the zarr.hierarchy.Group.create_dataset()
and zarr.hierarchy.Group.require_dataset()
methods, e.g.:
>>> z = bar_group.create_dataset('quux', shape=(10000, 10000),
... chunks=(1000, 1000), dtype='i4',
... fill_value=0, compression='gzip',
... compression_opts=1)
>>> z
Array(/foo/bar/quux, (10000, 10000), int32, chunks=(1000, 1000), order=C)
nbytes: 381.5M; nbytes_stored: 275; ratio: 1454545.5; initialized: 0/100
compressor: Zlib(level=1)
store: DictStore
Members of a group can be accessed via the suffix notation, e.g.:
>>> root_group['foo']
Group(/foo, 1)
groups: 1; bar
store: DictStore
The ‘/’ character can be used to access multiple levels of the hierarchy, e.g.:
>>> root_group['foo/bar']
Group(/foo/bar, 2)
arrays: 2; baz, quux
store: DictStore
>>> root_group['foo/bar/baz']
Array(/foo/bar/baz, (10000, 10000), int32, chunks=(1000, 1000), order=C)
nbytes: 381.5M; nbytes_stored: 324; ratio: 1234567.9; initialized: 0/100
compressor: Blosc(cname='zstd', clevel=1, shuffle=1)
store: DictStore
The zarr.hierarchy.open_group()
provides a convenient way to create or
re-open a group stored in a directory on the file-system, with sub-groups
stored in sub-directories, e.g.:
>>> persistent_group = zarr.open_group('example', mode='w')
>>> persistent_group
Group(/, 0)
store: DirectoryStore
>>> z = persistent_group.create_dataset('foo/bar/baz', shape=(10000, 10000),
... chunks=(1000, 1000), dtype='i4',
... fill_value=0)
>>> z
Array(/foo/bar/baz, (10000, 10000), int32, chunks=(1000, 1000), order=C)
nbytes: 381.5M; nbytes_stored: 323; ratio: 1238390.1; initialized: 0/100
compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
store: DirectoryStore
For more information on groups see the zarr.hierarchy
API docs.
Tips and tricks¶
Copying large arrays¶
Data can be copied between large arrays without needing much memory, e.g.:
>>> z1 = zarr.empty((10000, 10000), chunks=(1000, 1000), dtype='i4')
>>> z1[:] = 42
>>> z2 = zarr.empty_like(z1)
>>> z2[:] = z1
Internally the example above works chunk-by-chunk, extracting only the
data from z1
required to fill each chunk in z2
. The source of
the data (z1
) could equally be an h5py Dataset.
Changing memory layout¶
The order of bytes within each chunk of an array can be changed via
the order
keyword argument, to use either C or Fortran layout. For
multi-dimensional arrays, these two layouts may provide different
compression ratios, depending on the correlation structure within the
data. E.g.:
>>> a = np.arange(100000000, dtype='i4').reshape(10000, 10000).T
>>> zarr.array(a, chunks=(1000, 1000))
Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
nbytes: 381.5M; nbytes_stored: 26.3M; ratio: 14.5; initialized: 100/100
compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
store: dict
>>> zarr.array(a, chunks=(1000, 1000), order='F')
Array((10000, 10000), int32, chunks=(1000, 1000), order=F)
nbytes: 381.5M; nbytes_stored: 9.2M; ratio: 41.6; initialized: 100/100
compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
store: dict
In the above example, Fortran order gives a better compression ratio. This is an artifical example but illustrates the general point that changing the order of bytes within chunks of an array may improve the compression ratio, depending on the structure of the data, the compression algorithm used, and which compression filters (e.g., byte shuffle) have been applied.
Storage alternatives¶
Zarr can use any object that implements the MutableMapping
interface as
the store for a group or an array.
Here is an example storing an array directly into a Zip file:
>>> store = zarr.ZipStore('example.zip', mode='w')
>>> z = zarr.zeros((1000, 1000), chunks=(100, 100), dtype='i4', store=store)
>>> z
Array((1000, 1000), int32, chunks=(100, 100), order=C)
nbytes: 3.8M; nbytes_stored: 319; ratio: 12539.2; initialized: 0/100
compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
store: ZipStore
>>> z[:] = 42
>>> z
Array((1000, 1000), int32, chunks=(100, 100), order=C)
nbytes: 3.8M; nbytes_stored: 21.8K; ratio: 179.2; initialized: 100/100
compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
store: ZipStore
>>> store.close()
>>> import os
>>> os.path.getsize('example.zip')
30721
Re-open and check that data have been written:
>>> store = zarr.ZipStore('example.zip', mode='r')
>>> z = zarr.Array(store)
>>> z
Array((1000, 1000), int32, chunks=(100, 100), order=C)
nbytes: 3.8M; nbytes_stored: 21.8K; ratio: 179.2; initialized: 100/100
compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
store: ZipStore
>>> z[:]
array([[42, 42, 42, ..., 42, 42, 42],
[42, 42, 42, ..., 42, 42, 42],
[42, 42, 42, ..., 42, 42, 42],
...,
[42, 42, 42, ..., 42, 42, 42],
[42, 42, 42, ..., 42, 42, 42],
[42, 42, 42, ..., 42, 42, 42]], dtype=int32)
>>> store.close()
Note that there are some restrictions on how Zip files can be used, because items within a Zip file cannot be updated in place. This means that data in the array should only be written once and write operations should be aligned with chunk boundaries.
Note also that the close()
method must be called after writing any data to
the store, otherwise essential records will not be written to the underlying
zip file.
The Dask project has implementations of the MutableMapping
interface for distributed storage systems, see the S3Map
and HDFSMap
classes.
Chunk size and shape¶
In general, chunks of at least 1 megabyte (1M) seem to provide the best performance, at least when using the Blosc compression library.
The optimal chunk shape will depend on how you want to access the data. E.g.,
for a 2-dimensional array, if you only ever take slices along the first
dimension, then chunk across the second dimenson. If you know you want to
chunk across an entire dimension you can use None
within the chunks
argument, e.g.:
>>> z1 = zarr.zeros((10000, 10000), chunks=(100, None), dtype='i4')
>>> z1.chunks
(100, 10000)
Alternatively, if you only ever take slices along the second dimension, then chunk across the first dimension, e.g.:
>>> z2 = zarr.zeros((10000, 10000), chunks=(None, 100), dtype='i4')
>>> z2.chunks
(10000, 100)
If you require reasonable performance for both access patterns then you need to find a compromise, e.g.:
>>> z3 = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4')
>>> z3.chunks
(1000, 1000)
If you are feeling lazy, you can let Zarr guess a chunk shape for your data, although please note that the algorithm for guessing a chunk shape is based on simple heuristics and may be far from optimal. E.g.:
>>> z4 = zarr.zeros((10000, 10000), dtype='i4')
>>> z4.chunks
(313, 313)
Configuring Blosc¶
The Blosc compressor is able to use multiple threads internally to accelerate compression and decompression. By default, Zarr allows Blosc to use up to 8 internal threads. The number of Blosc threads can be changed to increase or decrease this number, e.g.:
>>> from zarr import blosc
>>> blosc.set_nthreads(2)
8
When a Zarr array is being used within a multi-threaded program, Zarr
automatically switches to using Blosc in a single-threaded
“contextual” mode. This is generally better as it allows multiple
program threads to use Blosc simultaneously and prevents CPU thrashing
from too many active threads. If you want to manually override this
behaviour, set the value of the blosc.use_threads
variable to
True
(Blosc always uses multiple internal threads) or False
(Blosc always runs in single-threaded contextual mode). To re-enable
automatic switching, set blosc.use_threads
to None
.
API reference¶
Array creation (zarr.creation
)¶
-
zarr.creation.
create
(shape, chunks=None, dtype=None, compressor='default', fill_value=0, order='C', store=None, synchronizer=None, overwrite=False, path=None, chunk_store=None, filters=None, cache_metadata=True, **kwargs)¶ Create an array.
Parameters: shape : int or tuple of ints
Array shape.
chunks : int or tuple of ints, optional
Chunk shape. If not provided, will be guessed from shape and dtype.
dtype : string or dtype, optional
NumPy dtype.
compressor : Codec, optional
Primary compressor.
fill_value : object
Default value to use for uninitialized portions of the array.
order : {‘C’, ‘F’}, optional
Memory layout to be used within each chunk.
store : MutableMapping or string
Store or path to directory in file system.
synchronizer : object, optional
Array synchronizer.
overwrite : bool, optional
If True, delete all pre-existing data in store at path before creating the array.
path : string, optional
Path under which array is stored.
chunk_store : MutableMapping, optional
Separate storage for chunks. If not provided, store will be used for storage of both chunks and metadata.
filters : sequence of Codecs, optional
Sequence of filters to use to encode chunk data prior to compression.
cache_metadata : bool, optional
If True, array configuration metadata will be cached for the lifetime of the object. If False, array metadata will be reloaded prior to all data access and modification operations (may incur overhead depending on storage and data access pattern).
Returns: z : zarr.core.Array
Examples
Create an array with default settings:
>>> import zarr >>> z = zarr.create((10000, 10000), chunks=(1000, 1000)) >>> z Array((10000, 10000), float64, chunks=(1000, 1000), order=C) nbytes: 762.9M; nbytes_stored: 323; ratio: 2476780.2; initialized: 0/100 compressor: Blosc(cname='lz4', clevel=5, shuffle=1) store: dict
-
zarr.creation.
empty
(shape, **kwargs)¶ Create an empty array.
For parameter definitions see
zarr.creation.create()
.Notes
The contents of an empty Zarr array are not defined. On attempting to retrieve data from an empty Zarr array, any values may be returned, and these are not guaranteed to be stable from one access to the next.
-
zarr.creation.
zeros
(shape, **kwargs)¶ Create an array, with zero being used as the default value for uninitialized portions of the array.
For parameter definitions see
zarr.creation.create()
.Examples
>>> import zarr >>> z = zarr.zeros((10000, 10000), chunks=(1000, 1000)) >>> z Array((10000, 10000), float64, chunks=(1000, 1000), order=C) nbytes: 762.9M; nbytes_stored: 323; ratio: 2476780.2; initialized: 0/100 compressor: Blosc(cname='lz4', clevel=5, shuffle=1) store: dict >>> z[:2, :2] array([[ 0., 0.], [ 0., 0.]])
-
zarr.creation.
ones
(shape, **kwargs)¶ Create an array, with one being used as the default value for uninitialized portions of the array.
For parameter definitions see
zarr.creation.create()
.Examples
>>> import zarr >>> z = zarr.ones((10000, 10000), chunks=(1000, 1000)) >>> z Array((10000, 10000), float64, chunks=(1000, 1000), order=C) nbytes: 762.9M; nbytes_stored: 323; ratio: 2476780.2; initialized: 0/100 compressor: Blosc(cname='lz4', clevel=5, shuffle=1) store: dict >>> z[:2, :2] array([[ 1., 1.], [ 1., 1.]])
-
zarr.creation.
full
(shape, fill_value, **kwargs)¶ Create an array, with fill_value being used as the default value for uninitialized portions of the array.
For parameter definitions see
zarr.creation.create()
.Examples
>>> import zarr >>> z = zarr.full((10000, 10000), chunks=(1000, 1000), fill_value=42) >>> z Array((10000, 10000), float64, chunks=(1000, 1000), order=C) nbytes: 762.9M; nbytes_stored: 324; ratio: 2469135.8; initialized: 0/100 compressor: Blosc(cname='lz4', clevel=5, shuffle=1) store: dict >>> z[:2, :2] array([[ 42., 42.], [ 42., 42.]])
-
zarr.creation.
array
(data, **kwargs)¶ Create an array filled with data.
The data argument should be a NumPy array or array-like object. For other parameter definitions see
zarr.creation.create()
.Examples
>>> import numpy as np >>> import zarr >>> a = np.arange(100000000).reshape(10000, 10000) >>> z = zarr.array(a, chunks=(1000, 1000)) >>> z Array((10000, 10000), int64, chunks=(1000, 1000), order=C) nbytes: 762.9M; nbytes_stored: 15.2M; ratio: 50.2; initialized: 100/100 compressor: Blosc(cname='lz4', clevel=5, shuffle=1) store: dict
-
zarr.creation.
open_array
(store=None, mode='a', shape=None, chunks=None, dtype=None, compressor='default', fill_value=0, order='C', synchronizer=None, filters=None, cache_metadata=True, path=None, **kwargs)¶ Open array using mode-like semantics.
Parameters: store : MutableMapping or string
Store or path to directory in file system.
mode : {‘r’, ‘r+’, ‘a’, ‘w’, ‘w-‘}
Persistence mode: ‘r’ means read only (must exist); ‘r+’ means read/write (must exist); ‘a’ means read/write (create if doesn’t exist); ‘w’ means create (overwrite if exists); ‘w-‘ means create (fail if exists).
shape : int or tuple of ints
Array shape.
chunks : int or tuple of ints, optional
Chunk shape. If not provided, will be guessed from shape and dtype.
dtype : string or dtype, optional
NumPy dtype.
compressor : Codec, optional
Primary compressor.
fill_value : object
Default value to use for uninitialized portions of the array.
order : {‘C’, ‘F’}, optional
Memory layout to be used within each chunk.
synchronizer : object, optional
Array synchronizer.
filters : sequence, optional
Sequence of filters to use to encode chunk data prior to compression.
cache_metadata : bool, optional
If True, array configuration metadata will be cached for the lifetime of the object. If False, array metadata will be reloaded prior to all data access and modification operations (may incur overhead depending on storage and data access pattern).
path : string, optional
Array path.
Returns: z : zarr.core.Array
Notes
There is no need to close an array. Data are automatically flushed to the file system.
Examples
>>> import numpy as np >>> import zarr >>> z1 = zarr.open_array('example.zarr', mode='w', shape=(10000, 10000), ... chunks=(1000, 1000), fill_value=0) >>> z1[:] = np.arange(100000000).reshape(10000, 10000) >>> z1 Array((10000, 10000), float64, chunks=(1000, 1000), order=C) nbytes: 762.9M; nbytes_stored: 23.0M; ratio: 33.2; initialized: 100/100 compressor: Blosc(cname='lz4', clevel=5, shuffle=1) store: DirectoryStore >>> z2 = zarr.open_array('example.zarr', mode='r') >>> z2 Array((10000, 10000), float64, chunks=(1000, 1000), order=C) nbytes: 762.9M; nbytes_stored: 23.0M; ratio: 33.2; initialized: 100/100 compressor: Blosc(cname='lz4', clevel=5, shuffle=1) store: DirectoryStore >>> np.all(z1[:] == z2[:]) True
-
zarr.creation.
empty_like
(a, **kwargs)¶ Create an empty array like a.
-
zarr.creation.
zeros_like
(a, **kwargs)¶ Create an array of zeros like a.
-
zarr.creation.
ones_like
(a, **kwargs)¶ Create an array of ones like a.
-
zarr.creation.
full_like
(a, **kwargs)¶ Create a filled array like a.
-
zarr.creation.
open_like
(a, path, **kwargs)¶ Open a persistent array like a.
The Array class (zarr.core
)¶
-
class
zarr.core.
Array
(store, path=None, read_only=False, chunk_store=None, synchronizer=None, cache_metadata=True)¶ Instantiate an array from an initialized store.
Parameters: store : MutableMapping
Array store, already initialized.
path : string, optional
Storage path.
read_only : bool, optional
True if array should be protected against modification.
chunk_store : MutableMapping, optional
Separate storage for chunks. If not provided, store will be used for storage of both chunks and metadata.
synchronizer : object, optional
Array synchronizer.
cache_metadata : bool, optional
If True, array configuration metadata will be cached for the lifetime of the object. If False, array metadata will be reloaded prior to all data access and modification operations (may incur overhead depending on storage and data access pattern).
Attributes
store
A MutableMapping providing the underlying storage for the array. path
Storage path. name
Array name following h5py convention. read_only
A boolean, True if modification operations are not permitted. chunk_store
A MutableMapping providing the underlying storage for array chunks. shape
A tuple of integers describing the length of each dimension of the array. chunks
A tuple of integers describing the length of each dimension of a chunk of the array. dtype
The NumPy data type. fill_value
A value used for uninitialized portions of the array. order
A string indicating the order in which bytes are arranged within chunks of the array. synchronizer
Object used to synchronize write access to the array. filters
One or more codecs used to transform data prior to compression. attrs
A MutableMapping containing user-defined attributes. size
The total number of elements in the array. itemsize
The size in bytes of each item in the array. nbytes
The total number of bytes that would be required to store the array without compression. nbytes_stored
The total number of stored bytes of data for the array. cdata_shape
A tuple of integers describing the number of chunks along each dimension of the array. nchunks
Total number of chunks. nchunks_initialized
The number of chunks that have been initialized with some data. is_view
A boolean, True if this array is a view on another array. compression compression_opts Methods
__getitem__
(item)Retrieve data for some portion of the array. __setitem__
(item, value)Modify data for some portion of the array. resize
(*args)Change the shape of the array by growing or shrinking one or more dimensions. append
(data[, axis])Append data to axis. view
([shape, chunks, dtype, fill_value, ...])Return an array sharing the same data. -
__getitem__
(item)¶ Retrieve data for some portion of the array. Most NumPy-style slicing operations are supported.
Returns: out : ndarray
A NumPy array containing the data for the requested region.
Examples
Setup a 1-dimensional array:
>>> import zarr >>> import numpy as np >>> z = zarr.array(np.arange(100000000), chunks=1000000, dtype='i4') >>> z Array((100000000,), int32, chunks=(1000000,), order=C) nbytes: 381.5M; nbytes_stored: 6.4M; ratio: 59.9; initialized: 100/100 compressor: Blosc(cname='lz4', clevel=5, shuffle=1) store: dict
Take some slices:
>>> z[5] 5 >>> z[:5] array([0, 1, 2, 3, 4], dtype=int32) >>> z[-5:] array([99999995, 99999996, 99999997, 99999998, 99999999], dtype=int32) >>> z[5:10] array([5, 6, 7, 8, 9], dtype=int32) >>> z[:] array([ 0, 1, 2, ..., 99999997, 99999998, 99999999], dtype=int32)
Setup a 2-dimensional array:
>>> import zarr >>> import numpy as np >>> z = zarr.array(np.arange(100000000).reshape(10000, 10000), ... chunks=(1000, 1000), dtype='i4') >>> z Array((10000, 10000), int32, chunks=(1000, 1000), order=C) nbytes: 381.5M; nbytes_stored: 9.2M; ratio: 41.6; initialized: 100/100 compressor: Blosc(cname='lz4', clevel=5, shuffle=1) store: dict
Take some slices:
>>> z[2, 2] 20002 >>> z[:2, :2] array([[ 0, 1], [10000, 10001]], dtype=int32) >>> z[:2] array([[ 0, 1, 2, ..., 9997, 9998, 9999], [10000, 10001, 10002, ..., 19997, 19998, 19999]], dtype=int32) >>> z[:, :2] array([[ 0, 1], [ 10000, 10001], [ 20000, 20001], ..., [99970000, 99970001], [99980000, 99980001], [99990000, 99990001]], dtype=int32) >>> z[:] array([[ 0, 1, 2, ..., 9997, 9998, 9999], [ 10000, 10001, 10002, ..., 19997, 19998, 19999], [ 20000, 20001, 20002, ..., 29997, 29998, 29999], ..., [99970000, 99970001, 99970002, ..., 99979997, 99979998, 99979999], [99980000, 99980001, 99980002, ..., 99989997, 99989998, 99989999], [99990000, 99990001, 99990002, ..., 99999997, 99999998, 99999999]], dtype=int32)
-
__setitem__
(item, value)¶ Modify data for some portion of the array.
Examples
Setup a 1-dimensional array:
>>> import zarr >>> z = zarr.zeros(100000000, chunks=1000000, dtype='i4') >>> z Array((100000000,), int32, chunks=(1000000,), order=C) nbytes: 381.5M; nbytes_stored: 301; ratio: 1328903.7; initialized: 0/100 compressor: Blosc(cname='lz4', clevel=5, shuffle=1) store: dict
Set all array elements to the same scalar value:
>>> z[:] = 42 >>> z[:] array([42, 42, 42, ..., 42, 42, 42], dtype=int32)
Set a portion of the array:
>>> z[:100] = np.arange(100) >>> z[-100:] = np.arange(100)[::-1] >>> z[:] array([0, 1, 2, ..., 2, 1, 0], dtype=int32)
Setup a 2-dimensional array:
>>> z = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4') >>> z Array((10000, 10000), int32, chunks=(1000, 1000), order=C) nbytes: 381.5M; nbytes_stored: 323; ratio: 1238390.1; initialized: 0/100 compressor: Blosc(cname='lz4', clevel=5, shuffle=1) store: dict
Set all array elements to the same scalar value:
>>> z[:] = 42 >>> z[:] array([[42, 42, 42, ..., 42, 42, 42], [42, 42, 42, ..., 42, 42, 42], [42, 42, 42, ..., 42, 42, 42], ..., [42, 42, 42, ..., 42, 42, 42], [42, 42, 42, ..., 42, 42, 42], [42, 42, 42, ..., 42, 42, 42]], dtype=int32)
Set a portion of the array:
>>> z[0, :] = np.arange(z.shape[1]) >>> z[:, 0] = np.arange(z.shape[0]) >>> z[:] array([[ 0, 1, 2, ..., 9997, 9998, 9999], [ 1, 42, 42, ..., 42, 42, 42], [ 2, 42, 42, ..., 42, 42, 42], ..., [9997, 42, 42, ..., 42, 42, 42], [9998, 42, 42, ..., 42, 42, 42], [9999, 42, 42, ..., 42, 42, 42]], dtype=int32)
-
resize
(*args)¶ Change the shape of the array by growing or shrinking one or more dimensions.
Notes
When resizing an array, the data are not rearranged in any way.
If one or more dimensions are shrunk, any chunks falling outside the new array shape will be deleted from the underlying store.
Examples
>>> import zarr >>> z = zarr.zeros(shape=(10000, 10000), chunks=(1000, 1000)) >>> z Array((10000, 10000), float64, chunks=(1000, 1000), order=C) nbytes: 762.9M; nbytes_stored: 323; ratio: 2476780.2; initialized: 0/100 compressor: Blosc(cname='lz4', clevel=5, shuffle=1) store: dict >>> z.resize(20000, 10000) >>> z Array((20000, 10000), float64, chunks=(1000, 1000), order=C) nbytes: 1.5G; nbytes_stored: 323; ratio: 4953560.4; initialized: 0/200 compressor: Blosc(cname='lz4', clevel=5, shuffle=1) store: dict >>> z.resize(30000, 1000) >>> z Array((30000, 1000), float64, chunks=(1000, 1000), order=C) nbytes: 228.9M; nbytes_stored: 322; ratio: 745341.6; initialized: 0/30 compressor: Blosc(cname='lz4', clevel=5, shuffle=1) store: dict
-
append
(data, axis=0)¶ Append data to axis.
Parameters: data : array_like
Data to be appended.
axis : int
Axis along which to append.
Returns: new_shape : tuple
Notes
The size of all dimensions other than axis must match between this array and data.
Examples
>>> import numpy as np >>> import zarr >>> a = np.arange(10000000, dtype='i4').reshape(10000, 1000) >>> z = zarr.array(a, chunks=(1000, 100)) >>> z Array((10000, 1000), int32, chunks=(1000, 100), order=C) nbytes: 38.1M; nbytes_stored: 1.9M; ratio: 20.3; initialized: 100/100 compressor: Blosc(cname='lz4', clevel=5, shuffle=1) store: dict >>> z.append(a) (20000, 1000) >>> z Array((20000, 1000), int32, chunks=(1000, 100), order=C) nbytes: 76.3M; nbytes_stored: 3.8M; ratio: 20.3; initialized: 200/200 compressor: Blosc(cname='lz4', clevel=5, shuffle=1) store: dict >>> z.append(np.vstack([a, a]), axis=1) (20000, 2000) >>> z Array((20000, 2000), int32, chunks=(1000, 100), order=C) nbytes: 152.6M; nbytes_stored: 7.5M; ratio: 20.3; initialized: 400/400 compressor: Blosc(cname='lz4', clevel=5, shuffle=1) store: dict
-
view
(shape=None, chunks=None, dtype=None, fill_value=None, filters=None, read_only=None, synchronizer=None)¶ Return an array sharing the same data.
Parameters: shape : int or tuple of ints
Array shape.
chunks : int or tuple of ints, optional
Chunk shape.
dtype : string or dtype, optional
NumPy dtype.
fill_value : object
Default value to use for uninitialized portions of the array.
filters : sequence, optional
Sequence of filters to use to encode chunk data prior to compression.
read_only : bool, optional
True if array should be protected against modification.
synchronizer : object, optional
Array synchronizer.
Notes
WARNING: This is an experimental feature and should be used with care. There are plenty of ways to generate errors and/or cause data corruption.
Examples
Bypass filters:
>>> import zarr >>> import numpy as np >>> np.random.seed(42) >>> labels = [b'female', b'male'] >>> data = np.random.choice(labels, size=10000) >>> filters = [zarr.Categorize(labels=labels, ... dtype=data.dtype, ... astype='u1')] >>> a = zarr.array(data, chunks=1000, filters=filters) >>> a[:] array([b'female', b'male', b'female', ..., b'male', b'male', b'female'], dtype='|S6') >>> v = a.view(dtype='u1', filters=[]) >>> v.is_view True >>> v[:] array([1, 2, 1, ..., 2, 2, 1], dtype=uint8)
Views can be used to modify data:
>>> x = v[:] >>> x.sort() >>> v[:] = x >>> v[:] array([1, 1, 1, ..., 2, 2, 2], dtype=uint8) >>> a[:] array([b'female', b'female', b'female', ..., b'male', b'male', b'male'], dtype='|S6')
View as a different dtype with the same itemsize:
>>> data = np.random.randint(0, 2, size=10000, dtype='u1') >>> a = zarr.array(data, chunks=1000) >>> a[:] array([0, 0, 1, ..., 1, 0, 0], dtype=uint8) >>> v = a.view(dtype=bool) >>> v[:] array([False, False, True, ..., True, False, False], dtype=bool) >>> np.all(a[:].view(dtype=bool) == v[:]) True
An array can be viewed with a dtype with a different itemsize, however some care is needed to adjust the shape and chunk shape so that chunk data is interpreted correctly:
>>> data = np.arange(10000, dtype='u2') >>> a = zarr.array(data, chunks=1000) >>> a[:10] array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=uint16) >>> v = a.view(dtype='u1', shape=20000, chunks=2000) >>> v[:10] array([0, 0, 1, 0, 2, 0, 3, 0, 4, 0], dtype=uint8) >>> np.all(a[:].view('u1') == v[:]) True
Change fill value for uninitialized chunks:
>>> a = zarr.full(10000, chunks=1000, fill_value=-1, dtype='i1') >>> a[:] array([-1, -1, -1, ..., -1, -1, -1], dtype=int8) >>> v = a.view(fill_value=42) >>> v[:] array([42, 42, 42, ..., 42, 42, 42], dtype=int8)
Note that resizing or appending to views is not permitted:
>>> a = zarr.empty(10000) >>> v = a.view() >>> try: ... v.resize(20000) ... except PermissionError as e: ... print(e) not permitted for views
-
Groups (zarr.hierarchy
)¶
-
zarr.hierarchy.
group
(store=None, overwrite=False, chunk_store=None, synchronizer=None, path=None)¶ Create a group.
Parameters: store : MutableMapping or string
Store or path to directory in file system.
overwrite : bool, optional
If True, delete any pre-existing data in store at path before creating the group.
chunk_store : MutableMapping, optional
Separate storage for chunks. If not provided, store will be used for storage of both chunks and metadata.
synchronizer : object, optional
Array synchronizer.
path : string, optional
Group path.
Returns: g : zarr.hierarchy.Group
Examples
Create a group in memory:
>>> import zarr >>> g = zarr.group() >>> g Group(/, 0) store: DictStore
Create a group with a different store:
>>> store = zarr.DirectoryStore('example') >>> g = zarr.group(store=store, overwrite=True) >>> g Group(/, 0) store: DirectoryStore
-
zarr.hierarchy.
open_group
(store=None, mode='a', synchronizer=None, path=None)¶ Open a group using mode-like semantics.
Parameters: store : MutableMapping or string
Store or path to directory in file system.
mode : {‘r’, ‘r+’, ‘a’, ‘w’, ‘w-‘}
Persistence mode: ‘r’ means read only (must exist); ‘r+’ means read/write (must exist); ‘a’ means read/write (create if doesn’t exist); ‘w’ means create (overwrite if exists); ‘w-‘ means create (fail if exists).
synchronizer : object, optional
Array synchronizer.
path : string, optional
Group path.
Returns: g : zarr.hierarchy.Group
Examples
>>> import zarr >>> root = zarr.open_group('example', mode='w') >>> foo = root.create_group('foo') >>> bar = root.create_group('bar') >>> root Group(/, 2) groups: 2; bar, foo store: DirectoryStore >>> root2 = zarr.open_group('example', mode='a') >>> root2 Group(/, 2) groups: 2; bar, foo store: DirectoryStore >>> root == root2 True
-
class
zarr.hierarchy.
Group
(store, path=None, read_only=False, chunk_store=None, synchronizer=None)¶ Instantiate a group from an initialized store.
Parameters: store : MutableMapping
Group store, already initialized.
path : string, optional
Group path.
read_only : bool, optional
True if group should be protected against modification.
chunk_store : MutableMapping, optional
Separate storage for chunks. If not provided, store will be used for storage of both chunks and metadata.
synchronizer : object, optional
Array synchronizer.
Attributes
store
A MutableMapping providing the underlying storage for the group. path
Storage path. name
Group name following h5py convention. read_only
A boolean, True if modification operations are not permitted. chunk_store
A MutableMapping providing the underlying storage for array chunks. synchronizer
Object used to synchronize write access to groups and arrays. attrs
A MutableMapping containing user-defined attributes. Methods
__len__
()Number of members. __iter__
()Return an iterator over group member names. __contains__
(item)Test for group membership. __getitem__
(item)Obtain a group member. group_keys
()Return an iterator over member names for groups only. groups
()Return an iterator over (name, value) pairs for groups only. array_keys
()Return an iterator over member names for arrays only. arrays
()Return an iterator over (name, value) pairs for arrays only. create_group
(name[, overwrite])Create a sub-group. require_group
(name[, overwrite])Obtain a sub-group, creating one if it doesn’t exist. create_groups
(*names, **kwargs)Convenience method to create multiple groups in a single call. require_groups
(*names)Convenience method to require multiple groups in a single call. create_dataset
(name, **kwargs)Create an array. require_dataset
(name, shape[, dtype, exact])Obtain an array, creating if it doesn’t exist. create
(name, **kwargs)Create an array. empty
(name, **kwargs)Create an array. zeros
(name, **kwargs)Create an array. ones
(name, **kwargs)Create an array. full
(name, fill_value, **kwargs)Create an array. array
(name, data, **kwargs)Create an array. empty_like
(name, data, **kwargs)Create an array. zeros_like
(name, data, **kwargs)Create an array. ones_like
(name, data, **kwargs)Create an array. full_like
(name, data, **kwargs)Create an array. -
__len__
()¶ Number of members.
-
__iter__
()¶ Return an iterator over group member names.
Examples
>>> import zarr >>> g1 = zarr.group() >>> g2 = g1.create_group('foo') >>> g3 = g1.create_group('bar') >>> d1 = g1.create_dataset('baz', shape=100, chunks=10) >>> d2 = g1.create_dataset('quux', shape=200, chunks=20) >>> for name in g1: ... print(name) bar baz foo quux
-
__contains__
(item)¶ Test for group membership.
Examples
>>> import zarr >>> g1 = zarr.group() >>> g2 = g1.create_group('foo') >>> d1 = g1.create_dataset('bar', shape=100, chunks=10) >>> 'foo' in g1 True >>> 'bar' in g1 True >>> 'baz' in g1 False
-
__getitem__
(item)¶ Obtain a group member.
Parameters: item : string
Member name or path.
Examples
>>> import zarr >>> g1 = zarr.group() >>> d1 = g1.create_dataset('foo/bar/baz', shape=100, chunks=10) >>> g1['foo'] Group(/foo, 1) groups: 1; bar store: DictStore >>> g1['foo/bar'] Group(/foo/bar, 1) arrays: 1; baz store: DictStore >>> g1['foo/bar/baz'] Array(/foo/bar/baz, (100,), float64, chunks=(10,), order=C) nbytes: 800; nbytes_stored: 290; ratio: 2.8; initialized: 0/10 compressor: Blosc(cname='lz4', clevel=5, shuffle=1) store: DictStore
-
group_keys
()¶ Return an iterator over member names for groups only.
Examples
>>> import zarr >>> g1 = zarr.group() >>> g2 = g1.create_group('foo') >>> g3 = g1.create_group('bar') >>> d1 = g1.create_dataset('baz', shape=100, chunks=10) >>> d2 = g1.create_dataset('quux', shape=200, chunks=20) >>> sorted(g1.group_keys()) ['bar', 'foo']
-
groups
()¶ Return an iterator over (name, value) pairs for groups only.
Examples
>>> import zarr >>> g1 = zarr.group() >>> g2 = g1.create_group('foo') >>> g3 = g1.create_group('bar') >>> d1 = g1.create_dataset('baz', shape=100, chunks=10) >>> d2 = g1.create_dataset('quux', shape=200, chunks=20) >>> for n, v in g1.groups(): ... print(n, type(v)) bar <class 'zarr.hierarchy.Group'> foo <class 'zarr.hierarchy.Group'>
-
array_keys
()¶ Return an iterator over member names for arrays only.
Examples
>>> import zarr >>> g1 = zarr.group() >>> g2 = g1.create_group('foo') >>> g3 = g1.create_group('bar') >>> d1 = g1.create_dataset('baz', shape=100, chunks=10) >>> d2 = g1.create_dataset('quux', shape=200, chunks=20) >>> sorted(g1.array_keys()) ['baz', 'quux']
-
arrays
()¶ Return an iterator over (name, value) pairs for arrays only.
Examples
>>> import zarr >>> g1 = zarr.group() >>> g2 = g1.create_group('foo') >>> g3 = g1.create_group('bar') >>> d1 = g1.create_dataset('baz', shape=100, chunks=10) >>> d2 = g1.create_dataset('quux', shape=200, chunks=20) >>> for n, v in g1.arrays(): ... print(n, type(v)) baz <class 'zarr.core.Array'> quux <class 'zarr.core.Array'>
-
create_group
(name, overwrite=False)¶ Create a sub-group.
Parameters: name : string
Group name.
overwrite : bool, optional
If True, overwrite any existing array with the given name.
Returns: g : zarr.hierarchy.Group
Examples
>>> import zarr >>> g1 = zarr.group() >>> g2 = g1.create_group('foo') >>> g3 = g1.create_group('bar') >>> g4 = g1.create_group('baz/quux')
-
require_group
(name, overwrite=False)¶ Obtain a sub-group, creating one if it doesn’t exist.
Parameters: name : string
Group name.
overwrite : bool, optional
Overwrite any existing array with given name if present.
Returns: g : zarr.hierarchy.Group
Examples
>>> import zarr >>> g1 = zarr.group() >>> g2 = g1.require_group('foo') >>> g3 = g1.require_group('foo') >>> g2 == g3 True
-
create_groups
(*names, **kwargs)¶ Convenience method to create multiple groups in a single call.
-
require_groups
(*names)¶ Convenience method to require multiple groups in a single call.
-
create_dataset
(name, **kwargs)¶ Create an array.
Parameters: name : string
Array name.
data : array_like, optional
Initial data.
shape : int or tuple of ints
Array shape.
chunks : int or tuple of ints, optional
Chunk shape. If not provided, will be guessed from shape and dtype.
dtype : string or dtype, optional
NumPy dtype.
compressor : Codec, optional
Primary compressor.
fill_value : object
Default value to use for uninitialized portions of the array.
order : {‘C’, ‘F’}, optional
Memory layout to be used within each chunk.
synchronizer : zarr.sync.ArraySynchronizer, optional
Array synchronizer.
filters : sequence of Codecs, optional
Sequence of filters to use to encode chunk data prior to compression.
overwrite : bool, optional
If True, replace any existing array or group with the given name.
cache_metadata : bool, optional
If True, array configuration metadata will be cached for the lifetime of the object. If False, array metadata will be reloaded prior to all data access and modification operations (may incur overhead depending on storage and data access pattern).
Returns: a : zarr.core.Array
Examples
>>> import zarr >>> g1 = zarr.group() >>> d1 = g1.create_dataset('foo', shape=(10000, 10000), ... chunks=(1000, 1000)) >>> d1 Array(/foo, (10000, 10000), float64, chunks=(1000, 1000), order=C) nbytes: 762.9M; nbytes_stored: 323; ratio: 2476780.2; initialized: 0/100 compressor: Blosc(cname='lz4', clevel=5, shuffle=1) store: DictStore
-
require_dataset
(name, shape, dtype=None, exact=False, **kwargs)¶ Obtain an array, creating if it doesn’t exist. Other kwargs are as per
zarr.hierarchy.Group.create_dataset()
.Parameters: name : string
Array name.
shape : int or tuple of ints
Array shape.
dtype : string or dtype, optional
NumPy dtype.
exact : bool, optional
If True, require dtype to match exactly. If false, require dtype can be cast from array dtype.
-
create
(name, **kwargs)¶ Create an array. Keyword arguments as per
zarr.creation.create()
.
-
empty
(name, **kwargs)¶ Create an array. Keyword arguments as per
zarr.creation.empty()
.
-
zeros
(name, **kwargs)¶ Create an array. Keyword arguments as per
zarr.creation.zeros()
.
-
ones
(name, **kwargs)¶ Create an array. Keyword arguments as per
zarr.creation.ones()
.
-
full
(name, fill_value, **kwargs)¶ Create an array. Keyword arguments as per
zarr.creation.full()
.
-
array
(name, data, **kwargs)¶ Create an array. Keyword arguments as per
zarr.creation.array()
.
-
empty_like
(name, data, **kwargs)¶ Create an array. Keyword arguments as per
zarr.creation.empty_like()
.
-
zeros_like
(name, data, **kwargs)¶ Create an array. Keyword arguments as per
zarr.creation.zeros_like()
.
-
ones_like
(name, data, **kwargs)¶ Create an array. Keyword arguments as per
zarr.creation.ones_like()
.
-
full_like
(name, data, **kwargs)¶ Create an array. Keyword arguments as per
zarr.creation.full_like()
.
-
Storage (zarr.storage
)¶
This module contains storage classes for use with Zarr arrays and groups.
However, note that any object implementing the MutableMapping
interface
can be used as a Zarr array store.
-
zarr.storage.
init_array
(store, shape, chunks=None, dtype=None, compressor='default', fill_value=None, order='C', overwrite=False, path=None, chunk_store=None, filters=None)¶ initialize an array store with the given configuration.
Parameters: store : MutableMapping
A mapping that supports string keys and bytes-like values.
shape : int or tuple of ints
Array shape.
chunks : int or tuple of ints, optional
Chunk shape. If not provided, will be guessed from shape and dtype.
dtype : string or dtype, optional
NumPy dtype.
compressor : Codec, optional
Primary compressor.
fill_value : object
Default value to use for uninitialized portions of the array.
order : {‘C’, ‘F’}, optional
Memory layout to be used within each chunk.
overwrite : bool, optional
If True, erase all data in store prior to initialisation.
path : string, optional
Path under which array is stored.
chunk_store : MutableMapping, optional
Separate storage for chunks. If not provided, store will be used for storage of both chunks and metadata.
filters : sequence, optional
Sequence of filters to use to encode chunk data prior to compression.
Notes
The initialisation process involves normalising all array metadata, encoding as JSON and storing under the ‘.zarray’ key. User attributes are also initialized and stored as JSON under the ‘.zattrs’ key.
Examples
Initialize an array store:
>>> from zarr.storage import init_array >>> store = dict() >>> init_array(store, shape=(10000, 10000), chunks=(1000, 1000)) >>> sorted(store.keys()) ['.zarray', '.zattrs']
Array metadata is stored as JSON:
>>> print(str(store['.zarray'], 'ascii')) { "chunks": [ 1000, 1000 ], "compressor": { "clevel": 5, "cname": "lz4", "id": "blosc", "shuffle": 1 }, "dtype": "<f8", "fill_value": null, "filters": null, "order": "C", "shape": [ 10000, 10000 ], "zarr_format": 2 }
User-defined attributes are also stored as JSON, initially empty:
>>> print(str(store['.zattrs'], 'ascii')) {}
Initialize an array using a storage path:
>>> store = dict() >>> init_array(store, shape=100000000, chunks=1000000, dtype='i1', ... path='foo') >>> sorted(store.keys()) ['.zattrs', '.zgroup', 'foo/.zarray', 'foo/.zattrs'] >>> print(str(store['foo/.zarray'], 'ascii')) { "chunks": [ 1000000 ], "compressor": { "clevel": 5, "cname": "lz4", "id": "blosc", "shuffle": 1 }, "dtype": "|i1", "fill_value": null, "filters": null, "order": "C", "shape": [ 100000000 ], "zarr_format": 2 }
-
zarr.storage.
init_group
(store, overwrite=False, path=None, chunk_store=None)¶ initialize a group store.
Parameters: store : MutableMapping
A mapping that supports string keys and byte sequence values.
overwrite : bool, optional
If True, erase all data in store prior to initialisation.
path : string, optional
Path under which array is stored.
chunk_store : MutableMapping, optional
Separate storage for chunks. If not provided, store will be used for storage of both chunks and metadata.
-
class
zarr.storage.
DictStore
(cls=<type 'dict'>)¶ Extended mutable mapping interface to a hierarchy of dicts.
Examples
>>> import zarr >>> store = zarr.DictStore() >>> store['foo'] = b'bar' >>> store['foo'] b'bar' >>> store['a/b/c'] = b'xxx' >>> store['a/b/c'] b'xxx' >>> sorted(store.keys()) ['a/b/c', 'foo'] >>> store.listdir() ['a', 'foo'] >>> store.listdir('a/b') ['c'] >>> store.rmdir('a') >>> sorted(store.keys()) ['foo']
-
class
zarr.storage.
DirectoryStore
(path)¶ Mutable Mapping interface to a directory. Keys must be strings, values must be bytes-like objects.
Parameters: path : string
Location of directory.
Examples
>>> import zarr >>> store = zarr.DirectoryStore('example_store') >>> store['foo'] = b'bar' >>> store['foo'] b'bar' >>> open('example_store/foo', 'rb').read() b'bar' >>> store['a/b/c'] = b'xxx' >>> store['a/b/c'] b'xxx' >>> open('example_store/a/b/c', 'rb').read() b'xxx' >>> sorted(store.keys()) ['a/b/c', 'foo'] >>> store.listdir() ['a', 'foo'] >>> store.listdir('a/b') ['c'] >>> store.rmdir('a') >>> sorted(store.keys()) ['foo'] >>> import os >>> os.path.exists('example_store/a') False
-
class
zarr.storage.
TempStore
(suffix='', prefix='zarr', dir=None)¶ Directory store using a temporary directory for storage.
-
class
zarr.storage.
ZipStore
(path, compression=0, allowZip64=True, mode='a')¶ Mutable Mapping interface to a Zip file. Keys must be strings, values must be bytes-like objects.
Parameters: path : string
Location of file.
compression : integer, optional
Compression method to use when writing to the archive.
allowZip64 : bool, optional
If True (the default) will create ZIP files that use the ZIP64 extensions when the zipfile is larger than 2 GiB. If False will raise an exception when the ZIP file would require ZIP64 extensions.
mode : string, optional
One of ‘r’ to read an existing file, ‘w’ to truncate and write a new file, ‘a’ to append to an existing file, or ‘x’ to exclusively create and write a new file.
Notes
When modifying a ZipStore the close() method must be called otherwise essential data will not be written to the underlying zip file. The ZipStore class also supports the context manager protocol, which ensures the close() method is called on leaving the with statement.
Examples
>>> import zarr >>> store = zarr.ZipStore('example.zip', mode='w') >>> store['foo'] = b'bar' >>> store['foo'] b'bar' >>> store['a/b/c'] = b'xxx' >>> store['a/b/c'] b'xxx' >>> sorted(store.keys()) ['a/b/c', 'foo'] >>> store.close() >>> import zipfile >>> zf = zipfile.ZipFile('example.zip', mode='r') >>> sorted(zf.namelist()) ['a/b/c', 'foo']
-
close
()¶ Closes the underlying zip file, ensuring all records are written.
-
flush
()¶ Closes the underlying zip file, ensuring all records are written, then re-opens the file for further modifications.
-
-
zarr.storage.
migrate_1to2
(store)¶ Migrate array metadata in store from Zarr format version 1 to version 2.
Parameters: store : MutableMapping
Store to be migrated.
Notes
Version 1 did not support hierarchies, so this migration function will look for a single array in store and migrate the array metadata to version 2.
Compressors and filters (zarr.codecs
)¶
This module contains compressor and filter classes for use with Zarr.
Other codecs can be registered dynamically with Zarr. All that is required
is to implement a class that provides the same interface as the classes listed
below, and then to add the class to the codec_registry
. See the source
code of this module for details.
-
class
zarr.codecs.
Codec
¶ Codec abstract base class.
-
encode
(buf)¶ Encode data in buf.
Parameters: buf : buffer-like
Data to be encoded. May be any object supporting the new-style buffer protocol or array.array.
Returns: enc : buffer-like
Encoded data. May be any object supporting the new-style buffer protocol or array.array.
-
decode
(buf, out=None)¶ Decode data in buf.
Parameters: buf : buffer-like
Encoded data. May be any object supporting the new-style buffer protocol or array.array.
out : buffer-like, optional
Buffer to store decoded data.
Returns: out : buffer-like
Decoded data. May be any object supporting the new-style buffer protocol or array.array.
-
get_config
()¶ Return a dictionary holding configuration parameters for this codec. All values must be compatible with JSON encoding.
-
classmethod
from_config
(config)¶ Instantiate from a configuration object.
-
-
class
zarr.codecs.
Blosc
(cname='lz4', clevel=5, shuffle=1)¶ Provides compression using the blosc meta-compressor.
Parameters: cname : string, optional
A string naming one of the compression algorithms available within blosc, e.g., ‘blosclz’, ‘lz4’, ‘zlib’ or ‘snappy’.
clevel : integer, optional
An integer between 0 and 9 specifying the compression level.
shuffle : integer, optional
Either 0 (no shuffle), 1 (byte shuffle) or 2 (bit shuffle).
-
class
zarr.codecs.
Zlib
(level=1)¶ Provides compression using zlib via the Python standard library.
Parameters: level : int
Compression level.
-
class
zarr.codecs.
BZ2
(level=1)¶ Provides compression using bzip2 via the Python standard library.
Parameters: level : int
Compression level.
-
class
zarr.codecs.
LZMA
(format=1, check=-1, preset=None, filters=None)¶ Provides compression using lzma via the Python standard library (only available under Python 3).
Parameters: format : integer, optional
One of the lzma format codes, e.g.,
lzma.FORMAT_XZ
.check : integer, optional
One of the lzma check codes, e.g.,
lzma.CHECK_NONE
.preset : integer, optional
An integer between 0 and 9 inclusive, specifying the compression level.
filters : list, optional
A list of dictionaries specifying compression filters. If filters are provided, ‘preset’ must be None.
-
class
zarr.codecs.
Delta
(dtype, astype=None)¶ Filter to encode data as the difference between adjacent values.
Parameters: dtype : dtype
Data type to use for decoded data.
astype : dtype, optional
Data type to use for encoded data.
Notes
If astype is an integer data type, please ensure that it is sufficiently large to store encoded values. No checks are made and data may become corrupted due to integer overflow if astype is too small. Note also that the encoded data for each chunk includes the absolute value of the first element in the chunk, and so the encoded data type in general needs to be large enough to store absolute values from the array.
Examples
>>> import zarr >>> import numpy as np >>> x = np.arange(100, 120, 2, dtype='i8') >>> f = zarr.Delta(dtype='i8', astype='i1') >>> y = f.encode(x) >>> y array([100, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int8) >>> z = f.decode(y) >>> z array([100, 102, 104, 106, 108, 110, 112, 114, 116, 118])
-
class
zarr.codecs.
FixedScaleOffset
(offset, scale, dtype, astype=None)¶ Simplified version of the scale-offset filter available in HDF5. Applies the transformation (x - offset) * scale to all chunks. Results are rounded to the nearest integer but are not packed according to the minimum number of bits.
Parameters: offset : float
Value to subtract from data.
scale : int
Value to multiply by data.
dtype : dtype
Data type to use for decoded data.
astype : dtype, optional
Data type to use for encoded data.
Notes
If astype is an integer data type, please ensure that it is sufficiently large to store encoded values. No checks are made and data may become corrupted due to integer overflow if astype is too small.
Examples
>>> import zarr >>> import numpy as np >>> x = np.linspace(1000, 1001, 10, dtype='f8') >>> x array([ 1000. , 1000.11111111, 1000.22222222, 1000.33333333, 1000.44444444, 1000.55555556, 1000.66666667, 1000.77777778, 1000.88888889, 1001. ]) >>> f1 = zarr.FixedScaleOffset(offset=1000, scale=10, dtype='f8', astype='u1') >>> y1 = f1.encode(x) >>> y1 array([ 0, 1, 2, 3, 4, 6, 7, 8, 9, 10], dtype=uint8) >>> z1 = f1.decode(y1) >>> z1 array([ 1000. , 1000.1, 1000.2, 1000.3, 1000.4, 1000.6, 1000.7, 1000.8, 1000.9, 1001. ]) >>> f2 = zarr.FixedScaleOffset(offset=1000, scale=10**2, dtype='f8', astype='u1') >>> y2 = f2.encode(x) >>> y2 array([ 0, 11, 22, 33, 44, 56, 67, 78, 89, 100], dtype=uint8) >>> z2 = f2.decode(y2) >>> z2 array([ 1000. , 1000.11, 1000.22, 1000.33, 1000.44, 1000.56, 1000.67, 1000.78, 1000.89, 1001. ]) >>> f3 = zarr.FixedScaleOffset(offset=1000, scale=10**3, dtype='f8', astype='u2') >>> y3 = f3.encode(x) >>> y3 array([ 0, 111, 222, 333, 444, 556, 667, 778, 889, 1000], dtype=uint16) >>> z3 = f3.decode(y3) >>> z3 array([ 1000. , 1000.111, 1000.222, 1000.333, 1000.444, 1000.556, 1000.667, 1000.778, 1000.889, 1001. ])
-
class
zarr.codecs.
Quantize
(digits, dtype, astype=None)¶ Lossy filter to reduce the precision of floating point data.
Parameters: digits : int
Desired precision (number of decimal digits).
dtype : dtype
Data type to use for decoded data.
astype : dtype, optional
Data type to use for encoded data.
Examples
>>> import zarr >>> import numpy as np >>> x = np.linspace(0, 1, 10, dtype='f8') >>> x array([ 0. , 0.11111111, 0.22222222, 0.33333333, 0.44444444, 0.55555556, 0.66666667, 0.77777778, 0.88888889, 1. ]) >>> f1 = zarr.Quantize(digits=1, dtype='f8') >>> y1 = f1.encode(x) >>> y1 array([ 0. , 0.125 , 0.25 , 0.3125, 0.4375, 0.5625, 0.6875, 0.75 , 0.875 , 1. ]) >>> f2 = zarr.Quantize(digits=2, dtype='f8') >>> y2 = f2.encode(x) >>> y2 array([ 0. , 0.109375 , 0.21875 , 0.3359375, 0.4453125, 0.5546875, 0.6640625, 0.78125 , 0.890625 , 1. ]) >>> f3 = zarr.Quantize(digits=3, dtype='f8') >>> y3 = f3.encode(x) >>> y3 array([ 0. , 0.11132812, 0.22265625, 0.33300781, 0.44433594, 0.55566406, 0.66699219, 0.77734375, 0.88867188, 1. ])
-
class
zarr.codecs.
PackBits
¶ Filter to pack elements of a boolean array into bits in a uint8 array.
Notes
The first element of the encoded array stores the number of bits that were padded to complete the final byte.
Examples
>>> import zarr >>> import numpy as np >>> f = zarr.PackBits() >>> x = np.array([True, False, False, True], dtype=bool) >>> y = f.encode(x) >>> y array([ 4, 144], dtype=uint8) >>> z = f.decode(y) >>> z array([ True, False, False, True], dtype=bool)
-
class
zarr.codecs.
Categorize
(labels, dtype, astype='u1')¶ Filter encoding categorical string data as integers.
Parameters: labels : sequence of strings
Category labels.
dtype : dtype
Data type to use for decoded data.
astype : dtype, optional
Data type to use for encoded data.
Examples
>>> import zarr >>> import numpy as np >>> x = np.array([b'male', b'female', b'female', b'male', b'unexpected']) >>> x array([b'male', b'female', b'female', b'male', b'unexpected'], dtype='|S10') >>> f = zarr.Categorize(labels=[b'female', b'male'], dtype=x.dtype) >>> y = f.encode(x) >>> y array([2, 1, 1, 2, 0], dtype=uint8) >>> z = f.decode(y) >>> z array([b'male', b'female', b'female', b'male', b''], dtype='|S10')
Specifications¶
Zarr storage specification version 1¶
This document provides a technical specification of the protocol and format used for storing a Zarr array. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.
Status¶
This specification is deprecated. See Specifications for the latest version.
Storage¶
A Zarr array can be stored in any storage system that provides a key/value interface, where a key is an ASCII string and a value is an arbitrary sequence of bytes, and the supported operations are read (get the sequence of bytes associated with a given key), write (set the sequence of bytes associated with a given key) and delete (remove a key/value pair).
For example, a directory in a file system can provide this interface, where keys are file names, values are file contents, and files can be read, written or deleted via the operating system. Equally, an S3 bucket can provide this interface, where keys are resource names, values are resource contents, and resources can be read, written or deleted via HTTP.
Below an “array store” refers to any system implementing this interface.
Metadata¶
Each array requires essential configuration metadata to be stored, enabling correct interpretation of the stored data. This metadata is encoded using JSON and stored as the value of the ‘meta’ key within an array store.
The metadata resource is a JSON object. The following keys MUST be present within the object:
- zarr_format
- An integer defining the version of the storage specification to which the array store adheres.
- shape
- A list of integers defining the length of each dimension of the array.
- chunks
- A list of integers defining the length of each dimension of a chunk of the array. Note that all chunks within a Zarr array have the same shape.
- dtype
- A string or list defining a valid data type for the array. See also the subsection below on data type encoding.
- compression
- A string identifying the primary compression library used to compress each chunk of the array.
- compression_opts
- An integer, string or dictionary providing options to the primary compression library.
- fill_value
- A scalar value providing the default value to use for uninitialized portions of the array.
- order
- Either ‘C’ or ‘F’, defining the layout of bytes within each chunk of the array. ‘C’ means row-major order, i.e., the last dimension varies fastest; ‘F’ means column-major order, i.e., the first dimension varies fastest.
Other keys MAY be present within the metadata object however they MUST NOT alter the interpretation of the required fields defined above.
For example, the JSON object below defines a 2-dimensional array of 64-bit little-endian floating point numbers with 10000 rows and 10000 columns, divided into chunks of 1000 rows and 1000 columns (so there will be 100 chunks in total arranged in a 10 by 10 grid). Within each chunk the data are laid out in C contiguous order, and each chunk is compressed using the Blosc compression library:
{
"chunks": [
1000,
1000
],
"compression": "blosc",
"compression_opts": {
"clevel": 5,
"cname": "lz4",
"shuffle": 1
},
"dtype": "<f8",
"fill_value": null,
"order": "C",
"shape": [
10000,
10000
],
"zarr_format": 1
}
Data type encoding¶
Simple data types are encoded within the array metadata resource as a
string, following the NumPy array protocol type string (typestr)
format. The
format consists of 3 parts: a character describing the byteorder of
the data (<
: little-endian, >
: big-endian, |
:
not-relevant), a character code giving the basic type of the array,
and an integer providing the number of bytes the type uses. The byte
order MUST be specified. E.g., "<f8"
, ">i4"
, "|b1"
and
"|S12"
are valid data types.
Structure data types (i.e., with multiple named fields) are encoded as
a list of two-element lists, following NumPy array protocol type
descriptions (descr).
For example, the JSON list [["r", "|u1"], ["g", "|u1"], ["b",
"|u1"]]
defines a data type composed of three single-byte unsigned
integers labelled ‘r’, ‘g’ and ‘b’.
Chunks¶
Each chunk of the array is compressed by passing the raw bytes for the chunk through the primary compression library to obtain a new sequence of bytes comprising the compressed chunk data. No header is added to the compressed bytes or any other modification made. The internal structure of the compressed bytes will depend on which primary compressor was used. For example, the Blosc compressor produces a sequence of bytes that begins with a 16-byte header followed by compressed data.
The compressed sequence of bytes for each chunk is stored under a key formed from the index of the chunk within the grid of chunks representing the array. To form a string key for a chunk, the indices are converted to strings and concatenated with the period character (‘.’) separating each index. For example, given an array with shape (10000, 10000) and chunk shape (1000, 1000) there will be 100 chunks laid out in a 10 by 10 grid. The chunk with indices (0, 0) provides data for rows 0-1000 and columns 0-1000 and is stored under the key ‘0.0’; the chunk with indices (2, 4) provides data for rows 2000-3000 and columns 4000-5000 and is stored under the key ‘2.4’; etc.
There is no need for all chunks to be present within an array
store. If a chunk is not present then it is considered to be in an
uninitialized state. An unitialized chunk MUST be treated as if it
was uniformly filled with the value of the ‘fill_value’ field in the
array metadata. If the ‘fill_value’ field is null
then the
contents of the chunk are undefined.
Note that all chunks in an array have the same shape. If the length of any array dimension is not exactly divisible by the length of the corresponding chunk dimension then some chunks will overhang the edge of the array. The contents of any chunk region falling outside the array are undefined.
Attributes¶
Each array can also be associated with custom attributes, which are simple key/value items with application-specific meaning. Custom attributes are encoded as a JSON object and stored under the ‘attrs’ key within an array store. Even if the attributes are empty, the ‘attrs’ key MUST be present within an array store.
For example, the JSON object below encodes three attributes named ‘foo’, ‘bar’ and ‘baz’:
{
"foo": 42,
"bar": "apples",
"baz": [1, 2, 3, 4]
}
Example¶
Below is an example of storing a Zarr array, using a directory on the local file system as storage.
Initialize the store:
>>> import zarr
>>> store = zarr.DirectoryStore('example.zarr')
>>> zarr.init_store(store, shape=(20, 20), chunks=(10, 10),
... dtype='i4', fill_value=42, compression='zlib',
... compression_opts=1, overwrite=True)
No chunks are initialized yet, so only the ‘meta’ and ‘attrs’ keys have been set:
>>> import os
>>> sorted(os.listdir('example.zarr'))
['attrs', 'meta']
Inspect the array metadata:
>>> print(open('example.zarr/meta').read())
{
"chunks": [
10,
10
],
"compression": "zlib",
"compression_opts": 1,
"dtype": "<i4",
"fill_value": 42,
"order": "C",
"shape": [
20,
20
],
"zarr_format": 1
}
Inspect the array attributes:
>>> print(open('example.zarr/attrs').read())
{}
Set some data:
>>> z = zarr.Array(store)
>>> z[0:10, 0:10] = 1
>>> sorted(os.listdir('example.zarr'))
['0.0', 'attrs', 'meta']
Set some more data:
>>> z[0:10, 10:20] = 2
>>> z[10:20, :] = 3
>>> sorted(os.listdir('example.zarr'))
['0.0', '0.1', '1.0', '1.1', 'attrs', 'meta']
Manually decompress a single chunk for illustration:
>>> import zlib
>>> b = zlib.decompress(open('example.zarr/0.0', 'rb').read())
>>> import numpy as np
>>> a = np.frombuffer(b, dtype='<i4')
>>> a
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
Modify the array attributes:
>>> z.attrs['foo'] = 42
>>> z.attrs['bar'] = 'apples'
>>> z.attrs['baz'] = [1, 2, 3, 4]
>>> print(open('example.zarr/attrs').read())
{
"bar": "apples",
"baz": [
1,
2,
3,
4
],
"foo": 42
}
Zarr storage specification version 2¶
This document provides a technical specification of the protocol and format used for storing Zarr arrays. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.
Status¶
This specification is the latest version. See Specifications for previous versions.
Storage¶
A Zarr array can be stored in any storage system that provides a key/value interface, where a key is an ASCII string and a value is an arbitrary sequence of bytes, and the supported operations are read (get the sequence of bytes associated with a given key), write (set the sequence of bytes associated with a given key) and delete (remove a key/value pair).
For example, a directory in a file system can provide this interface, where keys are file names, values are file contents, and files can be read, written or deleted via the operating system. Equally, an S3 bucket can provide this interface, where keys are resource names, values are resource contents, and resources can be read, written or deleted via HTTP.
Below an “array store” refers to any system implementing this interface.
Arrays¶
Metadata¶
Each array requires essential configuration metadata to be stored, enabling correct interpretation of the stored data. This metadata is encoded using JSON and stored as the value of the ”.zarray” key within an array store.
The metadata resource is a JSON object. The following keys MUST be present within the object:
- zarr_format
- An integer defining the version of the storage specification to which the array store adheres.
- shape
- A list of integers defining the length of each dimension of the array.
- chunks
- A list of integers defining the length of each dimension of a chunk of the array. Note that all chunks within a Zarr array have the same shape.
- dtype
- A string or list defining a valid data type for the array. See also the subsection below on data type encoding.
- compressor
- A JSON object identifying the primary compression codec and providing
configuration parameters, or
null
if no compressor is to be used. The object MUST contain an"id"
key identifying the codec to be used. - fill_value
- A scalar value providing the default value to use for uninitialized
portions of the array, or
null
if no fill_value is to be used. - order
- Either “C” or “F”, defining the layout of bytes within each chunk of the array. “C” means row-major order, i.e., the last dimension varies fastest; “F” means column-major order, i.e., the first dimension varies fastest.
- filters
- A list of JSON objects providing codec configurations, or
null
if no filters are to be applied. Each codec configuration object MUST contain a"id"
key identifying the codec to be used.
Other keys MUST NOT be present within the metadata object.
For example, the JSON object below defines a 2-dimensional array of 64-bit little-endian floating point numbers with 10000 rows and 10000 columns, divided into chunks of 1000 rows and 1000 columns (so there will be 100 chunks in total arranged in a 10 by 10 grid). Within each chunk the data are laid out in C contiguous order. Each chunk is encoded using a delta filter and compressed using the Blosc compression library prior to storage:
{
"chunks": [
1000,
1000
],
"compressor": {
"id": "blosc",
"cname": "lz4",
"clevel": 5,
"shuffle": 1
},
"dtype": "<f8",
"fill_value": "NaN",
"filters": [
{"id": "delta", "dtype": "<f8", "astype": "<f4"}
],
"order": "C",
"shape": [
10000,
10000
],
"zarr_format": 2
}
Data type encoding¶
Simple data types are encoded within the array metadata as a string, following the NumPy array protocol type string (typestr) format. The format consists of 3 parts:
- One character describing the byteorder of the data (
"<"
: little-endian;">"
: big-endian;"|"
: not-relevant) - One character code giving the basic type of the array (
"b"
: Boolean (integer type where all values are only True or False);"i"
: integer;"u"
: unsigned integer;"f"
: floating point;"c"
: complex floating point;"m"
: timedelta;"M"
: datetime;"S"
: string (fixed-length sequence of char);"U"
: unicode (fixed-length sequence of Py_UNICODE);"V"
: other (void * – each item is a fixed-size chunk of memory)) - An integer specifying the number of bytes the type uses.
The byte order MUST be specified. E.g., "<f8"
, ">i4"
, "|b1"
and
"|S12"
are valid data type encodings.
Structured data types (i.e., with multiple named fields) are encoded as a list
of two-element lists, following NumPy array protocol type descriptions (descr). For
example, the JSON list [["r", "|u1"], ["g", "|u1"], ["b", "|u1"]]
defines a
data type composed of three single-byte unsigned integers labelled “r”, “g” and
“b”.
Fill value encoding¶
For simple floating point data types, the following table MUST be used to encode values of the “fill_value” field:
Value | JSON encoding |
---|---|
Not a Number | "NaN" |
Positive Infinity | "Infinity" |
Negative Infinity | "-Infinity" |
Chunks¶
Each chunk of the array is compressed by passing the raw bytes for the chunk through the primary compression library to obtain a new sequence of bytes comprising the compressed chunk data. No header is added to the compressed bytes or any other modification made. The internal structure of the compressed bytes will depend on which primary compressor was used. For example, the Blosc compressor produces a sequence of bytes that begins with a 16-byte header followed by compressed data.
The compressed sequence of bytes for each chunk is stored under a key formed from the index of the chunk within the grid of chunks representing the array. To form a string key for a chunk, the indices are converted to strings and concatenated with the period character (”.”) separating each index. For example, given an array with shape (10000, 10000) and chunk shape (1000, 1000) there will be 100 chunks laid out in a 10 by 10 grid. The chunk with indices (0, 0) provides data for rows 0-1000 and columns 0-1000 and is stored under the key “0.0”; the chunk with indices (2, 4) provides data for rows 2000-3000 and columns 4000-5000 and is stored under the key “2.4”; etc.
There is no need for all chunks to be present within an array store. If a chunk
is not present then it is considered to be in an uninitialized state. An
unitialized chunk MUST be treated as if it was uniformly filled with the value
of the “fill_value” field in the array metadata. If the “fill_value” field is
null
then the contents of the chunk are undefined.
Note that all chunks in an array have the same shape. If the length of any array dimension is not exactly divisible by the length of the corresponding chunk dimension then some chunks will overhang the edge of the array. The contents of any chunk region falling outside the array are undefined.
Filters¶
Optionally a sequence of one or more filters can be used to transform chunk data prior to compression. When storing data, filters are applied in the order specified in array metadata to encode data, then the encoded data are passed to the primary compressor. When retrieving data, stored chunk data are decompressed by the primary compressor then decoded using filters in the reverse order.
Hierarchies¶
Logical storage paths¶
Multiple arrays can be stored in the same array store by associating each array with a different logical path. A logical path is simply an ASCII string. The logical path is used to form a prefix for keys used by the array. For example, if an array is stored at logical path “foo/bar” then the array metadata will be stored under the key “foo/bar/.zarray”, the user-defined attributes will be stored under the key “foo/bar/.zattrs”, and the chunks will be stored under keys like “foo/bar/0.0”, “foo/bar/0.1”, etc.
To ensure consistent behaviour across different storage systems, logical paths MUST be normalized as follows:
- Replace all backward slash characters (“\”) with forward slash characters (“/”)
- Strip any leading “/” characters
- Strip any trailing “/” characters
- Collapse any sequence of more than one “/” character into a single “/” character
The key prefix is then obtained by appending a single “/” character to the normalized logical path.
After normalization, if splitting a logical path by the “/” character results in any path segment equal to the string ”.” or the string ”..” then an error MUST be raised.
N.B., how the underlying array store processes requests to store values under keys containing the “/” character is entirely up to the store implementation and is not constrained by this specification. E.g., an array store could simply treat all keys as opaque ASCII strings; equally, an array store could map logical paths onto some kind of hierarchical storage (e.g., directories on a file system).
Groups¶
Arrays can be organized into groups which can also contain other groups. A group is created by storing group metadata under the ”.zgroup” key under some logical path. E.g., a group exists at the root of an array store if the ”.zgroup” key exists in the store, and a group exists at logical path “foo/bar” if the “foo/bar/.zgroup” key exists in the store.
If the user requests a group to be created under some logical path, then groups MUST also be created at all ancestor paths. E.g., if the user requests group creation at path “foo/bar” then groups MUST be created at path “foo” and the root of the store, if they don’t already exist.
If the user requests an array to be created under some logical path, then groups MUST also be created at all ancestor paths. E.g., if the user requests array creation at path “foo/bar/baz” then groups must be created at path “foo/bar”, path “foo”, and the root of the store, if they don’t already exist.
The group metadata resource is a JSON object. The following keys MUST be present within the object:
- zarr_format
- An integer defining the version of the storage specification to which the array store adheres.
Other keys MUST NOT be present within the metadata object.
The members of a group are arrays and groups stored under logical paths that are direct children of the parent group’s logical path. E.g., if groups exist under the logical paths “foo” and “foo/bar” and an array exists at logical path “foo/baz” then the members of the group at path “foo” are the group at path “foo/bar” and the array at path “foo/baz”.
Attributes¶
An array or group can be associated with custom attributes, which are simple key/value items with application-specific meaning. Custom attributes are encoded as a JSON object and stored under the ”.zattrs” key within an array store.
For example, the JSON object below encodes three attributes named “foo”, “bar” and “baz”:
{
"foo": 42,
"bar": "apples",
"baz": [1, 2, 3, 4]
}
Examples¶
Storing a single array¶
Below is an example of storing a Zarr array, using a directory on the local file system as storage.
Create an array:
>>> import zarr
>>> store = zarr.DirectoryStore('example')
>>> a = zarr.create(shape=(20, 20), chunks=(10, 10), dtype='i4',
... fill_value=42, compressor=zarr.Zlib(level=1),
... store=store, overwrite=True)
No chunks are initialized yet, so only the ”.zarray” and ”.zattrs” keys have been set in the store:
>>> import os
>>> sorted(os.listdir('example'))
['.zarray', '.zattrs']
Inspect the array metadata:
>>> print(open('example/.zarray').read())
{
"chunks": [
10,
10
],
"compressor": {
"id": "zlib",
"level": 1
},
"dtype": "<i4",
"fill_value": 42,
"filters": null,
"order": "C",
"shape": [
20,
20
],
"zarr_format": 2
}
Inspect the array attributes:
>>> print(open('example/.zattrs').read())
{}
Chunks are initialized on demand. E.g., set some data:
>>> a[0:10, 0:10] = 1
>>> sorted(os.listdir('example'))
['.zarray', '.zattrs', '0.0']
Set some more data:
>>> a[0:10, 10:20] = 2
>>> a[10:20, :] = 3
>>> sorted(os.listdir('example'))
['.zarray', '.zattrs', '0.0', '0.1', '1.0', '1.1']
Manually decompress a single chunk for illustration:
>>> import zlib
>>> buf = zlib.decompress(open('example/0.0', 'rb').read())
>>> import numpy as np
>>> chunk = np.frombuffer(buf, dtype='<i4')
>>> chunk
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
Modify the array attributes:
>>> a.attrs['foo'] = 42
>>> a.attrs['bar'] = 'apples'
>>> a.attrs['baz'] = [1, 2, 3, 4]
>>> print(open('example/.zattrs').read())
{
"bar": "apples",
"baz": [
1,
2,
3,
4
],
"foo": 42
}
Storing multiple arrays in a hierarchy¶
Below is an example of storing multiple Zarr arrays organized into a group hierarchy, using a directory on the local file system as storage. This storage implementation maps logical paths onto directory paths on the file system, however this is an implementation choice and is not required.
Setup the store:
>>> import zarr
>>> store = zarr.DirectoryStore('example_hierarchy')
Create the root group:
>>> root_grp = zarr.group(store, overwrite=True)
The metadata resource for the root group has been created, as well as a custom attributes resource:
>>> import os
>>> sorted(os.listdir('example_hierarchy'))
['.zattrs', '.zgroup']
Inspect the group metadata:
>>> print(open('example_hierarchy/.zgroup').read())
{
"zarr_format": 2
}
Inspect the group attributes:
>>> print(open('example_hierarchy/.zattrs').read())
{}
Create a sub-group:
>>> sub_grp = root_grp.create_group('foo')
What has been stored:
>>> sorted(os.listdir('example_hierarchy'))
['.zattrs', '.zgroup', 'foo']
>>> sorted(os.listdir('example_hierarchy/foo'))
['.zattrs', '.zgroup']
Create an array within the sub-group:
>>> a = sub_grp.create_dataset('bar', shape=(20, 20), chunks=(10, 10))
>>> a[:] = 42
What has been stored:
>>> sorted(os.listdir('example_hierarchy'))
['.zattrs', '.zgroup', 'foo']
>>> sorted(os.listdir('example_hierarchy/foo'))
['.zattrs', '.zgroup', 'bar']
>>> sorted(os.listdir('example_hierarchy/foo/bar'))
['.zarray', '.zattrs', '0.0', '0.1', '1.0', '1.1']
Here is the same example using a Zip file as storage:
>>> store = zarr.ZipStore('example_hierarchy.zip', mode='w')
>>> root_grp = zarr.group(store)
>>> sub_grp = root_grp.create_group('foo')
>>> a = sub_grp.create_dataset('bar', shape=(20, 20), chunks=(10, 10))
>>> a[:] = 42
>>> store.close()
What has been stored:
>>> import zipfile
>>> zf = zipfile.ZipFile('example_hierarchy.zip', mode='r')
>>> for name in sorted(zf.namelist()):
... print(name)
.zattrs
.zgroup
foo/.zattrs
foo/.zgroup
foo/bar/.zarray
foo/bar/.zattrs
foo/bar/0.0
foo/bar/0.1
foo/bar/1.0
foo/bar/1.1
Changes¶
Changes in version 2¶
- Added support for storing multiple arrays in the same store and organising arrays into hierarchies using groups.
- Array metadata is now stored under the ”.zarray” key instead of the “meta” key.
- Custom attributes are now stored under the ”.zattrs” key instead of the “attrs” key.
- Added support for filters.
- Changed encoding of “fill_value” field within array metadata.
- Changed encoding of compressor information within array metadata to be consistent with representation of filter information.
Release notes¶
2.1.4¶
Resolved an issue where calling hasattr
on a Group
object erroneously returned a
KeyError
(#88,
#95,
Vincent Schut)
2.1.3¶
Resolved an issue with zarr.creation.array()
where dtype was given as
None (#80).
2.1.1¶
Various minor improvements, including: Group
objects support member access
via dot notation (__getattr__
); fixed metadata caching for Array.shape
property and derivatives; added Array.ndim
property; fixed
Array.__array__
method arguments; fixed bug in pickling Array
state;
fixed bug in pickling ThreadSynchronizer
.
2.1.0¶
- Group objects now support member deletion via
del
statement (#65). - Added
zarr.storage.TempStore
class for convenience to provide storage via a temporary directory (#59). - Fixed performance issues with
zarr.storage.ZipStore
class (#66). - The Blosc extension has been modified to return bytes instead of array objects from compress and decompress function calls. This should improve compatibility and also provides a small performance increase for compressing high compression ratio data (#55).
- Added
overwrite
keyword argument to array and group creation methods on thezarr.hierarchy.Group
class (#71). - Added
cache_metadata
keyword argument to array creation methods. - The functions
zarr.creation.open_array()
andzarr.hierarchy.open_group()
now accept any store as first argument (#56).
2.0.1¶
The bundled Blosc library has been upgraded to version 1.11.1.
2.0.0¶
Hierarchies¶
Support has been added for organizing arrays into hierarchies via groups. See
the tutorial section on Groups and the zarr.hierarchy
API docs for more information.
Filters¶
Support has been added for configuring filters to preprocess chunk data prior
to compression. See the tutorial section on Filters and the
zarr.codecs
API docs for more information.
Other changes¶
To accommodate support for hierarchies and filters, the Zarr metadata format
has been modified. See the Zarr storage specification version 2 for more information. To migrate an
array stored using Zarr version 1.x, use the zarr.storage.migrate_1to2()
function.
The bundled Blosc library has been upgraded to version 1.11.0.
Acknowledgments¶
Thanks to Matthew Rocklin (mrocklin), Stephan Hoyer (shoyer) and Francesc Alted (FrancescAlted) for contributions and comments.
1.1.0¶
- The bundled Blosc library has been upgraded to version 1.10.0. The ‘zstd’ internal compression library is now available within Blosc. See the tutorial section on Compressors for an example.
- When using the Blosc compressor, the default internal compression library is now ‘lz4’.
- The default number of internal threads for the Blosc compressor has been increased to a maximum of 8 (previously 4).
- Added convenience functions
zarr.blosc.list_compressors()
andzarr.blosc.get_nthreads()
.
1.0.0¶
This release includes a complete re-organization of the code base. The major version number has been bumped to indicate that there have been backwards-incompatible changes to the API and the on-disk storage format. However, Zarr is still in an early stage of development, so please do not take the version number as an indicator of maturity.
Storage¶
The main motivation for re-organizing the code was to create an
abstraction layer between the core array logic and data storage (#21). In this release, any
object that implements the MutableMapping
interface can be used as
an array store. See the tutorial sections on Persistent arrays
and Storage alternatives, the Zarr storage specification version 1, and the
zarr.storage
module documentation for more information.
Please note also that the file organization and file name conventions
used when storing a Zarr array in a directory on the file system have
changed. Persistent Zarr arrays created using previous versions of the
software will not be compatible with this version. See the
zarr.storage
API docs and the Zarr storage specification version 1 for more
information.
Compression¶
An abstraction layer has also been created between the core array
logic and the code for compressing and decompressing array
chunks. This release still bundles the c-blosc library and uses Blosc
as the default compressor, however other compressors including zlib,
BZ2 and LZMA are also now supported via the Python standard
library. New compressors can also be dynamically registered for use
with Zarr. See the tutorial sections on Compressors and
Configuring Blosc, the Zarr storage specification version 1, and the
zarr.compressors
module documentation for more information.
Synchronization¶
The synchronization code has also been refactored to create a layer of
abstraction, enabling Zarr arrays to be used in parallel computations
with a number of alternative synchronization methods. For more
information see the tutorial section on Parallel computing and synchronization and the
zarr.sync
module documentation.
Changes to the Blosc extension¶
NumPy is no longer a build dependency for the zarr.blosc
Cython
extension, so setup.py will run even if NumPy is not already
installed, and should automatically install NumPy as a runtime
dependency. Manual installation of NumPy prior to installing Zarr is
still recommended, however, as the automatic installation of NumPy may
fail or be sub-optimal on some platforms.
Some optimizations have been made within the zarr.blosc
extension to avoid unnecessary memory copies, giving a ~10-20%
performance improvement for multi-threaded compression operations.
The zarr.blosc
extension now automatically detects whether it
is running within a single-threaded or multi-threaded program and
adapts its internal behaviour accordingly (#27). There is no need for
the user to make any API calls to switch Blosc between contextual and
non-contextual (global lock) mode. See also the tutorial section on
Configuring Blosc.
Other changes¶
The internal code for managing chunks has been rewritten to be more efficient. Now no state is maintained for chunks outside of the array store, meaning that chunks do not carry any extra memory overhead not accounted for by the store. This negates the need for the “lazy” option present in the previous release, and this has been removed.
The memory layout within chunks can now be set as either “C” (row-major) or “F” (column-major), which can help to provide better compression for some data (#7). See the tutorial section on Changing memory layout for more information.
A bug has been fixed within the __getitem__
and __setitem__
machinery for slicing arrays, to properly handle getting and setting
partial slices.
Acknowledgments¶
Thanks to Matthew Rocklin (mrocklin), Stephan Hoyer (shoyer), Francesc Alted (FrancescAlted), Anthony Scopatz (scopatz) and Martin Durant (martindurant) for contributions and comments.
0.4.0¶
0.3.0¶
Acknowledgments¶
Zarr bundles the c-blosc library and uses it as the default compressor.
Zarr is inspired by HDF5, h5py and bcolz.
Development of this package is supported by the MRC Centre for Genomics and Global Health.