Zarr

Zarr is a Python package providing an implementation of chunked, compressed, N-dimensional arrays.

Highlights

  • Create N-dimensional arrays with any NumPy dtype.
  • Chunk arrays along any dimension.
  • Compress chunks using the fast Blosc meta-compressor or alternatively using zlib, BZ2 or LZMA.
  • Store arrays in memory, on disk, inside a Zip file, on S3, ...
  • Read an array concurrently from multiple threads or processes.
  • Write to an array concurrently from multiple threads or processes.

Status

Zarr is still in an early, experimental phase of development. Feedback and bug reports are very welcome, please get in touch via the GitHub issue tracker.

Installation

Install Zarr from PyPI:

$ pip install zarr

Please note that Zarr includes a C extension providing integration with the Blosc library. Pre-compiled binaries are available for Linux and Windows platforms and will be installed automatically via pip if available. However, if you have a newer CPU that supports the AVX2 instruction set (e.g., Intel Haswell, Broadwell or Skylake) then compiling from source is preferable, as the Blosc library includes some optimisations for those architectures:

$ pip install --no-binary=:all: zarr%

To work with Zarr source code in development, install from GitHub:

$ git clone --recursive https://github.com/alimanfoo/zarr.git
$ cd zarr
$ python setup.py install

Contents

Tutorial

Zarr provides classes and functions for working with N-dimensional arrays that behave like NumPy arrays but whose data is divided into chunks and compressed. If you are already familiar with HDF5 datasets then Zarr arrays provide similar functionality, but with some additional flexibility.

Creating an array

Zarr has a number of convenience functions for creating arrays. For example:

>>> import zarr
>>> z = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4')
>>> z
zarr.core.Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
nbytes: 381.5M; nbytes_stored: 317; ratio: 1261829.7; initialized: 0/100
store: builtins.dict

The code above creates a 2-dimensional array of 32-bit integers with 10000 rows and 10000 columns, divided into chunks where each chunk has 1000 rows and 1000 columns (and so there will be 100 chunks in total).

For a complete list of array creation routines see the zarr.creation module documentation.

Reading and writing data

Zarr arrays support a similar interface to NumPy arrays for reading and writing data. For example, the entire array can be filled with a scalar value:

>>> z[:] = 42
>>> z
zarr.core.Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
  compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
  nbytes: 381.5M; nbytes_stored: 2.2M; ratio: 170.4; initialized: 100/100
  store: builtins.dict

Notice that the values of nbytes_stored, ratio and initialized have changed. This is because when a Zarr array is first created, none of the chunks are initialized. Writing data into the array will cause the necessary chunks to be initialized.

Regions of the array can also be written to, e.g.:

>>> import numpy as np
>>> z[0, :] = np.arange(10000)
>>> z[:, 0] = np.arange(10000)

The contents of the array can be retrieved by slicing, which will load the requested region into a NumPy array, e.g.:

>>> z[0, 0]
0
>>> z[-1, -1]
42
>>> z[0, :]
array([   0,    1,    2, ..., 9997, 9998, 9999], dtype=int32)
>>> z[:, 0]
array([   0,    1,    2, ..., 9997, 9998, 9999], dtype=int32)
>>> z[:]
array([[   0,    1,    2, ..., 9997, 9998, 9999],
       [   1,   42,   42, ...,   42,   42,   42],
       [   2,   42,   42, ...,   42,   42,   42],
       ...,
       [9997,   42,   42, ...,   42,   42,   42],
       [9998,   42,   42, ...,   42,   42,   42],
       [9999,   42,   42, ...,   42,   42,   42]], dtype=int32)

Persistent arrays

In the examples above, compressed data for each chunk of the array was stored in memory. Zarr arrays can also be stored on a file system, enabling persistence of data between sessions. For example:

>>> z1 = zarr.open('example.zarr', mode='w', shape=(10000, 10000),
...                chunks=(1000, 1000), dtype='i4', fill_value=0)
>>> z1
zarr.core.Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
  compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
  nbytes: 381.5M; nbytes_stored: 317; ratio: 1261829.7; initialized: 0/100
  store: zarr.storage.DirectoryStore

The array above will store its configuration metadata and all compressed chunk data in a directory called ‘example.zarr’ relative to the current working directory. The zarr.creation.open() function provides a convenient way to create a new persistent array or continue working with an existing array. Note that there is no need to close an array, and data are automatically flushed to disk whenever an array is modified.

Persistent arrays support the same interface for reading and writing data, e.g.:

>>> z1[:] = 42
>>> z1[0, :] = np.arange(10000)
>>> z1[:, 0] = np.arange(10000)

Check that the data have been written and can be read again:

>>> z2 = zarr.open('example.zarr', mode='r')
>>> z2
zarr.core.Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
  compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
  nbytes: 381.5M; nbytes_stored: 2.3M; ratio: 163.8; initialized: 100/100
  store: zarr.storage.DirectoryStore
>>> np.all(z1[:] == z2[:])
True

Resizing and appending

A Zarr array can be resized, which means that any of its dimensions can be increased or decreased in length. For example:

>>> z = zarr.zeros(shape=(10000, 10000), chunks=(1000, 1000))
>>> z[:] = 42
>>> z.resize(20000, 10000)
>>> z
zarr.core.Array((20000, 10000), float64, chunks=(1000, 1000), order=C)
  compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
  nbytes: 1.5G; nbytes_stored: 5.9M; ratio: 259.9; initialized: 100/200
  store: builtins.dict

Note that when an array is resized, the underlying data are not rearranged in any way. If one or more dimensions are shrunk, any chunks falling outside the new array shape will be deleted from the underlying store.

For convenience, Zarr arrays also provide an append() method, which can be used to append data to any axis. E.g.:

>>> a = np.arange(10000000, dtype='i4').reshape(10000, 1000)
>>> z = zarr.array(a, chunks=(1000, 100))
>>> z
zarr.core.Array((10000, 1000), int32, chunks=(1000, 100), order=C)
  compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
  nbytes: 38.1M; nbytes_stored: 2.0M; ratio: 19.3; initialized: 100/100
  store: builtins.dict
>>> z.append(a)
>>> z
zarr.core.Array((20000, 1000), int32, chunks=(1000, 100), order=C)
  compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
  nbytes: 76.3M; nbytes_stored: 4.0M; ratio: 19.3; initialized: 200/200
  store: builtins.dict
>>> z.append(np.vstack([a, a]), axis=1)
>>> z
zarr.core.Array((20000, 2000), int32, chunks=(1000, 100), order=C)
  compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
  nbytes: 152.6M; nbytes_stored: 7.9M; ratio: 19.3; initialized: 400/400
  store: builtins.dict

Compression

By default, Zarr uses the Blosc compression library to compress each chunk of an array. Blosc is extremely fast and can be configured in a variety of ways to improve the compression ratio for different types of data. Blosc is in fact a “meta-compressor”, which means that it can used a number of different compression algorithms internally to compress the data. Blosc also provides highly optimised implementations of byte and bit shuffle filters, which can significantly improve compression ratios for some data.

Options for the compressor can be controlled via the compression_opts keyword argument accepted by all array creation functions. For example:

>>> z = zarr.array(np.arange(100000000, dtype='i4').reshape(10000, 10000),
...                chunks=(1000, 1000), compression='blosc',
...                compression_opts=dict(cname='lz4', clevel=3, shuffle=2))
>>> z
zarr.core.Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
  compression: blosc; compression_opts: {'clevel': 3, 'cname': 'lz4', 'shuffle': 2}
  nbytes: 381.5M; nbytes_stored: 17.6M; ratio: 21.7; initialized: 100/100
  store: builtins.dict

The array above will use Blosc as the primary compressor, using the LZ4 algorithm (compression level 3) internally within Blosc, and with the bitshuffle filter applied.

In addition to Blosc, other compression libraries can also be used. Zarr comes with support for zlib, BZ2 and LZMA compression, via the Python standard library. For example, here is an array using zlib compression, level 1:

>>> z = zarr.array(np.arange(100000000, dtype='i4').reshape(10000, 10000),
...                chunks=(1000, 1000), compression='zlib',
...                compression_opts=1)
>>> z
zarr.core.Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
  compression: zlib; compression_opts: 1
  nbytes: 381.5M; nbytes_stored: 132.2M; ratio: 2.9; initialized: 100/100
  store: builtins.dict

Here is an example using LZMA with a custom filter pipeline including the delta filter:

>>> import lzma
>>> filters = [dict(id=lzma.FILTER_DELTA, dist=4),
...            dict(id=lzma.FILTER_LZMA2, preset=1)]
>>> z = zarr.array(np.arange(100000000, dtype='i4').reshape(10000, 10000),
...                chunks=(1000, 1000), compression='lzma',
...                compression_opts=dict(filters=filters))
>>> z
zarr.core.Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
  compression: lzma; compression_opts: {'preset': None, 'filters': [{'dist': 4, 'id': 3}, {'preset': 1, 'id': 33}], 'check': 0, 'format': 1}
  nbytes: 381.5M; nbytes_stored: 248.1K; ratio: 1574.7; initialized: 100/100
  store: builtins.dict

Parallel computing and synchronization

Zarr arrays can be used as either the source or sink for data in parallel computations. Both multi-threaded and multi-process parallelism are supported. The Python global interpreter lock (GIL) is released for both compression and decompression operations, so Zarr will not block other Python threads from running.

A Zarr array can be read concurrently by multiple threads or processes. No synchronization (i.e., locking) is required for concurrent reads.

A Zarr array can also be written to concurrently by multiple threads or processes. Some synchronization may be required, depending on the way the data is being written.

If each worker in a parallel computation is writing to a separate region of the array, and if region boundaries are perfectly aligned with chunk boundaries, then no synchronization is required. However, if region and chunk boundaries are not perfectly aligned, then synchronization is required to avoid two workers attempting to modify the same chunk at the same time.

To give a simple example, consider a 1-dimensional array of length 60, z, divided into three chunks of 20 elements each. If three workers are running and each attempts to write to a 20 element region (i.e., z[0:20], z[20:40] and z[40:60]) then each worker will be writing to a separate chunk and no synchronization is required. However, if two workers are running and each attempts to write to a 30 element region (i.e., z[0:30] and z[30:60]) then it is possible both workers will attempt to modify the middle chunk at the same time, and synchronization is required to prevent data loss.

Zarr provides support for chunk-level synchronization. E.g., create an array with thread synchronization:

>>> z = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4',
...                 synchronizer=zarr.ThreadSynchronizer())
>>> z
zarr.sync.SynchronizedArray((10000, 10000), int32, chunks=(1000, 1000), order=C)
  compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
  nbytes: 381.5M; nbytes_stored: 317; ratio: 1261829.7; initialized: 0/100
  store: builtins.dict; synchronizer: zarr.sync.ThreadSynchronizer

This array is safe to read or write within a multi-threaded program.

Zarr also provides support for process synchronization via file locking, provided that all processes have access to a shared file system. E.g.:

>>> synchronizer = zarr.ProcessSynchronizer('example.zarr')
>>> z = zarr.open('example.zarr', mode='w', shape=(10000, 10000),
...               chunks=(1000, 1000), dtype='i4',
...               synchronizer=synchronizer)
>>> z
zarr.sync.SynchronizedArray((10000, 10000), int32, chunks=(1000, 1000), order=C)
  compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
  nbytes: 381.5M; nbytes_stored: 317; ratio: 1261829.7; initialized: 0/100
  store: zarr.storage.DirectoryStore; synchronizer: zarr.sync.ProcessSynchronizer

This array is safe to read or write from multiple processes.

User attributes

Zarr arrays also support custom key/value attributes, which can be useful for associating an array with application-specific metadata. For example:

>>> z = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4')
>>> z.attrs['foo'] = 'bar'
>>> z.attrs['baz'] = 42
>>> sorted(z.attrs)
['baz', 'foo']
>>> 'foo' in z.attrs
True
>>> z.attrs['foo']
'bar'
>>> z.attrs['baz']
42

Internally Zarr uses JSON to store array attributes, so attribute values must be JSON serializable.

Tips and tricks

Copying large arrays

Data can be copied between large arrays without needing much memory, e.g.:

>>> z1 = zarr.empty((10000, 10000), chunks=(1000, 1000), dtype='i4')
>>> z1[:] = 42
>>> z2 = zarr.empty_like(z1)
>>> z2[:] = z1

Internally the example above works chunk-by-chunk, extracting only the data from z1 required to fill each chunk in z2. The source of the data (z1) could equally be an h5py Dataset.

Changing memory layout

The order of bytes within each chunk of an array can be changed via the order keyword argument, to use either C or Fortran layout. For multi-dimensional arrays, these two layouts may provide different compression ratios, depending on the correlation structure within the data. E.g.:

>>> a = np.arange(100000000, dtype='i4').reshape(10000, 10000).T
>>> zarr.array(a, chunks=(1000, 1000))
zarr.core.Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
  compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
  nbytes: 381.5M; nbytes_stored: 26.1M; ratio: 14.6; initialized: 100/100
  store: builtins.dict
>>> zarr.array(a, chunks=(1000, 1000), order='F')
zarr.core.Array((10000, 10000), int32, chunks=(1000, 1000), order=F)
  compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
  nbytes: 381.5M; nbytes_stored: 10.0M; ratio: 38.0; initialized: 100/100
  store: builtins.dict

In the above example, Fortran order gives a better compression ratio. This is an artifical example but illustrates the general point that changing the order of bytes within chunks of an array may improve the compression ratio, depending on the structure of the data, the compression algorithm used, and which compression filters (e.g., byte shuffle) have been applied.

Storage alternatives

Zarr can use any object that implements the MutableMapping interface as the store for an array.

Here is an example storing an array directly into a Zip file via the zict package:

>>> import zict
>>> import os
>>> store = zict.Zip('example.zip', mode='w')
>>> z = zarr.zeros((1000, 1000), chunks=(100, 100), dtype='i4',
...                compression='zlib', compression_opts=1, store=store)
>>> z
zarr.core.Array((1000, 1000), int32, chunks=(100, 100), order=C)
  compression: zlib; compression_opts: 1
  nbytes: 3.8M; initialized: 0/100
  store: zict.zip.Zip
>>> z[:] = 42
>>> store.flush()  # only required for zict.Zip
>>> os.path.getsize('example.zip')
30828

Re-open and check that data have been written:

>>> store = zict.Zip('example.zip', mode='r')
>>> z = zarr.Array(store)
>>> z
zarr.core.Array((1000, 1000), int32, chunks=(100, 100), order=C)
  compression: zlib; compression_opts: 1
  nbytes: 3.8M; initialized: 100/100
  store: zict.zip.Zip
>>> z[:]
array([[42, 42, 42, ..., 42, 42, 42],
       [42, 42, 42, ..., 42, 42, 42],
       [42, 42, 42, ..., 42, 42, 42],
       ...,
       [42, 42, 42, ..., 42, 42, 42],
       [42, 42, 42, ..., 42, 42, 42],
       [42, 42, 42, ..., 42, 42, 42]], dtype=int32)

Note that there are some restrictions on how Zip files can be used, because items within a Zip file cannot be updated in place. This means that data in the array should only be written once and write operations should be aligned with chunk boundaries.

Chunk size and shape

In general, chunks of at least 1 megabyte (1M) seem to provide the best performance, at least when using the Blosc compression library.

The optimal chunk shape will depend on how you want to access the data. E.g., for a 2-dimensional array, if you only ever take slices along the first dimension, then chunk across the second dimenson. If you know you want to chunk across an entire dimension you can use None within the chunks argument, e.g.:

>>> z1 = zarr.zeros((10000, 10000), chunks=(100, None), dtype='i4')
>>> z1.chunks
(100, 10000)

Alternatively, if you only ever take slices along the second dimension, then chunk across the first dimension, e.g.:

>>> z2 = zarr.zeros((10000, 10000), chunks=(None, 100), dtype='i4')
>>> z2.chunks
(10000, 100)

If you require reasonable performance for both access patterns then you need to find a compromise, e.g.:

>>> z3 = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4')
>>> z3.chunks
(1000, 1000)
Configuring Blosc

The Blosc compressor is able to use multiple threads internally to accelerate compression and decompression. By default, Zarr allows Blosc to use up to 4 internal threads. The number of Blosc threads can be changed, e.g.:

>>> from zarr import blosc
>>> blosc.set_nthreads(2)
4

When a Zarr array is being used within a multi-threaded program, Zarr automatically switches to using Blosc in a single-threaded “contextual” mode. This is generally better as it allows multiple program threads to use Blosc simultaneously and prevents CPU thrashing from too many active threads. If you want to manually override this behaviour, set the value of the blosc.use_threads variable to True (Blosc always uses multiple internal threads) or False (Blosc always runs in single-threaded contextual mode). To re-enable automatic switching, set blosc.use_threads to None.

API reference

Array creation (zarr.creation)

zarr.creation.create(shape, chunks, dtype=None, compression='default', compression_opts=None, fill_value=None, order='C', store=None, synchronizer=None, overwrite=False)

Create an array.

Parameters:

shape : int or tuple of ints

Array shape.

chunks : int or tuple of ints

Chunk shape.

dtype : string or dtype, optional

NumPy dtype.

compression : string, optional

Name of primary compression library, e.g., ‘blosc’, ‘zlib’, ‘bz2’, ‘lzma’.

compression_opts : object, optional

Options to primary compressor. E.g., for blosc, provide a dictionary with keys ‘cname’, ‘clevel’ and ‘shuffle’.

fill_value : object

Default value to use for uninitialised portions of the array.

order : {‘C’, ‘F’}, optional

Memory layout to be used within each chunk.

store : MutableMapping, optional

Array storage. If not provided, a Python dict will be used, meaning array data will be stored in memory.

synchronizer : zarr.sync.ArraySynchronizer, optional

Array synchronizer.

overwrite : bool, optional

If True, delete all pre-existing data in store before creating the array.

Returns:

z : zarr.core.Array

Examples

Create an array with default settings:

>>> import zarr
>>> z = zarr.create((10000, 10000), chunks=(1000, 1000))
>>> z
zarr.core.Array((10000, 10000), float64, chunks=(1000, 1000), order=C)
  compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
  nbytes: 762.9M; nbytes_stored: 320; ratio: 2500000.0; initialized: 0/100
  store: builtins.dict
zarr.creation.empty(shape, chunks, dtype=None, compression='default', compression_opts=None, order='C', store=None, synchronizer=None)

Create an empty array.

For parameter definitions see zarr.creation.create().

Notes

The contents of an empty Zarr array are not defined. On attempting to retrieve data from an empty Zarr array, any values may be returned, and these are not guaranteed to be stable from one access to the next.

zarr.creation.zeros(shape, chunks, dtype=None, compression='default', compression_opts=None, order='C', store=None, synchronizer=None)

Create an array, with zero being used as the default value for uninitialised portions of the array.

For parameter definitions see zarr.creation.create().

Examples

>>> import zarr
>>> z = zarr.zeros((10000, 10000), chunks=(1000, 1000))
>>> z
zarr.core.Array((10000, 10000), float64, chunks=(1000, 1000), order=C)
  compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
  nbytes: 762.9M; nbytes_stored: 317; ratio: 2523659.3; initialized: 0/100
  store: builtins.dict
>>> z[:2, :2]
array([[ 0.,  0.],
       [ 0.,  0.]])
zarr.creation.ones(shape, chunks, dtype=None, compression='default', compression_opts=None, order='C', store=None, synchronizer=None)

Create an array, with one being used as the default value for uninitialised portions of the array.

For parameter definitions see zarr.creation.create().

Examples

>>> import zarr
>>> z = zarr.ones((10000, 10000), chunks=(1000, 1000))
>>> z
zarr.core.Array((10000, 10000), float64, chunks=(1000, 1000), order=C)
  compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
  nbytes: 762.9M; nbytes_stored: 317; ratio: 2523659.3; initialized: 0/100
  store: builtins.dict
>>> z[:2, :2]
array([[ 1.,  1.],
       [ 1.,  1.]])
zarr.creation.full(shape, chunks, fill_value, dtype=None, compression='default', compression_opts=None, order='C', store=None, synchronizer=None)

Create an array, with fill_value being used as the default value for uninitialised portions of the array.

For parameter definitions see zarr.creation.create().

Examples

>>> import zarr
>>> z = zarr.full((10000, 10000), chunks=(1000, 1000), fill_value=42)
>>> z
zarr.core.Array((10000, 10000), float64, chunks=(1000, 1000), order=C)
  compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
  nbytes: 762.9M; nbytes_stored: 318; ratio: 2515723.3; initialized: 0/100
  store: builtins.dict
>>> z[:2, :2]
array([[ 42.,  42.],
       [ 42.,  42.]])
zarr.creation.array(data, chunks=None, dtype=None, compression='default', compression_opts=None, fill_value=None, order='C', store=None, synchronizer=None)

Create an array filled with data.

The data argument should be a NumPy array or array-like object. For other parameter definitions see zarr.creation.create().

Examples

>>> import numpy as np
>>> import zarr
>>> a = np.arange(100000000).reshape(10000, 10000)
>>> z = zarr.array(a, chunks=(1000, 1000))
>>> z
zarr.core.Array((10000, 10000), int64, chunks=(1000, 1000), order=C)
  compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
  nbytes: 762.9M; nbytes_stored: 16.8M; ratio: 45.5; initialized: 100/100
  store: builtins.dict
zarr.creation.open(path, mode='a', shape=None, chunks=None, dtype=None, compression='default', compression_opts=None, fill_value=0, order='C', synchronizer=None)

Open an array stored in a directory on the file system.

Parameters:

path : string

Path to directory in which to store the array.

mode : {‘r’, ‘r+’, ‘a’, ‘w’, ‘w-‘}

Persistence mode: ‘r’ means readonly (must exist); ‘r+’ means read/write (must exist); ‘a’ means read/write (create if doesn’t exist); ‘w’ means create (overwrite if exists); ‘w-‘ means create (fail if exists).

shape : int or tuple of ints

Array shape.

chunks : int or tuple of ints

Chunk shape.

dtype : string or dtype, optional

NumPy dtype.

compression : string, optional

Name of primary compression library, e.g., ‘blosc’, ‘zlib’, ‘bz2’, ‘lzma’.

compression_opts : object, optional

Options to primary compressor. E.g., for blosc, provide a dictionary with keys ‘cname’, ‘clevel’ and ‘shuffle’.

fill_value : object

Default value to use for uninitialised portions of the array.

order : {‘C’, ‘F’}, optional

Memory layout to be used within each chunk.

synchronizer : zarr.sync.ArraySynchronizer, optional

Array synchronizer.

Returns:

z : zarr.core.Array

Notes

There is no need to close an array. Data are automatically flushed to the file system.

Examples

>>> import numpy as np
>>> import zarr
>>> z1 = zarr.open('example.zarr', mode='w', shape=(10000, 10000),
...                chunks=(1000, 1000), fill_value=0)
>>> z1[:] = np.arange(100000000).reshape(10000, 10000)
>>> z1
zarr.core.Array((10000, 10000), float64, chunks=(1000, 1000), order=C)
  compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
  nbytes: 762.9M; nbytes_stored: 24.6M; ratio: 31.0; initialized: 100/100
  store: zarr.storage.DirectoryStore
>>> z2 = zarr.open('example.zarr', mode='r')
>>> z2
zarr.core.Array((10000, 10000), float64, chunks=(1000, 1000), order=C)
  compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
  nbytes: 762.9M; nbytes_stored: 24.6M; ratio: 31.0; initialized: 100/100
  store: zarr.storage.DirectoryStore
>>> np.all(z1[:] == z2[:])
True
zarr.creation.empty_like(a, shape=None, chunks=None, dtype=None, compression=None, compression_opts=None, order=None, store=None, synchronizer=None)

Create an empty array like a.

zarr.creation.zeros_like(a, shape=None, chunks=None, dtype=None, compression=None, compression_opts=None, order=None, store=None, synchronizer=None)

Create an array of zeros like a.

zarr.creation.ones_like(a, shape=None, chunks=None, dtype=None, compression=None, compression_opts=None, order=None, store=None, synchronizer=None)

Create an array of ones like a.

zarr.creation.full_like(a, shape=None, chunks=None, fill_value=None, dtype=None, compression=None, compression_opts=None, order=None, store=None, synchronizer=None)

Create a filled array like a.

zarr.creation.open_like(a, path, mode='a', shape=None, chunks=None, dtype=None, compression=None, compression_opts=None, fill_value=None, order=None, synchronizer=None)

Open a persistent array like a.

The Array class (zarr.core)

class zarr.core.Array(store, readonly=False)

Instantiate an array from an initialised store.

Parameters:

store : MutableMapping

Array store, already initialised.

readonly : bool, optional

True if array should be protected against modification.

Examples

>>> import zarr
>>> store = dict()
>>> zarr.init_store(store, shape=(10000, 10000), chunks=(1000, 1000))
>>> z = zarr.Array(store)
>>> z
zarr.core.Array((10000, 10000), float64, chunks=(1000, 1000), order=C)
  compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
  nbytes: 762.9M; nbytes_stored: 320; ratio: 2500000.0; initialized: 0/100
  store: builtins.dict

Attributes

store A MutableMapping providing the underlying storage for the array.
readonly A boolean, True if write operations are not permitted.
shape A tuple of integers describing the length of each dimension of the array.
chunks A tuple of integers describing the length of each dimension of a chunk of the array.
dtype The NumPy data type.
compression A string naming the primary compression algorithm used to compress chunks of the array.
compression_opts Parameters controlling the behaviour of the primary compression algorithm.
fill_value A value used for uninitialized portions of the array.
order A string indicating the order in which bytes are arranged within chunks of the array.
attrs A MutableMapping containing user-defined attributes.
size The total number of elements in the array.
itemsize The size in bytes of each item in the array.
nbytes The total number of bytes that would be required to store the array without compression.
nbytes_stored The total number of stored bytes of data for the array.
initialized The number of chunks that have been initialized with some data.
cdata_shape A tuple of integers describing the number of chunks along each dimension of the array.

Methods

__getitem__(item) Retrieve data for some portion of the array.
__setitem__(key, value) Modify data for some portion of the array.
resize(*args) Change the shape of the array by growing or shrinking one or more dimensions.
append(data[, axis]) Append data to axis.
__getitem__(item)

Retrieve data for some portion of the array. Most NumPy-style slicing operations are supported.

Returns:

out : ndarray

A NumPy array containing the data for the requested region.

Examples

Setup a 1-dimensional array:

>>> import zarr
>>> import numpy as np
>>> z = zarr.array(np.arange(100000000), chunks=1000000, dtype='i4')
>>> z
zarr.core.Array((100000000,), int32, chunks=(1000000,), order=C)
  compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
  nbytes: 381.5M; nbytes_stored: 6.8M; ratio: 56.0; initialized: 100/100
  store: builtins.dict

Take some slices:

>>> z[5]
5
>>> z[:5]
array([0, 1, 2, 3, 4], dtype=int32)
>>> z[-5:]
array([99999995, 99999996, 99999997, 99999998, 99999999], dtype=int32)
>>> z[5:10]
array([5, 6, 7, 8, 9], dtype=int32)
>>> z[:]
array([       0,        1,        2, ..., 99999997, 99999998, 99999999], dtype=int32)

Setup a 2-dimensional array:

>>> import zarr
>>> import numpy as np
>>> z = zarr.array(np.arange(100000000).reshape(10000, 10000),
...                chunks=(1000, 1000), dtype='i4')
>>> z
zarr.core.Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
  compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
  nbytes: 381.5M; nbytes_stored: 10.0M; ratio: 38.0; initialized: 100/100
  store: builtins.dict

Take some slices:

>>> z[2, 2]
20002
>>> z[:2, :2]
array([[    0,     1],
       [10000, 10001]], dtype=int32)
>>> z[:2]
array([[    0,     1,     2, ...,  9997,  9998,  9999],
       [10000, 10001, 10002, ..., 19997, 19998, 19999]], dtype=int32)
>>> z[:, :2]
array([[       0,        1],
       [   10000,    10001],
       [   20000,    20001],
       ...,
       [99970000, 99970001],
       [99980000, 99980001],
       [99990000, 99990001]], dtype=int32)
>>> z[:]
array([[       0,        1,        2, ...,     9997,     9998,     9999],
       [   10000,    10001,    10002, ...,    19997,    19998,    19999],
       [   20000,    20001,    20002, ...,    29997,    29998,    29999],
       ...,
       [99970000, 99970001, 99970002, ..., 99979997, 99979998, 99979999],
       [99980000, 99980001, 99980002, ..., 99989997, 99989998, 99989999],
       [99990000, 99990001, 99990002, ..., 99999997, 99999998, 99999999]], dtype=int32)
__setitem__(key, value)

Modify data for some portion of the array.

Examples

Setup a 1-dimensional array:

>>> import zarr
>>> z = zarr.zeros(100000000, chunks=1000000, dtype='i4')
>>> z
zarr.core.Array((100000000,), int32, chunks=(1000000,), order=C)
  compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
  nbytes: 381.5M; nbytes_stored: 295; ratio: 1355932.2; initialized: 0/100
  store: builtins.dict

Set all array elements to the same scalar value:

>>> z[:] = 42
>>> z[:]
array([42, 42, 42, ..., 42, 42, 42], dtype=int32)

Set a portion of the array:

>>> z[:100] = np.arange(100)
>>> z[-100:] = np.arange(100)[::-1]
>>> z[:]
array([0, 1, 2, ..., 2, 1, 0], dtype=int32)

Setup a 2-dimensional array:

>>> z = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4')
>>> z
zarr.core.Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
  compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
  nbytes: 381.5M; nbytes_stored: 317; ratio: 1261829.7; initialized: 0/100
  store: builtins.dict

Set all array elements to the same scalar value:

>>> z[:] = 42
>>> z[:]
array([[42, 42, 42, ..., 42, 42, 42],
       [42, 42, 42, ..., 42, 42, 42],
       [42, 42, 42, ..., 42, 42, 42],
       ...,
       [42, 42, 42, ..., 42, 42, 42],
       [42, 42, 42, ..., 42, 42, 42],
       [42, 42, 42, ..., 42, 42, 42]], dtype=int32)

Set a portion of the array:

>>> z[0, :] = np.arange(z.shape[1])
>>> z[:, 0] = np.arange(z.shape[0])
>>> z[:]
array([[   0,    1,    2, ..., 9997, 9998, 9999],
       [   1,   42,   42, ...,   42,   42,   42],
       [   2,   42,   42, ...,   42,   42,   42],
       ...,
       [9997,   42,   42, ...,   42,   42,   42],
       [9998,   42,   42, ...,   42,   42,   42],
       [9999,   42,   42, ...,   42,   42,   42]], dtype=int32)
resize(*args)

Change the shape of the array by growing or shrinking one or more dimensions.

Notes

When resizing an array, the data are not rearranged in any way.

If one or more dimensions are shrunk, any chunks falling outside the new array shape will be deleted from the underlying store.

Examples

>>> import zarr
>>> z = zarr.zeros(shape=(10000, 10000), chunks=(1000, 1000))
>>> z
zarr.core.Array((10000, 10000), float64, chunks=(1000, 1000), order=C)
  compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
  nbytes: 762.9M; nbytes_stored: 317; ratio: 2523659.3; initialized: 0/100
  store: builtins.dict
>>> z.resize(20000, 10000)
>>> z
zarr.core.Array((20000, 10000), float64, chunks=(1000, 1000), order=C)
  compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
  nbytes: 1.5G; nbytes_stored: 317; ratio: 5047318.6; initialized: 0/200
  store: builtins.dict
>>> z.resize(30000, 1000)
>>> z
zarr.core.Array((30000, 1000), float64, chunks=(1000, 1000), order=C)
  compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
  nbytes: 228.9M; nbytes_stored: 316; ratio: 759493.7; initialized: 0/30
  store: builtins.dict
append(data, axis=0)

Append data to axis.

Parameters:

data : array_like

Data to be appended.

axis : int

Axis along which to append.

Notes

The size of all dimensions other than axis must match between this array and data.

Examples

>>> import numpy as np
>>> import zarr
>>> a = np.arange(10000000, dtype='i4').reshape(10000, 1000)
>>> z = zarr.array(a, chunks=(1000, 100))
>>> z
zarr.core.Array((10000, 1000), int32, chunks=(1000, 100), order=C)
  compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
  nbytes: 38.1M; nbytes_stored: 2.0M; ratio: 19.3; initialized: 100/100
  store: builtins.dict
>>> z.append(a)
>>> z
zarr.core.Array((20000, 1000), int32, chunks=(1000, 100), order=C)
  compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
  nbytes: 76.3M; nbytes_stored: 4.0M; ratio: 19.3; initialized: 200/200
  store: builtins.dict
>>> z.append(np.vstack([a, a]), axis=1)
>>> z
zarr.core.Array((20000, 2000), int32, chunks=(1000, 100), order=C)
  compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
  nbytes: 152.6M; nbytes_stored: 7.9M; ratio: 19.3; initialized: 400/400
  store: builtins.dict

Storage (zarr.storage)

This module contains a single DirectoryStore class providing a MutableMapping interface to a directory on the file system. However, note that any object implementing the MutableMapping interface can be used as a Zarr array store.

zarr.storage.init_store(store, shape, chunks, dtype=None, compression='default', compression_opts=None, fill_value=None, order='C', overwrite=False)

Initialise an array store with the given configuration.

Parameters:

store : MutableMapping

A mapping that supports string keys and byte sequence values.

shape : int or tuple of ints

Array shape.

chunks : int or tuple of ints

Chunk shape.

dtype : string or dtype, optional

NumPy dtype.

compression : string, optional

Name of primary compression library, e.g., ‘blosc’, ‘zlib’, ‘bz2’, ‘lzma’.

compression_opts : object, optional

Options to primary compressor. E.g., for blosc, provide a dictionary with keys ‘cname’, ‘clevel’ and ‘shuffle’.

fill_value : object

Default value to use for uninitialised portions of the array.

order : {‘C’, ‘F’}, optional

Memory layout to be used within each chunk.

overwrite : bool, optional

If True, erase all data in store prior to initialisation.

Notes

The initialisation process involves normalising all array metadata, encoding as JSON and storing under the ‘meta’ key. User attributes are also initialised and stored as JSON under the ‘attrs’ key.

Examples

>>> import zarr
>>> store = dict()
>>> zarr.init_store(store, shape=(10000, 10000), chunks=(1000, 1000))
>>> sorted(store.keys())
['attrs', 'meta']
>>> print(str(store['meta'], 'ascii'))
{
    "chunks": [
        1000,
        1000
    ],
    "compression": "blosc",
    "compression_opts": {
        "clevel": 5,
        "cname": "blosclz",
        "shuffle": 1
    },
    "dtype": "<f8",
    "fill_value": null,
    "order": "C",
    "shape": [
        10000,
        10000
    ],
    "zarr_format": 1
}
>>> print(str(store['attrs'], 'ascii'))
{}
class zarr.storage.DirectoryStore(path)

Mutable Mapping interface to a directory. Keys must be strings, values must be bytes-like objects.

Parameters:

path : string

Location of directory.

Examples

>>> import zarr
>>> store = zarr.DirectoryStore('example.zarr')
>>> zarr.init_store(store, shape=(10000, 10000), chunks=(1000, 1000),
...                 fill_value=0, overwrite=True)
>>> import os
>>> sorted(os.listdir('example.zarr'))
['attrs', 'meta']
>>> print(open('example.zarr/meta').read())
{
    "chunks": [
        1000,
        1000
    ],
    "compression": "blosc",
    "compression_opts": {
        "clevel": 5,
        "cname": "blosclz",
        "shuffle": 1
    },
    "dtype": "<f8",
    "fill_value": 0,
    "order": "C",
    "shape": [
        10000,
        10000
    ],
    "zarr_format": 1
}
>>> print(open('example.zarr/attrs').read())
{}
>>> z = zarr.Array(store)
>>> z
zarr.core.Array((10000, 10000), float64, chunks=(1000, 1000), order=C)
  compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
  nbytes: 762.9M; nbytes_stored: 317; ratio: 2523659.3; initialized: 0/100
  store: zarr.storage.DirectoryStore
>>> z[:] = 1
>>> len(os.listdir('example.zarr'))
102
>>> sorted(os.listdir('example.zarr'))[0:5]
['0.0', '0.1', '0.2', '0.3', '0.4']
>>> print(open('example.zarr/0.0', 'rb').read(10))
b'\x02\x01\x01\x08\x00\x12z\x00\x00\x80'

Compressors (zarr.compressors)

This module contains compressor classes for use with Zarr. Note that normally there is no need to use these classes directly, they are used under the hood by Zarr when looking up an implementation of a particular compression.

Other compressors can be registered dynamically with Zarr. All that is required is to implement a class that provides the same interface as the classes listed below, and then to add the class to the compressor registry. See the source code of this module for details.

class zarr.compressors.BloscCompressor(compression_opts)

Provides compression using the blosc meta-compressor. Registered under the name ‘blosc’.

Parameters:

compression_opts : dict

A dictionary with keys ‘cname’, ‘clevel’ and ‘shuffle’. The value of the ‘cname’ key should be a string naming one of the compression algorithms available within blosc, e.g., ‘blosclz’, ‘lz4’, ‘zlib’ or ‘snappy’. The value of the ‘clevel’ key should be an integer between 0 and 9 specifying the compression level. The value of the ‘shuffle’ key should be 0 (no shuffle), 1 (byte shuffle) or 2 (bit shuffle).

Examples

>>> import zarr
>>> z = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4',
...                compression='blosc',
...                compression_opts=dict(cname='lz4', clevel=3, shuffle=2))
class zarr.compressors.ZlibCompressor(compression_opts)

Provides compression using zlib via the Python standard library. Registered under the name ‘zlib’.

Parameters:

compression_opts : int

An integer between 0 and 9 inclusive specifying the compression level.

Examples

>>> import zarr
>>> z = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4',
...                compression='zlib', compression_opts=1)
class zarr.compressors.BZ2Compressor(compression_opts)

Provides compression using BZ2 via the Python standard library. Registered under the name ‘bz2’.

Parameters:

compression_opts : int

An integer between 1 and 9 inclusive specifying the compression level.

Examples

>>> import zarr
>>> z = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4',
...                compression='bz2', compression_opts=1)
class zarr.compressors.LZMACompressor(compression_opts)

Provides compression using lzma via the Python standard library (only available under Python 3). Registered under the name ‘lzma’.

Parameters:

compression_opts : dict

A dictionary with keys ‘format’, ‘check’, ‘preset’ and ‘filters’. The value of the ‘format’ key should be an integer specifying one of the lzma format codes, e.g., lzma.FORMAT_XZ. The value of the ‘check’ key should be an integer specifying one of the lzma check codes, e.g., lzma.CHECK_NONE. The value of the ‘preset’ key should be an integer between 0 and 9 inclusive, specifying the compression level. The value of the ‘filters’ key should be a list of dictionaries specifying compression filters. If filters are provided, ‘preset’ must be None.

Examples

Simple usage:

>>> import zarr
>>> z = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4',
...                compression='lzma',
...                compression_opts=dict(preset=1))

Custom filter pipeline:

>>> import lzma
>>> filters = [dict(id=lzma.FILTER_DELTA, dist=4),
...            dict(id=lzma.FILTER_LZMA2, preset=1)]
>>> z = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4',
...                compression='lzma',
...                compression_opts=dict(filters=filters))
class zarr.compressors.NoCompressor(compression_opts)

No compression, i.e., pass bytes through. Registered under the name ‘none’.

Examples

>>> import zarr
>>> z = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4',
...                compression='none')

Synchronization (zarr.sync)

class zarr.sync.ThreadSynchronizer

Provides synchronization using thread locks.

class zarr.sync.ProcessSynchronizer(path)

Provides synchronization using file locks via the fasteners package.

Parameters:

path : string

Path to a directory on a file system that is shared by all processes.

class zarr.sync.SynchronizedArray(store, synchronizer, readonly=False)

Instantiate a synchronized array.

Parameters:

store : MutableMapping

Array store, already initialised.

synchronizer : object

Array synchronizer.

readonly : bool, optional

True if array should be protected against modification.

Notes

Only writing data to the array via the __setitem__() method and modification of user attributes are synchronized. Neither append() nor resize() are synchronized.

Writing to the array is synchronized at the chunk level. I.e., the array supports concurrent write operations via the __setitem__() method, but these will only exclude each other if they both require modification of the same chunk.

Examples

>>> import zarr
>>> store = dict()
>>> zarr.init_store(store, shape=1000, chunks=100)
>>> synchronizer = zarr.ThreadSynchronizer()
>>> z = zarr.SynchronizedArray(store, synchronizer)
>>> z
zarr.sync.SynchronizedArray((1000,), float64, chunks=(100,), order=C)
  compression: blosc; compression_opts: {'clevel': 5, 'cname': 'blosclz', 'shuffle': 1}
  nbytes: 7.8K; nbytes_stored: 289; ratio: 27.7; initialized: 0/10
  store: builtins.dict; synchronizer: zarr.sync.ThreadSynchronizer

Specifications

Zarr storage specification version 1

This document provides a technical specification of the protocol and format used for storing a Zarr array. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.

Storage

A Zarr array can be stored in any storage system that provides a key/value interface, where a key is an ASCII string and a value is an arbitrary sequence of bytes, and the supported operations are read (get the sequence of bytes associated with a given key), write (set the sequence of bytes associated with a given key) and delete (remove a key/value pair).

For example, a directory in a file system can provide this interface, where keys are file names, values are file contents, and files can be read, written or deleted. Equally, an S3 bucket can provide this interface, where keys are resource names, values are resource contents, and resources can be read, written or deleted via HTTP.

Below an “array store” refers to any system implementing this interface.

Metadata

Each array requires essential configuration metadata to be stored, enabling correct interpretation of the stored data. This metadata is encoded using JSON and stored as the value of the ‘meta’ key within an array store.

The metadata resource is a JSON object. The following keys MUST be present within the object:

zarr_format
An integer defining the version of the storage specification to which the array store adheres.
shape
A list of integers defining the length of each dimension of the array.
chunks
A list of integers defining the length of each dimension of a chunk of the array. Note that all chunks within a Zarr array have the same shape.
dtype
A string or list defining a valid data type for the array. See also the subsection below on data type encoding.
compression
A string identifying the primary compression library used to compress each chunk of the array.
compression_opts
An integer, string or dictionary providing options to the primary compression library.
fill_value
A scalar value providing the default value to use for uninitialized portions of the array.
order
Either ‘C’ or ‘F’, defining the layout of bytes within each chunk of the array. ‘C’ means row-major order, i.e., the last dimension varies fastest; ‘F’ means column-major order, i.e., the first dimension varies fastest.

Other keys MAY be present within the metadata object however they MUST NOT alter the interpretation of the required fields defined above.

For example, the JSON object below defines a 2-dimensional array of 64-bit little-endian floating point numbers with 10000 rows and 10000 columns, divided into chunks of 1000 rows and 1000 columns (so there will be 100 chunks in total arranged in a 10 by 10 grid). Within each chunk the data are laid out in C contiguous order, and each chunk is compressed using the Blosc compression library:

{
    "chunks": [
        1000,
        1000
    ],
    "compression": "blosc",
    "compression_opts": {
        "clevel": 5,
        "cname": "lz4",
        "shuffle": 1
    },
    "dtype": "<f8",
    "fill_value": null,
    "order": "C",
    "shape": [
        10000,
        10000
    ],
    "zarr_format": 1
}
Data type encoding

Simple data types are encoded within the array metadata resource as a string, following the NumPy array protocol type string (typestr) format. The format consists of 3 parts: a character describing the byteorder of the data (<: little-endian, >: big-endian, |: not-relevant), a character code giving the basic type of the array, and an integer providing the number of bytes the type uses. The byte order MUST be specified. E.g., "<f8", ">i4", "|b1" and "|S12" are valid data types.

Structure data types (i.e., with multiple named fields) are encoded as a list of two-element lists, following NumPy array protocol type descriptions (descr). For example, the JSON list [["r", "|u1"], ["g", "|u1"], ["b", "|u1"]] defines a data type composed of three single-byte unsigned integers labelled ‘r’, ‘g’ and ‘b’.

Chunks

Each chunk of the array is compressed by passing the raw bytes for the chunk through the primary compression library to obtain a new sequence of bytes comprising the compressed chunk data. No header is added to the compressed bytes or any other modification made. The internal structure of the compressed bytes will depend on which primary compressor was used. For example, the Blosc compressor produces a sequence of bytes that begins with a 16-byte header followed by compressed data.

The compressed sequence of bytes for each chunk is stored under a key formed from the index of the chunk within the grid of chunks representing the array. To form a string key for a chunk, the indices are converted to strings and concatenated with the period character (‘.’) separating each index. For example, given an array with shape (10000, 10000) and chunk shape (1000, 1000) there will be 100 chunks laid out in a 10 by 10 grid. The chunk with indices (0, 0) provides data for rows 0-1000 and columns 0-1000 and is stored under the key ‘0.0’; the chunk with indices (2, 4) provides data for rows 2000-3000 and columns 4000-5000 and is stored under the key ‘2.4’; etc.

There is no need for all chunks to be present within an array store. If a chunk is not present then it is considered to be in an uninitialized state. An unitialized chunk MUST be treated as if it was uniformly filled with the value of the ‘fill_value’ field in the array metadata. If the ‘fill_value’ field is null then the contents of the chunk are undefined.

Note that all chunks in an array have the same shape. If the length of any array dimension is not exactly divisible by the length of the corresponding chunk dimension then some chunks will overhang the edge of the array. The contents of any chunk region falling outside the array are undefined.

Attributes

Each array can also be associated with custom attributes, which are simple key/value items with application-specific meaning. Custom attributes are encoded as a JSON object and stored under the ‘attrs’ key within an array store. Even if the attributes are empty, the ‘attrs’ key MUST be present within an array store.

For example, the JSON object below encodes three attributes named ‘foo’, ‘bar’ and ‘baz’:

{
    "foo": 42,
    "bar": "apples",
    "baz": [1, 2, 3, 4]
}
Example

Below is an example of storing a Zarr array, using a directory on the local file system as storage.

Initialize the store:

>>> import zarr
>>> store = zarr.DirectoryStore('example.zarr')
>>> zarr.init_store(store, shape=(20, 20), chunks=(10, 10),
...                 dtype='i4', fill_value=42, compression='zlib',
...                 compression_opts=1, overwrite=True)

No chunks are initialized yet, so only the ‘meta’ and ‘attrs’ keys have been set:

>>> import os
>>> sorted(os.listdir('example.zarr'))
['attrs', 'meta']

Inspect the array metadata:

>>> print(open('example.zarr/meta').read())
{
    "chunks": [
        10,
        10
    ],
    "compression": "zlib",
    "compression_opts": 1,
    "dtype": "<i4",
    "fill_value": 42,
    "order": "C",
    "shape": [
        20,
        20
    ],
    "zarr_format": 1
}

Inspect the array attributes:

>>> print(open('example.zarr/attrs').read())
{}

Set some data:

>>> z = zarr.Array(store)
>>> z[0:10, 0:10] = 1
>>> sorted(os.listdir('example.zarr'))
['0.0', 'attrs', 'meta']

Set some more data:

>>> z[0:10, 10:20] = 2
>>> z[10:20, :] = 3
>>> sorted(os.listdir('example.zarr'))
['0.0', '0.1', '1.0', '1.1', 'attrs', 'meta']

Manually decompress a single chunk for illustration:

>>> import zlib
>>> b = zlib.decompress(open('example.zarr/0.0', 'rb').read())
>>> import numpy as np
>>> a = np.frombuffer(b, dtype='<i4')
>>> a
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

Modify the array attributes:

>>> z.attrs['foo'] = 42
>>> z.attrs['bar'] = 'apples'
>>> z.attrs['baz'] = [1, 2, 3, 4]
>>> print(open('example.zarr/attrs').read())
{
    "bar": "apples",
    "baz": [
        1,
        2,
        3,
        4
    ],
    "foo": 42
}

Release notes

1.0.0

This release includes a complete re-organization of the code base. The major version number has been bumped to indicate that there have been backwards-incompatible changes to the API and the on-disk storage format. However, Zarr is still in an early stage of development, so please do not take the version number as an indicator of maturity.

Storage

The main motivation for re-organizing the code was to create an abstraction layer between the core array logic and data storage (#21). In this release, any object that implements the MutableMapping interface can be used as an array store. See the tutorial sections on Persistent arrays and Storage alternatives, the Zarr storage specification version 1, and the zarr.storage module documentation for more information.

Please note also that the file organization and file name conventions used when storing a Zarr array in a directory on the file system have changed. Persistent Zarr arrays created using previous versions of the software will not be compatible with this version. See the zarr.storage API docs and the Zarr storage specification version 1 for more information.

Compression

An abstraction layer has also been created between the core array logic and the code for compressing and decompressing array chunks. This release still bundles the c-blosc library and uses Blosc as the default compressor, however other compressors including zlib, BZ2 and LZMA are also now supported via the Python standard library. New compressors can also be dynamically registered for use with Zarr. See the tutorial sections on Compression and Configuring Blosc, the Zarr storage specification version 1, and the zarr.compressors module documentation for more information.

Synchronization

The synchronization code has also been refactored to create a layer of abstraction, enabling Zarr arrays to be used in parallel computations with a number of alternative synchronization methods. For more information see the tutorial section on Parallel computing and synchronization and the zarr.sync module documentation.

Changes to the Blosc extension

NumPy is no longer a build dependency for the zarr.blosc Cython extension, so setup.py will run even if NumPy is not already installed, and should automatically install NumPy as a runtime dependency. Manual installation of NumPy prior to installing Zarr is still recommended, however, as the automatic installation of NumPy may fail or be sub-optimal on some platforms.

The zarr.blosc Cython extension is now optional and compilation will only be attempted on posix systems; other systems will fall back to a pure-Python installation (#25). On these systems only ‘zlib’, ‘bz2’ and ‘lzma’ compression will be available.

Some optimizations have been made within the zarr.blosc extension to avoid unnecessary memory copies, giving a ~10-20% performance improvement for multi-threaded compression operations.

The zarr.blosc extension now automatically detects whether it is running within a single-threaded or multi-threaded program and adapts its internal behaviour accordingly (#27). There is no need for the user to make any API calls to switch Blosc between contextual and non-contextual (global lock) mode. See also the tutorial section on Configuring Blosc.

Other changes

The internal code for managing chunks has been rewritten to be more efficient. Now no state is maintained for chunks outside of the array store, meaning that chunks do not carry any extra memory overhead not accounted for by the store. This negates the need for the “lazy” option present in the previous release, and this has been removed.

The memory layout within chunks can now be set as either “C” (row-major) or “F” (column-major), which can help to provide better compression for some data (#7). See the tutorial section on Changing memory layout for more information.

A bug has been fixed within the __getitem__ and __setitem__ machinery for slicing arrays, to properly handle getting and setting partial slices.

Acknowledgments

Zarr bundles the c-blosc library and uses it as the default compressor.

Zarr is inspired by HDF5, h5py and bcolz.

Development of this package is supported by the MRC Centre for Genomics and Global Health.

Indices and tables