Zarr

Zarr is a Python package providing an implementation of chunked, compressed, N-dimensional arrays.

Highlights

  • Create N-dimensional arrays with any NumPy dtype.
  • Chunk arrays along any dimension.
  • Compress and/or filter chunks using any NumCodecs codec.
  • Store arrays in memory, on disk, inside a Zip file, on S3, …
  • Read an array concurrently from multiple threads or processes.
  • Write to an array concurrently from multiple threads or processes.
  • Organize arrays into hierarchies via groups.

Status

Zarr is still a young project. Feedback and bug reports are very welcome, please get in touch via the GitHub issue tracker. See Contributing to Zarr for further information about contributing to Zarr.

Installation

Zarr depends on NumPy. It is generally best to install NumPy first using whatever method is most appropriate for you operating system and Python distribution. Other dependencies should be installed automatically if using one of the installation methods below.

Install Zarr from PyPI:

$ pip install zarr

Alternatively, install Zarr via conda:

$ conda install -c conda-forge zarr

To install the latest development version of Zarr, you can use pip with the latest GitHub master:

$ pip install git+https://github.com/zarr-developers/zarr-python.git

To work with Zarr source code in development, install from GitHub:

$ git clone --recursive https://github.com/zarr-developers/zarr-python.git
$ cd zarr
$ python setup.py install

To verify that Zarr has been fully installed, run the test suite:

$ pip install pytest
$ python -m pytest -v --pyargs zarr

Contents

Tutorial

Zarr provides classes and functions for working with N-dimensional arrays that behave like NumPy arrays but whose data is divided into chunks and each chunk is compressed. If you are already familiar with HDF5 then Zarr arrays provide similar functionality, but with some additional flexibility.

Creating an array

Zarr has several functions for creating arrays. For example:

>>> import zarr
>>> z = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4')
>>> z
<zarr.core.Array (10000, 10000) int32>

The code above creates a 2-dimensional array of 32-bit integers with 10000 rows and 10000 columns, divided into chunks where each chunk has 1000 rows and 1000 columns (and so there will be 100 chunks in total).

For a complete list of array creation routines see the zarr.creation module documentation.

Reading and writing data

Zarr arrays support a similar interface to NumPy arrays for reading and writing data. For example, the entire array can be filled with a scalar value:

>>> z[:] = 42

Regions of the array can also be written to, e.g.:

>>> import numpy as np
>>> z[0, :] = np.arange(10000)
>>> z[:, 0] = np.arange(10000)

The contents of the array can be retrieved by slicing, which will load the requested region into memory as a NumPy array, e.g.:

>>> z[0, 0]
0
>>> z[-1, -1]
42
>>> z[0, :]
array([   0,    1,    2, ..., 9997, 9998, 9999], dtype=int32)
>>> z[:, 0]
array([   0,    1,    2, ..., 9997, 9998, 9999], dtype=int32)
>>> z[:]
array([[   0,    1,    2, ..., 9997, 9998, 9999],
       [   1,   42,   42, ...,   42,   42,   42],
       [   2,   42,   42, ...,   42,   42,   42],
       ...,
       [9997,   42,   42, ...,   42,   42,   42],
       [9998,   42,   42, ...,   42,   42,   42],
       [9999,   42,   42, ...,   42,   42,   42]], dtype=int32)

Persistent arrays

In the examples above, compressed data for each chunk of the array was stored in main memory. Zarr arrays can also be stored on a file system, enabling persistence of data between sessions. For example:

>>> z1 = zarr.open('data/example.zarr', mode='w', shape=(10000, 10000),
...                chunks=(1000, 1000), dtype='i4')

The array above will store its configuration metadata and all compressed chunk data in a directory called ‘data/example.zarr’ relative to the current working directory. The zarr.convenience.open() function provides a convenient way to create a new persistent array or continue working with an existing array. Note that although the function is called “open”, there is no need to close an array: data are automatically flushed to disk, and files are automatically closed whenever an array is modified.

Persistent arrays support the same interface for reading and writing data, e.g.:

>>> z1[:] = 42
>>> z1[0, :] = np.arange(10000)
>>> z1[:, 0] = np.arange(10000)

Check that the data have been written and can be read again:

>>> z2 = zarr.open('data/example.zarr', mode='r')
>>> np.all(z1[:] == z2[:])
True

If you are just looking for a fast and convenient way to save NumPy arrays to disk then load back into memory later, the functions zarr.convenience.save() and zarr.convenience.load() may be useful. E.g.:

>>> a = np.arange(10)
>>> zarr.save('data/example.zarr', a)
>>> zarr.load('data/example.zarr')
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Please note that there are a number of other options for persistent array storage, see the section on Storage alternatives below.

Resizing and appending

A Zarr array can be resized, which means that any of its dimensions can be increased or decreased in length. For example:

>>> z = zarr.zeros(shape=(10000, 10000), chunks=(1000, 1000))
>>> z[:] = 42
>>> z.resize(20000, 10000)
>>> z.shape
(20000, 10000)

Note that when an array is resized, the underlying data are not rearranged in any way. If one or more dimensions are shrunk, any chunks falling outside the new array shape will be deleted from the underlying store.

For convenience, Zarr arrays also provide an append() method, which can be used to append data to any axis. E.g.:

>>> a = np.arange(10000000, dtype='i4').reshape(10000, 1000)
>>> z = zarr.array(a, chunks=(1000, 100))
>>> z.shape
(10000, 1000)
>>> z.append(a)
(20000, 1000)
>>> z.append(np.vstack([a, a]), axis=1)
(20000, 2000)
>>> z.shape
(20000, 2000)

Compressors

A number of different compressors can be used with Zarr. A separate package called NumCodecs is available which provides a common interface to various compressor libraries including Blosc, Zstandard, LZ4, Zlib, BZ2 and LZMA. Different compressors can be provided via the compressor keyword argument accepted by all array creation functions. For example:

>>> from numcodecs import Blosc
>>> compressor = Blosc(cname='zstd', clevel=3, shuffle=Blosc.BITSHUFFLE)
>>> data = np.arange(100000000, dtype='i4').reshape(10000, 10000)
>>> z = zarr.array(data, chunks=(1000, 1000), compressor=compressor)
>>> z.compressor
Blosc(cname='zstd', clevel=3, shuffle=BITSHUFFLE, blocksize=0)

This array above will use Blosc as the primary compressor, using the Zstandard algorithm (compression level 3) internally within Blosc, and with the bit-shuffle filter applied.

When using a compressor, it can be useful to get some diagnostics on the compression ratio. Zarr arrays provide a info property which can be used to print some diagnostics, e.g.:

>>> z.info
Type               : zarr.core.Array
Data type          : int32
Shape              : (10000, 10000)
Chunk shape        : (1000, 1000)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='zstd', clevel=3, shuffle=BITSHUFFLE,
                   : blocksize=0)
Store type         : builtins.dict
No. bytes          : 400000000 (381.5M)
No. bytes stored   : 3379344 (3.2M)
Storage ratio      : 118.4
Chunks initialized : 100/100

If you don’t specify a compressor, by default Zarr uses the Blosc compressor. Blosc is generally very fast and can be configured in a variety of ways to improve the compression ratio for different types of data. Blosc is in fact a “meta-compressor”, which means that it can use a number of different compression algorithms internally to compress the data. Blosc also provides highly optimized implementations of byte- and bit-shuffle filters, which can improve compression ratios for some data. A list of the internal compression libraries available within Blosc can be obtained via:

>>> from numcodecs import blosc
>>> blosc.list_compressors()
['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib', 'zstd']

In addition to Blosc, other compression libraries can also be used. For example, here is an array using Zstandard compression, level 1:

>>> from numcodecs import Zstd
>>> z = zarr.array(np.arange(100000000, dtype='i4').reshape(10000, 10000),
...                chunks=(1000, 1000), compressor=Zstd(level=1))
>>> z.compressor
Zstd(level=1)

Here is an example using LZMA with a custom filter pipeline including LZMA’s built-in delta filter:

>>> import lzma
>>> lzma_filters = [dict(id=lzma.FILTER_DELTA, dist=4),
...                 dict(id=lzma.FILTER_LZMA2, preset=1)]
>>> from numcodecs import LZMA
>>> compressor = LZMA(filters=lzma_filters)
>>> z = zarr.array(np.arange(100000000, dtype='i4').reshape(10000, 10000),
...                chunks=(1000, 1000), compressor=compressor)
>>> z.compressor
LZMA(format=1, check=-1, preset=None, filters=[{'dist': 4, 'id': 3}, {'id': 33, 'preset': 1}])

The default compressor can be changed by setting the value of the zarr.storage.default_compressor variable, e.g.:

>>> import zarr.storage
>>> from numcodecs import Zstd, Blosc
>>> # switch to using Zstandard
... zarr.storage.default_compressor = Zstd(level=1)
>>> z = zarr.zeros(100000000, chunks=1000000)
>>> z.compressor
Zstd(level=1)
>>> # switch back to Blosc defaults
... zarr.storage.default_compressor = Blosc()

To disable compression, set compressor=None when creating an array, e.g.:

>>> z = zarr.zeros(100000000, chunks=1000000, compressor=None)
>>> z.compressor is None
True

Filters

In some cases, compression can be improved by transforming the data in some way. For example, if nearby values tend to be correlated, then shuffling the bytes within each numerical value or storing the difference between adjacent values may increase compression ratio. Some compressors provide built-in filters that apply transformations to the data prior to compression. For example, the Blosc compressor has built-in implementations of byte- and bit-shuffle filters, and the LZMA compressor has a built-in implementation of a delta filter. However, to provide additional flexibility for implementing and using filters in combination with different compressors, Zarr also provides a mechanism for configuring filters outside of the primary compressor.

Here is an example using a delta filter with the Blosc compressor:

>>> from numcodecs import Blosc, Delta
>>> filters = [Delta(dtype='i4')]
>>> compressor = Blosc(cname='zstd', clevel=1, shuffle=Blosc.SHUFFLE)
>>> data = np.arange(100000000, dtype='i4').reshape(10000, 10000)
>>> z = zarr.array(data, chunks=(1000, 1000), filters=filters, compressor=compressor)
>>> z.info
Type               : zarr.core.Array
Data type          : int32
Shape              : (10000, 10000)
Chunk shape        : (1000, 1000)
Order              : C
Read-only          : False
Filter [0]         : Delta(dtype='<i4')
Compressor         : Blosc(cname='zstd', clevel=1, shuffle=SHUFFLE, blocksize=0)
Store type         : builtins.dict
No. bytes          : 400000000 (381.5M)
No. bytes stored   : 1290562 (1.2M)
Storage ratio      : 309.9
Chunks initialized : 100/100

For more information about available filter codecs, see the Numcodecs documentation.

Groups

Zarr supports hierarchical organization of arrays via groups. As with arrays, groups can be stored in memory, on disk, or via other storage systems that support a similar interface.

To create a group, use the zarr.group() function:

>>> root = zarr.group()
>>> root
<zarr.hierarchy.Group '/'>

Groups have a similar API to the Group class from h5py. For example, groups can contain other groups:

>>> foo = root.create_group('foo')
>>> bar = foo.create_group('bar')

Groups can also contain arrays, e.g.:

>>> z1 = bar.zeros('baz', shape=(10000, 10000), chunks=(1000, 1000), dtype='i4')
>>> z1
<zarr.core.Array '/foo/bar/baz' (10000, 10000) int32>

Arrays are known as “datasets” in HDF5 terminology. For compatibility with h5py, Zarr groups also implement the create_dataset() and require_dataset() methods, e.g.:

>>> z = bar.create_dataset('quux', shape=(10000, 10000), chunks=(1000, 1000), dtype='i4')
>>> z
<zarr.core.Array '/foo/bar/quux' (10000, 10000) int32>

Members of a group can be accessed via the suffix notation, e.g.:

>>> root['foo']
<zarr.hierarchy.Group '/foo'>

The ‘/’ character can be used to access multiple levels of the hierarchy in one call, e.g.:

>>> root['foo/bar']
<zarr.hierarchy.Group '/foo/bar'>
>>> root['foo/bar/baz']
<zarr.core.Array '/foo/bar/baz' (10000, 10000) int32>

The zarr.hierarchy.Group.tree() method can be used to print a tree representation of the hierarchy, e.g.:

>>> root.tree()
/
 └── foo
     └── bar
         ├── baz (10000, 10000) int32
         └── quux (10000, 10000) int32

The zarr.convenience.open() function provides a convenient way to create or re-open a group stored in a directory on the file-system, with sub-groups stored in sub-directories, e.g.:

>>> root = zarr.open('data/group.zarr', mode='w')
>>> root
<zarr.hierarchy.Group '/'>
>>> z = root.zeros('foo/bar/baz', shape=(10000, 10000), chunks=(1000, 1000), dtype='i4')
>>> z
<zarr.core.Array '/foo/bar/baz' (10000, 10000) int32>

Groups can be used as context managers (in a with statement). If the underlying store has a close method, it will be called on exit.

For more information on groups see the zarr.hierarchy and zarr.convenience API docs.

Array and group diagnostics

Diagnostic information about arrays and groups is available via the info property. E.g.:

>>> root = zarr.group()
>>> foo = root.create_group('foo')
>>> bar = foo.zeros('bar', shape=1000000, chunks=100000, dtype='i8')
>>> bar[:] = 42
>>> baz = foo.zeros('baz', shape=(1000, 1000), chunks=(100, 100), dtype='f4')
>>> baz[:] = 4.2
>>> root.info
Name        : /
Type        : zarr.hierarchy.Group
Read-only   : False
Store type  : zarr.storage.MemoryStore
No. members : 1
No. arrays  : 0
No. groups  : 1
Groups      : foo

>>> foo.info
Name        : /foo
Type        : zarr.hierarchy.Group
Read-only   : False
Store type  : zarr.storage.MemoryStore
No. members : 2
No. arrays  : 2
No. groups  : 0
Arrays      : bar, baz

>>> bar.info
Name               : /foo/bar
Type               : zarr.core.Array
Data type          : int64
Shape              : (1000000,)
Chunk shape        : (100000,)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type         : zarr.storage.MemoryStore
No. bytes          : 8000000 (7.6M)
No. bytes stored   : 33240 (32.5K)
Storage ratio      : 240.7
Chunks initialized : 10/10

>>> baz.info
Name               : /foo/baz
Type               : zarr.core.Array
Data type          : float32
Shape              : (1000, 1000)
Chunk shape        : (100, 100)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type         : zarr.storage.MemoryStore
No. bytes          : 4000000 (3.8M)
No. bytes stored   : 23943 (23.4K)
Storage ratio      : 167.1
Chunks initialized : 100/100

Groups also have the zarr.hierarchy.Group.tree() method, e.g.:

>>> root.tree()
/
 └── foo
     ├── bar (1000000,) int64
     └── baz (1000, 1000) float32

If you’re using Zarr within a Jupyter notebook (requires ipytree), calling tree() will generate an interactive tree representation, see the repr_tree.ipynb notebook for more examples.

User attributes

Zarr arrays and groups support custom key/value attributes, which can be useful for storing application-specific metadata. For example:

>>> root = zarr.group()
>>> root.attrs['foo'] = 'bar'
>>> z = root.zeros('zzz', shape=(10000, 10000))
>>> z.attrs['baz'] = 42
>>> z.attrs['qux'] = [1, 4, 7, 12]
>>> sorted(root.attrs)
['foo']
>>> 'foo' in root.attrs
True
>>> root.attrs['foo']
'bar'
>>> sorted(z.attrs)
['baz', 'qux']
>>> z.attrs['baz']
42
>>> z.attrs['qux']
[1, 4, 7, 12]

Internally Zarr uses JSON to store array attributes, so attribute values must be JSON serializable.

Advanced indexing

As of version 2.2, Zarr arrays support several methods for advanced or “fancy” indexing, which enable a subset of data items to be extracted or updated in an array without loading the entire array into memory.

Note that although this functionality is similar to some of the advanced indexing capabilities available on NumPy arrays and on h5py datasets, the Zarr API for advanced indexing is different from both NumPy and h5py, so please read this section carefully. For a complete description of the indexing API, see the documentation for the zarr.core.Array class.

Indexing with coordinate arrays

Items from a Zarr array can be extracted by providing an integer array of coordinates. E.g.:

>>> z = zarr.array(np.arange(10))
>>> z[:]
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> z.get_coordinate_selection([1, 4])
array([1, 4])

Coordinate arrays can also be used to update data, e.g.:

>>> z.set_coordinate_selection([1, 4], [-1, -2])
>>> z[:]
array([ 0, -1,  2,  3, -2,  5,  6,  7,  8,  9])

For multidimensional arrays, coordinates must be provided for each dimension, e.g.:

>>> z = zarr.array(np.arange(15).reshape(3, 5))
>>> z[:]
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])
>>> z.get_coordinate_selection(([0, 2], [1, 3]))
array([ 1, 13])
>>> z.set_coordinate_selection(([0, 2], [1, 3]), [-1, -2])
>>> z[:]
array([[ 0, -1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, -2, 14]])

For convenience, coordinate indexing is also available via the vindex property, e.g.:

>>> z.vindex[[0, 2], [1, 3]]
array([-1, -2])
>>> z.vindex[[0, 2], [1, 3]] = [-3, -4]
>>> z[:]
array([[ 0, -3,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, -4, 14]])
Indexing with a mask array

Items can also be extracted by providing a Boolean mask. E.g.:

>>> z = zarr.array(np.arange(10))
>>> z[:]
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> sel = np.zeros_like(z, dtype=bool)
>>> sel[1] = True
>>> sel[4] = True
>>> z.get_mask_selection(sel)
array([1, 4])
>>> z.set_mask_selection(sel, [-1, -2])
>>> z[:]
array([ 0, -1,  2,  3, -2,  5,  6,  7,  8,  9])

Here’s a multidimensional example:

>>> z = zarr.array(np.arange(15).reshape(3, 5))
>>> z[:]
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])
>>> sel = np.zeros_like(z, dtype=bool)
>>> sel[0, 1] = True
>>> sel[2, 3] = True
>>> z.get_mask_selection(sel)
array([ 1, 13])
>>> z.set_mask_selection(sel, [-1, -2])
>>> z[:]
array([[ 0, -1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, -2, 14]])

For convenience, mask indexing is also available via the vindex property, e.g.:

>>> z.vindex[sel]
array([-1, -2])
>>> z.vindex[sel] = [-3, -4]
>>> z[:]
array([[ 0, -3,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, -4, 14]])

Mask indexing is conceptually the same as coordinate indexing, and is implemented internally via the same machinery. Both styles of indexing allow selecting arbitrary items from an array, also known as point selection.

Orthogonal indexing

Zarr arrays also support methods for orthogonal indexing, which allows selections to be made along each dimension of an array independently. For example, this allows selecting a subset of rows and/or columns from a 2-dimensional array. E.g.:

>>> z = zarr.array(np.arange(15).reshape(3, 5))
>>> z[:]
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])
>>> z.get_orthogonal_selection(([0, 2], slice(None)))  # select first and third rows
array([[ 0,  1,  2,  3,  4],
       [10, 11, 12, 13, 14]])
>>> z.get_orthogonal_selection((slice(None), [1, 3]))  # select second and fourth columns
array([[ 1,  3],
       [ 6,  8],
       [11, 13]])
>>> z.get_orthogonal_selection(([0, 2], [1, 3]))       # select rows [0, 2] and columns [1, 4]
array([[ 1,  3],
       [11, 13]])

Data can also be modified, e.g.:

>>> z.set_orthogonal_selection(([0, 2], [1, 3]), [[-1, -2], [-3, -4]])
>>> z[:]
array([[ 0, -1,  2, -2,  4],
       [ 5,  6,  7,  8,  9],
       [10, -3, 12, -4, 14]])

For convenience, the orthogonal indexing functionality is also available via the oindex property, e.g.:

>>> z = zarr.array(np.arange(15).reshape(3, 5))
>>> z.oindex[[0, 2], :]  # select first and third rows
array([[ 0,  1,  2,  3,  4],
       [10, 11, 12, 13, 14]])
>>> z.oindex[:, [1, 3]]  # select second and fourth columns
array([[ 1,  3],
       [ 6,  8],
       [11, 13]])
>>> z.oindex[[0, 2], [1, 3]]  # select rows [0, 2] and columns [1, 4]
array([[ 1,  3],
       [11, 13]])
>>> z.oindex[[0, 2], [1, 3]] = [[-1, -2], [-3, -4]]
>>> z[:]
array([[ 0, -1,  2, -2,  4],
       [ 5,  6,  7,  8,  9],
       [10, -3, 12, -4, 14]])

Any combination of integer, slice, 1D integer array and/or 1D Boolean array can be used for orthogonal indexing.

Indexing fields in structured arrays

All selection methods support a fields parameter which allows retrieving or replacing data for a specific field in an array with a structured dtype. E.g.:

>>> a = np.array([(b'aaa', 1, 4.2),
...               (b'bbb', 2, 8.4),
...               (b'ccc', 3, 12.6)],
...              dtype=[('foo', 'S3'), ('bar', 'i4'), ('baz', 'f8')])
>>> z = zarr.array(a)
>>> z['foo']
array([b'aaa', b'bbb', b'ccc'],
      dtype='|S3')
>>> z['baz']
array([  4.2,   8.4,  12.6])
>>> z.get_basic_selection(slice(0, 2), fields='bar')
array([1, 2], dtype=int32)
>>> z.get_coordinate_selection([0, 2], fields=['foo', 'baz'])
array([(b'aaa',   4.2), (b'ccc',  12.6)],
      dtype=[('foo', 'S3'), ('baz', '<f8')])

Storage alternatives

Zarr can use any object that implements the MutableMapping interface from the collections module in the Python standard library as the store for a group or an array.

Some pre-defined storage classes are provided in the zarr.storage module. For example, the zarr.storage.DirectoryStore class provides a MutableMapping interface to a directory on the local file system. This is used under the hood by the zarr.convenience.open() function. In other words, the following code:

>>> z = zarr.open('data/example.zarr', mode='w', shape=1000000, dtype='i4')

…is short-hand for:

>>> store = zarr.DirectoryStore('data/example.zarr')
>>> z = zarr.create(store=store, overwrite=True, shape=1000000, dtype='i4')

…and the following code:

>>> root = zarr.open('data/example.zarr', mode='w')

…is short-hand for:

>>> store = zarr.DirectoryStore('data/example.zarr')
>>> root = zarr.group(store=store, overwrite=True)

Any other compatible storage class could be used in place of zarr.storage.DirectoryStore in the code examples above. For example, here is an array stored directly into a Zip file, via the zarr.storage.ZipStore class:

>>> store = zarr.ZipStore('data/example.zip', mode='w')
>>> root = zarr.group(store=store)
>>> z = root.zeros('foo/bar', shape=(1000, 1000), chunks=(100, 100), dtype='i4')
>>> z[:] = 42
>>> store.close()

Re-open and check that data have been written:

>>> store = zarr.ZipStore('data/example.zip', mode='r')
>>> root = zarr.group(store=store)
>>> z = root['foo/bar']
>>> z[:]
array([[42, 42, 42, ..., 42, 42, 42],
       [42, 42, 42, ..., 42, 42, 42],
       [42, 42, 42, ..., 42, 42, 42],
       ...,
       [42, 42, 42, ..., 42, 42, 42],
       [42, 42, 42, ..., 42, 42, 42],
       [42, 42, 42, ..., 42, 42, 42]], dtype=int32)
>>> store.close()

Note that there are some limitations on how Zip files can be used, because items within a Zip file cannot be updated in place. This means that data in the array should only be written once and write operations should be aligned with chunk boundaries. Note also that the close() method must be called after writing any data to the store, otherwise essential records will not be written to the underlying zip file.

Another storage alternative is the zarr.storage.DBMStore class, added in Zarr version 2.2. This class allows any DBM-style database to be used for storing an array or group. Here is an example using a Berkeley DB B-tree database for storage (requires bsddb3 to be installed):

>>> import bsddb3
>>> store = zarr.DBMStore('data/example.bdb', open=bsddb3.btopen)
>>> root = zarr.group(store=store, overwrite=True)
>>> z = root.zeros('foo/bar', shape=(1000, 1000), chunks=(100, 100), dtype='i4')
>>> z[:] = 42
>>> store.close()

Also added in Zarr version 2.2 is the zarr.storage.LMDBStore class which enables the lightning memory-mapped database (LMDB) to be used for storing an array or group (requires lmdb to be installed):

>>> store = zarr.LMDBStore('data/example.lmdb')
>>> root = zarr.group(store=store, overwrite=True)
>>> z = root.zeros('foo/bar', shape=(1000, 1000), chunks=(100, 100), dtype='i4')
>>> z[:] = 42
>>> store.close()

In Zarr version 2.3 is the zarr.storage.SQLiteStore class which enables the SQLite database to be used for storing an array or group (requires Python is built with SQLite support):

>>> store = zarr.SQLiteStore('data/example.sqldb')
>>> root = zarr.group(store=store, overwrite=True)
>>> z = root.zeros('foo/bar', shape=(1000, 1000), chunks=(100, 100), dtype='i4')
>>> z[:] = 42
>>> store.close()

Also added in Zarr version 2.3 are two storage classes for interfacing with server-client databases. The zarr.storage.RedisStore class interfaces Redis (an in memory data structure store), and the zarr.storage.MongoDB class interfaces with MongoDB (an oject oriented NoSQL database). These stores respectively require the redis-py and pymongo packages to be installed.

For compatibility with the N5 data format, Zarr also provides an N5 backend (this is currently an experimental feature). Similar to the zip storage class, an zarr.n5.N5Store can be instantiated directly:

>>> store = zarr.N5Store('data/example.n5')
>>> root = zarr.group(store=store)
>>> z = root.zeros('foo/bar', shape=(1000, 1000), chunks=(100, 100), dtype='i4')
>>> z[:] = 42

For convenience, the N5 backend will automatically be chosen when the filename ends with .n5:

>>> root = zarr.open('data/example.n5', mode='w')
Distributed/cloud storage

It is also possible to use distributed storage systems. The Dask project has implementations of the MutableMapping interface for Amazon S3 (S3Map), Hadoop Distributed File System (HDFSMap) and Google Cloud Storage (GCSMap), which can be used with Zarr.

Here is an example using S3Map to read an array created previously:

>>> import s3fs
>>> import zarr
>>> s3 = s3fs.S3FileSystem(anon=True, client_kwargs=dict(region_name='eu-west-2'))
>>> store = s3fs.S3Map(root='zarr-demo/store', s3=s3, check=False)
>>> root = zarr.group(store=store)
>>> z = root['foo/bar/baz']
>>> z
<zarr.core.Array '/foo/bar/baz' (21,) |S1>
>>> z.info
Name               : /foo/bar/baz
Type               : zarr.core.Array
Data type          : |S1
Shape              : (21,)
Chunk shape        : (7,)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type         : fsspec.mapping.FSMap
No. bytes          : 21
Chunks initialized : 3/3
>>> z[:]
array([b'H', b'e', b'l', b'l', b'o', b' ', b'f', b'r', b'o', b'm', b' ',
       b't', b'h', b'e', b' ', b'c', b'l', b'o', b'u', b'd', b'!'],
      dtype='|S1')
>>> z[:].tostring()
b'Hello from the cloud!'

Zarr now also has a builtin storage backend for Azure Blob Storage. The class is zarr.storage.ABSStore (requires azure-storage-blob to be installed):

>>> store = zarr.ABSStore(container='test', prefix='zarr-testing', blob_service_kwargs={'is_emulated': True})  
>>> root = zarr.group(store=store, overwrite=True)  
>>> z = root.zeros('foo/bar', shape=(1000, 1000), chunks=(100, 100), dtype='i4')  
>>> z[:] = 42  

When using an actual storage account, provide account_name and account_key arguments to zarr.storage.ABSStore, the above client is just testing against the emulator. Please also note that this is an experimental feature.

Note that retrieving data from a remote service via the network can be significantly slower than retrieving data from a local file system, and will depend on network latency and bandwidth between the client and server systems. If you are experiencing poor performance, there are several things you can try. One option is to increase the array chunk size, which will reduce the number of chunks and thus reduce the number of network round-trips required to retrieve data for an array (and thus reduce the impact of network latency). Another option is to try to increase the compression ratio by changing compression options or trying a different compressor (which will reduce the impact of limited network bandwidth).

As of version 2.2, Zarr also provides the zarr.storage.LRUStoreCache which can be used to implement a local in-memory cache layer over a remote store. E.g.:

>>> s3 = s3fs.S3FileSystem(anon=True, client_kwargs=dict(region_name='eu-west-2'))
>>> store = s3fs.S3Map(root='zarr-demo/store', s3=s3, check=False)
>>> cache = zarr.LRUStoreCache(store, max_size=2**28)
>>> root = zarr.group(store=cache)
>>> z = root['foo/bar/baz']
>>> from timeit import timeit
>>> # first data access is relatively slow, retrieved from store
... timeit('print(z[:].tostring())', number=1, globals=globals())  
b'Hello from the cloud!'
0.1081731989979744
>>> # second data access is faster, uses cache
... timeit('print(z[:].tostring())', number=1, globals=globals())  
b'Hello from the cloud!'
0.0009490990014455747

If you are still experiencing poor performance with distributed/cloud storage, please raise an issue on the GitHub issue tracker with any profiling data you can provide, as there may be opportunities to optimise further either within Zarr or within the mapping interface to the storage.

Consolidating metadata

(This is an experimental feature.)

Since there is a significant overhead for every connection to a cloud object store such as S3, the pattern described in the previous section may incur significant latency while scanning the metadata of the dataset hierarchy, even though each individual metadata object is small. For cases such as these, once the data are static and can be regarded as read-only, at least for the metadata/structure of the dataset hierarchy, the many metadata objects can be consolidated into a single one via zarr.convenience.consolidate_metadata(). Doing this can greatly increase the speed of reading the dataset metadata, e.g.:

>>> zarr.consolidate_metadata(store)  

This creates a special key with a copy of all of the metadata from all of the metadata objects in the store.

Later, to open a Zarr store with consolidated metadata, use zarr.convenience.open_consolidated(), e.g.:

>>> root = zarr.open_consolidated(store)  

This uses the special key to read all of the metadata in a single call to the backend storage.

Note that, the hierarchy could still be opened in the normal way and altered, causing the consolidated metadata to become out of sync with the real state of the dataset hierarchy. In this case, zarr.convenience.consolidate_metadata() would need to be called again.

To protect against consolidated metadata accidentally getting out of sync, the root group returned by zarr.convenience.open_consolidated() is read-only for the metadata, meaning that no new groups or arrays can be created, and arrays cannot be resized. However, data values with arrays can still be updated.

Copying/migrating data

If you have some data in an HDF5 file and would like to copy some or all of it into a Zarr group, or vice-versa, the zarr.convenience.copy() and zarr.convenience.copy_all() functions can be used. Here’s an example copying a group named ‘foo’ from an HDF5 file to a Zarr group:

>>> import h5py
>>> import zarr
>>> import numpy as np
>>> source = h5py.File('data/example.h5', mode='w')
>>> foo = source.create_group('foo')
>>> baz = foo.create_dataset('bar/baz', data=np.arange(100), chunks=(50,))
>>> spam = source.create_dataset('spam', data=np.arange(100, 200), chunks=(30,))
>>> zarr.tree(source)
/
 ├── foo
 │   └── bar
 │       └── baz (100,) int64
 └── spam (100,) int64
>>> dest = zarr.open_group('data/example.zarr', mode='w')
>>> from sys import stdout
>>> zarr.copy(source['foo'], dest, log=stdout)
copy /foo
copy /foo/bar
copy /foo/bar/baz (100,) int64
all done: 3 copied, 0 skipped, 800 bytes copied
(3, 0, 800)
>>> dest.tree()  # N.B., no spam
/
 └── foo
     └── bar
         └── baz (100,) int64
>>> source.close()

If rather than copying a single group or dataset you would like to copy all groups and datasets, use zarr.convenience.copy_all(), e.g.:

>>> source = h5py.File('data/example.h5', mode='r')
>>> dest = zarr.open_group('data/example2.zarr', mode='w')
>>> zarr.copy_all(source, dest, log=stdout)
copy /foo
copy /foo/bar
copy /foo/bar/baz (100,) int64
copy /spam (100,) int64
all done: 4 copied, 0 skipped, 1,600 bytes copied
(4, 0, 1600)
>>> dest.tree()
/
 ├── foo
 │   └── bar
 │       └── baz (100,) int64
 └── spam (100,) int64

If you need to copy data between two Zarr groups, the zarr.convenience.copy() and zarr.convenience.copy_all() functions can be used and provide the most flexibility. However, if you want to copy data in the most efficient way possible, without changing any configuration options, the zarr.convenience.copy_store() function can be used. This function copies data directly between the underlying stores, without any decompression or re-compression, and so should be faster. E.g.:

>>> import zarr
>>> import numpy as np
>>> store1 = zarr.DirectoryStore('data/example.zarr')
>>> root = zarr.group(store1, overwrite=True)
>>> baz = root.create_dataset('foo/bar/baz', data=np.arange(100), chunks=(50,))
>>> spam = root.create_dataset('spam', data=np.arange(100, 200), chunks=(30,))
>>> root.tree()
/
 ├── foo
 │   └── bar
 │       └── baz (100,) int64
 └── spam (100,) int64
>>> from sys import stdout
>>> store2 = zarr.ZipStore('data/example.zip', mode='w')
>>> zarr.copy_store(store1, store2, log=stdout)
copy .zgroup
copy foo/.zgroup
copy foo/bar/.zgroup
copy foo/bar/baz/.zarray
copy foo/bar/baz/0
copy foo/bar/baz/1
copy spam/.zarray
copy spam/0
copy spam/1
copy spam/2
copy spam/3
all done: 11 copied, 0 skipped, 1,138 bytes copied
(11, 0, 1138)
>>> new_root = zarr.group(store2)
>>> new_root.tree()
/
 ├── foo
 │   └── bar
 │       └── baz (100,) int64
 └── spam (100,) int64
>>> new_root['foo/bar/baz'][:]
array([ 0,  1,  2,  ..., 97, 98, 99])
>>> store2.close()  # zip stores need to be closed

String arrays

There are several options for storing arrays of strings.

If your strings are all ASCII strings, and you know the maximum length of the string in your dataset, then you can use an array with a fixed-length bytes dtype. E.g.:

>>> z = zarr.zeros(10, dtype='S6')
>>> z
<zarr.core.Array (10,) |S6>
>>> z[0] = b'Hello'
>>> z[1] = b'world!'
>>> z[:]
array([b'Hello', b'world!', b'', b'', b'', b'', b'', b'', b'', b''],
      dtype='|S6')

A fixed-length unicode dtype is also available, e.g.:

>>> greetings = ['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', 'Hei maailma!',
...              'Xin chào thế giới', 'Njatjeta Botë!', 'Γεια σου κόσμε!',
...              'こんにちは世界', '世界,你好!', 'Helló, világ!', 'Zdravo svete!',
...              'เฮลโลเวิลด์']
>>> text_data = greetings * 10000
>>> z = zarr.array(text_data, dtype='U20')
>>> z
<zarr.core.Array (120000,) <U20>
>>> z[:]
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
       'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'],
      dtype='<U20')

For variable-length strings, the object dtype can be used, but a codec must be provided to encode the data (see also Object arrays below). At the time of writing there are four codecs available that can encode variable length string objects: numcodecs.VLenUTF8, numcodecs.JSON, numcodecs.MsgPack. and numcodecs.Pickle. E.g. using VLenUTF8:

>>> import numcodecs
>>> z = zarr.array(text_data, dtype=object, object_codec=numcodecs.VLenUTF8())
>>> z
<zarr.core.Array (120000,) object>
>>> z.filters
[VLenUTF8()]
>>> z[:]
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
       'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)

As a convenience, dtype=str (or dtype=unicode on Python 2.7) can be used, which is a short-hand for dtype=object, object_codec=numcodecs.VLenUTF8(), e.g.:

>>> z = zarr.array(text_data, dtype=str)
>>> z
<zarr.core.Array (120000,) object>
>>> z.filters
[VLenUTF8()]
>>> z[:]
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
       'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)

Variable-length byte strings are also supported via dtype=object. Again an object_codec is required, which can be one of numcodecs.VLenBytes or numcodecs.Pickle. For convenience, dtype=bytes (or dtype=str on Python 2.7) can be used as a short-hand for dtype=object, object_codec=numcodecs.VLenBytes(), e.g.:

>>> bytes_data = [g.encode('utf-8') for g in greetings] * 10000
>>> z = zarr.array(bytes_data, dtype=bytes)
>>> z
<zarr.core.Array (120000,) object>
>>> z.filters
[VLenBytes()]
>>> z[:]
array([b'\xc2\xa1Hola mundo!', b'Hej V\xc3\xa4rlden!', b'Servus Woid!',
       ..., b'Hell\xc3\xb3, vil\xc3\xa1g!', b'Zdravo svete!',
       b'\xe0\xb9\x80\xe0\xb8\xae\xe0\xb8\xa5\xe0\xb9\x82\xe0\xb8\xa5\xe0\xb9\x80\xe0\xb8\xa7\xe0\xb8\xb4\xe0\xb8\xa5\xe0\xb8\x94\xe0\xb9\x8c'], dtype=object)

If you know ahead of time all the possible string values that can occur, you could also use the numcodecs.Categorize codec to encode each unique string value as an integer. E.g.:

>>> categorize = numcodecs.Categorize(greetings, dtype=object)
>>> z = zarr.array(text_data, dtype=object, object_codec=categorize)
>>> z
<zarr.core.Array (120000,) object>
>>> z.filters
[Categorize(dtype='|O', astype='|u1', labels=['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...])]
>>> z[:]
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
       'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)

Object arrays

Zarr supports arrays with an “object” dtype. This allows arrays to contain any type of object, such as variable length unicode strings, or variable length arrays of numbers, or other possibilities. When creating an object array, a codec must be provided via the object_codec argument. This codec handles encoding (serialization) of Python objects. The best codec to use will depend on what type of objects are present in the array.

At the time of writing there are three codecs available that can serve as a general purpose object codec and support encoding of a mixture of object types: numcodecs.JSON, numcodecs.MsgPack. and numcodecs.Pickle.

For example, using the JSON codec:

>>> z = zarr.empty(5, dtype=object, object_codec=numcodecs.JSON())
>>> z[0] = 42
>>> z[1] = 'foo'
>>> z[2] = ['bar', 'baz', 'qux']
>>> z[3] = {'a': 1, 'b': 2.2}
>>> z[:]
array([42, 'foo', list(['bar', 'baz', 'qux']), {'a': 1, 'b': 2.2}, None], dtype=object)

Not all codecs support encoding of all object types. The numcodecs.Pickle codec is the most flexible, supporting encoding any type of Python object. However, if you are sharing data with anyone other than yourself, then Pickle is not recommended as it is a potential security risk. This is because malicious code can be embedded within pickled data. The JSON and MsgPack codecs do not have any security issues and support encoding of unicode strings, lists and dictionaries. MsgPack is usually faster for both encoding and decoding.

Ragged arrays

If you need to store an array of arrays, where each member array can be of any length and stores the same primitive type (a.k.a. a ragged array), the numcodecs.VLenArray codec can be used, e.g.:

>>> z = zarr.empty(4, dtype=object, object_codec=numcodecs.VLenArray(int))
>>> z
<zarr.core.Array (4,) object>
>>> z.filters
[VLenArray(dtype='<i8')]
>>> z[0] = np.array([1, 3, 5])
>>> z[1] = np.array([4])
>>> z[2] = np.array([7, 9, 14])
>>> z[:]
array([array([1, 3, 5]), array([4]), array([ 7,  9, 14]),
       array([], dtype=int64)], dtype=object)

As a convenience, dtype='array:T' can be used as a short-hand for dtype=object, object_codec=numcodecs.VLenArray('T'), where ‘T’ can be any NumPy primitive dtype such as ‘i4’ or ‘f8’. E.g.:

>>> z = zarr.empty(4, dtype='array:i8')
>>> z
<zarr.core.Array (4,) object>
>>> z.filters
[VLenArray(dtype='<i8')]
>>> z[0] = np.array([1, 3, 5])
>>> z[1] = np.array([4])
>>> z[2] = np.array([7, 9, 14])
>>> z[:]
array([array([1, 3, 5]), array([4]), array([ 7,  9, 14]),
       array([], dtype=int64)], dtype=object)

Chunk optimizations

Chunk size and shape

In general, chunks of at least 1 megabyte (1M) uncompressed size seem to provide better performance, at least when using the Blosc compression library.

The optimal chunk shape will depend on how you want to access the data. E.g., for a 2-dimensional array, if you only ever take slices along the first dimension, then chunk across the second dimenson. If you know you want to chunk across an entire dimension you can use None or -1 within the chunks argument, e.g.:

>>> z1 = zarr.zeros((10000, 10000), chunks=(100, None), dtype='i4')
>>> z1.chunks
(100, 10000)

Alternatively, if you only ever take slices along the second dimension, then chunk across the first dimension, e.g.:

>>> z2 = zarr.zeros((10000, 10000), chunks=(None, 100), dtype='i4')
>>> z2.chunks
(10000, 100)

If you require reasonable performance for both access patterns then you need to find a compromise, e.g.:

>>> z3 = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4')
>>> z3.chunks
(1000, 1000)

If you are feeling lazy, you can let Zarr guess a chunk shape for your data by providing chunks=True, although please note that the algorithm for guessing a chunk shape is based on simple heuristics and may be far from optimal. E.g.:

>>> z4 = zarr.zeros((10000, 10000), chunks=True, dtype='i4')
>>> z4.chunks
(625, 625)

If you know you are always going to be loading the entire array into memory, you can turn off chunks by providing chunks=False, in which case there will be one single chunk for the array:

>>> z5 = zarr.zeros((10000, 10000), chunks=False, dtype='i4')
>>> z5.chunks
(10000, 10000)
Chunk memory layout

The order of bytes within each chunk of an array can be changed via the order keyword argument, to use either C or Fortran layout. For multi-dimensional arrays, these two layouts may provide different compression ratios, depending on the correlation structure within the data. E.g.:

>>> a = np.arange(100000000, dtype='i4').reshape(10000, 10000).T
>>> c = zarr.array(a, chunks=(1000, 1000))
>>> c.info
Type               : zarr.core.Array
Data type          : int32
Shape              : (10000, 10000)
Chunk shape        : (1000, 1000)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type         : builtins.dict
No. bytes          : 400000000 (381.5M)
No. bytes stored   : 6696010 (6.4M)
Storage ratio      : 59.7
Chunks initialized : 100/100
>>> f = zarr.array(a, chunks=(1000, 1000), order='F')
>>> f.info
Type               : zarr.core.Array
Data type          : int32
Shape              : (10000, 10000)
Chunk shape        : (1000, 1000)
Order              : F
Read-only          : False
Compressor         : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type         : builtins.dict
No. bytes          : 400000000 (381.5M)
No. bytes stored   : 4684636 (4.5M)
Storage ratio      : 85.4
Chunks initialized : 100/100

In the above example, Fortran order gives a better compression ratio. This is an artifical example but illustrates the general point that changing the order of bytes within chunks of an array may improve the compression ratio, depending on the structure of the data, the compression algorithm used, and which compression filters (e.g., byte-shuffle) have been applied.

Parallel computing and synchronization

Zarr arrays have been designed for use as the source or sink for data in parallel computations. By data source we mean that multiple concurrent read operations may occur. By data sink we mean that multiple concurrent write operations may occur, with each writer updating a different region of the array. Zarr arrays have not been designed for situations where multiple readers and writers are concurrently operating on the same array.

Both multi-threaded and multi-process parallelism are possible. The bottleneck for most storage and retrieval operations is compression/decompression, and the Python global interpreter lock (GIL) is released wherever possible during these operations, so Zarr will generally not block other Python threads from running.

When using a Zarr array as a data sink, some synchronization (locking) may be required to avoid data loss, depending on how data are being updated. If each worker in a parallel computation is writing to a separate region of the array, and if region boundaries are perfectly aligned with chunk boundaries, then no synchronization is required. However, if region and chunk boundaries are not perfectly aligned, then synchronization is required to avoid two workers attempting to modify the same chunk at the same time, which could result in data loss.

To give a simple example, consider a 1-dimensional array of length 60, z, divided into three chunks of 20 elements each. If three workers are running and each attempts to write to a 20 element region (i.e., z[0:20], z[20:40] and z[40:60]) then each worker will be writing to a separate chunk and no synchronization is required. However, if two workers are running and each attempts to write to a 30 element region (i.e., z[0:30] and z[30:60]) then it is possible both workers will attempt to modify the middle chunk at the same time, and synchronization is required to prevent data loss.

Zarr provides support for chunk-level synchronization. E.g., create an array with thread synchronization:

>>> z = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4',
...                 synchronizer=zarr.ThreadSynchronizer())
>>> z
<zarr.core.Array (10000, 10000) int32>

This array is safe to read or write within a multi-threaded program.

Zarr also provides support for process synchronization via file locking, provided that all processes have access to a shared file system, and provided that the underlying file system supports file locking (which is not the case for some networked file systems). E.g.:

>>> synchronizer = zarr.ProcessSynchronizer('data/example.sync')
>>> z = zarr.open_array('data/example', mode='w', shape=(10000, 10000),
...                     chunks=(1000, 1000), dtype='i4',
...                     synchronizer=synchronizer)
>>> z
<zarr.core.Array (10000, 10000) int32>

This array is safe to read or write from multiple processes.

When using multiple processes to parallelize reads or writes on arrays using the Blosc compression library, it may be necessary to set numcodecs.blosc.use_threads = False, as otherwise Blosc may share incorrect global state amongst processes causing programs to hang. See also the section on Configuring Blosc below.

Please note that support for parallel computing is an area of ongoing research and development. If you are using Zarr for parallel computing, we welcome feedback, experience, discussion, ideas and advice, particularly about issues related to data integrity and performance.

Pickle support

Zarr arrays and groups can be pickled, as long as the underlying store object can be pickled. Instances of any of the storage classes provided in the zarr.storage module can be pickled, as can the built-in dict class which can also be used for storage.

Note that if an array or group is backed by an in-memory store like a dict or zarr.storage.MemoryStore, then when it is pickled all of the store data will be included in the pickled data. However, if an array or group is backed by a persistent store like a zarr.storage.DirectoryStore, zarr.storage.ZipStore or zarr.storage.DBMStore then the store data are not pickled. The only thing that is pickled is the necessary parameters to allow the store to re-open any underlying files or databases upon being unpickled.

E.g., pickle/unpickle an in-memory array:

>>> import pickle
>>> z1 = zarr.array(np.arange(100000))
>>> s = pickle.dumps(z1)
>>> len(s) > 5000  # relatively large because data have been pickled
True
>>> z2 = pickle.loads(s)
>>> z1 == z2
True
>>> np.all(z1[:] == z2[:])
True

E.g., pickle/unpickle an array stored on disk:

>>> z3 = zarr.open('data/walnuts.zarr', mode='w', shape=100000, dtype='i8')
>>> z3[:] = np.arange(100000)
>>> s = pickle.dumps(z3)
>>> len(s) < 200  # small because no data have been pickled
True
>>> z4 = pickle.loads(s)
>>> z3 == z4
True
>>> np.all(z3[:] == z4[:])
True

Datetimes and timedeltas

NumPy’s datetime64 (‘M8’) and timedelta64 (‘m8’) dtypes are supported for Zarr arrays, as long as the units are specified. E.g.:

>>> z = zarr.array(['2007-07-13', '2006-01-13', '2010-08-13'], dtype='M8[D]')
>>> z
<zarr.core.Array (3,) datetime64[D]>
>>> z[:]
array(['2007-07-13', '2006-01-13', '2010-08-13'], dtype='datetime64[D]')
>>> z[0]
numpy.datetime64('2007-07-13')
>>> z[0] = '1999-12-31'
>>> z[:]
array(['1999-12-31', '2006-01-13', '2010-08-13'], dtype='datetime64[D]')

Usage tips

Copying large arrays

Data can be copied between large arrays without needing much memory, e.g.:

>>> z1 = zarr.empty((10000, 10000), chunks=(1000, 1000), dtype='i4')
>>> z1[:] = 42
>>> z2 = zarr.empty_like(z1)
>>> z2[:] = z1

Internally the example above works chunk-by-chunk, extracting only the data from z1 required to fill each chunk in z2. The source of the data (z1) could equally be an h5py Dataset.

Configuring Blosc

The Blosc compressor is able to use multiple threads internally to accelerate compression and decompression. By default, Blosc uses up to 8 internal threads. The number of Blosc threads can be changed to increase or decrease this number, e.g.:

>>> from numcodecs import blosc
>>> blosc.set_nthreads(2)  
8

When a Zarr array is being used within a multi-threaded program, Zarr automatically switches to using Blosc in a single-threaded “contextual” mode. This is generally better as it allows multiple program threads to use Blosc simultaneously and prevents CPU thrashing from too many active threads. If you want to manually override this behaviour, set the value of the blosc.use_threads variable to True (Blosc always uses multiple internal threads) or False (Blosc always runs in single-threaded contextual mode). To re-enable automatic switching, set blosc.use_threads to None.

Please note that if Zarr is being used within a multi-process program, Blosc may not be safe to use in multi-threaded mode and may cause the program to hang. If using Blosc in a multi-process program then it is recommended to set blosc.use_threads = False.

API reference

Array creation (zarr.creation)

zarr.creation.create(shape, chunks=True, dtype=None, compressor='default', fill_value=0, order='C', store=None, synchronizer=None, overwrite=False, path=None, chunk_store=None, filters=None, cache_metadata=True, cache_attrs=True, read_only=False, object_codec=None, **kwargs)[source]

Create an array.

Parameters:
shape : int or tuple of ints

Array shape.

chunks : int or tuple of ints, optional

Chunk shape. If True, will be guessed from shape and dtype. If False, will be set to shape, i.e., single chunk for the whole array. If an int, the chunk size in each dimension will be given by the value of chunks. Default is True.

dtype : string or dtype, optional

NumPy dtype.

compressor : Codec, optional

Primary compressor.

fill_value : object

Default value to use for uninitialized portions of the array.

order : {‘C’, ‘F’}, optional

Memory layout to be used within each chunk.

store : MutableMapping or string

Store or path to directory in file system or name of zip file.

synchronizer : object, optional

Array synchronizer.

overwrite : bool, optional

If True, delete all pre-existing data in store at path before creating the array.

path : string, optional

Path under which array is stored.

chunk_store : MutableMapping, optional

Separate storage for chunks. If not provided, store will be used for storage of both chunks and metadata.

filters : sequence of Codecs, optional

Sequence of filters to use to encode chunk data prior to compression.

cache_metadata : bool, optional

If True, array configuration metadata will be cached for the lifetime of the object. If False, array metadata will be reloaded prior to all data access and modification operations (may incur overhead depending on storage and data access pattern).

cache_attrs : bool, optional

If True (default), user attributes will be cached for attribute read operations. If False, user attributes are reloaded from the store prior to all attribute read operations.

read_only : bool, optional

True if array should be protected against modification.

object_codec : Codec, optional

A codec to encode object arrays, only needed if dtype=object.

Returns:
z : zarr.core.Array

Examples

Create an array with default settings:

>>> import zarr
>>> z = zarr.create((10000, 10000), chunks=(1000, 1000))
>>> z
<zarr.core.Array (10000, 10000) float64>

Create an array with different some different configuration options:

>>> from numcodecs import Blosc
>>> compressor = Blosc(cname='zstd', clevel=1, shuffle=Blosc.BITSHUFFLE)
>>> z = zarr.create((10000, 10000), chunks=(1000, 1000), dtype='i1', order='F',
...                 compressor=compressor)
>>> z
<zarr.core.Array (10000, 10000) int8>

To create an array with object dtype requires a filter that can handle Python object encoding, e.g., MsgPack or Pickle from numcodecs:

>>> from numcodecs import MsgPack
>>> z = zarr.create((10000, 10000), chunks=(1000, 1000), dtype=object,
...                 object_codec=MsgPack())
>>> z
<zarr.core.Array (10000, 10000) object>

Example with some filters, and also storing chunks separately from metadata:

>>> from numcodecs import Quantize, Adler32
>>> store, chunk_store = dict(), dict()
>>> z = zarr.create((10000, 10000), chunks=(1000, 1000), dtype='f8',
...                 filters=[Quantize(digits=2, dtype='f8'), Adler32()],
...                 store=store, chunk_store=chunk_store)
>>> z
<zarr.core.Array (10000, 10000) float64>
zarr.creation.empty(shape, **kwargs)[source]

Create an empty array.

For parameter definitions see zarr.creation.create().

Notes

The contents of an empty Zarr array are not defined. On attempting to retrieve data from an empty Zarr array, any values may be returned, and these are not guaranteed to be stable from one access to the next.

zarr.creation.zeros(shape, **kwargs)[source]

Create an array, with zero being used as the default value for uninitialized portions of the array.

For parameter definitions see zarr.creation.create().

Examples

>>> import zarr
>>> z = zarr.zeros((10000, 10000), chunks=(1000, 1000))
>>> z
<zarr.core.Array (10000, 10000) float64>
>>> z[:2, :2]
array([[0., 0.],
       [0., 0.]])
zarr.creation.ones(shape, **kwargs)[source]

Create an array, with one being used as the default value for uninitialized portions of the array.

For parameter definitions see zarr.creation.create().

Examples

>>> import zarr
>>> z = zarr.ones((10000, 10000), chunks=(1000, 1000))
>>> z
<zarr.core.Array (10000, 10000) float64>
>>> z[:2, :2]
array([[1., 1.],
       [1., 1.]])
zarr.creation.full(shape, fill_value, **kwargs)[source]

Create an array, with fill_value being used as the default value for uninitialized portions of the array.

For parameter definitions see zarr.creation.create().

Examples

>>> import zarr
>>> z = zarr.full((10000, 10000), chunks=(1000, 1000), fill_value=42)
>>> z
<zarr.core.Array (10000, 10000) float64>
>>> z[:2, :2]
array([[42., 42.],
       [42., 42.]])
zarr.creation.array(data, **kwargs)[source]

Create an array filled with data.

The data argument should be a NumPy array or array-like object. For other parameter definitions see zarr.creation.create().

Examples

>>> import numpy as np
>>> import zarr
>>> a = np.arange(100000000).reshape(10000, 10000)
>>> z = zarr.array(a, chunks=(1000, 1000))
>>> z
<zarr.core.Array (10000, 10000) int64>
zarr.creation.open_array(store=None, mode='a', shape=None, chunks=True, dtype=None, compressor='default', fill_value=0, order='C', synchronizer=None, filters=None, cache_metadata=True, cache_attrs=True, path=None, object_codec=None, chunk_store=None, **kwargs)[source]

Open an array using file-mode-like semantics.

Parameters:
store : MutableMapping or string, optional

Store or path to directory in file system or name of zip file.

mode : {‘r’, ‘r+’, ‘a’, ‘w’, ‘w-‘}, optional

Persistence mode: ‘r’ means read only (must exist); ‘r+’ means read/write (must exist); ‘a’ means read/write (create if doesn’t exist); ‘w’ means create (overwrite if exists); ‘w-‘ means create (fail if exists).

shape : int or tuple of ints, optional

Array shape.

chunks : int or tuple of ints, optional

Chunk shape. If True, will be guessed from shape and dtype. If False, will be set to shape, i.e., single chunk for the whole array. If an int, the chunk size in each dimension will be given by the value of chunks. Default is True.

dtype : string or dtype, optional

NumPy dtype.

compressor : Codec, optional

Primary compressor.

fill_value : object, optional

Default value to use for uninitialized portions of the array.

order : {‘C’, ‘F’}, optional

Memory layout to be used within each chunk.

synchronizer : object, optional

Array synchronizer.

filters : sequence, optional

Sequence of filters to use to encode chunk data prior to compression.

cache_metadata : bool, optional

If True, array configuration metadata will be cached for the lifetime of the object. If False, array metadata will be reloaded prior to all data access and modification operations (may incur overhead depending on storage and data access pattern).

cache_attrs : bool, optional

If True (default), user attributes will be cached for attribute read operations. If False, user attributes are reloaded from the store prior to all attribute read operations.

path : string, optional

Array path within store.

object_codec : Codec, optional

A codec to encode object arrays, only needed if dtype=object.

chunk_store : MutableMapping or string, optional

Store or path to directory in file system or name of zip file.

Returns:
z : zarr.core.Array

Notes

There is no need to close an array. Data are automatically flushed to the file system.

Examples

>>> import numpy as np
>>> import zarr
>>> z1 = zarr.open_array('data/example.zarr', mode='w', shape=(10000, 10000),
...                      chunks=(1000, 1000), fill_value=0)
>>> z1[:] = np.arange(100000000).reshape(10000, 10000)
>>> z1
<zarr.core.Array (10000, 10000) float64>
>>> z2 = zarr.open_array('data/example.zarr', mode='r')
>>> z2
<zarr.core.Array (10000, 10000) float64 read-only>
>>> np.all(z1[:] == z2[:])
True
zarr.creation.empty_like(a, **kwargs)[source]

Create an empty array like a.

zarr.creation.zeros_like(a, **kwargs)[source]

Create an array of zeros like a.

zarr.creation.ones_like(a, **kwargs)[source]

Create an array of ones like a.

zarr.creation.full_like(a, **kwargs)[source]

Create a filled array like a.

zarr.creation.open_like(a, path, **kwargs)[source]

Open a persistent array like a.

The Array class (zarr.core)

class zarr.core.Array(store, path=None, read_only=False, chunk_store=None, synchronizer=None, cache_metadata=True, cache_attrs=True)[source]

Instantiate an array from an initialized store.

Parameters:
store : MutableMapping

Array store, already initialized.

path : string, optional

Storage path.

read_only : bool, optional

True if array should be protected against modification.

chunk_store : MutableMapping, optional

Separate storage for chunks. If not provided, store will be used for storage of both chunks and metadata.

synchronizer : object, optional

Array synchronizer.

cache_metadata : bool, optional

If True (default), array configuration metadata will be cached for the lifetime of the object. If False, array metadata will be reloaded prior to all data access and modification operations (may incur overhead depending on storage and data access pattern).

cache_attrs : bool, optional

If True (default), user attributes will be cached for attribute read operations. If False, user attributes are reloaded from the store prior to all attribute read operations.

Attributes:
store

A MutableMapping providing the underlying storage for the array.

path

Storage path.

name

Array name following h5py convention.

read_only

A boolean, True if modification operations are not permitted.

chunk_store

A MutableMapping providing the underlying storage for array chunks.

shape

A tuple of integers describing the length of each dimension of the array.

chunks

A tuple of integers describing the length of each dimension of a chunk of the array.

dtype

The NumPy data type.

compression
compression_opts
fill_value

A value used for uninitialized portions of the array.

order

A string indicating the order in which bytes are arranged within chunks of the array.

synchronizer

Object used to synchronize write access to the array.

filters

One or more codecs used to transform data prior to compression.

attrs

A MutableMapping containing user-defined attributes.

size

The total number of elements in the array.

itemsize

The size in bytes of each item in the array.

nbytes

The total number of bytes that would be required to store the array without compression.

nbytes_stored

The total number of stored bytes of data for the array.

cdata_shape

A tuple of integers describing the number of chunks along each dimension of the array.

nchunks

Total number of chunks.

nchunks_initialized

The number of chunks that have been initialized with some data.

is_view

A boolean, True if this array is a view on another array.

info

Report some diagnostic information about the array.

vindex

Shortcut for vectorized (inner) indexing, see get_coordinate_selection(), set_coordinate_selection(), get_mask_selection() and set_mask_selection() for documentation and examples.

oindex

Shortcut for orthogonal (outer) indexing, see get_orthogonal_selection() and set_orthogonal_selection() for documentation and examples.

Methods

__getitem__(self, selection) Retrieve data for an item or region of the array.
__setitem__(self, selection, value) Modify data for an item or region of the array.
get_basic_selection(self[, selection, out, …]) Retrieve data for an item or region of the array.
set_basic_selection(self, selection, value) Modify data for an item or region of the array.
get_orthogonal_selection(self, selection[, …]) Retrieve data by making a selection for each dimension of the array.
set_orthogonal_selection(self, selection, value) Modify data via a selection for each dimension of the array.
get_mask_selection(self, selection[, out, …]) Retrieve a selection of individual items, by providing a Boolean array of the same shape as the array against which the selection is being made, where True values indicate a selected item.
set_mask_selection(self, selection, value[, …]) Modify a selection of individual items, by providing a Boolean array of the same shape as the array against which the selection is being made, where True values indicate a selected item.
get_coordinate_selection(self, selection[, …]) Retrieve a selection of individual items, by providing the indices (coordinates) for each selected item.
set_coordinate_selection(self, selection, value) Modify a selection of individual items, by providing the indices (coordinates) for each item to be modified.
digest(self[, hashname]) Compute a checksum for the data.
hexdigest(self[, hashname]) Compute a checksum for the data.
resize(self, \*args) Change the shape of the array by growing or shrinking one or more dimensions.
append(self, data[, axis]) Append data to axis.
view(self[, shape, chunks, dtype, …]) Return an array sharing the same data.
astype(self, dtype) Returns a view that does on the fly type conversion of the underlying data.
__getitem__(self, selection)[source]

Retrieve data for an item or region of the array.

Parameters:
selection : tuple

An integer index or slice or tuple of int/slice objects specifying the requested item or region for each dimension of the array.

Returns:
out : ndarray

A NumPy array containing the data for the requested region.

Notes

Slices with step > 1 are supported, but slices with negative step are not.

Currently the implementation for __getitem__ is provided by get_basic_selection(). For advanced (“fancy”) indexing, see the methods listed under See Also.

Examples

Setup a 1-dimensional array:

>>> import zarr
>>> import numpy as np
>>> z = zarr.array(np.arange(100))

Retrieve a single item:

>>> z[5]
5

Retrieve a region via slicing:

>>> z[:5]
array([0, 1, 2, 3, 4])
>>> z[-5:]
array([95, 96, 97, 98, 99])
>>> z[5:10]
array([5, 6, 7, 8, 9])
>>> z[5:10:2]
array([5, 7, 9])
>>> z[::2]
array([ 0,  2,  4, ..., 94, 96, 98])

Load the entire array into memory:

>>> z[...]
array([ 0,  1,  2, ..., 97, 98, 99])

Setup a 2-dimensional array:

>>> z = zarr.array(np.arange(100).reshape(10, 10))

Retrieve an item:

>>> z[2, 2]
22

Retrieve a region via slicing:

>>> z[1:3, 1:3]
array([[11, 12],
       [21, 22]])
>>> z[1:3, :]
array([[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]])
>>> z[:, 1:3]
array([[ 1,  2],
       [11, 12],
       [21, 22],
       [31, 32],
       [41, 42],
       [51, 52],
       [61, 62],
       [71, 72],
       [81, 82],
       [91, 92]])
>>> z[0:5:2, 0:5:2]
array([[ 0,  2,  4],
       [20, 22, 24],
       [40, 42, 44]])
>>> z[::2, ::2]
array([[ 0,  2,  4,  6,  8],
       [20, 22, 24, 26, 28],
       [40, 42, 44, 46, 48],
       [60, 62, 64, 66, 68],
       [80, 82, 84, 86, 88]])

Load the entire array into memory:

>>> z[...]
array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
       [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
       [60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
       [70, 71, 72, 73, 74, 75, 76, 77, 78, 79],
       [80, 81, 82, 83, 84, 85, 86, 87, 88, 89],
       [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]])

For arrays with a structured dtype, specific fields can be retrieved, e.g.:

>>> a = np.array([(b'aaa', 1, 4.2),
...               (b'bbb', 2, 8.4),
...               (b'ccc', 3, 12.6)],
...              dtype=[('foo', 'S3'), ('bar', 'i4'), ('baz', 'f8')])
>>> z = zarr.array(a)
>>> z['foo']
array([b'aaa', b'bbb', b'ccc'],
      dtype='|S3')
__setitem__(self, selection, value)[source]

Modify data for an item or region of the array.

Parameters:
selection : tuple

An integer index or slice or tuple of int/slice specifying the requested region for each dimension of the array.

value : scalar or array-like

Value to be stored into the array.

Notes

Slices with step > 1 are supported, but slices with negative step are not.

Currently the implementation for __setitem__ is provided by set_basic_selection(), which means that only integers and slices are supported within the selection. For advanced (“fancy”) indexing, see the methods listed under See Also.

Examples

Setup a 1-dimensional array:

>>> import zarr
>>> z = zarr.zeros(100, dtype=int)

Set all array elements to the same scalar value:

>>> z[...] = 42
>>> z[...]
array([42, 42, 42, ..., 42, 42, 42])

Set a portion of the array:

>>> z[:10] = np.arange(10)
>>> z[-10:] = np.arange(10)[::-1]
>>> z[...]
array([ 0, 1, 2, ..., 2, 1, 0])

Setup a 2-dimensional array:

>>> z = zarr.zeros((5, 5), dtype=int)

Set all array elements to the same scalar value:

>>> z[...] = 42

Set a portion of the array:

>>> z[0, :] = np.arange(z.shape[1])
>>> z[:, 0] = np.arange(z.shape[0])
>>> z[...]
array([[ 0,  1,  2,  3,  4],
       [ 1, 42, 42, 42, 42],
       [ 2, 42, 42, 42, 42],
       [ 3, 42, 42, 42, 42],
       [ 4, 42, 42, 42, 42]])

For arrays with a structured dtype, specific fields can be modified, e.g.:

>>> a = np.array([(b'aaa', 1, 4.2),
...               (b'bbb', 2, 8.4),
...               (b'ccc', 3, 12.6)],
...              dtype=[('foo', 'S3'), ('bar', 'i4'), ('baz', 'f8')])
>>> z = zarr.array(a)
>>> z['foo'] = b'zzz'
>>> z[...]
array([(b'zzz', 1,   4.2), (b'zzz', 2,   8.4), (b'zzz', 3,  12.6)],
      dtype=[('foo', 'S3'), ('bar', '<i4'), ('baz', '<f8')])
get_basic_selection(self, selection=Ellipsis, out=None, fields=None)[source]

Retrieve data for an item or region of the array.

Parameters:
selection : tuple

A tuple specifying the requested item or region for each dimension of the array. May be any combination of int and/or slice for multidimensional arrays.

out : ndarray, optional

If given, load the selected data directly into this array.

fields : str or sequence of str, optional

For arrays with a structured dtype, one or more fields can be specified to extract data for.

Returns:
out : ndarray

A NumPy array containing the data for the requested region.

Notes

Slices with step > 1 are supported, but slices with negative step are not.

Currently this method provides the implementation for accessing data via the square bracket notation (__getitem__). See __getitem__() for examples using the alternative notation.

Examples

Setup a 1-dimensional array:

>>> import zarr
>>> import numpy as np
>>> z = zarr.array(np.arange(100))

Retrieve a single item:

>>> z.get_basic_selection(5)
5

Retrieve a region via slicing:

>>> z.get_basic_selection(slice(5))
array([0, 1, 2, 3, 4])
>>> z.get_basic_selection(slice(-5, None))
array([95, 96, 97, 98, 99])
>>> z.get_basic_selection(slice(5, 10))
array([5, 6, 7, 8, 9])
>>> z.get_basic_selection(slice(5, 10, 2))
array([5, 7, 9])
>>> z.get_basic_selection(slice(None, None, 2))
array([  0,  2,  4, ..., 94, 96, 98])

Setup a 2-dimensional array:

>>> z = zarr.array(np.arange(100).reshape(10, 10))

Retrieve an item:

>>> z.get_basic_selection((2, 2))
22

Retrieve a region via slicing:

>>> z.get_basic_selection((slice(1, 3), slice(1, 3)))
array([[11, 12],
       [21, 22]])
>>> z.get_basic_selection((slice(1, 3), slice(None)))
array([[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]])
>>> z.get_basic_selection((slice(None), slice(1, 3)))
array([[ 1,  2],
       [11, 12],
       [21, 22],
       [31, 32],
       [41, 42],
       [51, 52],
       [61, 62],
       [71, 72],
       [81, 82],
       [91, 92]])
>>> z.get_basic_selection((slice(0, 5, 2), slice(0, 5, 2)))
array([[ 0,  2,  4],
       [20, 22, 24],
       [40, 42, 44]])
>>> z.get_basic_selection((slice(None, None, 2), slice(None, None, 2)))
array([[ 0,  2,  4,  6,  8],
       [20, 22, 24, 26, 28],
       [40, 42, 44, 46, 48],
       [60, 62, 64, 66, 68],
       [80, 82, 84, 86, 88]])

For arrays with a structured dtype, specific fields can be retrieved, e.g.:

>>> a = np.array([(b'aaa', 1, 4.2),
...               (b'bbb', 2, 8.4),
...               (b'ccc', 3, 12.6)],
...              dtype=[('foo', 'S3'), ('bar', 'i4'), ('baz', 'f8')])
>>> z = zarr.array(a)
>>> z.get_basic_selection(slice(2), fields='foo')
array([b'aaa', b'bbb'],
      dtype='|S3')
set_basic_selection(self, selection, value, fields=None)[source]

Modify data for an item or region of the array.

Parameters:
selection : tuple

An integer index or slice or tuple of int/slice specifying the requested region for each dimension of the array.

value : scalar or array-like

Value to be stored into the array.

fields : str or sequence of str, optional

For arrays with a structured dtype, one or more fields can be specified to set data for.

Notes

This method provides the underlying implementation for modifying data via square bracket notation, see __setitem__() for equivalent examples using the alternative notation.

Examples

Setup a 1-dimensional array:

>>> import zarr
>>> import numpy as np
>>> z = zarr.zeros(100, dtype=int)

Set all array elements to the same scalar value:

>>> z.set_basic_selection(..., 42)
>>> z[...]
array([42, 42, 42, ..., 42, 42, 42])

Set a portion of the array:

>>> z.set_basic_selection(slice(10), np.arange(10))
>>> z.set_basic_selection(slice(-10, None), np.arange(10)[::-1])
>>> z[...]
array([ 0, 1, 2, ..., 2, 1, 0])

Setup a 2-dimensional array:

>>> z = zarr.zeros((5, 5), dtype=int)

Set all array elements to the same scalar value:

>>> z.set_basic_selection(..., 42)

Set a portion of the array:

>>> z.set_basic_selection((0, slice(None)), np.arange(z.shape[1]))
>>> z.set_basic_selection((slice(None), 0), np.arange(z.shape[0]))
>>> z[...]
array([[ 0,  1,  2,  3,  4],
       [ 1, 42, 42, 42, 42],
       [ 2, 42, 42, 42, 42],
       [ 3, 42, 42, 42, 42],
       [ 4, 42, 42, 42, 42]])

For arrays with a structured dtype, the fields parameter can be used to set data for a specific field, e.g.:

>>> a = np.array([(b'aaa', 1, 4.2),
...               (b'bbb', 2, 8.4),
...               (b'ccc', 3, 12.6)],
...              dtype=[('foo', 'S3'), ('bar', 'i4'), ('baz', 'f8')])
>>> z = zarr.array(a)
>>> z.set_basic_selection(slice(0, 2), b'zzz', fields='foo')
>>> z[:]
array([(b'zzz', 1,   4.2), (b'zzz', 2,   8.4), (b'ccc', 3,  12.6)],
      dtype=[('foo', 'S3'), ('bar', '<i4'), ('baz', '<f8')])
get_mask_selection(self, selection, out=None, fields=None)[source]

Retrieve a selection of individual items, by providing a Boolean array of the same shape as the array against which the selection is being made, where True values indicate a selected item.

Parameters:
selection : ndarray, bool

A Boolean array of the same shape as the array against which the selection is being made.

out : ndarray, optional

If given, load the selected data directly into this array.

fields : str or sequence of str, optional

For arrays with a structured dtype, one or more fields can be specified to extract data for.

Returns:
out : ndarray

A NumPy array containing the data for the requested selection.

Notes

Mask indexing is a form of vectorized or inner indexing, and is equivalent to coordinate indexing. Internally the mask array is converted to coordinate arrays by calling np.nonzero.

Examples

Setup a 2-dimensional array:

>>> import zarr
>>> import numpy as np
>>> z = zarr.array(np.arange(100).reshape(10, 10))

Retrieve items by specifying a maks:

>>> sel = np.zeros_like(z, dtype=bool)
>>> sel[1, 1] = True
>>> sel[4, 4] = True
>>> z.get_mask_selection(sel)
array([11, 44])

For convenience, the mask selection functionality is also available via the vindex property, e.g.:

>>> z.vindex[sel]
array([11, 44])
set_mask_selection(self, selection, value, fields=None)[source]

Modify a selection of individual items, by providing a Boolean array of the same shape as the array against which the selection is being made, where True values indicate a selected item.

Parameters:
selection : ndarray, bool

A Boolean array of the same shape as the array against which the selection is being made.

value : scalar or array-like

Value to be stored into the array.

fields : str or sequence of str, optional

For arrays with a structured dtype, one or more fields can be specified to set data for.

Notes

Mask indexing is a form of vectorized or inner indexing, and is equivalent to coordinate indexing. Internally the mask array is converted to coordinate arrays by calling np.nonzero.

Examples

Setup a 2-dimensional array:

>>> import zarr
>>> import numpy as np
>>> z = zarr.zeros((5, 5), dtype=int)

Set data for a selection of items:

>>> sel = np.zeros_like(z, dtype=bool)
>>> sel[1, 1] = True
>>> sel[4, 4] = True
>>> z.set_mask_selection(sel, 1)
>>> z[...]
array([[0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1]])

For convenience, this functionality is also available via the vindex property. E.g.:

>>> z.vindex[sel] = 2
>>> z[...]
array([[0, 0, 0, 0, 0],
       [0, 2, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 2]])
get_coordinate_selection(self, selection, out=None, fields=None)[source]

Retrieve a selection of individual items, by providing the indices (coordinates) for each selected item.

Parameters:
selection : tuple

An integer (coordinate) array for each dimension of the array.

out : ndarray, optional

If given, load the selected data directly into this array.

fields : str or sequence of str, optional

For arrays with a structured dtype, one or more fields can be specified to extract data for.

Returns:
out : ndarray

A NumPy array containing the data for the requested selection.

Notes

Coordinate indexing is also known as point selection, and is a form of vectorized or inner indexing.

Slices are not supported. Coordinate arrays must be provided for all dimensions of the array.

Coordinate arrays may be multidimensional, in which case the output array will also be multidimensional. Coordinate arrays are broadcast against each other before being applied. The shape of the output will be the same as the shape of each coordinate array after broadcasting.

Examples

Setup a 2-dimensional array:

>>> import zarr
>>> import numpy as np
>>> z = zarr.array(np.arange(100).reshape(10, 10))

Retrieve items by specifying their coordinates:

>>> z.get_coordinate_selection(([1, 4], [1, 4]))
array([11, 44])

For convenience, the coordinate selection functionality is also available via the vindex property, e.g.:

>>> z.vindex[[1, 4], [1, 4]]
array([11, 44])
set_coordinate_selection(self, selection, value, fields=None)[source]

Modify a selection of individual items, by providing the indices (coordinates) for each item to be modified.

Parameters:
selection : tuple

An integer (coordinate) array for each dimension of the array.

value : scalar or array-like

Value to be stored into the array.

fields : str or sequence of str, optional

For arrays with a structured dtype, one or more fields can be specified to set data for.

Notes

Coordinate indexing is also known as point selection, and is a form of vectorized or inner indexing.

Slices are not supported. Coordinate arrays must be provided for all dimensions of the array.

Examples

Setup a 2-dimensional array:

>>> import zarr
>>> import numpy as np
>>> z = zarr.zeros((5, 5), dtype=int)

Set data for a selection of items:

>>> z.set_coordinate_selection(([1, 4], [1, 4]), 1)
>>> z[...]
array([[0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1]])

For convenience, this functionality is also available via the vindex property. E.g.:

>>> z.vindex[[1, 4], [1, 4]] = 2
>>> z[...]
array([[0, 0, 0, 0, 0],
       [0, 2, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 2]])
get_orthogonal_selection(self, selection, out=None, fields=None)[source]

Retrieve data by making a selection for each dimension of the array. For example, if an array has 2 dimensions, allows selecting specific rows and/or columns. The selection for each dimension can be either an integer (indexing a single item), a slice, an array of integers, or a Boolean array where True values indicate a selection.

Parameters:
selection : tuple

A selection for each dimension of the array. May be any combination of int, slice, integer array or Boolean array.

out : ndarray, optional

If given, load the selected data directly into this array.

fields : str or sequence of str, optional

For arrays with a structured dtype, one or more fields can be specified to extract data for.

Returns:
out : ndarray

A NumPy array containing the data for the requested selection.

Notes

Orthogonal indexing is also known as outer indexing.

Slices with step > 1 are supported, but slices with negative step are not.

Examples

Setup a 2-dimensional array:

>>> import zarr
>>> import numpy as np
>>> z = zarr.array(np.arange(100).reshape(10, 10))

Retrieve rows and columns via any combination of int, slice, integer array and/or Boolean array:

>>> z.get_orthogonal_selection(([1, 4], slice(None)))
array([[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [40, 41, 42, 43, 44, 45, 46, 47, 48, 49]])
>>> z.get_orthogonal_selection((slice(None), [1, 4]))
array([[ 1,  4],
       [11, 14],
       [21, 24],
       [31, 34],
       [41, 44],
       [51, 54],
       [61, 64],
       [71, 74],
       [81, 84],
       [91, 94]])
>>> z.get_orthogonal_selection(([1, 4], [1, 4]))
array([[11, 14],
       [41, 44]])
>>> sel = np.zeros(z.shape[0], dtype=bool)
>>> sel[1] = True
>>> sel[4] = True
>>> z.get_orthogonal_selection((sel, sel))
array([[11, 14],
       [41, 44]])

For convenience, the orthogonal selection functionality is also available via the oindex property, e.g.:

>>> z.oindex[[1, 4], :]
array([[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [40, 41, 42, 43, 44, 45, 46, 47, 48, 49]])
>>> z.oindex[:, [1, 4]]
array([[ 1,  4],
       [11, 14],
       [21, 24],
       [31, 34],
       [41, 44],
       [51, 54],
       [61, 64],
       [71, 74],
       [81, 84],
       [91, 94]])
>>> z.oindex[[1, 4], [1, 4]]
array([[11, 14],
       [41, 44]])
>>> sel = np.zeros(z.shape[0], dtype=bool)
>>> sel[1] = True
>>> sel[4] = True
>>> z.oindex[sel, sel]
array([[11, 14],
       [41, 44]])
set_orthogonal_selection(self, selection, value, fields=None)[source]

Modify data via a selection for each dimension of the array.

Parameters:
selection : tuple

A selection for each dimension of the array. May be any combination of int, slice, integer array or Boolean array.

value : scalar or array-like

Value to be stored into the array.

fields : str or sequence of str, optional

For arrays with a structured dtype, one or more fields can be specified to set data for.

Notes

Orthogonal indexing is also known as outer indexing.

Slices with step > 1 are supported, but slices with negative step are not.

Examples

Setup a 2-dimensional array:

>>> import zarr
>>> import numpy as np
>>> z = zarr.zeros((5, 5), dtype=int)

Set data for a selection of rows:

>>> z.set_orthogonal_selection(([1, 4], slice(None)), 1)
>>> z[...]
array([[0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1]])

Set data for a selection of columns:

>>> z.set_orthogonal_selection((slice(None), [1, 4]), 2)
>>> z[...]
array([[0, 2, 0, 0, 2],
       [1, 2, 1, 1, 2],
       [0, 2, 0, 0, 2],
       [0, 2, 0, 0, 2],
       [1, 2, 1, 1, 2]])

Set data for a selection of rows and columns:

>>> z.set_orthogonal_selection(([1, 4], [1, 4]), 3)
>>> z[...]
array([[0, 2, 0, 0, 2],
       [1, 3, 1, 1, 3],
       [0, 2, 0, 0, 2],
       [0, 2, 0, 0, 2],
       [1, 3, 1, 1, 3]])

For convenience, this functionality is also available via the oindex property. E.g.:

>>> z.oindex[[1, 4], [1, 4]] = 4
>>> z[...]
array([[0, 2, 0, 0, 2],
       [1, 4, 1, 1, 4],
       [0, 2, 0, 0, 2],
       [0, 2, 0, 0, 2],
       [1, 4, 1, 1, 4]])
digest(self, hashname='sha1')[source]

Compute a checksum for the data. Default uses sha1 for speed.

Examples

>>> import binascii
>>> import zarr
>>> z = zarr.empty(shape=(10000, 10000), chunks=(1000, 1000))
>>> binascii.hexlify(z.digest())
b'041f90bc7a571452af4f850a8ca2c6cddfa8a1ac'
>>> z = zarr.zeros(shape=(10000, 10000), chunks=(1000, 1000))
>>> binascii.hexlify(z.digest())
b'7162d416d26a68063b66ed1f30e0a866e4abed60'
>>> z = zarr.zeros(shape=(10000, 10000), dtype="u1", chunks=(1000, 1000))
>>> binascii.hexlify(z.digest())
b'cb387af37410ae5a3222e893cf3373e4e4f22816'
hexdigest(self, hashname='sha1')[source]

Compute a checksum for the data. Default uses sha1 for speed.

Examples

>>> import zarr
>>> z = zarr.empty(shape=(10000, 10000), chunks=(1000, 1000))
>>> z.hexdigest()
'041f90bc7a571452af4f850a8ca2c6cddfa8a1ac'
>>> z = zarr.zeros(shape=(10000, 10000), chunks=(1000, 1000))
>>> z.hexdigest()
'7162d416d26a68063b66ed1f30e0a866e4abed60'
>>> z = zarr.zeros(shape=(10000, 10000), dtype="u1", chunks=(1000, 1000))
>>> z.hexdigest()
'cb387af37410ae5a3222e893cf3373e4e4f22816'
resize(self, *args)[source]

Change the shape of the array by growing or shrinking one or more dimensions.

Notes

When resizing an array, the data are not rearranged in any way.

If one or more dimensions are shrunk, any chunks falling outside the new array shape will be deleted from the underlying store.

Examples

>>> import zarr
>>> z = zarr.zeros(shape=(10000, 10000), chunks=(1000, 1000))
>>> z.shape
(10000, 10000)
>>> z.resize(20000, 10000)
>>> z.shape
(20000, 10000)
>>> z.resize(30000, 1000)
>>> z.shape
(30000, 1000)
append(self, data, axis=0)[source]

Append data to axis.

Parameters:
data : array_like

Data to be appended.

axis : int

Axis along which to append.

Returns:
new_shape : tuple

Notes

The size of all dimensions other than axis must match between this array and data.

Examples

>>> import numpy as np
>>> import zarr
>>> a = np.arange(10000000, dtype='i4').reshape(10000, 1000)
>>> z = zarr.array(a, chunks=(1000, 100))
>>> z.shape
(10000, 1000)
>>> z.append(a)
(20000, 1000)
>>> z.append(np.vstack([a, a]), axis=1)
(20000, 2000)
>>> z.shape
(20000, 2000)
view(self, shape=None, chunks=None, dtype=None, fill_value=None, filters=None, read_only=None, synchronizer=None)[source]

Return an array sharing the same data.

Parameters:
shape : int or tuple of ints

Array shape.

chunks : int or tuple of ints, optional

Chunk shape.

dtype : string or dtype, optional

NumPy dtype.

fill_value : object

Default value to use for uninitialized portions of the array.

filters : sequence, optional

Sequence of filters to use to encode chunk data prior to compression.

read_only : bool, optional

True if array should be protected against modification.

synchronizer : object, optional

Array synchronizer.

Notes

WARNING: This is an experimental feature and should be used with care. There are plenty of ways to generate errors and/or cause data corruption.

Examples

Bypass filters:

>>> import zarr
>>> import numpy as np
>>> np.random.seed(42)
>>> labels = ['female', 'male']
>>> data = np.random.choice(labels, size=10000)
>>> filters = [zarr.Categorize(labels=labels,
...                            dtype=data.dtype,
...                            astype='u1')]
>>> a = zarr.array(data, chunks=1000, filters=filters)
>>> a[:]
array(['female', 'male', 'female', ..., 'male', 'male', 'female'],
      dtype='<U6')
>>> v = a.view(dtype='u1', filters=[])
>>> v.is_view
True
>>> v[:]
array([1, 2, 1, ..., 2, 2, 1], dtype=uint8)

Views can be used to modify data:

>>> x = v[:]
>>> x.sort()
>>> v[:] = x
>>> v[:]
array([1, 1, 1, ..., 2, 2, 2], dtype=uint8)
>>> a[:]
array(['female', 'female', 'female', ..., 'male', 'male', 'male'],
      dtype='<U6')

View as a different dtype with the same item size:

>>> data = np.random.randint(0, 2, size=10000, dtype='u1')
>>> a = zarr.array(data, chunks=1000)
>>> a[:]
array([0, 0, 1, ..., 1, 0, 0], dtype=uint8)
>>> v = a.view(dtype=bool)
>>> v[:]
array([False, False,  True, ...,  True, False, False])
>>> np.all(a[:].view(dtype=bool) == v[:])
True

An array can be viewed with a dtype with a different item size, however some care is needed to adjust the shape and chunk shape so that chunk data is interpreted correctly:

>>> data = np.arange(10000, dtype='u2')
>>> a = zarr.array(data, chunks=1000)
>>> a[:10]
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=uint16)
>>> v = a.view(dtype='u1', shape=20000, chunks=2000)
>>> v[:10]
array([0, 0, 1, 0, 2, 0, 3, 0, 4, 0], dtype=uint8)
>>> np.all(a[:].view('u1') == v[:])
True

Change fill value for uninitialized chunks:

>>> a = zarr.full(10000, chunks=1000, fill_value=-1, dtype='i1')
>>> a[:]
array([-1, -1, -1, ..., -1, -1, -1], dtype=int8)
>>> v = a.view(fill_value=42)
>>> v[:]
array([42, 42, 42, ..., 42, 42, 42], dtype=int8)

Note that resizing or appending to views is not permitted:

>>> a = zarr.empty(10000)
>>> v = a.view()
>>> try:
...     v.resize(20000)
... except PermissionError as e:
...     print(e)
operation not permitted for views
astype(self, dtype)[source]

Returns a view that does on the fly type conversion of the underlying data.

Parameters:
dtype : string or dtype

NumPy dtype.

See also

Array.view

Notes

This method returns a new Array object which is a view on the same underlying chunk data. Modifying any data via the view is currently not permitted and will result in an error. This is an experimental feature and its behavior is subject to change in the future.

Examples

>>> import zarr
>>> import numpy as np
>>> data = np.arange(100, dtype=np.uint8)
>>> a = zarr.array(data, chunks=10)
>>> a[:]
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,
       16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
       32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
       48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
       64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
       80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95,
       96, 97, 98, 99], dtype=uint8)
>>> v = a.astype(np.float32)
>>> v.is_view
True
>>> v[:]
array([  0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,
        10.,  11.,  12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,
        20.,  21.,  22.,  23.,  24.,  25.,  26.,  27.,  28.,  29.,
        30.,  31.,  32.,  33.,  34.,  35.,  36.,  37.,  38.,  39.,
        40.,  41.,  42.,  43.,  44.,  45.,  46.,  47.,  48.,  49.,
        50.,  51.,  52.,  53.,  54.,  55.,  56.,  57.,  58.,  59.,
        60.,  61.,  62.,  63.,  64.,  65.,  66.,  67.,  68.,  69.,
        70.,  71.,  72.,  73.,  74.,  75.,  76.,  77.,  78.,  79.,
        80.,  81.,  82.,  83.,  84.,  85.,  86.,  87.,  88.,  89.,
        90.,  91.,  92.,  93.,  94.,  95.,  96.,  97.,  98.,  99.],
      dtype=float32)

Groups (zarr.hierarchy)

zarr.hierarchy.group(store=None, overwrite=False, chunk_store=None, cache_attrs=True, synchronizer=None, path=None)[source]

Create a group.

Parameters:
store : MutableMapping or string, optional

Store or path to directory in file system.

overwrite : bool, optional

If True, delete any pre-existing data in store at path before creating the group.

chunk_store : MutableMapping, optional

Separate storage for chunks. If not provided, store will be used for storage of both chunks and metadata.

cache_attrs : bool, optional

If True (default), user attributes will be cached for attribute read operations. If False, user attributes are reloaded from the store prior to all attribute read operations.

synchronizer : object, optional

Array synchronizer.

path : string, optional

Group path within store.

Returns:
g : zarr.hierarchy.Group

Examples

Create a group in memory:

>>> import zarr
>>> g = zarr.group()
>>> g
<zarr.hierarchy.Group '/'>

Create a group with a different store:

>>> store = zarr.DirectoryStore('data/example.zarr')
>>> g = zarr.group(store=store, overwrite=True)
>>> g
<zarr.hierarchy.Group '/'>
zarr.hierarchy.open_group(store=None, mode='a', cache_attrs=True, synchronizer=None, path=None, chunk_store=None)[source]

Open a group using file-mode-like semantics.

Parameters:
store : MutableMapping or string, optional

Store or path to directory in file system or name of zip file.

mode : {‘r’, ‘r+’, ‘a’, ‘w’, ‘w-‘}, optional

Persistence mode: ‘r’ means read only (must exist); ‘r+’ means read/write (must exist); ‘a’ means read/write (create if doesn’t exist); ‘w’ means create (overwrite if exists); ‘w-‘ means create (fail if exists).

cache_attrs : bool, optional

If True (default), user attributes will be cached for attribute read operations. If False, user attributes are reloaded from the store prior to all attribute read operations.

synchronizer : object, optional

Array synchronizer.

path : string, optional

Group path within store.

chunk_store : MutableMapping or string, optional

Store or path to directory in file system or name of zip file.

Returns:
g : zarr.hierarchy.Group

Examples

>>> import zarr
>>> root = zarr.open_group('data/example.zarr', mode='w')
>>> foo = root.create_group('foo')
>>> bar = root.create_group('bar')
>>> root
<zarr.hierarchy.Group '/'>
>>> root2 = zarr.open_group('data/example.zarr', mode='a')
>>> root2
<zarr.hierarchy.Group '/'>
>>> root == root2
True
class zarr.hierarchy.Group(store, path=None, read_only=False, chunk_store=None, cache_attrs=True, synchronizer=None)[source]

Instantiate a group from an initialized store.

Parameters:
store : MutableMapping

Group store, already initialized. If the Group is used in a context manager, and the store has a close method, it will be called on exit.

path : string, optional

Group path.

read_only : bool, optional

True if group should be protected against modification.

chunk_store : MutableMapping, optional

Separate storage for chunks. If not provided, store will be used for storage of both chunks and metadata.

cache_attrs : bool, optional

If True (default), user attributes will be cached for attribute read operations. If False, user attributes are reloaded from the store prior to all attribute read operations.

synchronizer : object, optional

Array synchronizer.

Attributes:
store

A MutableMapping providing the underlying storage for the group.

path

Storage path.

name

Group name following h5py convention.

read_only

A boolean, True if modification operations are not permitted.

chunk_store

A MutableMapping providing the underlying storage for array chunks.

synchronizer

Object used to synchronize write access to groups and arrays.

attrs

A MutableMapping containing user-defined attributes.

info

Return diagnostic information about the group.

Methods

__len__(self) Number of members.
__iter__(self) Return an iterator over group member names.
__contains__(self, item) Test for group membership.
__getitem__(self, item) Obtain a group member.
__enter__(self) Return the Group for use as a context manager.
__exit__(self, exc_type, exc_val, exc_tb) If the underlying Store has a close method, call it.
group_keys(self) Return an iterator over member names for groups only.
groups(self) Return an iterator over (name, value) pairs for groups only.
array_keys(self[, recurse]) Return an iterator over member names for arrays only.
arrays(self[, recurse]) Return an iterator over (name, value) pairs for arrays only.
visit(self, func) Run func on each object’s path.
visitkeys(self, func) An alias for visit().
visitvalues(self, func) Run func on each object.
visititems(self, func) Run func on each object’s path and the object itself.
tree(self[, expand, level]) Provide a print-able display of the hierarchy.
create_group(self, name[, overwrite]) Create a sub-group.
require_group(self, name[, overwrite]) Obtain a sub-group, creating one if it doesn’t exist.
create_groups(self, \*names, \*\*kwargs) Convenience method to create multiple groups in a single call.
require_groups(self, \*names) Convenience method to require multiple groups in a single call.
create_dataset(self, name, \*\*kwargs) Create an array.
require_dataset(self, name, shape[, dtype, …]) Obtain an array, creating if it doesn’t exist.
create(self, name, \*\*kwargs) Create an array.
empty(self, name, \*\*kwargs) Create an array.
zeros(self, name, \*\*kwargs) Create an array.
ones(self, name, \*\*kwargs) Create an array.
full(self, name, fill_value, \*\*kwargs) Create an array.
array(self, name, data, \*\*kwargs) Create an array.
empty_like(self, name, data, \*\*kwargs) Create an array.
zeros_like(self, name, data, \*\*kwargs) Create an array.
ones_like(self, name, data, \*\*kwargs) Create an array.
full_like(self, name, data, \*\*kwargs) Create an array.
info Return diagnostic information about the group.
move(self, source, dest) Move contents from one path to another relative to the Group.
__len__(self)[source]

Number of members.

__iter__(self)[source]

Return an iterator over group member names.

Examples

>>> import zarr
>>> g1 = zarr.group()
>>> g2 = g1.create_group('foo')
>>> g3 = g1.create_group('bar')
>>> d1 = g1.create_dataset('baz', shape=100, chunks=10)
>>> d2 = g1.create_dataset('quux', shape=200, chunks=20)
>>> for name in g1:
...     print(name)
bar
baz
foo
quux
__contains__(self, item)[source]

Test for group membership.

Examples

>>> import zarr
>>> g1 = zarr.group()
>>> g2 = g1.create_group('foo')
>>> d1 = g1.create_dataset('bar', shape=100, chunks=10)
>>> 'foo' in g1
True
>>> 'bar' in g1
True
>>> 'baz' in g1
False
__getitem__(self, item)[source]

Obtain a group member.

Parameters:
item : string

Member name or path.

Examples

>>> import zarr
>>> g1 = zarr.group()
>>> d1 = g1.create_dataset('foo/bar/baz', shape=100, chunks=10)
>>> g1['foo']
<zarr.hierarchy.Group '/foo'>
>>> g1['foo/bar']
<zarr.hierarchy.Group '/foo/bar'>
>>> g1['foo/bar/baz']
<zarr.core.Array '/foo/bar/baz' (100,) float64>
__enter__(self)[source]

Return the Group for use as a context manager.

__exit__(self, exc_type, exc_val, exc_tb)[source]

If the underlying Store has a close method, call it.

group_keys(self)[source]

Return an iterator over member names for groups only.

Examples

>>> import zarr
>>> g1 = zarr.group()
>>> g2 = g1.create_group('foo')
>>> g3 = g1.create_group('bar')
>>> d1 = g1.create_dataset('baz', shape=100, chunks=10)
>>> d2 = g1.create_dataset('quux', shape=200, chunks=20)
>>> sorted(g1.group_keys())
['bar', 'foo']
groups(self)[source]

Return an iterator over (name, value) pairs for groups only.

Examples

>>> import zarr
>>> g1 = zarr.group()
>>> g2 = g1.create_group('foo')
>>> g3 = g1.create_group('bar')
>>> d1 = g1.create_dataset('baz', shape=100, chunks=10)
>>> d2 = g1.create_dataset('quux', shape=200, chunks=20)
>>> for n, v in g1.groups():
...     print(n, type(v))
bar <class 'zarr.hierarchy.Group'>
foo <class 'zarr.hierarchy.Group'>
array_keys(self, recurse=False)[source]

Return an iterator over member names for arrays only.

Parameters:
recurse : recurse, optional

Option to return member names for all arrays, even from groups below the current one. If False, only member names for arrays in the current group will be returned. Default value is False.

Examples

>>> import zarr
>>> g1 = zarr.group()
>>> g2 = g1.create_group('foo')
>>> g3 = g1.create_group('bar')
>>> d1 = g1.create_dataset('baz', shape=100, chunks=10)
>>> d2 = g1.create_dataset('quux', shape=200, chunks=20)
>>> sorted(g1.array_keys())
['baz', 'quux']
arrays(self, recurse=False)[source]

Return an iterator over (name, value) pairs for arrays only.

Parameters:
recurse : recurse, optional

Option to return (name, value) pairs for all arrays, even from groups below the current one. If False, only (name, value) pairs for arrays in the current group will be returned. Default value is False.

Examples

>>> import zarr
>>> g1 = zarr.group()
>>> g2 = g1.create_group('foo')
>>> g3 = g1.create_group('bar')
>>> d1 = g1.create_dataset('baz', shape=100, chunks=10)
>>> d2 = g1.create_dataset('quux', shape=200, chunks=20)
>>> for n, v in g1.arrays():
...     print(n, type(v))
baz <class 'zarr.core.Array'>
quux <class 'zarr.core.Array'>
visit(self, func)[source]

Run func on each object’s path.

Note: If func returns None (or doesn’t return),
iteration continues. However, if func returns anything else, it ceases and returns that value.

Examples

>>> import zarr
>>> g1 = zarr.group()
>>> g2 = g1.create_group('foo')
>>> g3 = g1.create_group('bar')
>>> g4 = g3.create_group('baz')
>>> g5 = g3.create_group('quux')
>>> def print_visitor(name):
...     print(name)
>>> g1.visit(print_visitor)
bar
bar/baz
bar/quux
foo
>>> g3.visit(print_visitor)
baz
quux
visitkeys(self, func)[source]

An alias for visit().

visitvalues(self, func)[source]

Run func on each object.

Note: If func returns None (or doesn’t return),
iteration continues. However, if func returns anything else, it ceases and returns that value.

Examples

>>> import zarr
>>> g1 = zarr.group()
>>> g2 = g1.create_group('foo')
>>> g3 = g1.create_group('bar')
>>> g4 = g3.create_group('baz')
>>> g5 = g3.create_group('quux')
>>> def print_visitor(obj):
...     print(obj)
>>> g1.visitvalues(print_visitor)
<zarr.hierarchy.Group '/bar'>
<zarr.hierarchy.Group '/bar/baz'>
<zarr.hierarchy.Group '/bar/quux'>
<zarr.hierarchy.Group '/foo'>
>>> g3.visitvalues(print_visitor)
<zarr.hierarchy.Group '/bar/baz'>
<zarr.hierarchy.Group '/bar/quux'>
visititems(self, func)[source]

Run func on each object’s path and the object itself.

Note: If func returns None (or doesn’t return),
iteration continues. However, if func returns anything else, it ceases and returns that value.

Examples

>>> import zarr
>>> g1 = zarr.group()
>>> g2 = g1.create_group('foo')
>>> g3 = g1.create_group('bar')
>>> g4 = g3.create_group('baz')
>>> g5 = g3.create_group('quux')
>>> def print_visitor(name, obj):
...     print((name, obj))
>>> g1.visititems(print_visitor)
('bar', <zarr.hierarchy.Group '/bar'>)
('bar/baz', <zarr.hierarchy.Group '/bar/baz'>)
('bar/quux', <zarr.hierarchy.Group '/bar/quux'>)
('foo', <zarr.hierarchy.Group '/foo'>)
>>> g3.visititems(print_visitor)
('baz', <zarr.hierarchy.Group '/bar/baz'>)
('quux', <zarr.hierarchy.Group '/bar/quux'>)
tree(self, expand=False, level=None)[source]

Provide a print-able display of the hierarchy.

Parameters:
expand : bool, optional

Only relevant for HTML representation. If True, tree will be fully expanded.

level : int, optional

Maximum depth to descend into hierarchy.

Notes

Please note that this is an experimental feature. The behaviour of this function is still evolving and the default output and/or parameters may change in future versions.

Examples

>>> import zarr
>>> g1 = zarr.group()
>>> g2 = g1.create_group('foo')
>>> g3 = g1.create_group('bar')
>>> g4 = g3.create_group('baz')
>>> g5 = g3.create_group('quux')
>>> d1 = g5.create_dataset('baz', shape=100, chunks=10)
>>> g1.tree()
/
 ├── bar
 │   ├── baz
 │   └── quux
 │       └── baz (100,) float64
 └── foo
>>> g1.tree(level=2)
/
 ├── bar
 │   ├── baz
 │   └── quux
 └── foo
>>> g3.tree()
bar
 ├── baz
 └── quux
     └── baz (100,) float64
create_group(self, name, overwrite=False)[source]

Create a sub-group.

Parameters:
name : string

Group name.

overwrite : bool, optional

If True, overwrite any existing array with the given name.

Returns:
g : zarr.hierarchy.Group

Examples

>>> import zarr
>>> g1 = zarr.group()
>>> g2 = g1.create_group('foo')
>>> g3 = g1.create_group('bar')
>>> g4 = g1.create_group('baz/quux')
require_group(self, name, overwrite=False)[source]

Obtain a sub-group, creating one if it doesn’t exist.

Parameters:
name : string

Group name.

overwrite : bool, optional

Overwrite any existing array with given name if present.

Returns:
g : zarr.hierarchy.Group

Examples

>>> import zarr
>>> g1 = zarr.group()
>>> g2 = g1.require_group('foo')
>>> g3 = g1.require_group('foo')
>>> g2 == g3
True
create_groups(self, *names, **kwargs)[source]

Convenience method to create multiple groups in a single call.

require_groups(self, *names)[source]

Convenience method to require multiple groups in a single call.

create_dataset(self, name, **kwargs)[source]

Create an array.

Parameters:
name : string

Array name.

data : array_like, optional

Initial data.

shape : int or tuple of ints

Array shape.

chunks : int or tuple of ints, optional

Chunk shape. If not provided, will be guessed from shape and dtype.

dtype : string or dtype, optional

NumPy dtype.

compressor : Codec, optional

Primary compressor.

fill_value : object

Default value to use for uninitialized portions of the array.

order : {‘C’, ‘F’}, optional

Memory layout to be used within each chunk.

synchronizer : zarr.sync.ArraySynchronizer, optional

Array synchronizer.

filters : sequence of Codecs, optional

Sequence of filters to use to encode chunk data prior to compression.

overwrite : bool, optional

If True, replace any existing array or group with the given name.

cache_metadata : bool, optional

If True, array configuration metadata will be cached for the lifetime of the object. If False, array metadata will be reloaded prior to all data access and modification operations (may incur overhead depending on storage and data access pattern).

Returns:
a : zarr.core.Array

Examples

>>> import zarr
>>> g1 = zarr.group()
>>> d1 = g1.create_dataset('foo', shape=(10000, 10000),
...                        chunks=(1000, 1000))
>>> d1
<zarr.core.Array '/foo' (10000, 10000) float64>
>>> d2 = g1.create_dataset('bar/baz/qux', shape=(100, 100, 100),
...                        chunks=(100, 10, 10))
>>> d2
<zarr.core.Array '/bar/baz/qux' (100, 100, 100) float64>
require_dataset(self, name, shape, dtype=None, exact=False, **kwargs)[source]

Obtain an array, creating if it doesn’t exist. Other kwargs are as per zarr.hierarchy.Group.create_dataset().

Parameters:
name : string

Array name.

shape : int or tuple of ints

Array shape.

dtype : string or dtype, optional

NumPy dtype.

exact : bool, optional

If True, require dtype to match exactly. If false, require dtype can be cast from array dtype.

create(self, name, **kwargs)[source]

Create an array. Keyword arguments as per zarr.creation.create().

empty(self, name, **kwargs)[source]

Create an array. Keyword arguments as per zarr.creation.empty().

zeros(self, name, **kwargs)[source]

Create an array. Keyword arguments as per zarr.creation.zeros().

ones(self, name, **kwargs)[source]

Create an array. Keyword arguments as per zarr.creation.ones().

full(self, name, fill_value, **kwargs)[source]

Create an array. Keyword arguments as per zarr.creation.full().

array(self, name, data, **kwargs)[source]

Create an array. Keyword arguments as per zarr.creation.array().

empty_like(self, name, data, **kwargs)[source]

Create an array. Keyword arguments as per zarr.creation.empty_like().

zeros_like(self, name, data, **kwargs)[source]

Create an array. Keyword arguments as per zarr.creation.zeros_like().

ones_like(self, name, data, **kwargs)[source]

Create an array. Keyword arguments as per zarr.creation.ones_like().

full_like(self, name, data, **kwargs)[source]

Create an array. Keyword arguments as per zarr.creation.full_like().

move(self, source, dest)[source]

Move contents from one path to another relative to the Group.

Parameters:
source : string

Name or path to a Zarr object to move.

dest : string

New name or path of the Zarr object.

Storage (zarr.storage)

This module contains storage classes for use with Zarr arrays and groups.

Note that any object implementing the MutableMapping interface from the collections module in the Python standard library can be used as a Zarr array store, as long as it accepts string (str) keys and bytes values.

In addition to the MutableMapping interface, store classes may also implement optional methods listdir (list members of a “directory”) and rmdir (remove all members of a “directory”). These methods should be implemented if the store class is aware of the hierarchical organisation of resources within the store and can provide efficient implementations. If these methods are not available, Zarr will fall back to slower implementations that work via the MutableMapping interface. Store classes may also optionally implement a rename method (rename all members under a given path) and a getsize method (return the size in bytes of a given value).

class zarr.storage.MemoryStore(root=None, cls=<class 'dict'>)[source]

Store class that uses a hierarchy of dict objects, thus all data will be held in main memory.

Notes

Safe to write in multiple threads.

Examples

This is the default class used when creating a group. E.g.:

>>> import zarr
>>> g = zarr.group()
>>> type(g.store)
<class 'zarr.storage.MemoryStore'>

Note that the default class when creating an array is the built-in dict class, i.e.:

>>> z = zarr.zeros(100)
>>> type(z.store)
<class 'dict'>
class zarr.storage.DirectoryStore(path, normalize_keys=False)[source]

Storage class using directories and files on a standard file system.

Parameters:
path : string

Location of directory to use as the root of the storage hierarchy.

normalize_keys : bool, optional

If True, all store keys will be normalized to use lower case characters (e.g. ‘foo’ and ‘FOO’ will be treated as equivalent). This can be useful to avoid potential discrepancies between case-senstive and case-insensitive file system. Default value is False.

Notes

Atomic writes are used, which means that data are first written to a temporary file, then moved into place when the write is successfully completed. Files are only held open while they are being read or written and are closed immediately afterwards, so there is no need to manually close any files.

Safe to write in multiple threads or processes.

Examples

Store a single array:

>>> import zarr
>>> store = zarr.DirectoryStore('data/array.zarr')
>>> z = zarr.zeros((10, 10), chunks=(5, 5), store=store, overwrite=True)
>>> z[...] = 42

Each chunk of the array is stored as a separate file on the file system, i.e.:

>>> import os
>>> sorted(os.listdir('data/array.zarr'))
['.zarray', '0.0', '0.1', '1.0', '1.1']

Store a group:

>>> store = zarr.DirectoryStore('data/group.zarr')
>>> root = zarr.group(store=store, overwrite=True)
>>> foo = root.create_group('foo')
>>> bar = foo.zeros('bar', shape=(10, 10), chunks=(5, 5))
>>> bar[...] = 42

When storing a group, levels in the group hierarchy will correspond to directories on the file system, i.e.:

>>> sorted(os.listdir('data/group.zarr'))
['.zgroup', 'foo']
>>> sorted(os.listdir('data/group.zarr/foo'))
['.zgroup', 'bar']
>>> sorted(os.listdir('data/group.zarr/foo/bar'))
['.zarray', '0.0', '0.1', '1.0', '1.1']
class zarr.storage.TempStore(suffix='', prefix='zarr', dir=None, normalize_keys=False)[source]

Directory store using a temporary directory for storage.

Parameters:
suffix : string, optional

Suffix for the temporary directory name.

prefix : string, optional

Prefix for the temporary directory name.

dir : string, optional

Path to parent directory in which to create temporary directory.

normalize_keys : bool, optional

If True, all store keys will be normalized to use lower case characters (e.g. ‘foo’ and ‘FOO’ will be treated as equivalent). This can be useful to avoid potential discrepancies between case-senstive and case-insensitive file system. Default value is False.

class zarr.storage.NestedDirectoryStore(path, normalize_keys=False)[source]

Storage class using directories and files on a standard file system, with special handling for chunk keys so that chunk files for multidimensional arrays are stored in a nested directory tree.

Parameters:
path : string

Location of directory to use as the root of the storage hierarchy.

normalize_keys : bool, optional

If True, all store keys will be normalized to use lower case characters (e.g. ‘foo’ and ‘FOO’ will be treated as equivalent). This can be useful to avoid potential discrepancies between case-senstive and case-insensitive file system. Default value is False.

Notes

The DirectoryStore class stores all chunk files for an array together in a single directory. On some file systems, the potentially large number of files in a single directory can cause performance issues. The NestedDirectoryStore class provides an alternative where chunk files for multidimensional arrays will be organised into a directory hierarchy, thus reducing the number of files in any one directory.

Safe to write in multiple threads or processes.

Examples

Store a single array:

>>> import zarr
>>> store = zarr.NestedDirectoryStore('data/array.zarr')
>>> z = zarr.zeros((10, 10), chunks=(5, 5), store=store, overwrite=True)
>>> z[...] = 42

Each chunk of the array is stored as a separate file on the file system, note the multiple directory levels used for the chunk files:

>>> import os
>>> sorted(os.listdir('data/array.zarr'))
['.zarray', '0', '1']
>>> sorted(os.listdir('data/array.zarr/0'))
['0', '1']
>>> sorted(os.listdir('data/array.zarr/1'))
['0', '1']

Store a group:

>>> store = zarr.NestedDirectoryStore('data/group.zarr')
>>> root = zarr.group(store=store, overwrite=True)
>>> foo = root.create_group('foo')
>>> bar = foo.zeros('bar', shape=(10, 10), chunks=(5, 5))
>>> bar[...] = 42

When storing a group, levels in the group hierarchy will correspond to directories on the file system, i.e.:

>>> sorted(os.listdir('data/group.zarr'))
['.zgroup', 'foo']
>>> sorted(os.listdir('data/group.zarr/foo'))
['.zgroup', 'bar']
>>> sorted(os.listdir('data/group.zarr/foo/bar'))
['.zarray', '0', '1']
>>> sorted(os.listdir('data/group.zarr/foo/bar/0'))
['0', '1']
>>> sorted(os.listdir('data/group.zarr/foo/bar/1'))
['0', '1']
class zarr.storage.ZipStore(path, compression=0, allowZip64=True, mode='a')[source]

Storage class using a Zip file.

Parameters:
path : string

Location of file.

compression : integer, optional

Compression method to use when writing to the archive.

allowZip64 : bool, optional

If True (the default) will create ZIP files that use the ZIP64 extensions when the zipfile is larger than 2 GiB. If False will raise an exception when the ZIP file would require ZIP64 extensions.

mode : string, optional

One of ‘r’ to read an existing file, ‘w’ to truncate and write a new file, ‘a’ to append to an existing file, or ‘x’ to exclusively create and write a new file.

Notes

Each chunk of an array is stored as a separate entry in the Zip file. Note that Zip files do not provide any way to remove or replace existing entries. If an attempt is made to replace an entry, then a warning is generated by the Python standard library about a duplicate Zip file entry. This can be triggered if you attempt to write data to a Zarr array more than once, e.g.:

>>> store = zarr.ZipStore('data/example.zip', mode='w')
>>> z = zarr.zeros(100, chunks=10, store=store)
>>> # first write OK
... z[...] = 42
>>> # second write generates warnings
... z[...] = 42  
>>> store.close()

This can also happen in a more subtle situation, where data are written only once to a Zarr array, but the write operations are not aligned with chunk boundaries, e.g.:

>>> store = zarr.ZipStore('data/example.zip', mode='w')
>>> z = zarr.zeros(100, chunks=10, store=store)
>>> z[5:15] = 42
>>> # write overlaps chunk previously written, generates warnings
... z[15:25] = 42  

To avoid creating duplicate entries, only write data once, and align writes with chunk boundaries. This alignment is done automatically if you call z[...] = ... or create an array from existing data via zarr.array().

Alternatively, use a DirectoryStore when writing the data, then manually Zip the directory and use the Zip file for subsequent reads.

Safe to write in multiple threads but not in multiple processes.

Examples

Store a single array:

>>> import zarr
>>> store = zarr.ZipStore('data/array.zip', mode='w')
>>> z = zarr.zeros((10, 10), chunks=(5, 5), store=store)
>>> z[...] = 42
>>> store.close()  # don't forget to call this when you're done

Store a group:

>>> store = zarr.ZipStore('data/group.zip', mode='w')
>>> root = zarr.group(store=store)
>>> foo = root.create_group('foo')
>>> bar = foo.zeros('bar', shape=(10, 10), chunks=(5, 5))
>>> bar[...] = 42
>>> store.close()  # don't forget to call this when you're done

After modifying a ZipStore, the close() method must be called, otherwise essential data will not be written to the underlying Zip file. The ZipStore class also supports the context manager protocol, which ensures the close() method is called on leaving the context, e.g.:

>>> with zarr.ZipStore('data/array.zip', mode='w') as store:
...     z = zarr.zeros((10, 10), chunks=(5, 5), store=store)
...     z[...] = 42
...     # no need to call store.close()
close(self)[source]

Closes the underlying zip file, ensuring all records are written.

flush(self)[source]

Closes the underlying zip file, ensuring all records are written, then re-opens the file for further modifications.

class zarr.storage.DBMStore(path, flag='c', mode=438, open=None, write_lock=True, **open_kwargs)[source]

Storage class using a DBM-style database.

Parameters:
path : string

Location of database file.

flag : string, optional

Flags for opening the database file.

mode : int

File mode used if a new file is created.

open : function, optional

Function to open the database file. If not provided, dbm.open() will be used on Python 3, and anydbm.open() will be used on Python 2.

write_lock: bool, optional

Use a lock to prevent concurrent writes from multiple threads (True by default).

**open_kwargs

Keyword arguments to pass the open function.

Notes

Please note that, by default, this class will use the Python standard library dbm.open function to open the database file (or anydbm.open on Python 2). There are up to three different implementations of DBM-style databases available in any Python installation, and which one is used may vary from one system to another. Database file formats are not compatible between these different implementations. Also, some implementations are more efficient than others. In particular, the “dumb” implementation will be the fall-back on many systems, and has very poor performance for some usage scenarios. If you want to ensure a specific implementation is used, pass the corresponding open function, e.g., dbm.gnu.open to use the GNU DBM library.

Safe to write in multiple threads. May be safe to write in multiple processes, depending on which DBM implementation is being used, although this has not been tested.

Examples

Store a single array:

>>> import zarr
>>> store = zarr.DBMStore('data/array.db')
>>> z = zarr.zeros((10, 10), chunks=(5, 5), store=store, overwrite=True)
>>> z[...] = 42
>>> store.close()  # don't forget to call this when you're done

Store a group:

>>> store = zarr.DBMStore('data/group.db')
>>> root = zarr.group(store=store, overwrite=True)
>>> foo = root.create_group('foo')
>>> bar = foo.zeros('bar', shape=(10, 10), chunks=(5, 5))
>>> bar[...] = 42
>>> store.close()  # don't forget to call this when you're done

After modifying a DBMStore, the close() method must be called, otherwise essential data may not be written to the underlying database file. The DBMStore class also supports the context manager protocol, which ensures the close() method is called on leaving the context, e.g.:

>>> with zarr.DBMStore('data/array.db') as store:
...     z = zarr.zeros((10, 10), chunks=(5, 5), store=store, overwrite=True)
...     z[...] = 42
...     # no need to call store.close()

A different database library can be used by passing a different function to the open parameter. For example, if the bsddb3 package is installed, a Berkeley DB database can be used:

>>> import bsddb3
>>> store = zarr.DBMStore('data/array.bdb', open=bsddb3.btopen)
>>> z = zarr.zeros((10, 10), chunks=(5, 5), store=store, overwrite=True)
>>> z[...] = 42
>>> store.close()
close(self)[source]

Closes the underlying database file.

flush(self)[source]

Synchronizes data to the underlying database file.

class zarr.storage.LMDBStore(path, buffers=True, **kwargs)[source]

Storage class using LMDB. Requires the lmdb package to be installed.

Parameters:
path : string

Location of database file.

buffers : bool, optional

If True (default) use support for buffers, which should increase performance by reducing memory copies.

**kwargs

Keyword arguments passed through to the lmdb.open function.

Notes

By default writes are not immediately flushed to disk to increase performance. You can ensure data are flushed to disk by calling the flush() or close() methods.

Should be safe to write in multiple threads or processes due to the synchronization support within LMDB, although writing from multiple processes has not been tested.

Examples

Store a single array:

>>> import zarr
>>> store = zarr.LMDBStore('data/array.mdb')
>>> z = zarr.zeros((10, 10), chunks=(5, 5), store=store, overwrite=True)
>>> z[...] = 42
>>> store.close()  # don't forget to call this when you're done

Store a group:

>>> store = zarr.LMDBStore('data/group.mdb')
>>> root = zarr.group(store=store, overwrite=True)
>>> foo = root.create_group('foo')
>>> bar = foo.zeros('bar', shape=(10, 10), chunks=(5, 5))
>>> bar[...] = 42
>>> store.close()  # don't forget to call this when you're done

After modifying a DBMStore, the close() method must be called, otherwise essential data may not be written to the underlying database file. The DBMStore class also supports the context manager protocol, which ensures the close() method is called on leaving the context, e.g.:

>>> with zarr.LMDBStore('data/array.mdb') as store:
...     z = zarr.zeros((10, 10), chunks=(5, 5), store=store, overwrite=True)
...     z[...] = 42
...     # no need to call store.close()
close(self)[source]

Closes the underlying database.

flush(self)[source]

Synchronizes data to the file system.

class zarr.storage.SQLiteStore(path, **kwargs)[source]

Storage class using SQLite.

Parameters:
path : string

Location of database file.

**kwargs

Keyword arguments passed through to the sqlite3.connect function.

Examples

Store a single array:

>>> import zarr
>>> store = zarr.SQLiteStore('data/array.sqldb')
>>> z = zarr.zeros((10, 10), chunks=(5, 5), store=store, overwrite=True)
>>> z[...] = 42
>>> store.close()  # don't forget to call this when you're done

Store a group:

>>> store = zarr.SQLiteStore('data/group.sqldb')
>>> root = zarr.group(store=store, overwrite=True)
>>> foo = root.create_group('foo')
>>> bar = foo.zeros('bar', shape=(10, 10), chunks=(5, 5))
>>> bar[...] = 42
>>> store.close()  # don't forget to call this when you're done
close(self)[source]

Closes the underlying database.

class zarr.storage.MongoDBStore(database='mongodb_zarr', collection='zarr_collection', **kwargs)[source]

Storage class using MongoDB.

Note

This is an experimental feature.

Requires the pymongo package to be installed.

Parameters:
database : string

Name of database

collection : string

Name of collection

**kwargs

Keyword arguments passed through to the pymongo.MongoClient function.

Notes

The maximum chunksize in MongoDB documents is 16 MB.

class zarr.storage.RedisStore(prefix='zarr', **kwargs)[source]

Storage class using Redis.

Note

This is an experimental feature.

Requires the redis package to be installed.

Parameters:
prefix : string

Name of prefix for Redis keys

**kwargs

Keyword arguments passed through to the redis.Redis function.

class zarr.storage.LRUStoreCache(store, max_size)[source]

Storage class that implements a least-recently-used (LRU) cache layer over some other store. Intended primarily for use with stores that can be slow to access, e.g., remote stores that require network communication to store and retrieve data.

Parameters:
store : MutableMapping

The store containing the actual data to be cached.

max_size : int

The maximum size that the cache may grow to, in number of bytes. Provide None if you would like the cache to have unlimited size.

Examples

The example below wraps an S3 store with an LRU cache:

>>> import s3fs
>>> import zarr
>>> s3 = s3fs.S3FileSystem(anon=True, client_kwargs=dict(region_name='eu-west-2'))
>>> store = s3fs.S3Map(root='zarr-demo/store', s3=s3, check=False)
>>> cache = zarr.LRUStoreCache(store, max_size=2**28)
>>> root = zarr.group(store=cache)  
>>> z = root['foo/bar/baz']  
>>> from timeit import timeit
>>> # first data access is relatively slow, retrieved from store
... timeit('print(z[:].tostring())', number=1, globals=globals())  
b'Hello from the cloud!'
0.1081731989979744
>>> # second data access is faster, uses cache
... timeit('print(z[:].tostring())', number=1, globals=globals())  
b'Hello from the cloud!'
0.0009490990014455747
invalidate(self)[source]

Completely clear the cache.

invalidate_values(self)[source]

Clear the values cache.

invalidate_keys(self)[source]

Clear the keys cache.

class zarr.storage.ABSStore(container, prefix='', account_name=None, account_key=None, blob_service_kwargs=None)[source]

Storage class using Azure Blob Storage (ABS).

Parameters:
container : string

The name of the ABS container to use.

prefix : string

Location of the “directory” to use as the root of the storage hierarchy within the container.

account_name : string

The Azure blob storage account name.

account_key : string

The Azure blob storage account access key.

blob_service_kwargs : dictionary

Extra arguments to be passed into the azure blob client, for e.g. when using the emulator, pass in blob_service_kwargs={‘is_emulated’: True}.

Notes

In order to use this store, you must install the Microsoft Azure Storage SDK for Python.

class zarr.storage.ConsolidatedMetadataStore(store, metadata_key='.zmetadata')[source]

A layer over other storage, where the metadata has been consolidated into a single key.

The purpose of this class, is to be able to get all of the metadata for a given dataset in a single read operation from the underlying storage. See zarr.convenience.consolidate_metadata() for how to create this single metadata key.

This class loads from the one key, and stores the data in a dict, so that accessing the keys no longer requires operations on the backend store.

This class is read-only, and attempts to change the dataset metadata will fail, but changing the data is possible. If the backend storage is changed directly, then the metadata stored here could become obsolete, and zarr.convenience.consolidate_metadata() should be called again and the class re-invoked. The use case is for write once, read many times.

New in version 2.3.

Note

This is an experimental feature.

Parameters:
store: MutableMapping

Containing the zarr dataset.

metadata_key: str

The target in the store where all of the metadata are stored. We assume JSON encoding.

zarr.storage.init_array(store, shape, chunks=True, dtype=None, compressor='default', fill_value=None, order='C', overwrite=False, path=None, chunk_store=None, filters=None, object_codec=None)[source]

Initialize an array store with the given configuration. Note that this is a low-level function and there should be no need to call this directly from user code.

Parameters:
store : MutableMapping

A mapping that supports string keys and bytes-like values.

shape : int or tuple of ints

Array shape.

chunks : int or tuple of ints, optional

Chunk shape. If True, will be guessed from shape and dtype. If False, will be set to shape, i.e., single chunk for the whole array.

dtype : string or dtype, optional

NumPy dtype.

compressor : Codec, optional

Primary compressor.

fill_value : object

Default value to use for uninitialized portions of the array.

order : {‘C’, ‘F’}, optional

Memory layout to be used within each chunk.

overwrite : bool, optional

If True, erase all data in store prior to initialisation.

path : string, optional

Path under which array is stored.

chunk_store : MutableMapping, optional

Separate storage for chunks. If not provided, store will be used for storage of both chunks and metadata.

filters : sequence, optional

Sequence of filters to use to encode chunk data prior to compression.

object_codec : Codec, optional

A codec to encode object arrays, only needed if dtype=object.

Notes

The initialisation process involves normalising all array metadata, encoding as JSON and storing under the ‘.zarray’ key.

Examples

Initialize an array store:

>>> from zarr.storage import init_array
>>> store = dict()
>>> init_array(store, shape=(10000, 10000), chunks=(1000, 1000))
>>> sorted(store.keys())
['.zarray']

Array metadata is stored as JSON:

>>> print(store['.zarray'].decode())
{
    "chunks": [
        1000,
        1000
    ],
    "compressor": {
        "blocksize": 0,
        "clevel": 5,
        "cname": "lz4",
        "id": "blosc",
        "shuffle": 1
    },
    "dtype": "<f8",
    "fill_value": null,
    "filters": null,
    "order": "C",
    "shape": [
        10000,
        10000
    ],
    "zarr_format": 2
}

Initialize an array using a storage path:

>>> store = dict()
>>> init_array(store, shape=100000000, chunks=1000000, dtype='i1', path='foo')
>>> sorted(store.keys())
['.zgroup', 'foo/.zarray']
>>> print(store['foo/.zarray'].decode())
{
    "chunks": [
        1000000
    ],
    "compressor": {
        "blocksize": 0,
        "clevel": 5,
        "cname": "lz4",
        "id": "blosc",
        "shuffle": 1
    },
    "dtype": "|i1",
    "fill_value": null,
    "filters": null,
    "order": "C",
    "shape": [
        100000000
    ],
    "zarr_format": 2
}
zarr.storage.init_group(store, overwrite=False, path=None, chunk_store=None)[source]

Initialize a group store. Note that this is a low-level function and there should be no need to call this directly from user code.

Parameters:
store : MutableMapping

A mapping that supports string keys and byte sequence values.

overwrite : bool, optional

If True, erase all data in store prior to initialisation.

path : string, optional

Path under which array is stored.

chunk_store : MutableMapping, optional

Separate storage for chunks. If not provided, store will be used for storage of both chunks and metadata.

zarr.storage.contains_array(store, path=None)[source]

Return True if the store contains an array at the given logical path.

zarr.storage.contains_group(store, path=None)[source]

Return True if the store contains a group at the given logical path.

zarr.storage.listdir(store, path=None)[source]

Obtain a directory listing for the given path. If store provides a listdir method, this will be called, otherwise will fall back to implementation via the MutableMapping interface.

zarr.storage.rmdir(store, path=None)[source]

Remove all items under the given path. If store provides a rmdir method, this will be called, otherwise will fall back to implementation via the MutableMapping interface.

zarr.storage.getsize(store, path=None)[source]

Compute size of stored items for a given path. If store provides a getsize method, this will be called, otherwise will return -1.

zarr.storage.rename(store, src_path, dst_path)[source]

Rename all items under the given path. If store provides a rename method, this will be called, otherwise will fall back to implementation via the MutableMapping interface.

zarr.storage.migrate_1to2(store)[source]

Migrate array metadata in store from Zarr format version 1 to version 2.

Parameters:
store : MutableMapping

Store to be migrated.

Notes

Version 1 did not support hierarchies, so this migration function will look for a single array in store and migrate the array metadata to version 2.

N5 (zarr.n5)

This module contains a storage class and codec to support the N5 format.

class zarr.n5.N5Store(path, normalize_keys=False)[source]

Storage class using directories and files on a standard file system, following the N5 format (https://github.com/saalfeldlab/n5).

Parameters:
path : string

Location of directory to use as the root of the storage hierarchy.

normalize_keys : bool, optional

If True, all store keys will be normalized to use lower case characters (e.g. ‘foo’ and ‘FOO’ will be treated as equivalent). This can be useful to avoid potential discrepancies between case-senstive and case-insensitive file system. Default value is False.

Notes

This is an experimental feature.

Safe to write in multiple threads or processes.

Examples

Store a single array:

>>> import zarr
>>> store = zarr.N5Store('data/array.n5')
>>> z = zarr.zeros((10, 10), chunks=(5, 5), store=store, overwrite=True)
>>> z[...] = 42

Store a group:

>>> store = zarr.N5Store('data/group.n5')
>>> root = zarr.group(store=store, overwrite=True)
>>> foo = root.create_group('foo')
>>> bar = foo.zeros('bar', shape=(10, 10), chunks=(5, 5))
>>> bar[...] = 42

Convenience functions (zarr.convenience)

Convenience functions for storing and loading data.

zarr.convenience.open(store=None, mode='a', **kwargs)[source]

Convenience function to open a group or array using file-mode-like semantics.

Parameters:
store : MutableMapping or string, optional

Store or path to directory in file system or name of zip file.

mode : {‘r’, ‘r+’, ‘a’, ‘w’, ‘w-‘}, optional

Persistence mode: ‘r’ means read only (must exist); ‘r+’ means read/write (must exist); ‘a’ means read/write (create if doesn’t exist); ‘w’ means create (overwrite if exists); ‘w-‘ means create (fail if exists).

**kwargs

Additional parameters are passed through to zarr.creation.open_array() or zarr.hierarchy.open_group().

Returns:
z : zarr.core.Array or zarr.hierarchy.Group

Array or group, depending on what exists in the given store.

Examples

Storing data in a directory ‘data/example.zarr’ on the local file system:

>>> import zarr
>>> store = 'data/example.zarr'
>>> zw = zarr.open(store, mode='w', shape=100, dtype='i4')  # open new array
>>> zw
<zarr.core.Array (100,) int32>
>>> za = zarr.open(store, mode='a')  # open existing array for reading and writing
>>> za
<zarr.core.Array (100,) int32>
>>> zr = zarr.open(store, mode='r')  # open existing array read-only
>>> zr
<zarr.core.Array (100,) int32 read-only>
>>> gw = zarr.open(store, mode='w')  # open new group, overwriting previous data
>>> gw
<zarr.hierarchy.Group '/'>
>>> ga = zarr.open(store, mode='a')  # open existing group for reading and writing
>>> ga
<zarr.hierarchy.Group '/'>
>>> gr = zarr.open(store, mode='r')  # open existing group read-only
>>> gr
<zarr.hierarchy.Group '/' read-only>
zarr.convenience.save(store, *args, **kwargs)[source]

Convenience function to save an array or group of arrays to the local file system.

Parameters:
store : MutableMapping or string

Store or path to directory in file system or name of zip file.

args : ndarray

NumPy arrays with data to save.

kwargs

NumPy arrays with data to save.

Examples

Save an array to a directory on the file system (uses a DirectoryStore):

>>> import zarr
>>> import numpy as np
>>> arr = np.arange(10000)
>>> zarr.save('data/example.zarr', arr)
>>> zarr.load('data/example.zarr')
array([   0,    1,    2, ..., 9997, 9998, 9999])

Save an array to a Zip file (uses a ZipStore):

>>> zarr.save('data/example.zip', arr)
>>> zarr.load('data/example.zip')
array([   0,    1,    2, ..., 9997, 9998, 9999])

Save several arrays to a directory on the file system (uses a DirectoryStore and stores arrays in a group):

>>> import zarr
>>> import numpy as np
>>> a1 = np.arange(10000)
>>> a2 = np.arange(10000, 0, -1)
>>> zarr.save('data/example.zarr', a1, a2)
>>> loader = zarr.load('data/example.zarr')
>>> loader
<LazyLoader: arr_0, arr_1>
>>> loader['arr_0']
array([   0,    1,    2, ..., 9997, 9998, 9999])
>>> loader['arr_1']
array([10000,  9999,  9998, ...,     3,     2,     1])

Save several arrays using named keyword arguments:

>>> zarr.save('data/example.zarr', foo=a1, bar=a2)
>>> loader = zarr.load('data/example.zarr')
>>> loader
<LazyLoader: bar, foo>
>>> loader['foo']
array([   0,    1,    2, ..., 9997, 9998, 9999])
>>> loader['bar']
array([10000,  9999,  9998, ...,     3,     2,     1])

Store several arrays in a single zip file (uses a ZipStore):

>>> zarr.save('data/example.zip', foo=a1, bar=a2)
>>> loader = zarr.load('data/example.zip')
>>> loader
<LazyLoader: bar, foo>
>>> loader['foo']
array([   0,    1,    2, ..., 9997, 9998, 9999])
>>> loader['bar']
array([10000,  9999,  9998, ...,     3,     2,     1])
zarr.convenience.load(store)[source]

Load data from an array or group into memory.

Parameters:
store : MutableMapping or string

Store or path to directory in file system or name of zip file.

Returns:
out

If the store contains an array, out will be a numpy array. If the store contains a group, out will be a dict-like object where keys are array names and values are numpy arrays.

See also

save, savez

Notes

If loading data from a group of arrays, data will not be immediately loaded into memory. Rather, arrays will be loaded into memory as they are requested.

zarr.convenience.save_array(store, arr, **kwargs)[source]

Convenience function to save a NumPy array to the local file system, following a similar API to the NumPy save() function.

Parameters:
store : MutableMapping or string

Store or path to directory in file system or name of zip file.

arr : ndarray

NumPy array with data to save.

kwargs

Passed through to create(), e.g., compressor.

Examples

Save an array to a directory on the file system (uses a DirectoryStore):

>>> import zarr
>>> import numpy as np
>>> arr = np.arange(10000)
>>> zarr.save_array('data/example.zarr', arr)
>>> zarr.load('data/example.zarr')
array([   0,    1,    2, ..., 9997, 9998, 9999])

Save an array to a single file (uses a ZipStore):

>>> zarr.save_array('data/example.zip', arr)
>>> zarr.load('data/example.zip')
array([   0,    1,    2, ..., 9997, 9998, 9999])
zarr.convenience.save_group(store, *args, **kwargs)[source]

Convenience function to save several NumPy arrays to the local file system, following a similar API to the NumPy savez()/savez_compressed() functions.

Parameters:
store : MutableMapping or string

Store or path to directory in file system or name of zip file.

args : ndarray

NumPy arrays with data to save.

kwargs

NumPy arrays with data to save.

Notes

Default compression options will be used.

Examples

Save several arrays to a directory on the file system (uses a DirectoryStore):

>>> import zarr
>>> import numpy as np
>>> a1 = np.arange(10000)
>>> a2 = np.arange(10000, 0, -1)
>>> zarr.save_group('data/example.zarr', a1, a2)
>>> loader = zarr.load('data/example.zarr')
>>> loader
<LazyLoader: arr_0, arr_1>
>>> loader['arr_0']
array([   0,    1,    2, ..., 9997, 9998, 9999])
>>> loader['arr_1']
array([10000,  9999,  9998, ...,     3,     2,     1])

Save several arrays using named keyword arguments:

>>> zarr.save_group('data/example.zarr', foo=a1, bar=a2)
>>> loader = zarr.load('data/example.zarr')
>>> loader
<LazyLoader: bar, foo>
>>> loader['foo']
array([   0,    1,    2, ..., 9997, 9998, 9999])
>>> loader['bar']
array([10000,  9999,  9998, ...,     3,     2,     1])

Store several arrays in a single zip file (uses a ZipStore):

>>> zarr.save_group('data/example.zip', foo=a1, bar=a2)
>>> loader = zarr.load('data/example.zip')
>>> loader
<LazyLoader: bar, foo>
>>> loader['foo']
array([   0,    1,    2, ..., 9997, 9998, 9999])
>>> loader['bar']
array([10000,  9999,  9998, ...,     3,     2,     1])
zarr.convenience.copy(source, dest, name=None, shallow=False, without_attrs=False, log=None, if_exists='raise', dry_run=False, **create_kws)[source]

Copy the source array or group into the dest group.

Parameters:
source : group or array/dataset

A zarr group or array, or an h5py group or dataset.

dest : group

A zarr or h5py group.

name : str, optional

Name to copy the object to.

shallow : bool, optional

If True, only copy immediate children of source.

without_attrs : bool, optional

Do not copy user attributes.

log : callable, file path or file-like object, optional

If provided, will be used to log progress information.

if_exists : {‘raise’, ‘replace’, ‘skip’, ‘skip_initialized’}, optional

How to handle arrays that already exist in the destination group. If ‘raise’ then a CopyError is raised on the first array already present in the destination group. If ‘replace’ then any array will be replaced in the destination. If ‘skip’ then any existing arrays will not be copied. If ‘skip_initialized’ then any existing arrays with all chunks initialized will not be copied (not available when copying to h5py).

dry_run : bool, optional

If True, don’t actually copy anything, just log what would have happened.

**create_kws

Passed through to the create_dataset method when copying an array/dataset.

Returns:
n_copied : int

Number of items copied.

n_skipped : int

Number of items skipped.

n_bytes_copied : int

Number of bytes of data that were actually copied.

Notes

Please note that this is an experimental feature. The behaviour of this function is still evolving and the default behaviour and/or parameters may change in future versions.

Examples

Here’s an example of copying a group named ‘foo’ from an HDF5 file to a Zarr group:

>>> import h5py
>>> import zarr
>>> import numpy as np
>>> source = h5py.File('data/example.h5', mode='w')
>>> foo = source.create_group('foo')
>>> baz = foo.create_dataset('bar/baz', data=np.arange(100), chunks=(50,))
>>> spam = source.create_dataset('spam', data=np.arange(100, 200), chunks=(30,))
>>> zarr.tree(source)
/
 ├── foo
 │   └── bar
 │       └── baz (100,) int64
 └── spam (100,) int64
>>> dest = zarr.group()
>>> from sys import stdout
>>> zarr.copy(source['foo'], dest, log=stdout)
copy /foo
copy /foo/bar
copy /foo/bar/baz (100,) int64
all done: 3 copied, 0 skipped, 800 bytes copied
(3, 0, 800)
>>> dest.tree()  # N.B., no spam
/
 └── foo
     └── bar
         └── baz (100,) int64
>>> source.close()

The if_exists parameter provides options for how to handle pre-existing data in the destination. Here are some examples of these options, also using dry_run=True to find out what would happen without actually copying anything:

>>> source = zarr.group()
>>> dest = zarr.group()
>>> baz = source.create_dataset('foo/bar/baz', data=np.arange(100))
>>> spam = source.create_dataset('foo/spam', data=np.arange(1000))
>>> existing_spam = dest.create_dataset('foo/spam', data=np.arange(1000))
>>> from sys import stdout
>>> try:
...     zarr.copy(source['foo'], dest, log=stdout, dry_run=True)
... except zarr.CopyError as e:
...     print(e)
...
copy /foo
copy /foo/bar
copy /foo/bar/baz (100,) int64
an object 'spam' already exists in destination '/foo'
>>> zarr.copy(source['foo'], dest, log=stdout, if_exists='replace', dry_run=True)
copy /foo
copy /foo/bar
copy /foo/bar/baz (100,) int64
copy /foo/spam (1000,) int64
dry run: 4 copied, 0 skipped
(4, 0, 0)
>>> zarr.copy(source['foo'], dest, log=stdout, if_exists='skip', dry_run=True)
copy /foo
copy /foo/bar
copy /foo/bar/baz (100,) int64
skip /foo/spam (1000,) int64
dry run: 3 copied, 1 skipped
(3, 1, 0)
zarr.convenience.copy_all(source, dest, shallow=False, without_attrs=False, log=None, if_exists='raise', dry_run=False, **create_kws)[source]

Copy all children of the source group into the dest group.

Parameters:
source : group or array/dataset

A zarr group or array, or an h5py group or dataset.

dest : group

A zarr or h5py group.

shallow : bool, optional

If True, only copy immediate children of source.

without_attrs : bool, optional

Do not copy user attributes.

log : callable, file path or file-like object, optional

If provided, will be used to log progress information.

if_exists : {‘raise’, ‘replace’, ‘skip’, ‘skip_initialized’}, optional

How to handle arrays that already exist in the destination group. If ‘raise’ then a CopyError is raised on the first array already present in the destination group. If ‘replace’ then any array will be replaced in the destination. If ‘skip’ then any existing arrays will not be copied. If ‘skip_initialized’ then any existing arrays with all chunks initialized will not be copied (not available when copying to h5py).

dry_run : bool, optional

If True, don’t actually copy anything, just log what would have happened.

**create_kws

Passed through to the create_dataset method when copying an array/dataset.

Returns:
n_copied : int

Number of items copied.

n_skipped : int

Number of items skipped.

n_bytes_copied : int

Number of bytes of data that were actually copied.

Notes

Please note that this is an experimental feature. The behaviour of this function is still evolving and the default behaviour and/or parameters may change in future versions.

Examples

>>> import h5py
>>> import zarr
>>> import numpy as np
>>> source = h5py.File('data/example.h5', mode='w')
>>> foo = source.create_group('foo')
>>> baz = foo.create_dataset('bar/baz', data=np.arange(100), chunks=(50,))
>>> spam = source.create_dataset('spam', data=np.arange(100, 200), chunks=(30,))
>>> zarr.tree(source)
/
 ├── foo
 │   └── bar
 │       └── baz (100,) int64
 └── spam (100,) int64
>>> dest = zarr.group()
>>> import sys
>>> zarr.copy_all(source, dest, log=sys.stdout)
copy /foo
copy /foo/bar
copy /foo/bar/baz (100,) int64
copy /spam (100,) int64
all done: 4 copied, 0 skipped, 1,600 bytes copied
(4, 0, 1600)
>>> dest.tree()
/
 ├── foo
 │   └── bar
 │       └── baz (100,) int64
 └── spam (100,) int64
>>> source.close()
zarr.convenience.copy_store(source, dest, source_path='', dest_path='', excludes=None, includes=None, flags=0, if_exists='raise', dry_run=False, log=None)[source]

Copy data directly from the source store to the dest store. Use this function when you want to copy a group or array in the most efficient way, preserving all configuration and attributes. This function is more efficient than the copy() or copy_all() functions because it avoids de-compressing and re-compressing data, rather the compressed chunk data for each array are copied directly between stores.

Parameters:
source : Mapping

Store to copy data from.

dest : MutableMapping

Store to copy data into.

source_path : str, optional

Only copy data from under this path in the source store.

dest_path : str, optional

Copy data into this path in the destination store.

excludes : sequence of str, optional

One or more regular expressions which will be matched against keys in the source store. Any matching key will not be copied.

includes : sequence of str, optional

One or more regular expressions which will be matched against keys in the source store and will override any excludes also matching.

flags : int, optional

Regular expression flags used for matching excludes and includes.

if_exists : {‘raise’, ‘replace’, ‘skip’}, optional

How to handle keys that already exist in the destination store. If ‘raise’ then a CopyError is raised on the first key already present in the destination store. If ‘replace’ then any data will be replaced in the destination. If ‘skip’ then any existing keys will not be copied.

dry_run : bool, optional

If True, don’t actually copy anything, just log what would have happened.

log : callable, file path or file-like object, optional

If provided, will be used to log progress information.

Returns:
n_copied : int

Number of items copied.

n_skipped : int

Number of items skipped.

n_bytes_copied : int

Number of bytes of data that were actually copied.

Notes

Please note that this is an experimental feature. The behaviour of this function is still evolving and the default behaviour and/or parameters may change in future versions.

Examples

>>> import zarr
>>> store1 = zarr.DirectoryStore('data/example.zarr')
>>> root = zarr.group(store1, overwrite=True)
>>> foo = root.create_group('foo')
>>> bar = foo.create_group('bar')
>>> baz = bar.create_dataset('baz', shape=100, chunks=50, dtype='i8')
>>> import numpy as np
>>> baz[:] = np.arange(100)
>>> root.tree()
/
 └── foo
     └── bar
         └── baz (100,) int64
>>> from sys import stdout
>>> store2 = zarr.ZipStore('data/example.zip', mode='w')
>>> zarr.copy_store(store1, store2, log=stdout)
copy .zgroup
copy foo/.zgroup
copy foo/bar/.zgroup
copy foo/bar/baz/.zarray
copy foo/bar/baz/0
copy foo/bar/baz/1
all done: 6 copied, 0 skipped, 566 bytes copied
(6, 0, 566)
>>> new_root = zarr.group(store2)
>>> new_root.tree()
/
 └── foo
     └── bar
         └── baz (100,) int64
>>> new_root['foo/bar/baz'][:]
array([ 0,  1,  2,  ..., 97, 98, 99])
>>> store2.close()  # zip stores need to be closed
zarr.convenience.tree(grp, expand=False, level=None)[source]

Provide a print-able display of the hierarchy. This function is provided mainly as a convenience for obtaining a tree view of an h5py group - zarr groups have a .tree() method.

Parameters:
grp : Group

Zarr or h5py group.

expand : bool, optional

Only relevant for HTML representation. If True, tree will be fully expanded.

level : int, optional

Maximum depth to descend into hierarchy.

Notes

Please note that this is an experimental feature. The behaviour of this function is still evolving and the default output and/or parameters may change in future versions.

Examples

>>> import zarr
>>> g1 = zarr.group()
>>> g2 = g1.create_group('foo')
>>> g3 = g1.create_group('bar')
>>> g4 = g3.create_group('baz')
>>> g5 = g3.create_group('qux')
>>> d1 = g5.create_dataset('baz', shape=100, chunks=10)
>>> g1.tree()
/
 ├── bar
 │   ├── baz
 │   └── qux
 │       └── baz (100,) float64
 └── foo
>>> import h5py
>>> h5f = h5py.File('data/example.h5', mode='w')
>>> zarr.copy_all(g1, h5f)
(5, 0, 800)
>>> zarr.tree(h5f)
/
 ├── bar
 │   ├── baz
 │   └── qux
 │       └── baz (100,) float64
 └── foo
zarr.convenience.consolidate_metadata(store, metadata_key='.zmetadata')[source]

Consolidate all metadata for groups and arrays within the given store into a single resource and put it under the given key.

This produces a single object in the backend store, containing all the metadata read from all the zarr-related keys that can be found. After metadata have been consolidated, use open_consolidated() to open the root group in optimised, read-only mode, using the consolidated metadata to reduce the number of read operations on the backend store.

Note, that if the metadata in the store is changed after this consolidation, then the metadata read by open_consolidated() would be incorrect unless this function is called again.

Note

This is an experimental feature.

Parameters:
store : MutableMapping or string

Store or path to directory in file system or name of zip file.

metadata_key : str

Key to put the consolidated metadata under.

Returns:
g : zarr.hierarchy.Group

Group instance, opened with the new consolidated metadata.

zarr.convenience.open_consolidated(store, metadata_key='.zmetadata', mode='r+', **kwargs)[source]

Open group using metadata previously consolidated into a single key.

This is an optimised method for opening a Zarr group, where instead of traversing the group/array hierarchy by accessing the metadata keys at each level, a single key contains all of the metadata for everything. For remote data sources where the overhead of accessing a key is large compared to the time to read data.

The group accessed must have already had its metadata consolidated into a single key using the function consolidate_metadata().

This optimised method only works in modes which do not change the metadata, although the data may still be written/updated.

Parameters:
store : MutableMapping or string

Store or path to directory in file system or name of zip file.

metadata_key : str

Key to read the consolidated metadata from. The default (.zmetadata) corresponds to the default used by consolidate_metadata().

mode : {‘r’, ‘r+’}, optional

Persistence mode: ‘r’ means read only (must exist); ‘r+’ means read/write (must exist) although only writes to data are allowed, changes to metadata including creation of new arrays or group are not allowed.

**kwargs

Additional parameters are passed through to zarr.creation.open_array() or zarr.hierarchy.open_group().

Returns:
g : zarr.hierarchy.Group

Group instance, opened with the consolidated metadata.

Compressors and filters (zarr.codecs)

This module contains compressor and filter classes for use with Zarr. Please note that this module is provided for backwards compatibility with previous versions of Zarr. From Zarr version 2.2 onwards, all codec classes have been moved to a separate package called Numcodecs. The two packages (Zarr and Numcodecs) are designed to be used together. For example, a Numcodecs codec class can be used as a compressor for a Zarr array:

>>> import zarr
>>> from numcodecs import Blosc
>>> z = zarr.zeros(1000000, compressor=Blosc(cname='zstd', clevel=1, shuffle=Blosc.SHUFFLE))

Codec classes can also be used as filters. See the tutorial section on Filters for more information.

Please note that it is also relatively straightforward to define and register custom codec classes. See the Numcodecs codec API and codec registry documentation for more information.

The Attributes class (zarr.attrs)

class zarr.attrs.Attributes(store, key='.zattrs', read_only=False, cache=True, synchronizer=None)[source]

Class providing access to user attributes on an array or group. Should not be instantiated directly, will be available via the .attrs property of an array or group.

Parameters:
store : MutableMapping

The store in which to store the attributes.

key : str, optional

The key under which the attributes will be stored.

read_only : bool, optional

If True, attributes cannot be modified.

cache : bool, optional

If True (default), attributes will be cached locally.

synchronizer : Synchronizer

Only necessary if attributes may be modified from multiple threads or processes.

__getitem__(self, item)[source]
__setitem__(self, item, value)[source]
__delitem__(self, item)[source]
__iter__(self)[source]
__len__(self)[source]
keys(self)[source]
asdict(self)[source]

Retrieve all attributes as a dictionary.

put(self, d)[source]

Overwrite all attributes with the key/value pairs in the provided dictionary d in a single operation.

update(self, *args, **kwargs)[source]

Update the values of several attributes in a single operation.

refresh(self)[source]

Refresh cached attributes from the store.

Synchronization (zarr.sync)

class zarr.sync.ThreadSynchronizer[source]

Provides synchronization using thread locks.

class zarr.sync.ProcessSynchronizer(path)[source]

Provides synchronization using file locks via the fasteners package.

Parameters:
path : string

Path to a directory on a file system that is shared by all processes. N.B., this should be a different path to where you store the array.

Specifications

Zarr storage specification version 1

This document provides a technical specification of the protocol and format used for storing a Zarr array. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.

Status

This specification is deprecated. See Specifications for the latest version.

Storage

A Zarr array can be stored in any storage system that provides a key/value interface, where a key is an ASCII string and a value is an arbitrary sequence of bytes, and the supported operations are read (get the sequence of bytes associated with a given key), write (set the sequence of bytes associated with a given key) and delete (remove a key/value pair).

For example, a directory in a file system can provide this interface, where keys are file names, values are file contents, and files can be read, written or deleted via the operating system. Equally, an S3 bucket can provide this interface, where keys are resource names, values are resource contents, and resources can be read, written or deleted via HTTP.

Below an “array store” refers to any system implementing this interface.

Metadata

Each array requires essential configuration metadata to be stored, enabling correct interpretation of the stored data. This metadata is encoded using JSON and stored as the value of the ‘meta’ key within an array store.

The metadata resource is a JSON object. The following keys MUST be present within the object:

zarr_format
An integer defining the version of the storage specification to which the array store adheres.
shape
A list of integers defining the length of each dimension of the array.
chunks
A list of integers defining the length of each dimension of a chunk of the array. Note that all chunks within a Zarr array have the same shape.
dtype
A string or list defining a valid data type for the array. See also the subsection below on data type encoding.
compression
A string identifying the primary compression library used to compress each chunk of the array.
compression_opts
An integer, string or dictionary providing options to the primary compression library.
fill_value
A scalar value providing the default value to use for uninitialized portions of the array.
order
Either ‘C’ or ‘F’, defining the layout of bytes within each chunk of the array. ‘C’ means row-major order, i.e., the last dimension varies fastest; ‘F’ means column-major order, i.e., the first dimension varies fastest.

Other keys MAY be present within the metadata object however they MUST NOT alter the interpretation of the required fields defined above.

For example, the JSON object below defines a 2-dimensional array of 64-bit little-endian floating point numbers with 10000 rows and 10000 columns, divided into chunks of 1000 rows and 1000 columns (so there will be 100 chunks in total arranged in a 10 by 10 grid). Within each chunk the data are laid out in C contiguous order, and each chunk is compressed using the Blosc compression library:

{
    "chunks": [
        1000,
        1000
    ],
    "compression": "blosc",
    "compression_opts": {
        "clevel": 5,
        "cname": "lz4",
        "shuffle": 1
    },
    "dtype": "<f8",
    "fill_value": null,
    "order": "C",
    "shape": [
        10000,
        10000
    ],
    "zarr_format": 1
}
Data type encoding

Simple data types are encoded within the array metadata resource as a string, following the NumPy array protocol type string (typestr) format. The format consists of 3 parts: a character describing the byteorder of the data (<: little-endian, >: big-endian, |: not-relevant), a character code giving the basic type of the array, and an integer providing the number of bytes the type uses. The byte order MUST be specified. E.g., "<f8", ">i4", "|b1" and "|S12" are valid data types.

Structure data types (i.e., with multiple named fields) are encoded as a list of two-element lists, following NumPy array protocol type descriptions (descr). For example, the JSON list [["r", "|u1"], ["g", "|u1"], ["b", "|u1"]] defines a data type composed of three single-byte unsigned integers labelled ‘r’, ‘g’ and ‘b’.

Chunks

Each chunk of the array is compressed by passing the raw bytes for the chunk through the primary compression library to obtain a new sequence of bytes comprising the compressed chunk data. No header is added to the compressed bytes or any other modification made. The internal structure of the compressed bytes will depend on which primary compressor was used. For example, the Blosc compressor produces a sequence of bytes that begins with a 16-byte header followed by compressed data.

The compressed sequence of bytes for each chunk is stored under a key formed from the index of the chunk within the grid of chunks representing the array. To form a string key for a chunk, the indices are converted to strings and concatenated with the period character (‘.’) separating each index. For example, given an array with shape (10000, 10000) and chunk shape (1000, 1000) there will be 100 chunks laid out in a 10 by 10 grid. The chunk with indices (0, 0) provides data for rows 0-1000 and columns 0-1000 and is stored under the key ‘0.0’; the chunk with indices (2, 4) provides data for rows 2000-3000 and columns 4000-5000 and is stored under the key ‘2.4’; etc.

There is no need for all chunks to be present within an array store. If a chunk is not present then it is considered to be in an uninitialized state. An unitialized chunk MUST be treated as if it was uniformly filled with the value of the ‘fill_value’ field in the array metadata. If the ‘fill_value’ field is null then the contents of the chunk are undefined.

Note that all chunks in an array have the same shape. If the length of any array dimension is not exactly divisible by the length of the corresponding chunk dimension then some chunks will overhang the edge of the array. The contents of any chunk region falling outside the array are undefined.

Attributes

Each array can also be associated with custom attributes, which are simple key/value items with application-specific meaning. Custom attributes are encoded as a JSON object and stored under the ‘attrs’ key within an array store. Even if the attributes are empty, the ‘attrs’ key MUST be present within an array store.

For example, the JSON object below encodes three attributes named ‘foo’, ‘bar’ and ‘baz’:

{
    "foo": 42,
    "bar": "apples",
    "baz": [1, 2, 3, 4]
}
Example

Below is an example of storing a Zarr array, using a directory on the local file system as storage.

Initialize the store:

>>> import zarr
>>> store = zarr.DirectoryStore('example.zarr')
>>> zarr.init_store(store, shape=(20, 20), chunks=(10, 10),
...                 dtype='i4', fill_value=42, compression='zlib',
...                 compression_opts=1, overwrite=True)

No chunks are initialized yet, so only the ‘meta’ and ‘attrs’ keys have been set:

>>> import os
>>> sorted(os.listdir('example.zarr'))
['attrs', 'meta']

Inspect the array metadata:

>>> print(open('example.zarr/meta').read())
{
    "chunks": [
        10,
        10
    ],
    "compression": "zlib",
    "compression_opts": 1,
    "dtype": "<i4",
    "fill_value": 42,
    "order": "C",
    "shape": [
        20,
        20
    ],
    "zarr_format": 1
}

Inspect the array attributes:

>>> print(open('example.zarr/attrs').read())
{}

Set some data:

>>> z = zarr.Array(store)
>>> z[0:10, 0:10] = 1
>>> sorted(os.listdir('example.zarr'))
['0.0', 'attrs', 'meta']

Set some more data:

>>> z[0:10, 10:20] = 2
>>> z[10:20, :] = 3
>>> sorted(os.listdir('example.zarr'))
['0.0', '0.1', '1.0', '1.1', 'attrs', 'meta']

Manually decompress a single chunk for illustration:

>>> import zlib
>>> b = zlib.decompress(open('example.zarr/0.0', 'rb').read())
>>> import numpy as np
>>> a = np.frombuffer(b, dtype='<i4')
>>> a
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

Modify the array attributes:

>>> z.attrs['foo'] = 42
>>> z.attrs['bar'] = 'apples'
>>> z.attrs['baz'] = [1, 2, 3, 4]
>>> print(open('example.zarr/attrs').read())
{
    "bar": "apples",
    "baz": [
        1,
        2,
        3,
        4
    ],
    "foo": 42
}

Zarr storage specification version 2

This document provides a technical specification of the protocol and format used for storing Zarr arrays. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.

Status

This specification is the latest version. See Specifications for previous versions.

Storage

A Zarr array can be stored in any storage system that provides a key/value interface, where a key is an ASCII string and a value is an arbitrary sequence of bytes, and the supported operations are read (get the sequence of bytes associated with a given key), write (set the sequence of bytes associated with a given key) and delete (remove a key/value pair).

For example, a directory in a file system can provide this interface, where keys are file names, values are file contents, and files can be read, written or deleted via the operating system. Equally, an S3 bucket can provide this interface, where keys are resource names, values are resource contents, and resources can be read, written or deleted via HTTP.

Below an “array store” refers to any system implementing this interface.

Arrays
Metadata

Each array requires essential configuration metadata to be stored, enabling correct interpretation of the stored data. This metadata is encoded using JSON and stored as the value of the “.zarray” key within an array store.

The metadata resource is a JSON object. The following keys MUST be present within the object:

zarr_format
An integer defining the version of the storage specification to which the array store adheres.
shape
A list of integers defining the length of each dimension of the array.
chunks
A list of integers defining the length of each dimension of a chunk of the array. Note that all chunks within a Zarr array have the same shape.
dtype
A string or list defining a valid data type for the array. See also the subsection below on data type encoding.
compressor
A JSON object identifying the primary compression codec and providing configuration parameters, or null if no compressor is to be used. The object MUST contain an "id" key identifying the codec to be used.
fill_value
A scalar value providing the default value to use for uninitialized portions of the array, or null if no fill_value is to be used.
order
Either “C” or “F”, defining the layout of bytes within each chunk of the array. “C” means row-major order, i.e., the last dimension varies fastest; “F” means column-major order, i.e., the first dimension varies fastest.
filters
A list of JSON objects providing codec configurations, or null if no filters are to be applied. Each codec configuration object MUST contain a "id" key identifying the codec to be used.

Other keys MUST NOT be present within the metadata object.

For example, the JSON object below defines a 2-dimensional array of 64-bit little-endian floating point numbers with 10000 rows and 10000 columns, divided into chunks of 1000 rows and 1000 columns (so there will be 100 chunks in total arranged in a 10 by 10 grid). Within each chunk the data are laid out in C contiguous order. Each chunk is encoded using a delta filter and compressed using the Blosc compression library prior to storage:

{
    "chunks": [
        1000,
        1000
    ],
    "compressor": {
        "id": "blosc",
        "cname": "lz4",
        "clevel": 5,
        "shuffle": 1
    },
    "dtype": "<f8",
    "fill_value": "NaN",
    "filters": [
        {"id": "delta", "dtype": "<f8", "astype": "<f4"}
    ],
    "order": "C",
    "shape": [
        10000,
        10000
    ],
    "zarr_format": 2
}
Data type encoding

Simple data types are encoded within the array metadata as a string, following the NumPy array protocol type string (typestr) format. The format consists of 3 parts:

  • One character describing the byteorder of the data ("<": little-endian; ">": big-endian; "|": not-relevant)
  • One character code giving the basic type of the array ("b": Boolean (integer type where all values are only True or False); "i": integer; "u": unsigned integer; "f": floating point; "c": complex floating point; "m": timedelta; "M": datetime; "S": string (fixed-length sequence of char); "U": unicode (fixed-length sequence of Py_UNICODE); "V": other (void * – each item is a fixed-size chunk of memory))
  • An integer specifying the number of bytes the type uses.

The byte order MUST be specified. E.g., "<f8", ">i4", "|b1" and "|S12" are valid data type encodings.

For datetime64 (“M”) and timedelta64 (“m”) data types, these MUST also include the units within square brackets. A list of valid units and their definitions are given in the NumPy documentation on Datetimes and Timedeltas. For example, "<M8[ns]" specifies a datetime64 data type with nanosecond time units.

Structured data types (i.e., with multiple named fields) are encoded as a list of lists, following NumPy array protocol type descriptions (descr). Each sub-list has the form [fieldname, datatype, shape] where shape is optional. fieldname is a string, datatype is a string specifying a simple data type (see above), and shape is a list of integers specifying subarray shape. For example, the JSON list below defines a data type composed of three single-byte unsigned integer fields named “r”, “g” and “b”:

[["r", "|u1"], ["g", "|u1"], ["b", "|u1"]]

For example, the JSON list below defines a data type composed of three fields named “x”, “y” and “z”, where “x” and “y” each contain 32-bit floats, and each item in “z” is a 2 by 2 array of floats:

[["x", "<f4"], ["y", "<f4"], ["z", "<f4", [2, 2]]]

Structured data types may also be nested, e.g., the following JSON list defines a data type with two fields “foo” and “bar”, where “bar” has two sub-fields “baz” and “qux”:

[["foo", "<f4"], ["bar", [["baz", "<f4"], ["qux", "<i4"]]]]
Fill value encoding

For simple floating point data types, the following table MUST be used to encode values of the “fill_value” field:

Value JSON encoding
Not a Number "NaN"
Positive Infinity "Infinity"
Negative Infinity "-Infinity"

If an array has a fixed length byte string data type (e.g., "|S12"), or a structured data type, and if the fill value is not null, then the fill value MUST be encoded as an ASCII string using the standard Base64 alphabet.

Chunks

Each chunk of the array is compressed by passing the raw bytes for the chunk through the primary compression library to obtain a new sequence of bytes comprising the compressed chunk data. No header is added to the compressed bytes or any other modification made. The internal structure of the compressed bytes will depend on which primary compressor was used. For example, the Blosc compressor produces a sequence of bytes that begins with a 16-byte header followed by compressed data.

The compressed sequence of bytes for each chunk is stored under a key formed from the index of the chunk within the grid of chunks representing the array. To form a string key for a chunk, the indices are converted to strings and concatenated with the period character (“.”) separating each index. For example, given an array with shape (10000, 10000) and chunk shape (1000, 1000) there will be 100 chunks laid out in a 10 by 10 grid. The chunk with indices (0, 0) provides data for rows 0-1000 and columns 0-1000 and is stored under the key “0.0”; the chunk with indices (2, 4) provides data for rows 2000-3000 and columns 4000-5000 and is stored under the key “2.4”; etc.

There is no need for all chunks to be present within an array store. If a chunk is not present then it is considered to be in an uninitialized state. An unitialized chunk MUST be treated as if it was uniformly filled with the value of the “fill_value” field in the array metadata. If the “fill_value” field is null then the contents of the chunk are undefined.

Note that all chunks in an array have the same shape. If the length of any array dimension is not exactly divisible by the length of the corresponding chunk dimension then some chunks will overhang the edge of the array. The contents of any chunk region falling outside the array are undefined.

Filters

Optionally a sequence of one or more filters can be used to transform chunk data prior to compression. When storing data, filters are applied in the order specified in array metadata to encode data, then the encoded data are passed to the primary compressor. When retrieving data, stored chunk data are decompressed by the primary compressor then decoded using filters in the reverse order.

Hierarchies
Logical storage paths

Multiple arrays can be stored in the same array store by associating each array with a different logical path. A logical path is simply an ASCII string. The logical path is used to form a prefix for keys used by the array. For example, if an array is stored at logical path “foo/bar” then the array metadata will be stored under the key “foo/bar/.zarray”, the user-defined attributes will be stored under the key “foo/bar/.zattrs”, and the chunks will be stored under keys like “foo/bar/0.0”, “foo/bar/0.1”, etc.

To ensure consistent behaviour across different storage systems, logical paths MUST be normalized as follows:

  • Replace all backward slash characters (“\”) with forward slash characters (“/”)
  • Strip any leading “/” characters
  • Strip any trailing “/” characters
  • Collapse any sequence of more than one “/” character into a single “/” character

The key prefix is then obtained by appending a single “/” character to the normalized logical path.

After normalization, if splitting a logical path by the “/” character results in any path segment equal to the string “.” or the string “..” then an error MUST be raised.

N.B., how the underlying array store processes requests to store values under keys containing the “/” character is entirely up to the store implementation and is not constrained by this specification. E.g., an array store could simply treat all keys as opaque ASCII strings; equally, an array store could map logical paths onto some kind of hierarchical storage (e.g., directories on a file system).

Groups

Arrays can be organized into groups which can also contain other groups. A group is created by storing group metadata under the “.zgroup” key under some logical path. E.g., a group exists at the root of an array store if the “.zgroup” key exists in the store, and a group exists at logical path “foo/bar” if the “foo/bar/.zgroup” key exists in the store.

If the user requests a group to be created under some logical path, then groups MUST also be created at all ancestor paths. E.g., if the user requests group creation at path “foo/bar” then groups MUST be created at path “foo” and the root of the store, if they don’t already exist.

If the user requests an array to be created under some logical path, then groups MUST also be created at all ancestor paths. E.g., if the user requests array creation at path “foo/bar/baz” then groups must be created at path “foo/bar”, path “foo”, and the root of the store, if they don’t already exist.

The group metadata resource is a JSON object. The following keys MUST be present within the object:

zarr_format
An integer defining the version of the storage specification to which the array store adheres.

Other keys MUST NOT be present within the metadata object.

The members of a group are arrays and groups stored under logical paths that are direct children of the parent group’s logical path. E.g., if groups exist under the logical paths “foo” and “foo/bar” and an array exists at logical path “foo/baz” then the members of the group at path “foo” are the group at path “foo/bar” and the array at path “foo/baz”.

Attributes

An array or group can be associated with custom attributes, which are simple key/value items with application-specific meaning. Custom attributes are encoded as a JSON object and stored under the “.zattrs” key within an array store. The “.zattrs” key does not have to be present, and if it is absent the attributes should be treated as empty.

For example, the JSON object below encodes three attributes named “foo”, “bar” and “baz”:

{
    "foo": 42,
    "bar": "apples",
    "baz": [1, 2, 3, 4]
}
Examples
Storing a single array

Below is an example of storing a Zarr array, using a directory on the local file system as storage.

Create an array:

>>> import zarr
>>> store = zarr.DirectoryStore('data/example.zarr')
>>> a = zarr.create(shape=(20, 20), chunks=(10, 10), dtype='i4',
...                 fill_value=42, compressor=zarr.Zlib(level=1),
...                 store=store, overwrite=True)

No chunks are initialized yet, so only the “.zarray” and “.zattrs” keys have been set in the store:

>>> import os
>>> sorted(os.listdir('data/example.zarr'))
['.zarray']

Inspect the array metadata:

>>> print(open('data/example.zarr/.zarray').read())
{
    "chunks": [
        10,
        10
    ],
    "compressor": {
        "id": "zlib",
        "level": 1
    },
    "dtype": "<i4",
    "fill_value": 42,
    "filters": null,
    "order": "C",
    "shape": [
        20,
        20
    ],
    "zarr_format": 2
}

Chunks are initialized on demand. E.g., set some data:

>>> a[0:10, 0:10] = 1
>>> sorted(os.listdir('data/example.zarr'))
['.zarray', '0.0']

Set some more data:

>>> a[0:10, 10:20] = 2
>>> a[10:20, :] = 3
>>> sorted(os.listdir('data/example.zarr'))
['.zarray', '0.0', '0.1', '1.0', '1.1']

Manually decompress a single chunk for illustration:

>>> import zlib
>>> buf = zlib.decompress(open('data/example.zarr/0.0', 'rb').read())
>>> import numpy as np
>>> chunk = np.frombuffer(buf, dtype='<i4')
>>> chunk
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

Modify the array attributes:

>>> a.attrs['foo'] = 42
>>> a.attrs['bar'] = 'apples'
>>> a.attrs['baz'] = [1, 2, 3, 4]
>>> sorted(os.listdir('data/example.zarr'))
['.zarray', '.zattrs', '0.0', '0.1', '1.0', '1.1']
>>> print(open('data/example.zarr/.zattrs').read())
{
    "bar": "apples",
    "baz": [
        1,
        2,
        3,
        4
    ],
    "foo": 42
}
Storing multiple arrays in a hierarchy

Below is an example of storing multiple Zarr arrays organized into a group hierarchy, using a directory on the local file system as storage. This storage implementation maps logical paths onto directory paths on the file system, however this is an implementation choice and is not required.

Setup the store:

>>> import zarr
>>> store = zarr.DirectoryStore('data/group.zarr')

Create the root group:

>>> root_grp = zarr.group(store, overwrite=True)

The metadata resource for the root group has been created:

>>> import os
>>> sorted(os.listdir('data/group.zarr'))
['.zgroup']

Inspect the group metadata:

>>> print(open('data/group.zarr/.zgroup').read())
{
    "zarr_format": 2
}

Create a sub-group:

>>> sub_grp = root_grp.create_group('foo')

What has been stored:

>>> sorted(os.listdir('data/group.zarr'))
['.zgroup', 'foo']
>>> sorted(os.listdir('data/group.zarr/foo'))
['.zgroup']

Create an array within the sub-group:

>>> a = sub_grp.create_dataset('bar', shape=(20, 20), chunks=(10, 10))
>>> a[:] = 42

Set a custom attributes:

>>> a.attrs['comment'] = 'answer to life, the universe and everything'

What has been stored:

>>> sorted(os.listdir('data/group.zarr'))
['.zgroup', 'foo']
>>> sorted(os.listdir('data/group.zarr/foo'))
['.zgroup', 'bar']
>>> sorted(os.listdir('data/group.zarr/foo/bar'))
['.zarray', '.zattrs', '0.0', '0.1', '1.0', '1.1']

Here is the same example using a Zip file as storage:

>>> store = zarr.ZipStore('data/group.zip', mode='w')
>>> root_grp = zarr.group(store)
>>> sub_grp = root_grp.create_group('foo')
>>> a = sub_grp.create_dataset('bar', shape=(20, 20), chunks=(10, 10))
>>> a[:] = 42
>>> a.attrs['comment'] = 'answer to life, the universe and everything'
>>> store.close()

What has been stored:

>>> import zipfile
>>> zf = zipfile.ZipFile('data/group.zip', mode='r')
>>> for name in sorted(zf.namelist()):
...     print(name)
.zgroup
foo/.zgroup
foo/bar/.zarray
foo/bar/.zattrs
foo/bar/0.0
foo/bar/0.1
foo/bar/1.0
foo/bar/1.1
Changes
Version 2 clarifications

The following changes have been made to the version 2 specification since it was initially published to clarify ambiguities and add some missing information.

  • The specification now describes how bytes fill values should be encoded and decoded for arrays with a fixed-length byte string data type (#165, #176).
  • The specification now clarifies that units must be specified for datetime64 and timedelta64 data types (#85, #215).
  • The specification now clarifies that the ‘.zattrs’ key does not have to be present for either arrays or groups, and if absent then custom attributes should be treated as empty.
  • The specification now describes how structured datatypes with subarray shapes and/or with nested structured data types are encoded in array metadata (#111, #296).
Changes from version 1 to version 2

The following changes were made between version 1 and version 2 of this specification:

  • Added support for storing multiple arrays in the same store and organising arrays into hierarchies using groups.
  • Array metadata is now stored under the “.zarray” key instead of the “meta” key.
  • Custom attributes are now stored under the “.zattrs” key instead of the “attrs” key.
  • Added support for filters.
  • Changed encoding of “fill_value” field within array metadata.
  • Changed encoding of compressor information within array metadata to be consistent with representation of filter information.

Release notes

2.4.0

Enhancements
  • Add key normalization option for DirectoryStore, NestedDirectoryStore, TempStore, and N5Store. By James Bourbeau; #459.
  • Add recurse keyword to Group.array_keys and Group.arrays methods. By James Bourbeau; #458.
  • Use uniform chunking for all dimensions when specifying chunks as an integer. Also adds support for specifying -1 to chunk across an entire dimension. By James Bourbeau; #456.
  • Rename DictStore to MemoryStore. By James Bourbeau; #455.
  • Rewrite .tree() pretty representation to use ipytree. Allows it to work in both the Jupyter Notebook and JupyterLab. By John Kirkham; #450.
  • Do not rename Blosc parameters in n5 backend and add blocksize parameter, compatible with n5-blosc. By @axtimwalde, #485.
  • Update DirectoryStore to create files with more permissive permissions. By Eduardo Gonzalez and James Bourbeau; #493
  • Use math.ceil for scalars. By John Kirkham; #500.
  • Ensure contiguous data using astype. By John Kirkham; #513.
  • Refactor out _tofile/_fromfile from DirectoryStore. By John Kirkham; #503.
  • Add __enter__/__exit__ methods to Group for h5py.File compatibility. By Chris Barnes; #509.
Bug fixes
  • Fix Sqlite Store Wrong Modification. By Tommy Tran; #440.
  • Add intermediate step (using zipfile.ZipInfo object) to write inside ZipStore to solve too restrictive permission issue. By Raphael Dussin; #505.
  • Fix ‘/’ prepend bug in ABSStore. By Shikhar Goenka; #525.
Documentation
Maintenance

2.3.2

Enhancements
Bug fixes

2.3.1

Bug fixes

2.3.0

Enhancements
Bug fixes
  • The implementation of the zarr.storage.DirectoryStore class has been modified to ensure that writes are atomic and there are no race conditions where a chunk might appear transiently missing during a write operation. By sbalmer, #327, #263.
  • Avoid raising in zarr.storage.DirectoryStore’s __setitem__ when file already exists. By Justin Swaney, #272, #318.
  • The required version of the Numcodecs package has been upgraded to 0.6.2, which has enabled some code simplification and fixes a failing test involving msgpack encoding. By John Kirkham, #361, #360, #352, #355, #324.
  • Failing tests related to pickling/unpickling have been fixed. By Ryan Williams, #273, #308.
  • Corrects handling of NaT in datetime64 and timedelta64 in various compressors (by John Kirkham; #344).
  • Ensure DictStore contains only bytes to facilitate comparisons and protect against writes. By John Kirkham, #350.
  • Test and fix an issue (w.r.t. fill values) when storing complex data to Array. By John Kirkham, #363.
  • Always use a tuple when indexing a NumPy ndarray. By John Kirkham, #376.
  • Ensure when Array uses a dict-based chunk store that it only contains bytes to facilitate comparisons and protect against writes. Drop the copy for the no filter/compressor case as this handles that case. By John Kirkham, #359.
Maintenance

2.2.0

Enhancements
  • Advanced indexing. The Array class has several new methods and properties that enable a selection of items in an array to be retrieved or updated. See the Advanced indexing tutorial section for more information. There is also a notebook with extended examples and performance benchmarks. #78, #89, #112, #172.
  • New package for compressor and filter codecs. The classes previously defined in the zarr.codecs module have been factored out into a separate package called Numcodecs. The Numcodecs package also includes several new codec classes not previously available in Zarr, including compressor codecs for Zstd and LZ4. This change is backwards-compatible with existing code, as all codec classes defined by Numcodecs are imported into the zarr.codecs namespace. However, it is recommended to import codecs from the new package, see the tutorial sections on Compressors and Filters for examples. With contributions by John Kirkham; #74, #102, #120, #123, #139.
  • New storage class for DBM-style databases. The zarr.storage.DBMStore class enables any DBM-style database such as gdbm, ndbm or Berkeley DB, to be used as the backing store for an array or group. See the tutorial section on Storage alternatives for some examples. #133, #186.
  • New storage class for LMDB databases. The zarr.storage.LMDBStore class enables an LMDB “Lightning” database to be used as the backing store for an array or group. #192.
  • New storage class using a nested directory structure for chunk files. The zarr.storage.NestedDirectoryStore has been added, which is similar to the existing zarr.storage.DirectoryStore class but nests chunk files for multidimensional arrays into sub-directories. #155, #177.
  • New tree() method for printing hierarchies. The Group class has a new zarr.hierarchy.Group.tree() method which enables a tree representation of a group hierarchy to be printed. Also provides an interactive tree representation when used within a Jupyter notebook. See the Array and group diagnostics tutorial section for examples. By John Kirkham; #82, #140, #184.
  • Visitor API. The Group class now implements the h5py visitor API, see docs for the zarr.hierarchy.Group.visit(), zarr.hierarchy.Group.visititems() and zarr.hierarchy.Group.visitvalues() methods. By John Kirkham, #92, #122.
  • Viewing an array as a different dtype. The Array class has a new zarr.core.Array.astype() method, which is a convenience that enables an array to be viewed as a different dtype. By John Kirkham, #94, #96.
  • New open(), save(), load() convenience functions. The function zarr.convenience.open() provides a convenient way to open a persistent array or group, using either a DirectoryStore or ZipStore as the backing store. The functions zarr.convenience.save() and zarr.convenience.load() are also available and provide a convenient way to save an entire NumPy array to disk and load back into memory later. See the tutorial section Persistent arrays for examples. #104, #105, #141, #181.
  • IPython completions. The Group class now implements __dir__() and _ipython_key_completions_() which enables tab-completion for group members to be used in any IPython interactive environment. #170.
  • New info property; changes to __repr__. The Group and Array classes have a new info property which can be used to print diagnostic information, including compression ratio where available. See the tutorial section on Array and group diagnostics for examples. The string representation (__repr__) of these classes has been simplified to ensure it is cheap and quick to compute in all circumstances. #83, #115, #132, #148.
  • Chunk options. When creating an array, chunks=False can be specified, which will result in an array with a single chunk only. Alternatively, chunks=True will trigger an automatic chunk shape guess. See Chunk optimizations for more on the chunks parameter. #106, #107, #183.
  • Zero-dimensional arrays and are now supported; by Prakhar Goel, #154, #161.
  • Arrays with one or more zero-length dimensions are now fully supported; by Prakhar Goel, #150, #154, #160.
  • The .zattrs key is now optional and will now only be created when the first custom attribute is set; #121, #200.
  • New Group.move() method supports moving a sub-group or array to a different location within the same hierarchy. By John Kirkham, #191, #193, #196.
  • ZipStore is now thread-safe; #194, #192.
  • New Array.hexdigest() method computes an Array’s hash with hashlib. By John Kirkham, #98, #203.
  • Improved support for object arrays. In previous versions of Zarr, creating an array with dtype=object was possible but could under certain circumstances lead to unexpected errors and/or segmentation faults. To make it easier to properly configure an object array, a new object_codec parameter has been added to array creation functions. See the tutorial section on Object arrays for more information and examples. Also, runtime checks have been added in both Zarr and Numcodecs so that segmentation faults are no longer possible, even with a badly configured array. This API change is backwards compatible and previous code that created an object array and provided an object codec via the filters parameter will continue to work, however a warning will be raised to encourage use of the object_codec parameter. #208, #212.
  • Added support for datetime64 and timedelta64 data types; #85, #215.
  • Array and group attributes are now cached by default to improve performance with slow stores, e.g., stores accessing data via the network; #220, #218, #204.
  • New LRUStoreCache class. The class zarr.storage.LRUStoreCache has been added and provides a means to locally cache data in memory from a store that may be slow, e.g., a store that retrieves data from a remote server via the network; #223.
  • New copy functions. The new functions zarr.convenience.copy() and zarr.convenience.copy_all() provide a way to copy groups and/or arrays between HDF5 and Zarr, or between two Zarr groups. The zarr.convenience.copy_store() provides a more efficient way to copy data directly between two Zarr stores. #87, #113, #137, #217.
Bug fixes
  • Fixed bug where read_only keyword argument was ignored when creating an array; #151, #179.
  • Fixed bugs when using a ZipStore opened in ‘w’ mode; #158, #182.
  • Fill values can now be provided for fixed-length string arrays; #165, #176.
  • Fixed a bug where the number of chunks initialized could be counted incorrectly; #97, #174.
  • Fixed a bug related to the use of an ellipsis (…) in indexing statements; #93, #168, #172.
  • Fixed a bug preventing use of other integer types for indexing; #143, #147.
Documentation
Maintenance
  • A data fixture has been included in the test suite to ensure data format compatibility is maintained; #83, #146.
  • The test suite has been migrated from nosetests to pytest; #189, #225.
  • Various continuous integration updates and improvements; #118, #124, #125, #126, #109, #114, #171.
  • Bump numcodecs dependency to 0.5.3, completely remove nose dependency, #237.
  • Fix compatibility issues with NumPy 1.14 regarding fill values for structured arrays, #222, #238, #239.
Acknowledgments

Code was contributed to this release by Alistair Miles, John Kirkham and Prakhar Goel.

Documentation was contributed to this release by Mamy Ratsimbazafy and Charles Noyes.

Thank you to John Kirkham, Stephan Hoyer, Francesc Alted, and Matthew Rocklin for code reviews and/or comments on pull requests.

2.1.4

  • Resolved an issue where calling hasattr on a Group object erroneously returned a KeyError. By Vincent Schut; #88, #95.

2.1.3

2.1.2

  • Resolved an issue when no compression is used and chunks are stored in memory (#79).

2.1.1

Various minor improvements, including: Group objects support member access via dot notation (__getattr__); fixed metadata caching for Array.shape property and derivatives; added Array.ndim property; fixed Array.__array__ method arguments; fixed bug in pickling Array state; fixed bug in pickling ThreadSynchronizer.

2.1.0

  • Group objects now support member deletion via del statement (#65).
  • Added zarr.storage.TempStore class for convenience to provide storage via a temporary directory (#59).
  • Fixed performance issues with zarr.storage.ZipStore class (#66).
  • The Blosc extension has been modified to return bytes instead of array objects from compress and decompress function calls. This should improve compatibility and also provides a small performance increase for compressing high compression ratio data (#55).
  • Added overwrite keyword argument to array and group creation methods on the zarr.hierarchy.Group class (#71).
  • Added cache_metadata keyword argument to array creation methods.
  • The functions zarr.creation.open_array() and zarr.hierarchy.open_group() now accept any store as first argument (#56).

2.0.1

The bundled Blosc library has been upgraded to version 1.11.1.

2.0.0

Hierarchies

Support has been added for organizing arrays into hierarchies via groups. See the tutorial section on Groups and the zarr.hierarchy API docs for more information.

Filters

Support has been added for configuring filters to preprocess chunk data prior to compression. See the tutorial section on Filters and the zarr.codecs API docs for more information.

Other changes

To accommodate support for hierarchies and filters, the Zarr metadata format has been modified. See the Zarr storage specification version 2 for more information. To migrate an array stored using Zarr version 1.x, use the zarr.storage.migrate_1to2() function.

The bundled Blosc library has been upgraded to version 1.11.0.

Acknowledgments

Thanks to Matthew Rocklin, Stephan Hoyer and Francesc Alted for contributions and comments.

1.1.0

  • The bundled Blosc library has been upgraded to version 1.10.0. The ‘zstd’ internal compression library is now available within Blosc. See the tutorial section on Compressors for an example.
  • When using the Blosc compressor, the default internal compression library is now ‘lz4’.
  • The default number of internal threads for the Blosc compressor has been increased to a maximum of 8 (previously 4).
  • Added convenience functions zarr.blosc.list_compressors() and zarr.blosc.get_nthreads().

1.0.0

This release includes a complete re-organization of the code base. The major version number has been bumped to indicate that there have been backwards-incompatible changes to the API and the on-disk storage format. However, Zarr is still in an early stage of development, so please do not take the version number as an indicator of maturity.

Storage

The main motivation for re-organizing the code was to create an abstraction layer between the core array logic and data storage (#21). In this release, any object that implements the MutableMapping interface can be used as an array store. See the tutorial sections on Persistent arrays and Storage alternatives, the Zarr storage specification version 1, and the zarr.storage module documentation for more information.

Please note also that the file organization and file name conventions used when storing a Zarr array in a directory on the file system have changed. Persistent Zarr arrays created using previous versions of the software will not be compatible with this version. See the zarr.storage API docs and the Zarr storage specification version 1 for more information.

Compression

An abstraction layer has also been created between the core array logic and the code for compressing and decompressing array chunks. This release still bundles the c-blosc library and uses Blosc as the default compressor, however other compressors including zlib, BZ2 and LZMA are also now supported via the Python standard library. New compressors can also be dynamically registered for use with Zarr. See the tutorial sections on Compressors and Configuring Blosc, the Zarr storage specification version 1, and the zarr.compressors module documentation for more information.

Synchronization

The synchronization code has also been refactored to create a layer of abstraction, enabling Zarr arrays to be used in parallel computations with a number of alternative synchronization methods. For more information see the tutorial section on Parallel computing and synchronization and the zarr.sync module documentation.

Changes to the Blosc extension

NumPy is no longer a build dependency for the zarr.blosc Cython extension, so setup.py will run even if NumPy is not already installed, and should automatically install NumPy as a runtime dependency. Manual installation of NumPy prior to installing Zarr is still recommended, however, as the automatic installation of NumPy may fail or be sub-optimal on some platforms.

Some optimizations have been made within the zarr.blosc extension to avoid unnecessary memory copies, giving a ~10-20% performance improvement for multi-threaded compression operations.

The zarr.blosc extension now automatically detects whether it is running within a single-threaded or multi-threaded program and adapts its internal behaviour accordingly (#27). There is no need for the user to make any API calls to switch Blosc between contextual and non-contextual (global lock) mode. See also the tutorial section on Configuring Blosc.

Other changes

The internal code for managing chunks has been rewritten to be more efficient. Now no state is maintained for chunks outside of the array store, meaning that chunks do not carry any extra memory overhead not accounted for by the store. This negates the need for the “lazy” option present in the previous release, and this has been removed.

The memory layout within chunks can now be set as either “C” (row-major) or “F” (column-major), which can help to provide better compression for some data (#7). See the tutorial section on Chunk memory layout for more information.

A bug has been fixed within the __getitem__ and __setitem__ machinery for slicing arrays, to properly handle getting and setting partial slices.

Acknowledgments

Thanks to Matthew Rocklin, Stephan Hoyer, Francesc Alted, Anthony Scopatz and Martin Durant for contributions and comments.

Contributing to Zarr

Zarr is a community maintained project. We welcome contributions in the form of bug reports, bug fixes, documentation, enhancement proposals and more. This page provides information on how best to contribute.

Asking for help

If you have a question about how to use Zarr, please post your question on StackOverflow using the “zarr” tag. If you don’t get a response within a day or two, feel free to raise a GitHub issue including a link to your StackOverflow question. We will try to respond to questions as quickly as possible, but please bear in mind that there may be periods where we have limited time to answer questions due to other commitments.

Bug reports

If you find a bug, please raise a GitHub issue. Please include the following items in a bug report:

  1. A minimal, self-contained snippet of Python code reproducing the problem. You can format the code nicely using markdown, e.g.:

    ```python
    import zarr
    g = zarr.group()
    # etc.
    ```
    
  2. An explanation of why the current behaviour is wrong/not desired, and what you expect instead.

  3. Information about the version of Zarr, along with versions of dependencies and the Python interpreter, and installation information. The version of Zarr can be obtained from the zarr.__version__ property. Please also state how Zarr was installed, e.g., “installed via pip into a virtual environment”, or “installed using conda”. Information about other packages installed can be obtained by executing pip freeze (if using pip to install packages) or conda env export (if using conda to install packages) from the operating system command prompt. The version of the Python interpreter can be obtained by running a Python interactive session, e.g.:

    $ python
    Python 3.6.1 (default, Mar 22 2017, 06:17:05)
    [GCC 6.3.0 20170321] on linux
    

Enhancement proposals

If you have an idea about a new feature or some other improvement to Zarr, please raise a GitHub issue first to discuss.

We very much welcome ideas and suggestions for how to improve Zarr, but please bear in mind that we are likely to be conservative in accepting proposals for new features. The reasons for this are that we would like to keep the Zarr code base lean and focused on a core set of functionalities, and available time for development, review and maintenance of new features is limited. But if you have a great idea, please don’t let that stop you from posting it on GitHub, just please don’t be offended if we respond cautiously.

Contributing code and/or documentation

Forking the repository

The Zarr source code is hosted on GitHub at the following location:

You will need your own fork to work on the code. Go to the link above and hit the “Fork” button. Then clone your fork to your local machine:

$ git clone git@github.com:your-user-name/zarr.git
$ cd zarr
$ git remote add upstream git@github.com:zarr-developers/zarr-python.git
Creating a development environment

To work with the Zarr source code, it is recommended to set up a Python virtual environment and install all Zarr dependencies using the same versions as are used by the core developers and continuous integration services. Assuming you have a Python 3 interpreter already installed, and have also installed the virtualenv package, and you have cloned the Zarr source code and your current working directory is the root of the repository, you can do something like the following:

$ mkdir -p ~/pyenv/zarr-dev
$ virtualenv --no-site-packages --python=/usr/bin/python3.8 ~/pyenv/zarr-dev
$ source ~/pyenv/zarr-dev/bin/activate
$ pip install -r requirements_dev_minimal.txt -r requirements_dev_numpy.txt
$ pip install -e .

To verify that your development environment is working, you can run the unit tests:

$ pytest -v zarr
Creating a branch

Before you do any new work or submit a pull request, please open an issue on GitHub to report the bug or propose the feature you’d like to add.

It’s best to synchronize your fork with the upstream repository, then create a new, separate branch for each piece of work you want to do. E.g.:

git checkout master
git fetch upstream
git rebase upstream/master
git push
git checkout -b shiny-new-feature
git push -u origin shiny-new-feature

This changes your working directory to the ‘shiny-new-feature’ branch. Keep any changes in this branch specific to one bug or feature so it is clear what the branch brings to Zarr.

To update this branch with latest code from Zarr, you can retrieve the changes from the master branch and perform a rebase:

git fetch upstream
git rebase upstream/master

This will replay your commits on top of the latest Zarr git master. If this leads to merge conflicts, these need to be resolved before submitting a pull request. Alternatively, you can merge the changes in from upstream/master instead of rebasing, which can be simpler:

git fetch upstream
git merge upstream/master

Again, any conflicts need to be resolved before submitting a pull request.

Running the test suite

Zarr includes a suite of unit tests, as well as doctests included in function and class docstrings and in the tutorial and storage spec. The simplest way to run the unit tests is to activate your development environment (see creating a development environment above) and invoke:

$ pytest -v zarr

Some tests require optional dependencies to be installed, otherwise the tests will be skipped. To install all optional dependencies, run:

$ pip install -r requirements_dev_optional.txt

To also run the doctests within docstrings (requires optional depencies to be installed), run:

$ pytest -v --doctest-plus zarr

To run the doctests within the tutorial and storage spec (requires optional dependencies to be installed), run:

$ python -m doctest -o NORMALIZE_WHITESPACE -o ELLIPSIS docs/tutorial.rst docs/spec/v2.rst

Note that some tests also require storage services to be running locally. To run the Azure Blob Service storage tests, run an Azure storage emulator (e.g., azurite) and set the environment variable ZARR_TEST_ABS=1. To run the Mongo DB storage tests, run a Mongo server locally and set the environment variable ZARR_TEST_MONGO=1. To run the Redis storage tests, run a Redis server locally on port 6379 and set the environment variable ZARR_TEST_REDIS=1.

All tests are automatically run via Travis (Linux) and AppVeyor (Windows) continuous integration services for every pull request. Tests must pass under both Travis and Appveyor before code can be accepted. Test coverage is also collected automatically via the Coveralls service, and total coverage over all builds must be 100% (although individual builds may be lower due to Python 2/3 or other differences).

Code standards

All code must conform to the PEP8 standard. Regarding line length, lines up to 100 characters are allowed, although please try to keep under 90 wherever possible. Conformance can be checked by running:

$ flake8 --max-line-length=100 zarr
Test coverage

Zarr maintains 100% test coverage under the latest Python stable release (currently Python 3.8). Both unit tests and docstring doctests are included when computing coverage. Running tox -e py38 will automatically run the test suite with coverage and produce a coverage report. This should be 100% before code can be accepted into the main code base.

When submitting a pull request, coverage will also be collected across all supported Python versions via the Coveralls service, and will be reported back within the pull request. Coveralls coverage must also be 100% before code can be accepted.

Documentation

Docstrings for user-facing classes and functions should follow the numpydoc standard, including sections for Parameters and Examples. All examples should run and pass as doctests under Python 3.8. To run doctests, activate your development environment, install optional requirements, and run:

$ pytest -v --doctest-plus zarr

Zarr uses Sphinx for documentation, hosted on readthedocs.org. Documentation is written in the RestructuredText markup language (.rst files) in the docs folder. The documentation consists both of prose and API documentation. All user-facing classes and functions should be included in the API documentation, under the docs/api folder. Any new features or important usage information should be included in the tutorial (docs/tutorial.rst). Any changes should also be included in the release notes (docs/release.rst).

The documentation can be built locally by running:

$ tox -e docs

The resulting built documentation will be available in the .tox/docs/tmp/html folder.

Development best practices, policies and procedures

The following information is mainly for core developers, but may also be of interest to contributors.

Merging pull requests

Pull requests submitted by an external contributor should be reviewed and approved by at least one core developers before being merged. Ideally, pull requests submitted by a core developer should be reviewed and approved by at least one other core developers before being merged.

Pull requests should not be merged until all CI checks have passed (Travis, AppVeyor, Coveralls) against code that has had the latest master merged in.

Compatibility and versioning policies

Because Zarr is a data storage library, there are two types of compatibility to consider: API compatibility and data format compatibility.

API compatibility

All functions, classes and methods that are included in the API documentation (files under docs/api/*.rst) are considered as part of the Zarr public API, except if they have been documented as an experimental feature, in which case they are part of the experimental API.

Any change to the public API that does not break existing third party code importing Zarr, or cause third party code to behave in a different way, is a backwards-compatible API change. For example, adding a new function, class or method is usually a backwards-compatible change. However, removing a function, class or method; removing an argument to a function or method; adding a required argument to a function or method; or changing the behaviour of a function or method, are examples of backwards-incompatible API changes.

If a release contains no changes to the public API (e.g., contains only bug fixes or other maintenance work), then the micro version number should be incremented (e.g., 2.2.0 -> 2.2.1). If a release contains public API changes, but all changes are backwards-compatible, then the minor version number should be incremented (e.g., 2.2.1 -> 2.3.0). If a release contains any backwards-incompatible public API changes, the major version number should be incremented (e.g., 2.3.0 -> 3.0.0).

Backwards-incompatible changes to the experimental API can be included in a minor release, although this should be minimised if possible. I.e., it would be preferable to save up backwards-incompatible changes to the experimental API to be included in a major release, and to stabilise those features at the same time (i.e., move from experimental to public API), rather than frequently tinkering with the experimental API in minor releases.

Data format compatibility

The data format used by Zarr is defined by a specification document, which should be platform-independent and contain sufficient detail to construct an interoperable software library to read and/or write Zarr data using any programming language. The latest version of the specification document is available from the Specifications page.

Here, data format compatibility means that all software libraries that implement a particular version of the Zarr storage specification are interoperable, in the sense that data written by any one library can be read by all others. It is obviously desirable to maintain data format compatibility wherever possible. However, if a change is needed to the storage specification, and that change would break data format compatibility in any way, then the storage specification version number should be incremented (e.g., 2 -> 3).

The versioning of the Zarr software library is related to the versioning of the storage specification as follows. A particular version of the Zarr library will implement a particular version of the storage specification. For example, Zarr version 2.2.0 implements the Zarr storage specification version 2. If a release of the Zarr library implements a different version of the storage specification, then the major version number of the Zarr library should be incremented. E.g., if Zarr version 2.2.0 implements the storage spec version 2, and the next release of the Zarr library implements storage spec version 3, then the next library release should have version number 3.0.0. Note however that the major version number of the Zarr library may not always correspond to the spec version number. For example, Zarr versions 2.x, 3.x, and 4.x might all implement the same version of the storage spec and thus maintain data format compatibility, although they will not maintain API compatibility. The version number of the storage specification that is currently implemented is stored under the zarr.meta.ZARR_FORMAT variable.

Note that the Zarr test suite includes a data fixture and tests to try and ensure that data format compatibility is not accidentally broken. See the test_format_compatibility() function in the zarr.tests.test_storage module for details.

When to make a release

Ideally, any bug fixes that don’t change the public API should be released as soon as possible. It is fine for a micro release to contain only a single bug fix.

When to make a minor release is at the discretion of the core developers. There are no hard-and-fast rules, e.g., it is fine to make a minor release to make a single new feature available; equally, it is fine to make a minor release that includes a number of changes.

Major releases obviously need to be given careful consideration, and should be done as infrequently as possible, as they will break existing code and/or affect data compatibility in some way.

Release procedure

Checkout and update the master branch:

$ git checkout master
$ git pull

Verify all tests pass on all supported Python versions, and docs build:

$ tox

Tag the version (where “X.X.X” stands for the version number, e.g., “2.2.0”):

$ version=X.X.X
$ git tag -a v$version -m v$version
$ git push --tags

Release source code to PyPI:

$ python setup.py register sdist
$ twine upload dist/zarr-${version}.tar.gz

Obtain checksum for release to conda-forge:

$ openssl sha256 dist/zarr-${version}.tar.gz

Release to conda-forge by making a pull request against the zarr-feedstock conda-forge repository, incrementing the version number.

Projects using Zarr

If you are using Zarr, we would love to hear about it.

Acknowledgments

The following people have contributed to the development of Zarr by contributing code, documentation, code reviews, comments and/or ideas:

Zarr is inspired by HDF5, h5py and bcolz.

Development of Zarr is supported by the MRC Centre for Genomics and Global Health.

Indices and tables