Storage (zarr.storage)

This module contains storage classes for use with Zarr arrays and groups.

Note that any object implementing the MutableMapping interface from the collections module in the Python standard library can be used as a Zarr array store, as long as it accepts string (str) keys and bytes values.

In addition to the MutableMapping interface, store classes may also implement optional methods listdir (list members of a “directory”) and rmdir (remove all members of a “directory”). These methods should be implemented if the store class is aware of the hierarchical organisation of resources within the store and can provide efficient implementations. If these methods are not available, Zarr will fall back to slower implementations that work via the MutableMapping interface. Store classes may also optionally implement a rename method (rename all members under a given path) and a getsize method (return the size in bytes of a given value).

class zarr.storage.DictStore(root=None, cls=<type 'dict'>)

Store class that uses a hierarchy of dict objects, thus all data will be held in main memory.

Notes

Safe to write in multiple threads.

Examples

This is the default class used when creating a group. E.g.:

>>> import zarr
>>> g = zarr.group()
>>> type(g.store)
<class 'zarr.storage.DictStore'>

Note that the default class when creating an array is the built-in dict class, i.e.:

>>> z = zarr.zeros(100)
>>> type(z.store)
<class 'dict'>
class zarr.storage.DirectoryStore(path)

Storage class using directories and files on a standard file system.

Parameters:
path : string

Location of directory to use as the root of the storage hierarchy.

Notes

Atomic writes are used, which means that data are first written to a temporary file, then moved into place when the write is successfully completed. Files are only held open while they are being read or written and are closed immediately afterwards, so there is no need to manually close any files.

Safe to write in multiple threads or processes.

Examples

Store a single array:

>>> import zarr
>>> store = zarr.DirectoryStore('data/array.zarr')
>>> z = zarr.zeros((10, 10), chunks=(5, 5), store=store, overwrite=True)
>>> z[...] = 42

Each chunk of the array is stored as a separate file on the file system, i.e.:

>>> import os
>>> sorted(os.listdir('data/array.zarr'))
['.zarray', '0.0', '0.1', '1.0', '1.1']

Store a group:

>>> store = zarr.DirectoryStore('data/group.zarr')
>>> root = zarr.group(store=store, overwrite=True)
>>> foo = root.create_group('foo')
>>> bar = foo.zeros('bar', shape=(10, 10), chunks=(5, 5))
>>> bar[...] = 42

When storing a group, levels in the group hierarchy will correspond to directories on the file system, i.e.:

>>> sorted(os.listdir('data/group.zarr'))
['.zgroup', 'foo']
>>> sorted(os.listdir('data/group.zarr/foo'))
['.zgroup', 'bar']
>>> sorted(os.listdir('data/group.zarr/foo/bar'))
['.zarray', '0.0', '0.1', '1.0', '1.1']
class zarr.storage.TempStore(suffix='', prefix='zarr', dir=None)

Directory store using a temporary directory for storage.

Parameters:
suffix : string, optional

Suffix for the temporary directory name.

prefix : string, optional

Prefix for the temporary directory name.

dir : string, optional

Path to parent directory in which to create temporary directory.

class zarr.storage.NestedDirectoryStore(path)

Storage class using directories and files on a standard file system, with special handling for chunk keys so that chunk files for multidimensional arrays are stored in a nested directory tree.

Parameters:
path : string

Location of directory to use as the root of the storage hierarchy.

Notes

The DirectoryStore class stores all chunk files for an array together in a single directory. On some file systems, the potentially large number of files in a single directory can cause performance issues. The NestedDirectoryStore class provides an alternative where chunk files for multidimensional arrays will be organised into a directory hierarchy, thus reducing the number of files in any one directory.

Safe to write in multiple threads or processes.

Examples

Store a single array:

>>> import zarr
>>> store = zarr.NestedDirectoryStore('data/array.zarr')
>>> z = zarr.zeros((10, 10), chunks=(5, 5), store=store, overwrite=True)
>>> z[...] = 42

Each chunk of the array is stored as a separate file on the file system, note the multiple directory levels used for the chunk files:

>>> import os
>>> sorted(os.listdir('data/array.zarr'))
['.zarray', '0', '1']
>>> sorted(os.listdir('data/array.zarr/0'))
['0', '1']
>>> sorted(os.listdir('data/array.zarr/1'))
['0', '1']

Store a group:

>>> store = zarr.NestedDirectoryStore('data/group.zarr')
>>> root = zarr.group(store=store, overwrite=True)
>>> foo = root.create_group('foo')
>>> bar = foo.zeros('bar', shape=(10, 10), chunks=(5, 5))
>>> bar[...] = 42

When storing a group, levels in the group hierarchy will correspond to directories on the file system, i.e.:

>>> sorted(os.listdir('data/group.zarr'))
['.zgroup', 'foo']
>>> sorted(os.listdir('data/group.zarr/foo'))
['.zgroup', 'bar']
>>> sorted(os.listdir('data/group.zarr/foo/bar'))
['.zarray', '0', '1']
>>> sorted(os.listdir('data/group.zarr/foo/bar/0'))
['0', '1']
>>> sorted(os.listdir('data/group.zarr/foo/bar/1'))
['0', '1']
class zarr.storage.ZipStore(path, compression=0, allowZip64=True, mode='a')

Storage class using a Zip file.

Parameters:
path : string

Location of file.

compression : integer, optional

Compression method to use when writing to the archive.

allowZip64 : bool, optional

If True (the default) will create ZIP files that use the ZIP64 extensions when the zipfile is larger than 2 GiB. If False will raise an exception when the ZIP file would require ZIP64 extensions.

mode : string, optional

One of ‘r’ to read an existing file, ‘w’ to truncate and write a new file, ‘a’ to append to an existing file, or ‘x’ to exclusively create and write a new file.

Notes

Each chunk of an array is stored as a separate entry in the Zip file. Note that Zip files do not provide any way to remove or replace existing entries. If an attempt is made to replace an entry, then a warning is generated by the Python standard library about a duplicate Zip file entry. This can be triggered if you attempt to write data to a Zarr array more than once, e.g.:

>>> store = zarr.ZipStore('data/example.zip', mode='w')
>>> z = zarr.zeros(100, chunks=10, store=store)
>>> z[...] = 42  # first write OK
>>> z[...] = 42  # second write generates warnings
>>> store.close()

This can also happen in a more subtle situation, where data are written only once to a Zarr array, but the write operations are not aligned with chunk boundaries, e.g.:

>>> store = zarr.ZipStore('data/example.zip', mode='w')
>>> z = zarr.zeros(100, chunks=10, store=store)
>>> z[5:15] = 42
>>> z[15:25] = 42  # write overlaps chunk previously written, generates warnings

To avoid creating duplicate entries, only write data once, and align writes with chunk boundaries. This alignment is done automatically if you call z[...] = ... or create an array from existing data via zarr.array().

Alternatively, use a DirectoryStore when writing the data, then manually Zip the directory and use the Zip file for subsequent reads.

Safe to write in multiple threads but not in multiple processes.

Examples

Store a single array:

>>> import zarr
>>> store = zarr.ZipStore('data/array.zip', mode='w')
>>> z = zarr.zeros((10, 10), chunks=(5, 5), store=store)
>>> z[...] = 42
>>> store.close()  # don't forget to call this when you're done

Store a group:

>>> store = zarr.ZipStore('data/group.zip', mode='w')
>>> root = zarr.group(store=store)
>>> foo = root.create_group('foo')
>>> bar = foo.zeros('bar', shape=(10, 10), chunks=(5, 5))
>>> bar[...] = 42
>>> store.close()  # don't forget to call this when you're done

After modifying a ZipStore, the close() method must be called, otherwise essential data will not be written to the underlying Zip file. The ZipStore class also supports the context manager protocol, which ensures the close() method is called on leaving the context, e.g.:

>>> with zarr.ZipStore('data/array.zip', mode='w') as store:
...     z = zarr.zeros((10, 10), chunks=(5, 5), store=store)
...     z[...] = 42
...     # no need to call store.close()
close()

Closes the underlying zip file, ensuring all records are written.

flush()

Closes the underlying zip file, ensuring all records are written, then re-opens the file for further modifications.

class zarr.storage.DBMStore(path, flag='c', mode=438, open=None, write_lock=True, **open_kwargs)

Storage class using a DBM-style database.

Parameters:
path : string

Location of database file.

flag : string, optional

Flags for opening the database file.

mode : int

File mode used if a new file is created.

open : function, optional

Function to open the database file. If not provided, dbm.open() will be used on Python 3, and anydbm.open() will be used on Python 2.

write_lock: bool, optional

Use a lock to prevent concurrent writes from multiple threads (True by default).

**open_kwargs

Keyword arguments to pass the open function.

Notes

Please note that, by default, this class will use the Python standard library dbm.open function to open the database file (or anydbm.open on Python 2). There are up to three different implementations of DBM-style databases available in any Python installation, and which one is used may vary from one system to another. Database file formats are not compatible between these different implementations. Also, some implementations are more efficient than others. In particular, the “dumb” implementation will be the fall-back on many systems, and has very poor performance for some usage scenarios. If you want to ensure a specific implementation is used, pass the corresponding open function, e.g., dbm.gnu.open to use the GNU DBM library.

Safe to write in multiple threads. May be safe to write in multiple processes, depending on which DBM implementation is being used, although this has not been tested.

Examples

Store a single array:

>>> import zarr
>>> store = zarr.DBMStore('data/array.db')
>>> z = zarr.zeros((10, 10), chunks=(5, 5), store=store, overwrite=True)
>>> z[...] = 42
>>> store.close()  # don't forget to call this when you're done

Store a group:

>>> store = zarr.DBMStore('data/group.db')
>>> root = zarr.group(store=store, overwrite=True)
>>> foo = root.create_group('foo')
>>> bar = foo.zeros('bar', shape=(10, 10), chunks=(5, 5))
>>> bar[...] = 42
>>> store.close()  # don't forget to call this when you're done

After modifying a DBMStore, the close() method must be called, otherwise essential data may not be written to the underlying database file. The DBMStore class also supports the context manager protocol, which ensures the close() method is called on leaving the context, e.g.:

>>> with zarr.DBMStore('data/array.db') as store:
...     z = zarr.zeros((10, 10), chunks=(5, 5), store=store, overwrite=True)
...     z[...] = 42
...     # no need to call store.close()

A different database library can be used by passing a different function to the open parameter. For example, if the bsddb3 package is installed, a Berkeley DB database can be used:

>>> import bsddb3
>>> store = zarr.DBMStore('data/array.bdb', open=bsddb3.btopen)
>>> z = zarr.zeros((10, 10), chunks=(5, 5), store=store, overwrite=True)
>>> z[...] = 42
>>> store.close()
close()

Closes the underlying database file.

flush()

Synchronizes data to the underlying database file.

class zarr.storage.LMDBStore(path, buffers=True, **kwargs)

Storage class using LMDB. Requires the lmdb package to be installed.

Parameters:
path : string

Location of database file.

buffers : bool, optional

If True (default) use support for buffers, which should increase performance by reducing memory copies.

**kwargs

Keyword arguments passed through to the lmdb.open function.

Notes

By default writes are not immediately flushed to disk to increase performance. You can ensure data are flushed to disk by calling the flush() or close() methods.

Should be safe to write in multiple threads or processes due to the synchronization support within LMDB, although writing from multiple processes has not been tested.

Examples

Store a single array:

>>> import zarr
>>> store = zarr.LMDBStore('data/array.mdb')
>>> z = zarr.zeros((10, 10), chunks=(5, 5), store=store, overwrite=True)
>>> z[...] = 42
>>> store.close()  # don't forget to call this when you're done

Store a group:

>>> store = zarr.LMDBStore('data/group.mdb')
>>> root = zarr.group(store=store, overwrite=True)
>>> foo = root.create_group('foo')
>>> bar = foo.zeros('bar', shape=(10, 10), chunks=(5, 5))
>>> bar[...] = 42
>>> store.close()  # don't forget to call this when you're done

After modifying a DBMStore, the close() method must be called, otherwise essential data may not be written to the underlying database file. The DBMStore class also supports the context manager protocol, which ensures the close() method is called on leaving the context, e.g.:

>>> with zarr.LMDBStore('data/array.mdb') as store:
...     z = zarr.zeros((10, 10), chunks=(5, 5), store=store, overwrite=True)
...     z[...] = 42
...     # no need to call store.close()
close()

Closes the underlying database.

flush()

Synchronizes data to the file system.

class zarr.storage.LRUStoreCache(store, max_size)

Storage class that implements a least-recently-used (LRU) cache layer over some other store. Intended primarily for use with stores that can be slow to access, e.g., remote stores that require network communication to store and retrieve data.

Parameters:
store : MutableMapping

The store containing the actual data to be cached.

max_size : int

The maximum size that the cache may grow to, in number of bytes. Provide None if you would like the cache to have unlimited size.

Examples

The example below wraps an S3 store with an LRU cache:

>>> import s3fs
>>> import zarr
>>> s3 = s3fs.S3FileSystem(anon=True, client_kwargs=dict(region_name='eu-west-2'))
>>> store = s3fs.S3Map(root='zarr-demo/store', s3=s3, check=False)
>>> cache = zarr.LRUStoreCache(store, max_size=2**28)
>>> root = zarr.group(store=cache)
>>> z = root['foo/bar/baz']
>>> from timeit import timeit
>>> # first data access is relatively slow, retrieved from store
... timeit('print(z[:].tostring())', number=1, globals=globals())  
b'Hello from the cloud!'
0.1081731989979744
>>> # second data access is faster, uses cache
... timeit('print(z[:].tostring())', number=1, globals=globals())  
b'Hello from the cloud!'
0.0009490990014455747
invalidate()

Completely clear the cache.

invalidate_values()

Clear the values cache.

invalidate_keys()

Clear the keys cache.

zarr.storage.init_array(store, shape, chunks=True, dtype=None, compressor='default', fill_value=None, order='C', overwrite=False, path=None, chunk_store=None, filters=None, object_codec=None)

Initialize an array store with the given configuration. Note that this is a low-level function and there should be no need to call this directly from user code.

Parameters:
store : MutableMapping

A mapping that supports string keys and bytes-like values.

shape : int or tuple of ints

Array shape.

chunks : int or tuple of ints, optional

Chunk shape. If True, will be guessed from shape and dtype. If False, will be set to shape, i.e., single chunk for the whole array.

dtype : string or dtype, optional

NumPy dtype.

compressor : Codec, optional

Primary compressor.

fill_value : object

Default value to use for uninitialized portions of the array.

order : {‘C’, ‘F’}, optional

Memory layout to be used within each chunk.

overwrite : bool, optional

If True, erase all data in store prior to initialisation.

path : string, optional

Path under which array is stored.

chunk_store : MutableMapping, optional

Separate storage for chunks. If not provided, store will be used for storage of both chunks and metadata.

filters : sequence, optional

Sequence of filters to use to encode chunk data prior to compression.

object_codec : Codec, optional

A codec to encode object arrays, only needed if dtype=object.

Notes

The initialisation process involves normalising all array metadata, encoding as JSON and storing under the ‘.zarray’ key.

Examples

Initialize an array store:

>>> from zarr.storage import init_array
>>> store = dict()
>>> init_array(store, shape=(10000, 10000), chunks=(1000, 1000))
>>> sorted(store.keys())
['.zarray']

Array metadata is stored as JSON:

>>> print(store['.zarray'].decode())
{
    "chunks": [
        1000,
        1000
    ],
    "compressor": {
        "blocksize": 0,
        "clevel": 5,
        "cname": "lz4",
        "id": "blosc",
        "shuffle": 1
    },
    "dtype": "<f8",
    "fill_value": null,
    "filters": null,
    "order": "C",
    "shape": [
        10000,
        10000
    ],
    "zarr_format": 2
}

Initialize an array using a storage path:

>>> store = dict()
>>> init_array(store, shape=100000000, chunks=1000000, dtype='i1', path='foo')
>>> sorted(store.keys())
['.zgroup', 'foo/.zarray']
>>> print(store['foo/.zarray'].decode())
{
    "chunks": [
        1000000
    ],
    "compressor": {
        "blocksize": 0,
        "clevel": 5,
        "cname": "lz4",
        "id": "blosc",
        "shuffle": 1
    },
    "dtype": "|i1",
    "fill_value": null,
    "filters": null,
    "order": "C",
    "shape": [
        100000000
    ],
    "zarr_format": 2
}
zarr.storage.init_group(store, overwrite=False, path=None, chunk_store=None)

Initialize a group store. Note that this is a low-level function and there should be no need to call this directly from user code.

Parameters:
store : MutableMapping

A mapping that supports string keys and byte sequence values.

overwrite : bool, optional

If True, erase all data in store prior to initialisation.

path : string, optional

Path under which array is stored.

chunk_store : MutableMapping, optional

Separate storage for chunks. If not provided, store will be used for storage of both chunks and metadata.

zarr.storage.contains_array(store, path=None)

Return True if the store contains an array at the given logical path.

zarr.storage.contains_group(store, path=None)

Return True if the store contains a group at the given logical path.

zarr.storage.listdir(store, path=None)

Obtain a directory listing for the given path. If store provides a listdir method, this will be called, otherwise will fall back to implementation via the MutableMapping interface.

zarr.storage.rmdir(store, path=None)

Remove all items under the given path. If store provides a rmdir method, this will be called, otherwise will fall back to implementation via the MutableMapping interface.

zarr.storage.getsize(store, path=None)

Compute size of stored items for a given path. If store provides a getsize method, this will be called, otherwise will return -1.

zarr.storage.rename(store, src_path, dst_path)

Rename all items under the given path. If store provides a rename method, this will be called, otherwise will fall back to implementation via the MutableMapping interface.

zarr.storage.migrate_1to2(store)

Migrate array metadata in store from Zarr format version 1 to version 2.

Parameters:
store : MutableMapping

Store to be migrated.

Notes

Version 1 did not support hierarchies, so this migration function will look for a single array in store and migrate the array metadata to version 2.