V3 Specification Implementation(zarr._storage.v3
)#
This module contains the implementation of the Zarr V3 Specification.
Warning
Since Zarr Python 2.12 release, this module provides experimental infrastructure for reading and
writing the upcoming V3 spec of the Zarr format. Users wishing to prepare for the migration can set
the environment variable ZARR_V3_EXPERIMENTAL_API=1
to begin experimenting, however data
written with this API should be expected to become stale, as the implementation will still change.
The new zarr._store.v3
package has the necessary classes and functions for evaluating Zarr V3.
Since the design is not finalised, the classes and functions are not automatically imported into
the regular Zarr namespace.
Code snippet for creating Zarr V3 arrays:
>>> import zarr
>>> z = zarr.create((10000, 10000),
>>> chunks=(100, 100),
>>> dtype='f8',
>>> compressor='default',
>>> path='path-where-you-want-zarr-v3-array',
>>> zarr_version=3)
Further, you can use z.info to see details about the array you just created:
>>> z.info
Name : path-where-you-want-zarr-v3-array
Type : zarr.core.Array
Data type : float64
Shape : (10000, 10000)
Chunk shape : (100, 100)
Order : C
Read-only : False
Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type : zarr._storage.v3.KVStoreV3
No. bytes : 800000000 (762.9M)
No. bytes stored : 557
Storage ratio : 1436265.7
Chunks initialized : 0/10000
You can also check Store type
here (which indicates Zarr V3).
- class zarr._storage.v3.RmdirV3[source]#
Mixin class that can be used to ensure override of any existing v2 rmdir class.
- class zarr._storage.v3.KVStoreV3(mutablemapping)[source]#
This provides a default implementation of a store interface around a mutable mapping, to avoid having to test stores for presence of methods.
This, for most methods should just be a pass-through to the underlying KV store which is likely to expose a MuttableMapping interface,
- class zarr._storage.v3.FSStoreV3(url, normalize_keys=False, key_separator=None, mode='w', exceptions=(<class 'KeyError'>, <class 'PermissionError'>, <class 'OSError'>), dimension_separator=None, fs=None, check=False, create=False, missing_exceptions=None, **storage_options)[source]#
- class zarr._storage.v3.MemoryStoreV3(root=None, cls=<class 'dict'>, dimension_separator=None)[source]#
Store class that uses a hierarchy of
KVStore
objects, thus all data will be held in main memory.Notes
Safe to write in multiple threads.
Examples
This is the default class used when creating a group. E.g.:
>>> import zarr >>> g = zarr.group() >>> type(g.store) <class 'zarr.storage.MemoryStore'>
Note that the default class when creating an array is the built-in
KVStore
class, i.e.:>>> z = zarr.zeros(100) >>> type(z.store) <class 'zarr.storage.KVStore'>
- class zarr._storage.v3.DirectoryStoreV3(path, normalize_keys=False, dimension_separator=None)[source]#
Storage class using directories and files on a standard file system.
- Parameters:
- pathstring
Location of directory to use as the root of the storage hierarchy.
- normalize_keysbool, optional
If True, all store keys will be normalized to use lower case characters (e.g. ‘foo’ and ‘FOO’ will be treated as equivalent). This can be useful to avoid potential discrepancies between case-sensitive and case-insensitive file system. Default value is False.
- dimension_separator{‘.’, ‘/’}, optional
Separator placed between the dimensions of a chunk.
Notes
Atomic writes are used, which means that data are first written to a temporary file, then moved into place when the write is successfully completed. Files are only held open while they are being read or written and are closed immediately afterwards, so there is no need to manually close any files.
Safe to write in multiple threads or processes.
Examples
Store a single array:
>>> import zarr >>> store = zarr.DirectoryStore('data/array.zarr') >>> z = zarr.zeros((10, 10), chunks=(5, 5), store=store, overwrite=True) >>> z[...] = 42
Each chunk of the array is stored as a separate file on the file system, i.e.:
>>> import os >>> sorted(os.listdir('data/array.zarr')) ['.zarray', '0.0', '0.1', '1.0', '1.1']
Store a group:
>>> store = zarr.DirectoryStore('data/group.zarr') >>> root = zarr.group(store=store, overwrite=True) >>> foo = root.create_group('foo') >>> bar = foo.zeros('bar', shape=(10, 10), chunks=(5, 5)) >>> bar[...] = 42
When storing a group, levels in the group hierarchy will correspond to directories on the file system, i.e.:
>>> sorted(os.listdir('data/group.zarr')) ['.zgroup', 'foo'] >>> sorted(os.listdir('data/group.zarr/foo')) ['.zgroup', 'bar'] >>> sorted(os.listdir('data/group.zarr/foo/bar')) ['.zarray', '0.0', '0.1', '1.0', '1.1']
- class zarr._storage.v3.ZipStoreV3(path, compression=0, allowZip64=True, mode='a', dimension_separator=None)[source]#
Storage class using a Zip file.
- Parameters:
- pathstring
Location of file.
- compressioninteger, optional
Compression method to use when writing to the archive.
- allowZip64bool, optional
If True (the default) will create ZIP files that use the ZIP64 extensions when the zipfile is larger than 2 GiB. If False will raise an exception when the ZIP file would require ZIP64 extensions.
- modestring, optional
One of ‘r’ to read an existing file, ‘w’ to truncate and write a new file, ‘a’ to append to an existing file, or ‘x’ to exclusively create and write a new file.
- dimension_separator{‘.’, ‘/’}, optional
Separator placed between the dimensions of a chunk.
Notes
Each chunk of an array is stored as a separate entry in the Zip file. Note that Zip files do not provide any way to remove or replace existing entries. If an attempt is made to replace an entry, then a warning is generated by the Python standard library about a duplicate Zip file entry. This can be triggered if you attempt to write data to a Zarr array more than once, e.g.:
>>> store = zarr.ZipStore('data/example.zip', mode='w') >>> z = zarr.zeros(100, chunks=10, store=store) >>> # first write OK ... z[...] = 42 >>> # second write generates warnings ... z[...] = 42 >>> store.close()
This can also happen in a more subtle situation, where data are written only once to a Zarr array, but the write operations are not aligned with chunk boundaries, e.g.:
>>> store = zarr.ZipStore('data/example.zip', mode='w') >>> z = zarr.zeros(100, chunks=10, store=store) >>> z[5:15] = 42 >>> # write overlaps chunk previously written, generates warnings ... z[15:25] = 42
To avoid creating duplicate entries, only write data once, and align writes with chunk boundaries. This alignment is done automatically if you call
z[...] = ...
or create an array from existing data viazarr.array()
.Alternatively, use a
DirectoryStore
when writing the data, then manually Zip the directory and use the Zip file for subsequent reads. Take note that the files in the Zip file must be relative to the root of the Zarr archive. You may find it easier to create such a Zip file with7z
, e.g.:7z a -tzip archive.zarr.zip archive.zarr/.
Safe to write in multiple threads but not in multiple processes.
Examples
Store a single array:
>>> import zarr >>> store = zarr.ZipStore('data/array.zip', mode='w') >>> z = zarr.zeros((10, 10), chunks=(5, 5), store=store) >>> z[...] = 42 >>> store.close() # don't forget to call this when you're done
Store a group:
>>> store = zarr.ZipStore('data/group.zip', mode='w') >>> root = zarr.group(store=store) >>> foo = root.create_group('foo') >>> bar = foo.zeros('bar', shape=(10, 10), chunks=(5, 5)) >>> bar[...] = 42 >>> store.close() # don't forget to call this when you're done
After modifying a ZipStore, the
close()
method must be called, otherwise essential data will not be written to the underlying Zip file. The ZipStore class also supports the context manager protocol, which ensures theclose()
method is called on leaving the context, e.g.:>>> with zarr.ZipStore('data/array.zip', mode='w') as store: ... z = zarr.zeros((10, 10), chunks=(5, 5), store=store) ... z[...] = 42 ... # no need to call store.close()
- class zarr._storage.v3.RedisStoreV3(prefix='zarr', dimension_separator=None, **kwargs)[source]#
Storage class using Redis.
Note
This is an experimental feature.
Requires the redis package to be installed.
- Parameters:
- prefixstring
Name of prefix for Redis keys
- dimension_separator{‘.’, ‘/’}, optional
Separator placed between the dimensions of a chunk.
- **kwargs
Keyword arguments passed through to the redis.Redis function.
- class zarr._storage.v3.MongoDBStoreV3(database='mongodb_zarr', collection='zarr_collection', dimension_separator=None, **kwargs)[source]#
Storage class using MongoDB.
Note
This is an experimental feature.
Requires the pymongo package to be installed.
- Parameters:
- databasestring
Name of database
- collectionstring
Name of collection
- dimension_separator{‘.’, ‘/’}, optional
Separator placed between the dimensions of a chunk.
- **kwargs
Keyword arguments passed through to the pymongo.MongoClient function.
Notes
The maximum chunksize in MongoDB documents is 16 MB.
- class zarr._storage.v3.DBMStoreV3(path, flag='c', mode=438, open=None, write_lock=True, dimension_separator=None, **open_kwargs)[source]#
Storage class using a DBM-style database.
- Parameters:
- pathstring
Location of database file.
- flagstring, optional
Flags for opening the database file.
- modeint
File mode used if a new file is created.
- openfunction, optional
Function to open the database file. If not provided,
dbm.open()
will be used on Python 3, andanydbm.open()
will be used on Python 2.- write_lock: bool, optional
Use a lock to prevent concurrent writes from multiple threads (True by default).
- dimension_separator{‘.’, ‘/’}, optional
Separator placed between the dimensions of a chunk.e
- **open_kwargs
Keyword arguments to pass the open function.
Notes
Please note that, by default, this class will use the Python standard library dbm.open function to open the database file (or anydbm.open on Python 2). There are up to three different implementations of DBM-style databases available in any Python installation, and which one is used may vary from one system to another. Database file formats are not compatible between these different implementations. Also, some implementations are more efficient than others. In particular, the “dumb” implementation will be the fall-back on many systems, and has very poor performance for some usage scenarios. If you want to ensure a specific implementation is used, pass the corresponding open function, e.g., dbm.gnu.open to use the GNU DBM library.
Safe to write in multiple threads. May be safe to write in multiple processes, depending on which DBM implementation is being used, although this has not been tested.
Examples
Store a single array:
>>> import zarr >>> store = zarr.DBMStore('data/array.db') >>> z = zarr.zeros((10, 10), chunks=(5, 5), store=store, overwrite=True) >>> z[...] = 42 >>> store.close() # don't forget to call this when you're done
Store a group:
>>> store = zarr.DBMStore('data/group.db') >>> root = zarr.group(store=store, overwrite=True) >>> foo = root.create_group('foo') >>> bar = foo.zeros('bar', shape=(10, 10), chunks=(5, 5)) >>> bar[...] = 42 >>> store.close() # don't forget to call this when you're done
After modifying a DBMStore, the
close()
method must be called, otherwise essential data may not be written to the underlying database file. The DBMStore class also supports the context manager protocol, which ensures theclose()
method is called on leaving the context, e.g.:>>> with zarr.DBMStore('data/array.db') as store: ... z = zarr.zeros((10, 10), chunks=(5, 5), store=store, overwrite=True) ... z[...] = 42 ... # no need to call store.close()
A different database library can be used by passing a different function to the open parameter. For example, if the bsddb3 package is installed, a Berkeley DB database can be used:
>>> import bsddb3 >>> store = zarr.DBMStore('data/array.bdb', open=bsddb3.btopen) >>> z = zarr.zeros((10, 10), chunks=(5, 5), store=store, overwrite=True) >>> z[...] = 42 >>> store.close()
- class zarr._storage.v3.LMDBStoreV3(path, buffers=True, dimension_separator=None, **kwargs)[source]#
Storage class using LMDB. Requires the lmdb package to be installed.
- Parameters:
- pathstring
Location of database file.
- buffersbool, optional
If True (default) use support for buffers, which should increase performance by reducing memory copies.
- dimension_separator{‘.’, ‘/’}, optional
Separator placed between the dimensions of a chunk.
- **kwargs
Keyword arguments passed through to the lmdb.open function.
Notes
By default writes are not immediately flushed to disk to increase performance. You can ensure data are flushed to disk by calling the
flush()
orclose()
methods.Should be safe to write in multiple threads or processes due to the synchronization support within LMDB, although writing from multiple processes has not been tested.
Examples
Store a single array:
>>> import zarr >>> store = zarr.LMDBStore('data/array.mdb') >>> z = zarr.zeros((10, 10), chunks=(5, 5), store=store, overwrite=True) >>> z[...] = 42 >>> store.close() # don't forget to call this when you're done
Store a group:
>>> store = zarr.LMDBStore('data/group.mdb') >>> root = zarr.group(store=store, overwrite=True) >>> foo = root.create_group('foo') >>> bar = foo.zeros('bar', shape=(10, 10), chunks=(5, 5)) >>> bar[...] = 42 >>> store.close() # don't forget to call this when you're done
After modifying a DBMStore, the
close()
method must be called, otherwise essential data may not be written to the underlying database file. The DBMStore class also supports the context manager protocol, which ensures theclose()
method is called on leaving the context, e.g.:>>> with zarr.LMDBStore('data/array.mdb') as store: ... z = zarr.zeros((10, 10), chunks=(5, 5), store=store, overwrite=True) ... z[...] = 42 ... # no need to call store.close()
- class zarr._storage.v3.SQLiteStoreV3(path, dimension_separator=None, **kwargs)[source]#
Storage class using SQLite.
- Parameters:
- pathstring
Location of database file.
- dimension_separator{‘.’, ‘/’}, optional
Separator placed between the dimensions of a chunk.
- **kwargs
Keyword arguments passed through to the sqlite3.connect function.
Examples
Store a single array:
>>> import zarr >>> store = zarr.SQLiteStore('data/array.sqldb') >>> z = zarr.zeros((10, 10), chunks=(5, 5), store=store, overwrite=True) >>> z[...] = 42 >>> store.close() # don't forget to call this when you're done
Store a group:
>>> store = zarr.SQLiteStore('data/group.sqldb') >>> root = zarr.group(store=store, overwrite=True) >>> foo = root.create_group('foo') >>> bar = foo.zeros('bar', shape=(10, 10), chunks=(5, 5)) >>> bar[...] = 42 >>> store.close() # don't forget to call this when you're done
- class zarr._storage.v3.LRUStoreCacheV3(store, max_size: int)[source]#
Storage class that implements a least-recently-used (LRU) cache layer over some other store. Intended primarily for use with stores that can be slow to access, e.g., remote stores that require network communication to store and retrieve data.
- Parameters:
- storeStore
The store containing the actual data to be cached.
- max_sizeint
The maximum size that the cache may grow to, in number of bytes. Provide None if you would like the cache to have unlimited size.
Examples
The example below wraps an S3 store with an LRU cache:
>>> import s3fs >>> import zarr >>> s3 = s3fs.S3FileSystem(anon=True, client_kwargs=dict(region_name='eu-west-2')) >>> store = s3fs.S3Map(root='zarr-demo/store', s3=s3, check=False) >>> cache = zarr.LRUStoreCache(store, max_size=2**28) >>> root = zarr.group(store=cache) >>> z = root['foo/bar/baz'] >>> from timeit import timeit >>> # first data access is relatively slow, retrieved from store ... timeit('print(z[:].tobytes())', number=1, globals=globals()) b'Hello from the cloud!' 0.1081731989979744 >>> # second data access is faster, uses cache ... timeit('print(z[:].tobytes())', number=1, globals=globals()) b'Hello from the cloud!' 0.0009490990014455747
- class zarr._storage.v3.ConsolidatedMetadataStoreV3(store: BaseStore | MutableMapping, metadata_key='meta/root/consolidated/.zmetadata')[source]#
A layer over other storage, where the metadata has been consolidated into a single key.
The purpose of this class, is to be able to get all of the metadata for a given array in a single read operation from the underlying storage. See
zarr.convenience.consolidate_metadata()
for how to create this single metadata key.This class loads from the one key, and stores the data in a dict, so that accessing the keys no longer requires operations on the backend store.
This class is read-only, and attempts to change the array metadata will fail, but changing the data is possible. If the backend storage is changed directly, then the metadata stored here could become obsolete, and
zarr.convenience.consolidate_metadata()
should be called again and the class re-invoked. The use case is for write once, read many times.Note
This is an experimental feature.
- Parameters:
- store: Store
Containing the zarr array.
- metadata_key: str
The target in the store where all of the metadata are stored. We assume JSON encoding.
In v3 storage transformers
can be set via zarr.create(…, storage_transformers=[…])
.
The experimental sharding storage transformer can be tested by setting
the environment variable ZARR_V3_SHARDING=1
. Data written with this flag
enabled should be expected to become stale until
ZEP 2 is approved
and fully implemented.
- class zarr._storage.v3_storage_transformers.ShardingStorageTransformer(_type, chunks_per_shard)[source]#
Implements sharding as a storage transformer, as described in the spec: https://zarr-specs.readthedocs.io/en/latest/extensions/storage-transformers/sharding/v1.0.html https://purl.org/zarr/spec/storage_transformers/sharding/1.0
The abstract base class for storage transformers is