Improve Zarr V3 support, adding partial store read/write and storage transformers. Add new features from the v3 spec:
get_partial_values and set_partial_values
efficient get_partial_values implementation for FSStoreV3
sharding storage transformer
Special thanks to Outreachy participants for contributing to most of the maintenance PRs. Please read the blog post summarising the contribution phase and welcoming new Outreachy interns: https://zarr.dev/blog/welcoming-outreachy-2022-interns/
Support of alternative array classes by introducing a new argument, meta_array, that specifies the type/class of the underlying array. The meta_array argument can be any class instance that can be used as the like argument in NumPy (see NEP 35). enabling support for CuPy through, for example, the creation of a CuPy CPU compressor. By Mads R. B. Kristensen #934.
Fix bug in N5 storage that prevented arrays located in the root of the hierarchy from bearing the n5 keyword. Along with fixing this bug, new tests were added for N5 routines that had previously been excluded from testing, and type annotations were added to the N5 codebase. By Davis Bennett #1092.
Add support for reading and writing Zarr V3. The new zarr._store.v3 package has the necessary classes and functions for evaluating Zarr V3. Since the format is not yet finalized, the classes and functions are not automatically imported into the regular zarr name space. Setting the ZARR_V3_EXPERIMENTAL_API environment variable will activate them. By Greggory Lee; #898, #1006, and #1007 as well as by Josh Moore #1032.
Sparse changes with performance impact! One of the advantages of the Zarr format is that it is sparse, which means that chunks with no data (more precisely, with data equal to the fill value, which is usually 0) don’t need to be written to disk at all. They will simply be assumed to be empty at read time. However, until this release, the Zarr library would write these empty chunks to disk anyway. This changes in this version: a small performance penalty at write time leads to significant speedups at read time and in filesystem operations in the case of sparse arrays. To revert to the old behavior, pass the argument
write_empty_chunks=Trueto the array creation function. By Juan Nunez-Iglesias; #853 and Davis Bennett; #738.
Fancy indexing. Zarr arrays now support NumPy-style fancy indexing with arrays of integer coordinates. This is equivalent to using zarr.Array.vindex. Mixing slices and integer arrays is not supported. By Juan Nunez-Iglesias; #725.
New base class. This release of Zarr Python introduces a new
BaseStoreclass that all provided store classes implemented in Zarr Python now inherit from. This is done as part of refactoring to enable future support of the Zarr version 3 spec. Existing third-party stores that are a MutableMapping (e.g. dict) can be converted to a new-style key/value store inheriting from
BaseStoreby passing them as the argument to the new
zarr.storage.KVStoreclass. For backwards compatibility, various higher-level array creation and convenience functions still accept plain Python dicts or other mutable mappings for the
storeargument, but will internally convert these to a
KVStore. By Greggory Lee; #839, #789, and #950.
and a swath of code-linting improvements by Dimitri Papadopoulos Orfanos:
Unnecessary comprehension (#899)
Noneprovided as default (#900)
use an if
expressioninstead of and/or (#888)
Remove unnecessary literal (#891)
Decorate a few method with @staticmethod (#885)
Unnecessary comprehension (#883)
Codespell configuration (#882)
Fix typos found by codespell (#880)
Proper C-style formatting for integer (#913)
Add LGTM.com / DeepSource.io configuration files (#909)
Fix structured arrays that contain objects By :user: Attila Bergou <abergou>; :issue: 806
This release of Zarr Python is the first release of Zarr to not support Python 3.6.
V2 Specification Update#
This release of Zarr Python is the first release of Zarr to not support Python 3.5.
open_group/open_arrayto allow opening of read-only store with
Add Array tests for FSStore. By Andrew Fulton; :issue: 644.
fix a bug in which
attrswould not be copied on the root when using
FileNotFoundErrorwith dask/s3fs #649
Fix flaky fixture in test_storage.py #652
Fix FSStore getitems fails with arrays that have a 0 length shape dimension #644
See this link for the full list of closed and merged PR tagged with the 2.6 milestone.
Add ability to partially read and decompress arrays, see #667. It is only available to chunks stored using fsspec and using Blosc as a compressor.
For certain analysis case when only a small portion of chunks is needed it can be advantageous to only access and decompress part of the chunks. Doing partial read and decompression add high latency to many of the operation so should be used only when the subset of the data is small compared to the full chunks and is stored contiguously (that is to say either last dimensions for C layout, firsts for F). Pass
partial_decompress=Trueas argument when creating an
Array, or when using
open_array. No option exists yet to apply partial read and decompress on a per-operation basis.
This release will be the last to support Python 3.5, next version of Zarr will be Python 3.6+.
DirectoryStore now uses os.scandir, which should make listing large store faster, #563
Add typing information to many of the core functions #589
Added support for generic URL opening by
fsspec, where the URLs have the form “protocol://[server]/path” or can be chained URls with “::” separators. The additional argument
storage_optionsis passed to the backend, see the
fsspecdocs. By Martin Durant; #546
Added support for fetching multiple items via
getitemsmethod of a store, if it exists. This allows for concurrent fetching of data blocks from stores that implement this; presently HTTP, S3, GCS. Currently only applies to reading. By Martin Durant; #606
Add “consolidated” metadata as an experimental feature: use
zarr.convenience.consolidate_metadata()to copy all metadata from the various metadata keys within a dataset hierarchy under a single key, and
zarr.convenience.open_consolidated()to use this single key. This can greatly cut down the number of calls to the storage backend, and so remove a lot of overhead for reading remote data. By Martin Durant, Alistair Miles, Ryan Abernathey, #268, #332, #338.
New storage class for N5 containers. The
zarr.n5.N5Storehas been added, which uses
zarr.storage.NestedDirectoryStoreto support reading and writing from and to N5 containers. By Jan Funke and John Kirkham.
The implementation of the
zarr.storage.DirectoryStoreclass has been modified to ensure that writes are atomic and there are no race conditions where a chunk might appear transiently missing during a write operation. By sbalmer, #327, #263.
The required version of the Numcodecs package has been upgraded to 0.6.2, which has enabled some code simplification and fixes a failing test involving msgpack encoding. By John Kirkham, #361, #360, #352, #355, #324.
dict-based chunk store that it only contains
bytesto facilitate comparisons and protect against writes. Drop the copy for the no filter/compressor case as this handles that case. By John Kirkham, #359.
Advanced indexing. The
Arrayclass has several new methods and properties that enable a selection of items in an array to be retrieved or updated. See the Advanced indexing tutorial section for more information. There is also a notebook with extended examples and performance benchmarks. #78, #89, #112, #172.
New package for compressor and filter codecs. The classes previously defined in the
zarr.codecsmodule have been factored out into a separate package called Numcodecs. The Numcodecs package also includes several new codec classes not previously available in Zarr, including compressor codecs for Zstd and LZ4. This change is backwards-compatible with existing code, as all codec classes defined by Numcodecs are imported into the
zarr.codecsnamespace. However, it is recommended to import codecs from the new package, see the tutorial sections on Compressors and Filters for examples. With contributions by John Kirkham; #74, #102, #120, #123, #139.
New storage class for DBM-style databases. The
zarr.storage.DBMStoreclass enables any DBM-style database such as gdbm, ndbm or Berkeley DB, to be used as the backing store for an array or group. See the tutorial section on Storage alternatives for some examples. #133, #186.
New storage class using a nested directory structure for chunk files. The
zarr.storage.NestedDirectoryStorehas been added, which is similar to the existing
zarr.storage.DirectoryStoreclass but nests chunk files for multidimensional arrays into sub-directories. #155, #177.
New tree() method for printing hierarchies. The
Groupclass has a new
zarr.hierarchy.Group.tree()method which enables a tree representation of a group hierarchy to be printed. Also provides an interactive tree representation when used within a Jupyter notebook. See the Array and group diagnostics tutorial section for examples. By John Kirkham; #82, #140, #184.
Visitor API. The
Groupclass now implements the h5py visitor API, see docs for the
zarr.hierarchy.Group.visitvalues()methods. By John Kirkham, #92, #122.
Viewing an array as a different dtype. The
Arrayclass has a new
zarr.core.Array.astype()method, which is a convenience that enables an array to be viewed as a different dtype. By John Kirkham, #94, #96.
New open(), save(), load() convenience functions. The function
zarr.convenience.open()provides a convenient way to open a persistent array or group, using either a
ZipStoreas the backing store. The functions
zarr.convenience.load()are also available and provide a convenient way to save an entire NumPy array to disk and load back into memory later. See the tutorial section Persistent arrays for examples. #104, #105, #141, #181.
IPython completions. The
Groupclass now implements
_ipython_key_completions_()which enables tab-completion for group members to be used in any IPython interactive environment. #170.
New info property; changes to __repr__. The
Arrayclasses have a new
infoproperty which can be used to print diagnostic information, including compression ratio where available. See the tutorial section on Array and group diagnostics for examples. The string representation (
__repr__) of these classes has been simplified to ensure it is cheap and quick to compute in all circumstances. #83, #115, #132, #148.
Chunk options. When creating an array,
chunks=Falsecan be specified, which will result in an array with a single chunk only. Alternatively,
chunks=Truewill trigger an automatic chunk shape guess. See Chunk optimizations for more on the
chunksparameter. #106, #107, #183.
Improved support for object arrays. In previous versions of Zarr, creating an array with
dtype=objectwas possible but could under certain circumstances lead to unexpected errors and/or segmentation faults. To make it easier to properly configure an object array, a new
object_codecparameter has been added to array creation functions. See the tutorial section on Object arrays for more information and examples. Also, runtime checks have been added in both Zarr and Numcodecs so that segmentation faults are no longer possible, even with a badly configured array. This API change is backwards compatible and previous code that created an object array and provided an object codec via the
filtersparameter will continue to work, however a warning will be raised to encourage use of the
object_codecparameter. #208, #212.
New LRUStoreCache class. The class
zarr.storage.LRUStoreCachehas been added and provides a means to locally cache data in memory from a store that may be slow, e.g., a store that retrieves data from a remote server via the network; #223.
New copy functions. The new functions
zarr.convenience.copy_all()provide a way to copy groups and/or arrays between HDF5 and Zarr, or between two Zarr groups. The
zarr.convenience.copy_store()provides a more efficient way to copy data directly between two Zarr stores. #87, #113, #137, #217.
Some changes have been made to the Zarr storage specification version 2 document to clarify ambiguities and add some missing information. These changes do not break compatibility with any of the material as previously implemented, and so the changes have been made in-place in the document without incrementing the document version number. See the section on Changes in the specification document for more information.
A new Advanced indexing section has been added to the tutorial.
The Chunk optimizations tutorial section has been reorganised and updated.
A new tutorial section on Datetimes and timedeltas has been added.
A new tutorial section on Array and group diagnostics has been added.
The tutorial sections on Parallel computing and synchronization and Configuring Blosc have been updated to provide information about how to avoid program hangs when using the Blosc compressor with multiple processes (#199, #201).
Bump numcodecs dependency to 0.5.3, completely remove nose dependency, #237.
Resolved an issue when no compression is used and chunks are stored in memory (#79).
Various minor improvements, including:
Group objects support member access
via dot notation (
__getattr__); fixed metadata caching for
property and derivatives; added
Array.ndim property; fixed
Array.__array__ method arguments; fixed bug in pickling
fixed bug in pickling
Group objects now support member deletion via
The Blosc extension has been modified to return bytes instead of array objects from compress and decompress function calls. This should improve compatibility and also provides a small performance increase for compressing high compression ratio data (#55).
cache_metadatakeyword argument to array creation methods.
The bundled Blosc library has been upgraded to version 1.11.1.
To accommodate support for hierarchies and filters, the Zarr metadata format
has been modified. See the Zarr storage specification version 2 for more information. To migrate an
array stored using Zarr version 1.x, use the
The bundled Blosc library has been upgraded to version 1.11.0.
The bundled Blosc library has been upgraded to version 1.10.0. The ‘zstd’ internal compression library is now available within Blosc. See the tutorial section on Compressors for an example.
When using the Blosc compressor, the default internal compression library is now ‘lz4’.
The default number of internal threads for the Blosc compressor has been increased to a maximum of 8 (previously 4).
Added convenience functions
This release includes a complete re-organization of the code base. The major version number has been bumped to indicate that there have been backwards-incompatible changes to the API and the on-disk storage format. However, Zarr is still in an early stage of development, so please do not take the version number as an indicator of maturity.
The main motivation for re-organizing the code was to create an
abstraction layer between the core array logic and data storage (#21).
In this release, any
object that implements the
MutableMapping interface can be used as
an array store. See the tutorial sections on Persistent arrays
and Storage alternatives, the Zarr storage specification version 1, and the
zarr.storage module documentation for more information.
Please note also that the file organization and file name conventions
used when storing a Zarr array in a directory on the file system have
changed. Persistent Zarr arrays created using previous versions of the
software will not be compatible with this version. See the
zarr.storage API docs and the Zarr storage specification version 1 for more
An abstraction layer has also been created between the core array
logic and the code for compressing and decompressing array
chunks. This release still bundles the c-blosc library and uses Blosc
as the default compressor, however other compressors including zlib,
BZ2 and LZMA are also now supported via the Python standard
library. New compressors can also be dynamically registered for use
with Zarr. See the tutorial sections on Compressors and
Configuring Blosc, the Zarr storage specification version 1, and the
zarr.compressors module documentation for more information.
The synchronization code has also been refactored to create a layer of
abstraction, enabling Zarr arrays to be used in parallel computations
with a number of alternative synchronization methods. For more
information see the tutorial section on Parallel computing and synchronization and the
zarr.sync module documentation.
Changes to the Blosc extension#
NumPy is no longer a build dependency for the
extension, so setup.py will run even if NumPy is not already
installed, and should automatically install NumPy as a runtime
dependency. Manual installation of NumPy prior to installing Zarr is
still recommended, however, as the automatic installation of NumPy may
fail or be sub-optimal on some platforms.
Some optimizations have been made within the
extension to avoid unnecessary memory copies, giving a ~10-20%
performance improvement for multi-threaded compression operations.
zarr.blosc extension now automatically detects whether it
is running within a single-threaded or multi-threaded program and
adapts its internal behaviour accordingly (#27). There is no need for
the user to make any API calls to switch Blosc between contextual and
non-contextual (global lock) mode. See also the tutorial section on
The internal code for managing chunks has been rewritten to be more efficient. Now no state is maintained for chunks outside of the array store, meaning that chunks do not carry any extra memory overhead not accounted for by the store. This negates the need for the “lazy” option present in the previous release, and this has been removed.
The memory layout within chunks can now be set as either “C” (row-major) or “F” (column-major), which can help to provide better compression for some data (#7). See the tutorial section on Chunk memory layout for more information.
A bug has been fixed within the
machinery for slicing arrays, to properly handle getting and setting