The Beauty of Zarr

PyCon DE & PyData Berlin 2023

In this talk, I’d be talking about [Zarr](https://zarr.dev/), an open-source data format for storing chunked, compressed N-dimensional arrays. This talk presents a systematic approach to understanding and implementing Zarr by showing how it works, the need for using it, and a hands-on session at the end. Zarr is based on an open [technical specification](https://zarr.readthedocs.io/en/stable/spec/v2.html), making implementations across several languages possible. I’d mainly talk about [Zarr’s Python](https://github.com/zarr-developers/zarr-python) implementation and show how it beautifully interoperates with the existing libraries in the PyData stack.

[Zarr](https://zarr.dev/) is a data format for storing chunked, compressed N-dimensional arrays. Zarr is based on open-source [technical specification](https://zarr.readthedocs.io/en/stable/spec/v2.html) and has [implementations](https://github.com/zarr-developers/zarr_implementations) in several languages, with [Zarr-Python](https://github.com/zarr-developers/zarr-python) being the most used. Zarr is [NumFOCUS’s sponsored project](https://numfocus.org/sponsored-projects) and is under their umbrella. ### Outline: First, I’d be talking about: ### What’s, Why’s, and How’s of Zarr (15 mins.) - How does Zarr work? - Talking about the motivation and functionality of Zarr - What’s the need for using Zarr? - When, where and why to use it? - Pluggable compressors and file-storage - Talking about several compressors and file-storage systems available in Zarr - Managing(selection, resizing, writing, reading) chunked arrays using Zarr functions - Using inbuilt functions to manage compressed chunks - How is Zarr different when compared to other storage formats? - Talking briefly about technical specification, which allows Zarr to have implementations in several languages - Pros and cons when compared to other storage formats - Zarr community - What is the Zarr community, and how do we do things? Then, I’d be doing a hands-on session, which would cover the following: ### Hands-on (10 mins.) - Creating and using Zarr arrays - Using inbuilt functions to create Zarr arrays and reading and writing data to it - Looking under the hood - Use store functions to explain how your Zarr data is stored - Consolidating metadata - Consolidating the metadata for an entire group into a single object - Writing and reading from Cloud object storage - Using S3/GCS/Azure to create Zarr arrays and write data to it - Showing how Zarr interoperates with the PyData stack - How Zarr interoperates with the PyData stack(NumPy, Dask and Xarray) and how you can write data to your Zarr chunks at incredibly high speed in parallel using Dask I’d be closing the talk by: ### Conclusion(5 mins.) - Key takeaway - How can you contribute to Zarr? - QnA This talk aims to address the audience who works with large amounts of data and are in search of a data format which is transparent, easy to use and friendly to the environment. Zarr is also reasonably used in bioimaging, geospatial and research communities. So, Zarr is your one-stop solution if you’re from a community or an organisation dealing with high-volume data. Also, anyone curious and wants to learn about Zarr and how to use it is most welcome. The tone of the talk is set to be informative, along with a hands-on session. Also, I’m happy to adjust the style according to the audience in the room. Intermediate knowledge of Python and NumPy arrays is required for the attendees to attend this talk. ### After this talk, you’d learn: - Basic use cases for Zarr and how to use it - Understand the basics of data storage in Zarr - Understand the basics of compressors and file-storage systems in Zarr - Take a better and more informed decision on what data format to use for your data

Speakers: Sanket Verma