numpydantic/docs/design.md

6 KiB

Design

Why do this?

We want to bring the tidyness of modeling data with pydantic to the universe of software that uses arrays - particularly formats and packages that need to be very particular about what kind of arrays they are able to handle or match a specific schema.

To support a new generation of data formats and data analysis libraries that can model the structure of data independently from its implementation, we made numpydantic as a bridge between abstract schemas and programmatic use.

The closest prior work is likely jaxtyping, but its support for multiple array libraries was backed into from its initial design as a jax specification package, and so its extensibility and readability is relatively low. Its Dtype[ArrayClass, "{shape_expression}"] syntax is not well suited for modeling arrays intended to be general across implementations, and makes it challenging to adapt to pydantic's schema generation system.

(design_challenges)=

Challenges

The Python type annotation system is weird and not like the rest of Python! (at least until PEP 0649 gets mainlined). Similarly, Pydantic 2's core_schema system is wonderful but still has a few mysteries lurking under the documented surface. This package does the work of plugging them in together to make some kind of type validation frankenstein.

The first problem is that type annotations are evaluated statically by python, mypy, etc. This means you can't use typical python syntax for declaring types - it has to be present at the time __new__ is called, rather than __init__. So

Different implementations of arrays behave differently! HDF5 files need to be carefully opened and closed to avoid corruption, video files don't typically allow normal array slicing operations, and only some array libraries support lazy loading of arrays on disk.

We can't anticipate all the possible array libraries that exist now or in the future, so it has to be possible to extend support to them without needing to go through a potentially lengthy contribution process.

Strategy

Numpydantic uses {class}~numpydantic.NDArray as an abstract specification of an array that uses one of several interface classes to validate and interact with an array. These interface classes will set the instance attribute either as the passed array itself, or a transparent proxy class (eg. {class}~numpydantic.interface.hdf5.H5Proxy) in the case that the native array format doesn't support numpy-like array operations out of the box.

The interface validation process thus often transforms the type of the passed array - eg. when specifying an array in an HDF5 file, one will pass some reference to a Path and the location of a dataset within that file, but the returned value from the interface validator will be an {class}~numpydantic.interface.hdf5.H5Proxy to the dataset. This confuses python's static type checker and IDE integrations like pylance/pyright/mypy, which naively expect the type to literally be an {class}~numpydantic.NDArray instance. To address this, numpydantic generates a .pyi stub file on import (see {mod}numpydantic.meta ) that declares the type of NDArray as the union of all {attr}.Interface.return_types .

To better support static type hinting and inspection (ie. so the type checker
is not only aware of the union of all `return_types`, but the specific array
type that was passed on model instantiation, as well as potentially
do shape and dtype checks during type checking (eg. so a wrongly shaped or dtyped 
array assignment will be highlighted as wrong), we will be exploring adding 
mypy/pylance/pyright hooks for dynamic type evaluation.

Since type annotations are static, each NDArray[] usage effectively creates a new class. The shape and dtype specifications are thus not available at the time that the validation is performed (see how pydantic handles Annotated types at the time that the class definition is evaluated by generating pydantic "core schemas", which are passed to the rust pydantic_core for fast validation, which can't be done with python-based validation functions). The validation function for each NDArray pseudo-subclass is a {func}closure <numpydantic.schema.get_validate_interface> that uses the class declaration-timed shape and dtype annotations with the instantiation-timed array object to find the matching validator interface and apply it.

We are initially adopting nptyping's syntax for array specification. It is a longstanding answer to the desire for more granular array type annotations, but it also was developed before some key developments in python and its typing system, and is no longer actively maintained. We make some minor modifications to its {mod}~numpydantic.dtype specification (eg. to allow builtin python types like int and float), but any existing nptyping annotations can be used as-is with numpydantic. In v2.* we will be reimplementing it, as well as making an extended syntax for shape and dtype specifications, so that the only required dependencies are {mod}numpy and {mod}pydantic. This will also let us better hook into pydantic 2's use of Annotated, eliminating some of the complexity in how specification information is passed to the validators.

Numpydantic is not an array library, but a tool that allows you to use existing array libraries with pydantic. It tries to be a transparent passthrough to whatever library you are using, adding only minimal convenience classes to make array usage roughly uniform across array libraries, but otherwise exposing as much of the functionality of the library as possible.

It is designed to be something that you don't have to think too carefully about before adding it as a dependency - it is simple, clean, unsurprising, well tested, and has three required dependencies.