6 KiB
Design
Why do this?
We want to bring the tidyness of modeling data with pydantic to the universe of software that uses arrays - particularly formats and packages that need to be very particular about what kind of arrays they are able to handle or match a specific schema.
To support a new generation of data formats and data analysis libraries that can model the structure of data independently from its implementation, we made numpydantic as a bridge between abstract schemas and programmatic use.
The closest prior work is likely jaxtyping
,
but its support for multiple array libraries was backed into from its initial
design as a jax
specification package, and so its extensibility and readability is
relatively low. Its Dtype[ArrayClass, "{shape_expression}"]
syntax is not well
suited for modeling arrays intended to be general across implementations, and
makes it challenging to adapt to pydantic's schema generation system.
(design_challenges)=
Challenges
The Python type annotation system is weird and not like the rest of Python! (at least until PEP 0649 gets mainlined). Similarly, Pydantic 2's core_schema system is wonderful but still has a few mysteries lurking under the documented surface. This package does the work of plugging them in together to make some kind of type validation frankenstein.
The first problem is that type annotations are evaluated statically by python, mypy,
etc. This means you can't use typical python syntax for declaring types - it has to
be present at the time __new__
is called, rather than __init__
. So
Different implementations of arrays behave differently! HDF5 files need to be carefully opened and closed to avoid corruption, video files don't typically allow normal array slicing operations, and only some array libraries support lazy loading of arrays on disk.
We can't anticipate all the possible array libraries that exist now or in the future, so it has to be possible to extend support to them without needing to go through a potentially lengthy contribution process.
Strategy
Numpydantic uses {class}~numpydantic.NDArray
as an abstract specification of
an array that uses one of several interface classes to validate
and interact with an array. These interface classes will set the instance attribute
either as the passed array itself, or a transparent proxy class (eg.
{class}~numpydantic.interface.hdf5.H5Proxy
) in the case that the native array format
doesn't support numpy-like array operations out of the box.
The interface
validation process thus often transforms the type of the passed array -
eg. when specifying an array in an HDF5 file, one will pass some reference to
a Path
and the location of a dataset within that file, but the returned value from the
interface validator will be an {class}~numpydantic.interface.hdf5.H5Proxy
to the dataset. This confuses python's static type checker and IDE integrations like
pylance/pyright/mypy, which naively expect the type to literally be an
{class}~numpydantic.NDArray
instance. To address this, numpydantic generates a .pyi
stub file on import (see {mod}numpydantic.meta
) that declares the type of NDArray
as the union of all {attr}.Interface.return_types
.
To better support static type hinting and inspection (ie. so the type checker
is not only aware of the union of all `return_types`, but the specific array
type that was passed on model instantiation, as well as potentially
do shape and dtype checks during type checking (eg. so a wrongly shaped or dtyped
array assignment will be highlighted as wrong), we will be exploring adding
mypy/pylance/pyright hooks for dynamic type evaluation.
Since type annotations are static, each NDArray[]
usage effectively creates a new
class. The shape
and dtype
specifications are thus not available at the time
that the validation is performed (see how pydantic handles Annotated types
at the time that the class definition is evaluated by generating pydantic "core schemas",
which are passed to the rust pydantic_core
for fast validation, which can't be
done with python-based validation functions). The validation function for each
NDArray
pseudo-subclass is a {func}closure <numpydantic.schema.get_validate_interface>
that uses the class declaration-timed shape
and dtype
annotations with the
instantiation-timed array object to find the matching validator interface and apply it.
We are initially adopting nptyping
's syntax for array specification. It is a longstanding
answer to the desire for more granular array type annotations, but it also was
developed before some key developments in python and its typing system, and is
no longer actively maintained. We make some minor modifications to its
{mod}~numpydantic.dtype
specification (eg. to allow builtin python types like int
and float
), but any existing nptyping
annotations can be used as-is with
numpydantic
. In v2.* we will be reimplementing it, as well as
making an extended syntax for shape and dtype specifications, so that the
only required dependencies are {mod}numpy
and {mod}pydantic
. This will also
let us better hook into pydantic 2's use of Annotated
, eliminating some
of the complexity in how specification information is passed to the validators.
Numpydantic is not an array library, but a tool that allows you to use existing array libraries with pydantic. It tries to be a transparent passthrough to whatever library you are using, adding only minimal convenience classes to make array usage roughly uniform across array libraries, but otherwise exposing as much of the functionality of the library as possible.
It is designed to be something that you don't have to think too carefully about before adding it as a dependency - it is simple, clean, unsurprising, well tested, and has three required dependencies.