first draft of docs for 1.0

This commit is contained in:
sneakers-the-rat 2024-05-23 00:27:00 -07:00
parent 0937fd7c0d
commit 2803c752b9
Signed by untrusted user who does not match committer: jonny
GPG key ID: 6DCB96EF1E4D232D
11 changed files with 271 additions and 24 deletions

View file

@ -4,4 +4,5 @@
.. automodule:: numpydantic.dtype .. automodule:: numpydantic.dtype
:members: :members:
:undoc-members: :undoc-members:
:imported-members:
``` ```

6
docs/api/meta.md Normal file
View file

@ -0,0 +1,6 @@
# meta
```{eval-rst}
.. automodule:: numpydantic.meta
:members:
```

View file

@ -10,6 +10,13 @@ To support a new generation of data formats and data analysis libraries that can
model the *structure* of data independently from its *implementation,* we made model the *structure* of data independently from its *implementation,* we made
numpydantic as a bridge between abstract schemas and programmatic use. numpydantic as a bridge between abstract schemas and programmatic use.
The closest prior work is likely [`jaxtyping`](https://github.com/patrick-kidger/jaxtyping),
but its support for multiple array libraries was backed into from its initial
design as a `jax` specification package, and so its extensibility and readability is
relatively low. Its `Dtype[ArrayClass, "{shape_expression}"]` syntax is not well
suited for modeling arrays intended to be general across implementations, and
makes it challenging to adapt to pydantic's schema generation system.
## Challenges ## Challenges
The Python type annotation system is weird and not like the rest of Python! The Python type annotation system is weird and not like the rest of Python!
@ -40,7 +47,53 @@ either as the passed array itself, or a transparent proxy class (eg.
{class}`~numpydantic.interface.hdf5.H5Proxy`) in the case that the native array format {class}`~numpydantic.interface.hdf5.H5Proxy`) in the case that the native array format
doesn't support numpy-like array operations out of the box. doesn't support numpy-like array operations out of the box.
- type hinting The `interface` validation process thus often transforms the type of the passed array -
- nptyping syntax eg. when specifying an array in an HDF5 file, one will pass some reference to
- not trying to be an array library a `Path` and the location of a dataset within that file, but the returned value from the
- dtyping, mapping & schematization interface validator will be an {class}`~numpydantic.interface.hdf5.H5Proxy`
to the dataset. This confuses python's static type checker and IDE integrations like
pylance/pyright/mypy, which naively expect the type to literally be an
{class}`~numpydantic.NDArray` instance. To address this, numpydantic generates a `.pyi`
stub file on import (see {mod}`numpydantic.meta` ) that declares the type of `NDArray`
as the union of all {attr}`.Interface.return_types` .
```{todo}
To better support static type hinting and inspection (ie. so the type checker
is not only aware of the union of all `return_types`, but the specific array
type that was passed on model instantiation, as well as potentially
do shape and dtype checks during type checking (eg. so a wrongly shaped or dtyped
array assignment will be highlighted as wrong), we will be exploring adding
mypy/pylance/pyright hooks for dynamic type evaluation.
```
Since type annotations are static, each `NDArray[]` usage effectively creates a new
class. The `shape` and `dtype` specifications are thus not available at the time
that the validation is performed (see how [pydantic handles Annotated types](https://github.com/pydantic/pydantic/blob/87adc65888ce54ef4314ef874f7ecba52f129f84/pydantic/_internal/_generate_schema.py#L1788)
at the time that the class definition is evaluated by generating pydantic "core schemas",
which are passed to the rust `pydantic_core` for fast validation, which can't be
done with python-based validation functions). The validation function for each
`NDArray` pseudo-subclass is a {func}`closure <numpydantic.schema.get_validate_interface>`
that uses the *class declaration*-timed `shape` and `dtype` annotations with the
*instantiation*-timed array object to find the matching validator interface and apply it.
We are initially adopting `nptyping`'s syntax for array specification. It is a longstanding
answer to the desire for more granular array type annotations, but it also was
developed before some key developments in python and its typing system, and is
no longer actively maintained. We make some minor modifications to its
{mod}`~numpydantic.dtype` specification (eg. to allow builtin python types like `int`
and `float`), but any existing `nptyping` annotations can be used as-is with
`numpydantic`. In [v2.*](todo.md#v2) we will be reimplementing it, as well as
making an extended syntax for shape and dtype specifications, so that the
only required dependencies are {mod}`numpy` and {mod}`pydantic`. This will also
let us better hook into pydantic 2's use of `Annotated`, eliminating some
of the complexity in how specification information is passed to the validators.
Numpydantic is *not* an array library, but a tool that allows you to use existing
array libraries with pydantic. It tries to be a transparent passthrough to
whatever library you are using, adding only minimal convenience classes to
make array usage roughly uniform across array libraries, but otherwise exposing
as much of the functionality of the library as possible.
It is designed to be something that you don't have
to think too carefully about before adding it as a dependency - it is simple,
clean, unsurprising, well tested, and has three required dependencies.

View file

@ -80,6 +80,7 @@ Coming soon:
constraints like chunk sizes, as well as make array specifications more introspectable and friendly to runtime usage. constraints like chunk sizes, as well as make array specifications more introspectable and friendly to runtime usage.
- **Advanced dtype handling** - handling dtypes that only exist in some array backends, allowing - **Advanced dtype handling** - handling dtypes that only exist in some array backends, allowing
minimum and maximum precision ranges, and so on as type maps provided by interface classes :) minimum and maximum precision ranges, and so on as type maps provided by interface classes :)
- **More Elaborate Arrays** - structured dtypes, recarrays, xarray-style labeled arrays...
- (see [todo](./todo.md)) - (see [todo](./todo.md))
## Installation ## Installation
@ -452,6 +453,7 @@ dumped = instance.model_dump_json(context={'zarr_dump_array': True})
:hidden: true :hidden: true
design design
syntax
interfaces interfaces
todo todo
``` ```
@ -466,9 +468,13 @@ api/interface/index
api/dtype api/dtype
api/ndarray api/ndarray
api/maps api/maps
api/meta
api/monkeypatch api/monkeypatch
api/schema api/schema
api/types api/types
``` ```
## See Also
- [`jaxtyping`](https://docs.kidger.site/jaxtyping/)

View file

@ -1,5 +1,66 @@
# Interfaces # Interfaces
Interfaces are the bridge between the abstract {class}`~numpydantic.NDArray` specification
and concrete array libraries. They are subclasses of the abstract {class}`.Interface`
class.
They contain methods for coercion, validation, serialization, and any other
implementation-specific functionality.
## Discovery
Interfaces are discovered through the {meth}`.Interface.interfaces` method -
returning all subclasses of `Interface`. To use a custom interface, it just
needs to be defined/imported by the time you intend to use it when instantiating
a pydantic model.
Each interface implements a {meth}`.Interface.enabled` method that determines
whether that interface can be used. Typically that means checking if its dependencies
are present in the environment, but can also control conditional use.
## Matching
When a pydantic model is instantiated and an `NDArray` is to be validated,
{meth}`.Interface.match` first, uh, finds the matching interface.
Each interface must define a {meth}`.Interface.check` class that accepts the
array to be validated and returns whether it can be used. Interfaces can
have any `check`ing logic they want, and so can eg. determine if a path
is a particular type of file, but should return quickly and do little work
since they are called frequently.
Validation fails if an argument doesn't match any interface.
```{note}
The {class}`.NumpyInterface` is special cased and is only checked if
no other interface matches. It attempts to cast the input argument to a
{class}`numpy.ndarray` to see if it is arraylike, and since many
lazy-loaded array libraries will attempt to load the whole array into memory
when cast to an `ndarray`, we only try as a last resort.
```
## Validation
Validation is a chain of lifecycle methods, with a single argument passed and returned
to and from each:
{meth}`.Interface.validate` calls in order:
- {meth}`.Interface.before_validation`
- {meth}`.Interface.validate_dtype`
- {meth}`.Interface.validate_shape`
- {meth}`.Interface.after_validation`
The `before` and `after` methods provide hooks for coercion, loading, etc. such that
`validate` can accept one of the types in the interface's
{attr}`~.Interface.input_types` and return the {attr}`~.Interface.return_type` .
## Diagram
```{todo}
Sorry this is unreadable, need to recall how to change the theme for
generated mermaid diagrams but it is very late and i want to push this.
```
```{mermaid} ```{mermaid}
flowchart LR flowchart LR

118
docs/syntax.md Normal file
View file

@ -0,0 +1,118 @@
# Syntax
General form:
```python
field: NDArray[Shape["{shape_expression}"], dtype]
```
## Dtype
Dtype checking is for the most part as simple as an `isinstance` check -
the `dtype` attribute of the array is checked against the `dtype` provided in the
`NDArray` annotation. Both numpy and builtin python types can be used.
A tuple of types can also be passed:
```python
field: NDArray[Shape["2, 3"], (np.int8, np.uint8)]
```
Like `nptyping`, the {mod}`~numpydantic.dtype` module provides convenient access
and aliases to the common dtypes, but also provides "generic" dtypes like
{class}`~numpydantic.dtype.Float` that is a tuple of all subclasses of
{class}`numpy.floating`. Numpy interprets `float` as being equivalent to
{class}`numpy.float64`, and {class}`numpy.floating` is an abstract parent class,
so "generic" tuple dtypes fill that narrow gap.
```{todo}
Future versions will support interfaces providing type maps for declaring
equality between dtypes that may be specific to that library but should be
considered equivalent to numpy or other library's dtypes.
```
```{todo}
Future versions will also support declaring minimum or maximum precisions,
so one might say "at least a 16-bit float" and also accept a 32-bit float.
```
## Shape
Full documentation of nptyping's shape syntax is available in the [nptyping docs](https://github.com/ramonhagenaars/nptyping/blob/master/USERDOCS.md#Shape-expressions),
but for the sake of self-contained docs, the high points are:
### Numerical Shape
A comma-separated list of integers.
For a 2-dimensional, 3 x 4-shaped array:
```python
Shape["3, 4"]
```
### Wildcards
Wildcards indicate a dimension can be any size
For a 2-dimensional, 3 x any-shaped array:
```python
Shape["3, *"]
```
### Labels
Dimensions can be given labels, and in future versions these labels will be
propagated to the generated JSON Schema
```python
Shape["3 x, 4 y, 5 z"]
```
### Arbitrary dimensions
After some specified dimensions, one can express that there can be any number
of additional dimensions with an `...` like
```python
Shape["3, 4, ..."]
```
### Any-Shaped
If `dtype` is also `Any`, one can just use
```python
field: NDArray
```
If a `dtype` is being passed, use the `'*'` wildcard along with the `'...'`
```python
field: NDArray[Shape['*, ...'], int]
```
## Caveats
```{todo}
numpydantic currently does not support structured dtypes or {class}`numpy.recarray`
specifications like nptyping does. It will in future versions.
```
````{todo}
numpydantic also does not support the variable shape definition form like
```python
Shape['Dim, Dim']
```
where there are two dimensions of any shape as long as they are equal
because at the moment it appears impossible to express dynamic constraints
(ie. `minItems`/`maxItems` that depend on the shape of another array)
in JSON Schema. A future minor version will allow them by generating a JSON
schema with a warning that the equal shape constraint will not be represented.
See: https://github.com/orgs/json-schema-org/discussions/730
````

View file

@ -1,15 +1,15 @@
# TODO # TODO
## Syntax
```{todo} ## v2
We will be moving away from using nptyping in v2.0.0. We will be moving away from using nptyping in v2.0.0.
It was written for an older era in python before the dramatic changes in the Python It was written for an older era in python before the dramatic changes in the Python
type system and is no longer actively maintained. We will be reimplementing a syntax type system and is no longer actively maintained. We will be reimplementing a syntax
that extends its array specification syntax to include things like ranges and extensible that extends its array specification syntax to include things like ranges and extensible
dtypes with varying precision (and is much less finnicky to deal with). dtypes with varying precision (and is much less finnicky to deal with).
```
## Validation ## Validation

View file

@ -10,8 +10,10 @@ dependencies = [
"nptyping>=2.5.0", "nptyping>=2.5.0",
"numpy>=1.24.0", "numpy>=1.24.0",
] ]
homepage = "https://numpydantic.readthedocs.io"
requires-python = "<4.0,>=3.9" requires-python = "<4.0,>=3.9"
readme = "README.md" readme = "README.md"
repository = "https://github.com/p2p-ld/numpydantic"
license = {text = "MIT"} license = {text = "MIT"}

View file

@ -9,7 +9,7 @@ interfaces.
This module also allows for convenient access to all abstract dtypes in a single This module also allows for convenient access to all abstract dtypes in a single
module, rather than needing to import each individually. module, rather than needing to import each individually.
Some types like :ref:`Integer` are compound types - tuples of multiple dtypes. Some types like `Integer` are compound types - tuples of multiple dtypes.
Check these using ``in`` rather than ``==``. This interface will develop in future Check these using ``in`` rather than ``==``. This interface will develop in future
versions to allow a single dtype check. versions to allow a single dtype check.
""" """
@ -59,6 +59,7 @@ Timedelta64 = np.timedelta64
SignedInteger = (np.int8, np.int16, np.int32, np.int64, np.short) SignedInteger = (np.int8, np.int16, np.int32, np.int64, np.short)
UnsignedInteger = (np.uint8, np.uint16, np.uint32, np.uint64, np.ushort) UnsignedInteger = (np.uint8, np.uint16, np.uint32, np.uint64, np.ushort)
Integer = tuple([*SignedInteger, *UnsignedInteger]) Integer = tuple([*SignedInteger, *UnsignedInteger])
"""All integer types"""
Int = Integer # Int should translate to the "generic" int type. Int = Integer # Int should translate to the "generic" int type.
Float16 = np.float16 Float16 = np.float16

View file

@ -13,7 +13,7 @@ Extension of nptyping NDArray for pydantic that allows for JSON-Schema serializa
""" """
from typing import TYPE_CHECKING, Any, Tuple from typing import Any, Tuple
import numpy as np import numpy as np
from nptyping.error import InvalidArgumentsError from nptyping.error import InvalidArgumentsError
@ -37,13 +37,6 @@ from numpydantic.schema import (
) )
from numpydantic.types import DtypeType, ShapeType from numpydantic.types import DtypeType, ShapeType
if TYPE_CHECKING: # pragma: no cover
pass
"""
python types that pydantic/json schema can't support (and Any will be used instead)
"""
class NDArrayMeta(_NDArrayMeta, implementation="NDArray"): class NDArrayMeta(_NDArrayMeta, implementation="NDArray"):
""" """
@ -90,12 +83,9 @@ class NDArray(NPTypingType, metaclass=NDArrayMeta):
Constrained array type allowing npytyping syntax for dtype and shape validation Constrained array type allowing npytyping syntax for dtype and shape validation
and serialization. and serialization.
Integrates with pydantic such that This class is not intended to be instantiated or used for type checking, it
- JSON schema for list of list encoding implements the ``__get_pydantic_core_schema__` method to invoke
- Serialized as LoL, with automatic compression for large arrays the relevant :ref:`interface <Interfaces>` for validation and serialization.
- Automatic coercion from lists on instantiation
Also supports validation on :class:`.NDArrayProxy` types for lazy loading.
References: References:
- https://docs.pydantic.dev/latest/usage/types/custom/#handling-third-party-types - https://docs.pydantic.dev/latest/usage/types/custom/#handling-third-party-types

View file

@ -129,7 +129,16 @@ def list_of_lists_schema(shape: Shape, array_type: CoreSchema) -> ListSchema:
elif arg == "...": elif arg == "...":
list_schema = _unbounded_shape(inner_schema, metadata=metadata) list_schema = _unbounded_shape(inner_schema, metadata=metadata)
else: else:
try:
arg = int(arg) arg = int(arg)
except ValueError as e:
raise ValueError(
"Array shapes must be integers, wildcards, or ellipses. "
"Shape variables (for declaring that one dimension must be the "
"same size as another) are not supported because it is "
"impossible to express dynamic minItems/maxItems in JSON Schema. "
"See: https://github.com/orgs/json-schema-org/discussions/730"
) from e
list_schema = core_schema.list_schema( list_schema = core_schema.list_schema(
inner_schema, min_length=arg, max_length=arg, metadata=metadata inner_schema, min_length=arg, max_length=arg, metadata=metadata
) )