numpydantic/README.md

496 lines
14 KiB
Markdown
Raw Permalink Normal View History

2024-02-02 01:50:57 +00:00
# numpydantic
[![PyPI - Version](https://img.shields.io/pypi/v/numpydantic)](https://pypi.org/project/numpydantic)
[![Documentation Status](https://readthedocs.org/projects/numpydantic/badge/?version=latest)](https://numpydantic.readthedocs.io/en/latest/?badge=latest)
[![Coverage Status](https://coveralls.io/repos/github/p2p-ld/numpydantic/badge.svg)](https://coveralls.io/github/p2p-ld/numpydantic)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
2024-05-21 04:25:35 +00:00
A python package for specifying, validating, and serializing arrays with arbitrary backends in pydantic.
2024-05-21 04:25:35 +00:00
**Problem:**
1) Pydantic is great for modeling data.
2) Arrays are one of a few elemental types in computing,
2024-05-21 04:25:35 +00:00
but ...
2024-09-26 01:15:27 +00:00
3) Typical type annotations would only work for a single array library implementation
4) They wouldnt allow you to specify array shapes and dtypes, and
5) If you try and specify an array in pydantic, this happens:
2024-02-05 19:50:56 +00:00
2024-05-21 04:25:35 +00:00
```python
>>> from pydantic import BaseModel
>>> import numpy as np
>>> class MyModel(BaseModel):
>>> array: np.ndarray
pydantic.errors.PydanticSchemaGenerationError:
Unable to generate pydantic-core schema for <class 'numpy.ndarray'>.
Set `arbitrary_types_allowed=True` in the model_config to ignore this error
or implement `__get_pydantic_core_schema__` on your type to fully support it.
```
2024-09-26 01:15:27 +00:00
**Solution**
Numpydantic allows you to do this:
```python
from pydantic import BaseModel
from numpydantic import NDArray, Shape
class MyModel(BaseModel):
array: NDArray[Shape["3 x, 4 y, * z"], int]
```
And use it with your favorite array library:
```python
import numpy as np
import dask.array as da
import zarr
# numpy
model = MyModel(array=np.zeros((3, 4, 5), dtype=int))
# dask
model = MyModel(array=da.zeros((3, 4, 5), dtype=int))
# hdf5 datasets
model = MyModel(array=('data.h5', '/nested/dataset'))
# zarr arrays
model = MyModel(array=zarr.zeros((3,4,5), dtype=int))
model = MyModel(array='data.zarr')
model = MyModel(array=('data.zarr', '/nested/dataset'))
# video files
model = MyModel(array="data.mp4")
```
`numpydantic` supports pydantic but none of its behavior is dependent on it!
Use the `NDArray` type annotation like a regular type outside
of pydantic -- eg. to validate an array anywhere, use `isinstance`:
```python
array_type = NDArray[Shape["1, 2, 3"], int]
isinstance(np.zeros((1,2,3), dtype=int), array_type)
# True
isinstance(zarr.zeros((1,2,3), dtype=int), array_type)
# True
isinstance(np.zeros((4,5,6), dtype=int), array_type)
# False
isinstance(np.zeros((1,2,3), dtype=float), array_type)
# False
```
Or use it as a convenient callable shorthand for validating and working with
array types that usually don't have an array-like API.
```python
>>> rgb_video_type = NDArray[Shape["* t, 1920 x, 1080 y, 3 rgb"], np.uint8]
>>> video = rgb_video_type('data.mp4')
>>> video.shape
(10, 1920, 1080, 3)
>>> video[0, 0:3, 0:3, 0]
array([[0, 0, 0],
[0, 0, 0],
[0, 0, 0]], dtype=uint8)
```
2024-05-21 04:25:35 +00:00
## Features:
- **Types** - Annotations (based on [npytyping](https://github.com/ramonhagenaars/nptyping))
for specifying arrays in pydantic models
- **Validation** - Shape, dtype, and other array validations
2024-05-21 04:29:01 +00:00
- **Interfaces** - Works with [`numpy`](https://numpydantic.readthedocs.io/en/latest/api/interface/numpy.html),
[`dask`](https://numpydantic.readthedocs.io/en/latest/api/interface/dask.html),
[`hdf5`](https://numpydantic.readthedocs.io/en/latest/api/interface/hdf5.html),
[`video`](https://numpydantic.readthedocs.io/en/latest/api/interface/video.html),
[`zarr`](https://numpydantic.readthedocs.io/en/latest/api/interface/zarr.html),
2024-05-21 04:25:35 +00:00
and a simple extension system to make it work with whatever else you want!
- **Serialization** - Dump an array as a JSON-compatible array-of-arrays with enough metadata to be able to
recreate the model in the native format
- **Schema Generation** - Correct JSON Schema for arrays, complete with shape and dtype constraints, to
make your models interoperable
2024-09-26 01:15:27 +00:00
- **Fast** - The validation codepath is careful to take quick exits and not perform unnecessary work,
and interfaces use whatever tools available to validate against array metadata and lazy load to avoid
expensive i/o operations. Our goal is to make numpydantic a tool you don't ever need to think about.
2024-05-21 04:25:35 +00:00
Coming soon:
- **Metadata** - This package was built to be used with [linkml arrays](https://linkml.io/linkml/schemas/arrays.html),
so we will be extending it to include arbitrary metadata included in the type annotation object in the JSON schema representation.
- **Extensible Specification** - for v1, we are implementing the existing nptyping syntax, but
for v2 we will be updating that to an extensible specification syntax to allow interfaces to validate additional
constraints like chunk sizes, as well as make array specifications more introspectable and friendly to runtime usage.
- **Advanced dtype handling** - handling dtypes that only exist in some array backends, allowing
minimum and maximum precision ranges, and so on as type maps provided by interface classes :)
- (see [todo](https://numpydantic.readthedocs.io/en/latest/todo.html))
## Installation
numpydantic tries to keep dependencies minimal, so by default it only comes with
dependencies to use the numpy interface. Add the extra relevant to your favorite
array library to be able to use it!
```shell
pip install numpydantic
# dask
pip install 'numpydantic[dask]'
# hdf5
pip install 'numpydantic[hdf5]'
# video
pip install 'numpydantic[video]'
# zarr
pip install 'numpydantic[zarr]'
# all array formats
pip intsall 'numpydantic[array]'
```
## Usage
2024-09-26 01:15:27 +00:00
> [!TIP]
> The README is just a sample! See the full documentation at
> https://numpydantic.readthedocs.io
2024-05-21 04:25:35 +00:00
Specify an array using [nptyping syntax](https://github.com/ramonhagenaars/nptyping/blob/master/USERDOCS.md)
and use it with your favorite array library :)
Use the `NDArray` class like you would any other python type,
combine it with `Union`, make it `Optional`, etc.
For example, to specify a very special type of image that can either be
- a 2D float array where the axes can be any size, or
- a 3D uint8 array where the third axis must be size 3
- a 1080p video
2024-02-05 19:50:56 +00:00
```python
from typing import Union
from pydantic import BaseModel
2024-05-21 04:25:35 +00:00
import numpy as np
from numpydantic import NDArray, Shape
2024-02-05 19:50:56 +00:00
class Image(BaseModel):
array: Union[
2024-05-21 04:25:35 +00:00
NDArray[Shape["* x, * y"], float],
NDArray[Shape["* x, * y, 3 rgb"], np.uint8],
NDArray[Shape["* t, 1080 y, 1920 x, 3 rgb"], np.uint8]
2024-02-05 19:50:56 +00:00
]
```
2024-05-21 04:25:35 +00:00
And then use that as a transparent interface to your favorite array library!
### Interfaces
#### Numpy
The Coca-Cola of array libraries
2024-02-05 19:50:56 +00:00
```python
import numpy as np
# works
2024-05-21 04:25:35 +00:00
frame_gray = Image(array=np.ones((1280, 720), dtype=float))
2024-02-05 19:50:56 +00:00
frame_rgb = Image(array=np.ones((1280, 720, 3), dtype=np.uint8))
# fails
2024-05-21 04:25:35 +00:00
wrong_n_dimensions = Image(array=np.ones((1280,), dtype=float))
2024-02-05 19:50:56 +00:00
wrong_shape = Image(array=np.ones((1280,720,10), dtype=np.uint8))
2024-05-21 04:25:35 +00:00
# shapes and types are checked together, so this also fails
wrong_shape_dtype_combo = Image(array=np.ones((1280, 720, 3), dtype=float))
```
#### Dask
High performance chunked arrays! The backend for many new array libraries!
Works exactly the same as numpy arrays
```python
import dask.array as da
# validate a humongous image without having to load it into memory
video_array = da.zeros(shape=(1e10,1e20,3), dtype=np.uint8)
dask_video = Image(array=video_array)
```
#### HDF5
Array work increasingly can't fit on memory, but dealing with arrays on disk
can become a pain in concurrent applications. Numpydantic allows you to
specify the location of an array within an hdf5 file on disk and use it just like
any other array!
eg. Make an array on disk...
```python
from pathlib import Path
import h5py
from numpydantic.interface.hdf5 import H5ArrayPath
h5f_file = Path('my_file.h5')
array_path = "/nested/array"
# make an HDF5 array
h5f = h5py.File(h5f_file, "w")
array = np.random.randint(0, 255, (1920,1080,3), np.uint8)
h5f.create_dataset(array_path, data=array)
h5f.close()
```
Then use it in your model! numpydantic will only open the file as long as it's needed
```python
>>> h5f_image = Image(array=H5ArrayPath(file=h5f_file, path=array_path))
>>> h5f_image.array[0:5,0:5,0]
array([[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]], dtype=uint8)
>>> h5f_image.array[0:2,0:2,0] = 1
>>> h5f_image.array[0:5,0:5,0]
array([[1, 1, 0, 0, 0],
[1, 1, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]], dtype=uint8)
```
Numpydantic tries to be a smart but transparent proxy, exposing the methods and attributes
of the source type even when we aren't directly using them, like when dealing with on-disk HDF5 arrays.
If you want, you can take full control and directly interact with the underlying :class:`h5py.Dataset`
object and leave the file open between calls:
```python
>>> dataset = h5f_image.array.open()
>>> # do some stuff that requires the dataset to be held open
>>> h5f_image.array.close()
```
#### Video
Videos are just arrays with fancy encoding! Numpydantic can validate shape and dtype
as well as lazy load chunks of frames with arraylike syntax!
Say we have some video `data.mp4` ...
```python
video = Image(array='data.mp4')
# get a single frame
video.array[5]
# or a range of frames!
video.array[5:10]
# or whatever slicing you want to do!
video.array[5:50:5, 0:10, 50:70]
```
As elsewhere, a proxy class is a transparent pass-through interface to the underlying
opencv class, so we can get the rest of the video properties ...
```python
import cv2
# get the total frames from opencv
video.array.get(cv2.CAP_PROP_FRAME_COUNT)
# the proxy class also provides a convenience property
video.array.n_frames
2024-02-05 19:50:56 +00:00
```
2024-05-21 04:25:35 +00:00
#### Zarr
Zarr works similarly!
Use it with any of Zarr's backends: Nested, Zipfile, S3, it's all the same!
Eg. create a nested zarr array on disk and use it...
2024-02-05 19:50:56 +00:00
```python
2024-05-21 04:25:35 +00:00
import zarr
from numpydantic.interface.zarr import ZarrArrayPath
array_file = 'data/array.zarr'
nested_path = 'data/sets/here'
root = zarr.open(array_file, mode='w')
nested_array = root.zeros(
nested_path,
shape=(1000, 1080, 1920, 3),
dtype=np.uint8
)
# validates just fine!
zarr_video = Image(array=ZarrArrayPath(array_file, nested_path))
# or just pass a tuple, the interface can discover it's a zarr array
zarr_video = Image(array=(array_file, nested_path))
2024-02-05 19:50:56 +00:00
```
2024-05-21 04:25:35 +00:00
### JSON Schema
Numpydantic generates JSON Schema for all its array specifications, so for the above
model, we get a schema for each of the possible array types that properly handles
the shape and dtype constraints and includes the origin numpy type as a `dtype` annotation.
2024-02-05 19:50:56 +00:00
```python
2024-05-21 04:25:35 +00:00
Image.model_json_schema()
2024-02-05 19:50:56 +00:00
```
```json
{
"properties": {
"array": {
2024-05-21 04:25:35 +00:00
"anyOf": [
{
"items": {"items": {"type": "number"}, "type": "array"},
"type": "array"
},
{
"dtype": "numpy.uint8",
2024-02-05 19:50:56 +00:00
"items": {
2024-05-21 04:25:35 +00:00
"items": {
"items": {
"maximum": 255,
"minimum": 0,
"type": "integer"
},
"maxItems": 3,
"minItems": 3,
"type": "array"
},
"type": "array"
2024-02-05 19:50:56 +00:00
},
"type": "array"
},
2024-05-21 04:25:35 +00:00
{
"dtype": "numpy.uint8",
"items": {
"items": {
"items": {
"items": {
"maximum": 255,
"minimum": 0,
"type": "integer"
},
"maxItems": 3,
"minItems": 3,
"type": "array"
},
"maxItems": 1920,
"minItems": 1920,
"type": "array"
},
"maxItems": 1080,
"minItems": 1080,
"type": "array"
},
"type": "array"
}
],
"title": "Array"
2024-02-05 19:50:56 +00:00
}
},
2024-05-21 04:25:35 +00:00
"required": ["array"],
"title": "Image",
2024-02-05 19:50:56 +00:00
"type": "object"
}
```
2024-05-21 04:25:35 +00:00
numpydantic can even handle shapes with unbounded numbers of dimensions by using
recursive JSON schema!!!
So the any-shaped array (using nptyping's ellipsis notation):
2024-02-05 19:50:56 +00:00
```python
2024-05-21 04:25:35 +00:00
class AnyShape(BaseModel):
array: NDArray[Shape["*, ..."], np.uint8]
```
is rendered to JSON-Schema like this:
2024-02-05 19:50:56 +00:00
2024-05-21 04:25:35 +00:00
```json
{
"$defs": {
"any-shape-array-9b5d89838a990d79": {
"anyOf": [
{
"items": {
"$ref": "#/$defs/any-shape-array-9b5d89838a990d79"
},
"type": "array"
},
{"maximum": 255, "minimum": 0, "type": "integer"}
]
}
},
"properties": {
"array": {
"dtype": "numpy.uint8",
"items": {"$ref": "#/$defs/any-shape-array-9b5d89838a990d79"},
"title": "Array",
"type": "array"
}
},
"required": ["array"],
"title": "AnyShape",
"type": "object"
}
2024-02-05 19:50:56 +00:00
```
2024-05-21 04:25:35 +00:00
where the key `"any-shape-array-9b5d89838a990d79"` uses a (blake2b) hash of the
inner dtype specification so that having multiple any-shaped arrays in a single
model schema are deduplicated without conflicts.
### Dumping
One of the main reasons to use chunked array libraries like zarr is to avoid
needing to load the entire array into memory. When dumping data to JSON, numpydantic
tries to mirror this behavior, by default only dumping the metadata that is
necessary to identify the array.
For example, with zarr:
2024-02-05 19:50:56 +00:00
```python
2024-05-21 04:25:35 +00:00
array = zarr.array([[1,2,3],[4,5,6],[7,8,9]], dtype=float)
instance = Image(array=array)
dumped = instance.model_dump_json()
```
2024-02-05 19:50:56 +00:00
2024-05-21 04:25:35 +00:00
```json
2024-02-05 19:50:56 +00:00
{
2024-05-21 04:25:35 +00:00
"array":
{
"Chunk shape": "(3, 3)",
"Chunks initialized": "1/1",
"Compressor": "Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)",
"Data type": "float64",
"No. bytes": "72",
"No. bytes stored": "421",
"Order": "C",
"Read-only": "False",
"Shape": "(3, 3)",
"Storage ratio": "0.2",
"Store type": "zarr.storage.KVStore",
"Type": "zarr.core.Array",
"hexdigest": "c51604eace325fe42bbebf39146c0956bd2ed13c"
}
2024-02-05 19:50:56 +00:00
}
```
2024-05-21 04:25:35 +00:00
To print the whole array, we use pydantic's serialization contexts:
```python
dumped = instance.model_dump_json(context={'zarr_dump_array': True})
```
```json
{
"array":
{
"same thing,": "except also...",
"array": [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]],
"hexdigest": "c51604eace325fe42bbebf39146c0956bd2ed13c"
}
}
2024-07-31 20:06:10 +00:00
```
## Vendored Dependencies
We have vendored dependencies in the `src/numpydantic/vendor` package,
and reproduced their licenses in the `licenses` directory.
- [nptyping](https://github.com/ramonhagenaars/nptyping) - `numpydantic.vendor.nptyping` - `/licenses/nptyping.txt`