diff --git a/README.md b/README.md index 4145173..fa80339 100644 --- a/README.md +++ b/README.md @@ -1,47 +1,2 @@ # translate-nwb Translating NWB schema language to linkml - -The [nwb specification language](https://schema-language.readthedocs.io/en/latest/description.html) -has several components - -- Namespaces: subcollections of specifications -- Groups: - -We want to translate the schema to LinkML so that we can export to other schema formats, -generate code for dealing with the data, and ultimately make it interoperable -with other formats. - -To do that, we need to map: -- Namespaces: seem to operate like separate schema? Then within a namespace the - rest are top-level objects -- Inheritance: NWB has an odd inheritance system, where the same syntax is used for - inheritance, mixins, type declaration, and inclusion. - - `neurodata_type_inc` -> `is_a` -- Groups: -- Slots: Lots of properties are reused in the nwb spec, and LinkML lets us separate these out as slots -- dims, shape, and dtypes: these should have been just attributes rather than put in the spec - language, so we'll just make an Array class and use that. - -## How does pynwb use the schema? - -* nwb-schema is included as a git submodule within pynwb -* [__get_resources](https://github.com/NeurodataWithoutBorders/pynwb/blob/dev/src/pynwb/__init__.py#L23) encodes the location of the directory -* [__TYPE_MAP](https://github.com/NeurodataWithoutBorders/pynwb/blob/dev/src/pynwb/__init__.py#L51) eventually contains the schema information -* on import, [load_namespaces](https://github.com/NeurodataWithoutBorders/pynwb/blob/dev/src/pynwb/__init__.py#L115-L116) populates `__TYPE_MAP` -* [register_class](https://github.com/NeurodataWithoutBorders/pynwb/blob/dev/src/pynwb/__init__.py#L135-L136) decorator is used on all pynwb classes to register with `__TYPE_MAP` - * Unclear how the schema is used if the containers contain the same information -* the [register_container_type](https://github.com/hdmf-dev/hdmf/blob/dd39b3878523c4b03f5286fc740752befd192d8b/src/hdmf/build/manager.py#L727-L736) method in hdmf's TypeMap class seems to overwrite the loaded schema??? - * `__NS_CATALOG` seems to actually hold references to the schema but it doesn't seem to be used anywhere except within `__TYPE_MAP` ? -* [NWBHDF5IO](https://github.com/NeurodataWithoutBorders/pynwb/blob/dev/src/pynwb/__init__.py#L237-L238) uses `TypeMap` to greate a `BuildManager` - * Parent class [HDF5IO](https://github.com/hdmf-dev/hdmf/blob/dd39b3878523c4b03f5286fc740752befd192d8b/src/hdmf/backends/hdf5/h5tools.py#L37) then reimplements a lot of basic functionality from elsehwere - * Parent-parent metaclass [HDMFIO](https://github.com/hdmf-dev/hdmf/blob/dev/src/hdmf/backends/io.py) appears to be the final writing class? - * `BuildManager.build` then [calls `TypeMap.build`](https://github.com/hdmf-dev/hdmf/blob/dd39b3878523c4b03f5286fc740752befd192d8b/src/hdmf/build/manager.py#L171) ??? -* `TypeMap.build` ... - * gets the [`ObjectMapper`](https://github.com/hdmf-dev/hdmf/blob/dd39b3878523c4b03f5286fc740752befd192d8b/src/hdmf/build/manager.py#L763) which does [god knows what](https://github.com/hdmf-dev/hdmf/blob/dd39b3878523c4b03f5286fc740752befd192d8b/src/hdmf/build/manager.py#L697) - * Calls the [`ObjectMapper.build`](https://github.com/hdmf-dev/hdmf/blob/dd39b3878523c4b03f5286fc740752befd192d8b/src/hdmf/build/objectmapper.py#L700) method - * Which seems to ultimately create a [`DatasetBuilder`](https://github.com/hdmf-dev/hdmf/blob/dev/src/hdmf/build/builders.py#L315) object -* The `DatasetBuilder` is returned to the `BuildManager` which seems to just store it? -* [HDMFIO.write](https://github.com/hdmf-dev/hdmf/blob/dd39b3878523c4b03f5286fc740752befd192d8b/src/hdmf/backends/io.py#L78) then calls `write_builder` to use the builder, which is unimplemented in the metaclass - * [HDF5IO.write_builder](https://github.com/hdmf-dev/hdmf/blob/dd39b3878523c4b03f5286fc740752befd192d8b/src/hdmf/backends/hdf5/h5tools.py#L806) implements it for HDF5, which then calls `write_group`, `write_dataset`, `write_link`, depending on the builder types, each of which are extremely heavy methods! - * eg. [`write_dataset`](https://github.com/hdmf-dev/hdmf/blob/dd39b3878523c4b03f5286fc740752befd192d8b/src/hdmf/backends/hdf5/h5tools.py#L1080) is basically unreadable to me, but seems to implement every type of dataset writing in a single method. -* At this point it is entirely unclear how the schema is involved, but the file is written. diff --git a/docs/notes/pynwb.md b/docs/notes/pynwb.md new file mode 100644 index 0000000..3f6f5e4 --- /dev/null +++ b/docs/notes/pynwb.md @@ -0,0 +1,25 @@ +# PyNWB notes + +## How does pynwb use the schema? + +* nwb-schema is included as a git submodule within pynwb +* [__get_resources](https://github.com/NeurodataWithoutBorders/pynwb/blob/dev/src/pynwb/__init__.py#L23) encodes the location of the directory +* [__TYPE_MAP](https://github.com/NeurodataWithoutBorders/pynwb/blob/dev/src/pynwb/__init__.py#L51) eventually contains the schema information +* on import, [load_namespaces](https://github.com/NeurodataWithoutBorders/pynwb/blob/dev/src/pynwb/__init__.py#L115-L116) populates `__TYPE_MAP` +* [register_class](https://github.com/NeurodataWithoutBorders/pynwb/blob/dev/src/pynwb/__init__.py#L135-L136) decorator is used on all pynwb classes to register with `__TYPE_MAP` + * Unclear how the schema is used if the containers contain the same information +* the [register_container_type](https://github.com/hdmf-dev/hdmf/blob/dd39b3878523c4b03f5286fc740752befd192d8b/src/hdmf/build/manager.py#L727-L736) method in hdmf's TypeMap class seems to overwrite the loaded schema??? + * `__NS_CATALOG` seems to actually hold references to the schema but it doesn't seem to be used anywhere except within `__TYPE_MAP` ? +* [NWBHDF5IO](https://github.com/NeurodataWithoutBorders/pynwb/blob/dev/src/pynwb/__init__.py#L237-L238) uses `TypeMap` to greate a `BuildManager` + * Parent class [HDF5IO](https://github.com/hdmf-dev/hdmf/blob/dd39b3878523c4b03f5286fc740752befd192d8b/src/hdmf/backends/hdf5/h5tools.py#L37) then reimplements a lot of basic functionality from elsehwere + * Parent-parent metaclass [HDMFIO](https://github.com/hdmf-dev/hdmf/blob/dev/src/hdmf/backends/io.py) appears to be the final writing class? + * `BuildManager.build` then [calls `TypeMap.build`](https://github.com/hdmf-dev/hdmf/blob/dd39b3878523c4b03f5286fc740752befd192d8b/src/hdmf/build/manager.py#L171) ??? +* `TypeMap.build` ... + * gets the [`ObjectMapper`](https://github.com/hdmf-dev/hdmf/blob/dd39b3878523c4b03f5286fc740752befd192d8b/src/hdmf/build/manager.py#L763) which does [god knows what](https://github.com/hdmf-dev/hdmf/blob/dd39b3878523c4b03f5286fc740752befd192d8b/src/hdmf/build/manager.py#L697) + * Calls the [`ObjectMapper.build`](https://github.com/hdmf-dev/hdmf/blob/dd39b3878523c4b03f5286fc740752befd192d8b/src/hdmf/build/objectmapper.py#L700) method + * Which seems to ultimately create a [`DatasetBuilder`](https://github.com/hdmf-dev/hdmf/blob/dev/src/hdmf/build/builders.py#L315) object +* The `DatasetBuilder` is returned to the `BuildManager` which seems to just store it? +* [HDMFIO.write](https://github.com/hdmf-dev/hdmf/blob/dd39b3878523c4b03f5286fc740752befd192d8b/src/hdmf/backends/io.py#L78) then calls `write_builder` to use the builder, which is unimplemented in the metaclass + * [HDF5IO.write_builder](https://github.com/hdmf-dev/hdmf/blob/dd39b3878523c4b03f5286fc740752befd192d8b/src/hdmf/backends/hdf5/h5tools.py#L806) implements it for HDF5, which then calls `write_group`, `write_dataset`, `write_link`, depending on the builder types, each of which are extremely heavy methods! + * eg. [`write_dataset`](https://github.com/hdmf-dev/hdmf/blob/dd39b3878523c4b03f5286fc740752befd192d8b/src/hdmf/backends/hdf5/h5tools.py#L1080) is basically unreadable to me, but seems to implement every type of dataset writing in a single method. +* At this point it is entirely unclear how the schema is involved, but the file is written. \ No newline at end of file diff --git a/docs/notes/schema.md b/docs/notes/schema.md new file mode 100644 index 0000000..f525066 --- /dev/null +++ b/docs/notes/schema.md @@ -0,0 +1,185 @@ +# Schema Notes + +https://schema-language.readthedocs.io/en/latest/ + +rough notes kept while thinking about how to translate the schema + +The easiest thing to do seems to just be to make a linkML schema of the nwb-schema spec itself and then use that to generate python dataclasses that process the loaded namespaces using mixin methods lol + +## Overview + +We want to translate the schema to LinkML so that we can export to other schema formats, +generate code for dealing with the data, and ultimately make it interoperable +with other formats. + + +## Structure + +- root is `nwb.namespace.yaml` and imports the rest of the namespaces +- `hdmf-common` is implicitly loaded (`TODO` link to issue) +- + +## Components + +The [nwb specification language](https://schema-language.readthedocs.io/en/latest/description.html) +has several components + +- **Namespaces:** top level object +- **Schema:** specified within a `namespaces` object. Each schema is a list of data types +- **Data types:** Each top-level list in a schema file is a data type. data types are one of three subtypes: + - Groups: generic collection + - Datasets: like groups, but also describe arrays + - Links: references to other top-level +- Attributes: Groups and Datasets, in addition to their default properties, also can have a list named `attributes` that seem to just be used like `**kwargs`, but also seem to maybe be used to specify arrays? + - > The specification of datasets looks quite similar to attributes and groups. Similar to attributes, datasets describe the storage of arbitrary n-dimensional array data. However, in contrast to attributes, datasets are not associated with a specific parent group or dataset object but are (similar to groups) primary data objects (and as such typically manage larger data than attributes) + +The components, in turn: + +- Groups and Datasets are recursive: ie. groups and datasets can have groups and datasets + - and also links (but the recursive part is just the group or dataset being linked to) + +## Properties + +**`dtype`** defines the storage type of the given "data type," which we'll also start calling "class" because confusing. + +dtypes can be +- unset, where then the "data type"/"class" becomes a group of datasets. +- a string +- a list of dtypes: single-layer recursion +- a dictionary defining a "reference", + - `target_type`: that type the target of the reference is + - `reftype`: the kind of reference being made, `ref/reference/object` (all equivalent) or `region` for a subset of the referred object. + +**`dims`** defines the axis names, and `shape`** defines the possible shapes of an array. The structure of each has to match + +eg: + +```yml +- neurodata_type_def: Image + neurodata_type_inc: NWBData + dtype: numeric + dims: + - - x + - y + - - x + - y + - r, g, b + - - x + - y + - r, g, b, a + shape: + - - null + - null + - - null + - null + - 3 + - - null + - null + - 4 +``` + +Can a compound dtype be used with multiple dims?? if dtype also controls the shape of the data type (eg. the tabular data example with a bigass dtype,) then what are dims? + +Seems like when `dtype` is specified with `dims` then it is treated as an array, but otherwise scalar. + + +### Inheritance + +- `neurodata_type_def` - defines a new data type +- `neurodata_type_inc` - includes/inherits from another data type within the namespace + +Both are optional. Inheritance and instantiation appear to be conflated here + +- `(def unset/inc unset)` - untyped data type? - seems to be because "datasets" are recursive, so the actual numerical arrays are "datasets" but so are the top-level classes. but can datasets truly be recursive? i think the HDF5 implementation probably means that untyped datasets are terminal - ie. untype datasets cannot contain datasets. maybe? +- `(def set /inc unset)` - new data type +- `(def set /inc set )` - inheritance +- `(def unset/inc set )` - instantiate??? + + +If no new type is defined, the "data type" has a "data type" of the `inc`luded type? + +I believe this means that including without defining is instantiating the type, hence the need for a unique name. Otherwise, the "name" is presumably the name of the type? + +Does overriding a dataset or group from the parent class ... override it? or add to it? or does it need to be validated against the parent dataset schema? + +instantiation as a group can be used to indicate an abstract number of a dataset, not sure how that's distinct from `dtype` and `dims` yet. + + + +## Mappings + +What can be restructured to fit LinkML + +we need to map: +- Namespaces: seem to operate like separate schema? Then within a namespace the + rest are top-level objects +- Inheritance: NWB has an odd inheritance system, where the same syntax is used for + inheritance, mixins, type declaration, and inclusion. + - `neurodata_type_inc` -> `is_a` +- Groups: +- Slots: Lots of properties are reused in the nwb spec, and LinkML lets us separate these out as slots +- `quantity` needs a manual map +- dims, shape, and dtypes: these should have been just attributes rather than put in the spec + language, so we'll just make an Array class and use that. + - dims and shape should probably be a dictionary so you don't need a zillion nulls, eg rather than + ```yml + dims: + - - x + - y + - - x + - y + - r, g, b + shape: + - - null + - null + - - null + - null + - 3 + ``` + do + ```yml + dims: + - - name: x + - name: y + - - name: x + - name: y + - name: r, g, b + shape: 3 + ``` + or even + ```yml + dims: + - - x + - y + - - x + - y + - name: r, g, b + shape: 3 + + ``` + + And also is there any case that would break where there is some odd dependency between dims where it wouldn't work to just use an `optional` param + + ```yml + dims: + - name: x + shape: null + - name: y + shape: null + - name: r, g, b + shape: 3 + optional: true + ``` + +## Parsing + +- Given a `nwb.schema.yml` meta-schema that defines the types of objects in nwb schema... +- The top level of an NWB schema is a `namespaces` object +- each file specified in the `namespaces.schema` array is a distinct schema + - that inherits the +- `groups` + - Top level lists are parsed as "groups" + +## Special Types + +holy hell it appears as if `hdmf-common` is all special cases. eg. DynamicTable.... is like a parallel implementation of links and references??? \ No newline at end of file diff --git a/docs/notes/storage.md b/docs/notes/storage.md new file mode 100644 index 0000000..702e2f1 --- /dev/null +++ b/docs/notes/storage.md @@ -0,0 +1,3 @@ +# NWB Storage + +https://nwb-storage.readthedocs.io/en/latest/ \ No newline at end of file