nwb-linkml/docs/intro/translation.md

11 KiB

Translation Strategy

NWB to LinkML

Structure

Names

Arrays

LinkML to Pydantic

Types

Metadata

Arrays

Special Cases

DynamicTable

See the [DynamicTable](https://hdmf-common-schema.readthedocs.io/en/stable/format_description.html#dynamictable)
reference docs

One of the major special cases in NWB is the use of DynamicTable to contain tabular data that contains columns that are not in the base spec.

Basic Usage

An example is the TimeIntervals neurodata type within nwb.epoch :

groups:
- neurodata_type_def: TimeIntervals
  neurodata_type_inc: DynamicTable
  doc: A container for aggregating epoch data and the TimeSeries that each epoch applies
    to.
  datasets:
  - name: start_time
    neurodata_type_inc: VectorData
    dtype: float32
    doc: Start time of epoch, in seconds.
  - name: stop_time
    neurodata_type_inc: VectorData
    dtype: float32
    doc: Stop time of epoch, in seconds.
  - name: tags
    neurodata_type_inc: VectorData
    dtype: text
    doc: User-defined tags that identify or categorize events.
    quantity: '?'
  - name: tags_index
    neurodata_type_inc: VectorIndex
    doc: Index for tags.
    quantity: '?'
  - name: timeseries
    neurodata_type_inc: TimeSeriesReferenceVectorData
    doc: An index into a TimeSeries object.
    quantity: '?'
  - name: timeseries_index
    neurodata_type_inc: VectorIndex
    doc: Index for timeseries.
    quantity: '?'

Each of the columns of the table are specified as VectorData objects, which create an implicit {n<=4}-dimensional array, and optionally have an adjoining VectorIndex attribute that has the VectorData item as a target :

- data_type_def: VectorData
  data_type_inc: Data
  doc: ...
  dims:
  - ...
  shape:
  - ...
  attributes:
  - name: description
    dtype: text
    doc: Description of what these vectors represent.

- data_type_def: VectorIndex
  data_type_inc: VectorData
  dtype: uint8
  doc: ...
  dims:
  - num_rows
  shape:
  - null
  attributes:
  - name: target
    dtype:
      target_type: VectorData
      reftype: object
    doc: Reference to the target dataset that this index applies to.

The DynamicTable also allows for arbitrary additional VectorData columns, where the name field is used as an identifier: columns specified in the model have a fixed name given by the schema, but each additional column is identified by its given name:

- data_type_def: DynamicTable
  data_type_inc: Container
  doc: ...
  attributes:
  - name: colnames
    dtype: text
    dims:
    - num_columns
    shape:
    - null
    doc: The names of the columns in this table. This should be used to specify
      an order to the columns.
  - name: description
    dtype: text
    doc: Description of what is in this dynamic table.
  datasets:
  - name: id
    data_type_inc: ElementIdentifiers
    dtype: int
    dims:
    - num_rows
    shape:
    - null
    doc: Array of unique identifiers for the rows of this dynamic table.
  - data_type_inc: VectorData
    doc: Vector columns, including index columns, of this dynamic table.
    quantity: '*'

Where colnames is stored as an array in the metadata attributes of the group, but all others are stored as hdf5 datasets.

In the simplest case, this results in a TimeIntervals group that looks like this (abbreviated for clarity):

$ h5ls -rv an_nwb_dataset.nwb/trials
/trials                  Group
    Attribute: colnames {7}
        Type:      variable-length null-terminated UTF-8 string
    Attribute: neurodata_type scalar
        Type:      variable-length null-terminated UTF-8 string
        Value:     TimeIntervals
/trials/id               Dataset {121/121}
    Attribute: neurodata_type scalar
        Type:      variable-length null-terminated UTF-8 string
        Value:     ElementIdentifiers
/trials/start_time       Dataset {121/121}
    Attribute: neurodata_type scalar
        Type:      variable-length null-terminated UTF-8 string
        Value:     VectorData
/trials/stop_time        Dataset {121/121}
    Attribute: neurodata_type scalar
        Type:      variable-length null-terminated UTF-8 string
        Value:     VectorData
/trials/surface_excursion_start_time Dataset {121/121}
    Attribute: neurodata_type scalar
        Type:      variable-length null-terminated UTF-8 string
        Value:     VectorData
/trials/surface_excursion_stop_time Dataset {121/121}
    Attribute: neurodata_type scalar
        Type:      variable-length null-terminated UTF-8 string
        Value:     VectorData
/trials/surface_location Dataset {121/121}
    Attribute: neurodata_type scalar
        Type:      variable-length null-terminated UTF-8 string
        Value:     VectorData
/trials/surface_return_start_time Dataset {121/121}
    Attribute: neurodata_type scalar
        Type:      variable-length null-terminated UTF-8 string
        Value:     VectorData
/trials/surface_return_stop_time Dataset {121/121}
    Attribute: neurodata_type scalar
        Type:      variable-length null-terminated UTF-8 string
        Value:     VectorData

Ragged Tables

VectorIndex and VectorData pairs can also be used to create ragged arrays, eg. in the case of the Units model from nwb.misc

- neurodata_type_def: Units
  neurodata_type_inc: DynamicTable
  default_name: Units
  doc: Data about spiking units. Event times of observed units (e.g. cell, synapse,
    etc.) should be concatenated and stored in spike_times.
  datasets:
  - name: spike_times_index
    neurodata_type_inc: VectorIndex
    doc: Index into the spike_times dataset.
    quantity: '?'
  - name: spike_times
    neurodata_type_inc: VectorData
    dtype: float64
    doc: Spike times for each unit in seconds.
    quantity: '?'
    attributes:
    - name: resolution
      dtype: float64
      doc: The smallest possible difference between two spike times. Usually 1 divided by the acquisition sampling rate
        from which spike times were extracted, but could be larger if the acquisition time series was downsampled or
        smaller if the acquisition time series was smoothed/interpolated and it is possible for the spike time to be
        between samples.
      required: false

In this case, the spike_times are stored as a 1-dimensional vector with spike times for each of the units concatenated. The spike_times_index then stores the first index for each of the units such that when one indexes the NWBFile.units[0] one gets an array of all the spike times for the 0th unit.

Inter-table views

The DynamicTableRegion model is a subclass of VectorData that refers to rows within another DynamicTable.

For example, the ElectricalSeries model from nwb.ecephys (abbreviated for clarity):

- neurodata_type_def: ElectricalSeries
  neurodata_type_inc: TimeSeries
  doc: ...
  datasets:
  - name: data
    dtype: numeric
    dims:
    - ...
    shape:
    - ...
    doc: Recorded voltage data.
    attributes:
    - name: unit
      dtype: text
      value: volts
      doc: ...
  - name: electrodes
    neurodata_type_inc: DynamicTableRegion
    doc: DynamicTableRegion pointer to the electrodes that this time series was generated from.

This produces an HDF5 dataset like /acquisition/{name}/electrodes that has

  • a table attribute that is a reference to another dynamic table (eg. /general/extracellular_ephys/electrodes)
  • a vector of values that are references to the row indices of that table

such that the {n_times} x {n_electrodes} /data array can be indexed such that each of the channels from electrodes correspond to a column of the array.

Implicit Behavior

  • A VectorIndex does not need to explicitly refer to a VectorData column using the target attribute, but can be implicitly linked by being named {VectorData.name}_index
  • When indexing a dynamictable, the result that is returned with DynamicTable.columname[0] is actually the VectorIndexed view into the VectorData column, rather than the VectorData column itself
  • References through DynamicTableRegion are similarly resolved by the API, replacing values from the referenced tables and datasets.

Implementation

When translating from nwb-schema-language to linkml we....

Link to relevant adapter classes
  • Interpret VectorData as regular array-like slots if they have no additional attributes, or as subclasses when they do
  • Replace all the special reference notation with range: Class annotations that directly refer to the classes being linked to

When generating pydantic models we...

  • Include a special :class:~nwb_linkml.includes.hdmf.DynamicTableMixin in the generated hdmf_common.table module and replace the configured base model
  • Since linkml doesn't have the notion of "arbitrary additional slots of this type" differentiated by a name, the Mixin class reconfigures the model to allow for extra fields.
  • The mixin then has model-level validation routines to verify that the columns are of equal length
  • The mixin also provides the accessor magic methods for indexing as usual.

References

There are several different ways to create references between objects in nwb/hdmf:

  • links are group-level properties that can reference other groups or datasets like this:
    links:
    - name: Link name
      doc: Required string with the description of the link
      target_type: Type of target
      quantity: Optional quantity identifier for the group (default=1).
    
  • Reference dtypes are dataset, and attribute-level properties that can reference both other objects and regions within other objects:
    dtype:
      target_type: ElectrodeGroup
      reftype: object
    
  • TimeSeriesReferenceVectorData is a compound dtype that behaves like VectorData and VectorIndex combined into a single type. It is slightly different in that each row of the vector can refer to a different table, and has a different way of handling selection (with start and count rather than a series of indices for the end of each cell)
  • Implicitly, hdmf creates references between objects according to some naming conventions, eg. an attribute/dataset that is a VectorIndex named mydata_index will be linked to a VectorData object mydata.
  • There is currently a note in the schema language docs that there will be an additional Relationships system that explicitly models relationships, but it is unclear how that would be different than references.

We represent all of these by just directly referring to the object type, preserving the source type in an annotation, when necessary.

LinkML to Everything

How to generalize to linked data triplets.