nwb-linkml/nwb_linkml/tests/test_io/test_io_hdf5.py

import pdb

import h5py
import numpy as np
import pytest

from nwb_linkml.io.hdf5 import HDF5IO, truncate_file


@pytest.mark.xfail()
@pytest.mark.parametrize("dset", ["aibs.nwb", "aibs_ecephys.nwb"])
def test_hdf_read(data_dir, dset):
    NWBFILE = data_dir / dset
    io = HDF5IO(path=NWBFILE)
    # the test for now is just whether we can read it lol
    model = io.read()


def test_truncate_file(tmp_output_dir):
    source = tmp_output_dir / "truncate_source.hdf5"

    # create a dang ol hdf5 file with a big dataset and some softlinks and make sure
    # we truncate the dataset and preserve softlink

    h5f = h5py.File(str(source), "w")
    data_group = h5f.create_group("data")
    dataset_contig = h5f.create_dataset(
        "/data/dataset_contig",
        data=np.zeros((1000, 30, 40), dtype=np.float64),
        compression="gzip",
        compression_opts=9,
    )
    dataset_chunked = h5f.create_dataset(
        "/data/dataset_chunked",
        data=np.zeros((1000, 40, 50), dtype=np.float64),
        compression="gzip",
        compression_opts=9,
        chunks=True,
    )
    dataset_contig.attrs["reference_other"] = dataset_chunked.ref
    dataset_chunked.attrs["reference_other"] = dataset_contig.ref
    dataset_contig.attrs["anattr"] = 1

    link_group = h5f.create_group("link/child")
    link_group.attrs["reference_contig"] = dataset_contig.ref
    link_group.attrs["reference_chunked"] = dataset_chunked.ref
    h5f.flush()
    h5f.close()

    source_size = source.stat().st_size

    # do it without providing target to check that we make filename correctly
    n = 10
    target_output = truncate_file(source, n=n)
    assert target_output == source.parent / (source.stem + "_truncated.hdf5")
    # check that we actually made it smaller
    target_size = target_output.stat().st_size
    # empirically, the source dataset is ~125KB and truncated is ~17KB
    assert target_size < source_size / 5

    # then check that we have what's expected in the file
    target_h5f = h5py.File(target_output, "r")

    # truncation happened
    assert target_h5f["data"]["dataset_contig"].shape == (n, 30, 40)
    assert target_h5f["data"]["dataset_chunked"].shape == (n, 40, 50)
    # references still work
    # can't directly assess object identity equality with "is"
    # so this tests if the referenced dereference and that they dereference to the right place
    assert (
        target_h5f[target_h5f["data"]["dataset_contig"].attrs["reference_other"]].name
        == target_h5f["data"]["dataset_chunked"].name
    )
    assert (
        target_h5f[target_h5f["data"]["dataset_chunked"].attrs["reference_other"]].name
        == target_h5f["data"]["dataset_contig"].name
    )
    assert (
        target_h5f[target_h5f["link"]["child"].attrs["reference_contig"]].name
        == target_h5f["data"]["dataset_contig"].name
    )
    assert (
        target_h5f[target_h5f["link"]["child"].attrs["reference_chunked"]].name
        == target_h5f["data"]["dataset_chunked"].name
    )
    assert target_h5f["data"]["dataset_contig"].attrs["anattr"] == 1


@pytest.mark.skip()
def test_flatten_hdf():
    from nwb_linkml.maps.hdf5 import flatten_hdf

    path = "/Users/jonny/Dropbox/lab/p2p_ld/data/nwb/sub-738651046_ses-760693773.nwb"
    import h5py

    h5f = h5py.File(path)
    flat = flatten_hdf(h5f)
    assert not any(["specifications" in v.path for v in flat.values()])
    pdb.set_trace()
    raise NotImplementedError("Just a stub for local testing for now, finish me!")
successfully building many versions of nwb schema. working on hdf5 importing, come back to it when fresh, just sorta poking at it because it's so close. 2023-09-14 09:45:01 +00:00			`import pdb`

ruff automatic fixes 2024-07-02 04:44:35 +00:00			`import h5py`
figuring out the strategy here... - added linkml_meta classvar to store additional linkml properties if needed - injecting path field to metaclass - sketch of doing a queue-based read - prune datasets & example allen institute data 2023-09-22 07:31:34 +00:00			`import numpy as np`
ruff automatic fixes 2024-07-02 04:44:35 +00:00			`import pytest`
i'm just sorta making a mess here. coming back tomorrow 2023-09-06 07:50:49 +00:00
ruff automatic fixes 2024-07-02 04:44:35 +00:00			`from nwb_linkml.io.hdf5 import HDF5IO, truncate_file`
Working on finalizing the mapping operation... doing it single threaded for now and it's very slow but it completes up until the stage where we need to zip up the orphaned objects and other things that can be inferred from the model. Need to make a proxytable model like proxyarray because reading all these tables takes way too fuckin long and it's not what we want to do anyway. 2023-09-26 05:03:53 +00:00
holy hell it was a TYPE COERCION in the way linkml handles annotations and a version mismatch between CI and local https://github.com/linkml/linkml-model/pull/162 2023-10-12 05:30:26 +00:00
get tests running again 2024-07-02 01:59:21 +00:00			`@pytest.mark.xfail()`
black formatting 2024-07-02 04:23:31 +00:00			`@pytest.mark.parametrize("dset", ["aibs.nwb", "aibs_ecephys.nwb"])`
[tests] cheap read test 2023-10-06 05:12:27 +00:00			`def test_hdf_read(data_dir, dset):`
			`NWBFILE = data_dir / dset`
i'm just sorta making a mess here. coming back tomorrow 2023-09-06 07:50:49 +00:00			`io = HDF5IO(path=NWBFILE)`
[tests] cheap read test 2023-10-06 05:12:27 +00:00			`# the test for now is just whether we can read it lol`
Working on finalizing the mapping operation... doing it single threaded for now and it's very slow but it completes up until the stage where we need to zip up the orphaned objects and other things that can be inferred from the model. Need to make a proxytable model like proxyarray because reading all these tables takes way too fuckin long and it's not what we want to do anyway. 2023-09-26 05:03:53 +00:00			`model = io.read()`
successfully building many versions of nwb schema. working on hdf5 importing, come back to it when fresh, just sorta poking at it because it's so close. 2023-09-14 09:45:01 +00:00
black formatting 2024-07-02 04:23:31 +00:00
figuring out the strategy here... - added linkml_meta classvar to store additional linkml properties if needed - injecting path field to metaclass - sketch of doing a queue-based read - prune datasets & example allen institute data 2023-09-22 07:31:34 +00:00			`def test_truncate_file(tmp_output_dir):`
black formatting 2024-07-02 04:23:31 +00:00			`source = tmp_output_dir / "truncate_source.hdf5"`
figuring out the strategy here... - added linkml_meta classvar to store additional linkml properties if needed - injecting path field to metaclass - sketch of doing a queue-based read - prune datasets & example allen institute data 2023-09-22 07:31:34 +00:00
			`# create a dang ol hdf5 file with a big dataset and some softlinks and make sure`
			`# we truncate the dataset and preserve softlink`

black formatting 2024-07-02 04:23:31 +00:00			`h5f = h5py.File(str(source), "w")`
			`data_group = h5f.create_group("data")`
figuring out the strategy here... - added linkml_meta classvar to store additional linkml properties if needed - injecting path field to metaclass - sketch of doing a queue-based read - prune datasets & example allen institute data 2023-09-22 07:31:34 +00:00			`dataset_contig = h5f.create_dataset(`
black formatting 2024-07-02 04:23:31 +00:00			`"/data/dataset_contig",`
			`data=np.zeros((1000, 30, 40), dtype=np.float64),`
			`compression="gzip",`
			`compression_opts=9,`
figuring out the strategy here... - added linkml_meta classvar to store additional linkml properties if needed - injecting path field to metaclass - sketch of doing a queue-based read - prune datasets & example allen institute data 2023-09-22 07:31:34 +00:00			`)`
			`dataset_chunked = h5f.create_dataset(`
black formatting 2024-07-02 04:23:31 +00:00			`"/data/dataset_chunked",`
figuring out the strategy here... - added linkml_meta classvar to store additional linkml properties if needed - injecting path field to metaclass - sketch of doing a queue-based read - prune datasets & example allen institute data 2023-09-22 07:31:34 +00:00			`data=np.zeros((1000, 40, 50), dtype=np.float64),`
			`compression="gzip",`
			`compression_opts=9,`
black formatting 2024-07-02 04:23:31 +00:00			`chunks=True,`
figuring out the strategy here... - added linkml_meta classvar to store additional linkml properties if needed - injecting path field to metaclass - sketch of doing a queue-based read - prune datasets & example allen institute data 2023-09-22 07:31:34 +00:00			`)`
black formatting 2024-07-02 04:23:31 +00:00			`dataset_contig.attrs["reference_other"] = dataset_chunked.ref`
			`dataset_chunked.attrs["reference_other"] = dataset_contig.ref`
			`dataset_contig.attrs["anattr"] = 1`
figuring out the strategy here... - added linkml_meta classvar to store additional linkml properties if needed - injecting path field to metaclass - sketch of doing a queue-based read - prune datasets & example allen institute data 2023-09-22 07:31:34 +00:00
black formatting 2024-07-02 04:23:31 +00:00			`link_group = h5f.create_group("link/child")`
			`link_group.attrs["reference_contig"] = dataset_contig.ref`
			`link_group.attrs["reference_chunked"] = dataset_chunked.ref`
figuring out the strategy here... - added linkml_meta classvar to store additional linkml properties if needed - injecting path field to metaclass - sketch of doing a queue-based read - prune datasets & example allen institute data 2023-09-22 07:31:34 +00:00			`h5f.flush()`
			`h5f.close()`

			`source_size = source.stat().st_size`

			`# do it without providing target to check that we make filename correctly`
			`n = 10`
			`target_output = truncate_file(source, n=n)`
black formatting 2024-07-02 04:23:31 +00:00			`assert target_output == source.parent / (source.stem + "_truncated.hdf5")`
figuring out the strategy here... - added linkml_meta classvar to store additional linkml properties if needed - injecting path field to metaclass - sketch of doing a queue-based read - prune datasets & example allen institute data 2023-09-22 07:31:34 +00:00			`# check that we actually made it smaller`
			`target_size = target_output.stat().st_size`
			`# empirically, the source dataset is ~125KB and truncated is ~17KB`
			`assert target_size < source_size / 5`

			`# then check that we have what's expected in the file`
black formatting 2024-07-02 04:23:31 +00:00			`target_h5f = h5py.File(target_output, "r")`
figuring out the strategy here... - added linkml_meta classvar to store additional linkml properties if needed - injecting path field to metaclass - sketch of doing a queue-based read - prune datasets & example allen institute data 2023-09-22 07:31:34 +00:00
			`# truncation happened`
black formatting 2024-07-02 04:23:31 +00:00			`assert target_h5f["data"]["dataset_contig"].shape == (n, 30, 40)`
			`assert target_h5f["data"]["dataset_chunked"].shape == (n, 40, 50)`
figuring out the strategy here... - added linkml_meta classvar to store additional linkml properties if needed - injecting path field to metaclass - sketch of doing a queue-based read - prune datasets & example allen institute data 2023-09-22 07:31:34 +00:00			`# references still work`
			`# can't directly assess object identity equality with "is"`
			`# so this tests if the referenced dereference and that they dereference to the right place`
black formatting 2024-07-02 04:23:31 +00:00			`assert (`
			`target_h5f[target_h5f["data"]["dataset_contig"].attrs["reference_other"]].name`
			`== target_h5f["data"]["dataset_chunked"].name`
			`)`
			`assert (`
			`target_h5f[target_h5f["data"]["dataset_chunked"].attrs["reference_other"]].name`
			`== target_h5f["data"]["dataset_contig"].name`
			`)`
			`assert (`
			`target_h5f[target_h5f["link"]["child"].attrs["reference_contig"]].name`
			`== target_h5f["data"]["dataset_contig"].name`
			`)`
			`assert (`
			`target_h5f[target_h5f["link"]["child"].attrs["reference_chunked"]].name`
			`== target_h5f["data"]["dataset_chunked"].name`
			`)`
			`assert target_h5f["data"]["dataset_contig"].attrs["anattr"] == 1`

Working on finalizing the mapping operation... doing it single threaded for now and it's very slow but it completes up until the stage where we need to zip up the orphaned objects and other things that can be inferred from the model. Need to make a proxytable model like proxyarray because reading all these tables takes way too fuckin long and it's not what we want to do anyway. 2023-09-26 05:03:53 +00:00
figuring out the strategy here... - added linkml_meta classvar to store additional linkml properties if needed - injecting path field to metaclass - sketch of doing a queue-based read - prune datasets & example allen institute data 2023-09-22 07:31:34 +00:00			`@pytest.mark.skip()`
			`def test_flatten_hdf():`
Working on finalizing the mapping operation... doing it single threaded for now and it's very slow but it completes up until the stage where we need to zip up the orphaned objects and other things that can be inferred from the model. Need to make a proxytable model like proxyarray because reading all these tables takes way too fuckin long and it's not what we want to do anyway. 2023-09-26 05:03:53 +00:00			`from nwb_linkml.maps.hdf5 import flatten_hdf`
black formatting 2024-07-02 04:23:31 +00:00
			`path = "/Users/jonny/Dropbox/lab/p2p_ld/data/nwb/sub-738651046_ses-760693773.nwb"`
figuring out the strategy here... - added linkml_meta classvar to store additional linkml properties if needed - injecting path field to metaclass - sketch of doing a queue-based read - prune datasets & example allen institute data 2023-09-22 07:31:34 +00:00			`import h5py`
black formatting 2024-07-02 04:23:31 +00:00
figuring out the strategy here... - added linkml_meta classvar to store additional linkml properties if needed - injecting path field to metaclass - sketch of doing a queue-based read - prune datasets & example allen institute data 2023-09-22 07:31:34 +00:00			`h5f = h5py.File(path)`
			`flat = flatten_hdf(h5f)`
black formatting 2024-07-02 04:23:31 +00:00			`assert not any(["specifications" in v.path for v in flat.values()])`
need 2 stop for the night but its sort of happening 2023-09-22 09:48:40 +00:00			`pdb.set_trace()`
black formatting 2024-07-02 04:23:31 +00:00			`raise NotImplementedError("Just a stub for local testing for now, finish me!")`