This commit is contained in:
sneakers-the-rat 2023-06-30 23:58:54 -07:00
parent e1da79c769
commit c782ee8ba9
15 changed files with 634 additions and 62 deletions

View file

@ -1,11 +0,0 @@
# Adapter
Interfaces to other protocols and formats
- Files
- hdf5
- json
- csv
- mat
- HTTP
- S3

15
src/codecs/hdf5.md Normal file
View file

@ -0,0 +1,15 @@
# HDF5
We are starting with hdf5 because our initial test case is the [NWB](https://www.nwb.org/) format for neurophysiology data. This is a challenging initial test case because the data is heterogeneous, large, and the specification is written in an idiosyncratic format specification.
HDF has three primary types of objects:
- Groups
- Datasets - contain the raw values in the file
- Attributes - metadata about groups or datasets.
Datasets have additional properties:
- Datatypes: binary representation of the data
- Dataspaces: Layout of individual data elements
- Properties: Additional information about the representation of the dataset, eg. chunked or contiguous
These map naturally onto triplets, where each group or dataset is a subject, attributes are properties and objects.

20
src/codecs/index.md Normal file
View file

@ -0,0 +1,20 @@
# Codecs
Interfaces to file formats
We want to support three kinds of interaction with files:
- **References** - treat files like abstract binary with some metadata indicating file type and a hash tree for the file
- **Introspection** - Export some metadata from the file that indicates components of the file along with byte ranges. We want to be able to know what is inside the file without downloading it, but we keep the file separate as an out of protocol entity.
- **Ingestion** - Export the metadata and the data contained within the file to triples. We also store some translation between the original binary file and the resulting triple through translation schema that allows us to update our triples if the files change and otherwise keep a strong link to the source, but otherwise enable forking/querying/etc. as if the data does not have an underlying file.
This is a challenging design balance, where we don't want clients to need to implement a large number of codecs for different files - so they can fall back to the reference strategy as needed - but we also want people to be able to interact and import their files without needing to abandon longstanding practices or other infrastructure they might already have for using/creating them.
```{toctree}
hdf5
```
- Files
- json
- csv
- mat

View file

@ -0,0 +1,15 @@
```{index} DataLad
```
# DataLad
DataLad is a tool for managing datasets! It is built on top of {index}`git` and {index}`git-annex <git; annex>` for storage, and is capable of integrating with external hosting providers (through git annex)
It has a number of interesting extensions that we can learn from
- [crawler](https://docs.datalad.org/projects/crawler/en/latest/) allows you to archive web pages
- [OSF Remote](https://github.com/datalad/datalad-osf/blob/main/datalad_osf/annex_remote.py) - example of an extension for interacting with OSF
## References
- DataLad repositories: https://github.com/datalad
- Docs: https://docs.datalad.org/en/latest/

View file

@ -4,6 +4,7 @@
:caption: Data Structures
:maxdepth: 1
eris
datalad
dmc
eris
```

123
src/comparison/ld/hdt.md Normal file
View file

@ -0,0 +1,123 @@
```{index} Linked Data; HDT
```
(hdt)=
# HDT
Like [Linked Data Fragments](ld_fragments), [HDT](https://www.rdfhdt.org/) is a transport and query format for linked data triples.
It is a compressed format that preserves headers to enable query and browsing without decompression.
## Format
It has [three components](https://www.rdfhdt.org/technical-specification/):
{attribution="https://www.rdfhdt.org/technical-specification/"}
> - **Header:** The Header holds metadata describing an HDT semantic dataset using plain RDF. It acts as an entry point for the consumer, who can have an initial idea of key properties of the content even before retrieving the whole dataset.
> - **Dictionary:** The Dictionary is a catalog comprising all the different terms used in the dataset, such as URIs, literals and blank nodes. A unique identifier (ID) is assigned to each term, enabling triples to be represented as tuples of three IDs, which reference their respective subject/predicate/object term from the dictionary. This is a first step toward compression, since it avoids long terms to be repeated again and again. Moreover, similar strings are now stored together inside the dictionary, fact that can be exploited to improve compression even more.
> - **Triples:** As stated before, the RDF triples can now be seen as tuples of three IDs. Therefore, the Triples section models the graph of relationships among the dataset terms. By understanding the typical properties of RDF graphs, we can come up with more efficient ways of representing this information, both to reduce the overall size, but also to provide efficient search/traversal operations.
### Header
A header contains
- At least one resource of type `hdt:Dataset`, which has
- Publication metadata - Where and when the dataset was published
- Statistical metadata - Number of triples, number of terms, etc.
- Format metadata - Encoding of dataset, which must have
- `hdt:dictionary`
- `hdt:triples`
- Additional metadata - uh idk anything?
````{dropdown} HDT Header Example
```turtle
@prefix void: <http://rdfs.org/ns/void#>.
@prefix dc: <http://purl.org/dc/terms/>.
@prefix foaf: <http://xmlns.com/foaf/0.1/>.
@prefix hdt: <http://purl.org/HDT/hdt#>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix swp: <http://www.w3.org/2004/03/trix/swp-2/>.
<http://example.org/ex/DBpediaEN>
a hdt:Dataset ;
a void:Dataset ;
hdt:publicationInformation :publication ;
hdt:statisticalInformation :statistics ;
hdt:formatInformation :format ;
hdt:additionalInformation :additional ;
void:triples "431440396" ;
void:properties "57986" ;
void:distinctSubjects "24791728" ;
void:distinctObjects "108927201" .
:publication dc:issued "2012-11-23T23:17:50+0000" ;
dc:license <http://www.gnu.org/copyleft/fdl.html> ;
dc:publisher [ a foaf:Organization ;
foaf:homepage <http://www.dbpedia.org>] ;
dc:source <http://downloads.dbpedia.org/3.8/en> ;
dc:title "DBpediaEN" ;
void:sparqlEndpoint <http://www.dbpedia.org/sparql> .
:statistics hdt:originalSize "110630364018" ;
hdt:hdtSize "3082795954" .
:format hdt:dictionary :dictionary ;
hdt:triplesBitmap :triples .
:dictionary dc:format hdt:dictionaryFour ;
hdt:dictionaryNamespaces [hdt:namespace [hdt:prefixLabel "dbpedia" ;
hdt:prefixURI "http://dbpedia.org/resource/"]] ;
hdt:dictionarynumSharedSubjectObject "22762644" ;
hdt:dictionarysizeStrings "1026354060" ;
hdt:dictionaryBlockSize "8" .
:triples dc:format hdt:triplesBitmap ;
hdt:triplesOrder "SPO" ;
hdt:triplesnumTriples "431440396" .
:additional swp:signature "AZ8QWE..." ;
swp:signatureMethod "DSA" .
```
````
### Dictionary
The dictionary replaces all terms in the dataset with short, unique IDs to make the dataset more compressible. Oddly, rather than being a simple lookup table, it splits the dictionary into four sections: a "shared" section that includes subjects and objects, and predicates are separated. Terms are lexicographically ordered and [front coded](https://en.wikipedia.org/wiki/Incremental_encoding) to additionally aid compression.
Separating encoding information into a header dictionary is a straightforwardly good idea, and an argument for distributing linked data in 'packetized' forms rather than as a bunch of raw triples, as we do here.
### Triples
Triples are encoded as a tree, where each subject forms a root, with each predicate as children, and likewise for objects. Since the dictionary is ordered such that the subjects are the lowest IDs, it is possible to use an implicit representation of each subject (ie. subjects are not encoded). The predicate and object layers are each encoded with two parallel bit streams: Each predicate or object entry has one `Sp` entry for its dictionary ID, and one `Bp` "bitsequence" entry which is `1` if the entry is the first child of its parent and `0` otherwise.
## Querying
The dictionary being uncompressed allows for the dataset to be indexed at a vocabulary level - it is possible to eg. 'find all datasets that use this set of terms,' as well as slightly more refined queries like 'find datasets that use this term as both subject and object.'
Lookup is fast for subject-based queries, but predicate and object queries are slower because of the bitmap triple encoding.
## Lessons
First, there are good strategies here for practical compression and serialization of RDF triples!
The most interesting thing for p2p-ld here is the header: we are also interested in making it possible to do restricted queries and indexing over containers of triples without needing to necessarily query, download, or unpack the entire dataset. The primary focus here is compression, which has add-on benefits like faster query performance because the dataset can be held in memory. We would instead like to focus on exposing hashed tree fragments that can encapsulate query logic - eg. a given RDF resource that might indicate the metadata for a type of experiment would be hashed as a tree, and queries can discover it by querying for the root or any of its child hashes. So we will take the ideas re: using the dictionary encoding without necessarily adopting HDT wholesale.
The bitmap encoding is also interesting, as according to their tests it outperforms other similar compression schemes and I/O times. We will keep this in mind as a potential serialization format for raw triple data.
The idea of including publication data in the header seems obvious, but according to the authors later work that is not necessarily the case in RDF world {cite}`polleresMoreDecentralizedVision2020`. Since p2p-ld is built explicitly around making identity and origin a more central component of linked data, we will further investigate using the {index}`VOID vocabulary <Ontology; VOID>` - https://www.w3.org/TR/void/
## References
- [HDT Homepage](https://www.rdfhdt.org/)
- Original Paper: {cite}`fernandezBinaryRDFRepresentation2013`
- Later contextualization: {cite}`polleresMoreDecentralizedVision2020`

View file

@ -7,6 +7,14 @@
rdf
solid
ld_fragments
hdt
ld_platform
nanopubs
```
Linked data was born to be p2p. Many of the [initial, lofty visions](https://jon-e.net/surveillance-graphs/#semantic-web-priesthoods) of the [semantic web](https://jon-e.net/infrastructure/#linked-data-has-an-ambivalent-history-of-thought-regarding-the-l) are only possible with p2p systems - fluid, languagelike ontologies, portable personal data, truly decentralized information structuring on the web and so on {cite}`saundersSurveillanceGraphs2023,saundersDecentralizedInfrastructureNeuro2022`. That's one of the central goals of this project --- as might be obvious from its placeholder name: p2p-ld.
Don't just take my word for it tho:
{attribution="A more decentralized vision for Linked Data. Polleres et al. (2020)"}
> So, where does this leave us? We have seen a lot of resources being put into publishing Linked Data, but yet a publicly widely visible “killer app” is still missing. The reason for this, in the opinion and experiences of the authors, lies all to often in the frustrating experiences when trying to actually use Linked Data for building actual applications. Many attempts and projects end up still using a centralized warehousing approach, integrating a handful of data sets directly from their raw data sources, rather than being able to leverage their “lifted” Linked Data versions: the use and benefits of RDF and Linked Data over conventional databases and warehouses technologies, where more trained people are available, remain questionable. {cite}`polleresMoreDecentralizedVision2020`

View file

@ -12,7 +12,9 @@ We depart from that vision, instead favoring radical vernacularism {cite}`saunde
## RDF And Friends
RDF has a lot of formats and
```{important}
Return here re: RDF canonicalization and IPFS https://github.com/multiformats/multicodec/pull/261
```
```{index} JSON-LD
```
@ -23,11 +25,7 @@ RDF has a lot of formats and
## Challenges
### Tabular and Array Data
```{important}
See https://www.cs.ox.ac.uk/isg/challenges/sem-tab/
```
### Ordered Data
The edges from a node in a graph are unordered, which makes array and tabular data difficult to work with in RDF!
@ -44,16 +42,16 @@ eg. one would express `MyList` which contains the `Friends` `["Arnold", "Bob", "
:MyList :Friends :list1 .
:list1
rdf:first :Amy ;
rdf:rest :list2 .
rdf:first :Amy ;
rdf:rest :list2 .
:list2
rdf:first :Bob ;
rdf:rest :list3 .
rdf:first :Bob ;
rdf:rest :list3 .
:list3
rdf:first :Carly ;
rdf:rest rdf:nil .
rdf:first :Carly ;
rdf:rest rdf:nil .
```
And thankfully turtle has a shorthand, which isn't so bad:
@ -63,33 +61,33 @@ And thankfully turtle has a shorthand, which isn't so bad:
@prefix : <https://example.com> .
:MyList
:Friends (
:Amy
:Bob
:Carly
).
:Friends (
:Amy
:Bob
:Carly
).
```
Both of these correspond to the triplet graph:
```{mermaid}
flowchart LR
MyList
list1
list2
list3
nil
Amy
Bob
Carly
MyList
list1
list2
list3
nil
Amy
Bob
Carly
MyList -->|Friends| list1
list1 -->|rest| list2
list2 -->|rest| list3
list3 -->|rest| nil
list1 -->|first| Amy
list2 -->|first| Bob
list3 -->|first| Carly
MyList -->|Friends| list1
list1 -->|rest| list2
list2 -->|rest| list3
list3 -->|rest| nil
list1 -->|first| Amy
list2 -->|first| Bob
list3 -->|first| Carly
```
Which is not great.
@ -152,6 +150,315 @@ which can be expanded recursively to [mimic arrays](https://www.w3.org/TR/json-l
````
`````
### Tabular Data
As an overbrief summary, converting data from tables to RDF needs a schema mapping:
- Columns to Properties
-
- Column names in source table to symbolic names used within the conversion schema
- datatype (for representation in concrete RDF syntax)
-
According to the [Tabular Data to RDF](https://www.w3.org/TR/csv2rdf/) recommendation, one would convert the following table (encoded as `csv`):
```{csv-table}
countryCode,latitude,longitude,name
AD,42.5,1.6,Andorra
AE,23.4,53.8,"United Arab Emirates"
AF,33.9,67.7,Afghanistan
```
Into one of two "minimal" or "standard" formats of RDF:
`````{tab-set}
````{tab-item} Minimal mode
```turtle
@base <http://example.org/countries.csv> .
:8228a149-8efe-448d-b15f-8abf92e7bd17
<#countryCode> "AD" ;
<#latitude> "42.5" ;
<#longitude> "1.6" ;
<#name> "Andorra" .
:ec59dcfc-872a-4144-822b-9ad5e2c6149c
<#countryCode> "AE" ;
<#latitude> "23.4" ;
<#longitude> "53.8" ;
<#name> "United Arab Emirates" .
:e8f2e8e9-3d02-4bf5-b4f1-4794ba5b52c9
<#countryCode> "AF" ;
<#latitude> "33.9" ;
<#longitude> "67.7" ;
<#name> "Afghanistan" .
```
````
````{tab-item} Standard mode
```turtle
@base <http://example.org/countries.csv> .
@prefix csvw: <http://www.w3.org/ns/csvw#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
:d4f8e548-9601-4e41-aadb-09a8bce32625 a csvw:TableGroup ;
csvw:table [ a csvw:Table ;
csvw:url <http://example.org/countries.csv> ;
csvw:row [ a csvw:Row ;
csvw:rownum "1"^^xsd:integer ;
csvw:url <#row=2> ;
csvw:describes :8228a149-8efe-448d-b15f-8abf92e7bd17
], [ a csvw:Row ;
csvw:rownum "2"^^xsd:integer ;
csvw:url <#row=3> ;
csvw:describes :ec59dcfc-872a-4144-822b-9ad5e2c6149c
], [ a csvw:Row ;
csvw:rownum "3"^^xsd:integer ;
csvw:url <#row=4> ;
csvw:describes :e8f2e8e9-3d02-4bf5-b4f1-4794ba5b52c9
]
] .
:8228a149-8efe-448d-b15f-8abf92e7bd17
<#countryCode> "AD" ;
<#latitude> "42.5" ;
<#longitude> "1.6" ;
<#name> "Andorra" .
:ec59dcfc-872a-4144-822b-9ad5e2c6149c
<#countryCode> "AE" ;
<#latitude> "23.4" ;
<#longitude> "53.8" ;
<#name> "United Arab Emirates" .
:e8f2e8e9-3d02-4bf5-b4f1-4794ba5b52c9
<#countryCode> "AF" ;
<#latitude> "33.9" ;
<#longitude> "67.7" ;
<#name> "Afghanistan" .
```
````
`````
The recommendation also covers more complex situations. These make use of a JSON schema that handles mapping between the CSV data and RDF.
By default, each row of a table describes a single RDF resource, and each column has a single property (so each cell is a triple).
For example this table of concerts:
```{csv-table}
Name, Start Date, Location Name, Location Address, Ticket Url
B.B. King,2014-04-12T19:30,"Lupos Heartbreak Hotel","79 Washington St., Providence, RI",https://www.etix.com/ticket/1771656
B.B. King,2014-04-13T20:00,"Lynn Auditorium","Lynn, MA, 01901",http://frontgatetickets.com/venue.php?id=11766
```
Needs to be mapped to 3 separate resources with 7 properties. The values are not transformed, just grouped in different places under different resources. Notice how in the standard mode the `csvw:describes`{l=turtle} entry can have three objects. The turtle is surprisingly humane.
The JSON schema describes five concrete triples that carry the data from the CSV, and five `virtual` triples that give the resources types and link them together. Abstractions over table iterators take the form of `"#event-{_row}"` to create a resource `<#event-1>`, `<#event-2>`, etc. for each row.
`````{tab-set}
````{tab-item} Minimal mode
```turtle
@base <http://example.org/events-listing.csv> .
@prefix schema: <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<#event-1> a schema:MusicEvent ;
schema:name "B.B. King" ;
schema:startDate "2014-04-12T19:30:00"^^xsd:dateTime ;
schema:location <#place-1> ;
schema:offers <#offer-1> .
<#place-1> a schema:Place ;
schema:name "Lupos Heartbreak Hotel" ;
schema:address "79 Washington St., Providence, RI" .
<#offer-1> a schema:Offer ;
schema:url "https://www.etix.com/ticket/1771656"^^xsd:anyURI .
<#event-2> a schema:MusicEvent ;
schema:name "B.B. King" ;
schema:startDate "2014-04-13T20:00:00"^^xsd:dateTime ;
schema:location <#place-2> ;
schema:offers <#offer-2> .
<#place-2> a schema:Place ;
schema:name "Lynn Auditorium" ;
schema:address "Lynn, MA, 01901" .
<#offer-2> a schema:Offer ;
schema:url "http://frontgatetickets.com/venue.php?id=11766"^^xsd:anyURI .
```
````
````{tab-item} Standard mode
```turtle
@base <http://example.org/events-listing.csv> .
@prefix csvw: <http://www.w3.org/ns/csvw#> .
@prefix schema: <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
:95cc7970-ce99-44b0-900c-e2c2c028bbd3 a csvw:TableGroup ;
csvw:table [ a csvw:Table ;
csvw:url <http://example.org/events-listing.csv> ;
csvw:row [ a csvw:Row ;
csvw:rownum 1 ;
csvw:url <#row=2> ;
csvw:describes <#event-1>, <#place-1>, <#offer-1>
], [ a csvw:Row ;
csvw:rownum 2 ;
csvw:url <#row=3> ;
csvw:describes <#event-2>, <#place-2>, <#offer-2>
]
] .
<#event-1> a schema:MusicEvent ;
schema:name "B.B. King" ;
schema:startDate "2014-04-12T19:30:00"^^xsd:dateTime ;
schema:location <#place-1> ;
schema:offers <#offer-1> .
<#place-1> a schema:Place ;
schema:name "Lupos Heartbreak Hotel" ;
schema:address "79 Washington St., Providence, RI" .
<#offer-1> a schema:Offer ;
schema:url "https://www.etix.com/ticket/1771656"^^xsd:anyURI .
<#event-2> a schema:MusicEvent ;
schema:name "B.B. King" ;
schema:startDate "2014-04-13T20:00:00"^^xsd:dateTime ;
schema:location <#place-2> ;
schema:offers <#offer-2> .
<#place-2> a schema:Place ;
schema:name "Lynn Auditorium" ;
schema:address "Lynn, MA, 01901" .
<#offer-2> a schema:Offer ;
schema:url "http://frontgatetickets.com/venue.php?id=11766"^^xsd:anyURI .
```
````
````{tab-item} JSON Schema
```json
{
"@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}],
"url": "events-listing.csv",
"dialect": {"trim": true},
"tableSchema": {
"columns": [{
"name": "name",
"titles": "Name",
"aboutUrl": "#event-{_row}",
"propertyUrl": "schema:name"
}, {
"name": "start_date",
"titles": "Start Date",
"datatype": {
"base": "datetime",
"format": "yyyy-MM-ddTHH:mm"
},
"aboutUrl": "#event-{_row}",
"propertyUrl": "schema:startDate"
}, {
"name": "location_name",
"titles": "Location Name",
"aboutUrl": "#place-{_row}",
"propertyUrl": "schema:name"
}, {
"name": "location_address",
"titles": "Location Address",
"aboutUrl": "#place-{_row}",
"propertyUrl": "schema:address"
}, {
"name": "ticket_url",
"titles": "Ticket Url",
"datatype": "anyURI",
"aboutUrl": "#offer-{_row}",
"propertyUrl": "schema:url"
}, {
"name": "type_event",
"virtual": true,
"aboutUrl": "#event-{_row}",
"propertyUrl": "rdf:type",
"valueUrl": "schema:MusicEvent"
}, {
"name": "type_place",
"virtual": true,
"aboutUrl": "#place-{_row}",
"propertyUrl": "rdf:type",
"valueUrl": "schema:Place"
}, {
"name": "type_offer",
"virtual": true,
"aboutUrl": "#offer-{_row}",
"propertyUrl": "rdf:type",
"valueUrl": "schema:Offer"
}, {
"name": "location",
"virtual": true,
"aboutUrl": "#event-{_row}",
"propertyUrl": "schema:location",
"valueUrl": "#place-{_row}"
}, {
"name": "offers",
"virtual": true,
"aboutUrl": "#event-{_row}",
"propertyUrl": "schema:offers",
"valueUrl": "#offer-{_row}"
}]
}
}
```
````
`````
One could imagine how this might generalize into multidimensional array data, but that immediately becomes pretty ridiculous - a better strategy in all cases that I can think of would be to just provide metadata about the array like the encoding, the sizes, types, etc. of their axes and indices and then link to the array.
I'll just leave this example of encoding the pixels in one RGB video frame as a joke.
```turtle
@prefix vid: <http://example.com/GodforsakenVideoSchema> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
:myVideo a vid:VideoGroup ;
vid:video [ a vid:Video ;
vid:url <http://example.com/myVideo.mp4> ;
vid:frame [ a vid:Frame ;
vid:framenum 1 ;
vid:url <#frame=1> ;
vid:describes <#frame-1> ;
], [ a vid:Frame ;
vid:framenum 2 ;
vid:url <#frame=2> ;
vid:describes <#frame-2> ;
]
] .
<#frame-1> a vid:VideoFrame ;
vid:timestamp "2023-06-29T12:00:00"^^xsd:dateTime ;
vid:bitDepth 8 ;
vid:width 1920 ;
vid:height 1080 ;
vid:channels <#red-1>, <#green-1>, <#blue-1> ;
<#red-1> a vid:VideoChannel ;
:pixel-1 a vid:pixelValue ;
rdf:first 0 ;
rdf:rest :pixel-2 .
:pixel-2 a vid:pixelValue ;
rdf:first 46 ;
rdf:rest :pixel-3 .
# ...
:pixel-2073600 a vid:pixelValue ;
rdf:first 57 ;
rdf:rest rdf:nil .
```
### Naming
- All names have to be global. Relative names must resolve to a global name via contexts/prefixes. The alternative is blank nodes, which are treated as equivalent in eg. graph merges. Probably here enters pattern matching or whatever those things are called.
@ -162,6 +469,8 @@ which can be expanded recursively to [mimic arrays](https://www.w3.org/TR/json-l
- [RDF 1.1 Primer](https://www.w3.org/TR/rdf11-primer/)
- W3C Recommendation on generating RDF from tabular data: {cite}`tandyGeneratingRDFTabular2015`
- Tabular data model: https://www.w3.org/TR/2015/REC-tabular-data-model-20151217/#parsing
- Metadata model: https://www.w3.org/TR/2015/REC-tabular-metadata-20151217/
- {index}`JSON Schema` in RDF: {cite}`charpenayJSONSchemaRDF2023`
- [Turtle](https://www.w3.org/TR/rdf12-turtle/)
- [N-ary relations in RDF](https://www.w3.org/TR/swbp-n-aryRelations/)
@ -173,11 +482,19 @@ which can be expanded recursively to [mimic arrays](https://www.w3.org/TR/json-l
- [rdf-canonize-native](https://github.com/digitalbazaar/rdf-canonize-native)
- [biolink-model](https://github.com/biolink/biolink-model) for a nice example of generating multiple schema formats from a .yaml file.
- [linkml](https://linkml.io/) - modeling language for linked data {cite}`moxonLinkedDataModeling2021`
- Multidimensional arrays in linkml https://linkml.io/linkml/howtos/multidimensional-arrays.html
- Multidimensional arrays in linkml https://linkml.io/linkml/howtos/multidimensional-arrays.html
- [oaklib](https://incatools.github.io/ontology-access-kit/index.html) - python package for managing ontologies
- [rdflib](https://github.com/RDFLib/rdflib) - maybe the canonical python rdf library
- [csv2rdf](https://github.com/Swirrl/csv2rdf/)
### See Also
- [HYDRA vocabulary](https://www.hydra-cg.com/spec/latest/core/) - Linked Data plus REST
- [CORAL](https://github.com/jmchandonia/CORAL)
- [SEMTAB](https://www.cs.ox.ac.uk/isg/challenges/sem-tab/) - competition for mapping tabular data to RDF
- [SciSPARQL](https://www.ceur-ws.org/Vol-1272/paper_22.pdf) - an extension of SPARQL to include arrays.
### Example Datasets
- [RDF Data Dumps](https://www.w3.org/wiki/DataSetRDFDumps)
- [bio2rdf](https://download.bio2rdf.org)

View file

@ -76,6 +76,38 @@ Though not explicitly in the protocol spec, two prominent design decisions are w
- **Peer Selection:** Which peers should I spent finite bandwidth uploading to? BitTorrent uses a variety of **Choke** algorithms that reward peers that reciprocate bandwidth. Choke algorithms are typically some variant of a 'tit-for-tat' strategy, although rarely the strict bitwise tit-for-tat favored by later blockchain systems and others that require a peer to upload an equivalent amount to what they have downloaded before they are given any additional pieces. Contrast this with [{index}`BitSwap`](#BitSwap) from IPFS. It is by *not* perfectly optimizing peer selection that BitTorrent is better capable of using more of its available network resources.
- **Piece Selection:** Which pieces should be uploaded/requested first? BitTorrent uses a **Rarest First** strategy, where a peer keeps track of the number of copies of each piece present in the swarm, and preferentially seeds the rarest pieces. This keeps the swarm healthy, rewarding keeping and sharing complete copies of files. This is in contrast to, eg. [SWARM](#SWARM) which explicitly rewards hosting and sharing the most in-demand pieces.
```{index} Web Seeds
```
## Web Seeds
One thing we want to mimic from bittorrent is the ability to use traditional web servers as additional peers, or to treat them as ["WebSeeds"](http://bittorrent.org/beps/bep_0019.html)[^BEP17]
HTTP servers allow you to specify a byte range to resume a download, but don't like the downloading client connecting hundreds of times to download the same file, jumping between pieces. To accomodate that, BEP 19 changes piece selection accordingly:
When downloading from bittorrent peers, we modify the "rarest first" algorithm such that for pieces with similar rareness we
- Select pieces from smaller "gaps" in between completed blocks
- Select pieces closer to the end of the gap
- After 50% of the torrent is completed, for some random subset of pieces, ignore rarest first and fill in small gaps.
When downloading from HTTP servers
- Start from some random location in the file (to avoid every peer having the same pieces at the start of the file)
- When partially completed, select the next longest gap between completed pieces
For multi-file torrents
- Prefer bittorrent downloads for small files that are less than a piece size
We can consider {index}`libtorrent <BitTorrent; libtorrent, Client; libtorrent>`'s implementation as a reference implementation.
- Libtorrent chooses pieces by [starting by assuming the client has all files and eliminating pieces for files we don't have](https://github.com/arvidn/libtorrent/blob/c2012b084c6654d681720ea0693d87a48bc95b14/src/web_peer_connection.cpp#L165-L171).
- On requesting a piece, it [checks for resume data](https://github.com/arvidn/libtorrent/blob/c2012b084c6654d681720ea0693d87a48bc95b14/src/web_peer_connection.cpp#L368-L394) if we have already partially downloaded it before, and modifies the start and length of the piece request
- It then [constructs an HTTP GET request](https://github.com/arvidn/libtorrent/blob/c2012b084c6654d681720ea0693d87a48bc95b14/src/web_peer_connection.cpp#L423-L442), using the [Range](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Range) header to select some subsection of the file.
- When we [receive data](https://github.com/arvidn/libtorrent/blob/c2012b084c6654d681720ea0693d87a48bc95b14/src/web_peer_connection.cpp#L778) from the server, we wait until we receive the full header, then we parse the body of the response. If the size is different than what we expected, we disconnect from the server. Otherwise, we iterate through any chunks and store them.
- If the pieces received from the web seed [fail the hash check](https://github.com/arvidn/libtorrent/blob/c2012b084c6654d681720ea0693d87a48bc95b14/src/web_peer_connection.cpp#L578-L584), we mark the peer as not having the file, which bans it in the case of a single file torrent, but allows us to check whether the other files on the server have been changed.
## Lessons
@ -88,7 +120,6 @@ Though not explicitly in the protocol spec, two prominent design decisions are w
- `.torrent` files make for a very **low barrier to entry** and are extremely **portable.** They also operate over the existing idioms of files and folders, rather than creating their own filesystem abstraction.
- Explicit peer and piece selection algorithms are left out of the protocol specification, allowing individual implementations to experiment with what works. This makes it possible to exploit the protocol by refusing to seed ever, but this rarely occurs in practice, as people are not the complete assholes imagined in worst-case scenarios of scarcity. Indeed even the most selfish peers have the intrinsic incentive to upload, as by aggressively seeding the pieces that a leeching peer already has, the other peers in the swarm are less likely to "waste" the bandwidth of the seeders and more bandwidth can be allocated to pieces that the leecher doesn't already have.
### Adapt
- **Metadata**. Currently all torrent metadata is contained within the tracker, so while it is possible to restore all the files that were indexed by a downed tracker, it is very difficult to restore all the metadata at a torrent level and above, eg. the organization of specific torrents into hierarchical categories that allow one to search for an artist, all the albums they have produced, all the versions of that album in different file formats, and so on.
@ -99,13 +130,22 @@ Though not explicitly in the protocol spec, two prominent design decisions are w
2. Maintain the possibility for loose anonymity where peers can share files without needing a large and well-connected social system to share files to them
3. Avoid significant performance penalties from guarantees of strong network-level anonymity like Tor.
- **Trackers** are a good idea, even if they could use some updating. It is good to have an explicit entrypoint specified with a distributed, social mechanism rather than prespecified as a hardcoded entry point. It is a good idea to make a clear space for social curation of information, rather than something that is intrinsically bound to a torrent at the time of uploading. We update the notion of trackers with [Peer Federations](#Peer-Federations).
- **Web Seeds**
- Torrent files handle single and multi-file torrents similarly, with the file structure in the info-dict. We can instead explicitly follow the lead of Bittorrent v2.0 and have per-file hash trees and URL references, avoiding some of the ambiguity in the web seed implementation that [requires us to do some manual path traversal](https://github.com/arvidn/libtorrent/blob/c2012b084c6654d681720ea0693d87a48bc95b14/src/web_peer_connection.cpp#L101-L121)
- We want to be able to integrate with existing servers and services, so we want to be able to find files by both the URL of the original file (if that is its "canonical" location) and its hash. Rather than adding a web seed as an additional source of a torrent file, we can treat it as one of the additional identifiers for the given container. This adds an additional argument in favor of nested containers as the unit of exchange. Eg. A data repository might have a single URL for a dataset that has multiple files within it, and the individual files might not have unique URLs (eg. the file picker generates a .zip file on the fly). A peer might want to bundle together multiple files from different locations. So it should be possible for each container to have multiple names, and when another peers requests a file by eg. a URL we can look within our containers for a match. This also allows handling files that might be uploaded in multiple places
- We want to store the [Last-Modified](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Last-Modified) data when importing a file from a web seed so that we can handle version changes in a given file without giving up on the web source entirely. When the `Last-Modified` is updated, we get the new file, re-hash it, and update the relevant file container if it has been changed. Otherwise we just store the new `Last-Modified`
## References
- Bittorrent Protocol Specification (BEP 3): http://www.bittorrent.org/beps/bep_0003.html
- Bittorrent v2 (BEP 52): http://www.bittorrent.org/beps/bep_0052.html
- Magnet Links (BEP 9): http://www.bittorrent.org/beps/bep_0009.html
- WebSeeds (BEP 19): http://bittorrent.org/beps/bep_0019.html
- More on BitTorrent and incentives - {cite}`cohenIncentivesBuildRobustness2003`
- Notes about writing a bittorrent client from the GetRight author, particularly re: DHT: https://www.getright.com/torrentdev.html
- Nice example of implementing a very minimal bittorrent client in Python: https://markuseliasson.se/article/bittorrent-in-python/
[^announcelist]: Or, properly, in the `announce-list` per ([BEP 12](http://www.bittorrent.org/beps/bep_0012.html))
[^BEP17]: There is a parallel [BEP 17](https://www.bittorrent.org/beps/bep_0017.html) that allows modified HTTP servers to more directly seed, but since it requires changes to existing servers we are less concerned with it.

View file

@ -45,3 +45,11 @@ If IPFS is {index}`BitTorrent` + {index}`git`, and {index}`ActivityPub` is {inde
## Differences
- Not permanent storage! Identities retain custody and control over objects in the network.
## References
- [IPFS-LD](https://github.com/ipfs/ipfs/issues/36)
- Discussions on gateways and {index}`Web Seed <Web Seeds>`-like things in IPFS:
- https://github.com/ipfs/kubo/issues/8234
-

View file

@ -34,6 +34,7 @@ exclude_patterns = []
html_theme = 'furo'
html_static_path = ['_static']
html_baseurl = '/docs/'
pygments_dark_style = "github-dark"
# -----------
# Extension config
@ -43,7 +44,8 @@ myst_heading_anchors = 3
myst_enable_extensions = [
'tasklist',
'linkify',
'attrs_block'
'attrs_block',
'attrs_inline'
]
myst_linkify_fuzzy_links = False

View file

@ -41,7 +41,7 @@ evolvability
:hidden:
triplets
adapter/index
codecs/index
translation/index
```

View file

@ -46,6 +46,24 @@
keywords = {archived}
}
@article{fernandezBinaryRDFRepresentation2013,
title = {Binary {{RDF}} Representation for Publication and Exchange ({{HDT}})},
author = {Fernández, Javier D. and Martínez-Prieto, Miguel A. and Gutiérrez, Claudio and Polleres, Axel and Arias, Mario},
date = {2013-03-01},
journaltitle = {Journal of Web Semantics},
shortjournal = {Journal of Web Semantics},
volume = {19},
pages = {22--41},
issn = {1570-8268},
doi = {10.1016/j.websem.2013.01.002},
url = {https://www.sciencedirect.com/science/article/pii/S1570826813000036},
urldate = {2023-06-29},
abstract = {The current Web of Data is producing increasingly large RDF datasets. Massive publication efforts of RDF data driven by initiatives like the Linked Open Data movement, and the need to exchange large datasets has unveiled the drawbacks of traditional RDF representations, inspired and designed by a document-centric and human-readable Web. Among the main problems are high levels of verbosity/redundancy and weak machine-processable capabilities in the description of these datasets. This scenario calls for efficient formats for publication and exchange. This article presents a binary RDF representation addressing these issues. Based on a set of metrics that characterizes the skewed structure of real-world RDF data, we develop a proposal of an RDF representation that modularly partitions and efficiently represents three components of RDF datasets: Header information, a Dictionary, and the actual Triples structure (thus called HDT). Our experimental evaluation shows that datasets in HDT format can be compacted by more than fifteen times as compared to current naive representations, improving both parsing and processing while keeping a consistent publication scheme. Specific compression techniques over HDT further improve these compression rates and prove to outperform existing compression solutions for efficient RDF exchange.},
langid = {english},
keywords = {Binary formats,Data compaction and compression,linked data,RDF,RDF metrics},
file = {/Users/jonny/Dropbox/papers/zotero/F/FernándezJ/fernandez_2013_binary_rdf_representation_for_publication_and_exchange_(hdt).pdf}
}
@article{kunzePersistenceStatementsDescribing2017,
title = {Persistence {{Statements}}: {{Describing Digital Stickiness}}},
shorttitle = {Persistence {{Statements}}},
@ -159,6 +177,26 @@
file = {/Users/jonny/Dropbox/papers/zotero/O/OgdenM/ogden_2017_dat_-_distributed_dataset_synchronization_and_versioning.pdf}
}
@article{polleresMoreDecentralizedVision2020,
title = {A More Decentralized Vision for {{Linked Data}}},
author = {Polleres, Axel and Kamdar, Maulik Rajendra and Fernández, Javier David and Tudorache, Tania and Musen, Mark Alan},
editor = {Hitzler, Pascal and Janowicz, Krzysztof},
date = {2020-01-31},
journaltitle = {Semantic Web},
shortjournal = {SW},
volume = {11},
number = {1},
pages = {101--113},
issn = {22104968, 15700844},
doi = {10.3233/SW-190380},
url = {https://www.medra.org/servlet/aliasResolver?alias=iospress&doi=10.3233/SW-190380},
urldate = {2023-06-29},
archive = {https://web.archive.org/web/20230629182643/https://content.iospress.com/articles/semantic-web/sw190380},
langid = {english},
keywords = {archived},
file = {/Users/jonny/Dropbox/papers/zotero/P/PolleresA/polleres_2020_a_more_decentralized_vision_for_linked_data.pdf}
}
@online{saundersDecentralizedInfrastructureNeuro2022,
title = {Decentralized {{Infrastructure}} for ({{Neuro}})Science},
author = {Saunders, Jonny L.},

View file

@ -1,8 +0,0 @@
# Translation
A toolkit for writing translations between formats and schemas!
## See also
- https://linkml.io/schema-automator/introduction.html#generalization-from-instance-data
- https://apps.islab.ntua.gr/d2rml/tr/d2rml/

View file

@ -2,3 +2,7 @@
Translation/import of existing schema/formats.
## See also
- https://linkml.io/schema-automator/introduction.html#generalization-from-instance-data
- https://apps.islab.ntua.gr/d2rml/tr/d2rml/