<linkrel="index"title="Index"href="../../genindex.html"/><linkrel="search"title="Search"href="../../search.html"/><linkrel="next"title="Linked Data Platform"href="ld_platform.html"/><linkrel="prev"title="Linked Data Fragments"href="ld_fragments.html"/>
<liclass="toctree-l1 current has-children"><aclass="reference internal"href="../index.html">Comparison</a><inputchecked=""class="toctree-checkbox"id="toctree-checkbox-1"name="toctree-checkbox-1"role="switch"type="checkbox"/><labelfor="toctree-checkbox-1"><divclass="visually-hidden">Toggle navigation of Comparison</div><iclass="icon"><svg><usehref="#svg-arrow-right"></use></svg></i></label><ulclass="current">
<liclass="toctree-l2 has-children"><aclass="reference internal"href="../p2p/index.html">P2P</a><inputclass="toctree-checkbox"id="toctree-checkbox-2"name="toctree-checkbox-2"role="switch"type="checkbox"/><labelfor="toctree-checkbox-2"><divclass="visually-hidden">Toggle navigation of P2P</div><iclass="icon"><svg><usehref="#svg-arrow-right"></use></svg></i></label><ul>
<liclass="toctree-l2 current has-children"><aclass="reference internal"href="index.html">Linked Data</a><inputchecked=""class="toctree-checkbox"id="toctree-checkbox-4"name="toctree-checkbox-4"role="switch"type="checkbox"/><labelfor="toctree-checkbox-4"><divclass="visually-hidden">Toggle navigation of Linked Data</div><iclass="icon"><svg><usehref="#svg-arrow-right"></use></svg></i></label><ulclass="current">
<liclass="toctree-l3"><aclass="reference internal"href="rdf.html">RDF and Friends</a></li>
<liclass="toctree-l2 has-children"><aclass="reference internal"href="../data/index.html">Data Structures</a><inputclass="toctree-checkbox"id="toctree-checkbox-5"name="toctree-checkbox-5"role="switch"type="checkbox"/><labelfor="toctree-checkbox-5"><divclass="visually-hidden">Toggle navigation of Data Structures</div><iclass="icon"><svg><usehref="#svg-arrow-right"></use></svg></i></label><ul>
<spanid="index-0"></span><spanid="id1"></span><h1>HDT<aclass="headerlink"href="#hdt"title="Permalink to this heading">#</a></h1>
<p>Like <aclass="reference internal"href="ld_fragments.html"><spanclass="doc std std-doc">Linked Data Fragments</span></a>, <aclass="reference external"href="https://www.rdfhdt.org/">HDT</a> is a transport and query format for linked data triples.</p>
<p>It is a compressed format that preserves headers to enable query and browsing without decompression.</p>
<sectionid="format">
<h2>Format<aclass="headerlink"href="#format"title="Permalink to this heading">#</a></h2>
<p>It has <aclass="reference external"href="https://www.rdfhdt.org/technical-specification/">three components</a>:</p>
<blockquote>
<div><ulclass="simple">
<li><p><strong>Header:</strong> The Header holds metadata describing an HDT semantic dataset using plain RDF. It acts as an entry point for the consumer, who can have an initial idea of key properties of the content even before retrieving the whole dataset.</p></li>
<li><p><strong>Dictionary:</strong> The Dictionary is a catalog comprising all the different terms used in the dataset, such as URIs, literals and blank nodes. A unique identifier (ID) is assigned to each term, enabling triples to be represented as tuples of three IDs, which reference their respective subject/predicate/object term from the dictionary. This is a first step toward compression, since it avoids long terms to be repeated again and again. Moreover, similar strings are now stored together inside the dictionary, fact that can be exploited to improve compression even more.</p></li>
<li><p><strong>Triples:</strong> As stated before, the RDF triples can now be seen as tuples of three IDs. Therefore, the Triples section models the graph of relationships among the dataset terms. By understanding the typical properties of RDF graphs, we can come up with more efficient ways of representing this information, both to reduce the overall size, but also to provide efficient search/traversal operations.</p></li>
<h3>Dictionary<aclass="headerlink"href="#dictionary"title="Permalink to this heading">#</a></h3>
<p>The dictionary replaces all terms in the dataset with short, unique IDs to make the dataset more compressible. Oddly, rather than being a simple lookup table, it splits the dictionary into four sections: a “shared” section that includes subjects and objects, and predicates are separated. Terms are lexicographically ordered and <aclass="reference external"href="https://en.wikipedia.org/wiki/Incremental_encoding">front coded</a> to additionally aid compression.</p>
<p>Separating encoding information into a header dictionary is a straightforwardly good idea, and an argument for distributing linked data in ‘packetized’ forms rather than as a bunch of raw triples, as we do here.</p>
</section>
<sectionid="triples">
<h3>Triples<aclass="headerlink"href="#triples"title="Permalink to this heading">#</a></h3>
<p>Triples are encoded as a tree, where each subject forms a root, with each predicate as children, and likewise for objects. Since the dictionary is ordered such that the subjects are the lowest IDs, it is possible to use an implicit representation of each subject (ie. subjects are not encoded). The predicate and object layers are each encoded with two parallel bit streams: Each predicate or object entry has one <codeclass="docutils literal notranslate"><spanclass="pre">Sp</span></code> entry for its dictionary ID, and one <codeclass="docutils literal notranslate"><spanclass="pre">Bp</span></code> “bitsequence” entry which is <codeclass="docutils literal notranslate"><spanclass="pre">1</span></code> if the entry is the first child of its parent and <codeclass="docutils literal notranslate"><spanclass="pre">0</span></code> otherwise.</p>
</section>
</section>
<sectionid="querying">
<h2>Querying<aclass="headerlink"href="#querying"title="Permalink to this heading">#</a></h2>
<p>The dictionary being uncompressed allows for the dataset to be indexed at a vocabulary level - it is possible to eg. ‘find all datasets that use this set of terms,’ as well as slightly more refined queries like ‘find datasets that use this term as both subject and object.’</p>
<p>Lookup is fast for subject-based queries, but predicate and object queries are slower because of the bitmap triple encoding.</p>
</section>
<sectionid="lessons">
<h2>Lessons<aclass="headerlink"href="#lessons"title="Permalink to this heading">#</a></h2>
<p>First, there are good strategies here for practical compression and serialization of RDF triples!</p>
<p>The most interesting thing for p2p-ld here is the header: we are also interested in making it possible to do restricted queries and indexing over containers of triples without needing to necessarily query, download, or unpack the entire dataset. The primary focus here is compression, which has add-on benefits like faster query performance because the dataset can be held in memory. We would instead like to focus on exposing hashed tree fragments that can encapsulate query logic - eg. a given RDF resource that might indicate the metadata for a type of experiment would be hashed as a tree, and queries can discover it by querying for the root or any of its child hashes. So we will take the ideas re: using the dictionary encoding without necessarily adopting HDT wholesale.</p>
<p>The bitmap encoding is also interesting, as according to their tests it outperforms other similar compression schemes and I/O times. We will keep this in mind as a potential serialization format for raw triple data.</p>
<p>The idea of including publication data in the header seems obvious, but according to the authors later work that is not necessarily the case in RDF world <spanid="id2">[<aclass="reference internal"href="../../references.html#id14"title="Axel Polleres, Maulik Rajendra Kamdar, Javier David Fernández, Tania Tudorache, and Mark Alan Musen. A more decentralized vision for Linked Data. Semantic Web, 11(1):101–113, 2020-01-31. URL: https://www.medra.org/servlet/aliasResolver?alias=iospress&doi=10.3233/SW-190380 (visited on 2023-06-29), doi:10.3233/SW-190380.">Polleres <em>et al.</em>, 2020</a>]</span>. Since p2p-ld is built explicitly around making identity and origin a more central component of linked data, we will further investigate using the <spanclass="target"id="index-1"></span>VOID vocabulary - <aclass="reference external"href="https://www.w3.org/TR/void/">https://www.w3.org/TR/void/</a></p>
</section>
<sectionid="references">
<h2>References<aclass="headerlink"href="#references"title="Permalink to this heading">#</a></h2>
<li><p>Original Paper: <spanid="id3">[<aclass="reference internal"href="../../references.html#id6"title="Javier D. Fernández, Miguel A. Martínez-Prieto, Claudio Gutiérrez, Axel Polleres, and Mario Arias. Binary RDF representation for publication and exchange (HDT). Journal of Web Semantics, 19:22–41, 2013-03-01. URL: https://www.sciencedirect.com/science/article/pii/S1570826813000036 (visited on 2023-06-29), doi:10.1016/j.websem.2013.01.002.">Fernández <em>et al.</em>, 2013</a>]</span></p></li>
<li><p>Later contextualization: <spanid="id4">[<aclass="reference internal"href="../../references.html#id14"title="Axel Polleres, Maulik Rajendra Kamdar, Javier David Fernández, Tania Tudorache, and Mark Alan Musen. A more decentralized vision for Linked Data. Semantic Web, 11(1):101–113, 2020-01-31. URL: https://www.medra.org/servlet/aliasResolver?alias=iospress&doi=10.3233/SW-190380 (visited on 2023-06-29), doi:10.3233/SW-190380.">Polleres <em>et al.</em>, 2020</a>]</span></p></li>