2020-10-14

Why RDF* Is a Mess... and How to Fix It

TL;DR

RDF* is a new group of standards that aims to bridge the gap between RDF and property graphs. However, it has taken an "easy" route that made it ambiguous and backward incompatible. An alternative approach that doesn't suffer from the mentioned problems would be to introduce the notion of triple labels instead of using the embedded triples syntax.

How Standards Should Be Developed

Our CTO used to say that standards should be written solely by those who are implementing them. And although this statement may be a little too extreme in some cases, it's a good rule of thumb. The main reason for it is not that it will make the standards simple to implement. Moreover, I don't want to argue that allowing a simple implementation is the main requirement for a standard. What's more important is that the implementors have a combined exposure to the whole variety of potential use cases both from the user feedback and own experience of being a consumer of their own dogfood. Besides, it doesn't hurt, in the long run, that if something is simple to implement it's also, usually, simple to understand, reason about, and use.

Obviously, given all power to the implementers might lead to abuse of this power, but it's a second-order problem and there are known ways to mitigate it. Primarily, by assembling representatives of several implementations in a committee. An approach that is often frowned upon by hotheads due to alleged bureaucracy and the need for compromise, yet leading to much more thought-out and lasting standards. A good example of such is the Common Lisp standard.

RDF*

But I digressed, let's talk about RDF*. This is the major candidate to solve the problem of RDF triple reification that is crucial to basic compatibility between RDF and property graph representations. In short, RDF defines the simplest elegant abstraction for representing any kind of data — a triple that comprises a subject, predicate, and object. Triples are used to represent facts. Besides, there's a third component to a triple called a graph. So, in fact, the historic name triple, in the realm of RDF triple-stores, currently stands for a structure of 4 elements. The graph may be used to group triples. And despite the beauty of the simple and elegant concept of a triple, in theory, having this fourth component is essential for any serious data modeling.

Now, we know how to represent facts and arbitrarily group them together. So, the usual next-level question arises: how to represent facts about facts? Which leads to the problem of reification. Let's consider a simple example:

:I :have :dream .

This is a simple triple represented in the popular Turtle format. As RDF deals with resources, it assumes that there's some default prefix defined elsewhere (let's say it's https://foo.com/). The sam triple may be represented in the basic NTriples format like this:

<https://foo.com/I> <https://foo.com/have> <https://foo.com/dream> .

What if we want to express the facts that it is a quote by Martin Luther King and that it was uttered in 1963? There are at least 3 ways to approach it:

  1. The obvious but wrong (as it reimplements the semantics of RDF in RDF introducing unnecessary complexity) one of turning it into a set of triples each one describing some of the properties of the original statement with all of subject, predicate, and object being one of the properties:
    _:fact1 rdf:subject :I ;
            rdf:predicate :have ;
            rdf:object :dream ;
            meta:author wiki:Martin_Luther_King_Jr. ;
            meta:date "1963" .
    

    This works but, as I said, is conceptually wrong. It's the RDF's Java-style cancer of the semicolon. It leads to storage waste and poor performance.

  2. The other one is to use singleton predicates and assign metadata to them:
    :I :have#1 :dream .
    :have#1 meta:author wiki:Martin_Luther_King_Jr. ;
            meta:date "1963" .
    

    This is complete nonsense as it makes SPARQL queries unreasonably complex unless you implement special syntax that will ignore the #1 suffix, in the query engine.

  3. Yet another one, which I consider to be the best (close to perfect), is to use the graph to attach triple metadata instead of grouping the triples.
    :t1 { :I :have :dream . }
    :t1 meta:author wiki:Martin_Luther_King_Jr. ;
        meta:date "1963" .
    

    Here, we use another (there's many more of them :) ) RDF format — TriG, which is an extension to Turtle for representing graphs. :t1 is a unique graph that is associated with our triple, and it is also used as a subject resource for metadata triples. This approach also has minor drawbacks, the most important of which is that grouping triples needs more overhead. We'll have to add an additional triple if we'd like to express that :t1 belongs to a graph :g1:

    :t1 meta:graph :g1 .
    

    On the flip side, that will open the possibility of putting the triple into more than a single graph. In other words, now grouping may be expressed as yet another triple property, which it, in fact, is.

  4. RDF* takes a different approach: embedding triples. We may say it tries to do the obvious by attaching metadata directly to the triple:
    << :I :have :dream >> meta:author wiki:Martin_Luther_King_Jr. ;
                          meta:date "1963" .
    

    Besides, you can also embed a triple into an object:

    wiki:Martin_Luther_King_Jr. meta:quote << :I :have :dream >> .
    

    And do nesting:

    << wiki:Martin_Luther_King_Jr. meta:quote << :I :have :dream >> >> meta:date "1963" .
    

    Neat, at first glance... Yet, there are many pitfalls of this seemingly simple approach.

What's Wrong with RDF*

The first obvious limitation of this approach is that this syntax is not able to unambiguously express all the possible cases. What if we want to say something like this:

<< << :I :have :dream >> meta:author wiki:Martin_Luther_King_Jr. ;
                         meta:date "1963" >>
   meta:timestamp "2020-10-13T01:02:03" .

Such syntax is not specified in the RFC and it's unclear if it is allowed (it seems like it shouldn't be), although this is perfectly legit:

<< << :I :have :dream >> meta:author wiki:Marthin_Luther_King_Jr. >>
   meta:timestamp "2020-10-13T01:02:03" .

What about this:

wiki:Martin_Luther_King_Jr. meta:quote << :I :have :dream >> .
wiki:John_Doe meta:quote << :I :have :dream >> .

Do these statements refer to the same :I :have :dream . triple or two different ones? RDF* seems to assume (although the authors don't say that anywhere explicitly) that each subject-predicate-object combination is a unique triple, i.e. there can be no duplicates. But RDF doesn't mandate it. So, some triple stores support duplicate triples. In this case, there is no way to express referencing the same embedded triple in object position from multiple triples in Turtle*.

Moreover, there's a not in the RDF* spec that mentions that the embedded triples should not, actually, be asserted (at least, in the object position — it is unclear whether that also applies to them in the subject position). I.e. in the following example:

wiki:Martin_Luther_King_Jr. meta:quote << :I :have :dream >> .

the triple :I :have :dream might be treated differently then the toplevel triples, and the SPARQL query like SELECT ?obj { :I :have ?obj } will not return :dream. And only SELECT ?obj { ?s ?p << :I :have ?obj >> } will be an acceptable way of accessing the embedded triple. We're now questioning the most basic principles of RDF...

And I haven't even started talking about graphs (for there's no TriG* yet). With graphs, there're more unspecified corner cases. For instance, the principal question is: can an embedded triple have a different graph than the enclosing property triple. It seems like a desirable property, moreover, it will be hard to prevent the appearance of such situations from directly manipulating the triple store (and not by reading serialized TriG* statements).

This is, actually, the major problem with Turtle*: it makes an impression that it exists in a vacuum. To see RDF* in context, we have to understand that the core of RDF comprises a group of connected standards: NTriples/NQuads, Turtle/TriG, and SPARQL. Turtle is a successor to NTriples that makes it more human-friendly, but all of them build on the same syntax. And this syntax is used by SPARQL also. Yet, there's no NTriples* and it's unclear whether it can exist. GraphDB implements a hack by embedding the triple (or rather its hash, but that doesn't matter much) in a resource (like <urn:abcdefgh>), but, first of all, that's ugly, and, secondly, it also assumes no duplicates. Yet, NTriples is the basic data interchange format for RDF, and forsaking it is a huge mistake. There's also no TriG* yet as I mentioned. Another sign that RDF* is mostly a theoretical exercise. TriG* can be defined as an extension to TriG with Turtle* syntax, but I have already briefly mentioned the issue it will face.

To sum up, the main deficiencies of Trutle* are:

  • poor backward compatibilty (up to a point of not taking into account other related standards)
  • limited syntax
  • lots of underspecified corners

And, in my opinion, they originate from the desire to make the most obvious UI not paying attention to all other considerations at all.

An Obvious Fix

What's the alternative? Well, probably, Turtle* will end up being implemented in some way or another by all the triple-store vendors. Although, I expect the implementations to be quite incompatible due to the high level of underspecification in the RFC.

Yet, you don't have to wait for Turtle* as graph-based reification is already available and quite usable.

Also, if we still had a choice to define an extension to RDF with the same purpose as RDF*, I'd take another quite obvious route. It may be less sexy, but it is at least as simple to understand and much more consistent both within itself and with other RDF standards. Moreover, a similar approach is already part of RDF — blank nodes.

Blank nodes are resources that are used just as ids:

We could as well use a blank node instead of http://foo.com/t1 as our graph label resouce:


_:b1 { :I :have :dream . }
_:b1 meta:author wiki:Martin_Luther_King_Jr. ;
      meta:date "1963" .

The underscore syntax is for blank nodes so _:b1 will create a graph node that is used to connect other nodes together but we don't care about its representation at all.

Similarly to blank nodes syntax, we could introduce triple label syntax:

^:t1 :I :have :dream .

This statement will mean that our triple has a t1 label. Now, we could add metadata to that label — in exactly the same manner as with graph-based reification (*:t1 is a "dereference" of the triple label):

*:t1 meta:author wiki:Martin_Luther_King_Jr. ;
     meta:date "1963" .

This would map directly to the implementation that will be able to unambiguously link the triple to its properties. Also, it would enable this:

wiki:Martin_Luther_King_Jr. *:t1 .
wiki:John_Doe meta:quote *:t1 .

And defining NTriples*/NQuads* becomes possible, as well. Here are NQuads triples for the MLK quote (with a graph g1 added for completeness).

^:t1 <https://foo.com/I> <https://foo.com/have> <https://foo.com/dream> <https://foo.com/g1> .
^:t2 *:t1 <https://meta.org/author> <https://en.wikipedia.org/wiki/Martin_Luther_King_Jr.> .
*:t1 <https://meta.org/date> "1963" .

Alas, this simple and potent approach was overlooked for RDF*, so now we have to deal with a mess that is both hard to implement and will likely lead to more fragmentation.

No comments:

Post a Comment