2020-08-12

Announcing CL-AGRAPH

AllegroGraph (agraph) is one of the hugely underappreciated pieces of software freely available to the programming community. Especially this relates to the Lisp community as agraph is one of the largest, most successful, and mature Lisp software projects around, yet it is hardly used by lispers. In part, its relative obscurity may be explained by the following factors:

  • the software is commercial... but it also has a free version with very reasonable limitations that can be used in the majority of hobby and other small projects
  • it is written in a commercial Lisp — Allegro CL... but it also has a free version; and it can run agraph
  • out-of-the-box, agraph has only an ACL client (in addition to the ones in mainstream languages like java or python)... in this post, a portable Lisp client is introduced to fill this gap

In fact, free access to agraph may enable the development of a wide variety of useful applications, and I plan another post about the unbeknown value that RDF may bring to almost any software project. Yet, to showcase it in full with code, I was missing the client. Besides, I have an occasional personal need for it, and so, with some hacking over several weekends, here it is — a minimal portable Lisp client for agraph that has, IMHO, a couple of interesting high-level features and can also be rather easily extended to support other RDF-backends.

Disclosure: for the last 2,5 years, I've been working for Franz on AllegroGraph. Over that period I was able to participate in the development and support of different parts of the system and come to gradually appreciate it both as an engineering accomplishment and an extremely useful data store.

The HTTP API

cl-agraph provides a way to interact from a running Lisp process with AllegroGraph via its HTTP API. I call it minimal as the client implements only the essential CRUD commands and the SPARQL interface. That is the critical part that enables the usage of the triple store as part of any application. However, the agraph API also provides many administrative capabilities. Those are not (yet) supported by cl-agraph, although they may be implemented rather easily (I'll show how this can be done below). Yet, those endpoints are accessible directly both via the WebView management web interface and the agtool command-line utility. So, the absence of their support in the client doesn't preclude the effective use of agraph from any application.

The client uses nquads as the data interchange format. The availability of standard data formats, such as nquads, is one of the great advantages of RDF as a way to model any data. And it also made the development of this client much easier. To work with nquads, I have extended the cl-ntriples library by Viktor Anyakin (ntriples is a simpler version of the nquads format).

The basic data structure of agraph is a `triple` (actually, a quad, but it's the name "triple" is more common):

(defstruct (triple (:conc-name nil)
                   (:print-object print-triple))
  s p o g
  triple-id
  obj-lang obj-type)

Triple components are, uris, blank-nodes, and literals (strings, numbers, booleans).

When the triple is printed, it is displayed in the standard nquads format:

AGRAPH> (<> (make-blank-node) "http://foo.com/foo" "bar" :g "http://foo.com/baz" :lang "en")
_:bn1899  "bar"@en  .
AGRAPH> (s *)
_:bn1899

I have chosen the diamond sign (<>) to signify triples (as in ntriples/nquads formats, the URIs are enclosed in it). So, the API functions that deal with triples are mostly accompanied by this sign. The parts enclosed in <> in the nquads representation are uris. Also, the very short names s, p, o, and g are used as triple parts accessors. This is a generally discouraged approach, but from my experience working with AG, I have learned that these functions are used very often and no one will be mistaken when seeing them, in the context of triple-store interactions. Also, usually, they will be used with a package prefix anyway, so the common code pattern, in a convenient setup, may look like this:

(defpackage #:foo
  (:local-nicknames (#:ag #:agraph))
  ...)

FOO> (ag:with-ag (:repo "bar")
       ;; Though, there's a more efficient variant of triple iteration
       ;; that will be shown below
       (dolist (tr (ag:get<> :p (uri "baz:quux")))
         (when (ag:blank-node-p (ag:s *))
           (ag:rem<> tr))))

The function <> ensures proper types of the triple components. There's also raw make-triple that creates the triple structure using the arguments as is.

RDF permits specifying aliases for uri prefixes and the uri function is aware of that:

AGRAPH> (register-prefix "foo" "http://foo.com/")
"http://foo.com/"
AGRAPH> (<> (make-blank-node) "rdf:type" (uri "foo:quux"))
_:bn1921 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://foo.com/quux> .

You can see that we have used the default expansion for prefix "rdf" and the user-defined one for prefix "foo". The object of the triple needed to be explicitly converted to an uri (unlike the predicate) before it was passed to the <> function as objects may be also strings and it's impossible to reliably distinguish in the background.

The other core data structure of CL-AGRAPH is ag-config. It lists the connection parameters that are used to make the client HTTP requests. Most of the parameters have reasonable defaults. The macro with-ag is a common Lisp with-style macro that is intended for creating an interaction context with fixed config parameters. Usually, it should be given at least the :repo argument.

Here are some simple interactions with agraph:

AGRAPH> (open-ag :repo "test" :port 12345)
NIL
AGRAPH> (let ((subj (make-blank-node)))
          (add<> (<> subj "rdf:type" (uri "foo:bar"))
                 (<> subj "foo:baz" "quux" :lang "en")))
2
AGRAPH> (get<>)
(_:bF049DE41x7325578 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://foo.com/bar> .
 _:bF049DE41x7325578 <http://foo.com/baz> "quux"@en .)
AGRAPH> (rem<> :g (uri "foo:bar"))
0
AGRAPH> (get<>)
(_:bF049DE41x7325578 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://foo.com/bar> .
 _:bF049DE41x7325578 <http://foo.com/baz> "quux"@en .)
AGRAPH> (rem<> :o (uri "foo:bar"))
1
AGRAPH> (get<>)
(_:bF049DE41x7325578 <http://foo.com/baz> "quux"@en .)
AGRAPH> (count<>)
1
AGRAPH> (close-ag)
T

CL-AGRAPH defines the function map<> and the macro do<> in the standard Lisp iteration paradigm. Map performs iteration with accumulation, while do is intended to be used just for the side-effects. Actually, do<> is expressed, internally, in terms of map<>. The main advantage of using these functions instead of just calling the normal mapcar or dolist on the results of get<> is their streaming mode of operation. Instead of pulling, potentially, millions of triples from the triple-store into the program's memory (or organizing paging-based iteration, as get<> has a :limit option), the triples are streamed from the backend and discarded after being processed.

AGRAPH> (map<> 'o)
("quux")
Unlike the usual mapcar, this call didn't have the second argument: it iterated all triples in the repository. Yet, surely, it can be limited to certain subjects, predicates, objects, and/or graphs:
AGRAPH> (do<> (tr :s "foo" :p "bar" :o "baz" :g "quuz")
          (print tr))
NIL  ; sorry, there were no triples with such parameters

AGRAPH> (do<> (tr :p (uri "foo:baz"))
          (print tr))
_:bF049DE41x7325578 <http://foo.com/baz> "quux"@en .

Defining Your Own Commands

All these commands use, under the hood, the ag-req utility that can be utilized to define other API wrappers. For example, here is a function to get all the duplicate triples in the repository (that's one of the features of agraph that permits several triples with the same SPOG to be added):


(defun duplicates (&key mode)
  (ag-req "/statements/duplicates" (list "mode" mode)))

However, the simplest commands can be defined even without ag-req, by using just the high-level functions. Here is a small example — the function that checks if a triple exists in the repository:


(defun storedp (tr)
  (assert (triple-p tr))
  (when (get<> :s (s tr) :p (p tr) :o (o tr) :limit 1)
    t))

NB. As the client uses a REST HTTP + nquads protocol, it should be rather easy to extend it to support other triple-store backends such as GraphDB, Stardog or Virtuoso. Provided they also support this method of interaction.

Sessions & Transactions

Now, let's return to open-ag and with-ag. Both of them have a sessionp keyword argument (that is, by default, true for with-ag and nil for open-ag). A session is a mechanism for speeding up some agraph operations and for running transactions. Without an active session, each update is committed at once. It is much more costly than batching up groups of operations. However, if a session is established, you need to explicitly call commit to enact the modifications to the triple-store. I.e. sessions create an implicit transaction. with-ag will commit the transcation after executing its body. It is also possible to manually rollback the changes. Any unhandled error inside with-ag will also, effectively, cause a rollback: the session will be terminated without a commit.

An agraph session has a certain lifetime/timeout that can also be specified as a parameter to open-ag/with-ag. However, there's also a maximum possible lefitime that is configured by the triple-store admin. Once the timeout expires, the session is terminated. with-ag will try to rerun the transaction if it encounteres a terminated session — but that will be done just once. And the user should be careful not to place transaction-unfriedly code in the body of with-ag. open-ag, on the contrary, defaults to sessionless mode. This way the additional complexity of timeouts and transactions is removed. In this mode, the only thing that open-ag does is configuring the connection spec and internal caches.

Symbolic SPARQL

Another thing worth some attention in this client is its symbolic SPARQL facility that allows generating SPARQL requests from s-expressions. Query generation from sexps is a common Lisp trick that can be found in such libraries as CLSQL & co. However, the implementation I came up with is, from my point of view, much simpler.

Here are a few simple examples that give a general impression of symbolic SPARQL:


AGRAPH> (generate-sparql '(select * (?s ?p ?o))
                         nil)
"SELECT 
* 
{
?S ?P ?O .
 }
"
AGRAPH> (generate-sparql '(select * (:union (?s ?p ?o)
                                            (:graph ?g (?s ?p ?o))))
                         nil)
"SELECT 
* 
{ {
?S ?P ?O .
 }
UNION
{
GRAPH ?G {
?S ?P ?O .
 } } }
"

The function generate-sparql uses a very simple evaluation rule. It will print any symbols as is, while lists are processed recursively in 3 possible ways depending on the first element:

  • a keyword in first position means that custom rules should be invoked;
  • any other symbol causes the list to be treated as a tiples pattern containing subject, predicate(s), and object(s);
  • another list invokes recursive processing.

Now, the custom rules are defined as methods of a generic function process-custom, which makes this mechanism quite extensible. Let's see an example SPARQL sexps and the custom rules that were used to handle it:


(defun storedp (tr)
  (assert (triple-p tr))
  (when (get<> :s (s tr) :p (p tr) :o (o tr) :limit 1)
    t))
AGRAPH> (generate-sparql '(select ?date ?title
                                  ((?g |dc:date| ?date)
                                   (:filter (:> ?date (:|^| "2005-08-01T00:00:00Z"
                                                            |xsd:dateTime|)))
                                   (:graph ?g (?b |dc:title| ?title))))
                         nil)
"SELECT 
?DATE 
?TITLE 
{ ?G dc:date ?DATE .
 FILTER ( (?DATE > \"2005-08-01T00:00:00Z\"^^xsd:dateTime ) )
 GRAPH ?G {
?B dc:title ?TITLE .
 } }
"
(defgeneric process-custom (key tree out &key)
  ...
  (:method ((key (eql :|^|)) tree out &key)
    (assert (and (dyadic tree)
                 (stringp (first tree))
                 (symbolp (second tree))))
    (format out "~S^^~A" (first tree) (second tree)))
  (:method ((key (eql :filter)) tree out &key)
    (assert (single tree))
    (format out "FILTER ( ~A )~%"
            (process-expr (first tree) nil :top-level nil)))
  (:method ((key (eql :graph)) tree out &key)
    (assert (dyadic tree))
    (format out "GRAPH ~A " (first tree))
    (process-expr (second tree) out :top-level t))
  (:method ((key (eql :>)) tree out &key top-level)
    (process-arithm key tree out :top-level top-level))
  ...

The sexp-based form of SPAQRL queries may seem unusual, but it is much more convenient and powerful than the standard string format:

  • it is more convenient to edit;
  • passing variables is easy;
  • and you can write function and macros to construct these expressions from parts, which is very rough and error-prone using the string-based format.

I considered implementing symbolic SPARQL ever since I started working with it as programmatically filling string templates is so primitive. Finally, I've found time to realize this idea!

Afterword

This announcement is targeted mainly at those who are already "enlightened" about RDF triple stores and were eagerly waiting for a chance to try agraph. :) I hope that it provides a good starting point for you to actually do it. I believe, the agraph download webpage gives enough guidance regarding installing it either on your machine or running it from the AWS Marketplace.

As I said, there will be another post (for now, unclear when) that will be an introduction to RDF cabilities for those developers who are still "in ignorance" about the possibilities that triple stores may open for their applications. Stay tuned...

No comments:

Post a Comment