In the last couple of days, I was ill and had to stay in bed, so I've used this time also to tidy up the work that accumulated over the past year in cl-nlp. That was especially timely, considering the interest that was expressed in using it by some people who I've met at the recent Lisp-related events.
I've even assembled a rough checklist of the things that need to be finished to get it to v.1.0 and beyond.
Besides, after finishing the basic cleaning, I've returned to one of the programming tasks that has racked my head for long: tree pretty-printing. In NLP, we constantly have to deal with various versions of parse trees, like the constituency or dependency ones, but the problem is that they are not easily visualized. And good visualization plays, at least for me, a critical role in effective debugging, ideation and programming. It's an essential part of a solid interactive experience that is one of the fundamental traits of Lisp development.
For instance, a constituency tree is usually presented as a Lisp list. Here's an infamous example from the Penn Treebank:
(S
(NP-SBJ
(NP (NNP Pierre) (NNP Vinken) )
(, ,)
(ADJP
(NP (CD 61) (NNS years) )
(JJ old) )
(, ,) )
(VP (MD will)
(VP (VB join)
(NP (DT the) (NN board) )
(PP-CLR (IN as)
(NP (DT a) (JJ nonexecutive) (NN director) ))
(NP-TMP (NNP Nov.) (CD 29) )))
(. .) )
A dependency tree has several representations, all of which are not really intuitive to grasp. This is the Stanford format:
amod(ideas-2, Colorless-0)
amod(ideas-2, green-1)
nsubj(sleep-3, ideas-2)
root(sleep-3, sleep-3)
advmod(sleep-3, furiously-4)
punct(sleep-3, .-5)
And here's the CoNLL one:
0 Colorless _ _ ADJ 2
1 green _ _ ADJ 2
2 ideas _ _ NOUN 3
3 sleep _ _ NOUN 3
4 furiously _ _ ADV 3
5 . _ _ PUNCT 3
Also, Google's Parsey McParseface offers another - presumably, more visual - representation (using the asciitree lib). Still, it is not good enough, as it messes with the order of words in a sentence.
Input: Bob brought the pizza to Alice .
Parse:
brought VBD ROOT
+-- Bob NNP nsubj
+-- pizza NN dobj
| +-- the DT det
+-- to IN prep
| +-- Alice NNP pobj
+-- . . punct
As you see, dependency trees are not trivial to visualize (or pretty-print) in ASCII. The authors of Spacy creatively approached solving this problem by using CSS in their displaCy tool:
However, it seems like an overkill to bring a browser to with you for such a small task. And it's also not very scalable:
I, in fact, was always interested in creative ways of text-based visualization. So, I thought of ways to represent parse trees in ASCII.
With constituency ones, it's rather trivial:
> (pprint-tree '(TOP (S (NP (NN <This:0 0..4>))
(VP (VBZ <is:1 5..7>)
(NP (DT <a:2 8..9>)
(JJ <simple:3 10..16>)
(NN <test:4 17..21>)))
(|.| <.:5 22..23>)))
TOP
:
S
.-----------:---------.
: VP :
: .---------. :
NP : NP :
: : .----:-----. :
NN VBZ DT JJ NN .
: : : : : :
This is a simple test .
The dependencies are trickier, but I managed to find a way to show them without compromising the sentence word order:
> (pprint-deps '(<This:0 0..4> <is:1 5..7> <a:2 8..9> <simple:3 10..16> <test:4 17..21> <.:5 22..23>)
'(root(_ROOT_-0, is-1) nsubj(is-1, This-0) dobj(is-1, test-4) det(test-4, a-2) amod(test-4, simple-3) punct(is-1, .-5)))
Colorless green ideas sleep furiously .
^ ^ .^ .^. ^ ^
: `. amod .´: ::: : :
`..... amod .....´: ::: : :
`. nsubj .´:: : :
:`. advmod .´ :
:`.... punct .....´
root
And it looks pretty neat even for longer sentences:
We hold these truths to be self - evident , that all men are created equal , that they are endowed by their Creator with certain unalienable Rights , that among these are Life , Liberty and the pursuit of Happiness .
^ .^. ^ .^ ^ .^. ^ ^ .^ ^ ^ ^ .^ ^ .^. ^ ^ ^ ^ ^ .^. ^. ^ .^. ^. ^ ^ .^ ^ ^ ^. ^ .^. ^. ^ ^. ^ ^ .^. ^. ^ ^
`. nsubj .´:: `. det .´: `. aux .´:: : `. punct .´: : : `. det .´: `. auxpass .´:: : : : : `. auxpass .´:: :: `. poss .´:: :: : `. amod .´: : : :`. pobj .´ ::: :`. punct .´ :`. cc .´ `. det .´:: :`. pobj .´ :
:`... dobj ...´ :: `. npadvmod .´: : : : ::`. advcl .´ : : : ::: :: :: :: `...... amod ......´: : : : ::: :: :: :`. prep .´ :
:: :`..... acomp .....´ : : `.. nsubjpass ..´:: : : : ::: :: :: :`......... pobj .........´ : : : ::: :: :`...... conj .......´ :
:`......... advcl ..........´ : : ::`... punct ...´ : : ::: :: :`. prep .´ : : : ::: :`.... conj ....´ :
:`..................... punct ......................´ `........... mark ...........´:: : : ::: :`... pobj ....´ : : : ::`. attr .´ :
:: :: : : ::`. agent .´ : : `... prep ....´: :
:: :: : `.. nsubjpass ..´:: : `...... mark ......´: :
:: :: `....... mark .......´:: : : :
:: :: :`............................ punct .............................´ : :
:: :: :`........................................ advcl .........................................´ :
:: :`................ advcl ................´ :
:`...................................... ccomp .......................................´ :
:`............................................................................................................................................ punct .............................................................................................................................................´
root
However, writing the visualization code was one of the most intimidating programming tasks I've ever encountered. One explanation is that trees are most naturally processed in depth-first order top-down, while the visualization requires bottom-up BFS approach. The other may be that pixel-perfect (or, in this case, character-perfect display is always tedious). As far as I'm concerned, this is not a sufficient explanation, but I couldn't find any other. The ugliest part of this machinery is deps->levels function that prints the dependency relations in a layered fashion. The problem is to properly calculate minimal space necessary to accommodate both tokens and dependency labels and to account for different cases when the token has outgoing dependency arcs or doesn't. In theory sounds pretty easy, but in practice, it turned out a nightmare.
And all of this assumes projective trees (non-intersecting arcs), as well as doesn't know how to show on one level two arcs going from one token in two directions. Finally, I still couldn't align the two trees (constituency and dependency) above and under the sentence. Here's the target:
TOP
:
S
.----------------:--------------.
: VP :
: .---------. :
NP : NP :
: : .----:---------. :
NN VBZ DT JJ NN .
This is a simple test .
^ .^. ^ ^ .^ ^
`. nsubj .´:: : `. amod .´: :
:: `.... det ....´: :
:`..... dobj .....´ :
:`...... punct ......´
root
and this is how it prints for now (one more challenge was to transfer additional offsets from dependencies into the constituency tree):
TOP
:
S
.-----------:---------.
: VP :
: .---------. :
NP : NP :
: : .----:-----. :
NN VBZ DT JJ NN .
This is a simple test .
^ .^. ^ ^ .^ ^
`. nsubj .´:: : `. amod .´: :
:: `.... det ....´: :
:`..... dobj .....´ :
:`...... punct ......´
root
Well, the good news is that it is usable, but it still needs more work to be feature complete. I wonder what was I doing wrong: maybe, someone can come up with a clean and simple implementation of this functionality (in any language)? I consider it a great coding challenge, although it may require a week of your free time and a bunch of dead neurons to accomplish. But if you're willing to take it, I'd be glad to see the results... :D