I'm a little bit behind my schedule of implementing NLTK examples in Lisp with no posts on topic in March. It doesn't mean that work on CL-NLP has stopped - I've just had an unexpected vacation and also worked on parts, related to writing programs for the excellent
Natural Language Processing by Michael Collins Coursera course.
Today we'll start looking at Chapter 2, but we'll do it from the end, first exploring the topic of Wordnet.
Wordnet and Lisp
Wordnet is a database of word meanings (senses) and word semantic
relations (synonymy, hyponymy and various other ymy's).
It is a really important freely available resource, so having easy access
to it is very valuable. At the same time, getting Wordnet to work is
somewhat involved technically. That's why I've decided to implement
it's support in
cl-nlp-contrib
system, so that the basic
cl-nlp
functionality can be loaded without the additional hassle of having
to ensure Wordnet (and other similar stuff) is loading.
The implementation of Wordnet interface in
cl-nlp
is incomplete in
the sense, that it doesn't provide matching entities for all Wordnet
tables and so doesn't support all the possible interactions with it
out-of-the-box. Yet, it is sufficient to run all NLTK's examples and
it is trivial to implement the rest as the need arises (I plan to do
it later this year).
There are several other Wordnet interfaces for CL developed in the
previous years, starting from as long ago as early nineties:
Non of them use my desired storage format (sqlite) and their
overall support ranges from non-existing to little if any. Besides,
one of the principles behind
cl-nlp
is to prefer uniform interface
and ease of extensibility over performance and efficiency
with the premise that a specialized efficient version can be easily
implemented by restricting the more general one, while the opposite
is often much harder. And adapting the existing packages to the desired
interface didn't seem like a straightforward task,
so I decided to implement Wordnet support from scratch.
Wordnet installation
Wordnet isn't a corpus, although NLTK categorizes it as such placing
it in
nltk.corpus
module. It is a word database distributed in
a custom format, as well as in binary format of several databases.
I have picked sqlite3-based distribution as the easiest to work with —
it's a single file of roughly 64 MB in size which can be downloaded from
WNSQL project page.
The default place where
cl-nlp
will search for Wordnet is the
data/
directory. If you call
(download :wordnet)
, it will fetch a copy
and place it there.
There are several options to work with sqlite databases in Lisp, and
I have chosen
CLSQL
as the most mature and comprehensive system —
it's a goto library for doing simple SQL stuff in CL. Besides the
low-level interface it provides the basic ORM functionality
and some other goodies.
Also, to work with sqlite from Lisp a C client library is needed.
It can be obtained in various ways depending on your OS and distribution
(on most Linuxes with
apt-get
or similar package managers).
Yet there's another issue of making Lisp find the installed library.
So for Linux I decided to ship the appropriate binaries in the project's
lib/
dir. If you are not on Linux and CLSQL can't find sqlite native library,
you need to manually call the following command before you load
cl-nlp-contrib
with quicklisp or ASDF:
(clsql:push-library-path "path to sqlite3 or libsqlite3 directory")
Supporting Wordnet interaction
Using CLSQL we can define classes for Wordnet entities, such as:
Synset, Sense or Lexlink.
(clsql:def-view-class synset ()
((synsetid :initarg :synsetid :reader synset-id
:type integer :db-kind :key)
(pos :initarg :pos :reader synset-pos
:type char)
(lexdomainid :initarg :lexdomain
:type smallint :db-constraints (:not-null))
(definition :initarg :definition :reader synset-def
:type string)
(word :initarg :word :db-kind :virtual)
(sensenum :initarg :sensenum :db-kind :virtual))
(:base-table "synsets"))
Notice the presence of virtual slots
word
and
sensenum
which are used
to cache data from other tables in
synset
to display it like it is
done in NLTK. They are populated lazily with
slot-unbound
method
just like we've seen in Chapter 1.
All the DB entities are cached as well in the global cache,
so that same requests don't produce different objects and Wordnet objects
could be compared with
eql
.
There are some inconsistencies in Wordnet lexicon which have also
migrated to NLTK interface (and then NLTK introduced some more).
I've done a cleanup on that so some classes and slot accessors don't
fully mimic the names of their DB counterparts or the NLTK's interface.
For instance, in our dictionary
word
refers to a raw string and
lemma
— to its representation as a DB entity.
CLSQL also has special syntactic support for writing SQL queries
in Lisp:
(clsql:select [item_id] [as [count [*]] 'num]
:from [table]
:where [is [null [sale_date]]]
:group-by [item_id]
:order-by '((num :desc))
:limit 1)
Yet I've chosen to use another very simple custom solution for that,
which I've kept in the back of my mind for a long time since I'd
started working with CLSQL literal SQL syntax. Here's how we get
all
lemma
object for a specific
synset
—
they are connected through
senses
:
(query (select 'lemma
`(:where wordid
:in ,(select '(wordid :from senses)
`(:where synsetid := ,(synset-id synset))))))
For the sake of efficiency we don't open new DB connection on every
such request, but use a combination of the following:
- there's a CLSQL special variable
*default-database*
that points to
the current DB connection; it is used by all Wordnet interfacing
functions
- the connection can be established with
connect-wordnet
which is
given an instance of sql-wordnet3
class (it has the usual default
instance <wordnet>
, but you can use any other instance
which may connect to some othe Wordnet SQL DB like MySQL if needed)
ensure-db
function checks, if the connection is present and
opens it otherwise
- it is assumed that all functions will perform Wordnet interaction
inside
with-wordnet
macro that implements a standard Lisp
resource-management call-with-*
pattern for the Wordnet connection
Running NLTK's examples
Now, let's run the examples from the book to see how they work in our interface.
First, connect to Wordnet:
WORDNET> (connect-wordnet <wordnet>)
#<CLSQL-SQLITE3:SQLITE3-DATABASE /cl-nlp/data/wordnet30.sqlite OPEN {1011D1FAF3}>
Now we can run the functions with the existing connection passed
implicitly through the special variable.
wn
is a special convenience
package which implements the logic of implicitly relying on the
default
<wordnet>
. There are functions with the same names in
wordnet
package that take a
wordnet
instance as first argument,
similar to how other parts of
cl-nlp
are organized.
WORDNET> (wn:synsets "motorcar")
(#<SYNSET auto.n.1 102958343 {10116B0003}>)
WORDNET> (synsets <wordnet> "motorcar")
(#<SYNSET auto.n.1 102958343 {10116B0003}>)
The printing of
synsets
and other Wordnet objects is performed with
custom
print-object
methods. Here's a
print-object
for
synset
that uses the built-in
print-unreadable-object
macro:
(defmethod print-object ((sample sample) stream)
(print-unreadable-object (sample stream :type t :identity t)
(with-slots (sample sampleid) sample
(format stream "~A ~A" sample sampleid))))
Let's proceed further.
WORDNET> (wn:words (wn:synset "car.n.1"))
("auto" "automobile" "car" "machine" "motorcar")
This snippet is equivalent to the following NLTK code:
wn.synset('car.n.01').lemma_names
Definitions and examples:
WORDNET> (synset-def (wn:synset "car.n.1"))
"a motor vehicle with four wheels; usually propelled by an internal combustion engine"
WORDNET> (wn:examples (wn:synset "car.n.1"))
("he needs a car to get to work")
Next are lemmas.
WORDNET> (wn:lemmas (wn:synset "car.n.1"))
(#<LEMMA auto 9953 {100386BCC3}> #<LEMMA automobile 10063 {100386BCF3}>
#<LEMMA car 20722 {100386BC03}> #<LEMMA machine 80312 {100386BD23}>
#<LEMMA motorcar 86898 {1003250543}>)
As I've written earlier, there's some confusion in Wordnet between
words and lemmas, and, IMHO, it is propagated in NLTK. The term
lemma
only appears once in Wordnet DB as a column in
words
table.
And
synsets
are not directly related to
words
— they are
linked through
senses
table. NLTK calls a synset to word pairing a
lemma, which only adds to the confusion. I decided to call entities of
words
table
lemma
s. Now, how do you implement the equivalent of
wn.lemma('car.n.01.automobile')
We can do it like this:
WORDNET> (remove-if-not #`(string= "automobile" (lemma-word %))
(wn:lemmas (wn:synset "car.n.1")))
(#<LEMMA automobile 10063 {100386BCF3}>)
But, actually, we need not a raw
lemma
object, but a
sense
object,
because sense is that mapping of word to its meaning (defined by a synset),
and it's a proper Wordnet entity for this:
WORDNET> (wn:sense "car~automobile.n.1")
#<SENSE car~auto.n.1 28261 {1011F226C3}>
You can also get at the raw lemma by its DB id:
WORDNET> (wn:synset (wn:lemma 10063))
#<SYNSET auto.n.1 102958343 {100386BB33}>
Notice, that synset here is named
auto
. This is different from
NLTK's
'car'
, and the reason for this is that it's unclear from the
Wordnet DB, what is the "primary" lemma for a synset, so I just use
the word which appears first in the DB. Probably, NLTK also uses it,
but has a different ordering — compare the next output with the book's one:
WORDNET> (dolist (synset (wn:synsets "car"))
(print (wn:words synset)))
("cable car" "car")
("auto" "automobile" "car" "machine" "motorcar")
("car" "railcar" "railroad car" "railway car")
("car" "elevator car")
("car" "gondola")
Also I've chosen a different naming scheme for senses —
word~synset
— as it better reflects the semantic meaning of this concept.
Wordnet relations
There're two types of Wordnet relations: semantic ones between synsets
and lexical ones between senses. All of the relations or links can be
found in
*link-types*
parameter. There are 28 of them in Wordnet 3.0.
To get at any relation there's a generic function
related
:
WORDNET> (defvar *motorcar* (wn:synset "car.n.1"))
WORDNET> (defvar *types-of-motorcar* (wn:related *motorcar* :hyponym))
WORDNET> (nth 0 *types-of-motorcar*)
#<SYNSET ambulance.n.1 102701002 {1011E35063}>
In our case
"ambulance"
is the first motorcar type.
WORDNET> (sort (mapcan #'wn:words *types-of-motorcar*) 'string<)
("ambulance" "beach waggon" "beach wagon" "bus" "cab" "compact" "compact car"
"convertible" "coupe" "cruiser" "electric" "electric automobile"
"electric car" "estate car" "gas guzzler" "hack" "hardtop" "hatchback" "heap"
"horseless carriage" "hot rod" "hot-rod" "jalopy" "jeep" "landrover" "limo"
"limousine" "loaner" "minicar" "minivan" "model t" "pace car" "patrol car"
"phaeton" "police car" "police cruiser" "prowl car" "race car" "racer"
"racing car" "roadster" "runabout" "s.u.v." "saloon" "secondhand car" "sedan"
"sport car" "sport utility" "sport utility vehicle" "sports car" "squad car"
"stanley steamer" "station waggon" "station wagon" "stock car" "subcompact"
"subcompact car" "suv" "taxi" "taxicab" "tourer" "touring car" "two-seater"
"used-car" "waggon" "wagon")
Hypernyms are more general entities in synset hierarchy:
WORDNET> (wn:related *motorcar* :hypernym)
(#<SYNSET automotive vehicle.n.1 103791235 {10125702C3}>)
Let's trace them up to root entity:
WORDNET> (defvar *paths* (wn:hypernym-paths *motorcar*))
WORDNET> (length *paths*)
2
WORDNET> (mapcar #'synset-name (nth 0 *paths*))
("entity.n.1" "physical entity.n.1" "object.n.1" "unit.n.6" "artefact.n.1"
"instrumentality.n.3" "conveyance.n.3" "vehicle.n.1" "wheeled vehicle.n.1"
"self-propelled vehicle.n.1" "automotive vehicle.n.1" "auto.n.1")
WORDNET> (mapcar #'synset-name (nth 1 *paths*))
("entity.n.1" "physical entity.n.1" "object.n.1" "unit.n.6" "artefact.n.1"
"instrumentality.n.3" "container.n.1" "wheeled vehicle.n.1"
"self-propelled vehicle.n.1" "automotive vehicle.n.1" "auto.n.1")
And here are just the root hypernyms:
WORDNET> (remove-duplicates (mapcar #'car *paths*))
(#<SYNSET entity.n.1 100001740 {101D016453}>)
Now, if we look at
part-meronym
,
substance-meronym
and
member-holonym
relations and try to get them with
related
, we'll get an empty set.
WORDNET> (wn:related (wn:synset "tree.n.1") :substance-meronym)
NIL
That's because the relation is actually one-way: a burl is part of a
tree, but not vice versa. For this case there's a
:reverse
key to
related
:
WORDNET> (wn:related (wn:synset "tree.n.1") :part-meronym :reverse t)
(#<SYNSET stump.n.1 113111504 {10114220B3}>
#<SYNSET crown.n.7 113128003 {1011423B73}>
#<SYNSET limb.n.2 113163803 {1011425633}>
#<SYNSET bole.n.2 113165815 {10114270F3}>
#<SYNSET burl.n.2 113166044 {1011428BE3}>)
WORDNET> (wn:related (wn:synset "tree.n.1") :substance-meronym :reverse t)
(#<SYNSET sapwood.n.1 113097536 {10113E8BE3}>
#<SYNSET duramen.n.1 113097752 {10113EA6A3}>)
WORDNET> (wn:related (wn:synset "tree.n.1") :member-holonym :reverse t)
(#<SYNSET forest.n.1 108438533 {10115A81C3}>)
While the tree is
member-meronym
of forest (i.e. NLTK has it slightly the opposite way):
WORDNET> (wn:related (wn:synset "tree.n.1") :member-meronym)
(#<SYNSET forest.n.1 108438533 {10115A81C3}>)
And here's the mint example:
WORDNET> (dolist (s (wn:synsets "mint" :pos #\n))
(format t "~A: ~A~%" (synset-name s) (synset-def s)))
mint.n.6: a plant where money is coined by authority of the government
mint.n.5: a candy that is flavored with a mint oil
mint.n.4: the leaves of a mint plant used fresh or candied
mint.n.3: any member of the mint family of plants
mint.n.2: any north temperate plant of the genus Mentha with aromatic leaves and small mauve flowers
batch.n.2: (often followed by `of') a large number or amount or extent
WORDNET> (wn:related (wn:synset "mint.n.4") :part-holonym :reverse t)
(#<SYNSET mint.n.2 112855042 {1011CF3CF3}>)
WORDNET> (wn:related (wn:synset "mint.n.4") :substance-holonym :reverse t)
(#<SYNSET mint.n.5 107606278 {1011CEECA3}>)
Verbs:
WORDNET> (wn:related (wn:synset "walk.v.1") :entail)
(#<SYNSET step.v.1 201928838 {1004139FF3}>)
WORDNET> (wn:related (wn:synset "eat.v.1") :entail)
(#<SYNSET chew.v.1 201201089 {10041BDC33}>
#<SYNSET get down.v.4 201201856 {10041BF6F3}>)
WORDNET> (wn:related (wn:synset "tease.v.3") :entail)
(#<SYNSET arouse.v.7 201762283 {10042495E3}>
#<SYNSET disappoint.v.1 201798936 {100424B0A3}>)
And, finally, here's antonomy, which is a lexical, not semantic,
relationship, and it takes place between
senses
:
WORDNET> (wn:related (wn:sense "supply~supply.n.2") :antonym)
(#<SENSE demand~demand.n.2 48880 {1003637C03}>)
WORDNET> (wn:related (wn:sense "rush~rush.v.1") :antonym)
(#<SENSE linger~dawdle.v.4 107873 {1010780DE3}>)
WORDNET> (wn:related (wn:sense "horizontal~horizontal.a.1") :antonym)
(#<SENSE inclined~inclined.a.2 94496 {1010A5E653}>)
WORDNET> (wn:related (wn:sense "staccato~staccato.r.1") :antonym)
(#<SENSE legato~legato.r.1 105844 {1010C18AF3}>)
Similarity measures
Let's now use the
lowest-common-hypernyms
function abbreviated to
lch
to see the semantic grouping of marine animals:
WORDNET> (defvar *right* (wn:synset "right whale.n.1"))
WORDNET> (defvar *orca* (wn:synset "orca.n.1"))
WORDNET> (defvar *minke* (wn:synset "minke whale.n.1"))
WORDNET> (defvar *tortoise* (wn:synset "tortoise.n.1"))
WORDNET> (defvar *novel* (wn:synset "novel.n.1"))
WORDNET> (wn:lch *right* *minke*)
(#<SYNSET baleen whale.n.1 102063224 {102A2E0003}>)
2
1
WORDNET> (wn:lch *right* *orca*)
(#<SYNSET whale.n.2 102062744 {102A373323}>)
3
2
WORDNET> (wn:lch *right* *tortoise*)
(#<SYNSET craniate.n.1 101471682 {102A3734A3}>)
7
5
WORDNET> (wn:lch *right* *novel*)
(#<SYNSET entity.n.1 100001740 {102A373653}>)
15
7
The second and third return values of
lch
come handy here, as they show the
depth of the paths to the common ancestor and give some immediate data
for estimating semantic relatedness, that we're going to explore more now.
WORDNET> (wn:min-depth (wn:synset "baleen whale.n.1"))
14
WORDNET> (wn:min-depth (wn:synset "whale.n.2"))
13
WORDNET> (wn:min-depth (wn:synset "vertebrate.n.1"))
8
WORDNET> (wn:min-depth (wn:synset "entity.n.1"))
0
By the way, there's another whale, that is much closer to the root:
WORDNET> (wn:min-depth (wn:synset "whale.n.1"))
5
Guess what it is?
WORDNET> (synset-def (wn:synset "whale.n.1"))
"a very large person; impressive in size or qualities"
Now, let's calculate different similarity measures:
WORDNET> (wn:path-similarity *right* *minke*)
1/4
WORDNET> (wn:path-similarity *right* *orca*)
1/6
WORDNET> (wn:path-similarity *right* *tortoise*)
1/13
WORDNET> (wn:path-similarity *right* *novel*)
1/23
The algorithm here is very simple — it just reuses the secondary
return values of
lch
:
(defmethod path-similarity ((wordnet sql-wordnet3)
(synset1 synset) (synset2 synset))
(mv-bind (_ l1 l2) (lowest-common-hypernyms wordnet synset1 synset2)
(declare (ignore _))
(when l1
(/ 1 (+ l1 l2 1)))))
There are many more Wordnet similarity measures in NLTK, and they are
also implemented in
cl-nlp
. For example,
lch-similarity
the
name of which luckily coincides with
lch
function:
WORDNET> (wn:lch-similarity (wn:synset "dog.n.1") (wn:synset "cat.n.1"))
Calculating taxonomy depth for #<SYNSET entity.n.1 100001740 {102A9F7A83}>
2.0281482
This measure depends on performing expensive computation of taxonomy
depth which is done in a lazy manner in the
max-depth
function:
(defmethod max-depth ((wordnet sql-wordnet3) (synset synset))
(reduce #'max (mapcar #`(or (get# % *taxonomy-depths*)
(progn
(format *debug-io*
"Calculating taxonomy depth for ~A~%" %)
(set# % *taxonomy-depths*
(calc-max-depth wordnet %))))
(mapcar #'car (hypernym-paths wordnet synset)))))
Other similarity measures include
wup-similarity
,
res-similarity
,
lin-similarity
and others. You can read about them in the
NLTK Wordnet manual. Most of them depend on the words' information content database
which is calculated for different corpora (e.g. BNC and Penn Treebank)
and is not part of Wordnet. There's a
WordNet::Similarity project
that distributes pre-calculated databases for some popular corpora.
Lin similarity proposes a similar formula to LCH similarity for
calculating the score, but only uses information contents instead of
taxonomy depths as the arguments to it:
WORDNET> (wn:lin-similarity (wn:synset "dog.n.1") (wn:synset "cat.n.1"))
0.87035906
To calculate it we first need to fetch the information content corpus
with
(download :wordnet-ic)
and load the data for SemCor:
WORDNET> (setf *lc* (load-ic :filename-pattern "semcor"))
#<HASH-TABLE :TEST EQUAL :COUNT 213482 {100D7B80D3}>
The default
:filename-pattern
will load the combined information
content scores for the following 5 corpora: BNC, Brown, SemCor and its
raw version, all of Shakespeare's works and Penn Treebank. We'll talk
more about various corpora in the next part of the series...
Well, this is all as far as Wordnet concerned for now. There's another
useful Wordnet functionality — word morpher, but we'll return to it
when we'll be talking about word forming, stemming and lemmatization.
Meanwhile, here's a good schema showing the full potential of Wordnet: