I'm a little bit behind my schedule of implementing NLTK examples in Lisp with no posts on topic in March. It doesn't mean that work on CL-NLP has stopped - I've just had an unexpected vacation and also worked on parts, related to writing programs for the excellent Natural Language Processing by Michael Collins Coursera course.
Today we'll start looking at Chapter 2, but we'll do it from the end, first exploring the topic of Wordnet.
The implementation of Wordnet interface in
There are several other Wordnet interfaces for CL developed in the previous years, starting from as long ago as early nineties:
There are several options to work with sqlite databases in Lisp, and I have chosen
Also, to work with sqlite from Lisp a C client library is needed. It can be obtained in various ways depending on your OS and distribution (on most Linuxes with
All the DB entities are cached as well in the global cache, so that same requests don't produce different objects and Wordnet objects could be compared with
There are some inconsistencies in Wordnet lexicon which have also migrated to NLTK interface (and then NLTK introduced some more). I've done a cleanup on that so some classes and slot accessors don't fully mimic the names of their DB counterparts or the NLTK's interface. For instance, in our dictionary
CLSQL also has special syntactic support for writing SQL queries in Lisp:
First, connect to Wordnet:
To get at any relation there's a generic function
Lin similarity proposes a similar formula to LCH similarity for calculating the score, but only uses information contents instead of taxonomy depths as the arguments to it:
Well, this is all as far as Wordnet concerned for now. There's another useful Wordnet functionality — word morpher, but we'll return to it when we'll be talking about word forming, stemming and lemmatization. Meanwhile, here's a good schema showing the full potential of Wordnet:
Today we'll start looking at Chapter 2, but we'll do it from the end, first exploring the topic of Wordnet.
Wordnet and Lisp
Wordnet is a database of word meanings (senses) and word semantic relations (synonymy, hyponymy and various other ymy's). It is a really important freely available resource, so having easy access to it is very valuable. At the same time, getting Wordnet to work is somewhat involved technically. That's why I've decided to implement it's support incl-nlp-contrib
system, so that the basic cl-nlp
functionality can be loaded without the additional hassle of having
to ensure Wordnet (and other similar stuff) is loading.The implementation of Wordnet interface in
cl-nlp
is incomplete in
the sense, that it doesn't provide matching entities for all Wordnet
tables and so doesn't support all the possible interactions with it
out-of-the-box. Yet, it is sufficient to run all NLTK's examples and
it is trivial to implement the rest as the need arises (I plan to do
it later this year).There are several other Wordnet interfaces for CL developed in the previous years, starting from as long ago as early nineties:
- cl-wordnet
- cffi-wordnet
- WordNet from MIT
- Some code from UTexas
- and, probably, some others
cl-nlp
is to prefer uniform interface
and ease of extensibility over performance and efficiency
with the premise that a specialized efficient version can be easily
implemented by restricting the more general one, while the opposite
is often much harder. And adapting the existing packages to the desired
interface didn't seem like a straightforward task,
so I decided to implement Wordnet support from scratch.Wordnet installation
Wordnet isn't a corpus, although NLTK categorizes it as such placing it innltk.corpus
module. It is a word database distributed in
a custom format, as well as in binary format of several databases.
I have picked sqlite3-based distribution as the easiest to work with —
it's a single file of roughly 64 MB in size which can be downloaded from
WNSQL project page.
The default place where cl-nlp
will search for Wordnet is the data/
directory. If you call (download :wordnet)
, it will fetch a copy
and place it there.There are several options to work with sqlite databases in Lisp, and I have chosen
CLSQL
as the most mature and comprehensive system —
it's a goto library for doing simple SQL stuff in CL. Besides the
low-level interface it provides the basic ORM functionality
and some other goodies.Also, to work with sqlite from Lisp a C client library is needed. It can be obtained in various ways depending on your OS and distribution (on most Linuxes with
apt-get
or similar package managers).
Yet there's another issue of making Lisp find the installed library.
So for Linux I decided to ship the appropriate binaries in the project's
lib/
dir. If you are not on Linux and CLSQL can't find sqlite native library,
you need to manually call the following command before you load
cl-nlp-contrib
with quicklisp or ASDF:(clsql:push-library-path "path to sqlite3 or libsqlite3 directory")
Supporting Wordnet interaction
Using CLSQL we can define classes for Wordnet entities, such as: Synset, Sense or Lexlink.(clsql:def-view-class synset ()
((synsetid :initarg :synsetid :reader synset-id
:type integer :db-kind :key)
(pos :initarg :pos :reader synset-pos
:type char)
(lexdomainid :initarg :lexdomain
:type smallint :db-constraints (:not-null))
(definition :initarg :definition :reader synset-def
:type string)
(word :initarg :word :db-kind :virtual)
(sensenum :initarg :sensenum :db-kind :virtual))
(:base-table "synsets"))
Notice the presence of virtual slots word
and sensenum
which are used
to cache data from other tables in synset
to display it like it is
done in NLTK. They are populated lazily with slot-unbound
method
just like we've seen in Chapter 1.All the DB entities are cached as well in the global cache, so that same requests don't produce different objects and Wordnet objects could be compared with
eql
.There are some inconsistencies in Wordnet lexicon which have also migrated to NLTK interface (and then NLTK introduced some more). I've done a cleanup on that so some classes and slot accessors don't fully mimic the names of their DB counterparts or the NLTK's interface. For instance, in our dictionary
word
refers to a raw string and
lemma
— to its representation as a DB entity.CLSQL also has special syntactic support for writing SQL queries in Lisp:
(clsql:select [item_id] [as [count [*]] 'num]
:from [table]
:where [is [null [sale_date]]]
:group-by [item_id]
:order-by '((num :desc))
:limit 1)
Yet I've chosen to use another very simple custom solution for that,
which I've kept in the back of my mind for a long time since I'd
started working with CLSQL literal SQL syntax. Here's how we get
all lemma
object for a specific synset
—
they are connected through senses
:(query (select 'lemma
`(:where wordid
:in ,(select '(wordid :from senses)
`(:where synsetid := ,(synset-id synset))))))
For the sake of efficiency we don't open new DB connection on every
such request, but use a combination of the following:- there's a CLSQL special variable
*default-database*
that points to the current DB connection; it is used by all Wordnet interfacing functions - the connection can be established with
connect-wordnet
which is given an instance ofsql-wordnet3
class (it has the usual default instance<wordnet>
, but you can use any other instance which may connect to some othe Wordnet SQL DB like MySQL if needed) ensure-db
function checks, if the connection is present and opens it otherwise- it is assumed that all functions will perform Wordnet interaction
inside
with-wordnet
macro that implements a standard Lisp resource-management call-with-* pattern for the Wordnet connection
Running NLTK's examples
Now, let's run the examples from the book to see how they work in our interface.First, connect to Wordnet:
WORDNET> (connect-wordnet <wordnet>)
#<CLSQL-SQLITE3:SQLITE3-DATABASE /cl-nlp/data/wordnet30.sqlite OPEN {1011D1FAF3}>
Now we can run the functions with the existing connection passed
implicitly through the special variable. wn
is a special convenience
package which implements the logic of implicitly relying on the
default <wordnet>
. There are functions with the same names in
wordnet
package that take a wordnet
instance as first argument,
similar to how other parts of cl-nlp
are organized.WORDNET> (wn:synsets "motorcar")
(#<SYNSET auto.n.1 102958343 {10116B0003}>)
WORDNET> (synsets <wordnet> "motorcar")
(#<SYNSET auto.n.1 102958343 {10116B0003}>)
The printing of synsets
and other Wordnet objects is performed with
custom print-object
methods. Here's a print-object
for synset
that uses the built-in print-unreadable-object
macro:(defmethod print-object ((sample sample) stream)
(print-unreadable-object (sample stream :type t :identity t)
(with-slots (sample sampleid) sample
(format stream "~A ~A" sample sampleid))))
Let's proceed further.WORDNET> (wn:words (wn:synset "car.n.1"))
("auto" "automobile" "car" "machine" "motorcar")
This snippet is equivalent to the following NLTK code:wn.synset('car.n.01').lemma_names
Definitions and examples:WORDNET> (synset-def (wn:synset "car.n.1"))
"a motor vehicle with four wheels; usually propelled by an internal combustion engine"
WORDNET> (wn:examples (wn:synset "car.n.1"))
("he needs a car to get to work")
Next are lemmas.WORDNET> (wn:lemmas (wn:synset "car.n.1"))
(#<LEMMA auto 9953 {100386BCC3}> #<LEMMA automobile 10063 {100386BCF3}>
#<LEMMA car 20722 {100386BC03}> #<LEMMA machine 80312 {100386BD23}>
#<LEMMA motorcar 86898 {1003250543}>)
As I've written earlier, there's some confusion in Wordnet between
words and lemmas, and, IMHO, it is propagated in NLTK. The term
lemma
only appears once in Wordnet DB as a column in words
table.
And synsets
are not directly related to words
— they are
linked through senses
table. NLTK calls a synset to word pairing a
lemma, which only adds to the confusion. I decided to call entities of
words
table lemma
s. Now, how do you implement the equivalent ofwn.lemma('car.n.01.automobile')
We can do it like this:WORDNET> (remove-if-not #`(string= "automobile" (lemma-word %))
(wn:lemmas (wn:synset "car.n.1")))
(#<LEMMA automobile 10063 {100386BCF3}>)
But, actually, we need not a raw lemma
object, but a sense
object,
because sense is that mapping of word to its meaning (defined by a synset),
and it's a proper Wordnet entity for this:WORDNET> (wn:sense "car~automobile.n.1")
#<SENSE car~auto.n.1 28261 {1011F226C3}>
You can also get at the raw lemma by its DB id:WORDNET> (wn:synset (wn:lemma 10063))
#<SYNSET auto.n.1 102958343 {100386BB33}>
Notice, that synset here is named auto
. This is different from
NLTK's 'car'
, and the reason for this is that it's unclear from the
Wordnet DB, what is the "primary" lemma for a synset, so I just use
the word which appears first in the DB. Probably, NLTK also uses it,
but has a different ordering — compare the next output with the book's one:WORDNET> (dolist (synset (wn:synsets "car"))
(print (wn:words synset)))
("cable car" "car")
("auto" "automobile" "car" "machine" "motorcar")
("car" "railcar" "railroad car" "railway car")
("car" "elevator car")
("car" "gondola")
Also I've chosen a different naming scheme for senses — word~synset
— as it better reflects the semantic meaning of this concept.Wordnet relations
There're two types of Wordnet relations: semantic ones between synsets and lexical ones between senses. All of the relations or links can be found in*link-types*
parameter. There are 28 of them in Wordnet 3.0.To get at any relation there's a generic function
related
:WORDNET> (defvar *motorcar* (wn:synset "car.n.1"))
WORDNET> (defvar *types-of-motorcar* (wn:related *motorcar* :hyponym))
WORDNET> (nth 0 *types-of-motorcar*)
#<SYNSET ambulance.n.1 102701002 {1011E35063}>
In our case "ambulance"
is the first motorcar type.WORDNET> (sort (mapcan #'wn:words *types-of-motorcar*) 'string<)
("ambulance" "beach waggon" "beach wagon" "bus" "cab" "compact" "compact car"
"convertible" "coupe" "cruiser" "electric" "electric automobile"
"electric car" "estate car" "gas guzzler" "hack" "hardtop" "hatchback" "heap"
"horseless carriage" "hot rod" "hot-rod" "jalopy" "jeep" "landrover" "limo"
"limousine" "loaner" "minicar" "minivan" "model t" "pace car" "patrol car"
"phaeton" "police car" "police cruiser" "prowl car" "race car" "racer"
"racing car" "roadster" "runabout" "s.u.v." "saloon" "secondhand car" "sedan"
"sport car" "sport utility" "sport utility vehicle" "sports car" "squad car"
"stanley steamer" "station waggon" "station wagon" "stock car" "subcompact"
"subcompact car" "suv" "taxi" "taxicab" "tourer" "touring car" "two-seater"
"used-car" "waggon" "wagon")
Hypernyms are more general entities in synset hierarchy:WORDNET> (wn:related *motorcar* :hypernym)
(#<SYNSET automotive vehicle.n.1 103791235 {10125702C3}>)
Let's trace them up to root entity:WORDNET> (defvar *paths* (wn:hypernym-paths *motorcar*))
WORDNET> (length *paths*)
2
WORDNET> (mapcar #'synset-name (nth 0 *paths*))
("entity.n.1" "physical entity.n.1" "object.n.1" "unit.n.6" "artefact.n.1"
"instrumentality.n.3" "conveyance.n.3" "vehicle.n.1" "wheeled vehicle.n.1"
"self-propelled vehicle.n.1" "automotive vehicle.n.1" "auto.n.1")
WORDNET> (mapcar #'synset-name (nth 1 *paths*))
("entity.n.1" "physical entity.n.1" "object.n.1" "unit.n.6" "artefact.n.1"
"instrumentality.n.3" "container.n.1" "wheeled vehicle.n.1"
"self-propelled vehicle.n.1" "automotive vehicle.n.1" "auto.n.1")
And here are just the root hypernyms:WORDNET> (remove-duplicates (mapcar #'car *paths*))
(#<SYNSET entity.n.1 100001740 {101D016453}>)
Now, if we look at part-meronym
, substance-meronym
and member-holonym
relations and try to get them with related
, we'll get an empty set.WORDNET> (wn:related (wn:synset "tree.n.1") :substance-meronym)
NIL
That's because the relation is actually one-way: a burl is part of a
tree, but not vice versa. For this case there's a :reverse
key to related
:WORDNET> (wn:related (wn:synset "tree.n.1") :part-meronym :reverse t)
(#<SYNSET stump.n.1 113111504 {10114220B3}>
#<SYNSET crown.n.7 113128003 {1011423B73}>
#<SYNSET limb.n.2 113163803 {1011425633}>
#<SYNSET bole.n.2 113165815 {10114270F3}>
#<SYNSET burl.n.2 113166044 {1011428BE3}>)
WORDNET> (wn:related (wn:synset "tree.n.1") :substance-meronym :reverse t)
(#<SYNSET sapwood.n.1 113097536 {10113E8BE3}>
#<SYNSET duramen.n.1 113097752 {10113EA6A3}>)
WORDNET> (wn:related (wn:synset "tree.n.1") :member-holonym :reverse t)
(#<SYNSET forest.n.1 108438533 {10115A81C3}>)
While the tree is member-meronym
of forest (i.e. NLTK has it slightly the opposite way):WORDNET> (wn:related (wn:synset "tree.n.1") :member-meronym)
(#<SYNSET forest.n.1 108438533 {10115A81C3}>)
And here's the mint example:WORDNET> (dolist (s (wn:synsets "mint" :pos #\n))
(format t "~A: ~A~%" (synset-name s) (synset-def s)))
mint.n.6: a plant where money is coined by authority of the government
mint.n.5: a candy that is flavored with a mint oil
mint.n.4: the leaves of a mint plant used fresh or candied
mint.n.3: any member of the mint family of plants
mint.n.2: any north temperate plant of the genus Mentha with aromatic leaves and small mauve flowers
batch.n.2: (often followed by `of') a large number or amount or extent
WORDNET> (wn:related (wn:synset "mint.n.4") :part-holonym :reverse t)
(#<SYNSET mint.n.2 112855042 {1011CF3CF3}>)
WORDNET> (wn:related (wn:synset "mint.n.4") :substance-holonym :reverse t)
(#<SYNSET mint.n.5 107606278 {1011CEECA3}>)
Verbs:WORDNET> (wn:related (wn:synset "walk.v.1") :entail)
(#<SYNSET step.v.1 201928838 {1004139FF3}>)
WORDNET> (wn:related (wn:synset "eat.v.1") :entail)
(#<SYNSET chew.v.1 201201089 {10041BDC33}>
#<SYNSET get down.v.4 201201856 {10041BF6F3}>)
WORDNET> (wn:related (wn:synset "tease.v.3") :entail)
(#<SYNSET arouse.v.7 201762283 {10042495E3}>
#<SYNSET disappoint.v.1 201798936 {100424B0A3}>)
And, finally, here's antonomy, which is a lexical, not semantic,
relationship, and it takes place between senses
:WORDNET> (wn:related (wn:sense "supply~supply.n.2") :antonym)
(#<SENSE demand~demand.n.2 48880 {1003637C03}>)
WORDNET> (wn:related (wn:sense "rush~rush.v.1") :antonym)
(#<SENSE linger~dawdle.v.4 107873 {1010780DE3}>)
WORDNET> (wn:related (wn:sense "horizontal~horizontal.a.1") :antonym)
(#<SENSE inclined~inclined.a.2 94496 {1010A5E653}>)
WORDNET> (wn:related (wn:sense "staccato~staccato.r.1") :antonym)
(#<SENSE legato~legato.r.1 105844 {1010C18AF3}>)
Similarity measures
Let's now use thelowest-common-hypernyms
function abbreviated to
lch
to see the semantic grouping of marine animals:WORDNET> (defvar *right* (wn:synset "right whale.n.1"))
WORDNET> (defvar *orca* (wn:synset "orca.n.1"))
WORDNET> (defvar *minke* (wn:synset "minke whale.n.1"))
WORDNET> (defvar *tortoise* (wn:synset "tortoise.n.1"))
WORDNET> (defvar *novel* (wn:synset "novel.n.1"))
WORDNET> (wn:lch *right* *minke*)
(#<SYNSET baleen whale.n.1 102063224 {102A2E0003}>)
2
1
WORDNET> (wn:lch *right* *orca*)
(#<SYNSET whale.n.2 102062744 {102A373323}>)
3
2
WORDNET> (wn:lch *right* *tortoise*)
(#<SYNSET craniate.n.1 101471682 {102A3734A3}>)
7
5
WORDNET> (wn:lch *right* *novel*)
(#<SYNSET entity.n.1 100001740 {102A373653}>)
15
7
The second and third return values of lch
come handy here, as they show the
depth of the paths to the common ancestor and give some immediate data
for estimating semantic relatedness, that we're going to explore more now.WORDNET> (wn:min-depth (wn:synset "baleen whale.n.1"))
14
WORDNET> (wn:min-depth (wn:synset "whale.n.2"))
13
WORDNET> (wn:min-depth (wn:synset "vertebrate.n.1"))
8
WORDNET> (wn:min-depth (wn:synset "entity.n.1"))
0
By the way, there's another whale, that is much closer to the root:WORDNET> (wn:min-depth (wn:synset "whale.n.1"))
5
Guess what it is?WORDNET> (synset-def (wn:synset "whale.n.1"))
"a very large person; impressive in size or qualities"
Now, let's calculate different similarity measures:WORDNET> (wn:path-similarity *right* *minke*)
1/4
WORDNET> (wn:path-similarity *right* *orca*)
1/6
WORDNET> (wn:path-similarity *right* *tortoise*)
1/13
WORDNET> (wn:path-similarity *right* *novel*)
1/23
The algorithm here is very simple — it just reuses the secondary
return values of lch
:(defmethod path-similarity ((wordnet sql-wordnet3)
(synset1 synset) (synset2 synset))
(mv-bind (_ l1 l2) (lowest-common-hypernyms wordnet synset1 synset2)
(declare (ignore _))
(when l1
(/ 1 (+ l1 l2 1)))))
There are many more Wordnet similarity measures in NLTK, and they are
also implemented in cl-nlp
. For example, lch-similarity
the
name of which luckily coincides with lch
function:WORDNET> (wn:lch-similarity (wn:synset "dog.n.1") (wn:synset "cat.n.1"))
Calculating taxonomy depth for #<SYNSET entity.n.1 100001740 {102A9F7A83}>
2.0281482
This measure depends on performing expensive computation of taxonomy
depth which is done in a lazy manner in the max-depth
function:(defmethod max-depth ((wordnet sql-wordnet3) (synset synset))
(reduce #'max (mapcar #`(or (get# % *taxonomy-depths*)
(progn
(format *debug-io*
"Calculating taxonomy depth for ~A~%" %)
(set# % *taxonomy-depths*
(calc-max-depth wordnet %))))
(mapcar #'car (hypernym-paths wordnet synset)))))
Other similarity measures include wup-similarity
, res-similarity
,
lin-similarity
and others. You can read about them in the
NLTK Wordnet manual. Most of them depend on the words' information content database
which is calculated for different corpora (e.g. BNC and Penn Treebank)
and is not part of Wordnet. There's a
WordNet::Similarity project
that distributes pre-calculated databases for some popular corpora.Lin similarity proposes a similar formula to LCH similarity for calculating the score, but only uses information contents instead of taxonomy depths as the arguments to it:
WORDNET> (wn:lin-similarity (wn:synset "dog.n.1") (wn:synset "cat.n.1"))
0.87035906
To calculate it we first need to fetch the information content corpus
with (download :wordnet-ic)
and load the data for SemCor:WORDNET> (setf *lc* (load-ic :filename-pattern "semcor"))
#<HASH-TABLE :TEST EQUAL :COUNT 213482 {100D7B80D3}>
The default :filename-pattern
will load the combined information
content scores for the following 5 corpora: BNC, Brown, SemCor and its
raw version, all of Shakespeare's works and Penn Treebank. We'll talk
more about various corpora in the next part of the series...Well, this is all as far as Wordnet concerned for now. There's another useful Wordnet functionality — word morpher, but we'll return to it when we'll be talking about word forming, stemming and lemmatization. Meanwhile, here's a good schema showing the full potential of Wordnet:
No comments:
Post a Comment