wnconnect: internals

Architecture

wnconnect comes in two forms. The Mac OS X application simply adds to the basic Prolog version a graphical user interface (GUI) implemented using Tck/Tk and a graph generator that interfaces with neato. The architecture is given below:

GUI

wnconnect

Tck/Tk Aqua
Version 8.4.4 WordNet
Version 1.7.1
(Prolog database) C code
(Breadth-first
search) Sicstus Prolog
Version 3.10.1 neato
Version 1.10
Graphviz Preview
(graph display)
Mac OS X 10.2

FreeType 2.15 ps2pdf
Ghostscript 7.0.5

Converted .ttf
TrueType Fonts
by fondu

Note that the program is a hybrid one. It is written using three programming languages calling each other as befits the task: C, Prolog and Tck/Tk (graphical user interface).

Algorithm and Representation

wnconnect as can be seen by the code supplied in connection.pl and connection.c is conceptually a simple program.

It is software originally written to permit the author to do research work exploring and inferencing with WordNet relations in Prolog exclusively. It was originally designed to run in Quintus Prolog on a Sun Sparcstation. The version with the GUI wrapper came much later.

Given the size of the WordNet semantic network, wnconnect uses a hybrid Prolog-C implementation of breadth-first search.

Breadth-First Search
Prolog's default search order is depth-first for good reasons. Breadth-first search is useful in graph searching, especially if you want to find shorter paths before longer ones.
Although it is easy to write a breadth-first search routine directly in Prolog, it's not efficient, especially in the case of very large graphs.
WordNet
WordNet is a very large database containing 195817 entries, organized into 111223 synsets.
wnconnect implements 15 semantic relations.
There are a total of 484381 links in the standard Prolog WordNet distribution. The common file number extension adds a further 53074 links.

Semantic relations are conveniently stored as Prolog database facts for easy querying (and modification), see the example for the verb senses of bust here.

For reasons of practical efficiency, the search strategy is largely controlled and implemented in C. Prolog computes the possible semantic relations that can be applied to a given word, but the nodes and links of the (incrementally grown) graph is passed to and packed and stored in a C array.

Two arrays are used in the C section of the program:

Synset Mark array

#define SIZE 111224 int mark[SIZE];
Graph Storage array

#define PTSIZE 8000000 /* 8,000,000 = 32Mb */ int pt[PTSIZE];

Synsets are entered into mark to prevent the program from looping, i.e. getting stuck by visiting the same node again and again.

pt represents the breadth-first search tree. The limit PTSIZE should be sufficient for cases of WordNet searching.

Each node/link pair in the search graph is stored as a single entry, i.e. as an integer, in pt. Given the compactness of the representation, 32Mb represents a lot of search space. The details of the encoding is described below.

Node/Link Representation

Each computed relation during search is efficiently packed into a single word (32 bits, 4 bytes) on the Prolog side for storage on the C side. Graph words are decoded for display on the Prolog side.

The following two formats are used:

Bits: 0-16 17 18-20 21 22-25 26-30 31

Format
#1 Internal Synset ID 0
Simple operator
decoding Operator:
sim, hyp, at,
cs, ent, mm, ms,
mp, vgp, ihyp (Unused) (Reserved)
mark

Format
#2 1
Operator/
operand
decoding Operator:
ant, fn,
ppl, per, sa Operand 1
Synset
offset Operand 2
Synset
offset

Two representations are used because the semantic relations come in two flavors. Some are synset-to-synset relations and each link can be encoded using just a single synset id. Format #1 above is used to encode these relations. Other operators relate specific words across synsets. Specific words are referenced using offsets within synsets. Format #2 is used to encode these links.

Synset ID

Wordnet has a total of 111223 synsets. Hence 17 bits are sufficient to store an internal-to-wnconnect synset id. A C array is used to index and mark visited synset ids.

Princeton's Prolog WordNet distribution uses non-contiguous synset ids. In wnconnect, all database facts using Princeton synset ids have been reindexed to take advantage of C array marking.

Look-up and conversion can be carried out using the two wnconnect predicates s2/5 and s3/5:

s2(Word,C,I,S,O) s3(I,O,S,Word,C)
Word: a Prolog atom encoding Word, e.g. bust
C: WordNet category label (n,v,a,s,r).
I: wnconnect synset id.
S: Princeton synset id for Word
O: Synset id offset for Word.

s2/5 and s3/5 store the same information indexed by word (s2/5) and wnconnect internal synset id (s3/5), respectively.

[Both s2/5 and s3/5 have been defined in order to take advantage of most Prologs' efficient first argument indexing mechanism.]

Synset Offset

In the case of word-specific semantic relations, the maximum size of a synset that we need to worry about is 23. Hence, 5 bits is sufficient to encode synset offsets for format #2. The relevant maximum synset sizes are given below:

ant 10 fn 23 ppl 9 per 22 sa 11

GUI
	wnconnect
Tck/Tk Aqua Version 8.4.4	WordNet Version 1.7.1 (Prolog database)	C code (Breadth-first search)	Sicstus Prolog Version 3.10.1	`neato` Version 1.10 Graphviz		Preview (graph display) Mac OS X 10.2
					FreeType 2.15	`ps2pdf` Ghostscript 7.0.5
					Converted .ttf TrueType Fonts by `fondu`

Bits:	0-16	17	18-20	21	22-25	26-30	31
Format #1	Internal Synset ID	0 Simple operator decoding	Operator: sim, hyp, at, cs, ent, mm, ms, mp, vgp, ihyp		(Unused)		(Reserved) mark
Format #2	Internal Synset ID	1 Operator/ operand decoding	Operator: ant, fn, ppl, per, sa	Operand 1 Synset offset		Operand 2 Synset offset	(Reserved) mark