Back to my academic homepage

TreeBank Search

This is software for browsing and searching treebanks using logic expressions. It is capable of handling large treebanks, e.g. the Penn TreeBank (PTB). It renders bracketed expressions as nicely-formatted trees.

Note 1: This is not free software. Download here. See the Licensing section here.
[A simpler version of this program that allows treebank browsing and rendering but not search is freely available elsewhere on my homepage. (See treeviewer.)]

Note 2: The Penn TreeBank is not included here. I cannot legally re-distribute that.
[(You need to have a license from the LDC for the Penn TreeBank.)
University of Arizona students: the TREEBANK_3 cd is freely available for loan from the library. See the library catalog.]


The example screenshot on the right shows the program operating on the supplied Lasnik & Uriagereka (L&U) treebank (317 sentences).

[Sentence File = lu.lisp (sentences in Bikel parser input format)
Prolog Tree File = lu.pl (Bikel-Collins parses converted into Prolog format)]
[Dan Bikel's multilingual statistical parsing engine: here.]

On the left display panel, all 104 sentences matching the pattern "TO" are selected.

[Match Sentence = TO, search parameters are literal match (regexp not checked) and All (entire treebank). Show "Selected Only" restricts the display to matching instances only.]

On the right display panel, the Bikel-Collins parse for sentence #43 is displayed.

[Parse associated with a sentence can be displayed by clicking the mouse button. The background of the selected sentence is highlighted in blue on the left panel. The sentence number is given above the parse tree (here: 43/317).

(This viewer employs the same underlying tree renderer as used in the next release of PAPPI.)


Contents


Multi-window Operation

The treebanksearch program is capable of operating in synchronized multi-window mode for browsing purposes.

The operation of this mode is described in a separate webpage. (See here.)

[On the right, the upper window contains one instance of the treebank viewer. The treebanksearch program is the lower window.

The treebank viewer has the Collins parser model 1 parses loaded for the Penn Treebank (PTB). The treebanksearch program has the "gold-standard" PTB trees loaded.

(Michael Collins's parser is available here.)

The two programs are operating in synchronized mode. This means the treebankviewer is slaved to treebanksearch program for tree display purposes.

In this particular snapshot, the upper and lower windows are displaying (different) trees for the sentence: Neither Lorillard nor the researchers who studied the workers were aware of any research on smokers of the Kent cigarettes .]


Usage

Initialization

  1. Specify the files containing the sentences and Prolog trees.

    Use the appropriate button to bring up a file dialog box or type directly into the entry field.

    Sentence file lu.lisp and Prolog tree file lu.pl for the Lasnik & Uriagereka (L&U) treebank are supplied with the distribution.

    File Formats

    The format is one line per sentence and one line per tree.
    The number of lines for the sentence and tree files should be the same.

    Anything can be present in the sentence file: each line is treated as a simple string for search.

    The tree file must be parse-able by the tree renderer.
    Each tree should occupy one line and be acceptable to Prolog.
    Format is:

    tree(Tree).

    where tree node Tree should be of form:

    n(NodeName,Child1,..,Childn)

    NodeName should be an acceptable Prolog atom.
    Atoms starting with an upper case letter should be quoted as follows, e.g. VP should be 'VP'
    Each child node Childi should either be an atom or (recursively) a tree node.

    Example:

    Prolog tree input for the sentence John slept

    tree(n('S',n('NP',n('NNP','John')),n('VP',n('VBD','slept')),n('.','.'))).
    
    Bikel parser output is in Lisp sexp format:
    (S (NP (NNP John)) (VP (VBD slept)) (. .))
    

    Code for converting Bikel parser output is detailed in the appendix.

  2. Both sentence and tree files must be entered.

    Press "Load Files" to load the files.


Tree Display

Click on any displayed sentence in the left display panel and the corresponding tree will be rendered on the right panel.

The background of the sentence currently being displayed is highlighted in blue. The sentence number is given above the tree.

In addition to directly clicking on a sentence, the Up and Down arrows on the keyboard can be used to display the tree for the preceding and following sentence.

To go directly to a sentence, enter the sentence number in the sentence number box and press Return. Example:

Screen and window sizing

Scrollbars are available when appropriate in both display windows.
If a scrollbar is not visible, expand the window.

The entire program window can be expanded or re-sized by dragging the handle at the bottom-right.
The divider separating the two display windows can be moved using the (small square) drag handle.

[Note: the right display window below has been resized to accommodate the large parse tree. The vertical scrollbar for the left display window has been occluded.]


Sentence Search

The program provides several simple mechanisms for searching the loaded sentence file.

Regular expression searches can also be specified by checking the regexp flag.

Example: Down

Click to focus the keyboard input in the entry box next to "Match Sentence" and type in TO.

[The box currently receiving focus is always marked with a black border.]

Then press Return or select "Down" from the pull-down menu.

The first matching TO in the file found in sentence 10

John is likely to be here

will be displayed in red.
(See first picture on the right.)

Press Return again (or select "Down") to move to the next match shown in red for sentence 12

*It is likely John to be here

(See second picture on the right.)

[By default, the search proceeds downwards from the current insertion point in the left panel. For example, you can scroll around and click to set the insertion point for the left display panel. Selecting "Down" (or clicking in the Match Sentence entry box and pressing Return) will find the next matching string from that point.]

(Note: Sentence search looks only in the sentence file. You must click on a sentence to display the associated tree.)

Example: Up

Search may also proceed upwards from the current insertion point.

Select "Up" from the pull-down menu. Pressing Return will now default to moving up to the next match.

Example: All

Instead of examining a single match at a time, selecting "All" instead of "Up" or "Down" will highlight (in red) all possible matches simultaneously in the sentence file.

Here below, we see the effect of option "All" on the matching string TO.

3 matches (TO) are visible here.

There are actually 104 matches in 317 sentences.

We can eliminate non-matching sentences and narrow the display down to the 104 matching sentences only by pressing the "Selected Only" button next to the "Show:" label.

The display toggles to show only sentences with highlighted matches.

[Note: The "Show Selected Only" button is now greyed out.]

To toggle back to the full display, press the (now-activated) "All" button to the right of the "Show" label.

The "Show Selected Only"/"Show All" toggle is also used in Tree Search. (See below.)

Regular Expression Search

An example of a simple regular expression "Show All" sentence search. Matching string is:

\(VB[PZ]\)

which matches either (VBP) or (VBZ).

(Standard Unix regular expression syntax is assumed.)


Tree Search

The most powerful and extensible part of the program is the tree search facility. This uses the full power of Prolog to extract matching trees.

To enable tree search, the "Enable" button must be pressed.

This loads all the trees into the Prolog database.

The "Enable" button is now greyed out and replaced by the label "Enabled".

The "Match Tree (Prolog)" entry box now accepts Prolog queries and is now no longer greyed out.

Prolog queries (currently) are evaluated using one of two formats:

  • finding a single match (option: "One"),
    and

  • finding all possible matches (option: "All").

Queries must obey Prolog syntax (but exclude the line-ending period).
Currently, Prolog syntax and run-time errors result in a TCL error message being generated (to be fixed).

Match Tree (Prolog): One

Example: node(X,'VP'), branching(X,3)

This query states: "Find a tree such that.."

  1. there exist a node X with label VP, and
  2. the branching factor of X is 3, i.e. X has 3 children.

[Following Prolog convention, logic variables start with an uppercase letter.
In the example, X is a variable denoting a tree node.
Constants that begin with an uppercase letter must be enclosed in single quotes (to avoid being interpreted as logic variables).
In the example, 'VP' names a constant beginning with an uppercase letter.
Conjunctive queries must be separated by a comma, disjunctive ones by a semicolon. \+ is the Prolog negation operator.]

In other words: "Find the first matching tree that has a VP with 3 children".
The result is given below:

The first matching tree is associated with sentence 18

*I believe sincerely John to be here

The program displays the matching tree (VP has three children headed by VBP, ADVP and S) and highlights the associated sentence.

Match Tree (Prolog): All

Example: node(X,'VP'), branching(X,3)

This query states: "Find all matching trees with a VP with 3 children".
The result is given below:

The left display panel is restricted to the sentences associated with the 31 matching trees.

The tree associated with sentence 22

I persuaded John to leave

is currently displayed.

Note: The left display is currently in "Show Selected Only" mode (see the Sentence Search documentation above), and may be toggled back to showing all sentences in the treebank by pressing the "All" button.

Prolog tree primitives

Currently, the following primitive logic formulas are pre-defined:

Primitive Documentation
node(V,Label) V: variable that names the node
Label: label associated with the node
Example: node(X,'NNP')
Semantics: there exists a node X with label NNP in the matching tree.
branching(V,N) V: variable standing for a node
(must be previously introduced via node/2).
N: positive integer indicating the branching factor of V
Example: branching(Z,3)
Semantics: node Z has branching factor 3.
V1 dom V2
dom(V1,V2)
V1: variable standing for a node
(must be previously introduced via node/2).
V2: variable standing for a node
(need not be previously introduced via node/2).
Example: X dom Y
Semantics: node X properly dominates node Y.
Note: dom is likely to be renamed to pdom (for properly dominates) soon.
V1 idom V2
idom(V1,V2)
V1: variable standing for a node
(must be previously introduced via node/2).
V2: variable standing for a node
(need not be previously introduced via node/2).
Example: X idom Y
Semantics: node X immediately dominates node Y.
notree No variables present Example: notree
Semantics: matches cases where no parse tree was recovered.
Usage: e.g. the Collins parser models

The following connectives are currently supported:

Logic connective Documentation
A , B (comma: conjunction) A and B are logic formulas. Example: node(X,'NP'), X idom Y, node(Y,'NNP')
Semantics: in the matching tree, there exist nodes X and Y with labels NP and NNP (respectively) such that X immediately dominates Y.
A ; B (semicolon: disjunction) A and B are logic formulas. Example: node(X,'NP'), X idom Y, (node(Y,'NNP') ; node(Y,'PRP'))
Semantics: in the matching tree, there exist nodes X and Y. X has label NP and Y has either label NNP or PRP. Furthermore, X immediately dominates Y.
Note: parentheses can be used to delimit the scope of the connectives.
By default, conjunction has a smaller scope than disjunction in Prolog.
\+ A (negation) A is a logic formula. Example: \+ node(X,'PRP')
Semantics: there does not exist a node X with label PRP (personal pronoun) in the matching tree.

It is possible to add additional Prolog definitions, e.g. c-command etc.
(Documentation forthcoming.)

Example:
node(Y,'VP'), Y dom X, node(X,'SBAR')
node(Y,'VP'), Y idom X, node(X,'SBAR')

We illustrate the difference in finding a VP that dominates some SBAR node vs. immediately dominating the SBAR node.

node(Y,'VP'), Y dom X, node(X,'SBAR')

172 matches are found. The tree for the first match (sentence 9) is shown above in the right display panel. Note that VP dominates SBAR via the intermediate node NP.

node(Y,'VP'), Y idom X, node(X,'SBAR')

109 matches are found. The tree for the first match (sentence 17) is shown. Only instances where VP immediately dominates the SBAR node are returned.


Download

Application

Platform File Install/Run
MacOS X (PowerPC)
(10.3, 10.4)
treebanksearch-powerpc.zip (Updated: 2/15/07)

Note: Requires Aqua Tcl/Tk (10.4: already installed by default, 10.3: download from http://tcltkaqua.sourceforge.net/)

(Unzip if necessary.)
Drag application to your Application folder.

Double-click application.
Enter keycode.
(See licensing section here.)

Double-click application again.

MacOS X (Intel)
(10.4)
treebanksearch-intel.zip (1264KB) Updated: 1/22/07

Note: There is no need to install additional software.
The application works with the standard 10.4 Aqua Tcl/Tk libraries supplied by Apple in /Library/Frameworks/ in PowerPC (Rosetta emulation) mode.

A universal binary (ActiveTcl) is available from http://www.activestate.com/Products/ActiveTcl/

Release Note: 1/22/07 version is a bit flakey due to TCL/TK performance issues. Sorry. Will be updated soon.

Application executable references /Library/Frameworks/Tk.framework/Versions/8.4/Tk
File should report:
Tk: Mach-O universal binary with 2 architectures
Tk (for architecture ppc):  Mach-O dynamically linked shared library ppc
Tk (for architecture i386): Mach-O dynamically linked shared library i386
(Unzip if necessary.)
Drag application to your Application folder.

Double-click application.
Enter keycode.
(See licensing section here.)

Double-click application again.

Windows XP treebanksearch-winxp.zip (Not available yet.)
[Version compiled sans SP1 on Visual C++ 2005 Express Edition. SP1 breaks the code.]

Note: This relies on ActiveTCL and Microsoft Visual C++ DLLs.

You need to install ActiveTCL for Windows XP. Download from http://downloads.activestate.com/ActiveTcl/Windows/.

Sicstus Prolog 3.12.7 was built against TCL/TK version 8.4.13 To avoid DLL hell, you probably should install this exact version (not the very latest release) from the above URL. In other words, download: 8.4.13/ActiveTcl8.4.13.0.261555-win32-ix86-threaded.exe

When installing ActiveTCL, choose the exact same directory used in the Sicstus build. This means not installing in the default directory (C:\Tcl), but C:\Tcl-8.4.13, see the configuration screen to the right.


[Click on the picture for a larger image.]

(Unzip the treebankviewer folder.)
Place folder in C:\Program Files

To run, double-click the executable tbs.exe in the C:\Program Files\treebanksearch.

[Do not double-click treesearch.tcl. It will start the GUI only but you will not be able display any trees.]


Treebanks

See the treebankviewer homepage downloads here for the example Lasnik & Uriageraka treebank and the Penn Treebank.

Collins Parser Output

This is the output from running the Wall Street Journal (WSJ) section of the Penn Treebank (PTB) on the Collins parser.

Models 1 through 3 in treebankviewer format. Temporarily restricted download (requires password):

Download Sentence File Prolog TreeBank
wsj-collins.zip (34.8MB, .zip archive) wsj.txt
Temporarily included.
Source: wsj-collins-m1.pl (Model 1)
Compiled: wsj-collins-m1.po
Source: wsj-collins-m2.pl (Model 2)
Compiled: wsj-collins-m2.po
Source: wsj-collins-m3.pl (Model 3)
Compiled: wsj-collins-m3.po


Licensing

You must have a keycode for this machine. It costs $5.

The following pop-up will appear if you do not have a keycode for the machine on which treebanksearch is installed.

You should note the displayed hostname.

The keycode supplied to you will be an (exact) function of this name.

If the keycode is corrupt or the application has been moved to another machine or the machine hostname has changed, the following pop-up will appear when starting treebanksearch.

A new keycode will need to be generated.

In either case, after you press the "OK" button:

A second pop-up will appear inviting you to enter a valid keycode.

Copy and paste the supplied keycode exactly, and press "OK".

Re-start treebanksearch. You will not see the keycode pop-ups again.


Appendix

Coming soon...


Back to my academic homepage
Last modified: Thu Feb 15 23:43:00 MST 2007