Features

`ParsePF`

Lexicon:
Predicates that must be supplied in a lexicon.
Lexical Features:
Obligatory and optional lexical features.
Parse PF:
The ParsePF system and predicates for taking apart and combining words.

Components:

Input Words

Expand Contractions

expandContractionsHook/2

Words

Lookup Words

parsePFHook/2

ZLSS

Example: John can't sleep

[john,can,'t,sleep]

contraction([],can,'''t',[can,neg]).

(null)

[john,can,neg,sleep]

lex(john,n,[a(-),p(-),agr([3,sg,m])]).

lex(can,v,[morph(can,def(past(-))), aux,modal,subcat(vp$[morph(_,[])],[])]).

lex(neg,mrkr,[left(v,aux,neg)]).

lex(sleep,v,[morph(sleep,[]),grid([agent],[])]).

(null)

[n John][v$[neg] can][v sleep]

Introduction

ParsePF represents the first stage of parsing in the PAPPI system. It receives as input the sentence to be parsed, i.e. a sequence of words, and performs lexical analysis to produce as its output a corresponding sequence of base-level or zero-level constituents (ZLSS).

These base-level constituents will be in turn fed as input to the second stage of parsing, namely ParseSS. This level performs standard phrase structure analysis to construct the corresponding phrase structure tree representations at S-structure.

For example, ParsePF transforms a sentence like John can't sleep into the following sequence of zero-level constituents: [N John][V$[neg] can][V sleep]. Within ParsePF, the transformation is broken down into a series of smaller stages, as shown in the diagram above.

Input Words

PAPPI can accept input sentences typed at the PAPPI input window (or any other window). Once the input sentence has been selected, the Run button will send the sentence to ParsePF.

Basically, the input consists of just words separated by spaces. However, there are a few embellishments:

Anything between parentheses (..) will be ignored.
The percent sign (%) is the comment character. Rest of line is ignored.
Asterisks (*) are ignored.
. ? , ! " ' <newline> <tab> <space> are word delimiters. All other characters except for $ and [..] are considered to be part of the word.
Delimiters are generally discarded, i.e. not passed to ParsePF, except in the case of the single quote character (') which functions both as a delimiter and as the beginning of the next word. For example, John's becomes:
john 's
Square brackets [..] immediately following a word can be used for coindexation. For example, as in:
John[i] thought that he[i] won
The preprocessor translates this into:
john$[add(index(I))] thought that he$[add(index(I))] won
[Explanation: the call to add/1 forces parsePF to assign a feature index/1 with the same variable (I) to both John and he.]
Note, empty square brackets ([]) can also be used to exclude a word from coindexation. For example, it$[] translates into:
it[add(noCoindex)]
[Explanation: noCoindex is a feature used by Free Indexation defined in the principles file.]
If caseFoldInput holds, words will be lowercased. By convention, this is set in the lexicon file. So the preprocessor turns John into the Prolog atom john, not 'John'. To leave case unchanged, use the following declaration instead:
no caseFoldInput.
The dollar sign ($) can be used in the sequence word$translation to restrict word lookup to use the lexical entry for word that has feature eng(translation). For example, in the Japanese implementation, specifying katta$buy in the input string eliminates unwanted multiple parses.

References: Implementation Notes

`Expand Contractions`

Expand Contractions uses simple contraction rules defined in the lexicon to `chop' or split up words into their components. For example, the following two rules can be used to derive can not and would not from can't and wouldn't, respectively.

contraction([],can,'''t',[can,not]).
contraction([],X+n,'''t',[X=morph(_,_),not]).

Similarly, the following two rules simply stipulate that 'd and 've should be expanded into would and have, respectively.

contraction([],'''d',[would]).
contraction([],'''ve',[have]).

Both contraction/3 and contraction/4 are examples of simple string manipulation rules. That is,

Expand
Contractions

does not produce lexical categories, nor does it take them as input. Note that this means such rules cannot make use of general lexical features except under a very limited set of circumstances concerned with stem recognition.

In general, contraction rules will apply optionally. That is, PAPPI will try both the possibility of applying a contraction rule or letting the input word go through unchanged. A special blockContraction declaration can be used on a word-by-word basis to inhibit contraction processing. For more details on the format of the rules, see the documentation on the predicate contraction:

Debugging

The contraction mechanism can be traced by turning on the Trace Contractions flag under the Tracing Options control panel:

(a) Turning on Trace Contractions

(Alternatively: it can be activated using the setup option tracingContractions.)

For example, here is the contraction mechanism at work on the Japanese sentence Taroo-ga hihansareta:

(a) Tracing Expand Contractions

Here, Expand Contractions produces two candidate strings. In the first one, prefixed by Exit Contraction: (1), the verb complex hihansareta is broken down into the suru-taking verb stem (hihan), followed by the passive form of suru (sare) to be modified by the past tense morpheme (-ta). This analysis subsequently leads to a successful parse. The second candidate string is the one where no contraction rule has applied.

References: contraction / blockContraction

`Expand Contractions Hook`

In later versions (PAPPI 3.x only), a hook predicate is provided after Expand Contractions but before

Lookup
Words

. Initially, expandContractionsHook/2 is defined to be a null stage as follows:

expandContractionsHook(Words,Words).

See the Turkish implementation for an example of how to extend expandContractionsHook/2 to handle noun-noun compounding.

Words

Note that the two previous stages operate on simple strings. They perform simple word transformation rules. In particular, lexical items have not yet acquired part-of-speech labels or feature bundles. That task is reserved for the next stage, Lookup Words.

Example constraints, as described above in the Input Word section, generally pass through Expand Contractions unscathed. Hence, the format of the words presented as input to Lookup Words will be largely unchanged. However, there is a special provision for compound words formed either in Expand Contractions or expandContractionsHook.

The following three compound forms are also valid:

merge(M,Word).
This structure is left-recursive. M is either another instance of merge/2 or a simple word.
mergeR(Word,M).
This structure is right-recursive. M is either a simple word or another instance of mergeR/2.
mergeRC(M₁,M₂).
M₁ and M₂ are both instances of mergeR/2. This is used in multiple compounding.

The behaviour of these merge structures during lexical item formation is described in the next section.

`Lookup Words`

The task of producing labelled lexical items falls to the third stage, namely Lookup Words. Each word is processed as follows:

If the word is a non-compound structure, it is matched against the lexicon using:

lexicon(Word,C,Fs)

C is the category label and Fs is the list of lexical features associated with Word.

The behaviour of Lookup Words is determined by the outcome of lexical lookup:

If the lookup is successful, the zero-level category [C Word] is formed (except in the case of markers to be described below).
If the lookup is unsuccessful, ParsePF will fail for the whole sentence.
If the word is ambiguous, i.e. there is more than one matching entry, ParsePF becomes non-deterministic. In other words, ParsePF will produce multiple outputs on bracktracking.

Markers

If the word is a marker, i.e. C = mrkr, no category [mrkr Word] is formed by Lookup Word.

In PAPPI, markers are special elements appearing in the input that do not project structure like regular categories. Instead, markers are realizations of feature elements that are attached to regular categories.

For example, in the English implementation, of as in:

(a) his picture of Mary
(b) proud of Mary

is encoded as the realization of genitive Case. Following Knowledge of Language[Chomsky,86], we assume that the heads picture and proud assign genitive Case to the complement NP Mary. In other words, rather than positing a complement PP headed by of in both cases, we get the following two (simpler) parse fragments instead:


(a) his pictures of Mary	(b) proud of Mary

The lexical entry for of is as follows:

lex(of,mrkr,[right(np,case(gen),[])]).		  % object genitive Case

Here, the marker feature right/3 states that of attaches as a feature case(gen) to the NP on the right.

In general, the possible elements to which marker can attach must be declared using the predicate relevant(C), where C is a category label. For example, for English, we have:

% relevant for marker constraints
relevant(n).
relevant(v).

Non-relevant categories are ignored or skipped-over for the purposes of feature attachment.

The following table specifies the possible marker features. Each lexical entry for a mrkr item must contain a marker feature. Generally, for each marker feature, C will be a category label, F a feature to be matched and A a feature to be added to the matching relevant item. (The possible forms for F and A are defined later in separate tables):

Marker Feature Description

left(C,F,A) Marks X, the relevant element of category C to its left. X must satisfy feature constraints F. If the match is successful, the features given in A are added to X. If the match is unsuccesful, Lookup Word fails locally.
Example:
The English possessive marker 's is defined as follows:
lex('''s',mrkr,[left(n,[not(morphC(_)),case(gen)],[])]).
The possessive marker is the realization of genitive Case for those nouns not already morphologically marked as genitive (e.g. possessive personal pronouns like his).

right(C,F,A) Marks X, the relevant element of category C to its right. X must satisfy feature constraints F. If the match is successful, the features given in A are added to X. If the match is unsuccesful, Lookup Word fails locally.
Example:
The infinitival marker to matches with the (base form) verb to its right, and adds the feature inf([]) to the verb:
lex(to,mrkr,[right(v,morph(_,[]),inf([]))]). % infinitival marker
See also the of-insertion example described earlier.

leftec(C,F,A,G,_) As for left(C,F,A) except in the case where the relevant element immediately to its left is not of category C. A new empty category with label C is created and inserted immediately to its left. The features of this new category is given by the goal G, which must of the form goal(Goal,Fs) where Goal is a call to a user-specified predicate that computes a list of features Fs.
Example:
In the Turkish implementation, the plural marker normally marks a singular noun to its left. In the case of a plural-marked adjective like k��kler (small-plr), the following rule produces k��k+[N], where [N] is a plural-marked empty noun:
lex(plr,mrkr,[leftec(n,[agr([3,sg,[]])],override(agr([3,pl,[]])), goal(emptyNFs(Fs),Fs),_)]).
emptyNFs/1 is defined by:
emptyNFs(Fs) :- mkFs([ec(_),a(-),p(-),grid([],[]),agr(_),noECP(lf), case(_),theta(_)],Fs).
Note: the agr(_) feature will be instantiated by override(agr([3,pl,[]])).

rightec(C,F,A,G,_) As for right(C,F,A) except in the case where the element immediately to its right is not of category C. A new empty category with label C is created and inserted immediately to its right. The features of this new category is given by the goal G, which must of the form goal(Goal,Fs) where Goal is a call to a user-specified predicate that computes a list of features Fs.
Example:
From the Turkish implementation, the noun relativizer -ki marks the noun to its right and allows it to take a locative complement:
lex(ki_mrk,mrkr,[rightec(skip(n,[a]),[not(a(+)),not(p(+))], [modify(grid(X,_),grid(X,[location]))], goal(emptyNFs(Fs),Fs),_)]).
Note: the definition of emptyNFs/1 is given in the example for leftec.

Feature Constraints

In general, F specifies the pre-conditions for marker attachment in left/right/leftec/rightec. The elements in F are unified with the features of the candidate item according to the following rules:

Constraint Description

[] matches any item.

F F a feature. Item must contain a feature unifiable with F.

not(F) F a feature. Item must not contain a feature unifiable with F.

[F₁,..,F_n] Item must satisfy constraints F₁ through F_n.

if(F) where F is a constraint. If F matches, the add feature portion, A is carried out. If there is no match, A is skipped.

eval(F,G) F a feature, and G a goal. Item must contain a feature unifiable with F and goal G holds.

eval(if(F),G) F a feature, and G a goal. If item contains a feature unifiable with F, goal G must hold. Otherwise, G is skipped.

Example:

lex(gen,mrkr,[leftec(n,[eval(if(morphC(X)),(X==nom))],override(morphC(gen)),
		     goal(emptyNFs(Fs),Fs),_)]).

Here, the Turkish genitive Case marker instantiates the morphC(gen) feature of the item immediately to its left. However, if the feature morphC is already present on the item, only nominative Case can be overridden by genitive Case.

[A note on leftec/rightec: Consecutive leftec markers will not introduce multiple empty categories. For instance, in the Turkish implementation, siyahlar� (black-pl-acc), to be interpreted as "the black ones", is defined as being an adjective followed by the plural and accusative Case markers. Both of these are leftecs. However, only one empty noun marked for plural number and accusative Case should be generated. Hence, for empty noun generation, PAPPI automatically batches up consecutive leftec and rightec markers.]

Add Features

In general, A in left/right/leftec/rightec specifies the features to be added to the matching item. Note that feature instantiation can be carried out during the matching phase. The following table gives the possible ranges of values for the add features:

Add Feature Description

[] do nothing.

A A a feature. A is added to the feature list for the matching item if there is no feature already in item that unifies with A.
Note: compare with add features new(A) and override(A).

new(A) A a feature. A is added to the feature list for the matching item provided feature is not already present.
Note: in cases where feature A contains slots, e.g. as in f(V), f/1 is considered to be already present if a feature with the same functor and arity already exists in the matching item. In these cases, f(V) is not added and add feature succeeds quietly.

override(A) A a feature. A is added to the feature list for the matching item.
Note: in contrast with new(A) and A, override doesn't care if the feature already exists or not.

[A₁,..,A_n] Features A₁ through A_n are added to the matching item.

modify(F,A) F and A features. Matching item must have a feature unifiable with F. If so, A is added to the item.
Example:
In the Japanese implementation, the past tense marker modifies the morph feature of the verb to its left, namely X:
lex(past,mrkr,[left(v,[],[modify(morph(X,_),morph(X,past(+))), suffix(W+ta,K+a4bf)]),prefix(W,K)]).

modify(F,A,G) G a goal, and F and A features. Matching item must have a feature unifiable with F. If so, goal G must hold and A is added to the item.

suffix(S) S a simple atom or a concatenation expression of the form X+Y where X and Y are simple atoms. The word for the matching item is suffixed with S. If S is not simple, X and Y are first concatenated to form a simple atom.

suffix(S,K) As for suffix(S) except that K is concatenated onto the end of the value of the k(_) feature for the matching item. As in the case of S, K may be simple or a concatenation expression.
Example:
In the Japanese implementation, the nominative Case marker ga marks the noun to its left by adding -ga to the word and the EUC code a4ac to its k(_) feature:
lex(ga,mrkr,[left(n,[],[morphC(nom),suffix(ga,a4ac)])]).
See the earlier example of the past tense marker for an example where S and K are non-simple.

Summarizing, markers are general devices that may be used anywhere where there are words or morphemes that may be simply reduced to features attached to lexical items. See the Hungarian and Turkish lexicons for more examples of how morpheme markers are used in conjunction with the contraction mechanism.

If the word is a compound structure consisting of words W₁..W_n, n>1, each component W_i is matched against the lexicon using:
lexicon(W_i,C,Fs) for 1<=i<=n
Each W_i must share the same category label C. A compound zero-level category is formed:
[C Word] where Word is ₁,..,W_n concatenated.
Each compound structure must also obey the following rules:
- If the structure is merge(..merge(W₁,W₂)..,W_n), W₁..W_n must all share the same the theta-grid feature, grid/2.
  The features of the aggregate is simply determined as the union of the features of its components, except in the case of the k(_) feature, which is formed by concatenating all of the component k(_) features.
  Note: This form of compounding is currently only used by the Chinese implementation for Serial Verb Constructions (SVC).
- If the structure is mergeR(W₁,..mergeR(W_n-1,W_n)..), the features of the aggregate is basically given by the features of the head W₁.
  Two other declarations must also be provided:
  - mergeRFeatures(List), where List is a list of components features to be accumulated.
  - mergeRFeaturesComposeHook(List,List'), where each element of List contains the list of features nominated in mergeRFeatures/1 for W₁ through W_n.
    The supplied definition must return a single list, List', will be appended to the list of features for the aggregate.
  Example:
  The Turkish implementation uses mergeR plus the following definitions to produce a composite eng(_) feature for noun-noun compounding:
```
mergeRFeatures([eng(_)]).

% takes [[eng(E1)],[eng(E2)],..,[eng(En)]]
% returns [eng(E1 E2 .. En)]

mergeRFeaturesComposeHook(List,[eng(Eng)]) :-
	composeEng(List,Eng).

composeEng([[eng(E)]],E) :- !.			% red
composeEng([[eng(E1)]|Fs],E) :-
	composeEng(Fs,E2),
	concatAtoms([E2,' ',E1],E).
```
- If the structure is a multiple compound structure, mergeRC(M₁,M₂), where M₁ and M₂ are instances of mergeR/2, the features of the aggregate is basically given by the features of the head of the sub-compound M₁. The rest of the details are identical to that for mergeR/2.

References: lexicon / k(_)

`ParsePF Hook`

Finally, ParsePF Hook is initially defined to be a null stage. It may be overridden as needed by individual lexicons to perform additional transformations on the input. It's initial definition is as follows:

parsePFHook(PF,PF).

For instance, see the Hungarian or Japanese lexicon for examples of how ParsePF Hook can be redefined to fill in default lexical feature values.

`ZLSS`

The output of ParsePF is a sequence of zero-level categories with markers resolved into features that attach to relevant lexical or specially-introduced empty categories.

Features

Marker Feature	Description
`left(C,F,A)`	Marks `X`, the relevant element of category `C` to its left. `X` must satisfy feature constraints `F`. If the match is successful, the features given in `A` are added to `X`. If the match is unsuccesful, `Lookup Word` fails locally. Example: The English possessive marker 's is defined as follows: lex('''s',mrkr,[left(n,[not(morphC(_)),case(gen)],[])]). The possessive marker is the realization of genitive Case for those nouns not already morphologically marked as genitive (e.g. possessive personal pronouns like his).
`right(C,F,A)`	Marks `X`, the relevant element of category `C` to its right. `X` must satisfy feature constraints `F`. If the match is successful, the features given in `A` are added to `X`. If the match is unsuccesful, `Lookup Word` fails locally. Example: The infinitival marker to matches with the (base form) verb to its right, and adds the feature `inf([])` to the verb: lex(to,mrkr,[right(v,morph(_,[]),inf([]))]). % infinitival marker See also the of-insertion example described earlier.
`leftec(C,F,A,G,_)`	As for `left(C,F,A)` except in the case where the relevant element immediately to its left is not of category `C`. A new empty category with label `C` is created and inserted immediately to its left. The features of this new category is given by the goal `G`, which must of the form `goal(Goal,Fs)` where `Goal` is a call to a user-specified predicate that computes a list of features `Fs`. Example: In the Turkish implementation, the plural marker normally marks a singular noun to its left. In the case of a plural-marked adjective like k��kler (small-plr), the following rule produces `k��k`+`[N]`, where `[N]` is a plural-marked empty noun: lex(plr,mrkr,[leftec(n,[agr([3,sg,[]])],override(agr([3,pl,[]])), goal(emptyNFs(Fs),Fs),_)]). `emptyNFs/1` is defined by: emptyNFs(Fs) :- mkFs([ec(_),a(-),p(-),grid([],[]),agr(_),noECP(lf), case(_),theta(_)],Fs). Note: the `agr(_)` feature will be instantiated by `override(agr([3,pl,[]]))`.
`rightec(C,F,A,G,_)`	As for `right(C,F,A)` except in the case where the element immediately to its right is not of category `C`. A new empty category with label `C` is created and inserted immediately to its right. The features of this new category is given by the goal `G`, which must of the form `goal(Goal,Fs)` where `Goal` is a call to a user-specified predicate that computes a list of features `Fs`. Example: From the Turkish implementation, the noun relativizer -ki marks the noun to its right and allows it to take a locative complement: lex(ki_mrk,mrkr,[rightec(skip(n,[a]),[not(a(+)),not(p(+))], [modify(grid(X,_),grid(X,[location]))], goal(emptyNFs(Fs),Fs),_)]). Note: the definition of `emptyNFs/1` is given in the example for `leftec`.

Constraint	Description
`[]`	matches any item.
`F`	`F` a feature. Item must contain a feature unifiable with `F`.
`not(F)`	`F` a feature. Item must not contain a feature unifiable with `F`.
`[F₁,..,F_n]`	Item must satisfy constraints `F₁` through `F_n`.
`if(F)`	where `F` is a constraint. If `F` matches, the add feature portion, `A` is carried out. If there is no match, `A` is skipped.
`eval(F,G)`	`F` a feature, and `G` a goal. Item must contain a feature unifiable with `F` and goal `G` holds.
`eval(if(F),G)`	`F` a feature, and `G` a goal. If item contains a feature unifiable with `F`, goal `G` must hold. Otherwise, `G` is skipped.

Add Feature	Description
`[]`	do nothing.
`A`	`A` a feature. `A` is added to the feature list for the matching item if there is no feature already in item that unifies with `A`. Note: compare with add features `new(A)` and `override(A)`.
`new(A)`	`A` a feature. `A` is added to the feature list for the matching item provided feature is not already present. Note: in cases where feature `A` contains slots, e.g. as in `f(V)`, `f/1` is considered to be already present if a feature with the same functor and arity already exists in the matching item. In these cases, `f(V)` is not added and add feature succeeds quietly.
`override(A)`	`A` a feature. `A` is added to the feature list for the matching item. Note: in contrast with `new(A)` and `A`, `override` doesn't care if the feature already exists or not.
`[A₁,..,A_n]`	Features `A₁` through `A_n` are added to the matching item.
`modify(F,A)`	`F` and `A` features. Matching item must have a feature unifiable with `F`. If so, `A` is added to the item. Example: In the Japanese implementation, the past tense marker modifies the `morph` feature of the verb to its left, namely `X`: lex(past,mrkr,[left(v,[],[modify(morph(X,_),morph(X,past(+))), suffix(W+ta,K+a4bf)]),prefix(W,K)]).
`modify(F,A,G)`	`G` a goal, and `F` and `A` features. Matching item must have a feature unifiable with `F`. If so, goal `G` must hold and `A` is added to the item.
`suffix(S)`	`S` a simple atom or a concatenation expression of the form `X+Y` where `X` and `Y` are simple atoms. The word for the matching item is suffixed with `S`. If `S` is not simple, `X` and `Y` are first concatenated to form a simple atom.
`suffix(S,K)`	As for `suffix(S)` except that `K` is concatenated onto the end of the value of the `k(_)` feature for the matching item. As in the case of `S`, `K` may be simple or a concatenation expression. Example: In the Japanese implementation, the nominative Case marker `ga` marks the noun to its left by adding `-ga` to the word and the EUC code `a4ac` to its `k(_)` feature: lex(ga,mrkr,[left(n,[],[morphC(nom),suffix(ga,a4ac)])]). See the earlier example of the past tense marker for an example where `S` and `K` are non-simple.

ParsePF

Contents