CS 598 JH: Advanced NLP (Spring ʼ09)

CFG parsing:
The basics

Todayʼs topics
CFGs and PCFGs:
CFGs as AND/OR graphs
Shared parse forests

CFGs and
AND-OR graphs
Context-free grammars
A CFG is a 4-tuple〈N,Σ,R,S〉
A set of nonterminals N
(e.g. N = {S, NP, VP, PP, Noun, Verb, ....})

A set of terminals Σ
(e.g. Σ = {I, you, he, eat, drink, sushi, ball, })

A set of rules R
R ⊆ {A → β with left-hand-side (LHS) A ∈ N
and right-hand-side (RHS) β ∈ (N ∪ Σ)* }
A start symbol S (sentence)

Some terminology

What is:
- the language defined by a CFG?
- a CFG derivation?
- a CFG parse tree?
- the yield of a CFG?

AND/OR graphs
Formal Definition (Mahanti/Bagchi, 1985)

An AND/OR graph G is a directed graph with a special node s (start/root

node) and a nonempty set {t1,…,tn} of terminal nodes.
The nonterminal nodes {n1...nm} are of two types: AND and OR
(The 3rd type (NONTERMINAL LEAF) is irrelevant for our purposes)

Semantics (Mahanti/Bagchi, 1985)

Start node = the problem to be solved

AND nonterminal node ni
= all its immediate descendants have to be solved
OR nonterminal node nj
= one of its immediate descendants has to be solved
Terminal nodes = their solutions are known

CFGs as AND-OR graphs

Terminal nodes of grammar G = Terminal nodes of graph G

Start symbol of G = Start node of G
Rules of G = AND nodes of G
Nonterminals of G = OR nodes of G
(B.Lang 1989/1991)
Each individual parse tree is an AND graph

(B.Lang 1989/1991)

Context and subtree

We can split the parse tree τ at each nonterminal node ni

into a context and a subtree.

If yield(τ) = uvw, and yield(ni) = v, then

subtree(ni) = v, and context(ni) = uniw
(B.Lang 1989/1991)

Two kinds of ambiguity

(B.Lang 1989/1991)

Are there any other kinds of
Shared parse forests

A compact representation of sentential ambiguity

Aside: Parse forests and
We can view the parse forest (AND/OR graph) of a sentence S
as a grammar GS , with L(GS) = {S}, except that now multiple
AND nodes may be labeled with the same rule:

The size of a parse forest
For a grammar with maximal branching factor p,
the size of a shared parse forest for a sentence S
with n words is O(np+1)
(B. Lang, 1991)

Hence, space complexity of CKY (with binary CFG) is O(n3)

Parsing as a deductive
(Shieber, Schabes, Pereira, 1993)
Parsing as deduction: what?
“Parsing can be viewed as a deductive process that seeks
to prove claims about the grammatical status of a string
from assumptions describing the grammaatical properties of
the stringʼs elements and the linear order between them”

(Shieber, Schabes, Pereira ʼ93)

Cf. categorial grammar (Lambek, Ajdukiewicz, Bar-Hillel)

Parsing as deduction: why?
This allows a separation of parsing into:

- a logic of grammaticality claims (= the grammar)

- a proof search procedure (= the parsing algorithm)
(Shieber, Schabes, Pereira ʼ93)

Cf. categorial grammar (Lambek, Ajdukiewicz, Bar-Hillel)

This also provides the formal basis and useful terminology

for understanding parsing algorithms

A1 ...... Ak

An inference rule consists of

- antecedents A1….Ak
- a consequent B
NB: there may be side conditions on A1….Ak and B
Usually rules are given as schemata, where A1….Ak and B
are/contain variables that need to be instantiated.= when
the rule is used

The derivation of a formula B from assumptions A1...Am
is a sequence of formulas S1...Sn such that
- B=Sn
- for i <n: Si=Aj or there is an axiom that allows
Si to be derived from S1..n-1

If a derivation of B from A1...Am exists, we say that

A1...Am derives B:

A1...Am ⊢B

Parsing as deduction
- Goal formula: the input string w=w1...wn is grammatical
according to the given grammar.
- Parsing = finding a derivation for a goal formula.

PCFG parsing
Computing P(τ | S)
Using Bayesʼ Rule:
P (τ, S)
arg max P (τ |S) = arg max
τ τ P (S)
= arg max P (τ, S)
= arg max P (τ ) if S = yield(τ )

The yield of a tree is the string of terminal symbols

that can be read offCorrect
the leaf analysis

( )
yield eat sushi with tuna eat sushi
= eat sushiwith with
tuna tuna
VP PP 21
Computing P(τ)
T is the (infinite) set of all trees in the language:
L = {s ∈ Σ | ∃τ ∈ T : yield(τ) = s}

Weed to define P(τ) such that:

∀τ ∈ T : 0 ≤ P(τ) ≤ 1
∑τ∈T P(τ) = 1
The set T is generated by a context-free grammar
S → NP VP VP → Verb NP NP → Det Noun
S → S conj S VP → VP PP NP → NP PP
S → ..... VP → ..... NP → .....

Probabilistic Context-Free Grammars
For every nonterminal X, define a probability distribution
P(X → α | X) over all rules with the same LHS symbol X:
S → NP VP 0.8
S → S conj S 0.2
NP → Noun 0.2
NP → Det Noun 0.4
NP → NP PP 0.2
NP → NP conj NP 0.2
VP → Verb 0.4
VP → Verb NP 0.3
VP → Verb NP NP 0.1
VP → VP PP 0.2
PP → P NP 1.0

Computing P(τ) with a PCFG
The probability of a tree τ is the product of the probabilities
of all its rules:
S → NP VP 0.8
S → S conj S 0.2
NP VP NP → Noun 0.2
Noun VP PP NP → Det Noun 0.4
John Verb NP P NP
NP → NP PP 0.2
NP → NP conj NP 0.2
eats Noun with Noun
VP → Verb 0.4
pie cream VP → Verb NP 0.3
VP → Verb NP NP 0.1
P(τ) = 0.8 ×0.3 ×0.2 ×1.0 ×0.23 VP → VP PP 0.2
PP → P NP 1.0
= 0.00384
PCFG parsing
Probabilistic CKY
Like standard CKY, but with probabilities.
Terminals have probability p=1
Associate P(X→ YZ | X) with every pair of backpointers
from X in cell[i][j] to Y in cell[i][k] and Z in cell[k+1][j]

Finding the most likely parse

Local greedy (Viterbi) search is guaranteed to be optimal:
For every non-terminal X in cell[i][j],
keep only the highest-scoring Y in cell[i][k] and Z in cell[k+1][j]

argmaxX,Y P(Y) × P(X) × P(X→ YZ | X)

Probabilistic CKY
Input: POS-tagged sentence
John_N eats_V pie_N with_P cream_N

John eats pie with cream S → NP VP 0.8

S → S conj S 0.2
NP S S S John NP → Noun 0.2
0.2 0.8*0.2*0.4 0.8*0.2*0.08 0.2*0.0024*0.8
NP → Det Noun 0.4
V max( 0.008*0.2, eats
0.4 0.3*0.2
0.06*0.2*0.2) NP → NP conj NP 0.2
NP NP pie VP → Verb 0.4
0.2 0.2*0.2*0.2
VP → Verb NP 0.3
P PP with VP → Verb NP NP 0.1
VP → VP PP 0.2
cream PP → P NP 1.0

Inside/outside probabilities
w1 ... ... wi ... wn

XP XP wj

w1….wi-1 wi……..wj wj+1….wn

Outside Probability of XP i..j : αij (XP )
αij (XP ) = P (S ⇒ w1 ..wi−1 XP wj+1 ...wn )

Inside Probability of XP i..j : βij (XP )

βij (XP ) = P (XP ⇒∗ wi ...wj )

