Introduction To Arabic NLP

Download as pdf or txt
Download as pdf or txt
You are on page 1of 87

MEADR 2009

Cairo, Egypt
April 21, 2009

Introduction to Arabic
Natural Language Processing
Nizar Habash
Columbia University
Center for Computational Learning Systems
[email protected]

1
CADIM

Focus of this tutorial

Phenomena
Concepts
Approaches
Resources

What is Arabic?
Arabic Script
Arabic Language
Modern Standard
Arabic (MSA)
Arabic Dialects
2

Road Map

Introduction
Orthography
Morphology
Syntax

Road Map
Introduction
Orthography

Arabic Script
MSA Phonology and Spelling
Recognizing Arabic vs. Persian/Urdu/Pashto/Kurdish/Sindhi/
Encoding Issues

Morphology
Syntax
4

Arabic Script

Arabic Script
Arabic script is an alphabet with allographic variants,
optional zero-width diacritics and common ligatures.

Arabic script is used to write many languages: Arabic,


Persian, Kurdish, Urdu, Pashto, etc.
6

Arabic Script
Alphabet
letter forms
letter marks
Arabic only
Other languages
Persian, Kurdish,
Urdu, Pashto, etc.
OCR output ambiguity; common spelling errors

Arabic Script
Alphabet (MSA)
letters (form+mark)
Distinctive


//

Non-distinctive

/s/

// /t/

/b/

I
//

glottal stop aka hamza

Arabic Script
Letter Shapes
No distinction between print and handwriting
No capitalization
Right-to-left
Ambiguous
shapes
Connective
letters
Disconnective
letters
( )
cause wordinternal visual
spacing

 

Stand
alone

   

initial

     

medial

 

final
9

Arabic Script
Letter shaping

&  & =
/katab/

to write

 & = &
/kitb/
book

b t

k
10

Arabic Script
Nunation

Vowel

Zero-width characters

Used for short vowels

/ban/

/ba/

/bun/

/bu/

/bin/

/bi/

Diacritics

& /katab/ to write


Nunation is used for
nominal indefinite
marker in MSA

&/kitbun/ a book

11

Arabic Script
Diacritics

No Vowel

No-vowel marker (sukun)

&

/maktab/ office

/b/

Double consonant marker


(shadda)

Double
Consonant

7& /kattab/ to dictate

Combinable

/bbu/

/bbin/

/bban/

/bb/
12

Arabic Script
Putting it together
Simple combination
Arab /arab/

9: = 9:


West /arb/

9 = 9


Ligatures
Peace /salm/

E@  @?

13

Arabic Script
Tatweel

elongation
aka kashida
used for text highlight
and justification


human rights /uqq alinsn/

14

Arabic Script
Different styles

Arabic Muhammad

High fluidity

Y9:

X

9VW

Optional ligatures
Vertical
arrangements

/arabi/ /muammad /

algebra

/alabr/
15

Arabic Script
Arabic Numerals
Decimal system
Numbers written left-to-right in right-to-left text

. 132 1962

Algeria achieved its independence in 1962 after 132 years of French occupation.

Three systems of enumeration symbols that vary by region

Western Arabic

0 1 2 3 4 5 6 7 8 9

Tunisia, Morocco, etc.

Indo-Arabic
Middle East

Eastern Indo-Arabic
Iran, Pakistan, etc.


i h g
16

Road Map
Introduction
Orthography

Arabic Script
MSA Phonology and Spelling
Recognizing Arabic vs. Persian/Urdu/Pashto/Kurdish/Sindhi/
Encoding Issues

Morphology
Syntax
17

MSA Phonology and Spelling


Phonological profile of Standard Arabic
28 Consonants
3 short vowels, 3 long vowels, 2 diphthongs

Arabic spelling is mostly phonemic


Letter-sound correspondence
! 

j w h n m l k q f  t d s s z r d x t b
18

MSA Phonology and Spelling


Arabic spelling is mostly phonemic
Except for
Medial short vowels can only appear as
diacritics
Diacritics are optional in most written text
Except in holy scripture
Present diacritics mark syntactic/semantic
distinctions
pq /katab/ to write pq /kutib/ to be written
ps /ubb/ love ps /abb/ seed

Dual use of , , as consonant and long vowel


( //,//) ( /w/,//) ( /j/,//)
19

MSA Phonology and Spelling


Arabic spelling is mostly phonemic
Except for (continued)
Morphophonemic characters
Feminine marker ( ta marbuta)
yz{ /kabr/ (big ) yz{ /kabra/ (big )
Derivation marker
/as~a/ (to disobey ) (a stick )

Hamza variants (6 characters for one phoneme!)


(I ) /baha/ + 3MascSing (his glory)
20

MSA Phonology and Spelling


Arabic spelling can be ambiguous
optional diacritics and dual use of letter

But how ambiguous? Really?


Classic example
ths s wht n rbc txt lks lk wth n vwls
this is what an Arabic text looks like with no vowels

Not exactly true


Long vowels are always written
Initial vowels are represented by an alef
Some final short vowels are represented
ths is wht an Arbc txt lks lik wth no vwls
Will revisit ambiguity in more detail again under morphology discussion

21

Proper Name Spelling


The Qadafi-Schwarzenegger problem
Foreign Proper name spelling is often ad hoc
Multiplicity of spellings causes increased sparsity

yz
y
yz
y

Gadafi Gaddafi Gaddfi Gadhafi



Ghaddafi Kadaffy Qaddafi Qadhafi
 Schwarzenegger

22

Road Map
Introduction
Orthography

Arabic Script
MSA Phonology and Spelling
Recognizing Arabic vs. Persian/Urdu/Pashto/Kurdish/Sindhi/
Encoding Issues

Morphology
Syntax
23

Arabic Script
Other languages
Arabic
No more than 3 dots
Dots either above or below
Marks are 1/2/3 dots, hamza ()
or madda (~) only
Rare borrowing for foreign words
/p/, /v/, /g/, /t/
regionally variable

Not Arabic

Extra marks: haft (v), ring (o), taa (),

four dots (::), vertical dots (:)

Some Numerals (i,h,g)



Once you learn the alphabet, it is easier

24

 Arabic
 Not Arabic

25

26

... ...y
q
z
z z
p
... ...y
y
z
z { yq

y

y q
p

 Arabic
 Not Arabic

...-./ :y 1234 56

 Arabic
 Not Arabic

27

Road Map
Introduction
Orthography

Arabic Script
MSA Phonology and Spelling
Recognizing Arabic vs. Persian/Urdu/Pashto/Kurdish/Sindhi/
Encoding Issues

Morphology
Syntax
28

Encoding Issues
Encoding Arabic

Data entry, storage, and display


Ease of use for Arabic-illiterate users
Multi-script support
Multilingual support (extended Arabic characters)

Types of Encoding
Machine character sets
Graphemic (shape insensitive, logical order)
Allographic (shape/direction sensitive) [obsolete]

Human accessible
Transliteration
Phonetic spelling (IPA)
Romanization
29

Encoding Issues
Many Conflicting Character Sets for Arabic

30

Encodings
CP-1256
Commonly used
1-byte characters
Widely supported
input/display
Minimal support for
extended Arabic
characters
bi-script support
(Roman/Arabic)
Tri-lingual support:
Arabic, French,
English (ala ANSI)
31

Encodings
Unicode
Becoming the
standard more and
more
2-byte characters
Widely supported
input/display
Supports extended
Arabic characters
Multi-script
representation

32

Encodings
Unicode
Supports presentation
forms (shapes and
ligatures)

33

Encoding Issues
Arabic Display
Memory (logical order) 
(Palestine) (Olympics) 2000 2004.
,( Palestine) ( Olympics) 2000 2004.

or this way for those with direction-bias


.4002 0002 )scipmylO( )enitselaP(
.4002 0002 )scipmylO( ) enitselaP( ,

34

Encoding Issues
Arabic Display
Memory (logical order)
(Palestine) (Olympics) 2000 2004.
,( Palestine) ( Olympics) 2000 2004.

Display (visual order)


Bidirectional (BiDi) support
Numbers and Roman script
.2004 2000 (Olympics) ( Palestine) ,

Letter and ligature shaping


.2004 2000 (Olympics) 3456 7 (Palestine) 89:;< =3?
35

Display Problems
ISO-8859 CP-1256
Unicode

Actual Encoding

CP-1256

Display Encoding
ISO-8859
Unicode

Western

1U 3T NOPQR IJKLM
NJ63X\Z] 5.XYZ 12

gY -KLMz3T z
z
5U 5.XU
zNY63M zz
z

LM~ 3T
5.XQ6
3X656

1U 3T NOPQR IJKLM
NJ63X\Z] 5.XYZ 12

gi`ab
-
i rsi
`ab

`i`ab

* **
* *
- **
**
* ***
** * *
*

1U 3T NOPQR IJKLM
NJ63X\Z] 5.XYZ 12

Wrong encoding

Partial support problems


36

Encoding Issues
Arabic Input
Standard graphemic keyboard
Logical order input

+,

37

http://www.cyrillic.com/kbd/btc.html

Encodings
Buckwalter Encoding
Romanization
One-to-one mapping
to Arabic script spelling
Left-to-right
Easy to learn/use
Human & machine compatible

Commonly used in NLP


Penn Arabic Tree Bank

Some characters can be


modified to allow use with XML
and regular expressions
Roman input/display
Monolingual encoding (cant do
English and Arabic)
Minimal support for extended
Arabic characters

38

Road Map

Introduction
Orthography
Morphology
Syntax

39

Morphology
Type
Concatenative: prefix, suffix, circumfix
Templatic: root+pattern

Function
Derivational
Creating new words
Mostly templatic

Inflectional
Modifying features of words
Tense, number, person, mood, aspect

Mostly concatenative

40

Road Map
Introduction
Orthography

Morphology

Derivational Morphology
Inflectional Morphology
Morphological Ambiguity
Arabic Computational Morphology

Syntax
41

Derivational Morphology
Templatic Morphology

Root
Pattern
Lexeme

ma

$"#!
maktb
written

+,*
ktib
writer

Lexeme.Meaning =
(Root.Meaning+Pattern.Meaning)*Idiosyncrasy.Random

42

Derivational Morphology
Root Meaning
KTB = notion of writing
&
&
/kitb/ /katab/
book
write
&
&
&
/maktb/
/maktaba/
/maktb/
written
library
letter
&

/maktab/
/ktib/
office
writer
43

Derivational Morphology
Root Meaning
LHM-1
Notion of meat
/lam/

345
laHm

Meat

/lam/
Butcher

44

Derivational Morphology
Root Meaning
LHM-2
Notion of battle
/malama/
Fierce battle
Massacre
Epic

45

Derivational Morphology
Root Meaning
LHM-3
Notion of soldering
/laam/
Weld, solder, stick, cling

q /iltaam/
Be welded/soldered/fused

q /multaim/
Welded, soldered, fused

46

Derivational Morphology
Pattern Meaning
Verb Pattern Meaning is hard to define
Pattern

Pattern Meaning

Example

Gloss

1a2a3

Basic sense of root

ktb  katab

write

II

1a22a3

Intensification, causation

ktb  kattab

dictate

III

1aA2a3

Interaction with others

ktb  kaAtab

correspond with

IV

Aa12a3

Causation

jls  Ajlas

seat

ta1a22a3

Reflexive of Pattern II

Elm  taEal~am

learn

VI

ta1aA2a3

Reflexive of Pattern III

ktb  takaAtab

correspond

VII

Ain1a2a3

Passive of Pattern I

ktb  Ainkatab

subscribe/enroll

VIII

Ai1ta2a3

Acquiescence, exaggeration

ktb  Aiktatab

register

IX

Ai12a33

Transformation

Hmr  AiHmarr

Turn red/blush

Aista12a3

Requirement

ktb  Aistaktab

ask/make_write
47

Road Map
Introduction
Orthography

Morphology

Derivational Morphology
Inflectional Morphology
Morphological Ambiguity
Arabic Computational Morphology

Syntax
48

Inflectional Morphology
Derivational Morphology
Lexeme Root + Pattern

Inflectional Morphology
Word = Lexeme + Features

Part-of-speech
Traditional: Noun, Verb, Particle
Computational: N, PN, V, Adj, Adv, P, Pron, Num,
Conj, Det, Aux, Pun, IJ, and others
Many tag sets exist ranging from 3 to over 22K tags

Noun-specific Features
Verb-specific Features
Other Features
49

Inflectional Morphology
Noun-specific Features

Number: singular, dual, plural, collective


Gender: masculine, feminine
Definiteness: definite, indefinite
Case: nominative, accusative, genitive
Possessive clitic

Verb-specific Features

Aspect: perfective, imperfective, imperative


Voice: active, passive
Tense: past, present, future
Mood: indicative, subjunctive, jussive
Subject (Person, Number, Gender)
Object clitic

Other Features
Single-letter conjunctions
Single-letter prepositions

50

Inflectional Morphology
Nouns
poss

plural

noun

*>,$?6
/wakabiytin/
*@ + $?A + +
wa+ka+biyt+n
and+like+houses+our
And like our houses

article

prep

conj

*6"#785
/walilmaktabt/
+;6"#!+++
wa+li+al+maktaba+t
and+for+the+library+plural
And for the libraries

Morphotactics (e.g. +  E5)


Arabic Broken Plurals (templatic)

51

Inflectional Morphology
Verbs
object

subj

*>*8IJ
/faqulnh/
* +*@ +*L +
fa+qul+na+h
so+said+we+it
So we said it

verb

tense

conj

*O5$I>N
/wasanaqluh/
* + $L + + +
wa+sa+na+ql+u+h
and+will+we+say+it
And we will say it

Morphotactics
Subject conjugation (suffix or circumfix)

52

Inflectional Morphology
katab to write

Perfect verb subject conjugation (suffixes only)


Singular

1
2
3

{q katabtu

{q katabta
pq kataba

Dual

Plural

{q katabn
q{q katabtum
q{q katabtum
{q katab
{q katabt

Imperfect verb subject conjugation (prefix+suffix)


Singular

1
2
3

pq aktubu
pq taktubu
pq yaktubu

Dual

Plural

pq naktubu
{q taktubn
{q taktubn
{q yaktubn
{qq yaktubn
53
Feminine form and other verb moods not shown

Road Map
Introduction
Orthography
Morphology

Derivational Morphology
Inflectional Morphology
Morphological Ambiguity
Arabic Computational Morphology

Syntax
54

Morphological Ambiguity
Derivational ambiguity
: basis/principle/rule, military base,
Qa'ida/Qaeda/Qaida

Inflectional ambiguity
pq /taktub/: you write, she writes
Segmentation ambiguity
: he found; +: and+grandfather
: +: for a language; +: for the language

55

Morphological Ambiguity
Spelling ambiguity
Optional diacritics
p: /ktib/ writer , /ktab/ to correspond

Suboptimal spelling
Hamza dropping: , 
Undotted ta-marbuta: 
Undotted final ya: 
Multiple sources of ambiguity

/bayyana/
/bayyanna/
/bayyin/
/bayna/
/biyin/
/biyn/

Verb
Verb
Adj
Prep
Proper Noun
Proper Noun

he demonstrated
they [feminine] demonstrated
clear/evident/explicit
between/among
in Yen
56
Ben

Morphological Disambiguation
in English
Select a morphological tag that fully
describes the morphology of a word
Complete English morphological tag set
(Penn Treebank): 48 tags
Verb: VB VBD VBG VBN VBP VBZ
go

went

going

gone

go

goes

Same as POS Tagging in English


57

Morphological Disambiguation
in Arabic
Morphological tag has 14 subtags
corresponding to different linguistic categories
Example:Verb
Gender(2), Number(3), Person(3), Aspect(3),
Mood(3), Voice(2), Pronominal clitic(12),
Conjunction clitic(3)

22,400 possible tags


Different possible subsets

2,200 appear in Penn Arabic Tree Bank Part 1


(140K words)
Example solution: MADA (Habash&Rambow 2005)
58

MADA (Habash&Rambow 2005)


(Habash&Rambow 2007)
(Roth et al. 2008)

W-4

W-3

W-2

W-1

W0

W1

W2

W3

W4

3rd
4th
5th
1st
2nd

MORPHOLOGICAL
CLASSIFIERS

RANKER

Multiple independent
classifiers
Corpus-trained

Heuristic or
corpus-trained

MORPHOLOGICAL
ANALYZER

Rule-based
Human-created

59

MADA

(Habash&Rambow 2005)
(Habash&Rambow 2007)
(Roth et al. 2008)

Morphological Analysis and Disambiguation for Arabic

;;; SENTENCE AsbAnyA tnfy tjmyd AlmsAEdp AlmmnwHp llmgrb


;;WORD AsbAnyA
;;MADA: AsbAnyA art-NO aspect-NA case-NOCASE clitic-NO conj-NO def-DEF mood-NA n
um-SG part-NO per-3 pos-PN voice-NA
*0.78571 <isobAniyA=[<isobAniyA_1 POS:PN BW:+<isobAniyA/NOUN_PROP+]=Spain
^0.71429 >asobAniyA=[<isobAniyA_1 POS:PN BW:+>asobAniyA/NOUN_PROP+]=Spain
_0.50000 <isobAniy~A=[<isobAniy~_2 POS:AJ +MASC +DU +NOM +POSS BW:+<isobAniy~/ADJ+A/NSUFF_MASC_DU_NOM_
_0.50000 <isobAniy~AF=[<isobAniy~_2 POS:AJ +ACC +INDEF BW:+<isobAniy~/ADJ+AF/CASE_INDEF_ACC]=Spanish/Spaniar
_0.57143 <isobAniy~A=[<isobAniy~_1 POS:N +MASC +DU +NOM +POSS BW:+<isobAniy~/NOUN+A/NSUFF_MASC_DU_NOM
_0.57143 <isobAniy~AF=[<isobAniy~_1 POS:N +ACC +INDEF BW:+<isobAniy~/NOUN+AF/CASE_INDEF_ACC]=Spanish/Spani
-------------;;WORD tnfy
;;MADA: tnfY art-NA aspect-IV case-NA clitic-NO conj-NO def-NA mood-I num-SG par
t-NO per-3 pos-V voice-ACT
*1.00000 tanofiy=[nafaY_1 POS:V +IV MOOD:I +S:3FS BW:ta/IV3FS+nofiy/IV+(null)/IVSUFF_MOOD:I]=disavow/deny/reject
_0.76923 tunofayo=[nafA-u_1 POS:V +IV +PASS MOOD:SJ +S:2FS
BW:tu/IV2FS+nof/IV_PASS+ayo/IVSUFF_SUBJ:2FS_MOOD:SJ]=be_rejected/be_refuted/be_denied
_0.84615 tanif~iy=[naf~-i_1 POS:V +IV MOOD:SJ +S:2FS BW:ta/IV2FS+nif~/IV+iy/IVSUFF_SUBJ:2FS_MOOD:SJ]=blow_the_no
_0.84615 tanofiy=[nafA-u_1 POS:V +IV MOOD:SJ +S:2FS BW:ta/IV2FS+nof/IV+iy/IVSUFF_SUBJ:2FS_MOOD:SJ]=refute/deny/r
_0.84615 tanofiy=[nafaY_1 POS:V +IV MOOD:SJ +S:2FS BW:ta/IV2FS+nof/IV+iy/IVSUFF_SUBJ:2FS_MOOD:SJ]=disavow/deny
_0.84615 tanofiya=[nafaY_1 POS:V +IV MOOD:S +S:2MS BW:ta/IV2MS+nofiy/IV+a/IVSUFF_MOOD:S]=disavow/deny/reject
_0.92308 tanofiy=[nafaY_1 POS:V +IV MOOD:I +S:2MS BW:ta/IV2MS+nofiy/IV+(null)/IVSUFF_MOOD:I]=disavow/deny/reject
60
_0.92308 tanofiya=[nafaY_1 POS:V +IV MOOD:S +S:3FS BW:ta/IV3FS+nofiy/IV+a/IVSUFF_MOOD:S]=disavow/deny/reject
--------------

Road Map
Introduction
Orthography
Morphology

Derivational Morphology
Inflectional Morphology
Morphological Ambiguity
Arabic Computational Morphology

Syntax
61

Arabic Computational Morphology


Representation units
Natural token ?&W
White space separated strings (as is)
Can include extra characters (e.g. tatweel/kashida)

Word ?&W
Segmented word
Can include any degree of morphological analysis
Pure segmentation: &W
Arabic Treebank tokens (with recovery of some
deleted/modified letters): &W
62

Arabic Computational Morphology


Representation units (continued)

Prefix + Stem + Suffix


+&+W

Can create more ambiguity

Lexeme + Features

&[+Plural +Def ++ ]

Root + Pattern + Features


& + a3a21a + [+Plural +Def + +]
Very abstract

Root + Pattern + Vocalism + Features


& + 321 + a.a.a + [+Plural +Def + +]
Very very abstract

63

(Habash&Sadat 2006)

TOKAN
A generalized tokenizer
Assumes disambiguated morphological analysis
a la MADA

Declarative specification of tokenization scheme


wsyktbhA=[katab_1 POS:V +IV w+ s+ +S:3MS +O:3FS]
Example

Scheme Specification

w+ syktbhA

D1

w+ f+ REST

w+ s+ yktbhA

D2

w+ f+ b+ k+ l+ s+ REST

w+ s+ yktb +hA

D3

w+ f+ b+ k+ l+ s+ Al+ REST +P: +O:

w+ syktb +hA

TB

w+ f+ b+ k+ l+ REST +P: +O:

w+ s+ ktb/VBZ
S:3MS +hA

EN

w+ f+ b+ k+ l+ s+ Al+ LEXEME + BIESPOS +S:

Uses generator (Habash 2004)

64

Arabic Computational Morphology


Approaches
Finite state machines (Beesely,2001) (Kiraz,2001) (Habash&Rambow 2006)
Concatenative analysis/generation (Smrz, 2007) (Buckwlater,2002)
(Cavalli-Sforza et al, 2000)

Lexeme+Feature analysis/generation (Habash, 2004) (Habash&Rambow


2006)

Shallow stemming (Darwish,2002) (Aljlayl and Frieder 2002)


Machine learning (Diab et al,2004) (Lee et al,2003) (Rogati et al, 2003)
(Habash & Rambow 2005a)
Survey article: (Al-Sughaiyer&Al-Kharashi, 2004)

Issues
Appropriateness of system representation for an application
Machine Translation vs. Information Retrieval
Arabic spelling vs. phonetic spelling

System coverage
System extendibility
Availability to researchers
Use for analysis and generation

65

Road Map

Introduction
Orthography
Morphology
Syntax

Morphology and Syntax


Sentence Structure
Phrase Structure
Computational Resources

66

Morphology and Syntax


Rich morphology crosses into syntax
Pro-drop / Subject conjugation
Verb sub-categorization and object clitics
Verbtransitive+subject+object
Verbintransitive+subject but not Verbintransitive+subject+object
Verbpassive+subject but not Verbpassive+subject+object

Morphological interactions with syntax


Agreement
Full: e.g. Noun-Adjective on number, gender, and definiteness (for
persons)
Partial: e.g. Verb-Subject on gender (in VSO order)

Definiteness
Noun compound formation, copular sentences, etc.
Nouns+DefiniteArticle, Proper Nouns, Pronouns, etc.
67

Morphology and Syntax


Morphological interactions with syntax (continued)
Case
MSA is case marking: nominative, accusative, genitive
Almost-free word order
Case is often marked with optionally written short vowels
This effectively limits the word-order freedom in published text

Agglutination
Attached prepositions create words that cross phrase
boundaries
&W+
for the-libraries

li+Almaktabt
[PP li [NP Almaktabt]]

Some morphological analysis (minimally segmentation)


is necessary even for statistical approaches to parsing
68

Road Map

Introduction
Orthography
Morphology
Syntax

Morphology and Syntax


Sentence Structure
Phrase Structure
Computational Resources

69

Sentence Structure
Two types of Arabic Sentences
Verbal sentences
[Verb Subject Object] (VSO)
pq
Wrote the-boys the-poems
The boys wrote the poems

Copular sentences (aka nominal sentences)


[Topic Complement]
y
the-boys poets
The boys are poets
70

Sentence Structure
Verbal sentences
Verb agreement with gender only
Default singular number
\pq wrote3MascSing the-boy/the-boys
{\{ {q wrote3FemSing the-girl/the-girls

Pronominal subjects are conjugated



{q wrote-youMascSing
q{q wrote-youMascPlur
{q wrote-theyMascPlur

Passive verbs
Same structure: Verbpassive SubjectunderlyingObject
Agreement with surface subject

71

Sentence Structure
Verbal sentences
Common structural ambiguity
Third masculine/feminine singular is structurally
ambiguous
Verb3MascSingular NounMasc
Verb subject=he object=Noun
Verb subject=Noun

Passive and active forms are often similar in


standard orthography
pq /kataba/ he wrote
pq /kutiba/ it was written
72

Sentence Structure
Copular sentences
[Topic Complement]
Definite Topic, Indefinite Complement
y
the-boy poet
The boy is a poet

[Auxiliary Topic Complement]


Auxiliaries (kna and her sisters)
Tense, Negation, Transformation, Persistence
y was the-boy poet The boy was a poet
y z is-not the-boy poet The boy is not a poet

Inverted order is expected in certain cases


Indefinite topic
q /indi kitbun/ at-me a-book I have a book
73

Sentence Structure
Copular sentences
Types of complements
Noun/Adjective/Adverb

the-boy smart

The boy is smart

Prepositional Phrase
{q the-boy in the-library The boy is in the library

Copular-Sentence
yz{ q[ the-boy [book-his big]] The boy, his book is big

Verb-Sentence
{ q
[the-boys [wrote3rdMascPlur poems]] The boys wrote the poems

Full agreement in this order (SVO)


{ q
[the-poems [wrote3rdMascSing-them the boys]] The poems, the boys wrote
74

Road Map

Introduction
Orthography
Morphology
Syntax

Morphology and Syntax


Sentence Structure
Phrase Structure
Computational Resources

75

Phrase Structure
Noun Phrase
Determiner Noun Adjective PostModifier
z p
this the-writer the-ambitious the-arriving from Japan
This ambitious writer from Japan

Noun-Adjective agreement
number, gender, definiteness
s { the-writerFemSing the-ambitiousFemSing
s { the-writerFemPlur the-ambitiousFemPlur

Exception: Plural non-persons

definiteness agreement; feminine singular default


pq the-officeMascSing the-newMascSing
{q the-libraryFemSing the-newFemSing
p the-officesMascBPlur the-newFemSing
{ q the-librariesFemPlur the-newFemSing
76

Phrase Structure
Noun Phrase
Idafa construction ()
Noun1 of Noun2 encoded structurally
Noun1-indefinite Noun2-definite

king Jordan
the king of Jordan / Jordans king

Noun1 becomes definite


Agrees with definite adjectives

Idafa chains
N1indef N2indef Nn-1indef Nndef
y z
son uncle neighbor chief committee management thecompany
The cousin of the CEOs neighbor

77

Phrase Structure
Morphological definiteness interacts with syntactic structure

definite
indefinite

Word 2 artist

Word 1 p writer
definite

Indefinite

Noun Phrase
p
The artist(ic) writer

Noun Compound
p
The writer of the artist

Copular Sentence
p
The writer is an artist

Noun Phrase
p
An artist(ic) writer
78

Agreement in Arabic

Verb-Subject agreement
Verb agrees with subject in full (gender,number)
Exception: partial agreement (number=singular) in VSO order
Exception: partial agreement (number=singular; gender=feminine) for non-person plural
subjects regardless of order

Noun-Adjective
Adjective agrees with noun in full (gender, number, definiteness and case)
Exception: partial agreement (number=singular; gender=feminine) for non-person plural
nouns

Noun-Number
Number is the syntactic-case head
for numbers [3..10]: Noun is plural+genitive (idafa); number gender is inverted gender
of noun!
for numbers [11..99]: Noun is singular+accusative (tamyiyz/specification); number
gender is even more complicated
for numbers [100,1K,1M]: Noun is singular+genitive (idafa)

bnyt was built

vlAv three

jAmEAt universities

jdydp new

Fem+Sg

Masc+Sg+Nom

Fem+PL+Gen

Fem+Sg+Gen

Verbs in VSO order are always


Sg and agree in gender only

Numbers agrees by
gender inversion

Adjectives of plural non79


person nouns are Fem+Sg

Road Map

Introduction
Orthography
Morphology
Syntax

Morphology and Syntax


Sentence Structure
Phrase Structure
Computational Resources

80

Computational Resources
Monolingual corpora for building language models
Arabic Gigaword

Agence France Presse


AlHayat News Agency
AnNahar News Agency
Xinhua News Agency

Arabic Newswire
United Nations Corpus (parallel with other UN languages)
Ummah Corpus (parallel with English)

Distributors
Linguistic Data Consortium (LDC)
Evaluations and Language resources Distribution Agency
(ELDA)
81

Computational Resources
Penn Arabic Treebank (PATB)
Started in 2001
Goal is 1 Million words
Currently 650K words
Agence France Presse , AlHayat newspaper, AnNahar
newspaper

POS tags
Buckwalter analyzer
Arabic-tailored POS list

PATB constituency
representation
Some modifications of Penn English Treebank
(e.g. Verb-phrase internal subjects)
82

Computational Resources
Prague Dependency Treebank
Partial overlap with PATB
and Arabic Gigaword
Agence France Presse,
AlHayat and Xinhua

Morphological analysis
Extends on PATB

Dependency representation
83
Graphic courtesy of Otakar Smr: http://ckl.mff.cuni.cz/padt/PADT_1.0/docs/slides/2003-eacl-trees.ppt

(Habash, 2009)

CATiB:Columbia Arabic Tree Bank


CATiB Representation
Lite Dependency Syntax
Tokenization
CONJ PART BASE PRON

Part of Speech Tag set


Six tags: VRB, VRB-pass, NOM,
PROP, PRT, PNX

Syntactic Annotation
Eight dependency relations: SBJ,
OBJ, TPC, PRD, IDF, TMZ, MOD, FLAT

Practical Considerations
Less information to annotate
Dependencies easier to annotate than phrase structure
Terminology and representation close to traditional
Arabic grammar
84

Computational Resources
Applications using Arabic treebanks
Statistical parsing
Bikels parser (Bikel 2003)
Same engine used with English, Chinese and Arabic

Nivres MALT parser (Nivre et al. 2006)

Base-phrase Chunking
(Diab et al, 2004; Diab et al. 2007)

POS tagging and morphological disambiguation


(Diab et al, 2004; Diab et al. 2007; Habash and Rambow, 2005a; Smith et
al., 2005; Roth et al. 2008)
Other non-treebank-based POS tagging efforts: (Khoja, 2001)

Formalism conversion
Constituency to dependency (abokrtsk and Smr 2003; Habash et al.
2007; Tounsi et al., 2009)

Tree-adjoining grammar extraction (Habash and Rambow 2004)

Automatic diacritization
Zitouni et al. (2006); Habash&Rambow (2007); Shaalan et al (2008)
among others
Diacritization for MT (Diab et al. 2007)

85

Other Tutorial Slides


Columbias Arabic Dialect
Modeling Group (CADIM)
http://www1.ccls.columbia.edu/~cadim/
Presentations

86

MEADR 2009
Cairo, Egypt
April 21, 2009

Introduction to Arabic
Natural Language Processing
Nizar Habash
Columbia University
Center for Computational Learning Systems
[email protected]

87
CADIM

You might also like