Introduction To Arabic NLP
Introduction To Arabic NLP
Introduction To Arabic NLP
Cairo, Egypt
April 21, 2009
Introduction to Arabic
Natural Language Processing
Nizar Habash
Columbia University
Center for Computational Learning Systems
[email protected]
1
CADIM
Phenomena
Concepts
Approaches
Resources
What is Arabic?
Arabic Script
Arabic Language
Modern Standard
Arabic (MSA)
Arabic Dialects
2
Road Map
Introduction
Orthography
Morphology
Syntax
Road Map
Introduction
Orthography
Arabic Script
MSA Phonology and Spelling
Recognizing Arabic vs. Persian/Urdu/Pashto/Kurdish/Sindhi/
Encoding Issues
Morphology
Syntax
4
Arabic Script
Arabic Script
Arabic script is an alphabet with allographic variants,
optional zero-width diacritics and common ligatures.
Arabic Script
Alphabet
letter forms
letter marks
Arabic only
Other languages
Persian, Kurdish,
Urdu, Pashto, etc.
OCR output ambiguity; common spelling errors
Arabic Script
Alphabet (MSA)
letters (form+mark)
Distinctive
//
Non-distinctive
/s/
// /t/
/b/
I
//
Arabic Script
Letter Shapes
No distinction between print and handwriting
No capitalization
Right-to-left
Ambiguous
shapes
Connective
letters
Disconnective
letters
( )
cause wordinternal visual
spacing
Stand
alone
initial
medial
final
9
Arabic Script
Letter shaping
& & =
/katab/
to write
& = &
/kitb/
book
b t
k
10
Arabic Script
Nunation
Vowel
Zero-width characters
/ban/
/ba/
/bun/
/bu/
/bin/
/bi/
Diacritics
&/kitbun/ a book
11
Arabic Script
Diacritics
No Vowel
&
/maktab/ office
/b/
Double
Consonant
Combinable
/bbu/
/bbin/
/bban/
/bb/
12
Arabic Script
Putting it together
Simple combination
Arab /arab/
9: = 9:
West /arb/
9 = 9
Ligatures
Peace /salm/
E@ @?
13
Arabic Script
Tatweel
elongation
aka kashida
used for text highlight
and justification
human rights /uqq alinsn/
14
Arabic Script
Different styles
Arabic Muhammad
High fluidity
Y9:
X
9VW
Optional ligatures
Vertical
arrangements
/arabi/ /muammad /
algebra
/alabr/
15
Arabic Script
Arabic Numerals
Decimal system
Numbers written left-to-right in right-to-left text
. 132 1962
Algeria achieved its independence in 1962 after 132 years of French occupation.
Western Arabic
0 1 2 3 4 5 6 7 8 9
Indo-Arabic
Middle East
Eastern Indo-Arabic
Iran, Pakistan, etc.
i h g
16
Road Map
Introduction
Orthography
Arabic Script
MSA Phonology and Spelling
Recognizing Arabic vs. Persian/Urdu/Pashto/Kurdish/Sindhi/
Encoding Issues
Morphology
Syntax
17
j w h n m l k q f t d s s z r d x t b
18
21
yz
y
yz
y
22
Road Map
Introduction
Orthography
Arabic Script
MSA Phonology and Spelling
Recognizing Arabic vs. Persian/Urdu/Pashto/Kurdish/Sindhi/
Encoding Issues
Morphology
Syntax
23
Arabic Script
Other languages
Arabic
No more than 3 dots
Dots either above or below
Marks are 1/2/3 dots, hamza ()
or madda (~) only
Rare borrowing for foreign words
/p/, /v/, /g/, /t/
regionally variable
Not Arabic
24
Arabic
Not Arabic
25
26
... ...y
q
z
z z
p
... ...y
y
z
z { yq
y
y q
p
Arabic
Not Arabic
...-./ :y 1234 56
Arabic
Not Arabic
27
Road Map
Introduction
Orthography
Arabic Script
MSA Phonology and Spelling
Recognizing Arabic vs. Persian/Urdu/Pashto/Kurdish/Sindhi/
Encoding Issues
Morphology
Syntax
28
Encoding Issues
Encoding Arabic
Types of Encoding
Machine character sets
Graphemic (shape insensitive, logical order)
Allographic (shape/direction sensitive) [obsolete]
Human accessible
Transliteration
Phonetic spelling (IPA)
Romanization
29
Encoding Issues
Many Conflicting Character Sets for Arabic
30
Encodings
CP-1256
Commonly used
1-byte characters
Widely supported
input/display
Minimal support for
extended Arabic
characters
bi-script support
(Roman/Arabic)
Tri-lingual support:
Arabic, French,
English (ala ANSI)
31
Encodings
Unicode
Becoming the
standard more and
more
2-byte characters
Widely supported
input/display
Supports extended
Arabic characters
Multi-script
representation
32
Encodings
Unicode
Supports presentation
forms (shapes and
ligatures)
33
Encoding Issues
Arabic Display
Memory (logical order)
(Palestine) (Olympics) 2000 2004.
,( Palestine) ( Olympics) 2000 2004.
.4002 0002 )scipmylO( )enitselaP(
.4002 0002 )scipmylO( ) enitselaP( ,
34
Encoding Issues
Arabic Display
Memory (logical order)
(Palestine) (Olympics) 2000 2004.
,( Palestine) ( Olympics) 2000 2004.
Display Problems
ISO-8859 CP-1256
Unicode
Actual Encoding
CP-1256
Display Encoding
ISO-8859
Unicode
Western
1U 3T NOPQR IJKLM
NJ63X\Z] 5.XYZ 12
gY -KLMz3T z
z
5U 5.XU
zNY63M zz
z
LM~ 3T
5.XQ6
3X656
1U 3T NOPQR IJKLM
NJ63X\Z] 5.XYZ 12
gi`ab
-
i rsi
`ab
`i`ab
* **
* *
- **
**
* ***
** * *
*
1U 3T NOPQR IJKLM
NJ63X\Z] 5.XYZ 12
Wrong encoding
Encoding Issues
Arabic Input
Standard graphemic keyboard
Logical order input
+,
37
http://www.cyrillic.com/kbd/btc.html
Encodings
Buckwalter Encoding
Romanization
One-to-one mapping
to Arabic script spelling
Left-to-right
Easy to learn/use
Human & machine compatible
38
Road Map
Introduction
Orthography
Morphology
Syntax
39
Morphology
Type
Concatenative: prefix, suffix, circumfix
Templatic: root+pattern
Function
Derivational
Creating new words
Mostly templatic
Inflectional
Modifying features of words
Tense, number, person, mood, aspect
Mostly concatenative
40
Road Map
Introduction
Orthography
Morphology
Derivational Morphology
Inflectional Morphology
Morphological Ambiguity
Arabic Computational Morphology
Syntax
41
Derivational Morphology
Templatic Morphology
Root
Pattern
Lexeme
ma
$"#!
maktb
written
+,*
ktib
writer
Lexeme.Meaning =
(Root.Meaning+Pattern.Meaning)*Idiosyncrasy.Random
42
Derivational Morphology
Root Meaning
KTB = notion of writing
&
&
/kitb/ /katab/
book
write
&
&
&
/maktb/
/maktaba/
/maktb/
written
library
letter
&
/maktab/
/ktib/
office
writer
43
Derivational Morphology
Root Meaning
LHM-1
Notion of meat
/lam/
345
laHm
Meat
/lam/
Butcher
44
Derivational Morphology
Root Meaning
LHM-2
Notion of battle
/malama/
Fierce battle
Massacre
Epic
45
Derivational Morphology
Root Meaning
LHM-3
Notion of soldering
/laam/
Weld, solder, stick, cling
q /iltaam/
Be welded/soldered/fused
q /multaim/
Welded, soldered, fused
46
Derivational Morphology
Pattern Meaning
Verb Pattern Meaning is hard to define
Pattern
Pattern Meaning
Example
Gloss
1a2a3
ktb katab
write
II
1a22a3
Intensification, causation
ktb kattab
dictate
III
1aA2a3
ktb kaAtab
correspond with
IV
Aa12a3
Causation
jls Ajlas
seat
ta1a22a3
Reflexive of Pattern II
Elm taEal~am
learn
VI
ta1aA2a3
ktb takaAtab
correspond
VII
Ain1a2a3
Passive of Pattern I
ktb Ainkatab
subscribe/enroll
VIII
Ai1ta2a3
Acquiescence, exaggeration
ktb Aiktatab
register
IX
Ai12a33
Transformation
Hmr AiHmarr
Turn red/blush
Aista12a3
Requirement
ktb Aistaktab
ask/make_write
47
Road Map
Introduction
Orthography
Morphology
Derivational Morphology
Inflectional Morphology
Morphological Ambiguity
Arabic Computational Morphology
Syntax
48
Inflectional Morphology
Derivational Morphology
Lexeme Root + Pattern
Inflectional Morphology
Word = Lexeme + Features
Part-of-speech
Traditional: Noun, Verb, Particle
Computational: N, PN, V, Adj, Adv, P, Pron, Num,
Conj, Det, Aux, Pun, IJ, and others
Many tag sets exist ranging from 3 to over 22K tags
Noun-specific Features
Verb-specific Features
Other Features
49
Inflectional Morphology
Noun-specific Features
Verb-specific Features
Other Features
Single-letter conjunctions
Single-letter prepositions
50
Inflectional Morphology
Nouns
poss
plural
noun
*>,$?6
/wakabiytin/
*@ + $?A + +
wa+ka+biyt+n
and+like+houses+our
And like our houses
article
prep
conj
*6"#785
/walilmaktabt/
+;6"#!+++
wa+li+al+maktaba+t
and+for+the+library+plural
And for the libraries
51
Inflectional Morphology
Verbs
object
subj
*>*8IJ
/faqulnh/
* +*@ +*L +
fa+qul+na+h
so+said+we+it
So we said it
verb
tense
conj
*O5$I>N
/wasanaqluh/
* + $L + + +
wa+sa+na+ql+u+h
and+will+we+say+it
And we will say it
Morphotactics
Subject conjugation (suffix or circumfix)
52
Inflectional Morphology
katab to write
1
2
3
{q katabtu
{q katabta
pq kataba
Dual
Plural
{q katabn
q{q katabtum
q{q katabtum
{q katab
{q katabt
1
2
3
pq aktubu
pq taktubu
pq yaktubu
Dual
Plural
pq naktubu
{q taktubn
{q taktubn
{q yaktubn
{qq yaktubn
53
Feminine form and other verb moods not shown
Road Map
Introduction
Orthography
Morphology
Derivational Morphology
Inflectional Morphology
Morphological Ambiguity
Arabic Computational Morphology
Syntax
54
Morphological Ambiguity
Derivational ambiguity
: basis/principle/rule, military base,
Qa'ida/Qaeda/Qaida
Inflectional ambiguity
pq /taktub/: you write, she writes
Segmentation ambiguity
: he found; +: and+grandfather
: +: for a language; +: for the language
55
Morphological Ambiguity
Spelling ambiguity
Optional diacritics
p: /ktib/ writer , /ktab/ to correspond
Suboptimal spelling
Hamza dropping: ,
Undotted ta-marbuta:
Undotted final ya:
Multiple sources of ambiguity
/bayyana/
/bayyanna/
/bayyin/
/bayna/
/biyin/
/biyn/
Verb
Verb
Adj
Prep
Proper Noun
Proper Noun
he demonstrated
they [feminine] demonstrated
clear/evident/explicit
between/among
in Yen
56
Ben
Morphological Disambiguation
in English
Select a morphological tag that fully
describes the morphology of a word
Complete English morphological tag set
(Penn Treebank): 48 tags
Verb: VB VBD VBG VBN VBP VBZ
go
went
going
gone
go
goes
Morphological Disambiguation
in Arabic
Morphological tag has 14 subtags
corresponding to different linguistic categories
Example:Verb
Gender(2), Number(3), Person(3), Aspect(3),
Mood(3), Voice(2), Pronominal clitic(12),
Conjunction clitic(3)
W-4
W-3
W-2
W-1
W0
W1
W2
W3
W4
3rd
4th
5th
1st
2nd
MORPHOLOGICAL
CLASSIFIERS
RANKER
Multiple independent
classifiers
Corpus-trained
Heuristic or
corpus-trained
MORPHOLOGICAL
ANALYZER
Rule-based
Human-created
59
MADA
(Habash&Rambow 2005)
(Habash&Rambow 2007)
(Roth et al. 2008)
Road Map
Introduction
Orthography
Morphology
Derivational Morphology
Inflectional Morphology
Morphological Ambiguity
Arabic Computational Morphology
Syntax
61
Word ?&W
Segmented word
Can include any degree of morphological analysis
Pure segmentation: &W
Arabic Treebank tokens (with recovery of some
deleted/modified letters): &W
62
Lexeme + Features
&[+Plural +Def ++ ]
63
(Habash&Sadat 2006)
TOKAN
A generalized tokenizer
Assumes disambiguated morphological analysis
a la MADA
Scheme Specification
w+ syktbhA
D1
w+ f+ REST
w+ s+ yktbhA
D2
w+ f+ b+ k+ l+ s+ REST
w+ s+ yktb +hA
D3
w+ syktb +hA
TB
w+ s+ ktb/VBZ
S:3MS +hA
EN
64
Issues
Appropriateness of system representation for an application
Machine Translation vs. Information Retrieval
Arabic spelling vs. phonetic spelling
System coverage
System extendibility
Availability to researchers
Use for analysis and generation
65
Road Map
Introduction
Orthography
Morphology
Syntax
66
Definiteness
Noun compound formation, copular sentences, etc.
Nouns+DefiniteArticle, Proper Nouns, Pronouns, etc.
67
Agglutination
Attached prepositions create words that cross phrase
boundaries
&W+
for the-libraries
li+Almaktabt
[PP li [NP Almaktabt]]
Road Map
Introduction
Orthography
Morphology
Syntax
69
Sentence Structure
Two types of Arabic Sentences
Verbal sentences
[Verb Subject Object] (VSO)
pq
Wrote the-boys the-poems
The boys wrote the poems
Sentence Structure
Verbal sentences
Verb agreement with gender only
Default singular number
\pq wrote3MascSing the-boy/the-boys
{\{ {q wrote3FemSing the-girl/the-girls
Passive verbs
Same structure: Verbpassive SubjectunderlyingObject
Agreement with surface subject
71
Sentence Structure
Verbal sentences
Common structural ambiguity
Third masculine/feminine singular is structurally
ambiguous
Verb3MascSingular NounMasc
Verb subject=he object=Noun
Verb subject=Noun
Sentence Structure
Copular sentences
[Topic Complement]
Definite Topic, Indefinite Complement
y
the-boy poet
The boy is a poet
Sentence Structure
Copular sentences
Types of complements
Noun/Adjective/Adverb
the-boy smart
Prepositional Phrase
{q the-boy in the-library The boy is in the library
Copular-Sentence
yz{ q[ the-boy [book-his big]] The boy, his book is big
Verb-Sentence
{ q
[the-boys [wrote3rdMascPlur poems]] The boys wrote the poems
Road Map
Introduction
Orthography
Morphology
Syntax
75
Phrase Structure
Noun Phrase
Determiner Noun Adjective PostModifier
z p
this the-writer the-ambitious the-arriving from Japan
This ambitious writer from Japan
Noun-Adjective agreement
number, gender, definiteness
s { the-writerFemSing the-ambitiousFemSing
s { the-writerFemPlur the-ambitiousFemPlur
Phrase Structure
Noun Phrase
Idafa construction ()
Noun1 of Noun2 encoded structurally
Noun1-indefinite Noun2-definite
king Jordan
the king of Jordan / Jordans king
Idafa chains
N1indef N2indef Nn-1indef Nndef
y z
son uncle neighbor chief committee management thecompany
The cousin of the CEOs neighbor
77
Phrase Structure
Morphological definiteness interacts with syntactic structure
definite
indefinite
Word 2 artist
Word 1 p writer
definite
Indefinite
Noun Phrase
p
The artist(ic) writer
Noun Compound
p
The writer of the artist
Copular Sentence
p
The writer is an artist
Noun Phrase
p
An artist(ic) writer
78
Agreement in Arabic
Verb-Subject agreement
Verb agrees with subject in full (gender,number)
Exception: partial agreement (number=singular) in VSO order
Exception: partial agreement (number=singular; gender=feminine) for non-person plural
subjects regardless of order
Noun-Adjective
Adjective agrees with noun in full (gender, number, definiteness and case)
Exception: partial agreement (number=singular; gender=feminine) for non-person plural
nouns
Noun-Number
Number is the syntactic-case head
for numbers [3..10]: Noun is plural+genitive (idafa); number gender is inverted gender
of noun!
for numbers [11..99]: Noun is singular+accusative (tamyiyz/specification); number
gender is even more complicated
for numbers [100,1K,1M]: Noun is singular+genitive (idafa)
vlAv three
jAmEAt universities
jdydp new
Fem+Sg
Masc+Sg+Nom
Fem+PL+Gen
Fem+Sg+Gen
Numbers agrees by
gender inversion
Road Map
Introduction
Orthography
Morphology
Syntax
80
Computational Resources
Monolingual corpora for building language models
Arabic Gigaword
Arabic Newswire
United Nations Corpus (parallel with other UN languages)
Ummah Corpus (parallel with English)
Distributors
Linguistic Data Consortium (LDC)
Evaluations and Language resources Distribution Agency
(ELDA)
81
Computational Resources
Penn Arabic Treebank (PATB)
Started in 2001
Goal is 1 Million words
Currently 650K words
Agence France Presse , AlHayat newspaper, AnNahar
newspaper
POS tags
Buckwalter analyzer
Arabic-tailored POS list
PATB constituency
representation
Some modifications of Penn English Treebank
(e.g. Verb-phrase internal subjects)
82
Computational Resources
Prague Dependency Treebank
Partial overlap with PATB
and Arabic Gigaword
Agence France Presse,
AlHayat and Xinhua
Morphological analysis
Extends on PATB
Dependency representation
83
Graphic courtesy of Otakar Smr: http://ckl.mff.cuni.cz/padt/PADT_1.0/docs/slides/2003-eacl-trees.ppt
(Habash, 2009)
Syntactic Annotation
Eight dependency relations: SBJ,
OBJ, TPC, PRD, IDF, TMZ, MOD, FLAT
Practical Considerations
Less information to annotate
Dependencies easier to annotate than phrase structure
Terminology and representation close to traditional
Arabic grammar
84
Computational Resources
Applications using Arabic treebanks
Statistical parsing
Bikels parser (Bikel 2003)
Same engine used with English, Chinese and Arabic
Base-phrase Chunking
(Diab et al, 2004; Diab et al. 2007)
Formalism conversion
Constituency to dependency (abokrtsk and Smr 2003; Habash et al.
2007; Tounsi et al., 2009)
Automatic diacritization
Zitouni et al. (2006); Habash&Rambow (2007); Shaalan et al (2008)
among others
Diacritization for MT (Diab et al. 2007)
85
86
MEADR 2009
Cairo, Egypt
April 21, 2009
Introduction to Arabic
Natural Language Processing
Nizar Habash
Columbia University
Center for Computational Learning Systems
[email protected]
87
CADIM