Title: Arabic%20NLP:%20Overview,%20the%20State%20of%20the%20Art%20%20%20%20%20%20%20%20%20%20%20Challenges%20and%20Opportunities
1 Arabic NLP Overview, the State of the Art
Challenges and Opportunities
2 Overview (1)
- Challenges
- 1. to the Arabic language and culture
- 2. to Arabic NLP
- a inherent properties of Arabic
- b problems of Arabic Linguistics
3 Overview (2)
-
- Inherent Opportunities for the Arabic Language
- 1. Classical Arabic has survived 15 centuries,
- other language failed to do so
- 2. Arabic is capable of reinventing itself
- 3. Classical Arabic is a living language in
which 1.4 billion Moslems perform their daily
prayers - 4. The significance of the Arabic language
culturally, strategically and linguistically
4 Overview (3) Why NLP is
important?
- Fundamental transition from the Industrial
Economy to the Knowledge Economy in the 1980s and
1990s -
- Knowledge is coded in Language
- Necessity for NLP Systems to categorize,
retrieve, translate, and/or answer questions from
unstructured texts -
4
5 Overview (4)
- NLP History
- Four generations of NLP
- Disappointment with the First Generation of
Machine Translation Systems, ALPAC Report (1966)
-
- Second Generation of NLP Systems (1970s-1980s)
6 Overview (5)
- Third Generation NLP Systems 1990s present
- Success of Statistical Approaches
- Problems with Statistical Approaches
- The Emergence of the Hybrid Approach (4th
generation?) -
-
7 Overview (6)
- Future Directions in Arabic NLP
- New Attitude towards Arabic Grammar
- Focus on Constituency
- The Need for Arabic Language Planning
8 Overview (7)
- Deal with syntactic ambiguity, co-reference,
unbounded dependencies, phrasal constituencies,
PRO Drop .etc. - Clear Objectives of Arabic NLP for the Arab
World - Could be different from Arabic NLP for the
Western World - Conclusion
9 Challenges (1)
- To the Arabic language and culture
- The English language is becoming the language of
the World Wide Web emails, blogs, chats etc.
taking away functionalities from Arabic - Number of books, papers published in the Arab
countries is minimal compared to that produced
in the USA and English speaking countries - Thus, we consume rather than produce knowledge
- No first class research universities in the
Arab world
10 Challenges (2)
- Even when we report research, we do not use
Arabic - Globalization has intensified the influence of
the Western culture in the Arab World - Almost all Arab universities teach science and
mathematics in a foreign language
11 Challenges (3)
- To Arabic NLP
- Inherent properties of the Arabic language
- 1. The Arabic script (no short vowels and no
capitalization) - 2. Explosion of ambiguity (average 2.3 per word
in other languages to 19.2 in Arabic. - Example 22 analyses of ??? by Buckwalter
(2004)
12 Challenges (4)
- 3. Complex word structure
- e.g. ??????? and I saw them
- 4. The problem of Normalization
- ? ? ? ? ? ? ? ? ?
- losing distinction ?? ? ?? ? ??
- 5. Arabic as a Pro Drop Language
-
13 Assumptions
- The Arabic language can meet all the needs of
its speakers - The Arabs were producers of knowledge at a time
when the rest of the world were were consumers
of knowledge - Contemporary Arab scholars proved their ability
- to produce knowledge
14 Opportunities (1) Lessons from
recent history
- Unprecedented accumulation of knowledge
- 1. Dramatic increase in the number of academic
- publications
- 2. Huge investment in R D companies
- 3. Fundamental changes in industry and society
- similar to the Industrial
Revolution - 4. Impressive progress in many fields such as
- medicine, space exploration,
computer - software and hardware development
etc.
15 The Knowledge Economy
- Fundamental Aspects of the Knowledge Economy
-
- 1. Strategic product is knowledge rather than
- manufactured goods
- 2. Industrial workers are replaced by
- knowledge workers
- 3. Global labor market
- 4. Democratization of knowledge
-
16 The Knowledge Economy NLP (1)
- The age of on-line information, electronic
communication, World Wide Web (www) - Millions of documents are created every minute
from kb -gt mg -gt gig -gt terabites -
- Explosion of knowledge can lead to explosion of
ignorance
17 The Knowledge Economy NLP (2)
- Democratization of knowledge through the use of
the computer/cell phone as a communication tool - Governments, industry, academia, and
individuals, desperately, need tools to process
information - Information is coded in natural language
18The knowledge Economy NLP (3)
- Globalization -gt Multilingual applications such
as machine translation and cross language
applications - Information Retrieval (IR) and Information
Extraction (IE) are becoming increasingly
important - key word search is being replaced by question
answering systems - Knowledge is encoded in natural language
19 NLP - Flashback
- The invention of the computer and language
- 1940s
- - First application breaking the Nazis secret
- code
- - Second application Russian to English
- machine translation (Warren, 1949)
20 1st Generation of MT
- Principles of the first generation
- Capitalized on the speed lookup offered by the
computer - MT is essentially a matter of correct pairing of
the source language expressions with the target
language equivalents - Trivial reordering of words
21 Problems with 1st Generation MT
- naïve concept of language structure
- Heavy reliance on bilingual dictionaries
- No attempt to mimic human translation
- Unrealistic goals and promises
22 2nd Generation MT (1)
- Principles of the Transfer Approach
- Three Components
- 1. analysis of source language (SL)
- 2. transfer the structure of SL to TL
- 3. Generation of target language surface forms
-
-
-
23 2nd Generation MT (2)
- Basic Principles
- Linguistic knowledge is essential for the
- understanding of the source text
- Target specific domains for better translation
- More realistic goals and promises
242nd Generation MT (3)
- Positive developments in NLP technology chart
parsing (Woods 1970, unification grammar Shieber
1986), definite clause grammar (Periera 1980) - Driven by the commercial market The Georgetown
System, Pan American Health Organization, EURORTA
Project (Interlingua approach) etc. - Emergence of lexical approaches to grammar
25 Problems with 2nd Generation MT
- Limitations
- Linguistic knowledge is expensive
- Explosion of syntactic ambiguity (300 parse for
- each input sentence)
- Needed huge computing power
- Limited successes The METAL system and the
Canadian weather forecast translation system
26 Statistical Approaches to NLP
-
- Built on Probability theory
- Works well for specific domains
- Relies on training data (machine learning)
- Very fast
- Does not require linguistic knowledge
27 3rd Generation of MT Systems (1)
- Principles
- Relies on the machine learning approach
- Benefits from the existence of huge corpora
through the Internet - Low development cost
- Rapid development time
283rd Generation of MT Systems (2)
- Heavy reliance on parallel corpus at several
levels - Does not require any linguistic knowledge Give
me enough parallel corpus, and I will give you
machine translation system in hours - Represents an empirical approach to language
The proof is in the pudding (Manning 2000) -
- Unlike the transfer approach, does not attempt
to mimic human translators
293rd Generation MT Systems (3)
- Benefited by
- Computers becoming much faster, more powerful
and less costly - Accumulation of huge corpora on the Internet
- Availability of annotated Treebanks for training
(Linguistic Data Consortium
30Problems with 3rd Generation MT Systems
- Limitations
- Performs well when dealing with data similar
- to the training set
- Performance deteriorates when documents
- are different from training set
- There comes a point when adding more training
- data does not improve performance (The
Threshold - Problem)
31Problems with 3rd Generation MT Systems
- There are domains when data is sparse
- Sometimes the training data itself is noisy
(full of errors) - Does not provide any insight into language,
linguistics or the translation process
32 Arabic NLP Goals
- Goals
- Transfer of knowledge and technology to the Arab
World - Modernize and fertilize the Arabic language
- Improve and modernize Arabic linguistics
- Make information retrieval, extraction,
summarization and translation available to the
Arab user
33 Arabic NLP History (1)
- Followed and integrated with main stream NLP
- 1978 - 1989
- Kuwait Mohammed Al-Sharikh Nabil Ali Sakhr
- Morocco Hlaal (1979, 1985) on Arabic morphology
- Holland Everhard Ditter on MSA
- US The Weidner English/Arabic MT system
34 Arabic NLP History (2)
- IBM Scientific Centers in Kuwait and Cairo
- France The Dinar Lexical Data Base, Joseph Dichy
- Language Resources and Human Language Technology
work (ELRA/Elda Choukry)
35 Arabic NLP History (3)
- The Language Weaver Statistical Arabic to English
MT system - The SYSTRAN Arabic to English MT system
- The Apptek Arabic to English Hybrid MT
- The LDC Arabic Treebank University of
Pennsylvania
36 Arabic NLP History (4)
- The Prague Dependency Arabic Treebank
- Arabic Entity Extraction (Shaalan 2007 Zitouni
2008) - Arabic Dialects Modeling Project at Columbia
University, USA (Diab and Habash, 2007)
37 Future Directions in Arabic NLP (1)
- New Attitude toward Arabic Grammar
- The need for explicit description of MSA
- Consider the idafa
- ???? ?????
- ??? ??????
- ??? ??????
38Future Directions in Arabic NLP (2)
- The first is a noun phrase
- The second is an adjectival phrase
- The third is a prepositional phrase
- The description of all as idafa is not helpful
- to Arabic NLP
39Future Directions in Arabic NLP (3)
- We need to focus on constituency without
- case endings. Consider
- ??? ????? ?? ?????? ?? ???????
- ??? ????? ?? ?????? ?? ????? ??????
- In the first, alwaziir is a subject and in the
second is an object. In both sentences it is
marked accusative
40Future Directions in Arabic NLP (4)
- We need to describe rules for Arabic anaphoric
relations - Subjectless sentences (Pro Drop)
- Discourse Analysis
- Arabic love of nominalization
41Future Directions in Arabic NLP (5)
- Defining MSA
- Mark differences between MSA and CA
- New Arabic grammars - acknowledging the heritage
while being liberated from the paradigm - A grammar that is more relevant to Arabic
Information Retrieval and Arabic MT -
42 Conclusion (1)
- Arabic NLP can help in transforming Arab
societies - Good progress has been achieved in Arabic NLP
- More explicit grammar of MSA will enhance
- and speed the development of NLP systems
- Arabic needs to be restored as the language of
- Of science and research
43 Conclusion (2)
- Standards of usage need to be enforced to
- preserve Arabic as the expression of the
Arabic identity - Linguists need to do their homework by writing
- explicit grammars for discourse Analysis,
Anaphoric Relations, Syntactic Structures etc. -