Arabic%20NLP:%20Overview,%20the%20State%20of%20the%20Art%20%20%20%20%20%20%20%20%20%20%20Challenges%20and%20Opportunities - PowerPoint PPT Presentation

About This Presentation
Title:

Arabic%20NLP:%20Overview,%20the%20State%20of%20the%20Art%20%20%20%20%20%20%20%20%20%20%20Challenges%20and%20Opportunities

Description:

medicine, space exploration, computer. software and hardware development etc. ... Heavy reliance on bilingual dictionaries. No attempt to mimic human translation ... – PowerPoint PPT presentation

Number of Views:107
Avg rating:3.0/5.0
Slides: 44
Provided by: aci5
Learn more at: https://acit2k.org
Category:

less

Transcript and Presenter's Notes

Title: Arabic%20NLP:%20Overview,%20the%20State%20of%20the%20Art%20%20%20%20%20%20%20%20%20%20%20Challenges%20and%20Opportunities


1
Arabic NLP Overview, the State of the Art
Challenges and Opportunities
  • Ali Farghaly

2
Overview (1)
  • Challenges
  • 1. to the Arabic language and culture
  • 2. to Arabic NLP
  • a inherent properties of Arabic
  • b problems of Arabic Linguistics

3
Overview (2)
  • Inherent Opportunities for the Arabic Language
  • 1. Classical Arabic has survived 15 centuries,
  • other language failed to do so
  • 2. Arabic is capable of reinventing itself
  • 3. Classical Arabic is a living language in
    which 1.4 billion Moslems perform their daily
    prayers
  • 4. The significance of the Arabic language
    culturally, strategically and linguistically

4
Overview (3) Why NLP is
important?
  • Fundamental transition from the Industrial
    Economy to the Knowledge Economy in the 1980s and
    1990s
  • Knowledge is coded in Language
  • Necessity for NLP Systems to categorize,
    retrieve, translate, and/or answer questions from
    unstructured texts

  • 4


5
Overview (4)
  • NLP History
  • Four generations of NLP
  • Disappointment with the First Generation of
    Machine Translation Systems, ALPAC Report (1966)
  • Second Generation of NLP Systems (1970s-1980s)

6
Overview (5)
  • Third Generation NLP Systems 1990s present
  • Success of Statistical Approaches
  • Problems with Statistical Approaches
  • The Emergence of the Hybrid Approach (4th
    generation?)

7
Overview (6)
  • Future Directions in Arabic NLP
  • New Attitude towards Arabic Grammar
  • Focus on Constituency
  • The Need for Arabic Language Planning

8
Overview (7)
  • Deal with syntactic ambiguity, co-reference,
    unbounded dependencies, phrasal constituencies,
    PRO Drop .etc.
  • Clear Objectives of Arabic NLP for the Arab
    World
  • Could be different from Arabic NLP for the
    Western World
  • Conclusion

9
Challenges (1)
  • To the Arabic language and culture
  • The English language is becoming the language of
    the World Wide Web emails, blogs, chats etc.
    taking away functionalities from Arabic
  • Number of books, papers published in the Arab
    countries is minimal compared to that produced
    in the USA and English speaking countries
  • Thus, we consume rather than produce knowledge
  • No first class research universities in the
    Arab world

10
Challenges (2)
  • Even when we report research, we do not use
    Arabic
  • Globalization has intensified the influence of
    the Western culture in the Arab World
  • Almost all Arab universities teach science and
    mathematics in a foreign language

11
Challenges (3)
  • To Arabic NLP
  • Inherent properties of the Arabic language
  • 1. The Arabic script (no short vowels and no
    capitalization)
  • 2. Explosion of ambiguity (average 2.3 per word
    in other languages to 19.2 in Arabic.
  • Example 22 analyses of ??? by Buckwalter
    (2004)

12
Challenges (4)
  • 3. Complex word structure
  • e.g. ??????? and I saw them
  • 4. The problem of Normalization
  • ? ? ? ? ? ? ? ? ?
  • losing distinction ?? ? ?? ? ??
  • 5. Arabic as a Pro Drop Language

13
Assumptions
  • The Arabic language can meet all the needs of
    its speakers
  • The Arabs were producers of knowledge at a time
    when the rest of the world were were consumers
    of knowledge
  • Contemporary Arab scholars proved their ability
  • to produce knowledge

14
Opportunities (1) Lessons from
recent history
  • Unprecedented accumulation of knowledge
  • 1. Dramatic increase in the number of academic
  • publications
  • 2. Huge investment in R D companies
  • 3. Fundamental changes in industry and society
  • similar to the Industrial
    Revolution
  • 4. Impressive progress in many fields such as
  • medicine, space exploration,
    computer
  • software and hardware development
    etc.

15
The Knowledge Economy
  • Fundamental Aspects of the Knowledge Economy
  • 1. Strategic product is knowledge rather than
  • manufactured goods
  • 2. Industrial workers are replaced by
  • knowledge workers
  • 3. Global labor market
  • 4. Democratization of knowledge

16
The Knowledge Economy NLP (1)
  • The age of on-line information, electronic
    communication, World Wide Web (www)
  • Millions of documents are created every minute
    from kb -gt mg -gt gig -gt terabites
  • Explosion of knowledge can lead to explosion of
    ignorance

17
The Knowledge Economy NLP (2)
  • Democratization of knowledge through the use of
    the computer/cell phone as a communication tool
  • Governments, industry, academia, and
    individuals, desperately, need tools to process
    information
  • Information is coded in natural language

18
The knowledge Economy NLP (3)
  • Globalization -gt Multilingual applications such
    as machine translation and cross language
    applications
  • Information Retrieval (IR) and Information
    Extraction (IE) are becoming increasingly
    important
  • key word search is being replaced by question
    answering systems
  • Knowledge is encoded in natural language

19
NLP - Flashback
  • The invention of the computer and language
  • 1940s
  • - First application breaking the Nazis secret
  • code
  • - Second application Russian to English
  • machine translation (Warren, 1949)

20
1st Generation of MT
  • Principles of the first generation
  • Capitalized on the speed lookup offered by the
    computer
  • MT is essentially a matter of correct pairing of
    the source language expressions with the target
    language equivalents
  • Trivial reordering of words

21
Problems with 1st Generation MT
  • naïve concept of language structure
  • Heavy reliance on bilingual dictionaries
  • No attempt to mimic human translation
  • Unrealistic goals and promises

22
2nd Generation MT (1)
  • Principles of the Transfer Approach
  • Three Components
  • 1. analysis of source language (SL)
  • 2. transfer the structure of SL to TL
  • 3. Generation of target language surface forms

23
2nd Generation MT (2)
  • Basic Principles
  • Linguistic knowledge is essential for the
  • understanding of the source text
  • Target specific domains for better translation
  • More realistic goals and promises

24
2nd Generation MT (3)
  • Positive developments in NLP technology chart
    parsing (Woods 1970, unification grammar Shieber
    1986), definite clause grammar (Periera 1980)
  • Driven by the commercial market The Georgetown
    System, Pan American Health Organization, EURORTA
    Project (Interlingua approach) etc.
  • Emergence of lexical approaches to grammar

25
Problems with 2nd Generation MT
  • Limitations
  • Linguistic knowledge is expensive
  • Explosion of syntactic ambiguity (300 parse for
  • each input sentence)
  • Needed huge computing power
  • Limited successes The METAL system and the
    Canadian weather forecast translation system

26
Statistical Approaches to NLP
  • Built on Probability theory
  • Works well for specific domains
  • Relies on training data (machine learning)
  • Very fast
  • Does not require linguistic knowledge

27
3rd Generation of MT Systems (1)
  • Principles
  • Relies on the machine learning approach
  • Benefits from the existence of huge corpora
    through the Internet
  • Low development cost
  • Rapid development time

28
3rd Generation of MT Systems (2)
  • Heavy reliance on parallel corpus at several
    levels
  • Does not require any linguistic knowledge Give
    me enough parallel corpus, and I will give you
    machine translation system in hours
  • Represents an empirical approach to language
    The proof is in the pudding (Manning 2000)
  • Unlike the transfer approach, does not attempt
    to mimic human translators

29
3rd Generation MT Systems (3)
  • Benefited by
  • Computers becoming much faster, more powerful
    and less costly
  • Accumulation of huge corpora on the Internet
  • Availability of annotated Treebanks for training
    (Linguistic Data Consortium

30
Problems with 3rd Generation MT Systems
  • Limitations
  • Performs well when dealing with data similar
  • to the training set
  • Performance deteriorates when documents
  • are different from training set
  • There comes a point when adding more training
  • data does not improve performance (The
    Threshold
  • Problem)

31
Problems with 3rd Generation MT Systems
  • There are domains when data is sparse
  • Sometimes the training data itself is noisy
    (full of errors)
  • Does not provide any insight into language,
    linguistics or the translation process

32
Arabic NLP Goals
  • Goals
  • Transfer of knowledge and technology to the Arab
    World
  • Modernize and fertilize the Arabic language
  • Improve and modernize Arabic linguistics
  • Make information retrieval, extraction,
    summarization and translation available to the
    Arab user

33
Arabic NLP History (1)
  • Followed and integrated with main stream NLP
  • 1978 - 1989
  • Kuwait Mohammed Al-Sharikh Nabil Ali Sakhr
  • Morocco Hlaal (1979, 1985) on Arabic morphology
  • Holland Everhard Ditter on MSA
  • US The Weidner English/Arabic MT system

34
Arabic NLP History (2)
  • IBM Scientific Centers in Kuwait and Cairo
  • France The Dinar Lexical Data Base, Joseph Dichy
  • Language Resources and Human Language Technology
    work (ELRA/Elda Choukry)

35
Arabic NLP History (3)
  • The Language Weaver Statistical Arabic to English
    MT system
  • The SYSTRAN Arabic to English MT system
  • The Apptek Arabic to English Hybrid MT
  • The LDC Arabic Treebank University of
    Pennsylvania

36
Arabic NLP History (4)
  • The Prague Dependency Arabic Treebank
  • Arabic Entity Extraction (Shaalan 2007 Zitouni
    2008)
  • Arabic Dialects Modeling Project at Columbia
    University, USA (Diab and Habash, 2007)

37
Future Directions in Arabic NLP (1)
  • New Attitude toward Arabic Grammar
  • The need for explicit description of MSA
  • Consider the idafa
  • ???? ?????
  • ??? ??????
  • ??? ??????

38
Future Directions in Arabic NLP (2)
  • The first is a noun phrase
  • The second is an adjectival phrase
  • The third is a prepositional phrase
  • The description of all as idafa is not helpful
  • to Arabic NLP

39
Future Directions in Arabic NLP (3)
  • We need to focus on constituency without
  • case endings. Consider
  • ??? ????? ?? ?????? ?? ???????
  • ??? ????? ?? ?????? ?? ????? ??????
  • In the first, alwaziir is a subject and in the
    second is an object. In both sentences it is
    marked accusative

40
Future Directions in Arabic NLP (4)
  • We need to describe rules for Arabic anaphoric
    relations
  • Subjectless sentences (Pro Drop)
  • Discourse Analysis
  • Arabic love of nominalization

41
Future Directions in Arabic NLP (5)
  • Defining MSA
  • Mark differences between MSA and CA
  • New Arabic grammars - acknowledging the heritage
    while being liberated from the paradigm
  • A grammar that is more relevant to Arabic
    Information Retrieval and Arabic MT

42
Conclusion (1)
  • Arabic NLP can help in transforming Arab
    societies
  • Good progress has been achieved in Arabic NLP
  • More explicit grammar of MSA will enhance
  • and speed the development of NLP systems
  • Arabic needs to be restored as the language of
  • Of science and research

43
Conclusion (2)
  • Standards of usage need to be enforced to
  • preserve Arabic as the expression of the
    Arabic identity
  • Linguists need to do their homework by writing
  • explicit grammars for discourse Analysis,
    Anaphoric Relations, Syntactic Structures etc.
Write a Comment
User Comments (0)
About PowerShow.com