Rulebased approach in Arabic NLP: Tools, Systems and Resources - PowerPoint PPT Presentation

Loading...

PPT – Rulebased approach in Arabic NLP: Tools, Systems and Resources PowerPoint presentation | free to download - id: 104c79-ODhkO



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Rulebased approach in Arabic NLP: Tools, Systems and Resources

Description:

To show how rule-based approach has successfully used to develop ... case (nominative, genitive, accusative,...), person (first, second, third) Types of Rules ... – PowerPoint PPT presentation

Number of Views:283
Avg rating:3.0/5.0
Slides: 84
Provided by: khaleds
Learn more at: http://www.emi.ac.ma
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Rulebased approach in Arabic NLP: Tools, Systems and Resources


1
Rule-based approach in Arabic NLP Tools, Systems
and Resources
CITALA'09
  • Dr Khaled Shaalan
  • Professor, Faculty of Computers Information,
    Cairo University
  • On Secondment to BUiD, UAE
  • Khaled.shaalan_at_buid.ac.ae,
  • gmail.com

2
Agenda
  • Objective
  • Language Tasks
  • NLP Approaches
  • Rule-based Arabic Analysis and generation tools
  • Rule-based Arabic NLP applications
  • Some Arabic NLP Free Resources
  • Major and Arabic mailing lists
  • Conclusion

3
Objective
  • To show how rule-based approach has successfully
    used to develop Arabic natural language
    processing tools and applications.

4
Separating Language Tasks
  • English vs. French vs. Arabic vs . . .
  • spoken language (dialogue) vs written test vs
    hand written script
  • Genuine Script vs transliterated (Romanized)
    script
  • Vocalized (vowelized) vs non-vocalized
  • Understanding vs. generation
  • First language learner vs second language learner
  • Classical or Quranical Arabic vs Modern Standard
    Arabic vs colloquial (dialects)
  • Stem-based vs root-based

5
Rules
  • Situation/Action
  • If match(stem.prefix, def_article) then
    romve(stem.prefix,Stem_FS)
  • If match(stem.definitness,indefinite) then
    morph_gen(stem.definitness,Stem_FS)

6
Common Mistake
  • Rule-based approach is not a rule-based expert
    systems !!!!!!!
  • Both consist of rules.
  • Rule-based expert systems solves the problem by
    Recognize-Act Cycle
  • Loop
  • Conflict resolution strategy

7
Recognize-Act Cycle
  • loop
  • Match Rules are compared to working memory to
    determine matches. if no rule matches then stop
  • Conflict Resolution Select or enable a single
    rule for execution
  • Execute Fire the selected rule
  • Add new fact, or
  • Learn a new rule
  • end loop

8
NLP Approaches
  • Rule-based
  • Statistical-based

9
NLP Approaches (1)
Rule-based
Statistical-based
  • Relies on hand-constructed rules that are to be
    acquired from language specialists
  • requires only small amount of training data
  • development could be very time consuming
  • developers do not need language specialists
    expertise
  • requires large amount of annotated training data
    (very large corpora)
  • automated

10
NLP Approaches (2)
Rule-based
Statistical-based
  • some changes may be hard to accommodate
  • not easy to obtain high coverage of the
    linguistic knowledge
  • useful for limited domain
  • Can be used with both well-formed and ill-formed
    input
  • High quality based on solid linguistic
  • some changes may require re-annotation of the
    entire training corpus
  • Coverage depends on the training data
  • Not easy to work with ill-formed input as both
    well-formed and ill-formed are still probable
  • Less quality - does not explicitly deal with
    syntax

11
Rule-based Arabic NLP tools
  • Morphological Analyzers
  • Morphological Generators
  • Syntactic Analyzers
  • Syntactic Generators

12
Rule-based Arabic Morphological Analyzer
13
Morphological Analysis
  • Breakdown the inflected Arabic word into a
    root/stem, affixes, features.
  • Example sa- uEty- kumA (????????) - will I
    give you…

14
Rules - Augmented Transition Network (ATN)
technique
  • Rules associated with arcs represent the
    context-sensitive knowledge about the relation
    between a root and inflections.
  • More than one rule may be associated with one
    arc.
  • Conditions associated with the arcs are placed in
    such a way that the arc to be traversed first is
    the one that leads to the most probable solution.

15
Arabic Morphology using ATN Technique
16
Types of Rules
  • Remove Prefix or Suffix
  • Remove doubled letter
  • Add/change Hamza, Weak letter,…
  • …

17
Analysis of the verb "??????" (I saw you)
Remove suffixes
?????
??????
last1 ?
last2 ?
????
S10
S3
S1
S2
S0
  • stem "????" (saw)
  • perfect
  • 1st person sg pronoun "?"
  • 2nd person sg pronoun "?"

18
Analysis of the verb ?????? (they are
playing) Remove prefix suffix
?????
?????
???
Begin2 ?
last2 ??
S10
S3
S1
S2
S0
  • stem ???" (played)
  • imperfect
  • Plural subject

19
Issues in the morphological analysis
  • Overgeneration (too many output)
  • Ambiguity
  • Reconstruction of vowels
  • MultiWord/compound Expressions
  • Out-of-Vocabulary (OOV)
  • Handling ill-formed input
  • Detection (spell checking)
  • Correction- relaxation ? instead of ?
  • Prevent ill-formed output
  • Check the compatibility (the prefix ? cannot
    come after the prefix ? (or ?)).

20
Rule-based Arabic Morphological Generator
21
Morphological generation
  • Synthesis of an inflected Arabic word from a
    given root/stem according to a combination of
    morphological properties that include
  • definiteness (definite article ??),
  • gender (masculine, feminine),
  • number (singular, dual, plural),
  • case (nominative, genitive, accusative,…),
  • person (first, second, third)
  • …

22
Types of Rules
  • synthesis of inflected
  • Noun
  • Verb
  • particle

23
Synthesis of inflected Nouns
  • definite noun
  • feminine noun
  • pluralize noun
  • dual noun
  • attach a prefix preposition
  • attach a suffix pronoun
  • end case
  • ….

24
Synthesis of feminine noun
  • If noun.gender masculine Then attach suffix
    feminine letter
  • Example
  • ??? )husband) ? ???? (wife)

25
Synthesis of suffix pronoun
  • If pronoun.person first and pronoun.number
    singular Then attach first person singular
    suffix pronoun
  • Example
  • ???? (wife) ? ????? (my wife)

26
Synthesis of inflected Verbs (very complex-rich
in form and meaning)
  • conjugate a verb with tense
  • conjugate a verb with number
  • conjugate a verb with prefix pronoun
  • conjugate a verb with suffix pronoun
  • ….

27
Rule synthesize first person plural of
assimilated verbs
  • Input first person singular past verb
  • Output inflected verb
  • Example ???- ???? - ?????
  • If verb.tense future
  • then remove first weak attach_prefix(""??)
  • else if verb.tense present
  • then remove first weak attach_prefix(""?)
  • else attach_suffix(verb.stem,"??")

28
Issues in the morphological generation
  • MultiWord/compound Expressions
  • Out-of-Vocabulary (OOV)
  • Some forms need special handling
  • Substitution This man ??? ?????
  • literal numbers (complex nouns)
  • Arabic script
  • ? ?? ? ???
  • ????? ? ? ?????? ? ??????
  • ???? ? ??????

29
Rule-based Arabic Syntactic Analyzer
30
Types of Rules
  • Grammatical rules
  • Describe sentence and phrase structures, and
    ensure the agreement relations between various
    elements in the sentence.
  • Parsing
  • Accepts the input and generates the sentence
    structure (parse tree)

31
Parsing of the sentence ??????? ?????? The
student (sg,f) is diligent (sg,f)
??????? ??????
noun
(definite, fem, sg) noun (indefinite, fem, sg)

definite(definite, fem,
sg) enunciative (indefinite, fem, sg)
Inchoative (defined, fem, sg) nominal
sentence
  • Agreement
  • Number
  • Gender

Nominal sentence -gt definite_Inchoative(Number,Gen
der) indefinite_enuciative(Number,Gender)
32
Issues in the syntactic analysis
  • Ambiguity (more than parse tree)
  • Disambiguation techniques
  • Handling ill-formed input
  • Detection (grammar checking)
  • Recovering (Partial parsing - parses chunks to
    be related)

33
Rule-based Arabic Syntactic Generator
34
Types of Rules
  • Determine phrase structures
  • Determine syntactic structure
  • Ensure the agreement relations between various
    elements in the sentence.

35
Rule verb-subject agreement
  • Input verb and inflected subject (a pre-verbal
    NP )
  • Output inflected verb agreed with its inflected
    subject
  • synthesize_verb(Subject.number,verb.stem)
  • synthesize_verb(Subject.gender,verb.stem)

36
An agreement example
  • ??????? ????? ??? ????? ?????
  • the-boys visited-they five museum old
  • The boys visited five old museums

37
Issues in the syntactic generation
  • Word order (VSO,SVO, etc.)
  • Agreement (full/partial)
  • dropping the subject pronoun (called Pro-drop),
    i.e., to have a null subject, when the inflected
    verb includes subject affixes.
  • Syntax that captures the source/intended meaning
  • My son is 8 ???? ???? ????? ?????
  • I did not understand the last sentence ??? ??
    ???? ?????? ???????

38
A Rule-based Arabic NLP applications
  • Named Entity Recognition
  • Machine translation
  • Transferring Egyptian Colloquial Dialect into
    Modern Standard Arabic

39
What is entity recognition?
  • Identifying, extracting, and normalizing entities
    from documents such as names of people,
    locations, or companies.
  • Makes unstructured data more structured

40
Politics of Ukraine In July 1994, Leonid Kuchma
was elected as Ukraine's second president in free
and fair elections. Kuchma was reelected in
November 1999 to another five-year term, with 56
percent of the vote. International observers
criticized aspects of the election, especially
slanted media coverage however, the outcome of
the vote was not called into question. In March
2002, Ukraine held its most recent parliamentary
elections, which were characterized by the
Organization for Security and Cooperation in
Europe (OSCE) as flawed, but an improvement over
the 1998 elections. The pro-presidential For a
United Ukraine bloc won the largest number of
seats, followed by the reformist Our Ukraine bloc
of former Prime Minister Viktor Yushchenko, and
the Communist Party. There are 450 seats in
parliament, with half chosen from party lists by
proportional vote and half from individual
constituencies.
Entity Extractor
Person
Date
Location
41
Person Entity Recognition (1)
  • Example ????? ??????? ??? ???? ?????? The
    Jordanian king Abdullah II
  • We want to have a rule that recognizes a person
    name composed of a first name followed by
    optional last names, based on a preceding person
    indicator pattern.

42
Person Entity Recognition (2)
  • The Rule component of this example
  • Name Entity ??? ???? Abdullah
  • indicator pattern
  • an honorific such as "?????" The king
  • Nasab (optional) inflected from a location name
    "???????" Jordanian.
  • The rule also matches an optional ordinal number
    appearing at the end of some names such as
    "??????" II.

43
Person Entity Recognition (3)
((honorfic(location(???))?) first_Name(last_Nam
e)?(number)?)
  • This (Regular Expression) rule can recognize
  • ????? ??? ????
  • ????? ??????? ??? ????
  • ????? ??????? ??? ???? ??????
  • ?????? ???????? ?????
  • …

44
Issues in the Arabic NER
  • Complex Morphological System (inflections)
  • Non-casing language (No initial capital for
    proper nouns)
  • Non-standardization and inconsistency in Arabic
    written text (typos, and spelling variants)
  • Ambiguity

45
Machine Translation
  • Direct
  • Transfer
  • Interlingua

46
MT Approaches MT Pyramid
Source syntax
Target syntax
Source word
Target word
Analysis
Generation
47
English-to-Arabic Transfer based Approach
source sentence (English)
Morphological syntactic Analysis Rules of
English
English Dic.
Sentence Analysis
English Parse Tree
English-to-Arabic Transformation Rules
Bi-ling Dic.
Transfer
Arabic Parse Tree
Morphological Gen. Synthesis Rules of Arabic
Arabic Dic.
Sentence Synthesis
Target sentence (Arabic)
48
Transfer approach
  • Involves analysis, transfer, and generation
    components
  • If you have an Arabic parser Arabic syntactic
    generator, All you need is to acquire the
    transfer rules and build the transfer component

49
Simple Transfer
  • (1) wi1, wi12, …, wkk ?(1 ? i ? k)
  • wkk, wk-1k-1, …, wii (1 ? i ? k)

50
Networks performance evaluation ? ????? ???? ????
transfer
51
Issues in the Transfer-based MT approach
  • Synonyms of a word
  • Acquisition ? ?????? or ???????.
  • Agreement
  • intelligent tutoring systems ? ??? ???????
    ?????? or ??? ??????? ?????
  • Problems with prepositions
  • did you do fungal analysis? ?
  • ?? ??? ??????? ??????
  • …

52
Interlingua MT Multilingual translation
  • Interlingua Semantic Representation
  • Deep analysis
  • no need for transfer component)
  • Only analysis and generation components
  • Add Arabic analyzer to translate to other
    languages
  • Add Arabic generator to translate from other
    languages

53
Analysis of Arabic to Interlingua
?????? ??? ???? ?? ??? ???? ?? ??????
Parse Tree
Interlingua(IF) cintroduce-topicreservationdisp
ositionroom (room-spec(room, specifierhote,iden
tifiabilityyes),disposition(desire,whoi))
54
Generating Arabic from Interlingua
Interlingua(IF) cintroduce-topicreservationdisp
ositionroom (room-spec(room, specifierhote,iden
tifiabilityyes),disposition(desire,whoi))
?????? ??? ???? ?? ??? ???? ?? ??????
55
Issues in the interlingua approach
  • Interlingua
  • language-neutral representation
  • captures the intended meaning of the source
    sentence
  • Requires a fully-disambiguating parser

56
Transferring Egyptian Colloquial Dialect into
Modern Standard Arabic
  • Be able to reuse MSA processing tools with
    colloquial Arabic by transferring colloquial
    Arabic words into their corresponding MSA words.
  • Facilitate the communication with colloquial
    Arabic speakers
  • Restore the Arabic dialect to the standard
    language in use nowadays.

57
A one-to-one transfer example
?????
Mapping
???? when?
58
A one-to-many transfer example
??? On-the
Mapping
?? the
??? on
59
A complete sentence example
??? ?????
You-came when?
  • Step (1)
  • ??? ? ???
  • ???? ? ???
  • Step (2)
  • the New Segment Position for
  • the word ???? is
  • start of sentence (SoS)

Mapping
??? ????
reordering
??? ????
When did-you-come ?
60
Issues in the transfer to MSA
  • More investigations are needed

61
Arabic NLP Free Resources
  • Arabic NLP Free Resources

62
Arabic Morphological Analyzers
  • Tim Buckwalter Morphological
  • http//www.qamus.org/
  • http//www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?
    catalogIdLDC2002L49
  • Xerox
  • http//www.cis.upenn.edu/cis639/arabic/input/keyb
    oard_input.html

63
Arabic Morphological Analyzers
  • Aramorph
  • http//www.nongnu.org/aramorph/english/index.html

64
Arabic spell checker
  • Aspell
  • http//aspell.net/
  • http//www.freshports.org/arabic/aspell

65
Arabic Morphological Generation
  • Sarf
  • http//sourceforge.net/projects/sarf

66
Tokenization POS tagging
  • ArabicSVMTools The tools utilize the Yamcha SVM
    tools to tokenize, POS tag and Base Phrase Chunk
    Arabic text
  • http//www1.cs.columbia.edu/mdiab/
  • http//www1.cs.columbia.edu/mdiab/software/AMIRA-
    1.0.tar.gz

67
Tokenization POS tagging
  • MADA a full morphological tagger for Modern
    Standard Arabic.
  • http//www1.cs.columbia.edu/rambow/software-downl
    oads/MADA_Distribution.html

68
POS tagging
  • Stanford Log-linear Part-Of-Speech Tagger
  • http//nlp.stanford.edu/software/tagger.shtml
  • http//nlp.stanford.edu/software/stanford-arabic-t
    agger-2008-09-28.tar.gz

69
Tokenization POS tagging
  • Attia's Finite State Tools for Modern Standard
    Arabic
  • http//www.attiaspace.com/getrec.asp?rechtmFiles/
    fsttools

70
Arabic Parsers
  • Dan Bikels Parser
  • http//www.cis.upenn.edu/dbikel/
  • http//www.cis.upenn.edu/dbikel/software.html
  • Attia Arabic Parser
  • http//www.attiaspace.com/
  • http//decentius.aksis.uib.no/logon/xle.xml

71
Arabic wordnet
  • Arabic WordNet
  • http//www.globalwordnet.org/AWN/
  • http//personalpages.manchester.ac.uk/staff/paul.t
    hompson/AWNBrowser.zip

72
Translation resources
  • Tools GIZA, MOSES, Pharaoh, Rewrite and BLEU
  • http//www.statmt.org/
  • APIs
  • http//code.google.com/apis/ajax/playground/trans
    late
  • http//code.google.com/apis/ajax/playground/batch
    _translate

73
Transliterate
  • Transliterate
  • http//code.google.com/apis/ajax/playground/trans
    literate_arabic

74
Mailing Lists just to be connected to the NLP
community
  • corpora_at_uib.no
  • http//mailman.uib.no/listinfo/corpora
  • linguist_at_LINGUISTLIST.ORG
  • http//www.linguistlist.org/
  • semitic_at_cs.haifa.ac.il
  • http//www.semitic.tk/
  • caasl-list_at_arabicscript.org
  • http//www.arabicscript.org/CAASL3/index.html

75
Conclusion (1)
  • Arabic requires the treatment of the language
    constituents at all levels morphology, syntax,
    and semantics.
  • Most of the researches in Arabic NLP are mainly
    concentrated on the analysis part aiming at
    automated understanding of Arabic language.

76
Conclusion (2)
  • Arabic NLP in general is significantly under
    developed.
  • In order to bridge this gab and help Arabic NLP
    research to catch up with the many recent
    advances of Latin languages, we need
    collaborative efforts from the Arabic research
    community.

77
Conclusion (3)
  • We need Public Domain (in Electronic Form) for
  • Linguistic resources such as large Arabic
    (bilingual) Corpora and treebanks.
  • Machine readable (bilingual) dictionaries
  • Morphological Analyzers
  • Parsers
  • …

78
Conclusion (4)
  • We need to secure fund for
  • Exchanging visits (experience Expert Network)
  • Buy software
  • Secure dedicated RAs and/or PhD students for the
    NLP task.

79
References (1) - Journals
  • Khaled Shaalan, Hafsa Raza, NERA Named Entity
    Recognition for Arabic, the Journal of the
    American Society for Information Science and
    Technology (JASIST), John Wiley Sons, Inc., NJ,
    USA, 60(7)112, July 2009.
  • Shaalan, K., Monem, A. A., Rafea, A., Arabic
    Morphological Generation from Interlingua A
    Rule-based Approach, in IFIP International
    Federation for Information Processing, Vol. 228,
    Intelligent Information Processing III, eds. Z.
    Shi, Shimohara K., Feng D., (BostonSpringer),
    PP. 441-451, 2006.
  • Shaalan, K., Talhami H., and Kamel I.,
    Morphological Generation for Indexing Arabic
    Speech Recordings, The International Journal of
    Computer Processing of Oriental Languages
    (IJCPOL), World Scientific Publishing Company,
    20(1)114, 2007.

80
References (2) - Journals
  • Shaalan K. An Intelligent Computer Assisted
    Language Learning System for Arabic Learners,
    Computer Assisted Language Learning An
    International Journal, Taylor Francis Group
    Ltd., 18(1 2) 81-108, February 2005.
  • Shaalan K. Arabic GramCheck A Grammar Checker
    for Arabic, Software Practice and Experience,
    John Wiley sons Ltd., UK, 35(7)643-665, June
    2005.
  • Shaalan K., Rafea, A., Abdel Monem, A., Baraka,
    H., Machine Translation of English Noun Phrases
    into Arabic, The International Journal of
    Computer Processing of Oriental Languages
    (IJCPOL), World Scientific Publishing Company,
    17(2)121-134, 2004.
  • Rafea A., Shaalan K., Lexical Analysis of
    Inflected Arabic words using Exhaustive Search of
    an Augmented Transition Network, Software
    Practice and Experience, John Wiley sons Ltd.,
    UK,23(6)567-588, June 1993.

81
References (3) workshops conferences
  • Hosny, A., Shaalan, K., Fahmy, A., Automatic
    Morphological Rule Induction for Arabic, In the
    Proceedings of The LREC'08 workshop on HLT NLP
    within the Arabic world Arabic Language and
    local languages processing Status Updates and
    Prospects, 31st May, PP. 97-101, 2008.
  • Shaalan, K., Abo Bakr, H., Ziedan, I.,
    Transferring Egyptian Colloquial into Modern
    Standard Arabic, International Conference on
    Recent Advances in Natural Language Processing
    (RANLP 2007) , Borovets, Bulgaria, PP. 525-529,
    September 27-29, 2007.
  • Shaalan, K., Abdel Monem, A., Rafea, A., Baraka,
    H., Generating Arabic Text from Interlingua, In
    the Proceedings of the 2nd Workshop on
    Computational Approaches to Arabic Script-based
    Languages, CAASL-2, Linguistic Institute,
    Stanford, California, USA, PP. 137-144, July
    21-22, 2007.

82
References (4) workshops conferences
  • Othman E., Shaalan K., and Rafea A., Towards
    Resolving Ambiguity in Understanding Arabic
    Sentence, In the Proceedings of the International
    Conference on Arabic Language Resources and
    Tools, NEMLAR, PP. 118-122, 22nd23rd Sept.,
    Egypt, , 2004.
  • Othman E., Shaalan K., and Rafea A. A Chart
    Parser for Analyzing Modern Standard Arabic
    Sentence, In proceedings of the MT Summit IX
    Workshop on Machine Translation for Semitic
    Languages Issues and Approaches, New Orleans,
    Louisiana, USA., September, 2003.

83
Thank you! Merci!
  • Shukran!
  • ????
About PowerShow.com