Corpus Linguistics for Understanding the Quran - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Corpus Linguistics for Understanding the Quran

Description:

Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence and Biological ... – PowerPoint PPT presentation

Number of Views:221
Avg rating:3.0/5.0
Slides: 39
Provided by: kdu68
Category:

less

Transcript and Presenter's Notes

Title: Corpus Linguistics for Understanding the Quran


1
Corpus Linguistics for Understanding the Quran
  • Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee
    Muhammad
  • I-AIBS Institute for Artificial Intelligence and
    Biological Systems
  • School of Computing
  • University of Leeds

2
The Challenge An interdisciplinary approach to
understanding the Quran
3
(1) What is the Quran?
The last in a series of 5 religious texts
Holy Book Prophet Text Dated
Suhuf Ibrahim (Scrolls) Abraham ?
The Tawrat (Torah) Moses 1500 BCE?
The Zabur (Psalms) David 1000 BCE?
The Injil (Gospel) Jesus 1 CE
The Quran Muhammad (PBUH) 610-632 CE
4
(1) What is the Quran?
The central religious text of Islam
  • - Classical Arabic
  • Islamic Law (legal logic)
  • Divine guidance direction
  • Scientific philosophical knowledge
  • Has inspired many scientific achievements, e.g.
    Algebra and linguistics

5
(2) Traditional Arabic Linguistics
Originated in Arabs studying the language of the
Quran (detailed analysis for at least 1000 years)
- Orthography (diacritics and vowelization) -
Etymology (Semitic roots) - Morphology
(derivation and inflection) - Syntax (origins of
dependency grammar) - Discourse Analysis
Rhetoric - Semantics Pragmatics
6
(3) Computational Linguistics
Where are we now?
  • Current use of computing to analyze the Quran is
    mostly
  • - Keyword search (useful)
  • - Frequency analysis (numerology?)

7
(3) Computational Linguistics
- How far can we go? - Is an artificial
intelligence system realistic?
  • Example question-answering dialog system
  • Question
  • How long should I breastfeed my child for?
  • Answer Mothers should suckle their offspring for
    two years, if the father wishes to complete the
    term (The Holy Quran, Verse 2233).

8
An AI approach to understanding the Quran
Central Hypothesis Augmenting the text of the
Quran with rich annotation will lead to a more
accurate AI system. - Prepare the data by
annotating the Quran. - Use the data to build an
AI system for concept search and
question-answering.
9
Annotating the Quran
Challenges Orthography - Complex script verified
in Unicode? Morphology - Arabic is highly
inflected and this is challenging to model by
computer Syntax - Phrase structure or dependency
grammar? Semantics lexical semantics,
ontology, logic, lexical frames?
10
Annotating the Quran
Solutions - Recent computational advances have
made possible annotating the Quran to very high
accuracy - Community effort using volunteers -
Leverage existing resources from Traditional
Arabic Grammar - Automatic annotation followed
by manual verification
11
Recent Advances Orthography
Does an accurate digital copy of the Quran exist?
  • Encoding Issues
  • Missing diacritics
  • Simplified script (not Uthmani)
  • Windows code page 1256, not Unicode

Google Search for verse (6838) on Jan 21, 2008
shows many typos
12
Recent Advances Orthography
  • Tanzil Project (http//tanzil.info)
  • Stable version released May 2008
  • Uses Unicode XML encoding, including the special
    characters designed for the complex Arabic script
    of the Quran
  • Manually verified to 100 accuracy by a group of
    experts who have memorized the entire text of the
    Quran

13
Recent Advances Orthography
  • Java Quran API (http//jqurantree.org)
  • March 2009
  • Java classes for querying the Tanzil XML of the
    Quran
  • First step towards software package for
    analyzing the Quran

14
Recent Advances Morphology
  • - Buckwalter Arabic Morphological Analyzer (2002)
  • Morphological Analysis of the Quran at the
    University of Haifa, Israel (2004)
  • - Lexeme feature based morphological
    representation of Arabic (Nizar Habash, 2006)

15
The Haifa Corpus (2004)
  • Multiple analysis for each word (up to 5)
  • rbbfalNounTriptoticMascSgPronDependent1P
    Sg
  • rbbfalNounTriptoticMascSgGen
  • Not a manually verified corpus
  • Authors reports an F-measure of 86
  • Non-standard annotation scheme not familiar to
    traditional Arabic linguists (e.g. extracting a
    list of all verbs in the corpus is non-trivial)
  • Arabic text is only encoded phonetically instead
    of using the original Arabic. Searching for the
    possible morphological analyses for a specific
    word is not easy

16
The Quranic Arabic Corpus
  • http//corpus.quran.com
  • - Manually verified (99 accuracy)
  • Poplar website with very positive feedback
  • million(s) of visitors

1. Initial tagging using Buckwalter Analyzer 2.
Paid annotator working for 3 months 3. Community
of volunteers verifying against existing books of
Traditional Arabic Grammar which analyse the
Quran Shows Arabic and English morphological
analysis side-by-side, with phonetic
transcription, search and translation.
17
The Quranic Arabic Corpus http//corpus.quran.com/
  • Kais Dukes Arabic Language Computing Applied to
    the Quran PhD (part-time)
  • an open-source online focus for linguistic
    research on Classical Arabic
  • morphology - each word shows colour-coded
    morphological analysis
  • syntax - each verse shows dependecy parse
    following Arabic tradition
  • semantics - entitites and concepts are linked
    to an ontology
  • translation - word-for-word English
    translations to aid understanding
  • Machine Learning - annotations provide training
    data for a parser
  • Impact on society - dozens of researchers
    collaborated on the analysis
  • and over a million
    visitors have used the website this year.

18
The Quranic Arabic CorpusPart-of-speech Tagging
  • Part-of-speech tags adapted from Traditional
    Arabic Grammar, and mapped to English equivalents
    (not the other way around)
  • These tags apply to words in the Quran, as well
    as to individual morphological segments in the
    text

Part-of-speech Tag Name Arabic Name
N Noun ???
PN Proper noun ????? ???
PRON Personal pronoun ????
DEM Demonstrative pronoun ??? ?????
REL Relative pronoun ??? ?????
ADJ Adjective ???
V Verb ???
P Preposition ??? ??
PART Particle ???
INTG Interrogative particle ??? ???????
VOC Vocative particle ??? ????
NEG Negative particle ??? ???
FUT Future particle ??? ???????
CONJ Conjunction ??? ???
NUM Number ???
T Time adverb ??? ????
LOC Location adverb ??? ????
EMPH Emphatic lam prefix ??? ???????
PRP Purpose lam prefix ??? ???????
IMPV Imperative lam prefix ??? ?????
INL Quranic initials ???? ?????
19
The Quranic Arabic CorpusVerified Uthmani
Script
  • Unicode Uthmani Script
  • Sourced from the verified Tanzil project

20
The Quranic Arabic CorpusPhonetics
(faja'alnahumu)
  • Phonetic transcription generated algorithmically
  • Guided by Arabic vowelized diacritics

21
The Quranic Arabic CorpusInterlinear
translation
  • Word-for-word translation from accepted sources
  • Interlinear translation scheme

22
The Quranic Arabic CorpusLocation Reference
(21704)
  • Common standard for verses (ChapterVerse)
  • Extended in the QAC corpus to include word
    numbers and segment numbers, e.g. (217042)

23
The Quranic Arabic CorpusMorphological
Segmentation
  • Division of a single word into multiple segments
  • Part-of-speech tag assigned to each segment
  • - Traditional Arabic Grammar rules used for
    division

24
The Quranic Arabic CorpusMorphological segment
features
25
The Quranic Arabic CorpusArabic Grammar Summary
26
The Quranic Arabic TreebankSyntactic Annotation
  • Dependency Grammar based on????? (i'rab)
  • Syntactico-semantic roles for each word

27
The Quranic Arabic TreebankWhats new about this
research?
  • First Treebank of Classical Arabic
  • Free Treebank of the Quran
  • - Well-defined formal representation of
    Traditional Arabic Grammar using hybrid
    constituency/dependency graphs

28
Automatic AnnotationClassical Arabic Dependency
Parser
  • Joakim Nivre (2009) dependency parsing using a
    shift/reduce queue/stack architecture with
    machine learning
  • Following similar architecture, but with hand
    written rules, custom parser has an
  • F-measure of 77.2

29
Quran Search for a Concept Tool
  • Nora Abbas developed the first Quran "search for
    a concept" tool and website, Qurany
  • Noorhan Abbas. Qurany A Tool to Search for
    Concepts in the Quran (PDF). MSc by Research
    Thesis, School of Computing, Leeds University,
    2009

30
Quran Search for a Concept Tools
  • The SearchTruth tool 48
  • Search Truth http//www.searchtruth.com/
  • The Holy Quran Viewer tool 34
  • Holy Quran Viewer http//www.2muslims.com/director
    y/Detailed/223253.shtml
  • The University of Southern California tool 49
  • MSA-USC Quran Database http//www.usc.edu/dept/MS
    A/reference/searchquran.html
  • What the available Quran tools on the net provide?
  • What is the main problem with these tools?

What about the Recall value of their results?
  • What is the main reason for these poor results?

31
Quran Search for a Concept Tool
  • What is a CONCEPT?
  • NOT just a keyword
  • index term in a textbook?

32
Quran Search for a Concept Tool
  • General/Abstract Concepts
  • Womens financial status
  • Main pillars of Islam
  • Characteristics of Paradise
  • Concrete Concepts
  • Names of places
  • (Makkah, Mecca, Meccah)
  • Names of prophets, angels,etc.
  • (Musa, Moses)
  • Names of Holy Books
  • (The Book (Bible), Bible, New Testament)

33
Quran Search for a Concept Tool
1 2 3 4 5
  • What does my tool look like?

6
34
Quran Search for a Concept Tool
  • Handling the Concrete Concepts
  • Eight Parallel English Translations
  • Search for one English word or a group of words
    in one search request
  • Search for one Arabic word or a group of words in
    one search request
  • Search for a mixed list of Arabic and English
    words in one search request
  • Offers a list of synonyms for the English words

35
Quran Search for a Concept Tool
  • General/Abstract Concepts
  • It is imported from Mushaf Al Tajweed index of
    topics published by Dar Al-Maarifa in Syria.
  • The tool has 15 main concepts.
  • The tool covers all the concepts in both
    languages Arabic and English.
  • The total number of concepts covered is 1170.
  • For example, to represent
  • Womens financial status
  • Main pillars of Islam
  • Characteristics of Paradise

36
Knowledge representation and text mining of the
Qur'an
  • Abdul-Baquee Muhammad
  • http//www.comp.leeds.ac.uk/scsams/
  • http//www.textminingthequran.com/wiki

37
Qur'anic ApplicationsText Mining The Quran
  • Verse similarity Allows you to see all verses
    that share a certain percent of characters with
    your input verse.
  • Quranic Chapter Relatives allows you to see the
    strongest relatives of a given Quran Chapter.
  • Word Cloud See word clouds of a sura or group of
    suras of the Qur'anic.
  • Qur'an Concordance Concordance over lemma.
  • Part-of-Speech Display of Sura View a sura of
    the Qur'an with color-coded Part of speech tags.
  • Quranic word co-occurence Allows you to enter a
    quranic terms to finds its most frequent
    neighbors.
  • N-gram Search Search upto 5-gram phrases of the
    Quran with a frequency of 5 or more.
  • Pronoun References Given a verse, see all
    pronoun references within this verse.
  • List of Concepts See a list of concepts arising
    from Pronoun referents in the Quran.

38
AI for understanding the Quran
  • Kais Dukes developed the first online annotated
    linguistic resource which shows the Arabic "irab"
    morphology and grammar for each word and verse in
    the Holy Quran, the Quranic Arabic Corpus
    including word-by-word morphology and English
    gloss, and Ontology of Quranic concepts
  • Nora Abbas developed the first Quran "search for
    a concept" tool and website, Qurany
  • Abdul-Baquee Sharaf developed tools and resources
    for text mining the Quran including verse
    similarity, lemma concordance and collocation,
    and text mining the Hadeeth
Write a Comment
User Comments (0)
About PowerShow.com