Title: Corpus Linguistics for Understanding the Quran
1Corpus Linguistics for Understanding the Quran
- Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee
Muhammad - I-AIBS Institute for Artificial Intelligence and
Biological Systems - School of Computing
- University of Leeds
2The Challenge An interdisciplinary approach to
understanding the Quran
3(1) What is the Quran?
The last in a series of 5 religious texts
Holy Book Prophet Text Dated
Suhuf Ibrahim (Scrolls) Abraham ?
The Tawrat (Torah) Moses 1500 BCE?
The Zabur (Psalms) David 1000 BCE?
The Injil (Gospel) Jesus 1 CE
The Quran Muhammad (PBUH) 610-632 CE
4(1) What is the Quran?
The central religious text of Islam
- - Classical Arabic
- Islamic Law (legal logic)
- Divine guidance direction
- Scientific philosophical knowledge
- Has inspired many scientific achievements, e.g.
Algebra and linguistics
5(2) Traditional Arabic Linguistics
Originated in Arabs studying the language of the
Quran (detailed analysis for at least 1000 years)
- Orthography (diacritics and vowelization) -
Etymology (Semitic roots) - Morphology
(derivation and inflection) - Syntax (origins of
dependency grammar) - Discourse Analysis
Rhetoric - Semantics Pragmatics
6(3) Computational Linguistics
Where are we now?
- Current use of computing to analyze the Quran is
mostly - - Keyword search (useful)
- - Frequency analysis (numerology?)
7(3) Computational Linguistics
- How far can we go? - Is an artificial
intelligence system realistic?
- Example question-answering dialog system
- Question
- How long should I breastfeed my child for?
- Answer Mothers should suckle their offspring for
two years, if the father wishes to complete the
term (The Holy Quran, Verse 2233).
8An AI approach to understanding the Quran
Central Hypothesis Augmenting the text of the
Quran with rich annotation will lead to a more
accurate AI system. - Prepare the data by
annotating the Quran. - Use the data to build an
AI system for concept search and
question-answering.
9Annotating the Quran
Challenges Orthography - Complex script verified
in Unicode? Morphology - Arabic is highly
inflected and this is challenging to model by
computer Syntax - Phrase structure or dependency
grammar? Semantics lexical semantics,
ontology, logic, lexical frames?
10Annotating the Quran
Solutions - Recent computational advances have
made possible annotating the Quran to very high
accuracy - Community effort using volunteers -
Leverage existing resources from Traditional
Arabic Grammar - Automatic annotation followed
by manual verification
11Recent Advances Orthography
Does an accurate digital copy of the Quran exist?
- Encoding Issues
- Missing diacritics
- Simplified script (not Uthmani)
- Windows code page 1256, not Unicode
Google Search for verse (6838) on Jan 21, 2008
shows many typos
12Recent Advances Orthography
- Tanzil Project (http//tanzil.info)
- Stable version released May 2008
- Uses Unicode XML encoding, including the special
characters designed for the complex Arabic script
of the Quran - Manually verified to 100 accuracy by a group of
experts who have memorized the entire text of the
Quran
13Recent Advances Orthography
- Java Quran API (http//jqurantree.org)
- March 2009
- Java classes for querying the Tanzil XML of the
Quran - First step towards software package for
analyzing the Quran
14Recent Advances Morphology
- - Buckwalter Arabic Morphological Analyzer (2002)
- Morphological Analysis of the Quran at the
University of Haifa, Israel (2004) - - Lexeme feature based morphological
representation of Arabic (Nizar Habash, 2006)
15The Haifa Corpus (2004)
- Multiple analysis for each word (up to 5)
- rbbfalNounTriptoticMascSgPronDependent1P
Sg - rbbfalNounTriptoticMascSgGen
- Not a manually verified corpus
- Authors reports an F-measure of 86
- Non-standard annotation scheme not familiar to
traditional Arabic linguists (e.g. extracting a
list of all verbs in the corpus is non-trivial) - Arabic text is only encoded phonetically instead
of using the original Arabic. Searching for the
possible morphological analyses for a specific
word is not easy
16The Quranic Arabic Corpus
- http//corpus.quran.com
- - Manually verified (99 accuracy)
- Poplar website with very positive feedback
- million(s) of visitors
1. Initial tagging using Buckwalter Analyzer 2.
Paid annotator working for 3 months 3. Community
of volunteers verifying against existing books of
Traditional Arabic Grammar which analyse the
Quran Shows Arabic and English morphological
analysis side-by-side, with phonetic
transcription, search and translation.
17The Quranic Arabic Corpus http//corpus.quran.com/
- Kais Dukes Arabic Language Computing Applied to
the Quran PhD (part-time) - an open-source online focus for linguistic
research on Classical Arabic - morphology - each word shows colour-coded
morphological analysis - syntax - each verse shows dependecy parse
following Arabic tradition - semantics - entitites and concepts are linked
to an ontology - translation - word-for-word English
translations to aid understanding - Machine Learning - annotations provide training
data for a parser - Impact on society - dozens of researchers
collaborated on the analysis - and over a million
visitors have used the website this year.
18The Quranic Arabic CorpusPart-of-speech Tagging
- Part-of-speech tags adapted from Traditional
Arabic Grammar, and mapped to English equivalents
(not the other way around) - These tags apply to words in the Quran, as well
as to individual morphological segments in the
text
Part-of-speech Tag Name Arabic Name
N Noun ???
PN Proper noun ????? ???
PRON Personal pronoun ????
DEM Demonstrative pronoun ??? ?????
REL Relative pronoun ??? ?????
ADJ Adjective ???
V Verb ???
P Preposition ??? ??
PART Particle ???
INTG Interrogative particle ??? ???????
VOC Vocative particle ??? ????
NEG Negative particle ??? ???
FUT Future particle ??? ???????
CONJ Conjunction ??? ???
NUM Number ???
T Time adverb ??? ????
LOC Location adverb ??? ????
EMPH Emphatic lam prefix ??? ???????
PRP Purpose lam prefix ??? ???????
IMPV Imperative lam prefix ??? ?????
INL Quranic initials ???? ?????
19The Quranic Arabic CorpusVerified Uthmani
Script
- Unicode Uthmani Script
- Sourced from the verified Tanzil project
20The Quranic Arabic CorpusPhonetics
(faja'alnahumu)
- Phonetic transcription generated algorithmically
- Guided by Arabic vowelized diacritics
21The Quranic Arabic CorpusInterlinear
translation
- Word-for-word translation from accepted sources
- Interlinear translation scheme
22The Quranic Arabic CorpusLocation Reference
(21704)
- Common standard for verses (ChapterVerse)
- Extended in the QAC corpus to include word
numbers and segment numbers, e.g. (217042)
23The Quranic Arabic CorpusMorphological
Segmentation
- Division of a single word into multiple segments
- Part-of-speech tag assigned to each segment
- - Traditional Arabic Grammar rules used for
division
24The Quranic Arabic CorpusMorphological segment
features
25The Quranic Arabic CorpusArabic Grammar Summary
26The Quranic Arabic TreebankSyntactic Annotation
- Dependency Grammar based on????? (i'rab)
- Syntactico-semantic roles for each word
27The Quranic Arabic TreebankWhats new about this
research?
- First Treebank of Classical Arabic
- Free Treebank of the Quran
- - Well-defined formal representation of
Traditional Arabic Grammar using hybrid
constituency/dependency graphs
28Automatic AnnotationClassical Arabic Dependency
Parser
- Joakim Nivre (2009) dependency parsing using a
shift/reduce queue/stack architecture with
machine learning - Following similar architecture, but with hand
written rules, custom parser has an - F-measure of 77.2
29Quran Search for a Concept Tool
- Nora Abbas developed the first Quran "search for
a concept" tool and website, Qurany - Noorhan Abbas. Qurany A Tool to Search for
Concepts in the Quran (PDF). MSc by Research
Thesis, School of Computing, Leeds University,
2009
30Quran Search for a Concept Tools
- The SearchTruth tool 48
- Search Truth http//www.searchtruth.com/
- The Holy Quran Viewer tool 34
- Holy Quran Viewer http//www.2muslims.com/director
y/Detailed/223253.shtml - The University of Southern California tool 49
- MSA-USC Quran Database http//www.usc.edu/dept/MS
A/reference/searchquran.html
- What the available Quran tools on the net provide?
- What is the main problem with these tools?
What about the Recall value of their results?
- What is the main reason for these poor results?
31Quran Search for a Concept Tool
- What is a CONCEPT?
- NOT just a keyword
- index term in a textbook?
32Quran Search for a Concept Tool
- General/Abstract Concepts
- Womens financial status
- Main pillars of Islam
- Characteristics of Paradise
- Concrete Concepts
- Names of places
- (Makkah, Mecca, Meccah)
- Names of prophets, angels,etc.
- (Musa, Moses)
- Names of Holy Books
- (The Book (Bible), Bible, New Testament)
33Quran Search for a Concept Tool
1 2 3 4 5
- What does my tool look like?
6
34Quran Search for a Concept Tool
- Handling the Concrete Concepts
- Eight Parallel English Translations
- Search for one English word or a group of words
in one search request - Search for one Arabic word or a group of words in
one search request - Search for a mixed list of Arabic and English
words in one search request - Offers a list of synonyms for the English words
35Quran Search for a Concept Tool
- General/Abstract Concepts
- It is imported from Mushaf Al Tajweed index of
topics published by Dar Al-Maarifa in Syria. - The tool has 15 main concepts.
- The tool covers all the concepts in both
languages Arabic and English. - The total number of concepts covered is 1170.
- For example, to represent
- Womens financial status
- Main pillars of Islam
- Characteristics of Paradise
36Knowledge representation and text mining of the
Qur'an
- Abdul-Baquee Muhammad
- http//www.comp.leeds.ac.uk/scsams/
- http//www.textminingthequran.com/wiki
37Qur'anic ApplicationsText Mining The Quran
- Verse similarity Allows you to see all verses
that share a certain percent of characters with
your input verse. - Quranic Chapter Relatives allows you to see the
strongest relatives of a given Quran Chapter. - Word Cloud See word clouds of a sura or group of
suras of the Qur'anic. - Qur'an Concordance Concordance over lemma.
- Part-of-Speech Display of Sura View a sura of
the Qur'an with color-coded Part of speech tags. - Quranic word co-occurence Allows you to enter a
quranic terms to finds its most frequent
neighbors. - N-gram Search Search upto 5-gram phrases of the
Quran with a frequency of 5 or more. - Pronoun References Given a verse, see all
pronoun references within this verse. - List of Concepts See a list of concepts arising
from Pronoun referents in the Quran.
38AI for understanding the Quran
- Kais Dukes developed the first online annotated
linguistic resource which shows the Arabic "irab"
morphology and grammar for each word and verse in
the Holy Quran, the Quranic Arabic Corpus
including word-by-word morphology and English
gloss, and Ontology of Quranic concepts - Nora Abbas developed the first Quran "search for
a concept" tool and website, Qurany - Abdul-Baquee Sharaf developed tools and resources
for text mining the Quran including verse
similarity, lemma concordance and collocation,
and text mining the Hadeeth