CS460/626 : Natural Language Processing/Language Technology for the Web (Lecture 1 - PowerPoint PPT Presentation

About This Presentation
Title:

CS460/626 : Natural Language Processing/Language Technology for the Web (Lecture 1

Description:

CS460/626 : Natural Language Processing/Language Technology for the Web (Lecture 1 Introduction) Pushpak Bhattacharyya CSE Dept., IIT Bombay – PowerPoint PPT presentation

Number of Views:1125
Avg rating:3.0/5.0
Slides: 96
Provided by: cfd5
Category:

less

Transcript and Presenter's Notes

Title: CS460/626 : Natural Language Processing/Language Technology for the Web (Lecture 1


1
CS460/626 Natural Language Processing/Language
Technology for the Web(Lecture 1 Introduction)
  • Pushpak BhattacharyyaCSE Dept., IIT Bombay

2
Persons involved
  • Faculty instructors Dr. Pushpak Bhattacharyya
    (www.cse.iitb.ac.in/pb) and Dr. Om Damani
    (www.cse.iitb.ac.in/damani)
  • TAs to be decided
  • Course home page (to be created)
  • www.cse.iitb.ac.in/cs626-460-2009

3
Perpectivising NLP Areas of AI and their
inter-dependencies
Knowledge Representation
Search
Logic
Machine Learning
Planning
Expert Systems
Vision
Robotics
NLP
4
Web brings new perspectives QSA Triangle
Query
Analystics
Search
5
Web 2.0 tasks
  • Business Intelligence on the Internet Platform
  • Opinion Mining
  • Reputation Management
  • Sentiment Analysis (some observations at the end)
  • NLP is thought to play a key role

6
Books etc.
  • Main Text(s)
  • Natural Language Understanding James Allan
  • Speech and NLP Jurafsky and Martin
  • Foundations of Statistical NLP Manning and
    Schutze
  • Other References
  • NLP a Paninian Perspective Bharati, Cahitanya
    and Sangal
  • Statistical NLP Charniak
  • Journals
  • Computational Linguistics, Natural Language
    Engineering, AI, AI Magazine, IEEE SMC
  • Conferences
  • ACL, EACL, COLING, MT Summit, EMNLP, IJCNLP, HLT,
    ICON, SIGIR, WWW, ICML, ECML

7
Allied Disciplines
Philosophy Semantics, Meaning of meaning, Logic (syllogism)
Linguistics Study of Syntax, Lexicon, Lexical Semantics etc.
Probability and Statistics Corpus Linguistics, Testing of Hypotheses, System Evaluation
Cognitive Science Computational Models of Language Processing, Language Acquisition
Psychology Behavioristic insights into Language Processing, Psychological Models
Brain Science Language Processing Areas in Brain
Physics Information Theory, Entropy, Random Fields
Computer Sc. Engg. Systems for NLP
8
Topics proposed to be covered
  • Shallow Processing
  • Part of Speech Tagging and Chunking using HMM,
    MEMM, CRF, and Rule Based Systems
  • EM Algorithm
  • Language Modeling
  • N-grams
  • Probabilistic CFGs
  • Basic Linguistics
  • Morphemes and Morphological Processing
  • Parse Trees and Syntactic Processing Constituent
    Parsing and Dependency Parsing
  • Deep Parsing
  • Classical Approaches Top-Down, Bottom-UP and
    Hybrid Methods
  • Chart Parsing, Earley Parsing
  • Statistical Approach Probabilistic Parsing, Tree
    Bank Corpora

9
Topics proposed to be covered (contd.)
  • Knowledge Representation and NLP
  • Predicate Calculus, Semantic Net, Frames,
    Conceptual Dependency, Universal Networking
    Language (UNL)
  • Lexical Semantics
  • Lexicons, Lexical Networks and Ontology
  • Word Sense Disambiguation
  • Applications
  • Machine Translation
  • IR
  • Summarization
  • Question Answering

10
Grading
  • Based on
  • Midsem
  • Endsem
  • Assignments
  • Seminar
  • Except the first two everything else in groups of
    4. Weightages will be revealed soon.

11
Definitions etc.
12
What is NLP
  • Branch of AI
  • 2 Goals
  • Science Goal Understand the way language
    operates
  • Engineering Goal Build systems that analyse and
    generate language reduce the man machine gap

13
The famous Turing Test Language Based Interaction
Test conductor
Machine
Human
Can the test conductor find out which is the
machine and which the human
14
Inspired Eliza
  • http//www.manifestation.com/neurotoys/eliza.php3

15
Inspired Eliza (another sample interaction)
  • A Sample of Interaction

16
What is it question NLP is concerned with
Grounding
  • Ground the language into perceptual, motor and
    cognitive capacities.

17
Grounding
  • Chair
  • Computer

18
Two Views of NLP and the Associated Challenges
  1. Classical View
  2. Statistical/Machine Learning View

19
Stages of processing
  • Phonetics and phonology
  • Morphology
  • Lexical Analysis
  • Syntactic Analysis
  • Semantic Analysis
  • Pragmatics
  • Discourse

20
Phonetics
  • Processing of speech
  • Challenges
  • Homophones bank (finance) vs. bank (river
  • bank)
  • Near Homophones maatraa vs. maatra (hin)
  • Word Boundary
  • aajaayenge (aa jaayenge (will come) or aaj
    aayenge (will come today)
  • I got uaplate
  • Phrase boundary
  • mtech1 students are especially exhorted to attend
    as such seminars are integral to one's
    post-graduate education
  • Disfluency ah, um, ahem etc.

21
Morphology
  • Word formation rules from root words
  • Nouns Plural (boy-boys) Gender marking
    (czar-czarina)
  • Verbs Tense (stretch-stretched) Aspect (e.g.
    perfective sit-had sat) Modality (e.g. request
    khaanaa? khaaiie)
  • First crucial first step in NLP
  • Languages rich in morphology e.g., Dravidian,
    Hungarian, Turkish
  • Languages poor in morphology Chinese, English
  • Languages with rich morphology have the advantage
    of easier processing at higher stages of
    processing
  • A task of interest to computer science Finite
    State Machines for Word Morphology

22
Lexical Analysis
  • Essentially refers to dictionary access and
    obtaining the properties of the word
  • e.g. dog
  • noun (lexical property)
  • take-s-in-plural (morph property)
  • animate (semantic property)
  • 4-legged (-do-)
  • carnivore (-do)
  • Challenge Lexical or word sense disambiguation

23
Lexical Disambiguation
  • First step part of Speech Disambiguation
  • Dog as a noun (animal)
  • Dog as a verb (to pursue)
  • Sense Disambiguation
  • Dog (as animal)
  • Dog (as a very detestable person)
  • Needs word relationships in a context
  • The chair emphasised the need for adult education
  • Very common in day to day communications
  • Satellite Channel Ad Watch what you want, when
    you want (two senses of watch)
  • e.g., Ground breaking ceremony/research

24
Technological developments bring in new terms,
additional meanings/nuances for existing terms
  • Justify as in justify the right margin (word
    processing context)
  • Xeroxed a new verb
  • Digital Trace a new expression
  • Communifaking pretending to talk on mobile when
    you are actually not
  • Discomgooglation anxiety/discomfort at not being
    able to access internet
  • Helicopter Parenting over parenting

25
Syntax Processing Stage
  • Structure Detection

S
VP
NP
V
NP
I
like
mangoes
26
Parsing Strategy
  • Driven by grammar
  • S-gt NP VP
  • NP-gt N PRON
  • VP-gt V NP V PP
  • N-gt Mangoes
  • PRON-gt I
  • V-gt like

27
Challenges in Syntactic Processing Structural
Ambiguity
  • Scope
  • 1.The old men and women were taken to safe
    locations
  • (old men and women) vs. ((old men) and women)
  • 2. No smoking areas will allow Hookas inside
  • Preposition Phrase Attachment
  • I saw the boy with a telescope
  • (who has the telescope?)
  • I saw the mountain with a telescope
  • (world knowledge mountain cannot be an
    instrument of seeing)
  • I saw the boy with the pony-tail
  • (world knowledge pony-tail cannot be an
    instrument of seeing)
  • Very ubiquitous newspaper headline 20 years
    later, BMC pays father 20 lakhs for causing sons
    death

28
Structural Ambiguity
  • Overheard
  • I did not know my PDA had a phone for 3 months
  • An actual sentence in the newspaper
  • The camera man shot the man with the gun when he
    was near Tendulkar
  • (P.G. Wodehouse, Ring in Jeeves) Jill had rubbed
    ointment on Mike the Irish Terrier, taken a look
    at the goldfish belonging to the cook, which had
    caused anxiety in the kitchen by refusing its
    ants eggs
  • (Times of India, 26/2/08) Aid for kins of cops
    killed in terrorist attacks

29
Headache for Parsing Garden Path sentences
  • Garden Pathing
  • The horse raced past the garden fell.
  • The old man the boat.
  • Twin Bomb Strike in Baghdad kill 25 (Times of
    India 05/09/07)

30
Semantic Analysis
  • Representation in terms of
  • Predicate calculus/Semantic Nets/Frames/Conceptual
    Dependencies and Scripts
  • John gave a book to Mary
  • Give action Agent John, Object Book,
    Recipient Mary
  • Challenge ambiguity in semantic role labeling
  • (Eng) Visiting aunts can be a nuisance
  • (Hin) aapko mujhe mithaai khilaanii padegii
    (ambiguous in Marathi and Bengali too not in
    Dravidian languages)

31
Pragmatics
  • Very hard problem
  • Model user intention
  • Tourist (in a hurry, checking out of the hotel,
    motioning to the service boy) Boy, go upstairs
    and see if my sandals are under the divan. Do not
    be late. I just have 15 minutes to catch the
    train.
  • Boy (running upstairs and coming back panting)
    yes sir, they are there.
  • World knowledge
  • WHY INDIA NEEDS A SECOND OCTOBER (ToI, 2/10/07)

32
Discourse
  • Processing of sequence of sentences
  • Mother to John
  • John go to school. It is open today. Should
    you bunk? Father will be very angry.
  • Ambiguity of open
  • bunk what?
  • Why will the father be angry?
  • Complex chain of reasoning and application of
    world knowledge
  • Ambiguity of father
  • father as parent
  • or
  • father as headmaster

33
Complexity of Connected Text
  • John was returning from school dejected today
    was the math test

He couldnt control the class
Teacher shouldnt have made him responsible
After all he is just a janitor
34
Giving a flavour of what is done Structure
Disambiguation
  • Scope, Clause and Preposition/Postpositon

35
Structure Disambiguation is as critical as Sense
Disambiguation
  • Scope (portion of text in the scope of a
    modifier)
  • Old men and women will be taken to safe locations
  • No smoking areas allow hookas inside
  • Clause
  • I told the child that I liked that he came to the
    game on time
  • Preposition
  • I saw the boy with a telescope

36
Structure Disambiguation is as critical as Sense
Disambiguation (contd.)
  • Semantic role
  • Visiting aunts can be a nuisance
  • Mujhe aapko mithaai khilaani padegii (I have to
    give you sweets or You have to give me sweets)
  • Postposition
  • unhone teji se bhaaagte hue chor ko pakad liyaa
    (he caught the thief that was running fast or
    he ran fast and caught the thief)
  • All these ambiguities lead to the construction
    multiple parse trees for each sentence and need
    semantic, pragmatic and discourse cues for
    disambiguation

37
Higher level knowledge needed for disambiguation
  • Semantics
  • I saw the boy with a pony tail (pony tail cannot
    be an instrument of seeing)
  • Pragmatics
  • ((old men) and women) as opposed to (old men and
    women) in Old men and women were taken to safe
    location, since women- both and young and old-
    were very likely taken to safe locations
  • Discourse
  • No smoking areas allow hookas inside, except the
    one in Hotel Grand.
  • No smoking areas allow hookas inside, but not
    cigars.

38
Preposition
39
Problem definition
  • 4-tuples of the form V N1 P N2
  • saw (V) boys (N1) with (P) telescopes (N2)
  • Attachment choice is between the matrix verb V
    and the object noun N

40
Lexical Association Table (Hindle and Rooth, 1991
and 1993)
  • From a large corpus of parsed text
  • first find all noun phrase heads
  • then record the verb (if any) that precedes the
    head
  • and the preposition (if any) that follows it
  • as well as some other syntactic information about
    the sentence.
  • Extract attachment information from this table of
    co-occurrences

41
Example lexical association
  • A table entry is considered a definite instance
    of the prepositional phrase attaching to the verb
    if
  • the verb definitely licenses the prepositional
    phrase
  • E.g. from Propbank,
  • absolve
  • frames
  • absolve.XX NP-ARG0 NP-ARG2-of obj-ARG1 1
  • absolve.XX NP-ARG0 NP-ARG2-of obj-ARG1
  • On Friday , the firms filed a suit ICH-1
    against West Virginia in New York state court
    asking for ARG0 a declaratory judgment rel
    absolving ARG1 them of ARG2-of liability .

42
Core steps
  • Seven different procedures for deciding whether a
    table entry is an instance of no attachment, sure
    noun attach, sure verb attach, or ambiguous
    attach
  • able to extract frequency information, counting
    the number of times a particular verb or noun
    attaches with a particular preposition

43
Core steps (contd.)
  • These frequencies serve as the training data for
    the statistical model used to predict correct
    attachment
  • To disambiguate a sentence, compute the
    likelihood of the particular preposition given
    the particular verb and contrast with the
    likelihood of the preposition given the
    particular noun
  • i.e., compare P(withsaw) with P(withtelescope)
    as in I saw the boy with a telescope

44
Critique
  • Limited by the number of relationships in the
    training corpora
  • Too large a parameter space
  • Model acquired during training is represented in
    a huge table of probabilities, precluding any
    straightforward analysis of its workings

45
Approach based on Transformation Based Error
Driven Learning, Brill and Resnick, COLING 1994
46
Example Transformations
Initial attach- ments by default are to N1
pre- dominantly.
47
Transformation rules with word classes
Wordnet synsets and Semantic classes used
48
Accuracy values of the transformation based
approach 12000 training and 500 test examples
Method Accuracy of transformation rules
Hindle and Rooth (baseline) 70.4 to 75.8 NA
Transformations 79.2 418
Transformations (word classes) 81.8 266
49
Maximum Entropy Based Approach (Ratnaparki,
Reyner, Roukos, 1994)
  • Use more features than (V N1) bigram and (N1 P)
    bigram
  • Apply Maximum Entropy Principle

50
Core formulation
  • We denote
  • the partially parsed verb phrase, i.e., the verb
    phrase without the attachment decision, as a
    history h, and
  • the conditional probability of an attachment as
    P(dh),
  • where d and corresponds to a noun or verb
    attachment- 0 or 1- respectively.

51
Maximize the training data log likelihood
--(1)
--(2)
52
Equating the model expected parameters and
training data parameters
--(3)
--(4)
53
Features
  • Two types of binary-valued questions
  • Questions about the presence of any n-gram of the
    four head words, e.g., a bigram maybe V
    is, P of
  • Features comprised solely of questions on words
    are denoted as word features

54
Features (contd.)
  • Questions that involve the class membership of a
    head word
  • Binary hierarchy of classes derived by mutual
    information

55
Features (contd.)
  • Given a binary class hierarchy,
  • we can associate a bit string with every word in
    the vocabulary
  • Then, by querying the value of certain bit
    positions we can construct
  • binary questions.
  • Features comprised solely of questions about
    class bits are denoted as class features, and
    features containing questions about both class
    bits and words are denoted as mixed features.

56
Word classes (Brown et. al. 1992)
57
Experimental data size
58
Performance of ME Model on Test Events
59
Examples of Features Chosen for Wall St. Journal
Data
60
Average Performance of Human ME Model on300
Events of WSJ Data
61
Human and ME model performance on consensus set
for WSJ
62
Average Performance of Human ME Model on200
Events of Computer Manuals Data
63
Back-off model based approach (Collins and
Brooks, 1995)
  • NP-attach
  • (joined ((the board) (as a non executive
    director)))
  • VP-attach
  • ((joined (the board)) (as a non executive
    director))
  • Correspondingly,
  • NP-attach
  • 1 joined board as director
  • VP-attach
  • 0 joined board as director
  • Quintuple of (attachment A 0/1, V, N1, P, N2)
  • 5 random variables

64
Probabilistic formulation
Or briefly,
If
Then the attachment is to the noun, else to the
verb
65
Maximum Likelihood estimate
66
The Back-off estimate
  • Inspired by speech recognition
  • Prediction of the Nth word from previous (N-1)
    words

Data sparsity problem
f(w1, w2, w3,wn) will frequently be 0 for large
values on n
67
Back-off estimate contd.
The cut off frequencies (c1, c2 ....) are
thresholds determining whether to back-off or not
at each level- counts lower than ci at stage i
are deemed to be too low to give an accurate
estimate, so in this case backing-off continues.
68
Back off for PPT attachment
Note the back off tuples always retain the
preposition
69
The backoff algorithm
70
Lower and upper bounds on performance
Lower bound (most frequent)
Upper bound (human experts Looking at 4 word only)
71
Results
72
Comparison with other systems
Maxent, Ratnaparkhi et. al.
Transformation Learning, Brill et. al.
73
Flexible Unsupervised PP Attachment using WSD and
Data Sparsity Reduction (Medimi Srinivas and
Pushpak Bhattacharyya, IJCAI 2007)
  • Unsupervised approach (some way similar to
    Ratnaparkhi 1998) The training data is extracted
    from raw text
  • The unambiguous training data of the form V-P-N
    and N1-P-N2 TEACH the system how to resolve
    PP-attachment in ambiguous test data V-N1-P-N2
  • Refinement of extracted training data. And use
    of N2 in PP-attachment resolution process.

74
Flexible Unsupervised PP Attachment using WSD and
Data Sparsity Reduction (Medimi Srinivas and
Pushpak Bhattacharyya, IJCAI 2007)
  • PP-attachment is determined by the semantic
    property of lexical items in the context of
    preposition using WordNet
  • An Iterative Graph based unsupervised approach is
    used for Word Sense disambiguation (Similar to
    Mihalcea 2005)
  • Use of a Data sparseness Reduction (DSR) Process
    which uses lemmatization, Synset replacement and
    a form of inferencing. DSRP uses WordNet.
  • Flexible use of WSD and DSR processes for
    PP-Attachment

75
Graph based disambiguation page rank based
algorithm, Mihalcea 2005
76
Experimental setup
  • Training Data
  • Brown corpus (raw text). Corpus size is 6 MB,
    consists of 51763 sentences, nearly 1 million 27
    thousand words.
  • Most frequent Prepositions in the syntactic
    context N1-P-N2 of, in, for, to, with, on, at,
    from, by
  • Most frequent Prepositions in the syntactic
    context V-P-N in, to, by, with, on, for, from,
    at, of
  • The Extracted unambiguous N1-P-N2 54030 and
    V-P-N 22362
  • Test Data
  • Penn Treebank Wall Street Journal (WSJ) data
    extracted by Ratnaparkhi
  • It consists of V-N1-P-N2 tuples 20801(training),
    4039(development) and 3097(Test)

77
Experimental setup contd.
  • BaseLine
  • The unsupervised approach by Ratnaparkhi, 1998
    (Base-RP).
  • Preprocessing
  • Upper case to lower case
  • Any four digit number less than 2100 as a year
  • Any other number or signs are converted to num
  • Experiments are performed using DSRP with
    different stages of DSRP
  • Experiments are performed using GuWSD and DSRP
    with different senses

78
The process of extracting training data Data
Sparsity Reduction
Tools/process Output
Raw Text The professional conduct of the doctors is guided by Indian Medical Association.
POS Tagger The_DT professional_JJ conduct_NN of_IN the_DT doctors_NNS is_VBZ guided_VBN by_ IN Indian_NNP Medical_NNP Association_NNP._.
Chunker The_DT professional_JJ conduct_NN of_IN the_DT doctors_NNS (is_VBZ guided_VBN) by_IN Indian_NNP Medical_NNP Association_NNP. After replacing each chunk by its head word it results in conduct_NN of_IN doctors_NNS guided_VBN by_IN Association_NNP
Extraction Heuristics N1PN2 conduct of doctors and VPN guided by Association
Morphing N1PN2 conduct of doctor and VPN guide by association
DSRP (Synset Replacement) N1PN2 conduct, behavior of doctor, physician can result in 4 combination with the same sense and similarly for VPN guide, direct by association can result in 2 combinations with the same sense.
79
Data Sparsity Reduction Inferencing
  • If
  • V1-P-N1 and V2-P-N1 exist as also do V1-P- N2
    and V2-P-N2, then
  • if
  • V3-P-Ni exist (i1,2), then
  • we can infer the existence of V3-P-NJ (i ?
    j) with a frequency count of V3-P-Ni that can be
    added to the corpus.

80
Example of DSR by inferencing
  • V1-P-N1 play in garden and V2-P-N1 sit in
    garden
  • V1-P-N2 play in house and V2-P-N2 sit in house
  • V3-P-N2 jump in house exists
  • Infer the existence of V3-P-N1 jump in garden

81
Results
82
Effect of various processes on FlexPPAttach
algorithm
83
Precision vs. various processes
84
Is NLP Really Needed
85
Post-1
  • POST----5 TITLE "Wants to invest in IPO? Think
    again" ltbr /gtltbr /gtHereacirceurotrades a
    sobering thought for those who believe in
    investing in IPOs. Listing gains
    acirceurordquo the return on the IPO scrip
    at the close of listing day over the allotment
    price acirceurordquo have been falling
    substantially in the past two years. Average
    listing gains have fallen from 38 in 2005 to as
    low as 2 in the first half of 2007.Of the 159
    book-built initial public offerings (IPOs) in
    India between 2000 and 2007, two-thirds saw
    listing gains. However, these gains have eroded
    sharply in recent years.Experts say this trend
    can be attributed to the aggressive pricing
    strategy that investment bankers adopt before an
    IPO. acirceurooeligWhile the drop in
    average listing gains is not a good sign, it
    could be due to the fact that IPO issue managers
    are getting aggressive with pricing of the
    issues,acirceuro says Anand Rathi, chief
    economist, Sujan Hajra.While the listing gain was
    38 in 2005 over 34 issues, it fell to 30 in
    2006 over 61 issues and to 2 in 2007 till
    mid-April over 34 issues. The overall listing
    gain for 159 issues listed since 2000 has been
    23, according to an analysis by Anand Rathi
    Securities.Aggressive pricing means the scrip has
    often been priced at the high end of the pricing
    range, which would restrict the upward movement
    of the stock, leading to reduced listing gains
    for the investor. It also tends to suggest
    investors should not indiscriminately pump in
    money into IPOs.But some market experts point out
    that India fares better than other countries.
    acirceurooeligInternationally, there have
    been periods of negative returns and low positive
    returns in India should not be considered a bad
    thing.

86
Post-2
  • POST----7TITLE "IIM-Jobs Bank
    International Projects Group - Manager" ltbr
    /gtPlease send your CV amp cover letter to
    anup.abraham_at_bank.com Bank, through
    its International Banking Group (IBG), is
    expanding beyond the Indian market with an intent
    to become a significant player in the global
    marketplace. The exciting growth in the overseas
    markets is driven not only by India linked
    opportunities, but also by opportunities of
    impact that we see as a local player in these
    overseas markets and / or as a bank with global
    footprint. IBG comprises of Retail banking,
    Corporate banking amp Treasury in 17 overseas
    markets we are present in. Technology is seen as
    key part of the business strategy, and critical
    to business innovation amp capability scale up.
    The International Projects Group in IBG takes
    ownership of defining amp delivering business
    critical IT projects, and directly impact
    business growth. Role Manager Acircndash
    International Projects Group Purpose of the role
    Define IT initiatives and manage IT projects to
    achieve business goals. The project domain will
    be retail, corporate amp treasury. The
    incumbent will work with teams across functions
    (including internal technology teams amp IT
    vendors for development/implementation) and
    locations to deliver significant amp measurable
    impact to the business. Location Mumbai (Short
    travel to overseas locations may be needed) Key
    Deliverables Conceptualize IT initiatives,
    define business requirements

87
Sentiment Classification
  • Positive, negative, neutral 3 class
  • Sports, economics, literature - multi class
  • Create a representation for the document
  • Classify the representation
  • The most popular way of representing a document
    is feature vector (indicator sequence).

88
Established Techniques
  • Naïve Bayes Classifier (NBC)
  • Support Vector Machines (SVM)
  • Neural Networks
  • K nearest neighbor classifier
  • Latent Semantic Indexing
  • Decision Tree ID3
  • Concept based indexing

89
Successful Approaches
  • The following are successful approaches as
    reported in literature.
  • NBC simple to understand and implement
  • SVM complex, requires foundations of perceptions

90
Mathematical Setting
  • We have training set
  • A Positive Sentiment Docs
  • B Negative Sentiment Docs
  • Let the class of positive and negative documents
    be C and C- , respectively.
  • Given a new document D label it positive if

Indicator/feature vectors to be formed
P(CD) gt P(C-D)
91
Priori Probability
Document Vector Classification
D1 V1
D2 V2 -
D3 V3
.. .. ..
D4000 V4000 -
Let T Total no of documents And let
M So,- T-M Priori probability is
calculated without considering any features of
the new document.
P(D being positive)M/T
92
Apply Bayes Theorem
  • Steps followed for the NBC algorithm
  • Calculate Prior Probability of the classes. P(C
    ) and P(C-)
  • Calculate feature probabilities of new document.
    P(D C ) and P(D C-)
  • Probability of a document D belonging to a class
    C can be calculated by Bayes Theorem as follows

P(CD) P(C) P(DC) P(D)
  • Document belongs to C , if

P(C ) P(DC) gt P(C- ) P(DC-)
93
Calculating P(DC)
  • P(DC) is the probability of class C given D.
    This is calculated as follows
  • Identify a set of features/indicators to evaluate
    a document and generate a feature vector (VD). VD
    ltx1 , x2 , x3 xn gt
  • Hence, P(DC) P(VDC)
  • P( ltx1 , x2
    , x3 xn gt C)
  • ltx1,x2,x3..xngt, C
  • C
  • Based on the assumption that all features are
    Independently Identically Distributed (IID)
  • P( ltx1 , x2 , x3 xn gt C )
  • P(x1 C) P(x2 C) P(x3 C) . P(xn C)
  • ? i1 n P(xi C)
  • P(xi C) can now be calculated as xi

  • C

94
Baseline Accuracy
  • Just on Tokens as features, 80 accuracy
  • 20 probability of a document being misclassified
  • On large sets this is significant

95
To improve accuracy
  • Clean corpora
  • POS tag
  • Concentrate on critical POS tags (e.g. adjective)
  • Remove objective sentences ('of' ones)
  • Do aggregation
  • Use minimal to sophisticated NLP
Write a Comment
User Comments (0)
About PowerShow.com