A Probabilistic Model for Classification of Multiple-Record Web Documents - PowerPoint PPT Presentation

Loading...

PPT – A Probabilistic Model for Classification of Multiple-Record Web Documents PowerPoint presentation | free to download - id: 6165e0-ODg5N



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

A Probabilistic Model for Classification of Multiple-Record Web Documents

Description:

A Probabilistic Model for Classification of ... Classification Rule Multivariate Normal Density Functions Linear Classification Rule Linear Discrimination ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 21
Provided by: LiXu8
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: A Probabilistic Model for Classification of Multiple-Record Web Documents


1
A Probabilistic Model for Classification of
Multiple-Record Web Documents
  • June Tang
  • Yiu-Kai Ng

2
Overview
  • Probabilistic Model
  • Bayes decision theory
  • Document and query representations
  • Ranking-function construction
  • Multivariant Statistical Analysis

3
Approach
  • Constructing a rank function for a probabilistic
    model based on multivariant statistical analysis
  • Minimizing expected cost of misclassification
  • Deriving a classification rule
  • Deriving a linear classification rule
  • Deriving a sample linear classification rule

4
Application Ontology
5
Document Representation
(Year, Make, Model, Mileage, Price, Feature,
PhoneNr) Total records 60 (Year62) (Make58)
(Model48) (Mileage12) (Price58)
(Feature49) (PhoneNr33)
(62,58,48,12,58,49,33) (1.03,0.97,0.80,0.20,0.
97,0.82,0.55)
6
Elementary Concepts
  • Variables are things that we measure, control, or
    manipulate in research
  • Multi-variant analysis considers multiple
    variables together as a single unit
  • Normal distribution represents one of the
    empirically verified elementary "truths about the
    general nature of reality"

7
Multivariant Statistical Analysis
  • Let
  • A be an application ontology
  • D be a set of Web documents
  • R be a set of relevant documents
  • R be a set of irrelevant document
  • X (X1, X2, , Xp) represent a document
  • ? be the set of all possible values on which X
    can take
  • ? ?1? ?2

8
Expected Cost of Misclassification (ECM)
Here,
Two density functions f1 and f2
9
Classification Rule
10
Multivariate Normal Density Functions
Assume that density functions are normal
Where
11
Linear Classification Rule
Assume that density functions are normal and ?1
, ?2 , and ? are equal
Document x is classified as relevant if
12
Linear Discrimination Function
Threshold
?
13
Parameter Estimations
Suppose we have n1 relevant documents
and n2 irrelevant documents
Such that n1n2gtp and p is the dimension of
vector x
14
Parameter Estimations (Cont.)
15
Sample Classification Rule
Document x is classified as relevant if
16
Misclassification Probabilities
Lachenbruchs holdout procedure
where
17
Precision Measure
18
Experimental Result (Relevant)
19
Experimental Result (Irrelevant)
20
Conclusion
  • Precision 85 (VSM 77.5)
  • Multivariant Statistical Analysis
  • Extendibility to Multiple Categorization
    Classification
About PowerShow.com