# A Probabilistic Model for Classification of Multiple-Record Web Documents - PowerPoint PPT Presentation

PPT – A Probabilistic Model for Classification of Multiple-Record Web Documents PowerPoint presentation | free to download - id: 6165e0-ODg5N

The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
Title:

## A Probabilistic Model for Classification of Multiple-Record Web Documents

Description:

### A Probabilistic Model for Classification of ... Classification Rule Multivariate Normal Density Functions Linear Classification Rule Linear Discrimination ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 21
Provided by: LiXu8
Category:
Tags:
Transcript and Presenter's Notes

Title: A Probabilistic Model for Classification of Multiple-Record Web Documents

1
A Probabilistic Model for Classification of
Multiple-Record Web Documents
• June Tang
• Yiu-Kai Ng

2
Overview
• Probabilistic Model
• Bayes decision theory
• Document and query representations
• Ranking-function construction
• Multivariant Statistical Analysis

3
Approach
• Constructing a rank function for a probabilistic
model based on multivariant statistical analysis
• Minimizing expected cost of misclassification
• Deriving a classification rule
• Deriving a linear classification rule
• Deriving a sample linear classification rule

4
Application Ontology
5
Document Representation
(Year, Make, Model, Mileage, Price, Feature,
PhoneNr) Total records 60 (Year62) (Make58)
(Model48) (Mileage12) (Price58)
(Feature49) (PhoneNr33)
(62,58,48,12,58,49,33) (1.03,0.97,0.80,0.20,0.
97,0.82,0.55)
6
Elementary Concepts
• Variables are things that we measure, control, or
manipulate in research
• Multi-variant analysis considers multiple
variables together as a single unit
• Normal distribution represents one of the
empirically verified elementary "truths about the
general nature of reality"

7
Multivariant Statistical Analysis
• Let
• A be an application ontology
• D be a set of Web documents
• R be a set of relevant documents
• R be a set of irrelevant document
• X (X1, X2, , Xp) represent a document
• ? be the set of all possible values on which X
can take
• ? ?1? ?2

8
Expected Cost of Misclassification (ECM)
Here,
Two density functions f1 and f2
9
Classification Rule
10
Multivariate Normal Density Functions
Assume that density functions are normal
Where
11
Linear Classification Rule
Assume that density functions are normal and ?1
, ?2 , and ? are equal
Document x is classified as relevant if
12
Linear Discrimination Function
Threshold
?
13
Parameter Estimations
Suppose we have n1 relevant documents
and n2 irrelevant documents
Such that n1n2gtp and p is the dimension of
vector x
14
Parameter Estimations (Cont.)
15
Sample Classification Rule
Document x is classified as relevant if
16
Misclassification Probabilities
Lachenbruchs holdout procedure
where
17
Precision Measure
18
Experimental Result (Relevant)
19
Experimental Result (Irrelevant)
20
Conclusion
• Precision 85 (VSM 77.5)
• Multivariant Statistical Analysis
• Extendibility to Multiple Categorization
Classification