Web Text Classification

About This Presentation

Title:

Web Text Classification

Description:

A sub-field of information retrieval and text mining. ... 'An Introduction to the Sundance and Autoslog Systems', Technical Report, ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 18

Provided by: Lay96

Category:

more less

Transcript and Presenter's Notes

Title: Web Text Classification

1
Web Text Classification
2
Outline of Presentation

Background Study
Literature Review
Evaluation Metrics
References

3
Background StudyText Classification

Text Classification
A sub-field of information retrieval and text
mining.
Text documents are automatically organized into
different pre-specified categories to facilitate
document retrieval and subsequent analysis.
Text documents are represented in term frequency
matrix based on text indexing methods.
Use soft-computing techniques, such as
Naïve Bayesian Classifier
Support Vector Machine

4
Background StudyText Classification Cont

Limitations of representing documents using term
features
Polysemy
One word can carry different meaning.
Synonymy
Several words carry same meaning.
Local or global context
The main concept available in whole content, one
whole paragraph or one full sentence.

5
Background StudyInformation Extraction (IE)

The task of locating relevant and specific
information from text using statistical and
linguistic analysis (semantic and syntactic
analysis).
Key element a set of text extraction rules or
natural language processing (NLP) patterns that
identify specific information to be extracted.
Knowledge engineering approach
Rules are hand-crafted by domain experts.
Automatic training approach
Run training algorithm on a set of annotated
training corpus.
Instead of a set of keywords, documents can be
represented based on the NLP patterns and
extracted information patterns feature.

6
Background StudyAssociation Rule Mining (ARM)

Identify interesting relationships among items in
a given data set.
Let I i1, i2, , im be a set of m items. Let
D, be as set of database transactions where each
transaction T is a set of items such that T ? I.
An association rule is an implication of the form
A ? B, where A ? I, B ? I, and A ? B ?.
Example of association rule generated from a
transactional database
vanilla_coke ? ice-cream
support 5, confidence 83

7
Outline of Presentation

Background Study
Literature Review
Evaluation Metrics
References

8
Literature ReviewText Classification

Web-page Classification through Summarization
Shen et al., 2004
Locate the content body of the Web pages as the
basic summarisation component.
Full-text summarization is performed on the page
content.
The summaries are then passed to standard text
classification algorithms, such as Naïve Bayesian
and Support Vector Machine.
Average precision achieved is 80 81.

9
Literature ReviewIE for Text Classification

Using IE to Classify Newspapers Advertisements
Peleato et al., 2000
All advertisements are classified into different
categories using Naïve Bayesian classifier.
Extract fillers for the templates of each
category respectively, where each slot carries
different weights (manually defined). A score is
then given for each filled template.
Results of both modules are compared, an
advertisement is considered as unclassified if
different results are produced by two methods.

10
Literature ReviewNLP for Text Classification

Ntoulas et al., 2005 applied linguistic
analysis in searching and ranking Web pages.
Linguistic analysis operations POS tagging,
phrase identification and named entity
recognition.
Classification of results are performed to solve
word sense disambiguation.
No full sentence parsing, the nature of concepts
extracted is different from our proposal.

11
Literature ReviewIE for Text Classification

IE as a Basis for High-Precision Text
Classification Riloff and Lehnert, 1994
Documents are represented in NLP
annotations/patterns and extracted attributes
values.
Each NLP patterns is associated to a category.
A document is classified as long as there is one
single relevant NLP pattern in the document.

12
Outline of Presentation

Background Study
Literature Review
Evaluation Metrics
References

13
Evaluation Metrics

Evaluating the classification results
Precision
Recall
F-measure

14
Outline of Presentation

Background Study
Literature Review
Evaluation Metrics
References

15
References

D. Shen, Z. Chen, Q. Yang, H.J. Zeng, B. Zhang,
Y. Lu and W.Y. Ma, Web-page Classification
through Summarization, Proceedings of the 27th
annual International ACM SIGIR Conference on
Research and Development in Information
Retrieval, Sheffield, United Kingdom, 2004, pp
242 246.
A. Ntoulas, G. Chao and J. Cho, The Infocious
Web Search Engine Improving Web Searching
Through Linguistic Analysis, Proceeding of the
14th International Conference on World Wide Web,
May 2005, Chiba, Japan, ACM Press, pp 840 849.
M.L. Antonie and O.R. Zaiane, Text Document
Categorization by Term Association, Proceeding
of IEEE International Conference on Data Mining,
2002, pp 19 26.
R.A. Peleato, J.C. Chappelier and M. Rajman,
Using Information Extraction to Classify
Newspapers Advertisements, Proceedings of the
5th International Conference on the Statistical
Analysis of Textual Data, Lausanne, Switzerland,
March 2000.

16
References

B. Liu, W. Hsu and Y. Ma, Integrating
Classification and Association Rule Mining,
Proceeding of The 4th International Conference on
Knowledge Discovery and Data Mining, New York,
1998, AAAI Publication, pp 80 86.
B. Liu, Y. Ma and C.K. Wong, Classifications
using Association Rules Weakness and
Enhancements, in Vipin Kumar, et al, (eds), Data
mining for Scientific Applications, 2001.
A.A. Freistas, Understanding the Crucial
Differences Between Classification and Discovery
of Association Rules A Position Paper, ACM
SIGKDD Explorations, July 2000, Vol. 2, Issue1,
pp 65 69.
L. Eikvil, Information Extraction from World
Wide Web A Survey, published by Norwegian
Computing Centre, Norway, July 1999.
E. Riloff and W. Lehnert, Information Extraction
as a Basis for High-Precision Text
Classification, in ACM Transactions on
Information Systems, Vol. 12, No. 3, July 1994,
pp 296 - 333.

17
References

D. Barbará, C. Domeniconi, N. Kang, Mining
Relevant Text from Unlabelled Documents. In
Proceedings of the Third IEEE International
Conference on Data Mining (ICDM 03), 2003, pp
489 492.
E. Riloff and W. Phillips, An Introduction to
the Sundance and Autoslog Systems, Technical
Report, School of Computing, University of Utah,
November 2004, USA.
R. Kosala and H. Blockeel, Web Mining Research
A Survey, ACM SIGKDD Explorations, July 2000,
Vol. 2, Issue1, pp 1 15.
X.Y. Gao, M.J. Zhang and P. Andreae, Automatic
Pattern Construction For Web Information
Extraction, in International Journal of
Uncertainty, Fuzziness and Knowledge-Based
Systems, Vol. 12, No. 4, 2004, pp 447 470.
R.J. Mooney and R. Bunescu, Mining Knowledge
from Text Using Information Extraction, ACM
SIGKDD Explorations, June 2005, Vol. 7, Issue1,
pp 3 10.
J. Han and M. Kamber, Data Mining Concepts and
Techniques, Morgan Kauffman Publishers, United
States of America, 2001.

Write a Comment

User Comments (0)

About PowerShow.com

Web Text Classification - PowerPoint PPT Presentation

Web Text Classification

A sub-field of information retrieval and text mining. ... 'An Introduction to the Sundance and Autoslog Systems', Technical Report, ... – PowerPoint PPT presentation