Web Text Classification - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Web Text Classification

Description:

A sub-field of information retrieval and text mining. ... 'An Introduction to the Sundance and Autoslog Systems', Technical Report, ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 18
Provided by: Lay96
Category:

less

Transcript and Presenter's Notes

Title: Web Text Classification


1
Web Text Classification
2
Outline of Presentation
  • Background Study
  • Literature Review
  • Evaluation Metrics
  • References

3
Background StudyText Classification
  • Text Classification
  • A sub-field of information retrieval and text
    mining.
  • Text documents are automatically organized into
    different pre-specified categories to facilitate
    document retrieval and subsequent analysis.
  • Text documents are represented in term frequency
    matrix based on text indexing methods.
  • Use soft-computing techniques, such as
  • Naïve Bayesian Classifier
  • Support Vector Machine

4
Background StudyText Classification Cont
  • Limitations of representing documents using term
    features
  • Polysemy
  • One word can carry different meaning.
  • Synonymy
  • Several words carry same meaning.
  • Local or global context
  • The main concept available in whole content, one
    whole paragraph or one full sentence.

5
Background StudyInformation Extraction (IE)
  • The task of locating relevant and specific
    information from text using statistical and
    linguistic analysis (semantic and syntactic
    analysis).
  • Key element a set of text extraction rules or
    natural language processing (NLP) patterns that
    identify specific information to be extracted.
  • Knowledge engineering approach
  • Rules are hand-crafted by domain experts.
  • Automatic training approach
  • Run training algorithm on a set of annotated
    training corpus.
  • Instead of a set of keywords, documents can be
    represented based on the NLP patterns and
    extracted information patterns feature.

6
Background StudyAssociation Rule Mining (ARM)
  • Identify interesting relationships among items in
    a given data set.
  • Let I i1, i2, , im be a set of m items. Let
    D, be as set of database transactions where each
    transaction T is a set of items such that T ? I.
  • An association rule is an implication of the form
    A ? B, where A ? I, B ? I, and A ? B ?.
  • Example of association rule generated from a
    transactional database
  • vanilla_coke ? ice-cream
  • support 5, confidence 83

7
Outline of Presentation
  • Background Study
  • Literature Review
  • Evaluation Metrics
  • References

8
Literature ReviewText Classification
  • Web-page Classification through Summarization
    Shen et al., 2004
  • Locate the content body of the Web pages as the
    basic summarisation component.
  • Full-text summarization is performed on the page
    content.
  • The summaries are then passed to standard text
    classification algorithms, such as Naïve Bayesian
    and Support Vector Machine.
  • Average precision achieved is 80 81.

9
Literature ReviewIE for Text Classification
  • Using IE to Classify Newspapers Advertisements
    Peleato et al., 2000
  • All advertisements are classified into different
    categories using Naïve Bayesian classifier.
  • Extract fillers for the templates of each
    category respectively, where each slot carries
    different weights (manually defined). A score is
    then given for each filled template.
  • Results of both modules are compared, an
    advertisement is considered as unclassified if
    different results are produced by two methods.

10
Literature ReviewNLP for Text Classification
  • Ntoulas et al., 2005 applied linguistic
    analysis in searching and ranking Web pages.
  • Linguistic analysis operations POS tagging,
    phrase identification and named entity
    recognition.
  • Classification of results are performed to solve
    word sense disambiguation.
  • No full sentence parsing, the nature of concepts
    extracted is different from our proposal.

11
Literature ReviewIE for Text Classification
  • IE as a Basis for High-Precision Text
    Classification Riloff and Lehnert, 1994
  • Documents are represented in NLP
    annotations/patterns and extracted attributes
    values.
  • Each NLP patterns is associated to a category.
  • A document is classified as long as there is one
    single relevant NLP pattern in the document.

12
Outline of Presentation
  • Background Study
  • Literature Review
  • Evaluation Metrics
  • References

13
Evaluation Metrics
  • Evaluating the classification results
  • Precision
  • Recall
  • F-measure

14
Outline of Presentation
  • Background Study
  • Literature Review
  • Evaluation Metrics
  • References

15
References
  • D. Shen, Z. Chen, Q. Yang, H.J. Zeng, B. Zhang,
    Y. Lu and W.Y. Ma, Web-page Classification
    through Summarization, Proceedings of the 27th
    annual International ACM SIGIR Conference on
    Research and Development in Information
    Retrieval, Sheffield, United Kingdom, 2004, pp
    242 246.
  • A. Ntoulas, G. Chao and J. Cho, The Infocious
    Web Search Engine Improving Web Searching
    Through Linguistic Analysis, Proceeding of the
    14th International Conference on World Wide Web,
    May 2005, Chiba, Japan, ACM Press, pp 840 849.
  • M.L. Antonie and O.R. Zaiane, Text Document
    Categorization by Term Association, Proceeding
    of IEEE International Conference on Data Mining,
    2002, pp 19 26.
  • R.A. Peleato, J.C. Chappelier and M. Rajman,
    Using Information Extraction to Classify
    Newspapers Advertisements, Proceedings of the
    5th International Conference on the Statistical
    Analysis of Textual Data, Lausanne, Switzerland,
    March 2000.

16
References
  • B. Liu, W. Hsu and Y. Ma, Integrating
    Classification and Association Rule Mining,
    Proceeding of The 4th International Conference on
    Knowledge Discovery and Data Mining, New York,
    1998, AAAI Publication, pp 80 86.
  • B. Liu, Y. Ma and C.K. Wong, Classifications
    using Association Rules Weakness and
    Enhancements, in Vipin Kumar, et al, (eds), Data
    mining for Scientific Applications, 2001.
  • A.A. Freistas, Understanding the Crucial
    Differences Between Classification and Discovery
    of Association Rules A Position Paper, ACM
    SIGKDD Explorations, July 2000, Vol. 2, Issue1,
    pp 65 69.
  • L. Eikvil, Information Extraction from World
    Wide Web A Survey, published by Norwegian
    Computing Centre, Norway, July 1999.
  • E. Riloff and W. Lehnert, Information Extraction
    as a Basis for High-Precision Text
    Classification, in ACM Transactions on
    Information Systems, Vol. 12, No. 3, July 1994,
    pp 296 - 333.

17
References
  • D. Barbará, C. Domeniconi, N. Kang, Mining
    Relevant Text from Unlabelled Documents. In
    Proceedings of the Third IEEE International
    Conference on Data Mining (ICDM 03), 2003, pp
    489 492.
  • E. Riloff and W. Phillips, An Introduction to
    the Sundance and Autoslog Systems, Technical
    Report, School of Computing, University of Utah,
    November 2004, USA.
  • R. Kosala and H. Blockeel, Web Mining Research
    A Survey, ACM SIGKDD Explorations, July 2000,
    Vol. 2, Issue1, pp 1 15.
  • X.Y. Gao, M.J. Zhang and P. Andreae, Automatic
    Pattern Construction For Web Information
    Extraction, in International Journal of
    Uncertainty, Fuzziness and Knowledge-Based
    Systems, Vol. 12, No. 4, 2004, pp 447 470.
  • R.J. Mooney and R. Bunescu, Mining Knowledge
    from Text Using Information Extraction, ACM
    SIGKDD Explorations, June 2005, Vol. 7, Issue1,
    pp 3 10.
  • J. Han and M. Kamber, Data Mining Concepts and
    Techniques, Morgan Kauffman Publishers, United
    States of America, 2001.
Write a Comment
User Comments (0)
About PowerShow.com