Web Mining Research: A Survey - PowerPoint PPT Presentation

Loading...

PPT – Web Mining Research: A Survey PowerPoint presentation | free to download - id: 522e62-YTY5N



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Web Mining Research: A Survey

Description:

Title: rtwtewetwet Author: ccoughli Last modified by: document writer Created Date: 4/16/2002 4:21:03 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:125
Avg rating:3.0/5.0
Slides: 40
Provided by: ccou2
Learn more at: http://www.cs.uvm.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Web Mining Research: A Survey


1
Web Mining Research A Survey
  • Raymond Kosala and Hendrik Blockeel
  • ACM SIGKDD , July 2000
  • Presented by Drew DeHaas

2
Outline
  • Introduction
  • Web Mining
  • Web Content Mining
  • Web Structure Mining
  • Web Usage Mining
  • Conclusion Exam Questions

3
Motivation for Web Mining
  • World Wide Web is popular/interactive medium for
    disseminating information
  • It is also huge, diverse, and dynamic raising
    issues of scalability, multimedia data, and
    temporal information.
  • Both information users and information providers
    face problems due to the nature of the web.

4
Problems Information Users
  • Finding relevant information
  • Relevant search results are hard to come by
  • Inability to index all of the information on web
  • Creating new knowledge out of available
    information on the web
  • Extract knowledge out of collected data
  • Personalizing the information available
  • Catering to personal preference in content and
    presentation

5
Problem Information Providers
  • The main problem that information providers face
    is learning about consumers/users
  • What does the customer do?
  • What does the customer want to do?
  • Personalizing to individual users
  • Using web data to effectively market products
    and/or services

6
Other Approaches
  • Web mining is not the only approach
  • Database approach
  • Information retrieval
  • Natural language processing
  • In-depth syntactic and semantic analysis
  • Web document community
  • Standards, manually appended meta-information,
    maintained directories, etc

7
Direct vs Indirect Web Mining
  • Web mining techniques can be used to solve the
    information overload problems
  • Directly
  • Attack the problem with web mining techniques
  • E.g. newsgroup agent classifies news as relevant
  • Indirectly
  • Used as part of a bigger application that
    addresses problems
  • E.g. used to create index terms for a web search
    service

8
The Research
  • Converging research from Database, information
    retrieval, and artificial intelligence
    (specifically NLP and machine learning)
  • Paper focuses on research from the machine
    learning point of view

9
Outline
  • Introduction
  • Web Mining
  • Web Content Mining
  • Web Structure Mining
  • Web Usage Mining
  • Conclusion Exam Questions

10
Web Mining Definition
  • Web mining refers to the overall process of
    discovering potentially useful and previously
    unknown information or knowledge from the Web
    data.
  • Can be viewed as four subtasks
  • Not the same as Information Retrieval
  • Not the same as Information Extraction

11
Web Mining Subtasks
  • Resource finding
  • Retrieving intended documents
  • Information selection/pre-processing
  • Select and pre-process specific information from
    selected documents
  • Generalization
  • Discover general patterns within and across web
    sites
  • Analysis
  • Validation and/or interpretation of mined patterns

12
Web Mining Not IR or IE
  • Information retrieval (IR) is the automatic
    retrieval of all relevant documents while at the
    same time retrieving as few of the non-relevant
    as possible
  • Web document classification, which is a Web
    Mining task, could be part of an IR system (e.g.
    indexing for a search engine)

13
Web Mining Not IR or IE
  • Information extraction (IE) aims to extract the
    relevant facts from given documents while IR aims
    to select the relevant documents
  • IE systems for the general Web are not feasible
  • Most focus on specific Web sites or content

14
Web Mining and Machine Learning
  • Web mining not the same as learning from the Web.
  • Some applications of machine learning on the web
    are not Web Mining
  • Some methods used for Web Mining besides machine
    learning
  • However, there is a close relationship between
    web mining and machine learning.

15
Web Mining Categories
  • Web Content Mining
  • Discovering useful information from web
    contents/data/documents.
  • Web Structure Mining
  • Discovering the model underlying link structures
    on the Web
  • Web Usage Mining
  • Try to make sense of data generated by Web
    surfers sessions or behaviors

16
Web Mining The Agent Paradigm
  • User Interface Agents
  • information retrieval agents, information
    filtering agents, personal assistant agents.
  • Distributed Agents
  • distributed agents for knowledge discovery or
    data mining.
  • Problem solving by a group of agents

17
Web Mining The Agent Paradigm
  • Content-based approach
  • The system searches for items that match based on
    an analysis of the content using the user
    preferences.
  • Collaborative approach
  • The system tries to find users with similar
    interests
  • Recommendations given based on what similar users
    did

18
Outline
  • Introduction
  • Web Mining
  • Web Content Mining
  • Web Structure Mining
  • Web Usage Mining
  • Conclusion Exam Questions

19
Web Content Mining Intro
  • Motivations
  • Most of the data on the internet is accessible
    through the Web
  • Digital libraries are becoming prevalent
  • Businesses and services are moving online
  • Applications are moving from the desktop to the
    Web

20
Web Content Mining Intro
  • Types of data dealt with
  • Textual, image, audio, video, metadata,
    hyperlinks
  • Multimedia mining
  • Can be an instance of Web Mining
  • Hidden data
  • Dynamic or private
  • Unstructured (free text), semi-structured (HTML,
    etc), and structured (data in tables, or pages
    generated from a database)

21
Web Content Mining IR View
  • Unstructured Documents
  • Bag of words, or phrase-based feature
    representation
  • Features can be boolean or frequency based
  • Features can be reduced using different feature
    selection techniques
  • Word stemming, combining morphological variations
    into one feature
  • Possibly use n-gram representations (encodes some
    context)

22
(No Transcript)
23
Web Content Mining IR View
  • Semi-Structured Documents
  • Uses richer representations for features, based
    on information from the document structure
    (typically HTML and hyperlinks)
  • Uses common data mining methods (whereas
    unstructured might use more text mining methods)

24
(No Transcript)
25
Web Content Mining DB View
  • Tries to infer the structure of a Web site or
    transform a Web site to become a database
  • Better information management
  • Better querying on the Web
  • Can be achieved by
  • Finding the schema of Web documents
  • Building a Web warehouse
  • Building a Web knowledge base
  • Building a virtual database

26
Web Content Mining DB View
  • Mainly uses the Object Exchange Model (OEM)
  • Represents semi-structured data (some structure,
    no rigid schema) by a labeled graph
  • Process typically starts with manual selection of
    Web sites for content mining
  • Main application building a structural summary
    of semi-structured data (schema extraction or
    discovery)

27
(No Transcript)
28
Outline
  • Introduction
  • Web Mining
  • Web Content Mining
  • Web Structure Mining
  • Web Usage Mining
  • Conclusion Exam Questions

29
Web Structure Mining
  • Interested in the structure between Web documents
    (not within a document)
  • Inspired by the study of social networks and
    citation analysis
  • Example PageRank Google
  • Application Discovering micro-communities in the
    Web
  • Measuring the completeness of a Web site

30
Outline
  • Introduction
  • Web Mining
  • Web Content Mining
  • Web Structure Mining
  • Web Usage Mining
  • Conclusion Exam Questions

31
Web Usage Mining
  • Tries to predict user behavior from interaction
    with the Web
  • Wide range of data (logs)
  • Web client data
  • Proxy server data
  • Web server data
  • Two common approaches
  • Map usage data into relational tables and use
    adapted data mining techniques
  • Use log data directly by utilizing special
    pre-processing techniques

32
Web Usage Mining
  • Typical problems Distinguishing among unique
    users, server sessions, episodes, etc in the
    presence of caching and proxy servers
  • Often Usage Mining uses some background or domain
    knowledge
  • E.g. site topology, Web content, etc

33
Web Usage Mining
  • Two main categories
  • Learning a user profile (personalized)
  • Web users would be interested in techniques that
    learn their needs and preferences automatically
  • Learning user navigation patterns
    (impersonalized)
  • Information providers would be interested in
    techniques that improve the effectiveness of
    their Web site or biasing the users towards the
    goals of the site

34
Outline
  • Introduction
  • Web Mining
  • Web Content Mining
  • Web Structure Mining
  • Web Usage Mining
  • Conclusion Exam Questions

35
Conclusions
  • Tried to resolve confusion with regards to the
    term Web Mining
  • Differentiated from IR and IE
  • Suggest three Web mining categories
  • Content, Structure, and Usage Mining
  • Briefly described approaches for the three
    categories
  • Explored connection with agent paradigm

36
Exam Question 1
  • Question Outline the main characteristics of Web
    information.
  • Answer Web information is huge, diverse, and
    dynamic.

37
Exam Question 2
  • Question How data mining techniques can be used
    in Web information analysis? Give at least two
    examples.
  • Classification classification on server logs
    using decision tree, Naïve-Bayes classifier to
    discover the profiles of users belonging to a
    particular class
  • Clustering Clustering can be used to group users
    exhibiting similar browsing patterns.
  • Association Analysis association analysis can be
    used to relate pages that are most often
    referenced together in a single server session.

38
Exam Question 1
  • Question What are the three main areas of
    interest for Web mining?
  • Answer (1) Web Content
  • (2) Web Structure
  • (3) Web Usage

39
Questions?
About PowerShow.com