Web Mining Research : A survey - PowerPoint PPT Presentation


PPT – Web Mining Research : A survey PowerPoint presentation | free to view - id: c30b5-ZDc1Z


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Web Mining Research : A survey


WebLog (Lakshmanan et.al. 1996), WebOQL(Mendelzon et.al. 1998) ... Construct multidimensional view on the Weblog database ... Perform data mining on Weblog records ... – PowerPoint PPT presentation

Number of Views:178
Avg rating:3.0/5.0
Slides: 38
Provided by: deve52


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Web Mining Research : A survey

Web Mining Research A survey
  • Authors
  • Raymond Kosala
  • Hendrick Blockeel
  • Heverlee, Belgium
  • Presented by
  • Devesh Sinha

A Survey in Web Mining
  • Web mining is the use of data mining techniques
    to automatically discover and extract information
    from Web documents/services (Etzioni, 1996).
  • The web mining research is at the cross road of
    research from several research communities
    (Kosala and Blockeel, July 2000), such as
  • database (DB)
    information retrieval (IR)
    sub-areas of machine learning (ML)
    natural language processing (NLP)

Mining the World-Wide Web
  • Motivation , Opportunity
  • The WWW is huge, widely distributed, global
    information service center for
  • Information services news, advertisements,
    consumer information, financial management,
    education, government, e-commerce, etc.
  • Hyper-link information
  • Access and usage information
  • WWW provides rich sources for data mining

Mining the World-Wide Web
  • Growing and changing very rapidly
  • Broad diversity of user communities
  • Only a small portion of the information on the
    Web is truly relevant or useful
  • 99 of the Web information is useless to 99 of
    Web users
  • How can we find high-quality Web pages on a
    specified topic?

Challenges on www interactions
  • Finding Relevant Information
  • Creating knowledge from Information available
  • Personalization of the information
  • Learning about customers / individual users

Web Mining A more challenging task
  • Searches for
  • Web access patterns
  • Web structures
  • Regularity and dynamics of Web contents
  • Problems
  • The abundance problem
  • Limited coverage of the Web hidden Web sources,
    majority of data in DBMS
  • Limited query interface based on keyword-oriented
  • Limited customization to individual users

Web Mining Subtasks
  • Resource Finding
  • Task of retrieving intended web-documents
  • Information Selection Pre-processing
  • Automatic selection and pre-processing specific
    information from retrieved web resources
  • Generalization
  • Automatic Discovery of patterns in web sites
  • Analysis
  • Validation and / or interpretation of mined

Discussion Question
  • What is the difference between Information
    Retrieval Information Extraction ?

  • Information Retrieval
  • Automatic retrieval of relevant documents
  • Primary Goals
  • Indexing Text
  • Searching for useful documents in a collection
  • Bag of unordered words
  • Web document classification task is an
    instance of IR
  • Information Extraction
  • Extract relevant facts from documents
  • Primary Goals
  • Transform collection of retrieved documents to
  • Structure of representation of a document
  • Web document classification task is an
    instance of IR
  • IE has a higher level of granularity
  • Result
  • Structured Database
  • Compression or summary of Text or documents

Types of IE
  • I E from unstructured texts ( Classical)
  • Unstructured ?? Free texts eg.News stories
  • Basic to deep linguistic processing
  • IE from semi-structured texts (Structural)
  • Semi-Structured ?? HTML
  • Uses meta-information eg. HTML tags
  • Wrapper Induction,
  • Machine learning used to build systems

Discussion Question
  • Is web mining same as learning from the web or
    machine learning techniques applied on the web ?

Agent Paradigm
  • Software / Intelligent Agents
  • User Interface Agents
  • Maximize productivity of current user interaction
    by adapting behaviour
  • Distributed Agents
  • Problem Solving by group of agents Relevant
  • Mobile Agents

Web Mining Taxonomy
Web Content Mining
  • Discovery of useful information from web contents
    / data / documents
  • Information Retrieval View ( Structured
  • Assist / Improve information finding
  • Filtering Information to users on user profiles
  • Database View
  • Model Data on the web
  • Integrate them for more sophisticated queries

A Survey in Web Mining
  • What have been doing in Web content mining?
    Developing intelligent tools for IR
    Finding keywords and keyphrases
    - Discovering grammatical
    rules and collocations -
    Hypertext classification/categorization

    - Extracting keyphrases from text documents
    - Learning
    extraction models/rules
    - Hierarchical
    - Predicting (words)
    Developing Web query systems
    Many applications such as
    WebLog (Lakshmanan, et al., 1996)
    3. Mining
    multimedia data
    - Fayyad, et al. (1996)
    mining image from satellite
    - Smyth, et al (1996) mining image to
    identify small volcanoes on Venus.

Multiple Layered Web Architecture
More Generalized Descriptions
Generalized Descriptions
Web Structure Mining
  • Finding authoritative Web pages
  • Retrieving pages that are not only relevant, but
    also of high quality, or authoritative on the
  • Hyperlinks can infer the notion of authority
  • The Web consists not only of pages, but also of
    hyperlinks pointing from one page to another
  • These hyperlinks contain an enormous amount of
    latent human annotation
  • A hyperlink pointing to another Web page, this
    can be considered as the author's endorsement of
    the other page

A Survey in Web Mining
  • What have been doing in Web structure mining?
    Calculating the quality relevancy of each Web
    - Web pages categorization
    (Chakrabarti, et al., 1998)
    - Discovering micro
    communities on the web
    - Example
    Clever system (Chakrabarti, et al., 1999)
    - Example Google (Brin
    and Page, 1998)
    Mining context of Web warehouse (Madria, et
    al.,1999) -
    Measuring the completeness of the Web sites
    - Measuring the
    replication of Web documents

Web Usage Mining
  • Web usage mining, also known as Web log mining,
  • process of discovering interesting patterns in
    Web access logs.
  • Commonly used approaches (Borges and Levene,
    - Maps the log
    data into relational tables before an adapted
    data mining technique is performed.
    - Uses the log
    data directly by utilizing special pre-processing
  • Typical problems
    - Distinguishing among
    unique users, server sessions, episodes, etc. in
    the presence of caching and proxy servers
    (McCallum, et al., 2000 Srivastava, et al.,

Mining the World-Wide Web
Web Content Mining
Web Structure Mining
Web Usage Mining
  • Web Page Content Mining
  • Web Page Summarization
  • WebLog (Lakshmanan et.al. 1996), WebOQL(Mendelzon
    et.al. 1998)
  • Web Structuring query languages
  • Can identify information within given web pages
  • Ahoy! (Etzioni et.al. 1997)Uses heuristics to
    distinguish personal home pages from other web
  • ShopBot (Etzioni et.al. 1997) Looks for product
    prices within web pages

General Access Pattern Tracking
Customized Usage Tracking
Search Result Mining
Mining the World-Wide Web
Web Content Mining
Web Structure Mining
Web Usage Mining
Web Page Content Mining
  • Search Result Mining
  • Search Engine Result Summarization
  • Clustering Search Result (Leouski and Croft,
    1996, Zamir and Etzioni, 1997)
  • Categorizes documents using phrases in titles and

General Access Pattern Tracking
Customized Usage Tracking
Mining the World-Wide Web
Web Content Mining
Web Usage Mining
  • Web Structure Mining
  • Using Links
  • PageRank (Brin et al., 1998)
  • CLEVER (Chakrabarti et al., 1998)
  • Use interconnections between web pages to give
    weight to pages.
  • Using Generalization
  • MLDB (1994), VWV (1998)
  • Uses a multi-level database representation of the
    Web. Counters (popularity) and link lists are
    used for capturing structure.

General Access Pattern Tracking
Search Result Mining
Web Page Content Mining
Customized Usage Tracking
Mining the World-Wide Web
Web Structure Mining
Web Content Mining
Web Usage Mining
Web Page Content Mining
Customized Usage Tracking
  • General Access Pattern Tracking
  • Web Log Mining (Zaïane, Xin and Han, 1998)
  • Uses KDD techniques to understand general access
    patterns and trends.
  • Can shed light on better structure and grouping
    of resource providers.

Search Result Mining
Mining the World-Wide Web
Web Usage Mining
Web Structure Mining
Web Content Mining
  • Customized Usage Tracking
  • Adaptive Sites (Perkowitz and Etzioni, 1997)
  • Analyzes access patterns of each user at a time.
  • Web site restructures itself automatically by
    learning from user access patterns.

General Access Pattern Tracking
Web Page Content Mining
Search Result Mining
Web Usage Mining
  • Mining Web log records to discover user access
    patterns of Web pages
  • Applications
  • Target potential customers for electronic
  • Enhance the quality and delivery of Internet
    information services to the end user
  • Improve Web server system performance
  • Identify potential prime advertisement locations
  • Web logs provide rich information about Web
  • Typical Web log entry includes the URL requested,
    the IP address from which the request originated,
    and a timestamp

Discussion Question
  • What are the four subtasks of Web Mining ?
  • 1.
  • 2.
  • 3.
  • 4.

Techniques for Web usage mining
  • Construct multidimensional view on the Weblog
  • Perform multidimensional OLAP analysis to find
    the top N users, top N accessed Web pages, most
    frequently accessed time periods, etc.
  • Perform data mining on Weblog records
  • Find association patterns, sequential patterns,
    and trends of Web accessing
  • May need additional information,e.g., user
    browsing sequences of the Web pages in the Web
    server buffer
  • Conduct studies to
  • Analyze system performance, improve system design
    by Web caching, Web page prefetching, and Web
    page swapping

Mining the World-Wide Web
  • Design of a Web Log Miner
  • Web log is filtered to generate a relational
  • A data cube is generated form database
  • OLAP is used to drill-down and roll-up in the
  • OLAM is used for mining interesting knowledge

Web log
Data Cube
Sliced and diced cube
1 Data Cleaning
2 Data Cube Creation
4 Data Mining
Website Usage Analysis (SUA)
  • Why developing a Website usage/utilization
    analyzation tool?
    Knowledge about how visitors
    use Website could - prevent
    disorientation and help designers place important
    information/functions exactly where the
    visitors look for it and in the way users need it
    - especially help to
    build up adaptive Website server

Website Usage Analysis (SUA)
  • What the SUA do?
    Discover user navigation
    patterns in using Website
    Establish a aggregated log structure as a
    preprocessor to reduce the search space before
    the actual log mining phase

    - Introduce a model for Website
    usage pattern discovery by extending the
    classical mining model, and establish the
    processing framework of this model

Website Usage Analysis (SUA)
  • Website client-server architecture facilitates
    recording user behaviors in every steps by

    - submit client-side log files to server
    when users use clear functions or exit
  • The special design for local and universal
    back/forward/clear functions makes users
    navigation pattern more clear for designer by
    - analyzing local back/forward history and
    corporate it with universal back/forward history

Website Usage Analysis (SUA)
  • What will be included in SUA
    Identify and collect log data
    2. transfer the data to
    server-side and save them in a structure desired
    for analysis
    3. Prepare mined data by establishing a
    customized aggregated log tree/frame
    4. Use
    modifications of the typical data mining methods,
    particularly an extension of a traditional
    sequence discovery algorithm, to mine user
    navigation patterns

Website Usage Analysis (SUA)
  • Problem need to be considered
  • - How to identify the log data when a user go
    through uninteresting function/module
  • - What marks the end of a user session?
  • - Client connect Website through proxy servers
  • Differences in Website usage analysis with common
    Web usage mining
  • - Client-side log files available
  • - Log files format (Web log files follow Common
    Log Format specified as a part of HTTP protocol)
  • - Not necessary for log file cleaning/filtering
    (which usually performed in preprocess of Web log

WebSift Project
  • Cooley, R., Mobasher, B., and Srivastava, J. Web
    Mining Information and pattern Discovery on the
    World Wide Web. IEEE Computer, pages 558-566,
  • Etzioni, O. The world wide web Quagmire or gold
    mine. Communications of the ACM, 39(11)65-68,
  • Fayyad, U., Djorgovski, S., and Weir, N.
    Automating the analysis and cataloging of sky
    surveys. In Advances in Knowledge Discovery and
    Data Mining, pages 471-493. AAAI Press, 1996.
  • Kosala, R. and Blockeel, H. Web Mining Research
    A summary. SIGKDD Explorations, 2(1)1-15, 2000.

  • Langley, P. User modeling in adaptive
    interfaces. In Proceedings of the Seventh
    International Conference on User Modeling, pages
    357-370, 1999.
  • Madria, S.K., Bhowmick, S.S., Ng, W.K., and Lim,
    E.-P. Research issues in web data mining. In
    Proceedings of Data Warehousing and Knowledge
    Discovery, First International Conference, DaWaK
    99, pages 303-312, 1999.
  • Masand, B. and Spiliopoulou, M. Webkdd-99
    Work-shop on web usage analysis and user
    profiling. SIGKDD Explorations, 1(2), 2000.
  • Mobasher, B., Jain, N. Han, E.H., and Srivastava,
    J. Web mining Pattern discovery from world wide
    web transactions. Technical Report TR 96-060,
    University of Minnesota, Dept. of Computer
    Science, Minneapolis, 1996

  • Smyth, P., Fayyad, U.M., Burl, M.C., and Perona,
    P. Modeling subjective uncertainty in image
    annotation. In Advances in Knowledge Discovery
    and Data Mining, pages 517-539, 1996.
  • Spiliopoulou, M. Data mining for the web. In
    Principles of Data Mining and Knowledge
    Discovery, Second European Symposium, PKDD 99,
    pages 588-589, 1999.
  • Srivastava, J., Cooley, R., Deshpande, M., and
    Tan, P.-N. Web usage mining Discovery and
    applications of usage patterns from web data.
    SIGMOD Explorations, 1(2), 2000.
  • Zaiane, O.R., Xin, M., and Han, J. Discovering
    Web access patterns and trends by applying OLAP
    and data mining technology on Web logs. IEEE,
    pages 19-29, 1998.
About PowerShow.com