Web Mining Research : A survey presentation

About This Presentation

Transcript and Presenter's Notes

Title: Web Mining Research : A survey

1
Web Mining Research A survey

Authors
Raymond Kosala
Hendrick Blockeel
Heverlee, Belgium
Presented by
Devesh Sinha

2
A Survey in Web Mining

Web mining is the use of data mining techniques
to automatically discover and extract information
from Web documents/services (Etzioni, 1996).
The web mining research is at the cross road of
research from several research communities
(Kosala and Blockeel, July 2000), such as
database (DB)
information retrieval (IR)
the
sub-areas of machine learning (ML)
natural language processing (NLP)

3
Mining the World-Wide Web

Motivation , Opportunity
The WWW is huge, widely distributed, global
information service center for
Information services news, advertisements,
consumer information, financial management,
education, government, e-commerce, etc.
Hyper-link information
Access and usage information
WWW provides rich sources for data mining

4
Mining the World-Wide Web

Growing and changing very rapidly
Broad diversity of user communities
Only a small portion of the information on the
Web is truly relevant or useful
99 of the Web information is useless to 99 of
Web users
How can we find high-quality Web pages on a
specified topic?

5
Challenges on www interactions

Finding Relevant Information
Creating knowledge from Information available
Personalization of the information
Learning about customers / individual users

6
Web Mining A more challenging task

Searches for
Web access patterns
Web structures
Regularity and dynamics of Web contents
Problems
The abundance problem
Limited coverage of the Web hidden Web sources,
majority of data in DBMS
Limited query interface based on keyword-oriented
search
Limited customization to individual users

7
Web Mining Subtasks

Resource Finding
Task of retrieving intended web-documents
Information Selection Pre-processing
Automatic selection and pre-processing specific
information from retrieved web resources
Generalization
Automatic Discovery of patterns in web sites
Analysis
Validation and / or interpretation of mined
patterns

8
Discussion Question

What is the difference between Information
Retrieval Information Extraction ?

9
IE - IR

Information Retrieval
Automatic retrieval of relevant documents
Primary Goals
Indexing Text
Searching for useful documents in a collection
Bag of unordered words
Web document classification task is an
instance of IR

Information Extraction
Extract relevant facts from documents
Primary Goals
Transform collection of retrieved documents to
information.
Structure of representation of a document
Web document classification task is an
instance of IR
IE has a higher level of granularity
Result
Structured Database
Compression or summary of Text or documents

10
Types of IE

I E from unstructured texts ( Classical)
Unstructured ?? Free texts eg.News stories
Basic to deep linguistic processing
IE from semi-structured texts (Structural)
Semi-Structured ?? HTML
Uses meta-information eg. HTML tags
Wrapper Induction,
Machine learning used to build systems
(semi-)automatically

11
Discussion Question

Is web mining same as learning from the web or
machine learning techniques applied on the web ?

12
Agent Paradigm

Software / Intelligent Agents
User Interface Agents
Maximize productivity of current user interaction
by adapting behaviour
Distributed Agents
Problem Solving by group of agents Relevant
Agents
Mobile Agents

13
Web Mining Taxonomy
14
Web Content Mining

Discovery of useful information from web contents
/ data / documents
Information Retrieval View ( Structured
Semi-Structured)
Assist / Improve information finding
Filtering Information to users on user profiles
Database View
Model Data on the web
Integrate them for more sophisticated queries

15
A Survey in Web Mining

What have been doing in Web content mining?
1.
Developing intelligent tools for IR
-
Finding keywords and keyphrases
- Discovering grammatical
rules and collocations -
Hypertext classification/categorization

- Extracting keyphrases from text documents
- Learning
extraction models/rules
- Hierarchical
clustering
- Predicting (words)
relationship
2.
Developing Web query systems
Many applications such as
WebLog (Lakshmanan, et al., 1996)
3. Mining
multimedia data
- Fayyad, et al. (1996)
mining image from satellite
- Smyth, et al (1996) mining image to
identify small volcanoes on Venus.

16
Multiple Layered Web Architecture
More Generalized Descriptions
Layern
...
Generalized Descriptions
Layer1
Layer0
17
Web Structure Mining

Finding authoritative Web pages
Retrieving pages that are not only relevant, but
also of high quality, or authoritative on the
topic
Hyperlinks can infer the notion of authority
The Web consists not only of pages, but also of
hyperlinks pointing from one page to another
These hyperlinks contain an enormous amount of
latent human annotation
A hyperlink pointing to another Web page, this
can be considered as the author's endorsement of
the other page

18
A Survey in Web Mining

What have been doing in Web structure mining?
1.
Calculating the quality relevancy of each Web
page
- Web pages categorization
(Chakrabarti, et al., 1998)
- Discovering micro
communities on the web
- Example
Clever system (Chakrabarti, et al., 1999)
- Example Google (Brin
and Page, 1998)
2.
Mining context of Web warehouse (Madria, et
al.,1999) -
Measuring the completeness of the Web sites
- Measuring the
replication of Web documents

19
Web Usage Mining

Web usage mining, also known as Web log mining,
process of discovering interesting patterns in
Web access logs.
Commonly used approaches (Borges and Levene,
1999)
- Maps the log
data into relational tables before an adapted
data mining technique is performed.
- Uses the log
data directly by utilizing special pre-processing
techniques.
Typical problems
- Distinguishing among
unique users, server sessions, episodes, etc. in
the presence of caching and proxy servers
(McCallum, et al., 2000 Srivastava, et al.,
2000).

20
Mining the World-Wide Web
Web Content Mining
Web Structure Mining
Web Usage Mining

Web Page Content Mining
Web Page Summarization
WebLog (Lakshmanan et.al. 1996), WebOQL(Mendelzon
et.al. 1998)
Web Structuring query languages
Can identify information within given web pages
Ahoy! (Etzioni et.al. 1997)Uses heuristics to
distinguish personal home pages from other web
pages
ShopBot (Etzioni et.al. 1997) Looks for product
prices within web pages

General Access Pattern Tracking
Customized Usage Tracking
Search Result Mining
21
Mining the World-Wide Web
Web Content Mining
Web Structure Mining
Web Usage Mining
Web Page Content Mining

Search Result Mining
Search Engine Result Summarization
Clustering Search Result (Leouski and Croft,
1996, Zamir and Etzioni, 1997)
Categorizes documents using phrases in titles and
snippets

General Access Pattern Tracking
Customized Usage Tracking
22
Mining the World-Wide Web
Web Content Mining
Web Usage Mining

Web Structure Mining
Using Links
PageRank (Brin et al., 1998)
CLEVER (Chakrabarti et al., 1998)
Use interconnections between web pages to give
weight to pages.
Using Generalization
MLDB (1994), VWV (1998)
Uses a multi-level database representation of the
Web. Counters (popularity) and link lists are
used for capturing structure.

General Access Pattern Tracking
Search Result Mining
Web Page Content Mining
Customized Usage Tracking
23
Mining the World-Wide Web
Web Structure Mining
Web Content Mining
Web Usage Mining
Web Page Content Mining
Customized Usage Tracking

General Access Pattern Tracking
Web Log Mining (Zaïane, Xin and Han, 1998)
Uses KDD techniques to understand general access
patterns and trends.
Can shed light on better structure and grouping
of resource providers.

Search Result Mining
24
Mining the World-Wide Web
Web Usage Mining
Web Structure Mining
Web Content Mining

Customized Usage Tracking
Adaptive Sites (Perkowitz and Etzioni, 1997)
Analyzes access patterns of each user at a time.
Web site restructures itself automatically by
learning from user access patterns.

General Access Pattern Tracking
Web Page Content Mining
Search Result Mining
25
Web Usage Mining

Mining Web log records to discover user access
patterns of Web pages
Applications
Target potential customers for electronic
commerce
Enhance the quality and delivery of Internet
information services to the end user
Improve Web server system performance
Identify potential prime advertisement locations
Web logs provide rich information about Web
dynamics
Typical Web log entry includes the URL requested,
the IP address from which the request originated,
and a timestamp

26
Discussion Question

What are the four subtasks of Web Mining ?
1.
2.
3.
4.

27
Techniques for Web usage mining

Construct multidimensional view on the Weblog
database
Perform multidimensional OLAP analysis to find
the top N users, top N accessed Web pages, most
frequently accessed time periods, etc.
Perform data mining on Weblog records
Find association patterns, sequential patterns,
and trends of Web accessing
May need additional information,e.g., user
browsing sequences of the Web pages in the Web
server buffer
Conduct studies to
Analyze system performance, improve system design
by Web caching, Web page prefetching, and Web
page swapping

28
Mining the World-Wide Web

Design of a Web Log Miner
Web log is filtered to generate a relational
database
A data cube is generated form database
OLAP is used to drill-down and roll-up in the
cube
OLAM is used for mining interesting knowledge

Knowledge
Web log
Database
Data Cube
Sliced and diced cube
1 Data Cleaning
2 Data Cube Creation
4 Data Mining
3 OLAP
29
Website Usage Analysis (SUA)

Why developing a Website usage/utilization
analyzation tool?
Knowledge about how visitors
use Website could - prevent
disorientation and help designers place important
information/functions exactly where the
visitors look for it and in the way users need it
- especially help to
build up adaptive Website server

30
Website Usage Analysis (SUA)

What the SUA do?
Discover user navigation
patterns in using Website
-
Establish a aggregated log structure as a
preprocessor to reduce the search space before
the actual log mining phase

- Introduce a model for Website
usage pattern discovery by extending the
classical mining model, and establish the
processing framework of this model

31
Website Usage Analysis (SUA)

Website client-server architecture facilitates
recording user behaviors in every steps by

- submit client-side log files to server
when users use clear functions or exit
window/modules
The special design for local and universal
back/forward/clear functions makes users
navigation pattern more clear for designer by
- analyzing local back/forward history and
corporate it with universal back/forward history

32
Website Usage Analysis (SUA)

What will be included in SUA
1.
Identify and collect log data
2. transfer the data to
server-side and save them in a structure desired
for analysis
3. Prepare mined data by establishing a
customized aggregated log tree/frame
4. Use
modifications of the typical data mining methods,
particularly an extension of a traditional
sequence discovery algorithm, to mine user
navigation patterns

33
Website Usage Analysis (SUA)

Problem need to be considered
- How to identify the log data when a user go
through uninteresting function/module
- What marks the end of a user session?
- Client connect Website through proxy servers
Differences in Website usage analysis with common
Web usage mining
- Client-side log files available
- Log files format (Web log files follow Common
Log Format specified as a part of HTTP protocol)
- Not necessary for log file cleaning/filtering
(which usually performed in preprocess of Web log
mining)

34
WebSift Project
35
Reference

Cooley, R., Mobasher, B., and Srivastava, J. Web
Mining Information and pattern Discovery on the
World Wide Web. IEEE Computer, pages 558-566,
1997.
Etzioni, O. The world wide web Quagmire or gold
mine. Communications of the ACM, 39(11)65-68,
1996.
Fayyad, U., Djorgovski, S., and Weir, N.
Automating the analysis and cataloging of sky
surveys. In Advances in Knowledge Discovery and
Data Mining, pages 471-493. AAAI Press, 1996.
Kosala, R. and Blockeel, H. Web Mining Research
A summary. SIGKDD Explorations, 2(1)1-15, 2000.

36
Reference

Langley, P. User modeling in adaptive
interfaces. In Proceedings of the Seventh
International Conference on User Modeling, pages
357-370, 1999.
Madria, S.K., Bhowmick, S.S., Ng, W.K., and Lim,
E.-P. Research issues in web data mining. In
Proceedings of Data Warehousing and Knowledge
Discovery, First International Conference, DaWaK
99, pages 303-312, 1999.
Masand, B. and Spiliopoulou, M. Webkdd-99
Work-shop on web usage analysis and user
profiling. SIGKDD Explorations, 1(2), 2000.
Mobasher, B., Jain, N. Han, E.H., and Srivastava,
J. Web mining Pattern discovery from world wide
web transactions. Technical Report TR 96-060,
University of Minnesota, Dept. of Computer
Science, Minneapolis, 1996

37
Reference

Smyth, P., Fayyad, U.M., Burl, M.C., and Perona,
P. Modeling subjective uncertainty in image
annotation. In Advances in Knowledge Discovery
and Data Mining, pages 517-539, 1996.
Spiliopoulou, M. Data mining for the web. In
Principles of Data Mining and Knowledge
Discovery, Second European Symposium, PKDD 99,
pages 588-589, 1999.
Srivastava, J., Cooley, R., Deshpande, M., and
Tan, P.-N. Web usage mining Discovery and
applications of usage patterns from web data.
SIGMOD Explorations, 1(2), 2000.
Zaiane, O.R., Xin, M., and Han, J. Discovering
Web access patterns and trends by applying OLAP
and data mining technology on Web logs. IEEE,
pages 19-29, 1998.

Write a Comment

User Comments (0)

About PowerShow.com

Web Mining Research : A survey PowerPoint PPT Presentation