Integrating Automatic Web Page Clustering into Web Log Association Mining - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Integrating Automatic Web Page Clustering into Web Log Association Mining

Description:

fetching the Web pages, and storing to local host; Document Clustering ... picture, gallery, pic, return, previous, completed, room, frame, ready, building ... – PowerPoint PPT presentation

Number of Views:1537
Avg rating:3.0/5.0
Slides: 43
Provided by: jiayu6
Category:

less

Transcript and Presenter's Notes

Title: Integrating Automatic Web Page Clustering into Web Log Association Mining


1
Integrating Automatic Web Page Clustering into
Web Log Association Mining
  • Jiayun Guo
  • MCS candidate at FCS of Dal

2
Outline
  • Introduction
  • Related Work
  • System Design
  • Results and Evaluations
  • Conclusion and Future Work

3
Introduction
  • Motivation
  • -- Current web usage mining methods are all
    based on the web URL directory and
    sub-directories and do not use the real Web pages
    for content concerns
  • -- Although some Web site are organized by
    content, there are also many others that are not
    organized according to the web page contentes.
    That is, the URL directory and sub-directories do
    not indicate the content all the time

4
Motivation
  • Example of different Web construction basis

Table-1 Different Web Site Organizations
1 In fact, the Web page organizations of some
large corporations like Microsoft and IBM are
much more complex, combining several
categorization means together. 2 The real
usernames of the authors of Web pages are
substituted with anonymous usernames like
prof01 throughout this thesis for privacy
issues.
5
Proposed Solution Strategy
Figure-2 System Structure
6
Introduction
  • Research Goals
  • Instead of paying attention only to the web url
    directories, consider the Web page content when
    mining the Web server log files
  • Research Contribution
  • -- Proposing a new initiative to combine Web
    content mining and Web usage mining
  • -- Providing more accurate Web usage mining
    methods especially when the Web site is not
    organized according to the content

7
Related Work
  • Web Mining
  • Web Mining Categories
  • Combining Web Mining Categories
  • Web Mining Applications
  • Document Clustering
  • Vector-Space Model
  • TF-IDF
  • Similarity/dissimilarity measures
  • Association Rule Mining
  • Apriori

8
Web Mining
  • Web Content Mining
  • discovery of useful information from the Web
    content, including documents
  • Web Structure Mining
  • discovery of the model underlying the link
    structures of the Web
  • Web Usage Mining
  • discovery of useful patterns from the data
    generated by the Web surfers sessions or
    behaviors

9
Web Content Mining
  • Web content consists a large range of different
    types of data from various source
  • Gopher, FTP, Usenet
  • text, hypertext, multimedia, metadata
  • Two characteristics of Web content data
  • Large volumn
  • Heterogeneity

10
Web Content Mining
  • IR view
  • To develop more intelligent IR system in
    information finding and filtering
  • Document Clustering
  • DB view
  • To integrate and organize the heterogeous and
    semi-structured Web-data

Figure-1 Categorization of Web Content Mining
11
Web Usage Mining
  • Utilizes the secondary data derived from the
    interactions of users while interacting with the
    Web
  • Includes data from Web server access logs, proxy
    server logs, browser logs, user profiles,
    registration data, user sessions or transactions,
    cookies, user queries, bookmark data, mouse
    clicks and scrolls, and any other data as the
    results of interactions
  • Different concerns include CRM, Fraud Detection,
    Web Cache Prediction and Enhancement, Web Site
    construction

12
Combining Web Mining Categories
  • Web content and structure mining are often
    combined together
  • Web content mining sometimes utilizes the user
    profiles from survey or registration
  • Web usage mining is relatively isolated from the
    other two

13
Web Mining Tools
  • General Data Mining Packages
  • DBMiner, WEKA, SAS, etc.
  • Web Statistical or Mining Packages
  • WEBMINER, AWStats, Webalizer, Analog, etc.

14
Document Clustering
  • Clustering
  • To group objects into clusters
  • Objects are similar to those in same cluster
  • Objects are dissimilar to those in different
    cluster
  • Clustering Algorithms
  • Partition k-means
  • Hierarchical
  • Agglomerative (bottom-up) AGNES (AGglomative
    NESting)
  • Divisive (top-down) DIANA (DIvision ANAlysis)
  • Density-based DBSCAN (Density Based spatial
    Clustering of Applications with Noise)

15
Document Clustering
  • Takes textual documents as data objects for
    clustering
  • Vector-Space Model
  • each document is represented by a vector of
    frequencies of features
  • TF-IDF weighting schema

16
Distance Measure
  • sin/cos measure
  • Euclidean measure
  • Normalized Euclidean measure

17
Evaluation for Document Clustering
  • Internal quality measure
  • Compares different sets of clusters without
    references to external knowledge
  • External quality measure
  • Evaluates how well the clustering is working by
    comparing the groups produced by clustering
    techniques to known classes

18
Association Rule Mining
  • One of the most important techniques in Data
    Mining
  • To find interesting association or correlation
    relationships among a large set of data items
  • Two important measure

19
System Design and Implementation
  • Preprocessing
  • reading in the Web log file, and reformatting
  • fetching the Web pages, and storing to local
    host
  • Document Clustering
  • calculating the frequencies of features in
    documents
  • clustering the Web documents using k-means
    algorithm
  • summarizing clusters manually based on the
    clustering results
  • evaluating clustering results.
  • Integration
  • Web Log Mining
  • applying Apriori algorithm on the two data
    collection
  • evaluating the resuts of association rules
    obtained.

20
Figure-3 System Architecture
21
Preprocessing
  • Re-formatting Web server log file

Table-4 Web Log File Re-formatting

1 After the step of Integration, the URL in log
file is substituted by the directory root only,
/user for example.
22
Preprocessing
  • Retrieving Web page documents
  • Web document type screening
  • .html, .htm, .shtml, .xml, .php, .cgi (hypertext
    files)
  • .txt (plain text files)
  • pages from professors directories only
  • Web access status screening
  • Screen out records with access status leading by
    4
  • Exception 401 (Unauthorized)
  • Web page storage

23
Document Clustering
  • Document Preprocessing
  • HTML,XML,SGML tag cleaning
  • Eliminating punctuations, numbers
  • Eliminating stop words (word-rep only)
  • Word Stemming (word-rep only)
  • Frequency Calculation
  • TF-IDF weighting measure used
  • Document Clustering
  • K-means
  • Summarization
  • Evaluation

24
K-means
  • K-means Algorithm
  • Partition objects into k non-empty subsets
    randomly
  • repeat
  • Compute the centriods of the clusters
  • Assign each object to the cluster with the
    nearest centriod
  • until some stop criteria is met.
  • Stop Criteria
  • when no object is moved from one cluster to
    another
  • when the system has reached the maximum number of
    iteration
  • when some evaluation shows that the system
    performance has reached maximum

25
Document Clustering Evaluation
  • where m is the total number of clusters, N is
    the total number of documents, E(tj) is
    inter-cluster entropy, and Ei(tj) is
    intra-cluster entropy
  • where ni is the number of documents in cluster
    Ci, nij is the number of documents including
    feature tj in cluster Ci, fimax is the maximum
    frequency of feature tj in cluster Ci, is the
    average frequency of feature tj in cluster Ci,
    and mj is the number of clusters in which
    feature tj appears EF01

26
Integration
  • Integrating the Web document cluster information
    into log files
  • Obtaining two data sets, one with the cluster
    information, the other without

Table-5 Segment of Datasets

1 Another option is to use cluster information
without the root.
27
Web Log Association Mining
  • Treat each attribute value as an item
  • Each item is present or not, and is treated as
    Boolean value
  • Apriori algorithm was applied

28
Apriori
  • Apriori Algorithm
  • Obtain 1-itemset C1
  • Scan database to count the frequency of all the
    itemsets in C1
  • Prune those do not exceed the support threshold a
    and get the frequent 1-itemset L1.
  • repeat
  • for each k, k1,2,,
  • use Lk to generate Ck1 by merge two frequent
    itemsets in Lk
  • for each (k1)-itemsets
  • if all its k-subset is in the frequent k-itemset
    Lk
  • then keep it in Ck1 else prune it
  • end of for
  • scan the database to count the support for all
    the itemsets in Ck1
  • prune those do not exceed the support threshold a
    and get Lk1
  • end of for
  • until no itemset could be generated for Ck1.
  • for each set of items A in Lk, for each k2,3,
  • for each subset A of A
  • calculate print out all the rules that exceed the
    confidence threshold ß
  • end of for

29
Results and Evaluations
  • Dataset
  • Original log file size 230M
  • Number of Web pages 4981
  • Total size of Web pages 71MB
  • Number of access records 161499
  • Environment
  • Sun Solaris (SunOS sparc SUNW, Sun-Fire-880)
  • Perl
  • C

30
Document Clustering
Table-6 Document Cluster summaries with k8 and
word as feature
31
Table-7 Document Cluster summaries with k10 and
word as feature
32
Table-8 Document Cluster summaries with k12 and
word as feature
33
Table-9 Document Cluster summaries with k10 and
N-gram with N6 as feature
34
Document Clustering Evaluation
Table-10 Comparison of Results from Document
Clustering
  • Ngram-rep performs better than word-rep
  • K10 has the best performance for both word-rep
    and ngram-rep

35
Association Rules Obtained
Table-11 Comparison of Numbers of Rules Obtained
  • More rules were obtained from log_integ than from
    log_origin
  • Rules from log_origin are all included in those
    from log_integ

36
Association Rules Obtained
Table-12 Association Rules from log_origin with
k10, word, support1, confidence50
37
Table-13 Association Rules from log_integ with
k10, word, support1, confidence50
38
Association Rules Obtained
  • These extra association rules obtained from the
    integrated Web server log data set can benefit in
    various applications
  • These rules are related with Web page content and
    were not obtained from the original Web log data
    set

39
Conclusion and Future Work
  • Conclusion
  • Content-related rules found when Web URL
    directories do not strongly indicate contents
  • These rules may contribute to various Web usage
    mining applications including user profiling, and
    Web site construction and improvement,

40
Conclusion and Future Work
  • Future Work
  • Improve the algorithm for larger datasets
  • Consider more different types of Web documents
  • .pdf, .ps, .doc
  • .java, .cpp
  • Improve summarization techniques on document
    clusters to some automatic approaches
  • Deploy hierarchy concepts
  • Apply other Data Mining techniques in Web log
    analysis

41
References
  • CDC04 Y.Miao, V.Keselj, E.E.Millios",
    "Comparing Document Clustering Using N-grams,
    Terms and Words", Master's Thesis, 2004
  • DBM96 J.Han, Y.Fu, W.Wang, J.Chiang, W.Gong,
    K.Koperski, D.Li, Y.Lu, A.Rajan, N.Stepanovic,
    B.Xia, and O.R.Zaiane, "DBMiner A System for
    Mining Knowledge in Large Spatial Databases",
    Proc. of Int'l data Mining and Knowledge
    discovery, P.250-255, Portland, Ore., Aug, 1996
  • DMCT01 J.Han, M.Kamber, "Data Mining Concepts
    and Techniques", Morgan Kaufmann Publishers, 2001
  • EF01 T.C.Jo, Evaluation Function of document
    Clustering based on Term Entropy, Proc. of 2nd
    International Symposium on Advanced Intelligent
    System, P.95-100, 2001
  • IPD97 R.Cooley, B.Mobasher, and J.Srivastava,
    "Web Mining Information and Pattern Discovery on
    the World Wide Web", Proc. of the 9th IEEE
    International Conference on Tools with Artificial
    Intelligence (ICTAI'97), P.558-567, 1997
  • VSM75 G.Salton, A.Wong, and C.S.Yang, "A Vector
    Space Model for Automatic Indexing",
    Communications of the ACM, 18(11)613-620, 1975
  • WMR00 R. Kosala, H. Blockeel, "Web Mining
    Research A Survey", ACM SIGKDD, 2(1)1-15, 2000
  • WUM01 J.R.Punin, M.S.Krishnamoorthy, and
    M.J.Zaki, Web Usage Mining Languages and
    Algorithms, Workshop for WEBKDD01, Mining Web
    Log Data Across All Customers Touch Points,
    P.88-112, 2001
  • WWW1 "Analog logfile analyzer",
    http//www.analog.cx, Last accessed in Oct. 2004
  • WWW2 "Apache server log format",
    http//httpd.apache.org/docs-2.0/logs.html, Last
    accessed in Oct. 2004
  • WWW3 "AWStats log analyzer", http//www.awstats.
    org, Last accessed in Oct. 2004
  • WWW4 V.Keselj, "Cleaning HTML Tags",
    http//www.cs.dal.ca/vlado/srcperl/clean_html,
    Last accessed in Oct. 2004
  • WWW5 G.Karypis, "CLUTO A Software Package for
    Clustering High-Dimensional Datasets",
    http//www-users.cs.umn.edu/karypis/cluto/, Last
    accessed in Oct. 2004
  • WWW6 "SAS business analysis system",
    http//www.sas.org, Last accessed in Oct. 2004
  • WWW7 "Webalizer log file analysis program",
    http//www.mrunix.net/webalizer/, last accessed
    in Sep. 2004
  • WWW8 B.Mobasher, Web Content Mining,
    http//maya.cs.depaul.edu/mobasher/webminer/surve
    y/node3.html, last accessed in Oct.2004
  • WWW9 "WEKA", http//www.cs.waikato.ac.nz/ml/wek
    a/, Last accessed in Oct. 2004
  • WWW10 V.Keselj, "Perl Package TextNgrams,
    2003", http//www.cs.dal.ca/vlado/srcperl/Ngrams,
    Last accessed in Oct. 2004

42
Questions and Comments
  • Thanks
Write a Comment
User Comments (0)
About PowerShow.com