Title: Integrating Automatic Web Page Clustering into Web Log Association Mining
1Integrating Automatic Web Page Clustering into
Web Log Association Mining
- Jiayun Guo
- MCS candidate at FCS of Dal
2Outline
- Introduction
- Related Work
- System Design
- Results and Evaluations
- Conclusion and Future Work
3Introduction
- Motivation
- -- Current web usage mining methods are all
based on the web URL directory and
sub-directories and do not use the real Web pages
for content concerns - -- Although some Web site are organized by
content, there are also many others that are not
organized according to the web page contentes.
That is, the URL directory and sub-directories do
not indicate the content all the time
4Motivation
- Example of different Web construction basis
Table-1 Different Web Site Organizations
1 In fact, the Web page organizations of some
large corporations like Microsoft and IBM are
much more complex, combining several
categorization means together. 2 The real
usernames of the authors of Web pages are
substituted with anonymous usernames like
prof01 throughout this thesis for privacy
issues.
5Proposed Solution Strategy
Figure-2 System Structure
6Introduction
- Research Goals
- Instead of paying attention only to the web url
directories, consider the Web page content when
mining the Web server log files - Research Contribution
- -- Proposing a new initiative to combine Web
content mining and Web usage mining - -- Providing more accurate Web usage mining
methods especially when the Web site is not
organized according to the content
7Related Work
- Web Mining
- Web Mining Categories
- Combining Web Mining Categories
- Web Mining Applications
- Document Clustering
- Vector-Space Model
- TF-IDF
- Similarity/dissimilarity measures
- Association Rule Mining
- Apriori
8Web Mining
- Web Content Mining
- discovery of useful information from the Web
content, including documents - Web Structure Mining
- discovery of the model underlying the link
structures of the Web - Web Usage Mining
- discovery of useful patterns from the data
generated by the Web surfers sessions or
behaviors
9Web Content Mining
- Web content consists a large range of different
types of data from various source - Gopher, FTP, Usenet
- text, hypertext, multimedia, metadata
- Two characteristics of Web content data
- Large volumn
- Heterogeneity
10Web Content Mining
- IR view
- To develop more intelligent IR system in
information finding and filtering - Document Clustering
- DB view
- To integrate and organize the heterogeous and
semi-structured Web-data
Figure-1 Categorization of Web Content Mining
11Web Usage Mining
- Utilizes the secondary data derived from the
interactions of users while interacting with the
Web - Includes data from Web server access logs, proxy
server logs, browser logs, user profiles,
registration data, user sessions or transactions,
cookies, user queries, bookmark data, mouse
clicks and scrolls, and any other data as the
results of interactions - Different concerns include CRM, Fraud Detection,
Web Cache Prediction and Enhancement, Web Site
construction
12Combining Web Mining Categories
- Web content and structure mining are often
combined together - Web content mining sometimes utilizes the user
profiles from survey or registration - Web usage mining is relatively isolated from the
other two
13Web Mining Tools
- General Data Mining Packages
- DBMiner, WEKA, SAS, etc.
- Web Statistical or Mining Packages
- WEBMINER, AWStats, Webalizer, Analog, etc.
14Document Clustering
- Clustering
- To group objects into clusters
- Objects are similar to those in same cluster
- Objects are dissimilar to those in different
cluster - Clustering Algorithms
- Partition k-means
- Hierarchical
- Agglomerative (bottom-up) AGNES (AGglomative
NESting) - Divisive (top-down) DIANA (DIvision ANAlysis)
- Density-based DBSCAN (Density Based spatial
Clustering of Applications with Noise)
15Document Clustering
- Takes textual documents as data objects for
clustering - Vector-Space Model
- each document is represented by a vector of
frequencies of features - TF-IDF weighting schema
16Distance Measure
- sin/cos measure
-
-
- Euclidean measure
-
- Normalized Euclidean measure
-
17Evaluation for Document Clustering
- Internal quality measure
- Compares different sets of clusters without
references to external knowledge - External quality measure
- Evaluates how well the clustering is working by
comparing the groups produced by clustering
techniques to known classes
18Association Rule Mining
- One of the most important techniques in Data
Mining - To find interesting association or correlation
relationships among a large set of data items - Two important measure
19System Design and Implementation
- Preprocessing
- reading in the Web log file, and reformatting
- fetching the Web pages, and storing to local
host - Document Clustering
- calculating the frequencies of features in
documents - clustering the Web documents using k-means
algorithm - summarizing clusters manually based on the
clustering results - evaluating clustering results.
- Integration
- Web Log Mining
- applying Apriori algorithm on the two data
collection - evaluating the resuts of association rules
obtained.
20Figure-3 System Architecture
21Preprocessing
- Re-formatting Web server log file
Table-4 Web Log File Re-formatting
1 After the step of Integration, the URL in log
file is substituted by the directory root only,
/user for example.
22Preprocessing
- Retrieving Web page documents
- Web document type screening
- .html, .htm, .shtml, .xml, .php, .cgi (hypertext
files) - .txt (plain text files)
- pages from professors directories only
- Web access status screening
- Screen out records with access status leading by
4 - Exception 401 (Unauthorized)
- Web page storage
23Document Clustering
- Document Preprocessing
- HTML,XML,SGML tag cleaning
- Eliminating punctuations, numbers
- Eliminating stop words (word-rep only)
- Word Stemming (word-rep only)
- Frequency Calculation
- TF-IDF weighting measure used
- Document Clustering
- K-means
- Summarization
- Evaluation
24K-means
- K-means Algorithm
- Partition objects into k non-empty subsets
randomly - repeat
- Compute the centriods of the clusters
- Assign each object to the cluster with the
nearest centriod - until some stop criteria is met.
- Stop Criteria
- when no object is moved from one cluster to
another - when the system has reached the maximum number of
iteration - when some evaluation shows that the system
performance has reached maximum
25Document Clustering Evaluation
- where m is the total number of clusters, N is
the total number of documents, E(tj) is
inter-cluster entropy, and Ei(tj) is
intra-cluster entropy - where ni is the number of documents in cluster
Ci, nij is the number of documents including
feature tj in cluster Ci, fimax is the maximum
frequency of feature tj in cluster Ci, is the
average frequency of feature tj in cluster Ci,
and mj is the number of clusters in which
feature tj appears EF01 -
26Integration
- Integrating the Web document cluster information
into log files - Obtaining two data sets, one with the cluster
information, the other without
Table-5 Segment of Datasets
1 Another option is to use cluster information
without the root.
27Web Log Association Mining
- Treat each attribute value as an item
- Each item is present or not, and is treated as
Boolean value - Apriori algorithm was applied
28Apriori
- Apriori Algorithm
- Obtain 1-itemset C1
- Scan database to count the frequency of all the
itemsets in C1 - Prune those do not exceed the support threshold a
and get the frequent 1-itemset L1. - repeat
- for each k, k1,2,,
- use Lk to generate Ck1 by merge two frequent
itemsets in Lk - for each (k1)-itemsets
- if all its k-subset is in the frequent k-itemset
Lk - then keep it in Ck1 else prune it
- end of for
- scan the database to count the support for all
the itemsets in Ck1 - prune those do not exceed the support threshold a
and get Lk1 - end of for
- until no itemset could be generated for Ck1.
- for each set of items A in Lk, for each k2,3,
- for each subset A of A
- calculate print out all the rules that exceed the
confidence threshold ß - end of for
29Results and Evaluations
- Dataset
- Original log file size 230M
- Number of Web pages 4981
- Total size of Web pages 71MB
- Number of access records 161499
- Environment
- Sun Solaris (SunOS sparc SUNW, Sun-Fire-880)
- Perl
- C
30Document Clustering
Table-6 Document Cluster summaries with k8 and
word as feature
31Table-7 Document Cluster summaries with k10 and
word as feature
32Table-8 Document Cluster summaries with k12 and
word as feature
33Table-9 Document Cluster summaries with k10 and
N-gram with N6 as feature
34Document Clustering Evaluation
Table-10 Comparison of Results from Document
Clustering
- Ngram-rep performs better than word-rep
- K10 has the best performance for both word-rep
and ngram-rep
35Association Rules Obtained
Table-11 Comparison of Numbers of Rules Obtained
- More rules were obtained from log_integ than from
log_origin - Rules from log_origin are all included in those
from log_integ
36Association Rules Obtained
Table-12 Association Rules from log_origin with
k10, word, support1, confidence50
37Table-13 Association Rules from log_integ with
k10, word, support1, confidence50
38Association Rules Obtained
- These extra association rules obtained from the
integrated Web server log data set can benefit in
various applications - These rules are related with Web page content and
were not obtained from the original Web log data
set
39Conclusion and Future Work
- Conclusion
- Content-related rules found when Web URL
directories do not strongly indicate contents - These rules may contribute to various Web usage
mining applications including user profiling, and
Web site construction and improvement,
40Conclusion and Future Work
- Future Work
- Improve the algorithm for larger datasets
- Consider more different types of Web documents
- .pdf, .ps, .doc
- .java, .cpp
- Improve summarization techniques on document
clusters to some automatic approaches - Deploy hierarchy concepts
- Apply other Data Mining techniques in Web log
analysis
41References
- CDC04 Y.Miao, V.Keselj, E.E.Millios",
"Comparing Document Clustering Using N-grams,
Terms and Words", Master's Thesis, 2004 - DBM96 J.Han, Y.Fu, W.Wang, J.Chiang, W.Gong,
K.Koperski, D.Li, Y.Lu, A.Rajan, N.Stepanovic,
B.Xia, and O.R.Zaiane, "DBMiner A System for
Mining Knowledge in Large Spatial Databases",
Proc. of Int'l data Mining and Knowledge
discovery, P.250-255, Portland, Ore., Aug, 1996 - DMCT01 J.Han, M.Kamber, "Data Mining Concepts
and Techniques", Morgan Kaufmann Publishers, 2001 - EF01 T.C.Jo, Evaluation Function of document
Clustering based on Term Entropy, Proc. of 2nd
International Symposium on Advanced Intelligent
System, P.95-100, 2001 - IPD97 R.Cooley, B.Mobasher, and J.Srivastava,
"Web Mining Information and Pattern Discovery on
the World Wide Web", Proc. of the 9th IEEE
International Conference on Tools with Artificial
Intelligence (ICTAI'97), P.558-567, 1997 - VSM75 G.Salton, A.Wong, and C.S.Yang, "A Vector
Space Model for Automatic Indexing",
Communications of the ACM, 18(11)613-620, 1975 - WMR00 R. Kosala, H. Blockeel, "Web Mining
Research A Survey", ACM SIGKDD, 2(1)1-15, 2000 - WUM01 J.R.Punin, M.S.Krishnamoorthy, and
M.J.Zaki, Web Usage Mining Languages and
Algorithms, Workshop for WEBKDD01, Mining Web
Log Data Across All Customers Touch Points,
P.88-112, 2001 - WWW1 "Analog logfile analyzer",
http//www.analog.cx, Last accessed in Oct. 2004 - WWW2 "Apache server log format",
http//httpd.apache.org/docs-2.0/logs.html, Last
accessed in Oct. 2004 - WWW3 "AWStats log analyzer", http//www.awstats.
org, Last accessed in Oct. 2004 - WWW4 V.Keselj, "Cleaning HTML Tags",
http//www.cs.dal.ca/vlado/srcperl/clean_html,
Last accessed in Oct. 2004 - WWW5 G.Karypis, "CLUTO A Software Package for
Clustering High-Dimensional Datasets",
http//www-users.cs.umn.edu/karypis/cluto/, Last
accessed in Oct. 2004 - WWW6 "SAS business analysis system",
http//www.sas.org, Last accessed in Oct. 2004 - WWW7 "Webalizer log file analysis program",
http//www.mrunix.net/webalizer/, last accessed
in Sep. 2004 - WWW8 B.Mobasher, Web Content Mining,
http//maya.cs.depaul.edu/mobasher/webminer/surve
y/node3.html, last accessed in Oct.2004 - WWW9 "WEKA", http//www.cs.waikato.ac.nz/ml/wek
a/, Last accessed in Oct. 2004 - WWW10 V.Keselj, "Perl Package TextNgrams,
2003", http//www.cs.dal.ca/vlado/srcperl/Ngrams,
Last accessed in Oct. 2004
42Questions and Comments