Title: Data Mining meets the Internet: Techniques for Web Information Retrieval
1 Data Mining meets the Internet Techniques for
Web Information Retrieval
Rajeev
Rastogi Internet Management Research Bell
laboratories, Murray Hill
2The Web
- Over 1 billion HTML pages, 15 terabytes
- Wealth of information
- Bookstores, restaraunts, travel, malls,
dictionaries, news, stock quotes, yellow white
pages, maps, markets, ......... - Diverse media types text, images, audio, video
- Heterogeneous formats HTML, XML, postscript,
pdf, JPEG, MPEG, MP3 - Highly Dynamic
- 1 million new pages each day
- Average page changes in a few weeks
- Graph structure with links between pages
- Average page has 7-10 links
- Hundreds of millions of queries per day
3Why is Web Information Retrieval Important?
- According to most predictions, the majority of
human information will be available on the Web in
ten years - Effective information retrieval can aid in
- Research Find all papers that use the primal
dual method to solve the facility location
problem - Health/Medicene What could be reason for
symptoms of yellow eyes, high fever and
frequent vomitting - Travel Find information on the tropical island
of St. Lucia - Business Find companies that manufacture digital
signal processors (DSPs) - Entertainment Find all movies starring Marilyn
Monroe during the years 1960 and 1970 - Arts Find all short stories written by Jhumpa
Lahiri
4Web Information Retrieval Model
Repository
Storage Server
Web Server
Crawler
Clustering Classification
The jaguar has a 4 liter engine
Indexer
The jaguar, a cat, can run at speeds reaching 50
mph
Inverted Index
Topic Hierarchy
engine jaguar cat
Root
Documents in repository
Business
News
Science
jaguar
Search Query
Computers
Automobiles
Plants
Animals
5Why is Web Information Retrieval Difficult?
- The Abundance Problem (99 of information of no
interest to 99 of people) - Hundreds of irrelevant documents returned in
response to a search query - Limited Coverage of the Web (Internet sources
hidden behind search interfaces) - Largest crawlers cover less than 18 of Web pages
- The Web is extremely dynamic
- 1 million pages added each day
- Very high dimensionality (thousands of
dimensions) - Limited query interface based on keyword-oriented
search - Limited customization to individual users
6How can Data Mining Improve Web Information
Retrieval?
- Latent Semantic Indexing (LSI)
- SVD-based method to improve precision and recall
- Document clustering to generate topic hierarchies
- Hypergraph partitioning, STIRR, ROCK
- Document classification to assign topics to new
documents - Naive Bayes, TAPER
- Exploiting hyperlink structure to locate
authoritative Web pages - HITS, Google, Web Trawling
- Collaborative searching
- SearchLight
- Image Retrieval
- QBIC, Virage, Photobook, WBIIS, WALRUS
7Latent Semantic Indexing
8Problems with Inverted Index Approach
- Synonymy
- Many ways to refer to the same object
- Polysemy
- Most words have more than one distinct meaning
animal
jaguar
speed
car
engine
porsche
automobile
Doc 1
X
X
X
Doc 2
X
X
X
X
Doc 3
X
X
X
Synonymy
Polysemy
9LSI - Key Idea DDF 90
- Apply SVD to terms by documents (t x d) matrix
X X
T0 S0 D0T0 , D0 have
orthonormal columns and S0 is diagonal - Ignoring very small singular values in S (keeping
only the first k largest values)
X X
T S D - New matrix X of rank k is closest to X in the
least squares sense
m x d
m x m
t x d
t x m
k x k
k x d
t x d
t x k
10Comparing Documents and Queries
- Comparing two documents
- Essentially dot product of two column vectors of
X X X D S D - So one can consider rows of DS matrix as
coordinates for documents and take dot products
in this space - Finding documents similar to query q with term
vector Xq - Derive a representation Dq for query Dq
Xq T S - Dot product of DqS and appropriate row of DS
matrix yields similarity between query and
specific document
2
-1
11LSI - Benefits
- Reduces Dimensionality of Documents
- From tens of thousands (one dimension per
keyword) to a few 100 - Decreases storage overhead of index structures
- Speeds up retrieval of documents similar to a
query - Makes search less brittle
- Captures semantics of documents
- Addresses problems of synonymy and polysemy
- Transforms document space from discrete to
continuous - Improves both search precision and recall
12Document Clustering
13Improve Search Using Topic Hierarchies
- Web directories (or topic hierarchies) provide a
hierarchical classification of documents (e.g.,
Yahoo!) - Searches performed in the context of a topic
restricts the search to only a subset of web
pages related to the topic - Clustering can be used to generate topic
hierarchies
Yahoo home page
Recreation
Science
Business
News
Sports
Travel
Companies
Finance
Jobs
14Clustering
- Given
- Data points (documents) and number of desired
clusters k - Group the data points (documents) into k clusters
- Data points (documents) within clusters are more
similar than across clusters - Document similarity measure
- Each document can be represented by vector with
0/1 value along each word dimension - Cosine of angle between document vectors is a
measure of their similarity, or (euclidean)
distance between the vectors - Other applications
- Customer segmentation
- Market basket analysis
15k-means Algorithm
- Choose k initial means
- Assign each point to the cluster with the closest
mean - Compute new mean for each cluster
- Iterate until the k means stabilize
16Agglomerative Hierarchical Clustering Algorithms
- Initially each point is a distinct cluster
- Repeatedly merge closest clusters until the
number of clusters becomes k - Closest dmean (Ci, Cj)
- dmin (Ci, Cj)
- Likewise dave (Ci, Cj) and dmax (Ci, Cj)
17Agglomerative Hierarchical Clustering Algorithms
(Continued)
dmean Centroid approach dmin Minimum Spanning
Tree (MST) approach
(c) Correct Clusters
(a) Centroid
(b) MST
18Drawbacks of Traditional Clustering Methods
- Traditional clustering methods are ineffective
for clustering documents - Cannot handle thousands of dimensions
- Cannot scale to millions of documents
- Centroid-based method splits large and
non-hyperspherical clusters - Centers of subclusters can be far apart
- MST-based algorithm is sensitive to outliers and
slight change in position - Exhibits chaining effect on string of outliers
- Using other similarity measures such as Jaccard
coefficient instead of euclidean distance does
not help
19Example - Centroid Method for Clustering Documents
- As cluster size grows
- The number of dimensions appearing in mean go up
- Their value in the mean decreases
- Thus, very difficult to distinguish two points
that differ on few dimensions - ripple effect
- 1,4 and 6 are merged even though they have no
elements in common!
20Itemset Clustering using Hypergraph Partitioning
HKK 97
- Build a weighted hypergraph with frequent
itemsets - Hyperedge each frequent item
- Weight of hyperedge average of confidences of
all association rules generated from itemset - Hypergraph partitioning algorithm is used to
cluster items - Minimize sum of weights of cut hyperedges
- Label customers with Item clusters by scoring
- Assume that items defining clusters are disjoint!
21STIRR - A System for Clustering Categorical
Attribute Values GKR 98
- Motivated by spectral graph partitioning, a
method for clustering undirected graphs - Each distinct attribute value becomes a separate
node v with weight w(v) - Node weights w(v) updated in each iteration as
follows For each tuple
do
Update set of
weights so that it is orthonormal - Positive and negative weights in non-principal
basins tend to represent good partitions of the
data
22ROCK GRS 99
- Hierarchical clustering algorithm for categorical
attributes - Example market basket customers
- Use novel concept of links for merging clusters
- sim(pi, pj) similarity function that captures
the closeness between pi and pj - pi and pj are said to be neighbors if sim(pi, pj)
- link(pi, pj) the number of common neighbors
- At each step, merge clusters/points with the most
number of links - Points belonging to a single cluster will in
general have a large number of common neighbors - Random sampling used for scale up
- In final labeling phase, each point on disk is
assigned to cluster with maximum neighbors
23ROCK
lt1, 2, 3, 4, 5gt 1, 2, 3 1, 4, 5 1, 2, 4
2, 3, 4 1, 2, 5 2, 3, 5 1, 3, 4 2, 4,
5 1, 3, 5 3, 4, 5
lt1, 2, 6, 7gt 1, 2, 6 1, 2, 7 1, 6, 7 2, 6,
7
- 1, 2, 6 and 1, 2, 7 have 5 links.
- 1, 2, 3 and 1, 2, 6 have 3 links.
24Clustering Algorithms for Numeric Attributes
- Scalable Clustering Algorithms
- (From Database Community)
- CLARANS
- DBSCAN
- BIRCH
- CLIQUE
- CURE
-
- Above algorithms can be used to cluster documents
after reducing their dimensionality using SVD
.
25BIRCH ZRL 96
- Pre-cluster data points using CF-tree
- CF-tree is similar to R-tree
- For each point
- CF-tree is traversed to find the closest cluster
- If the cluster is within epsilon distance, the
point is absorbed into the cluster - Otherwise, the point starts a new cluster
- Requires only single scan of data
- Cluster summaries stored in CF-tree are given to
main memory hierarchical clustering algorithm
26CURE GRS 98
- Hierarchical algorithm for dicovering arbitrary
shaped clusters - Uses a small number of representatives per
cluster - Note
- Centroid-based uses 1 point to represent a
cluster gt Too little information..Hyper-spherical
clusters - MST-based uses every point to represent a
cluster gtToo much information..Easily mislead - Uses random sampling
- Uses Partitioning
- Labeling using representatives
27Cluster Representatives
- A Representative set of points
- Small in number c
- Distributed over the cluster
- Each point in cluster is close to one
representative - Distance between clusters
- smallest distance between
representatives
28Computing Cluster Representatives
- Finding Scattered Representatives
- We want to
- Distribute around the center of the cluster
- Spread well out over the cluster
- Capture the physical shape and geometry of the
cluster - Use farthest point heuristic to scatter the
points over the cluster - Shrink uniformly around the mean of the cluster
29Computing Cluster Representatives (Continued)
- Shrinking the Representatives
- Why do we need to alter the Representative Set?
- Too close to the boundary of cluster
- Shrink uniformly around the mean (center) of the
cluster
30Document Classification
31Classification
- Given
- Database of tuples (documents), each assigned a
class label - Develop a model/profile for each class
- Example profile (good credit)
- (25 lt age lt 40 and income gt 40k) or
(married YES) - Example profile (automobile)
- Document contains a word from car,
truck, van, SUV, vehicle, scooter - Other applications
- Credit card approval (good, bad)
- Bank locations (good, fair, poor)
- Treatment effectiveness (good, fair, poor)
32Naive Bayesian Classifier
- Class c for new document d is the one for which
Prc/d is maximum - Assume independent term occurrences in document
- fraction of documents in class c that
contain term t - Then, by Bayes rule
33Hierarchical Classifier (TAPER) CDA 97
- Class of new document d is leaf node c such that
Prc/d is maximum Topic Hierarchy -
- can be computed using Bayes
rule - Problem of computing c reduces to finding leaf
node c with the least cost path from the root
to c
c
34k-Nearest Neighbor Classifier
- Assign to a point the label for majority of the
k-nearest neighbors - For k1, error rate never worse than twice the
Bayes rate (unlimited number of samples) - Scalability issues
- Use index to find k-nearest neighbors
- R-tree family works well up to 20 dimensions
- Pyramid tree for high-dimensional data
- Use SVD to reduce dimensionality of data set
- Use clusters to reduce the dataset size
35Decision Trees
Credit Analysis
salary lt 20000
no
yes
education in graduate
accept
yes
no
accept
reject
36Decision Tree Algorithms
- Building phase
- Recursively split nodes using best splitting
attribute for node - Pruning phase
- Smaller imperfect decision tree generally
achieves better accuracy - Prune leaf nodes recursively to prevent
over-fitting
37Decision Tree Algorithms
- Classifiers from machine learning community
- ID3
- C4.5
- CART
- Classifiers for large databases
- SLIQ, SPRINT
- PUBLIC
- SONAR
- Rainforest, BOAT
38Decision Trees
- Pros
- Fast execution time
- Generated rules are easy to interpret by humans
- Scale well for large data sets
- Can handle high dimensional data
- Cons
- Cannot capture correlations among attributes
- Consider only axis-parallel cuts
39Feature Selection
- Choose a collection of keywords that help
discriminate between two or more sets of
documents - Fewer keywords help to speed up classification
- Improves classification accuracy by eliminating
noise from documents - Fischers discriminant (ratio of between-class to
within-class scatter) where
and
if d contains t
40Exploiting Hyperlink Structure
41HITS (Hyperlink-Induced Topic Search) Kle 98
- HITS uses hyperlink structure to identify
authoritative Web sources for broad-topic
information discovery - Premise Sufficiently broad topics contain
communities consisting of two types of
hyperlinked pages - Authorities highly-referenced pages on a topic
- Hubs pages that point to authorities
- A good authority is pointed to by many good hubs
a good hub points to many good authorities
Hubs
Authorities
42HITS - Discovering Web Communities
- Discovering the community for a specific
topic/query involves the following steps - Collect seed set of pages S (returned by search
engine) - Expand seed set to contain pages that point to or
are pointed to by pages in seed set - Iteratively update hub weight h(p) and authority
weight a(p) for each page - After a fixed number of iterations, pages with
highest hub/authority weights form core of
community - Extensions proposed in Clever
- Assign links different weights based on relevance
of link anchor text
43Google BP 98
- Search engine that uses link structure to
calculate a quality ranking (PageRank) for each
page - PageRank
- Can be calculated using a simple iterative
algorithm, and corresponds to principal
eigenvector of the normalized link matrix - Intuition PageRank is the probability that a
random surfer visits a page - Parameter p is probability that the surfer gets
bored and starts on a new random page - (1-p) is the probability that the random surfer
follows a link on current page
44Google - Features
- In addition to PageRank, in order to improve
search Google also weighs keyword matches - Anchor text
- Provide more accurate descriptions of Web pages
- Anchors exist for un-indexable documents (e.g.,
images) - Font sizes of words in text
- Words in larger or bolder font are assigned
higher weights - Google v/s HITS
- Google PageRanks computed initially for Web
Pages independent of search query - HITS Hub and authority weights computed for
different root sets in the context of a
particular search query
45Trawling the Web for Emerging Communities KRR 98
- Co-citation pages that are related are
frequently referenced together - Web communities are characterized by dense
directed bipartite subgraphs - Computing (i,j) Bipartite cores
- Sort edge list by source id and detect all source
pages s with out-degree j (let D be the set of
destination pages that s points to) - Compute intersection S of sets of source pages
pointing to destination pages in D (using an
index on dest id to generate each source set) - Output Bipartite (S,D)
Bipartite Core
46Using Hyperlinks to Improve Classification CDI
98
- Use text from neighbors when classifying Web page
- Ineffective because referenced pages may belong
to different class - Use class information from pre-classified
neighbors - Choose class ci for which Pr(ci/Ni) is maximum
(Ni is class labels of all the neighboring
documents) - By Bayes rule, we choose ci to maximize Pr(Ni/ci)
Pr(ci) - Assuming independence of neighbor classes,
47Collaborative Search
48SearchLight
- Key Idea Improve search by sharing information
on URLs visited by members of a community during
search - Based on the concept of search sessions
- A search session is the search engine query
(collection of keywords) and the URLs visited in
response to the query - Possible to extract search sessions from the
proxy logs - SearchLight maintains a database of (query,
target URL) pairs - Target URL is heuristically chosen to be last URL
in search session for the query - In response to a search query, SearchLight
displays URLs from its database for the specified
query
49Image Retrieval
50Similar Images
- Given
- A set of images
- Find
- All images similar to a given image
- All pairs of similar images
- Sample applications
- Medical diagnosis
- Weather predication
- Web search engine for images
- E-commerce
51Similar Image Retrieval Systems
- QBIC, Virage, Photobook
- Compute feature signature for each image
- QBIC uses color histograms
- WBIIS, WALRUS use wavelets
- Use spatial index to retrieve database image
whose signature is closest to the querys
signature - QBIC drawbacks
- Computes single signature for entire image
- Thus, fails when images contain similar objects,
but at different locations or in varying sizes - Color histograms cannot capture shape, texture
and location information (wavelets can!)
52WALRUS Similarity Model NRS 99
- WALRUS decomposes an image into regions
- A single signature is stored for each region
- Two images are considered to be similar if they
have enough similar region pairs
53WALRUS (Step 1)
- Generation of Signatures for Sliding Windows
- Each image is broken into sliding windows
- For the signature of each sliding window, use
- coefficients from lowest frequency band
of the Haar wavelet - Naive Algorithm
- Dynamic Programming Algorithm
- N - number of pixels in the image
- S -
- - max window size
54WALRUS (Step 2)
- Clustering Sliding Windows
- Cluster the windows in the image using
pre-clustering phase of BIRCH - Each cluster defines a region in the image.
- For each cluster, the centroid is used as a
signature. (c.f. bounding box)
55WALRUS - Retrieval Results
Query image
56Automated Schema Extraction for XML Data
The XTRACT System
57XML Primer I
- Standard for data representation and data
exchange - Unified, self-describing format for
publishing/exchanging management data across
heterogeneous network/NM platforms - Looks like HTML but it isnt
- Collection of elements
- Atomic (raw character data)
- Composite (sequence of nested sub-elements)
- Example
- ltbookgt
- lttitlegtA relational Model for Large Shared Data
Banks lt/titlegt - ltauthorgt ltnamegt E.F. Codd lt/namegt
- ltaffiliationgt IBM Research
lt/affiliationgt - lt/authorgt
- lt/ bookgt
58XML Primer II
- XML documents can be accompanied by Document Type
Descriptors (DTDs) - DTDs serve the role of the schema of the document
- Specify a regular expression for every element
- Example
- lt!ELEMENT book (title, author)gt
- lt!ELEMENT title (PCDATA)gt
- lt!ELEMENT author (name, affiliation)gt
- lt!ELEMENT name (PCDATA)gt
- lt!ELEMENT affiliation (PCDATA)gt
59The XTRACT System GGR 00
- DTDs are of great practical importance
- Efficient storage of XML data collections
- Formulation and optimization of XML queries
- However, DTDs are not mandatory gt XML data may
not be accompanied by a DTD - Automatically-generated XML documents (e.g., from
relational databases or flat files) - DTD standards for many communities are still
evolving - Goal of the XTRACT system
- Automated inference of DTDs from XML-document
collections
60Problem Formulation
- Element types Þ alphabet
- Infer DTD for each element type separately
- Example sequences instances of nested
sub-elements - Þ Only one level down in the hierarchy
- Problem statement
- Given a set example sequences for element e
- Infer a good regular expression for e
- Hard problem!!
- DTDs can comprise general, complex regular
expressions - quantify notion of goodness for regular
expressions
61Example XML Documents
- ltbookgt ltbookgt
- lttitlegtlt/titlegt lttitlegtlt/titlegt
- ltauthorgt ltauthorgt
- ltnamegtlt/namegt ltnamegtlt/namegt
- ltaffiliationgtlt/affiliationgt ltaddressgtlt/addressgt
- lt/authorgt lt/authorgt
- lt/bookgt ltauthorgt
- ltnamegtlt/namegt
- ltaddressgtlt/addressgt
- lt/authorgt
- lteditorgtlt/editorgt
- ltpapergt ltbookgt
- lttitlegtlt/titlegt
- ltauthorgt
- ltnamegtlt/namegt
- ltaffiliationgtlt/affiliationgt
- lt/authorgt
- ltconferencegtlt/conferencegt
- ltyeargtlt/yeargt
62Example (Continued)
- Simplified example sequences
- ltbookgt - lttitlegtltauthorgt,
- lttitlegtltauthorgtltauthorgtlteditorgt
- ltpapergt - lttitlegtltauthorgtltconferencegtltyeargt
- ltauthorgt - ltnamegtltaffiliationgt,
- ltnamegtltaffiliationgt,
- ltnamegtltaddressgt,
- ltnamegtltaddressgt
- Desirable solution
- lt!ELEMENT book (title, author, editor?)gt
- lt!ELEMENT paper (title, author, conference,
year)gt - lt!ELEMENT author (name, affiliation?, address?)gt
-
63DTD Inference Requirements
- Requirements for a good DTD
- Generalizes to intuitively correct but previously
unseen examples - It should be concise (i.e., small in size)
- It should be precise (i.e., not cover too many
sequences not contained in the set of examples) - Example Consider the case
- p - ta, taa, taaa, ta, taaaa
Candidate DTD
64The XTRACT Approach MDL Principle
- Minimum Description Length (MDL) quantifies and
resolves the tradeoff between DTD conciseness and
preciseness - MDL principle The best theory to infer from a
set of data is the one which minimizes the sum of - (A) the length of the theory, in bits, plus
- (B) the length of the data, in bits, when
encoded with the help of the theory. - Part (A) captures conciseness, and
- Part (B) captures preciseness
65Overview of the XTRACT System
- XTRACT consists of 3 subsystems
- Input Sequences
-
I ab, abab, ac, ad, bc, bd, bbd, bbbe
SG I U (ab), (ab), bd, be
SF SG U (ab)(cd), b(de)
Inferred DTD (ab) (ab)(cd) b(de)
66MDL Subsystem
- MDL principle Minimize the sum of
- Theory description length, plus
- Data description length given the theory
- In order to use MDL, need to
- Define theory description length (candidate
DTD) - Define data description length (input sequences)
given the theory (candidate DTD) - Solve the resulting minimization problem
67MDL Subsystem - Encoding Scheme
- Description Length of a DTD
- Number of bits required to encode the DTD
- Size of DTD log U (,),,
- Description length of a sequence given a
candidate DTD - Number of bits required to specify the sequence
given DTD - Use a sequence of encoding indices
- Encoding of a given a is the empty string Î
- Encoding of a given (abc) is the index 0
- Encoding of aaa given a is the index 3
- Example Encoding of ababcabc given
((ab)c) is the sequence 2,2,1
68 MDL Encoding Example
- Consider again the case
- p - ta, taa, taaa, taaaa
Data Description
Theory Description
(given the theory)
ta taa taaa taaaa (ta) ta
0, 1,0 2, 3
17 7 24
6 21 27
201, 3011, 40111, 501111
3 7 10
1, 2, 3, 4
69MDL Subsystem - Minimization
Input Sequences
Candidate DTDs
w11
ta
c1
w12
ta
taaa
c2
taa
ta
c3
taaaa
taa
ta
- Maps to the Facility Location Problem (NP-hard)
- XTRACT employs fast heuristic algorithms
proposed by the Operations Research community
70References
- BP 98 S. Brin, and L. Page. The anatomy of a
large-scale hypertextual Web search engine. WWW7,
1998. - CDA 97 S. Chakrabarti, B. Dom, and P. Indyk.
Enhanced hypertext categorization using
hyperlinks. ACM SIGMOD, 1998. - CDI 98 S. Chakrabarti, B. Dom, R. Agrawal, and
P. Raghavan. Scalable feature selection,
classification and signature generation for
organizing large text databases into hierarchical
topic taxonomies. VLDB Journal, 1998. - CGR 00 K. Chakrabarti, M. Garofalakis, R.
Rastogi, and K. Shim. Approximate Query
Processing Using Wavelets. VLDB, 2000. - DDF 90 S. Deerwater, S. T. Dumais, G. W.
Furnas, T. K. Landauer, and R. Harshman. Indexing
by latent semantic analysis. Journal of the
Society for Information Science, 41(6), 1990. - GGR 00 M. Garofalakis, A. Gionis, R. Rastogi,
S. Seshadri, and K. Shim. XTRACT A System for
Extracting Document Type Descriptors from XML
Documents. ACM SIGMOD, 2000.
71References (Continued)
- GKR 98 D. Gibson, J. Kleinberg, and P.
Raghavan. Clustering categorical data An
approach based on dynamical systems. VLDB, 1998. - GRS 99 S. Guha, K. Shim, and R. Rastogi. CURE
An efficient clustering algorithm for large
databases. ACM SIGMOD, 1998. - GRS 98 S. Guha, K. Shim, and R. Rastogi. ROCK
A robust clustering algorithm for categorical
attributes. Data Engineering, 1999. - HKK 97 E. Han, G. Karypis, V. Kumar, and B.
Mobasher. Clustering based on association rule
hypergraphs. DMKD Workshop, 1997. - Kle 98 J. Kleinberg. Authoritative sources in a
hyperlinked environment. SODA, 1998. - KRR 98 R. Kumar, P. Raghavan, S. Rajagopalan,
and A. Tomkins. Trawling the Web for emerging
cyber-communities. WWW8, 1999. - ZRL 96 T. Zhang, R. Ramakrishnan, and M. Livny.
BIRCH An efficient data clustering method for
very large databases. ACM SIGMOD, 1996.