Title: Data Mining meets the Internet: Techniques for Web Information Retrieval and Network Data Management
1 Data Mining meets the Internet Techniques for
Web Information Retrieval and Network Data
Management
Minos
Garofalakis Rajeev Rastogi Internet
Management Research Bell laboratories, Murray Hill
2The Web
- Over 1 billion HTML pages, 15 terabytes
- Wealth of information
- Bookstores, restaraunts, travel, malls,
dictionaries, news, stock quotes, yellow white
pages, maps, markets, ......... - Diverse media types text, images, audio, video
- Heterogeneous formats HTML, XML, postscript,
pdf, JPEG, MPEG, MP3 - Highly Dynamic
- 1 million new pages each day
- Average page changes in a few weeks
- Graph structure with links between pages
- Average page has 7-10 links
- Hundreds of millions of queries per day
3Why is Web Information Retrieval Important?
- According to most predictions, the majority of
human information will be available on the Web in
ten years - Effective information retrieval can aid in
- Research Find all papers that use the primal
dual method to solve the facility location
problem - Health/Medicene What could be reason for
symptoms of yellow eyes, high fever and
frequent vomitting - Travel Find information on the tropical island
of St. Lucia - Business Find companies that manufacture digital
signal processors (DSPs) - Entertainment Find all movies starring Marilyn
Monroe during the years 1960 and 1970 - Arts Find all short stories written by Jhumpa
Lahiri
4Web Information Retrieval Model
Repository
Storage Server
Web Server
Crawler
Clustering Classification
The jaguar has a 4 liter engine
Indexer
The jaguar, a cat, can run at speeds reaching 50
mph
Inverted Index
Topic Hierarchy
engine jaguar cat
Root
Documents in repository
Business
News
Science
jaguar
Search Query
Computers
Automobiles
Plants
Animals
5Why is Web Information Retrieval Difficult?
- The Abundance Problem (99 of information of no
interest to 99 of people) - Hundreds of irrelevant documents returned in
response to a search query - Limited Coverage of the Web (Internet sources
hidden behind search interfaces) - Largest crawlers cover less than 18 of Web pages
- The Web is extremely dynamic
- 1 million pages added each day
- Very high dimensionality (thousands of
dimensions) - Limited query interface based on keyword-oriented
search - Limited customization to individual users
6How can Data Mining Improve Web Information
Retrieval?
- Latent Semantic Indexing (LSI)
- SVD-based method to improve precision and recall
- Document clustering to generate topic hierarchies
- Hypergraph partitioning, STIRR, ROCK
- Document classification to assign topics to new
documents - Naive Bayes, TAPER
- Exploiting hyperlink structure to locate
authoritative Web pages - HITS, Google, Web Trawling
- Collaborative searching
- SearchLight
- Image Retrieval
- QBIC, Virage, Photobook, WBIIS, WALRUS
7Latent Semantic Indexing
8Problems with Inverted Index Approach
- Synonymy
- Many ways to refer to the same object
- Polysemy
- Most words have more than one distinct meaning
animal
jaguar
speed
car
engine
porsche
automobile
Doc 1
X
X
X
Doc 2
X
X
X
X
Doc 3
X
X
X
Synonymy
Polysemy
9LSI - Key Idea DDF 90
- Apply SVD to terms by documents (t x d) matrix
X X
T0 S0 D0T0 , D0 have
orthonormal columns and S0 is diagonal - Ignoring very small singular values in S (keeping
only the first k largest values)
X X
T S D - New matrix X of rank k is closest to X in the
least squares sense
m x d
m x m
t x d
t x m
k x k
k x d
t x d
t x k
10Comparing Documents and Queries
- Comparing two documents
- Essentially dot product of two column vectors of
X X X D S D - So one can consider rows of DS matrix as
coordinates for documents and take dot products
in this space - Finding documents similar to query q with term
vector Xq - Derive a representation Dq for query Dq
Xq T S - Dot product of DqS and appropriate row of DS
matrix yields similarity between query and
specific document
2
-1
11LSI - Benefits
- Reduces Dimensionality of Documents
- From tens of thousands (one dimension per
keyword) to a few 100 - Decreases storage overhead of index structures
- Speeds up retrieval of documents similar to a
query - Makes search less brittle
- Captures semantics of documents
- Addresses problems of synonymy and polysemy
- Transforms document space from discrete to
continuous - Improves both search precision and recall
12Document Clustering
13Improve Search Using Topic Hierarchies
- Web directories (or topic hierarchies) provide a
hierarchical classification of documents (e.g.,
Yahoo!) - Searches performed in the context of a topic
restricts the search to only a subset of web
pages related to the topic - Clustering can be used to generate topic
hierarchies
Yahoo home page
Recreation
Science
Business
News
Sports
Travel
Companies
Finance
Jobs
14Clustering
- Given
- Data points (documents) and number of desired
clusters k - Group the data points (documents) into k clusters
- Data points (documents) within clusters are more
similar than across clusters - Document similarity measure
- Each document can be represented by vector with
0/1 value along each word dimension - Cosine of angle between document vectors is a
measure of their similarity, or (euclidean)
distance between the vectors - Other applications
- Customer segmentation
- Market basket analysis
15k-means Algorithm
- Choose k initial means
- Assign each point to the cluster with the closest
mean - Compute new mean for each cluster
- Iterate until the k means stabilize
16Agglomerative Hierarchical Clustering Algorithms
- Initially each point is a distinct cluster
- Repeatedly merge closest clusters until the
number of clusters becomes k - Closest dmean (Ci, Cj)
- dmin (Ci, Cj)
- Likewise dave (Ci, Cj) and dmax (Ci, Cj)
17Agglomerative Hierarchical Clustering Algorithms
(Continued)
dmean Centroid approach dmin Minimum Spanning
Tree (MST) approach
(c) Correct Clusters
(a) Centroid
(b) MST
18Drawbacks of Traditional Clustering Methods
- Traditional clustering methods are ineffective
for clustering documents - Cannot handle thousands of dimensions
- Cannot scale to millions of documents
- Centroid-based method splits large and
non-hyperspherical clusters - Centers of subclusters can be far apart
- MST-based algorithm is sensitive to outliers and
slight change in position - Exhibits chaining effect on string of outliers
- Using other similarity measures such as Jaccard
coefficient instead of euclidean distance does
not help
19Example - Centroid Method for Clustering Documents
- As cluster size grows
- The number of dimensions appearing in mean go up
- Their value in the mean decreases
- Thus, very difficult to distinguish two points
that differ on few dimensions - ripple effect
- 1,4 and 6 are merged even though they have no
elements in common!
20Itemset Clustering using Hypergraph Partitioning
HKK 97
- Build a weighted hypergraph with frequent
itemsets - Hyperedge each frequent item
- Weight of hyperedge average of confidences of
all association rules generated from itemset - Hypergraph partitioning algorithm is used to
cluster items - Minimize sum of weights of cut hyperedges
- Label customers with Item clusters by scoring
- Assume that items defining clusters are disjoint!
21STIRR - A System for Clustering Categorical
Attribute Values GKR 98
- Motivated by spectral graph partitioning, a
method for clustering undirected graphs - Each distinct attribute value becomes a separate
node v with weight w(v) - Node weights w(v) updated in each iteration as
follows For each tuple
do
Update set of
weights so that it is orthonormal - Positive and negative weights in non-principal
basins tend to represent good partitions of the
data
22ROCK GRS 99
- Hierarchical clustering algorithm for categorical
attributes - Example market basket customers
- Use novel concept of links for merging clusters
- sim(pi, pj) similarity function that captures
the closeness between pi and pj - pi and pj are said to be neighbors if sim(pi, pj)
- link(pi, pj) the number of common neighbors
- At each step, merge clusters/points with the most
number of links - Points belonging to a single cluster will in
general have a large number of common neighbors - Random sampling used for scale up
- In final labeling phase, each point on disk is
assigned to cluster with maximum neighbors
23ROCK
1, 2, 3 1, 4, 5 1, 2, 4
2, 3, 4 1, 2, 5 2, 3, 5 1, 3, 4 2, 4,
5 1, 3, 5 3, 4, 5
1, 2, 6 1, 2, 7 1, 6, 7 2, 6,
7
- 1, 2, 6 and 1, 2, 7 have 5 links.
- 1, 2, 3 and 1, 2, 6 have 3 links.
24Clustering Algorithms for Numeric Attributes
- Scalable Clustering Algorithms
- (From Database Community)
- CLARANS
- DBSCAN
- BIRCH
- CLIQUE
- CURE
-
- Above algorithms can be used to cluster documents
after reducing their dimensionality using SVD
.
25BIRCH ZRL 96
- Pre-cluster data points using CF-tree
- CF-tree is similar to R-tree
- For each point
- CF-tree is traversed to find the closest cluster
- If the cluster is within epsilon distance, the
point is absorbed into the cluster - Otherwise, the point starts a new cluster
- Requires only single scan of data
- Cluster summaries stored in CF-tree are given to
main memory hierarchical clustering algorithm
26CURE GRS 98
- Hierarchical algorithm for dicovering arbitrary
shaped clusters - Uses a small number of representatives per
cluster - Note
- Centroid-based uses 1 point to represent a
cluster Too little information..Hyper-spherical
clusters - MST-based uses every point to represent a
cluster Too much information..Easily mislead - Uses random sampling
- Uses Partitioning
- Labeling using representatives
27Cluster Representatives
- A Representative set of points
- Small in number c
- Distributed over the cluster
- Each point in cluster is close to one
representative - Distance between clusters
- smallest distance between
representatives
28Computing Cluster Representatives
- Finding Scattered Representatives
- We want to
- Distribute around the center of the cluster
- Spread well out over the cluster
- Capture the physical shape and geometry of the
cluster - Use farthest point heuristic to scatter the
points over the cluster - Shrink uniformly around the mean of the cluster
29Computing Cluster Representatives (Continued)
- Shrinking the Representatives
- Why do we need to alter the Representative Set?
- Too close to the boundary of cluster
- Shrink uniformly around the mean (center) of the
cluster
30Document Classification
31Classification
- Given
- Database of tuples (documents), each assigned a
class label - Develop a model/profile for each class
- Example profile (good credit)
- (25 40k) or
(married YES) - Example profile (automobile)
- Document contains a word from car,
truck, van, SUV, vehicle, scooter - Other applications
- Credit card approval (good, bad)
- Bank locations (good, fair, poor)
- Treatment effectiveness (good, fair, poor)
32Naive Bayesian Classifier
- Class c for new document d is the one for which
Prc/d is maximum - Assume independent term occurrences in document
- fraction of documents in class c that
contain term t - Then, by Bayes rule
33Hierarchical Classifier (TAPER) CDA 97
- Class of new document d is leaf node c such that
Prc/d is maximum Topic Hierarchy -
- can be computed using Bayes
rule - Problem of computing c reduces to finding leaf
node c with the least cost path from the root
to c
c
34k-Nearest Neighbor Classifier
- Assign to a point the label for majority of the
k-nearest neighbors - For k1, error rate never worse than twice the
Bayes rate (unlimited number of samples) - Scalability issues
- Use index to find k-nearest neighbors
- R-tree family works well up to 20 dimensions
- Pyramid tree for high-dimensional data
- Use SVD to reduce dimensionality of data set
- Use clusters to reduce the dataset size
35Decision Trees
Credit Analysis
salary no
yes
education in graduate
accept
yes
no
accept
reject
36Decision Tree Algorithms
- Building phase
- Recursively split nodes using best splitting
attribute for node - Pruning phase
- Smaller imperfect decision tree generally
achieves better accuracy - Prune leaf nodes recursively to prevent
over-fitting
37Decision Tree Algorithms
- Classifiers from machine learning community
- ID3
- C4.5
- CART
- Classifiers for large databases
- SLIQ, SPRINT
- PUBLIC
- SONAR
- Rainforest, BOAT
38Decision Trees
- Pros
- Fast execution time
- Generated rules are easy to interpret by humans
- Scale well for large data sets
- Can handle high dimensional data
- Cons
- Cannot capture correlations among attributes
- Consider only axis-parallel cuts
39Feature Selection
- Choose a collection of keywords that help
discriminate between two or more sets of
documents - Fewer keywords help to speed up classification
- Improves classification accuracy by eliminating
noise from documents - Fischers discriminant (ratio of between-class to
within-class scatter) where
and
if d contains t
40Exploiting Hyperlink Structure
41HITS (Hyperlink-Induced Topic Search) Kle 98
- HITS uses hyperlink structure to identify
authoritative Web sources for broad-topic
information discovery - Premise Sufficiently broad topics contain
communities consisting of two types of
hyperlinked pages - Authorities highly-referenced pages on a topic
- Hubs pages that point to authorities
- A good authority is pointed to by many good hubs
a good hub points to many good authorities
Hubs
Authorities
42HITS - Discovering Web Communities
- Discovering the community for a specific
topic/query involves the following steps - Collect seed set of pages S (returned by search
engine) - Expand seed set to contain pages that point to or
are pointed to by pages in seed set - Iteratively update hub weight h(p) and authority
weight a(p) for each page - After a fixed number of iterations, pages with
highest hub/authority weights form core of
community - Extensions proposed in Clever
- Assign links different weights based on relevance
of link anchor text
43Google BP 98
- Search engine that uses link structure to
calculate a quality ranking (PageRank) for each
page - PageRank
- Can be calculated using a simple iterative
algorithm, and corresponds to principal
eigenvector of the normalized link matrix - Intuition PageRank is the probability that a
random surfer visits a page - Parameter p is probability that the surfer gets
bored and starts on a new random page - (1-p) is the probability that the random surfer
follows a link on current page
44Google - Features
- In addition to PageRank, in order to improve
search Google also weighs keyword matches - Anchor text
- Provide more accurate descriptions of Web pages
- Anchors exist for un-indexable documents (e.g.,
images) - Font sizes of words in text
- Words in larger or bolder font are assigned
higher weights - Google v/s HITS
- Google PageRanks computed initially for Web
Pages independent of search query - HITS Hub and authority weights computed for
different root sets in the context of a
particular search query
45Trawling the Web for Emerging Communities KRR 98
- Co-citation pages that are related are
frequently referenced together - Web communities are characterized by dense
directed bipartite subgraphs - Computing (i,j) Bipartite cores
- Sort edge list by source id and detect all source
pages s with out-degree j (let D be the set of
destination pages that s points to) - Compute intersection S of sets of source pages
pointing to destination pages in D (using an
index on dest id to generate each source set) - Output Bipartite (S,D)
Bipartite Core
46Using Hyperlinks to Improve Classification CDI
98
- Use text from neighbors when classifying Web page
- Ineffective because referenced pages may belong
to different class - Use class information from pre-classified
neighbors - Choose class ci for which Pr(ci/Ni) is maximum
(Ni is class labels of all the neighboring
documents) - By Bayes rule, we choose ci to maximize Pr(Ni/ci)
Pr(ci) - Assuming independence of neighbor classes,
47Collaborative Search
48SearchLight
- Key Idea Improve search by sharing information
on URLs visited by members of a community during
search - Based on the concept of search sessions
- A search session is the search engine query
(collection of keywords) and the URLs visited in
response to the query - Possible to extract search sessions from the
proxy logs - SearchLight maintains a database of (query,
target URL) pairs - Target URL is heuristically chosen to be last URL
in search session for the query - In response to a search query, SearchLight
displays URLs from its database for the specified
query
49Image Retrieval
50Similar Images
- Given
- A set of images
- Find
- All images similar to a given image
- All pairs of similar images
- Sample applications
- Medical diagnosis
- Weather predication
- Web search engine for images
- E-commerce
51Similar Image Retrieval Systems
- QBIC, Virage, Photobook
- Compute feature signature for each image
- QBIC uses color histograms
- WBIIS, WALRUS use wavelets
- Use spatial index to retrieve database image
whose signature is closest to the querys
signature - QBIC drawbacks
- Computes single signature for entire image
- Thus, fails when images contain similar objects,
but at different locations or in varying sizes - Color histograms cannot capture shape, texture
and location information (wavelets can!)
52WALRUS Similarity Model NRS 99
- WALRUS decomposes an image into regions
- A single signature is stored for each region
- Two images are considered to be similar if they
have enough similar region pairs
53WALRUS (Step 1)
- Generation of Signatures for Sliding Windows
- Each image is broken into sliding windows
- For the signature of each sliding window, use
- coefficients from lowest frequency band
of the Haar wavelet - Naive Algorithm
- Dynamic Programming Algorithm
- N - number of pixels in the image
- S -
- - max window size
54WALRUS (Step 2)
- Clustering Sliding Windows
- Cluster the windows in the image using
pre-clustering phase of BIRCH - Each cluster defines a region in the image.
- For each cluster, the centroid is used as a
signature. (c.f. bounding box)
55WALRUS - Retrieval Results
Query image
56Network-Data Management and Analysis
57Networks Create Data
- To effectively manage their networks
Internet/Telecom Service Providers continuously
gather utilization and traffic data - Managed IP network elements collect huge amounts
of traffic data - Switch/router-level monitoring (SNMP, RMON,
NetFlow, etc.) - Typical IP router several 1000s SNMP counters
- Service-Level Agreements (SLAs),
Quality-of-Service (QoS) guarantees
finer-grain monitoring (per IP flow!!) - Telecom networks Call-Detail Records (CDRs) for
every phone call - Each CDR comprises 100s bytes of data with
several 10s of fields/attributes (e.g., endpoint
exchanges, timestamps, tarifs) - End Result Massive collections of
Network-Management (NM) data (can grow in the
order of several TeraBytes/year!!)
58Why Data Management??
- Massive NM data sets hide knowledge that is
crucial to key management tasks - Application/user profiling, proactive/reactive
resource management traffic engineering,
capacity planning, etc. - Data Mining research can help!
- Develop novel tools for the effective storage,
exploration, and analysis of massive
Network-Management data - Several challenging research themes
- semantic data compression, approximate query
processing, XML, mining models for event
correlation and fault analysis,
network-recommender systems, . . . - Loooooong-term goal -)
- Intelligent, self-tuning, self-healing
communication networks
59Mining Techniques for Network Data
- Automated schema extraction for XML data the
XTRACT system - Data reduction techniques for massive data tables
- lossless semantic compression with simple data
dependencies the pzip algorithm - lossy, guaranteed-error semantic compression
- Fascicles
- Model-Based Semantic Compression the SPARTAN
system - Approximate query processing over data synopses
- Mining techniques for event correlation and
root-cause analysis - Managing and mining data streams
60Automated Schema Extraction for XML Data
The XTRACT System
61XML Primer I
- Standard for data representation and data
exchange - Unified, self-describing format for
publishing/exchanging management data across
heterogeneous network/NM platforms - Looks like HTML but it isnt
- Collection of elements
- Atomic (raw character data)
- Composite (sequence of nested sub-elements)
- Example
-
- A relational Model for Large Shared Data
Banks - E.F. Codd
- IBM Research
-
-
62XML Primer II
- XML documents can be accompanied by Document Type
Descriptors (DTDs) - DTDs serve the role of the schema of the document
- Specify a regular expression for every element
- Example
-
-
-
-
-
63The XTRACT System GGR 00
- DTDs are of great practical importance
- Efficient storage of XML data collections
- Formulation and optimization of XML queries
- However, DTDs are not mandatory XML data may
not be accompanied by a DTD - Automatically-generated XML documents (e.g., from
relational databases or flat files) - DTD standards for many communities are still
evolving - Goal of the XTRACT system
- Automated inference of DTDs from XML-document
collections
64Problem Formulation
- Element types Þ alphabet
- Infer DTD for each element type separately
- Example sequences instances of nested
sub-elements - Þ Only one level down in the hierarchy
- Problem statement
- Given a set example sequences for element e
- Infer a good regular expression for e
- Hard problem!!
- DTDs can comprise general, complex regular
expressions - quantify notion of goodness for regular
expressions
65Example XML Documents
66Example (Continued)
- Simplified example sequences
- - ,
-
- -
- - ,
- ,
- ,
-
- Desirable solution
-
- year)
-
-
67DTD Inference Requirements
- Requirements for a good DTD
- Generalizes to intuitively correct but previously
unseen examples - It should be concise (i.e., small in size)
- It should be precise (i.e., not cover too many
sequences not contained in the set of examples) - Example Consider the case
- p - ta, taa, taaa, ta, taaaa
Candidate DTD
68The XTRACT Approach MDL Principle
- Minimum Description Length (MDL) quantifies and
resolves the tradeoff between DTD conciseness and
preciseness - MDL principle The best theory to infer from a
set of data is the one which minimizes the sum of - (A) the length of the theory, in bits, plus
- (B) the length of the data, in bits, when
encoded with the help of the theory. - Part (A) captures conciseness, and
- Part (B) captures preciseness
69Overview of the XTRACT System
- XTRACT consists of 3 subsystems
- Input Sequences
-
I ab, abab, ac, ad, bc, bd, bbd, bbbe
SG I U (ab), (ab), bd, be
SF SG U (ab)(cd), b(de)
Inferred DTD (ab) (ab)(cd) b(de)
70MDL Subsystem
- MDL principle Minimize the sum of
- Theory description length, plus
- Data description length given the theory
- In order to use MDL, need to
- Define theory description length (candidate
DTD) - Define data description length (input sequences)
given the theory (candidate DTD) - Solve the resulting minimization problem
71MDL Subsystem - Encoding Scheme
- Description Length of a DTD
- Number of bits required to encode the DTD
- Size of DTD log U (,),,
- Description length of a sequence given a
candidate DTD - Number of bits required to specify the sequence
given DTD - Use a sequence of encoding indices
- Encoding of a given a is the empty string Î
- Encoding of a given (abc) is the index 0
- Encoding of aaa given a is the index 3
- Example Encoding of ababcabc given
((ab)c) is the sequence 2,2,1
72 MDL Encoding Example
- Consider again the case
- p - ta, taa, taaa, taaaa
Data Description
Theory Description
(given the theory)
ta taa taaa taaaa (ta) ta
0, 1,0 2, 3
17 7 24
6 21 27
201, 3011, 40111, 501111
3 7 10
1, 2, 3, 4
73MDL Subsystem - Minimization
Input Sequences
Candidate DTDs
w11
ta
c1
w12
ta
taaa
c2
taa
ta
c3
taaaa
taa
ta
- Maps to the Facility Location Problem (NP-hard)
- XTRACT employs fast heuristic algorithms
proposed by the Operations Research community
74Semantic Compression of Massive Network-Data
Tables
75Compressing Massive Tables A New Direction in
Data Compression
- Benefits of data compression are well established
- Optimize storage, I/O, network bandwidth (e.g.,
data transfers, disconnected operation for mobile
users) over the lifetime of the data - Faster query processing over synopses
- Several generic compression tools and
algorithms(e.g., gzip, Huffman, Lempel-Ziv) - Syntactic methods operate at the byte level,
view data as large byte string - Lossless compression only
- Effective compression of massive alphanumeric
tables - Need novel methods that are semantic account
for and exploit the meaning and data
dependencies of attributes in the table - Lossless of lossy compression flexible
mechanisms for users to specify acceptable
information loss
76The pzip Table Compressor BCC 00
- Key ideas
- Lossless compression via training use a small
sample of table records to learn simple
dependency patterns - Build a compression plan that exploits the
discovered dependencies (e.g., column grouping) - Leverage existing compression tools (e.g., gzip,
bzip) to losslessly compress the entire table - Based on discovering and exploiting simple
dependency patterns among table columns - Combinational dependencies
- Differential dependencies
- Also, use simple differential coding for
low-frequency columns - Outperforms naive gzip by factors of up to 2 in
compression ratio/time
77Combinational Dependencies in pzip
- Some notation
- Ti,j portion of table T between columns i and
j (Ti i-th column of T) - S(Ti,j) size of compressed (e.g., gzipped)
representation of Ti,j - The ranges Ti,j and Tj1,k are
combinationally dependent iff S(Ti,j)
S(Tj1,k) S(Ti,k) - Grouping the two ranges results in better
compression - Optimum Partitioning find the column groupings
that result in minimum overall storage
requirements (each column group is compressed
individually) - Solved optimally using Dynamic Programming
- OPT1,i min OPT1,j S(Tj1,i)
j - Complexity is O(n2) assuming S(Ti,j)s are
known (remember these are
computed over a sample of T)
78Differential Dependencies in pzip
- Column Tj is differentially dependent on Ti
iff S(Tj) S(Ti -
Tj) - Compressing the difference wrt Ti rather than
Tj itself results in better compression - More explicit form of dependency
- Differential compression problem partition
Ts columns into source and derived, and
find the differential encoding for each derived
column such that overall storage is minimized - Maps naturally to the Facility Location Problem
(NP-hard) - Greedy local-search heuristics are used in the
pzip implementation
79Semantic Compression with Fascicles JMN 99
- Key observation
- Often, numerous subsets of records in T have
similar values for many attributes
- Compress data by storing representative
values (e.g., centroid) only once for each
attribute cluster
- Lossy compression information loss is
controlled by the notion of similar values for
attributes (user-defined)
80Problem Formulation
- k-dimensional fascicle F(k,t) subset of
records with k compact attributes - User-defined compactness tolerance t (vector)
specifies the allowable loss in the compression
per attribute - E.g., tDuration 3 means that all Duration
values in a fascicle are within 3 of the centroid
value - Flexible, per-attribute specification of
compression loss - Problem Statement
- Given a table T and a compactness-tolerance
vector t, find fascicles within the specified
tolerances such that the total storage is
minimized - (1) Finding candidate fascicles in T
(2) Selecting the best fascicles to
compress T
81Finding Candidate Fascicles
- Efficient, randomized algorithm
- Use (memory-resident) random samples of T to
choose an initial collection of tip sets (
maximal fascicles based on the sampled records) - Grow tip sets with all qualifying records in
one pass over T - Not guaranteed to find all fascicles!
- Exact, level-wise (Apriori-like) procedures are
possible (fascicles are anti-monotone), BUT - Inordinately expensive
- Not necessarily better (require static
pre-binning of numeric attributes)
82Selecting Fascicles for Compression
- Selecting the optimal subset among all
candidate fascicles is hard! - Generalization of Weighted Set Cover Problem
(NP-hard) - Use an efficient, greedy heuristic
- Always select the fascicle that gives maximum
compression benefit - Fascicles give significantly improved compression
ratios (factors of 2-3) compared to naive gzip
83SPARTAN A Model-BasedSemantic Compressor
BGR 01
- New, general paradigm Model-Based Semantic
Compression (MBSC) - Extract Data Mining models and use them to
compress - Lossless or lossy compression (w/ guaranteed
per-attribute error bounds) - SPARTAN system specific instantiation of MBSC
framework - Key observation row-wise attribute clusters
(a-la fascicles) are not sufficient
(e.g., Y aX b) - Idea use carefully-selected collection of
Classification and Regression Trees (CaRTs) to
capture such vertical correlations and predict
values for entire columns
84SPARTAN Example CaRT Models
Protocol Duration Bytes Packets
http 12 20K 3
http 16 24K
5 http 15 20K
8 http 19 40K
11 http 26 58K
18 ftp 27
100K 24 ftp 32
300K 35 ftp 18
80K 15
- Can use two compact trees (one decision,
one regression) to eliminate two data columns
(predicted attributes)
85SPARTAN Architecture
86SPARTANs CaRTSelector
- Heart of the SPARTAN semantic-compression
engine - Uses the constructed Bayesian network on T to
drive the construction and selection of the
best subset of CaRT predictors - Hard optimization problem -- Strict
generalization of Weighted Maximum Independent
Set (WMIS) (NP-hard!) - CaRTSelector employs a novel algorithm that
iteratively uses a near-optimal WMIS heuristic
to determine a good subset of CaRTs for
compression - SPARTANs compression ratios outperform gzip
and fascicles by wide margins (even for lossless
compression) - Higher, but reasonable compression times (8min
for a 14-attribute, 30MB table) -- use samples to
learn CaRT models - SPARTAN models predictors can be useful in
other NM contexts - e.g., event correlation filtering, root cause
analysis (more later...)
87Approximate Query Processing Over Synopses
88Data Exploration in Traditional Decision Support
Systems (DSS)
Data Warehouse (GB/TB)
Long Response Times
SQL Query
Exact Answers
89Exact Answers NOT Always Required
- Interactive exploration of massive data sets
- early feedback giving rough idea of results would
help to quickly find the interesting regions in
data space - data visualization
-
- Aggregate queries approximate answers often
suffice - How does total sales of product X in NJ compare
to that in CA? Precision to the penny is
not needed - Base data may be remote/unavailable
Locally-cached synopses of the data may be the
only option
90Solution Approximate Query Processing
Data Warehouse (GB/TB)
Construct Compact Relations (in advance)
Compact Relations (MB)
Fast Response Times
Transformation Algebra
SQL Query
Approximate Answers
Transformed SQL Query
91Approximate Query Processing Using Wavelets CGR
00
- Construct compact synopses of data table(s) using
multi-dimensional Haar-wavelet decomposition - Fast takes just a single pass over the data
if it is chunked, otherwise logarithmic
passes - SQL queries are answered by working just on the
compact synopses (collections of wavelet
coefficients) , i.e. , entirely in the
wavelet (compressed) domain - fast response times
- results converted back to relational domain
(rendering) at the end - all types of queries supported aggregate,
set-valued, GROUP-BY, . . . - Fast, accurate, general
92Query Processing Architecture
- Entire processing in compressed (wavelet) domain
93Query Execution
render
- Each operator (e.g., select, project, join,
aggregates) - Input Set of Haar coefficients
- Output Set of coefficients
- Finally, rendering step
- Input Set of Haar coefficients
- Output (Multi)set of tuples
Set of coeffs
Set of coeffs
Set of coeffs
94Mining Techniques for Event Correlation and
Root-Cause Analysis
95Network Event Correlation Root- Cause Analysis
- The problem Alarm floods !!
Router
Router
96NM System Architecture
- EC use fault propagation rules to improve
information quality and filter secondary alarms - RCA employ EC output to produce a set of
possible root causes and associated degrees of
confidence
97Event Correlation Engine
- Driven by fault propagation rules ( causal
relationships between alarm signals )
CAUSAL BAYESIAN MODEL !!
Given set of observed alarms A find minimal
subset P such that PA P threshold
98State-of-the-art
- SMARTS InCharge
- Network elements modeled as objects with
hard-coded fault propagation rules - Use causal graph to produce binary signatures
for each failure (codebook) - HP OpenView ECS , CISCO InfoCenter , GTE
Impact , . . . - Graphics- or language- based specification of
global rules for event filtering
- Hand-coding of causal model !!
- tedious, error-prone, non-incremental
- ignores probabilistic aspects (dependency
strength)
99Data Mining can Help Automate
- Data Mining techniques for inferring
maintaining causal models from network alarm data
Maintenance (on-line)
- Challenges incorporate temporal aspects,
topology, domain knowledge in the data-mining
process
100Root Cause Analysis
- Use data mining (e.g., classification
techniques) for RCA (field data DB to learn
failure signatures) - Exploit domain knowledge (e.g., topology) in
the data-mining process - Refine the RCA models as more data from the field
becomes available
101References
- BCC 00 A.L. Buchsbaum, D.F. Caldwell, K.W.
Church, G.S. Fowler, and S. Muthukrishnan.
Engineering the Compression of Massive Tables An
Experimental Approach. SODA, 2000 - BGR 01 S. Babu, M. Garofalakis, and R. Rastogi.
SPARTAN A Model-Based Semantic Compression
System for Massive Data Tables. ACM SIGMOD, 2001. - BP 98 S. Brin, and L. Page. The anatomy of a
large-scale hypertextual Web search engine. WWW7,
1998. - CDA 97 S. Chakrabarti, B. Dom, and P. Indyk.
Enhanced hypertext categorization using
hyperlinks. ACM SIGMOD, 1998. - CDI 98 S. Chakrabarti, B. Dom, R. Agrawal, and
P. Raghavan. Scalable feature selection,
classification and signature generation for
organizing large text databases into hierarchical
topic taxonomies. VLDB Journal, 1998. - CGR 00 K. Chakrabarti, M. Garofalakis, R.
Rastogi, and K. Shim. Approximate Query
Processing Using Wavelets. VLDB, 2000.
102References (Continued)
- DDF 90 S. Deerwater, S. T. Dumais, G. W.
Furnas, T. K. Landauer, and R. Harshman. Indexing
by latent semantic analysis. Journal of the
Society for Information Science, 41(6), 1990. - GGR 00 M. Garofalakis, A. Gionis, R. Rastogi,
S. Seshadri, and K. Shim. XTRACT A System for
Extracting Document Type Descriptors from XML
Documents. ACM SIGMOD, 2000. - GKR 98 D. Gibson, J. Kleinberg, and P.
Raghavan. Clustering categorical data An
approach based on dynamical systems. VLDB, 1998. - GRS 99 S. Guha, K. Shim, and R. Rastogi. CURE
An efficient clustering algorithm for large
databases. ACM SIGMOD, 1998. - GRS 98 S. Guha, K. Shim, and R. Rastogi. ROCK
A robust clustering algorithm for categorical
attributes. Data Engineering, 1999. - HKK 97 E. Han, G. Karypis, V. Kumar, and B.
Mobasher. Clustering based on association rule
hypergraphs. DMKD Workshop, 1997.
103References (Continued)
- JMN 99 H.V. Jagadish, J. Madar, R.T. Ng.
Semantic Compression and Pattern Extraction with
Fascicles. VLDB, 1999. - Kle 98 J. Kleinberg. Authoritative sources in a
hyperlinked environment. SODA, 1998. - KRR 98 R. Kumar, P. Raghavan, S. Rajagopalan,
and A. Tomkins. Trawling the Web for emerging
cyber-communities. WWW8, 1999. - ZRL 96 T. Zhang, R. Ramakrishnan, and M. Livny.
BIRCH An efficient data clustering method for
very large databases. ACM SIGMOD, 1996.