Data Mining meets the Internet: Techniques for Web Information Retrieval and Network Data Management

About This Presentation

Title:

Data Mining meets the Internet: Techniques for Web Information Retrieval and Network Data Management

Description:

1. Data Mining Meets the Internet. 6/22/09 ... The jaguar, a cat, can run at. speeds reaching 50 mph. The jaguar has a 4 liter engine ... – PowerPoint PPT presentation

Number of Views:2464

Avg rating:3.0/5.0

Slides: 104

Provided by: bell87

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining meets the Internet: Techniques for Web Information Retrieval and Network Data Management

1
Data Mining meets the Internet Techniques for
Web Information Retrieval and Network Data
Management
Minos
Garofalakis Rajeev Rastogi Internet
Management Research Bell laboratories, Murray Hill
2
The Web

Over 1 billion HTML pages, 15 terabytes
Wealth of information
Bookstores, restaraunts, travel, malls,
dictionaries, news, stock quotes, yellow white
pages, maps, markets, .........
Diverse media types text, images, audio, video
Heterogeneous formats HTML, XML, postscript,
pdf, JPEG, MPEG, MP3
Highly Dynamic
1 million new pages each day
Average page changes in a few weeks
Graph structure with links between pages
Average page has 7-10 links
Hundreds of millions of queries per day

3
Why is Web Information Retrieval Important?

According to most predictions, the majority of
human information will be available on the Web in
ten years
Effective information retrieval can aid in
Research Find all papers that use the primal
dual method to solve the facility location
problem
Health/Medicene What could be reason for
symptoms of yellow eyes, high fever and
frequent vomitting
Travel Find information on the tropical island
of St. Lucia
Business Find companies that manufacture digital
signal processors (DSPs)
Entertainment Find all movies starring Marilyn
Monroe during the years 1960 and 1970
Arts Find all short stories written by Jhumpa
Lahiri

4
Web Information Retrieval Model
Repository
Storage Server
Web Server
Crawler
Clustering Classification
The jaguar has a 4 liter engine
Indexer
The jaguar, a cat, can run at speeds reaching 50
mph
Inverted Index
Topic Hierarchy
engine jaguar cat
Root
Documents in repository
Business
News
Science
jaguar
Search Query
Computers
Automobiles
Plants
Animals
5
Why is Web Information Retrieval Difficult?

The Abundance Problem (99 of information of no
interest to 99 of people)
Hundreds of irrelevant documents returned in
response to a search query
Limited Coverage of the Web (Internet sources
hidden behind search interfaces)
Largest crawlers cover less than 18 of Web pages
The Web is extremely dynamic
1 million pages added each day
Very high dimensionality (thousands of
dimensions)
Limited query interface based on keyword-oriented
search
Limited customization to individual users

6
How can Data Mining Improve Web Information
Retrieval?

Latent Semantic Indexing (LSI)
SVD-based method to improve precision and recall
Document clustering to generate topic hierarchies
Hypergraph partitioning, STIRR, ROCK
Document classification to assign topics to new
documents
Naive Bayes, TAPER
Exploiting hyperlink structure to locate
authoritative Web pages
HITS, Google, Web Trawling
Collaborative searching
SearchLight
Image Retrieval
QBIC, Virage, Photobook, WBIIS, WALRUS

7
Latent Semantic Indexing
8
Problems with Inverted Index Approach

Synonymy
Many ways to refer to the same object
Polysemy
Most words have more than one distinct meaning

animal
jaguar
speed
car
engine
porsche
automobile
Doc 1
X
X
X
Doc 2
X
X
X
X
Doc 3
X
X
X
Synonymy
Polysemy
9
LSI - Key Idea DDF 90

Apply SVD to terms by documents (t x d) matrix
X X
T0 S0 D0T0 , D0 have
orthonormal columns and S0 is diagonal
Ignoring very small singular values in S (keeping
only the first k largest values)
X X
T S D
New matrix X of rank k is closest to X in the
least squares sense

m x d
m x m
t x d
t x m
k x k
k x d
t x d
t x k
10
Comparing Documents and Queries

Comparing two documents
Essentially dot product of two column vectors of
X X X D S D
So one can consider rows of DS matrix as
coordinates for documents and take dot products
in this space
Finding documents similar to query q with term
vector Xq
Derive a representation Dq for query Dq
Xq T S
Dot product of DqS and appropriate row of DS
matrix yields similarity between query and
specific document

2
-1
11
LSI - Benefits

Reduces Dimensionality of Documents
From tens of thousands (one dimension per
keyword) to a few 100
Decreases storage overhead of index structures
Speeds up retrieval of documents similar to a
query
Makes search less brittle
Captures semantics of documents
Addresses problems of synonymy and polysemy
Transforms document space from discrete to
continuous
Improves both search precision and recall

12
Document Clustering
13
Improve Search Using Topic Hierarchies

Web directories (or topic hierarchies) provide a
hierarchical classification of documents (e.g.,
Yahoo!)
Searches performed in the context of a topic
restricts the search to only a subset of web
pages related to the topic
Clustering can be used to generate topic
hierarchies

Yahoo home page
Recreation
Science
Business
News
Sports
Travel
Companies
Finance
Jobs
14
Clustering

Given
Data points (documents) and number of desired
clusters k
Group the data points (documents) into k clusters
Data points (documents) within clusters are more
similar than across clusters
Document similarity measure
Each document can be represented by vector with
0/1 value along each word dimension
Cosine of angle between document vectors is a
measure of their similarity, or (euclidean)
distance between the vectors
Other applications
Customer segmentation
Market basket analysis

15
k-means Algorithm

Choose k initial means
Assign each point to the cluster with the closest
mean
Compute new mean for each cluster
Iterate until the k means stabilize

16
Agglomerative Hierarchical Clustering Algorithms

Initially each point is a distinct cluster
Repeatedly merge closest clusters until the
number of clusters becomes k
Closest dmean (Ci, Cj)
dmin (Ci, Cj)
Likewise dave (Ci, Cj) and dmax (Ci, Cj)

17
Agglomerative Hierarchical Clustering Algorithms
(Continued)
dmean Centroid approach dmin Minimum Spanning
Tree (MST) approach
(c) Correct Clusters
(a) Centroid
(b) MST
18
Drawbacks of Traditional Clustering Methods

Traditional clustering methods are ineffective
for clustering documents
Cannot handle thousands of dimensions
Cannot scale to millions of documents
Centroid-based method splits large and
non-hyperspherical clusters
Centers of subclusters can be far apart
MST-based algorithm is sensitive to outliers and
slight change in position
Exhibits chaining effect on string of outliers
Using other similarity measures such as Jaccard
coefficient instead of euclidean distance does
not help

19
Example - Centroid Method for Clustering Documents

As cluster size grows
The number of dimensions appearing in mean go up
Their value in the mean decreases
Thus, very difficult to distinguish two points
that differ on few dimensions
ripple effect
1,4 and 6 are merged even though they have no
elements in common!

20
Itemset Clustering using Hypergraph Partitioning
HKK 97

Build a weighted hypergraph with frequent
itemsets
Hyperedge each frequent item
Weight of hyperedge average of confidences of
all association rules generated from itemset
Hypergraph partitioning algorithm is used to
cluster items
Minimize sum of weights of cut hyperedges
Label customers with Item clusters by scoring
Assume that items defining clusters are disjoint!

21
STIRR - A System for Clustering Categorical
Attribute Values GKR 98

Motivated by spectral graph partitioning, a
method for clustering undirected graphs
Each distinct attribute value becomes a separate
node v with weight w(v)
Node weights w(v) updated in each iteration as
follows For each tuple
do

Update set of
weights so that it is orthonormal
Positive and negative weights in non-principal
basins tend to represent good partitions of the
data

22
ROCK GRS 99

Hierarchical clustering algorithm for categorical
attributes
Example market basket customers
Use novel concept of links for merging clusters
sim(pi, pj) similarity function that captures
the closeness between pi and pj
pi and pj are said to be neighbors if sim(pi, pj)
link(pi, pj) the number of common neighbors
At each step, merge clusters/points with the most
number of links
Points belonging to a single cluster will in
general have a large number of common neighbors
Random sampling used for scale up
In final labeling phase, each point on disk is
assigned to cluster with maximum neighbors

23
ROCK
1, 2, 3 1, 4, 5 1, 2, 4
2, 3, 4 1, 2, 5 2, 3, 5 1, 3, 4 2, 4,
5 1, 3, 5 3, 4, 5
1, 2, 6 1, 2, 7 1, 6, 7 2, 6,
7

1, 2, 6 and 1, 2, 7 have 5 links.
1, 2, 3 and 1, 2, 6 have 3 links.

24
Clustering Algorithms for Numeric Attributes

Scalable Clustering Algorithms
(From Database Community)
CLARANS
DBSCAN
BIRCH
CLIQUE
CURE
Above algorithms can be used to cluster documents
after reducing their dimensionality using SVD

.
25
BIRCH ZRL 96

Pre-cluster data points using CF-tree
CF-tree is similar to R-tree
For each point
CF-tree is traversed to find the closest cluster
If the cluster is within epsilon distance, the
point is absorbed into the cluster
Otherwise, the point starts a new cluster
Requires only single scan of data
Cluster summaries stored in CF-tree are given to
main memory hierarchical clustering algorithm

26
CURE GRS 98

Hierarchical algorithm for dicovering arbitrary
shaped clusters
Uses a small number of representatives per
cluster
Note
Centroid-based uses 1 point to represent a
cluster Too little information..Hyper-spherical
clusters
MST-based uses every point to represent a
cluster Too much information..Easily mislead
Uses random sampling
Uses Partitioning
Labeling using representatives

27
Cluster Representatives

A Representative set of points
Small in number c
Distributed over the cluster
Each point in cluster is close to one
representative
Distance between clusters
smallest distance between
representatives

28
Computing Cluster Representatives

Finding Scattered Representatives
We want to
Distribute around the center of the cluster
Spread well out over the cluster
Capture the physical shape and geometry of the
cluster
Use farthest point heuristic to scatter the
points over the cluster
Shrink uniformly around the mean of the cluster

29
Computing Cluster Representatives (Continued)

Shrinking the Representatives
Why do we need to alter the Representative Set?
Too close to the boundary of cluster
Shrink uniformly around the mean (center) of the
cluster

30
Document Classification
31
Classification

Given
Database of tuples (documents), each assigned a
class label
Develop a model/profile for each class
Example profile (good credit)
(25 40k) or
(married YES)
Example profile (automobile)
Document contains a word from car,
truck, van, SUV, vehicle, scooter
Other applications
Credit card approval (good, bad)
Bank locations (good, fair, poor)
Treatment effectiveness (good, fair, poor)

32
Naive Bayesian Classifier

Class c for new document d is the one for which
Prc/d is maximum
Assume independent term occurrences in document

- fraction of documents in class c that
contain term t
Then, by Bayes rule

33
Hierarchical Classifier (TAPER) CDA 97

Class of new document d is leaf node c such that
Prc/d is maximum Topic Hierarchy
can be computed using Bayes
rule
Problem of computing c reduces to finding leaf
node c with the least cost path from the root
to c

c
34
k-Nearest Neighbor Classifier

Assign to a point the label for majority of the
k-nearest neighbors
For k1, error rate never worse than twice the
Bayes rate (unlimited number of samples)
Scalability issues
Use index to find k-nearest neighbors
R-tree family works well up to 20 dimensions
Pyramid tree for high-dimensional data
Use SVD to reduce dimensionality of data set
Use clusters to reduce the dataset size

35
Decision Trees
Credit Analysis
salary no
yes
education in graduate
accept
yes
no
accept
reject
36
Decision Tree Algorithms

Building phase
Recursively split nodes using best splitting
attribute for node
Pruning phase
Smaller imperfect decision tree generally
achieves better accuracy
Prune leaf nodes recursively to prevent
over-fitting

37
Decision Tree Algorithms

Classifiers from machine learning community
ID3
C4.5
CART
Classifiers for large databases
SLIQ, SPRINT
PUBLIC
SONAR
Rainforest, BOAT

38
Decision Trees

Pros
Fast execution time
Generated rules are easy to interpret by humans
Scale well for large data sets
Can handle high dimensional data
Cons
Cannot capture correlations among attributes
Consider only axis-parallel cuts

39
Feature Selection

Choose a collection of keywords that help
discriminate between two or more sets of
documents
Fewer keywords help to speed up classification
Improves classification accuracy by eliminating
noise from documents
Fischers discriminant (ratio of between-class to
within-class scatter) where
and
if d contains t

40
Exploiting Hyperlink Structure
41
HITS (Hyperlink-Induced Topic Search) Kle 98

HITS uses hyperlink structure to identify
authoritative Web sources for broad-topic
information discovery
Premise Sufficiently broad topics contain
communities consisting of two types of
hyperlinked pages
Authorities highly-referenced pages on a topic
Hubs pages that point to authorities
A good authority is pointed to by many good hubs
a good hub points to many good authorities

Hubs
Authorities
42
HITS - Discovering Web Communities

Discovering the community for a specific
topic/query involves the following steps
Collect seed set of pages S (returned by search
engine)
Expand seed set to contain pages that point to or
are pointed to by pages in seed set
Iteratively update hub weight h(p) and authority
weight a(p) for each page
After a fixed number of iterations, pages with
highest hub/authority weights form core of
community
Extensions proposed in Clever
Assign links different weights based on relevance
of link anchor text

43
Google BP 98

Search engine that uses link structure to
calculate a quality ranking (PageRank) for each
page
PageRank
Can be calculated using a simple iterative
algorithm, and corresponds to principal
eigenvector of the normalized link matrix
Intuition PageRank is the probability that a
random surfer visits a page
Parameter p is probability that the surfer gets
bored and starts on a new random page
(1-p) is the probability that the random surfer
follows a link on current page

44
Google - Features

In addition to PageRank, in order to improve
search Google also weighs keyword matches
Anchor text
Provide more accurate descriptions of Web pages
Anchors exist for un-indexable documents (e.g.,
images)
Font sizes of words in text
Words in larger or bolder font are assigned
higher weights
Google v/s HITS
Google PageRanks computed initially for Web
Pages independent of search query
HITS Hub and authority weights computed for
different root sets in the context of a
particular search query

45
Trawling the Web for Emerging Communities KRR 98

Co-citation pages that are related are
frequently referenced together
Web communities are characterized by dense
directed bipartite subgraphs
Computing (i,j) Bipartite cores
Sort edge list by source id and detect all source
pages s with out-degree j (let D be the set of
destination pages that s points to)
Compute intersection S of sets of source pages
pointing to destination pages in D (using an
index on dest id to generate each source set)
Output Bipartite (S,D)

Bipartite Core
46
Using Hyperlinks to Improve Classification CDI
98

Use text from neighbors when classifying Web page
Ineffective because referenced pages may belong
to different class
Use class information from pre-classified
neighbors
Choose class ci for which Pr(ci/Ni) is maximum
(Ni is class labels of all the neighboring
documents)
By Bayes rule, we choose ci to maximize Pr(Ni/ci)
Pr(ci)
Assuming independence of neighbor classes,

47
Collaborative Search
48
SearchLight

Key Idea Improve search by sharing information
on URLs visited by members of a community during
search
Based on the concept of search sessions
A search session is the search engine query
(collection of keywords) and the URLs visited in
response to the query
Possible to extract search sessions from the
proxy logs
SearchLight maintains a database of (query,
target URL) pairs
Target URL is heuristically chosen to be last URL
in search session for the query
In response to a search query, SearchLight
displays URLs from its database for the specified
query

49
Image Retrieval
50
Similar Images

Given
A set of images
Find
All images similar to a given image
All pairs of similar images
Sample applications
Medical diagnosis
Weather predication
Web search engine for images
E-commerce

51
Similar Image Retrieval Systems

QBIC, Virage, Photobook
Compute feature signature for each image
QBIC uses color histograms
WBIIS, WALRUS use wavelets
Use spatial index to retrieve database image
whose signature is closest to the querys
signature
QBIC drawbacks
Computes single signature for entire image
Thus, fails when images contain similar objects,
but at different locations or in varying sizes
Color histograms cannot capture shape, texture
and location information (wavelets can!)

52
WALRUS Similarity Model NRS 99

WALRUS decomposes an image into regions
A single signature is stored for each region
Two images are considered to be similar if they
have enough similar region pairs

53
WALRUS (Step 1)

Generation of Signatures for Sliding Windows
Each image is broken into sliding windows
For the signature of each sliding window, use
coefficients from lowest frequency band
of the Haar wavelet
Naive Algorithm
Dynamic Programming Algorithm
N - number of pixels in the image
S -
- max window size

54
WALRUS (Step 2)

Clustering Sliding Windows
Cluster the windows in the image using
pre-clustering phase of BIRCH
Each cluster defines a region in the image.
For each cluster, the centroid is used as a
signature. (c.f. bounding box)

55
WALRUS - Retrieval Results
Query image
56
Network-Data Management and Analysis
57
Networks Create Data

To effectively manage their networks
Internet/Telecom Service Providers continuously
gather utilization and traffic data
Managed IP network elements collect huge amounts
of traffic data
Switch/router-level monitoring (SNMP, RMON,
NetFlow, etc.)
Typical IP router several 1000s SNMP counters
Service-Level Agreements (SLAs),
Quality-of-Service (QoS) guarantees
finer-grain monitoring (per IP flow!!)
Telecom networks Call-Detail Records (CDRs) for
every phone call
Each CDR comprises 100s bytes of data with
several 10s of fields/attributes (e.g., endpoint
exchanges, timestamps, tarifs)
End Result Massive collections of
Network-Management (NM) data (can grow in the
order of several TeraBytes/year!!)

58
Why Data Management??

Massive NM data sets hide knowledge that is
crucial to key management tasks
Application/user profiling, proactive/reactive
resource management traffic engineering,
capacity planning, etc.
Data Mining research can help!
Develop novel tools for the effective storage,
exploration, and analysis of massive
Network-Management data
Several challenging research themes
semantic data compression, approximate query
processing, XML, mining models for event
correlation and fault analysis,
network-recommender systems, . . .
Loooooong-term goal -)
Intelligent, self-tuning, self-healing
communication networks

59
Mining Techniques for Network Data

Automated schema extraction for XML data the
XTRACT system
Data reduction techniques for massive data tables
lossless semantic compression with simple data
dependencies the pzip algorithm
lossy, guaranteed-error semantic compression
Fascicles
Model-Based Semantic Compression the SPARTAN
system
Approximate query processing over data synopses
Mining techniques for event correlation and
root-cause analysis
Managing and mining data streams

60
Automated Schema Extraction for XML Data
The XTRACT System
61
XML Primer I

Standard for data representation and data
exchange
Unified, self-describing format for
publishing/exchanging management data across
heterogeneous network/NM platforms
Looks like HTML but it isnt
Collection of elements
Atomic (raw character data)
Composite (sequence of nested sub-elements)
Example
A relational Model for Large Shared Data
Banks
E.F. Codd
IBM Research

62
XML Primer II

XML documents can be accompanied by Document Type
Descriptors (DTDs)
DTDs serve the role of the schema of the document
Specify a regular expression for every element
Example

63
The XTRACT System GGR 00

DTDs are of great practical importance
Efficient storage of XML data collections
Formulation and optimization of XML queries
However, DTDs are not mandatory XML data may
not be accompanied by a DTD
Automatically-generated XML documents (e.g., from
relational databases or flat files)
DTD standards for many communities are still
evolving
Goal of the XTRACT system
Automated inference of DTDs from XML-document
collections

64
Problem Formulation

Element types Þ alphabet
Infer DTD for each element type separately
Example sequences instances of nested
sub-elements
Þ Only one level down in the hierarchy
Problem statement
Given a set example sequences for element e
Infer a good regular expression for e
Hard problem!!
DTDs can comprise general, complex regular
expressions
quantify notion of goodness for regular
expressions

65
Example XML Documents

66
Example (Continued)

Simplified example sequences
- ,
-
- ,
,
,
Desirable solution
year)

67
DTD Inference Requirements

Requirements for a good DTD
Generalizes to intuitively correct but previously
unseen examples
It should be concise (i.e., small in size)
It should be precise (i.e., not cover too many
sequences not contained in the set of examples)
Example Consider the case
p - ta, taa, taaa, ta, taaaa

Candidate DTD
68
The XTRACT Approach MDL Principle

Minimum Description Length (MDL) quantifies and
resolves the tradeoff between DTD conciseness and
preciseness
MDL principle The best theory to infer from a
set of data is the one which minimizes the sum of
(A) the length of the theory, in bits, plus
(B) the length of the data, in bits, when
encoded with the help of the theory.
Part (A) captures conciseness, and
Part (B) captures preciseness

69
Overview of the XTRACT System

XTRACT consists of 3 subsystems
Input Sequences

I ab, abab, ac, ad, bc, bd, bbd, bbbe
SG I U (ab), (ab), bd, be
SF SG U (ab)(cd), b(de)
Inferred DTD (ab) (ab)(cd) b(de)
70
MDL Subsystem

MDL principle Minimize the sum of
Theory description length, plus
Data description length given the theory
In order to use MDL, need to
Define theory description length (candidate
DTD)
Define data description length (input sequences)
given the theory (candidate DTD)
Solve the resulting minimization problem

71
MDL Subsystem - Encoding Scheme

Description Length of a DTD
Number of bits required to encode the DTD
Size of DTD log U (,),,
Description length of a sequence given a
candidate DTD
Number of bits required to specify the sequence
given DTD
Use a sequence of encoding indices
Encoding of a given a is the empty string Î
Encoding of a given (abc) is the index 0
Encoding of aaa given a is the index 3
Example Encoding of ababcabc given
((ab)c) is the sequence 2,2,1

72
MDL Encoding Example

Consider again the case
p - ta, taa, taaa, taaaa

Data Description
Theory Description

(given the theory)
ta taa taaa taaaa (ta) ta
0, 1,0 2, 3
17 7 24
6 21 27
201, 3011, 40111, 501111
3 7 10
1, 2, 3, 4
73
MDL Subsystem - Minimization
Input Sequences
Candidate DTDs
w11
ta
c1
w12
ta
taaa
c2
taa
ta
c3
taaaa
taa
ta

Maps to the Facility Location Problem (NP-hard)
XTRACT employs fast heuristic algorithms
proposed by the Operations Research community

74
Semantic Compression of Massive Network-Data
Tables
75
Compressing Massive Tables A New Direction in
Data Compression

Benefits of data compression are well established
Optimize storage, I/O, network bandwidth (e.g.,
data transfers, disconnected operation for mobile
users) over the lifetime of the data
Faster query processing over synopses
Several generic compression tools and
algorithms(e.g., gzip, Huffman, Lempel-Ziv)
Syntactic methods operate at the byte level,
view data as large byte string
Lossless compression only
Effective compression of massive alphanumeric
tables
Need novel methods that are semantic account
for and exploit the meaning and data
dependencies of attributes in the table
Lossless of lossy compression flexible
mechanisms for users to specify acceptable
information loss

76
The pzip Table Compressor BCC 00

Key ideas
Lossless compression via training use a small
sample of table records to learn simple
dependency patterns
Build a compression plan that exploits the
discovered dependencies (e.g., column grouping)
Leverage existing compression tools (e.g., gzip,
bzip) to losslessly compress the entire table
Based on discovering and exploiting simple
dependency patterns among table columns
Combinational dependencies
Differential dependencies
Also, use simple differential coding for
low-frequency columns
Outperforms naive gzip by factors of up to 2 in
compression ratio/time

77
Combinational Dependencies in pzip

Some notation
Ti,j portion of table T between columns i and
j (Ti i-th column of T)
S(Ti,j) size of compressed (e.g., gzipped)
representation of Ti,j
The ranges Ti,j and Tj1,k are
combinationally dependent iff S(Ti,j)
S(Tj1,k) S(Ti,k)
Grouping the two ranges results in better
compression
Optimum Partitioning find the column groupings
that result in minimum overall storage
requirements (each column group is compressed
individually)
Solved optimally using Dynamic Programming
OPT1,i min OPT1,j S(Tj1,i)
j
Complexity is O(n2) assuming S(Ti,j)s are
known (remember these are
computed over a sample of T)

78
Differential Dependencies in pzip

Column Tj is differentially dependent on Ti
iff S(Tj) S(Ti -
Tj)
Compressing the difference wrt Ti rather than
Tj itself results in better compression
More explicit form of dependency
Differential compression problem partition
Ts columns into source and derived, and
find the differential encoding for each derived
column such that overall storage is minimized
Maps naturally to the Facility Location Problem
(NP-hard)
Greedy local-search heuristics are used in the
pzip implementation

79
Semantic Compression with Fascicles JMN 99

Key observation
Often, numerous subsets of records in T have
similar values for many attributes

Compress data by storing representative
values (e.g., centroid) only once for each
attribute cluster

Lossy compression information loss is
controlled by the notion of similar values for
attributes (user-defined)

80
Problem Formulation

k-dimensional fascicle F(k,t) subset of
records with k compact attributes
User-defined compactness tolerance t (vector)
specifies the allowable loss in the compression
per attribute
E.g., tDuration 3 means that all Duration
values in a fascicle are within 3 of the centroid
value
Flexible, per-attribute specification of
compression loss
Problem Statement
Given a table T and a compactness-tolerance
vector t, find fascicles within the specified
tolerances such that the total storage is
minimized
(1) Finding candidate fascicles in T

(2) Selecting the best fascicles to
compress T

81
Finding Candidate Fascicles

Efficient, randomized algorithm
Use (memory-resident) random samples of T to
choose an initial collection of tip sets (
maximal fascicles based on the sampled records)
Grow tip sets with all qualifying records in
one pass over T
Not guaranteed to find all fascicles!
Exact, level-wise (Apriori-like) procedures are
possible (fascicles are anti-monotone), BUT
Inordinately expensive
Not necessarily better (require static
pre-binning of numeric attributes)

82
Selecting Fascicles for Compression

Selecting the optimal subset among all
candidate fascicles is hard!
Generalization of Weighted Set Cover Problem
(NP-hard)
Use an efficient, greedy heuristic
Always select the fascicle that gives maximum
compression benefit
Fascicles give significantly improved compression
ratios (factors of 2-3) compared to naive gzip

83
SPARTAN A Model-BasedSemantic Compressor
BGR 01

New, general paradigm Model-Based Semantic
Compression (MBSC)
Extract Data Mining models and use them to
compress
Lossless or lossy compression (w/ guaranteed
per-attribute error bounds)
SPARTAN system specific instantiation of MBSC
framework
Key observation row-wise attribute clusters
(a-la fascicles) are not sufficient
(e.g., Y aX b)
Idea use carefully-selected collection of
Classification and Regression Trees (CaRTs) to
capture such vertical correlations and predict
values for entire columns

84
SPARTAN Example CaRT Models
Protocol Duration Bytes Packets
http 12 20K 3
http 16 24K
5 http 15 20K
8 http 19 40K
11 http 26 58K
18 ftp 27
100K 24 ftp 32
300K 35 ftp 18
80K 15

Can use two compact trees (one decision,
one regression) to eliminate two data columns
(predicted attributes)

85
SPARTAN Architecture
86
SPARTANs CaRTSelector

Heart of the SPARTAN semantic-compression
engine
Uses the constructed Bayesian network on T to
drive the construction and selection of the
best subset of CaRT predictors
Hard optimization problem -- Strict
generalization of Weighted Maximum Independent
Set (WMIS) (NP-hard!)
CaRTSelector employs a novel algorithm that
iteratively uses a near-optimal WMIS heuristic
to determine a good subset of CaRTs for
compression
SPARTANs compression ratios outperform gzip
and fascicles by wide margins (even for lossless
compression)
Higher, but reasonable compression times (8min
for a 14-attribute, 30MB table) -- use samples to
learn CaRT models
SPARTAN models predictors can be useful in
other NM contexts
e.g., event correlation filtering, root cause
analysis (more later...)

87
Approximate Query Processing Over Synopses
88
Data Exploration in Traditional Decision Support
Systems (DSS)
Data Warehouse (GB/TB)
Long Response Times
SQL Query
Exact Answers
89
Exact Answers NOT Always Required

Interactive exploration of massive data sets
early feedback giving rough idea of results would
help to quickly find the interesting regions in
data space
data visualization
Aggregate queries approximate answers often
suffice
How does total sales of product X in NJ compare
to that in CA? Precision to the penny is
not needed
Base data may be remote/unavailable
Locally-cached synopses of the data may be the
only option

90
Solution Approximate Query Processing
Data Warehouse (GB/TB)
Construct Compact Relations (in advance)
Compact Relations (MB)
Fast Response Times
Transformation Algebra
SQL Query
Approximate Answers
Transformed SQL Query
91
Approximate Query Processing Using Wavelets CGR
00

Construct compact synopses of data table(s) using
multi-dimensional Haar-wavelet decomposition
Fast takes just a single pass over the data
if it is chunked, otherwise logarithmic
passes
SQL queries are answered by working just on the
compact synopses (collections of wavelet
coefficients) , i.e. , entirely in the
wavelet (compressed) domain
fast response times
results converted back to relational domain
(rendering) at the end
all types of queries supported aggregate,
set-valued, GROUP-BY, . . .
Fast, accurate, general

92
Query Processing Architecture

Entire processing in compressed (wavelet) domain

93
Query Execution
render

Each operator (e.g., select, project, join,
aggregates)
Input Set of Haar coefficients
Output Set of coefficients
Finally, rendering step
Input Set of Haar coefficients
Output (Multi)set of tuples

Set of coeffs
Set of coeffs
Set of coeffs
94
Mining Techniques for Event Correlation and
Root-Cause Analysis
95
Network Event Correlation Root- Cause Analysis

The problem Alarm floods !!

Router
Router
96
NM System Architecture

EC use fault propagation rules to improve
information quality and filter secondary alarms
RCA employ EC output to produce a set of
possible root causes and associated degrees of
confidence

97
Event Correlation Engine

Driven by fault propagation rules ( causal
relationships between alarm signals )

CAUSAL BAYESIAN MODEL !!
Given set of observed alarms A find minimal
subset P such that PA P threshold
98
State-of-the-art

SMARTS InCharge
Network elements modeled as objects with
hard-coded fault propagation rules
Use causal graph to produce binary signatures
for each failure (codebook)
HP OpenView ECS , CISCO InfoCenter , GTE
Impact , . . .
Graphics- or language- based specification of
global rules for event filtering

Hand-coding of causal model !!
tedious, error-prone, non-incremental
ignores probabilistic aspects (dependency
strength)

99
Data Mining can Help Automate

Data Mining techniques for inferring
maintaining causal models from network alarm data

Maintenance (on-line)

Challenges incorporate temporal aspects,
topology, domain knowledge in the data-mining
process

100
Root Cause Analysis

Use data mining (e.g., classification
techniques) for RCA (field data DB to learn
failure signatures)
Exploit domain knowledge (e.g., topology) in
the data-mining process
Refine the RCA models as more data from the field
becomes available

101
References

BCC 00 A.L. Buchsbaum, D.F. Caldwell, K.W.
Church, G.S. Fowler, and S. Muthukrishnan.
Engineering the Compression of Massive Tables An
Experimental Approach. SODA, 2000
BGR 01 S. Babu, M. Garofalakis, and R. Rastogi.
SPARTAN A Model-Based Semantic Compression
System for Massive Data Tables. ACM SIGMOD, 2001.
BP 98 S. Brin, and L. Page. The anatomy of a
large-scale hypertextual Web search engine. WWW7,
1998.
CDA 97 S. Chakrabarti, B. Dom, and P. Indyk.
Enhanced hypertext categorization using
hyperlinks. ACM SIGMOD, 1998.
CDI 98 S. Chakrabarti, B. Dom, R. Agrawal, and
P. Raghavan. Scalable feature selection,
classification and signature generation for
organizing large text databases into hierarchical
topic taxonomies. VLDB Journal, 1998.
CGR 00 K. Chakrabarti, M. Garofalakis, R.
Rastogi, and K. Shim. Approximate Query
Processing Using Wavelets. VLDB, 2000.

102
References (Continued)

DDF 90 S. Deerwater, S. T. Dumais, G. W.
Furnas, T. K. Landauer, and R. Harshman. Indexing
by latent semantic analysis. Journal of the
Society for Information Science, 41(6), 1990.
GGR 00 M. Garofalakis, A. Gionis, R. Rastogi,
S. Seshadri, and K. Shim. XTRACT A System for
Extracting Document Type Descriptors from XML
Documents. ACM SIGMOD, 2000.
GKR 98 D. Gibson, J. Kleinberg, and P.
Raghavan. Clustering categorical data An
approach based on dynamical systems. VLDB, 1998.
GRS 99 S. Guha, K. Shim, and R. Rastogi. CURE
An efficient clustering algorithm for large
databases. ACM SIGMOD, 1998.
GRS 98 S. Guha, K. Shim, and R. Rastogi. ROCK
A robust clustering algorithm for categorical
attributes. Data Engineering, 1999.
HKK 97 E. Han, G. Karypis, V. Kumar, and B.
Mobasher. Clustering based on association rule
hypergraphs. DMKD Workshop, 1997.

103
References (Continued)

JMN 99 H.V. Jagadish, J. Madar, R.T. Ng.
Semantic Compression and Pattern Extraction with
Fascicles. VLDB, 1999.
Kle 98 J. Kleinberg. Authoritative sources in a
hyperlinked environment. SODA, 1998.
KRR 98 R. Kumar, P. Raghavan, S. Rajagopalan,
and A. Tomkins. Trawling the Web for emerging
cyber-communities. WWW8, 1999.
ZRL 96 T. Zhang, R. Ramakrishnan, and M. Livny.
BIRCH An efficient data clustering method for
very large databases. ACM SIGMOD, 1996.