Loading...

PPT – Data Mining meets the Internet: Techniques for Web Information Retrieval and Network Data Management PowerPoint presentation | free to view - id: 23311-NTIxM

The Adobe Flash plugin is needed to view this content

Data Mining meets the Internet Techniques for

Web Information Retrieval and Network Data

Management

Minos

Garofalakis Rajeev Rastogi Internet

Management Research Bell laboratories, Murray Hill

The Web

- Over 1 billion HTML pages, 15 terabytes
- Wealth of information
- Bookstores, restaraunts, travel, malls,

dictionaries, news, stock quotes, yellow white

pages, maps, markets, ......... - Diverse media types text, images, audio, video
- Heterogeneous formats HTML, XML, postscript,

pdf, JPEG, MPEG, MP3 - Highly Dynamic
- 1 million new pages each day
- Average page changes in a few weeks
- Graph structure with links between pages
- Average page has 7-10 links
- Hundreds of millions of queries per day

Why is Web Information Retrieval Important?

- According to most predictions, the majority of

human information will be available on the Web in

ten years - Effective information retrieval can aid in
- Research Find all papers that use the primal

dual method to solve the facility location

problem - Health/Medicene What could be reason for

symptoms of yellow eyes, high fever and

frequent vomitting - Travel Find information on the tropical island

of St. Lucia - Business Find companies that manufacture digital

signal processors (DSPs) - Entertainment Find all movies starring Marilyn

Monroe during the years 1960 and 1970 - Arts Find all short stories written by Jhumpa

Lahiri

Web Information Retrieval Model

Repository

Storage Server

Web Server

Crawler

Clustering Classification

The jaguar has a 4 liter engine

Indexer

The jaguar, a cat, can run at speeds reaching 50

mph

Inverted Index

Topic Hierarchy

engine jaguar cat

Root

Documents in repository

Business

News

Science

jaguar

Search Query

Computers

Automobiles

Plants

Animals

Why is Web Information Retrieval Difficult?

- The Abundance Problem (99 of information of no

interest to 99 of people) - Hundreds of irrelevant documents returned in

response to a search query - Limited Coverage of the Web (Internet sources

hidden behind search interfaces) - Largest crawlers cover less than 18 of Web pages
- The Web is extremely dynamic
- 1 million pages added each day
- Very high dimensionality (thousands of

dimensions) - Limited query interface based on keyword-oriented

search - Limited customization to individual users

How can Data Mining Improve Web Information

Retrieval?

- Latent Semantic Indexing (LSI)
- SVD-based method to improve precision and recall
- Document clustering to generate topic hierarchies
- Hypergraph partitioning, STIRR, ROCK
- Document classification to assign topics to new

documents - Naive Bayes, TAPER
- Exploiting hyperlink structure to locate

authoritative Web pages - HITS, Google, Web Trawling
- Collaborative searching
- SearchLight
- Image Retrieval
- QBIC, Virage, Photobook, WBIIS, WALRUS

Latent Semantic Indexing

Problems with Inverted Index Approach

- Synonymy
- Many ways to refer to the same object
- Polysemy
- Most words have more than one distinct meaning

animal

jaguar

speed

car

engine

porsche

automobile

Doc 1

X

X

X

Doc 2

X

X

X

X

Doc 3

X

X

X

Synonymy

Polysemy

LSI - Key Idea DDF 90

- Apply SVD to terms by documents (t x d) matrix

X X

T0 S0 D0T0 , D0 have

orthonormal columns and S0 is diagonal - Ignoring very small singular values in S (keeping

only the first k largest values)

X X

T S D - New matrix X of rank k is closest to X in the

least squares sense

m x d

m x m

t x d

t x m

k x k

k x d

t x d

t x k

Comparing Documents and Queries

- Comparing two documents
- Essentially dot product of two column vectors of

X X X D S D - So one can consider rows of DS matrix as

coordinates for documents and take dot products

in this space - Finding documents similar to query q with term

vector Xq - Derive a representation Dq for query Dq

Xq T S - Dot product of DqS and appropriate row of DS

matrix yields similarity between query and

specific document

2

-1

LSI - Benefits

- Reduces Dimensionality of Documents
- From tens of thousands (one dimension per

keyword) to a few 100 - Decreases storage overhead of index structures
- Speeds up retrieval of documents similar to a

query - Makes search less brittle
- Captures semantics of documents
- Addresses problems of synonymy and polysemy
- Transforms document space from discrete to

continuous - Improves both search precision and recall

Document Clustering

Improve Search Using Topic Hierarchies

- Web directories (or topic hierarchies) provide a

hierarchical classification of documents (e.g.,

Yahoo!) - Searches performed in the context of a topic

restricts the search to only a subset of web

pages related to the topic - Clustering can be used to generate topic

hierarchies

Yahoo home page

Recreation

Science

Business

News

Sports

Travel

Companies

Finance

Jobs

Clustering

- Given
- Data points (documents) and number of desired

clusters k - Group the data points (documents) into k clusters
- Data points (documents) within clusters are more

similar than across clusters - Document similarity measure
- Each document can be represented by vector with

0/1 value along each word dimension - Cosine of angle between document vectors is a

measure of their similarity, or (euclidean)

distance between the vectors - Other applications
- Customer segmentation
- Market basket analysis

k-means Algorithm

- Choose k initial means
- Assign each point to the cluster with the closest

mean - Compute new mean for each cluster
- Iterate until the k means stabilize

Agglomerative Hierarchical Clustering Algorithms

- Initially each point is a distinct cluster
- Repeatedly merge closest clusters until the

number of clusters becomes k - Closest dmean (Ci, Cj)
- dmin (Ci, Cj)
- Likewise dave (Ci, Cj) and dmax (Ci, Cj)

Agglomerative Hierarchical Clustering Algorithms

(Continued)

dmean Centroid approach dmin Minimum Spanning

Tree (MST) approach

(c) Correct Clusters

(a) Centroid

(b) MST

Drawbacks of Traditional Clustering Methods

- Traditional clustering methods are ineffective

for clustering documents - Cannot handle thousands of dimensions
- Cannot scale to millions of documents
- Centroid-based method splits large and

non-hyperspherical clusters - Centers of subclusters can be far apart
- MST-based algorithm is sensitive to outliers and

slight change in position - Exhibits chaining effect on string of outliers
- Using other similarity measures such as Jaccard

coefficient instead of euclidean distance does

not help

Example - Centroid Method for Clustering Documents

- As cluster size grows
- The number of dimensions appearing in mean go up
- Their value in the mean decreases
- Thus, very difficult to distinguish two points

that differ on few dimensions - ripple effect
- 1,4 and 6 are merged even though they have no

elements in common!

Itemset Clustering using Hypergraph Partitioning

HKK 97

- Build a weighted hypergraph with frequent

itemsets - Hyperedge each frequent item
- Weight of hyperedge average of confidences of

all association rules generated from itemset - Hypergraph partitioning algorithm is used to

cluster items - Minimize sum of weights of cut hyperedges
- Label customers with Item clusters by scoring
- Assume that items defining clusters are disjoint!

STIRR - A System for Clustering Categorical

Attribute Values GKR 98

- Motivated by spectral graph partitioning, a

method for clustering undirected graphs - Each distinct attribute value becomes a separate

node v with weight w(v) - Node weights w(v) updated in each iteration as

follows For each tuple

do

Update set of

weights so that it is orthonormal - Positive and negative weights in non-principal

basins tend to represent good partitions of the

data

ROCK GRS 99

- Hierarchical clustering algorithm for categorical

attributes - Example market basket customers
- Use novel concept of links for merging clusters
- sim(pi, pj) similarity function that captures

the closeness between pi and pj - pi and pj are said to be neighbors if sim(pi, pj)

- link(pi, pj) the number of common neighbors
- At each step, merge clusters/points with the most

number of links - Points belonging to a single cluster will in

general have a large number of common neighbors - Random sampling used for scale up
- In final labeling phase, each point on disk is

assigned to cluster with maximum neighbors

ROCK

1, 2, 3 1, 4, 5 1, 2, 4

2, 3, 4 1, 2, 5 2, 3, 5 1, 3, 4 2, 4,

5 1, 3, 5 3, 4, 5

1, 2, 6 1, 2, 7 1, 6, 7 2, 6,

7

- 1, 2, 6 and 1, 2, 7 have 5 links.
- 1, 2, 3 and 1, 2, 6 have 3 links.

Clustering Algorithms for Numeric Attributes

- Scalable Clustering Algorithms
- (From Database Community)
- CLARANS
- DBSCAN
- BIRCH
- CLIQUE
- CURE
- Above algorithms can be used to cluster documents

after reducing their dimensionality using SVD

.

BIRCH ZRL 96

- Pre-cluster data points using CF-tree
- CF-tree is similar to R-tree
- For each point
- CF-tree is traversed to find the closest cluster
- If the cluster is within epsilon distance, the

point is absorbed into the cluster - Otherwise, the point starts a new cluster
- Requires only single scan of data
- Cluster summaries stored in CF-tree are given to

main memory hierarchical clustering algorithm

CURE GRS 98

- Hierarchical algorithm for dicovering arbitrary

shaped clusters - Uses a small number of representatives per

cluster - Note
- Centroid-based uses 1 point to represent a

cluster Too little information..Hyper-spherical

clusters - MST-based uses every point to represent a

cluster Too much information..Easily mislead - Uses random sampling
- Uses Partitioning
- Labeling using representatives

Cluster Representatives

- A Representative set of points
- Small in number c
- Distributed over the cluster
- Each point in cluster is close to one

representative - Distance between clusters
- smallest distance between

representatives

Computing Cluster Representatives

- Finding Scattered Representatives
- We want to
- Distribute around the center of the cluster
- Spread well out over the cluster
- Capture the physical shape and geometry of the

cluster - Use farthest point heuristic to scatter the

points over the cluster - Shrink uniformly around the mean of the cluster

Computing Cluster Representatives (Continued)

- Shrinking the Representatives
- Why do we need to alter the Representative Set?
- Too close to the boundary of cluster
- Shrink uniformly around the mean (center) of the

cluster

Document Classification

Classification

- Given
- Database of tuples (documents), each assigned a

class label - Develop a model/profile for each class
- Example profile (good credit)
- (25 40k) or

(married YES) - Example profile (automobile)
- Document contains a word from car,

truck, van, SUV, vehicle, scooter - Other applications
- Credit card approval (good, bad)
- Bank locations (good, fair, poor)
- Treatment effectiveness (good, fair, poor)

Naive Bayesian Classifier

- Class c for new document d is the one for which

Prc/d is maximum - Assume independent term occurrences in document

- fraction of documents in class c that

contain term t - Then, by Bayes rule

Hierarchical Classifier (TAPER) CDA 97

- Class of new document d is leaf node c such that

Prc/d is maximum Topic Hierarchy - can be computed using Bayes

rule - Problem of computing c reduces to finding leaf

node c with the least cost path from the root

to c

c

k-Nearest Neighbor Classifier

- Assign to a point the label for majority of the

k-nearest neighbors - For k1, error rate never worse than twice the

Bayes rate (unlimited number of samples) - Scalability issues
- Use index to find k-nearest neighbors
- R-tree family works well up to 20 dimensions
- Pyramid tree for high-dimensional data
- Use SVD to reduce dimensionality of data set
- Use clusters to reduce the dataset size

Decision Trees

Credit Analysis

salary no

yes

education in graduate

accept

yes

no

accept

reject

Decision Tree Algorithms

- Building phase
- Recursively split nodes using best splitting

attribute for node - Pruning phase
- Smaller imperfect decision tree generally

achieves better accuracy - Prune leaf nodes recursively to prevent

over-fitting

Decision Tree Algorithms

- Classifiers from machine learning community
- ID3
- C4.5
- CART
- Classifiers for large databases
- SLIQ, SPRINT
- PUBLIC
- SONAR
- Rainforest, BOAT

Decision Trees

- Pros
- Fast execution time
- Generated rules are easy to interpret by humans
- Scale well for large data sets
- Can handle high dimensional data
- Cons
- Cannot capture correlations among attributes
- Consider only axis-parallel cuts

Feature Selection

- Choose a collection of keywords that help

discriminate between two or more sets of

documents - Fewer keywords help to speed up classification
- Improves classification accuracy by eliminating

noise from documents - Fischers discriminant (ratio of between-class to

within-class scatter) where

and

if d contains t

Exploiting Hyperlink Structure

HITS (Hyperlink-Induced Topic Search) Kle 98

- HITS uses hyperlink structure to identify

authoritative Web sources for broad-topic

information discovery - Premise Sufficiently broad topics contain

communities consisting of two types of

hyperlinked pages - Authorities highly-referenced pages on a topic
- Hubs pages that point to authorities
- A good authority is pointed to by many good hubs

a good hub points to many good authorities

Hubs

Authorities

HITS - Discovering Web Communities

- Discovering the community for a specific

topic/query involves the following steps - Collect seed set of pages S (returned by search

engine) - Expand seed set to contain pages that point to or

are pointed to by pages in seed set - Iteratively update hub weight h(p) and authority

weight a(p) for each page - After a fixed number of iterations, pages with

highest hub/authority weights form core of

community - Extensions proposed in Clever
- Assign links different weights based on relevance

of link anchor text

Google BP 98

- Search engine that uses link structure to

calculate a quality ranking (PageRank) for each

page - PageRank
- Can be calculated using a simple iterative

algorithm, and corresponds to principal

eigenvector of the normalized link matrix - Intuition PageRank is the probability that a

random surfer visits a page - Parameter p is probability that the surfer gets

bored and starts on a new random page - (1-p) is the probability that the random surfer

follows a link on current page

Google - Features

- In addition to PageRank, in order to improve

search Google also weighs keyword matches - Anchor text
- Provide more accurate descriptions of Web pages
- Anchors exist for un-indexable documents (e.g.,

images) - Font sizes of words in text
- Words in larger or bolder font are assigned

higher weights - Google v/s HITS
- Google PageRanks computed initially for Web

Pages independent of search query - HITS Hub and authority weights computed for

different root sets in the context of a

particular search query

Trawling the Web for Emerging Communities KRR 98

- Co-citation pages that are related are

frequently referenced together - Web communities are characterized by dense

directed bipartite subgraphs - Computing (i,j) Bipartite cores
- Sort edge list by source id and detect all source

pages s with out-degree j (let D be the set of

destination pages that s points to) - Compute intersection S of sets of source pages

pointing to destination pages in D (using an

index on dest id to generate each source set) - Output Bipartite (S,D)

Bipartite Core

Using Hyperlinks to Improve Classification CDI

98

- Use text from neighbors when classifying Web page
- Ineffective because referenced pages may belong

to different class - Use class information from pre-classified

neighbors - Choose class ci for which Pr(ci/Ni) is maximum

(Ni is class labels of all the neighboring

documents) - By Bayes rule, we choose ci to maximize Pr(Ni/ci)

Pr(ci) - Assuming independence of neighbor classes,

Collaborative Search

SearchLight

- Key Idea Improve search by sharing information

on URLs visited by members of a community during

search - Based on the concept of search sessions
- A search session is the search engine query

(collection of keywords) and the URLs visited in

response to the query - Possible to extract search sessions from the

proxy logs - SearchLight maintains a database of (query,

target URL) pairs - Target URL is heuristically chosen to be last URL

in search session for the query - In response to a search query, SearchLight

displays URLs from its database for the specified

query

Image Retrieval

Similar Images

- Given
- A set of images
- Find
- All images similar to a given image
- All pairs of similar images
- Sample applications
- Medical diagnosis
- Weather predication
- Web search engine for images
- E-commerce

Similar Image Retrieval Systems

- QBIC, Virage, Photobook
- Compute feature signature for each image
- QBIC uses color histograms
- WBIIS, WALRUS use wavelets
- Use spatial index to retrieve database image

whose signature is closest to the querys

signature - QBIC drawbacks
- Computes single signature for entire image
- Thus, fails when images contain similar objects,

but at different locations or in varying sizes - Color histograms cannot capture shape, texture

and location information (wavelets can!)

WALRUS Similarity Model NRS 99

- WALRUS decomposes an image into regions
- A single signature is stored for each region
- Two images are considered to be similar if they

have enough similar region pairs

WALRUS (Step 1)

- Generation of Signatures for Sliding Windows
- Each image is broken into sliding windows
- For the signature of each sliding window, use
- coefficients from lowest frequency band

of the Haar wavelet - Naive Algorithm
- Dynamic Programming Algorithm
- N - number of pixels in the image
- S -
- - max window size

WALRUS (Step 2)

- Clustering Sliding Windows
- Cluster the windows in the image using

pre-clustering phase of BIRCH - Each cluster defines a region in the image.
- For each cluster, the centroid is used as a

signature. (c.f. bounding box)

WALRUS - Retrieval Results

Query image

Network-Data Management and Analysis

Networks Create Data

- To effectively manage their networks

Internet/Telecom Service Providers continuously

gather utilization and traffic data - Managed IP network elements collect huge amounts

of traffic data - Switch/router-level monitoring (SNMP, RMON,

NetFlow, etc.) - Typical IP router several 1000s SNMP counters
- Service-Level Agreements (SLAs),

Quality-of-Service (QoS) guarantees

finer-grain monitoring (per IP flow!!) - Telecom networks Call-Detail Records (CDRs) for

every phone call - Each CDR comprises 100s bytes of data with

several 10s of fields/attributes (e.g., endpoint

exchanges, timestamps, tarifs) - End Result Massive collections of

Network-Management (NM) data (can grow in the

order of several TeraBytes/year!!)

Why Data Management??

- Massive NM data sets hide knowledge that is

crucial to key management tasks - Application/user profiling, proactive/reactive

resource management traffic engineering,

capacity planning, etc. - Data Mining research can help!
- Develop novel tools for the effective storage,

exploration, and analysis of massive

Network-Management data - Several challenging research themes
- semantic data compression, approximate query

processing, XML, mining models for event

correlation and fault analysis,

network-recommender systems, . . . - Loooooong-term goal -)
- Intelligent, self-tuning, self-healing

communication networks

Mining Techniques for Network Data

- Automated schema extraction for XML data the

XTRACT system - Data reduction techniques for massive data tables
- lossless semantic compression with simple data

dependencies the pzip algorithm - lossy, guaranteed-error semantic compression
- Fascicles
- Model-Based Semantic Compression the SPARTAN

system - Approximate query processing over data synopses
- Mining techniques for event correlation and

root-cause analysis - Managing and mining data streams

Automated Schema Extraction for XML Data

The XTRACT System

XML Primer I

- Standard for data representation and data

exchange - Unified, self-describing format for

publishing/exchanging management data across

heterogeneous network/NM platforms - Looks like HTML but it isnt
- Collection of elements
- Atomic (raw character data)
- Composite (sequence of nested sub-elements)
- Example
- A relational Model for Large Shared Data

Banks - E.F. Codd
- IBM Research

XML Primer II

- XML documents can be accompanied by Document Type

Descriptors (DTDs) - DTDs serve the role of the schema of the document
- Specify a regular expression for every element
- Example

The XTRACT System GGR 00

- DTDs are of great practical importance
- Efficient storage of XML data collections
- Formulation and optimization of XML queries
- However, DTDs are not mandatory XML data may

not be accompanied by a DTD - Automatically-generated XML documents (e.g., from

relational databases or flat files) - DTD standards for many communities are still

evolving - Goal of the XTRACT system
- Automated inference of DTDs from XML-document

collections

Problem Formulation

- Element types Þ alphabet
- Infer DTD for each element type separately
- Example sequences instances of nested

sub-elements - Þ Only one level down in the hierarchy
- Problem statement
- Given a set example sequences for element e
- Infer a good regular expression for e
- Hard problem!!
- DTDs can comprise general, complex regular

expressions - quantify notion of goodness for regular

expressions

Example XML Documents

Example (Continued)

- Simplified example sequences
- - ,
- -
- - ,
- ,
- ,
- Desirable solution
- year)

DTD Inference Requirements

- Requirements for a good DTD
- Generalizes to intuitively correct but previously

unseen examples - It should be concise (i.e., small in size)
- It should be precise (i.e., not cover too many

sequences not contained in the set of examples) - Example Consider the case
- p - ta, taa, taaa, ta, taaaa

Candidate DTD

The XTRACT Approach MDL Principle

- Minimum Description Length (MDL) quantifies and

resolves the tradeoff between DTD conciseness and

preciseness - MDL principle The best theory to infer from a

set of data is the one which minimizes the sum of - (A) the length of the theory, in bits, plus
- (B) the length of the data, in bits, when

encoded with the help of the theory. - Part (A) captures conciseness, and
- Part (B) captures preciseness

Overview of the XTRACT System

- XTRACT consists of 3 subsystems
- Input Sequences

I ab, abab, ac, ad, bc, bd, bbd, bbbe

SG I U (ab), (ab), bd, be

SF SG U (ab)(cd), b(de)

Inferred DTD (ab) (ab)(cd) b(de)

MDL Subsystem

- MDL principle Minimize the sum of
- Theory description length, plus
- Data description length given the theory
- In order to use MDL, need to
- Define theory description length (candidate

DTD) - Define data description length (input sequences)

given the theory (candidate DTD) - Solve the resulting minimization problem

MDL Subsystem - Encoding Scheme

- Description Length of a DTD
- Number of bits required to encode the DTD
- Size of DTD log U (,),,
- Description length of a sequence given a

candidate DTD - Number of bits required to specify the sequence

given DTD - Use a sequence of encoding indices
- Encoding of a given a is the empty string Î
- Encoding of a given (abc) is the index 0
- Encoding of aaa given a is the index 3
- Example Encoding of ababcabc given

((ab)c) is the sequence 2,2,1

MDL Encoding Example

- Consider again the case
- p - ta, taa, taaa, taaaa

Data Description

Theory Description

(given the theory)

ta taa taaa taaaa (ta) ta

0, 1,0 2, 3

17 7 24

6 21 27

201, 3011, 40111, 501111

3 7 10

1, 2, 3, 4

MDL Subsystem - Minimization

Input Sequences

Candidate DTDs

w11

ta

c1

w12

ta

taaa

c2

taa

ta

c3

taaaa

taa

ta

- Maps to the Facility Location Problem (NP-hard)
- XTRACT employs fast heuristic algorithms

proposed by the Operations Research community

Semantic Compression of Massive Network-Data

Tables

Compressing Massive Tables A New Direction in

Data Compression

- Benefits of data compression are well established
- Optimize storage, I/O, network bandwidth (e.g.,

data transfers, disconnected operation for mobile

users) over the lifetime of the data - Faster query processing over synopses
- Several generic compression tools and

algorithms(e.g., gzip, Huffman, Lempel-Ziv) - Syntactic methods operate at the byte level,

view data as large byte string - Lossless compression only
- Effective compression of massive alphanumeric

tables - Need novel methods that are semantic account

for and exploit the meaning and data

dependencies of attributes in the table - Lossless of lossy compression flexible

mechanisms for users to specify acceptable

information loss

The pzip Table Compressor BCC 00

- Key ideas
- Lossless compression via training use a small

sample of table records to learn simple

dependency patterns - Build a compression plan that exploits the

discovered dependencies (e.g., column grouping) - Leverage existing compression tools (e.g., gzip,

bzip) to losslessly compress the entire table - Based on discovering and exploiting simple

dependency patterns among table columns - Combinational dependencies
- Differential dependencies
- Also, use simple differential coding for

low-frequency columns - Outperforms naive gzip by factors of up to 2 in

compression ratio/time

Combinational Dependencies in pzip

- Some notation
- Ti,j portion of table T between columns i and

j (Ti i-th column of T) - S(Ti,j) size of compressed (e.g., gzipped)

representation of Ti,j - The ranges Ti,j and Tj1,k are

combinationally dependent iff S(Ti,j)

S(Tj1,k) S(Ti,k) - Grouping the two ranges results in better

compression - Optimum Partitioning find the column groupings

that result in minimum overall storage

requirements (each column group is compressed

individually) - Solved optimally using Dynamic Programming
- OPT1,i min OPT1,j S(Tj1,i)

j - Complexity is O(n2) assuming S(Ti,j)s are

known (remember these are

computed over a sample of T)

Differential Dependencies in pzip

- Column Tj is differentially dependent on Ti

iff S(Tj) S(Ti -

Tj) - Compressing the difference wrt Ti rather than

Tj itself results in better compression - More explicit form of dependency
- Differential compression problem partition

Ts columns into source and derived, and

find the differential encoding for each derived

column such that overall storage is minimized - Maps naturally to the Facility Location Problem

(NP-hard) - Greedy local-search heuristics are used in the

pzip implementation

Semantic Compression with Fascicles JMN 99

- Key observation
- Often, numerous subsets of records in T have

similar values for many attributes

- Compress data by storing representative

values (e.g., centroid) only once for each

attribute cluster

- Lossy compression information loss is

controlled by the notion of similar values for

attributes (user-defined)

Problem Formulation

- k-dimensional fascicle F(k,t) subset of

records with k compact attributes - User-defined compactness tolerance t (vector)

specifies the allowable loss in the compression

per attribute - E.g., tDuration 3 means that all Duration

values in a fascicle are within 3 of the centroid

value - Flexible, per-attribute specification of

compression loss - Problem Statement
- Given a table T and a compactness-tolerance

vector t, find fascicles within the specified

tolerances such that the total storage is

minimized - (1) Finding candidate fascicles in T

(2) Selecting the best fascicles to

compress T

Finding Candidate Fascicles

- Efficient, randomized algorithm
- Use (memory-resident) random samples of T to

choose an initial collection of tip sets (

maximal fascicles based on the sampled records) - Grow tip sets with all qualifying records in

one pass over T - Not guaranteed to find all fascicles!
- Exact, level-wise (Apriori-like) procedures are

possible (fascicles are anti-monotone), BUT - Inordinately expensive
- Not necessarily better (require static

pre-binning of numeric attributes)

Selecting Fascicles for Compression

- Selecting the optimal subset among all

candidate fascicles is hard! - Generalization of Weighted Set Cover Problem

(NP-hard) - Use an efficient, greedy heuristic
- Always select the fascicle that gives maximum

compression benefit - Fascicles give significantly improved compression

ratios (factors of 2-3) compared to naive gzip

SPARTAN A Model-BasedSemantic Compressor

BGR 01

- New, general paradigm Model-Based Semantic

Compression (MBSC) - Extract Data Mining models and use them to

compress - Lossless or lossy compression (w/ guaranteed

per-attribute error bounds) - SPARTAN system specific instantiation of MBSC

framework - Key observation row-wise attribute clusters

(a-la fascicles) are not sufficient

(e.g., Y aX b) - Idea use carefully-selected collection of

Classification and Regression Trees (CaRTs) to

capture such vertical correlations and predict

values for entire columns

SPARTAN Example CaRT Models

Protocol Duration Bytes Packets

http 12 20K 3

http 16 24K

5 http 15 20K

8 http 19 40K

11 http 26 58K

18 ftp 27

100K 24 ftp 32

300K 35 ftp 18

80K 15

- Can use two compact trees (one decision,

one regression) to eliminate two data columns

(predicted attributes)

SPARTAN Architecture

SPARTANs CaRTSelector

- Heart of the SPARTAN semantic-compression

engine - Uses the constructed Bayesian network on T to

drive the construction and selection of the

best subset of CaRT predictors - Hard optimization problem -- Strict

generalization of Weighted Maximum Independent

Set (WMIS) (NP-hard!) - CaRTSelector employs a novel algorithm that

iteratively uses a near-optimal WMIS heuristic

to determine a good subset of CaRTs for

compression - SPARTANs compression ratios outperform gzip

and fascicles by wide margins (even for lossless

compression) - Higher, but reasonable compression times (8min

for a 14-attribute, 30MB table) -- use samples to

learn CaRT models - SPARTAN models predictors can be useful in

other NM contexts - e.g., event correlation filtering, root cause

analysis (more later...)

Approximate Query Processing Over Synopses

Data Exploration in Traditional Decision Support

Systems (DSS)

Data Warehouse (GB/TB)

Long Response Times

SQL Query

Exact Answers

Exact Answers NOT Always Required

- Interactive exploration of massive data sets
- early feedback giving rough idea of results would

help to quickly find the interesting regions in

data space - data visualization
- Aggregate queries approximate answers often

suffice - How does total sales of product X in NJ compare

to that in CA? Precision to the penny is

not needed - Base data may be remote/unavailable

Locally-cached synopses of the data may be the

only option

Solution Approximate Query Processing

Data Warehouse (GB/TB)

Construct Compact Relations (in advance)

Compact Relations (MB)

Fast Response Times

Transformation Algebra

SQL Query

Approximate Answers

Transformed SQL Query

Approximate Query Processing Using Wavelets CGR

00

- Construct compact synopses of data table(s) using

multi-dimensional Haar-wavelet decomposition - Fast takes just a single pass over the data

if it is chunked, otherwise logarithmic

passes - SQL queries are answered by working just on the

compact synopses (collections of wavelet

coefficients) , i.e. , entirely in the

wavelet (compressed) domain - fast response times
- results converted back to relational domain

(rendering) at the end - all types of queries supported aggregate,

set-valued, GROUP-BY, . . . - Fast, accurate, general

Query Processing Architecture

- Entire processing in compressed (wavelet) domain

Query Execution

render

- Each operator (e.g., select, project, join,

aggregates) - Input Set of Haar coefficients
- Output Set of coefficients
- Finally, rendering step
- Input Set of Haar coefficients
- Output (Multi)set of tuples

Set of coeffs

Set of coeffs

Set of coeffs

Mining Techniques for Event Correlation and

Root-Cause Analysis

Network Event Correlation Root- Cause Analysis

- The problem Alarm floods !!

Router

Router

NM System Architecture

- EC use fault propagation rules to improve

information quality and filter secondary alarms - RCA employ EC output to produce a set of

possible root causes and associated degrees of

confidence

Event Correlation Engine

- Driven by fault propagation rules ( causal

relationships between alarm signals )

CAUSAL BAYESIAN MODEL !!

Given set of observed alarms A find minimal

subset P such that PA P threshold

State-of-the-art

- SMARTS InCharge
- Network elements modeled as objects with

hard-coded fault propagation rules - Use causal graph to produce binary signatures

for each failure (codebook) - HP OpenView ECS , CISCO InfoCenter , GTE

Impact , . . . - Graphics- or language- based specification of

global rules for event filtering

- Hand-coding of causal model !!
- tedious, error-prone, non-incremental
- ignores probabilistic aspects (dependency

strength)

Data Mining can Help Automate

- Data Mining techniques for inferring

maintaining causal models from network alarm data

Maintenance (on-line)

- Challenges incorporate temporal aspects,

topology, domain knowledge in the data-mining

process

Root Cause Analysis

- Use data mining (e.g., classification

techniques) for RCA (field data DB to learn

failure signatures) - Exploit domain knowledge (e.g., topology) in

the data-mining process - Refine the RCA models as more data from the field

becomes available

References

- BCC 00 A.L. Buchsbaum, D.F. Caldwell, K.W.

Church, G.S. Fowler, and S. Muthukrishnan.

Engineering the Compression of Massive Tables An

Experimental Approach. SODA, 2000 - BGR 01 S. Babu, M. Garofalakis, and R. Rastogi.

SPARTAN A Model-Based Semantic Compression

System for Massive Data Tables. ACM SIGMOD, 2001. - BP 98 S. Brin, and L. Page. The anatomy of a

large-scale hypertextual Web search engine. WWW7,

1998. - CDA 97 S. Chakrabarti, B. Dom, and P. Indyk.

Enhanced hypertext categorization using

hyperlinks. ACM SIGMOD, 1998. - CDI 98 S. Chakrabarti, B. Dom, R. Agrawal, and

P. Raghavan. Scalable feature selection,

classification and signature generation for

organizing large text databases into hierarchical

topic taxonomies. VLDB Journal, 1998. - CGR 00 K. Chakrabarti, M. Garofalakis, R.

Rastogi, and K. Shim. Approximate Query

Processing Using Wavelets. VLDB, 2000.

References (Continued)

- DDF 90 S. Deerwater, S. T. Dumais, G. W.

Furnas, T. K. Landauer, and R. Harshman. Indexing

by latent semantic analysis. Journal of the

Society for Information Science, 41(6), 1990. - GGR 00 M. Garofalakis, A. Gionis, R. Rastogi,

S. Seshadri, and K. Shim. XTRACT A System for

Extracting Document Type Descriptors from XML

Documents. ACM SIGMOD, 2000. - GKR 98 D. Gibson, J. Kleinberg, and P.

Raghavan. Clustering categorical data An

approach based on dynamical systems. VLDB, 1998. - GRS 99 S. Guha, K. Shim, and R. Rastogi. CURE

An efficient clustering algorithm for large

databases. ACM SIGMOD, 1998. - GRS 98 S. Guha, K. Shim, and R. Rastogi. ROCK

A robust clustering algorithm for categorical

attributes. Data Engineering, 1999. - HKK 97 E. Han, G. Karypis, V. Kumar, and B.

Mobasher. Clustering based on association rule

hypergraphs. DMKD Workshop, 1997.

References (Continued)

- JMN 99 H.V. Jagadish, J. Madar, R.T. Ng.

Semantic Compression and Pattern Extraction with

Fascicles. VLDB, 1999. - Kle 98 J. Kleinberg. Authoritative sources in a

hyperlinked environment. SODA, 1998. - KRR 98 R. Kumar, P. Raghavan, S. Rajagopalan,

and A. Tomkins. Trawling the Web for emerging

cyber-communities. WWW8, 1999. - ZRL 96 T. Zhang, R. Ramakrishnan, and M. Livny.

BIRCH An efficient data clustering method for

very large databases. ACM SIGMOD, 1996.