Data Mining meets the Internet: Techniques for Web Information Retrieval and Network Data Management - PowerPoint PPT Presentation

1 / 103
About This Presentation
Title:

Data Mining meets the Internet: Techniques for Web Information Retrieval and Network Data Management

Description:

1. Data Mining Meets the Internet. 6/22/09 ... The jaguar, a cat, can run at. speeds reaching 50 mph. The jaguar has a 4 liter engine ... – PowerPoint PPT presentation

Number of Views:2464
Avg rating:3.0/5.0
Slides: 104
Provided by: bell87
Category:

less

Transcript and Presenter's Notes

Title: Data Mining meets the Internet: Techniques for Web Information Retrieval and Network Data Management


1
Data Mining meets the Internet Techniques for
Web Information Retrieval and Network Data
Management
Minos
Garofalakis Rajeev Rastogi Internet
Management Research Bell laboratories, Murray Hill
2
The Web
  • Over 1 billion HTML pages, 15 terabytes
  • Wealth of information
  • Bookstores, restaraunts, travel, malls,
    dictionaries, news, stock quotes, yellow white
    pages, maps, markets, .........
  • Diverse media types text, images, audio, video
  • Heterogeneous formats HTML, XML, postscript,
    pdf, JPEG, MPEG, MP3
  • Highly Dynamic
  • 1 million new pages each day
  • Average page changes in a few weeks
  • Graph structure with links between pages
  • Average page has 7-10 links
  • Hundreds of millions of queries per day

3
Why is Web Information Retrieval Important?
  • According to most predictions, the majority of
    human information will be available on the Web in
    ten years
  • Effective information retrieval can aid in
  • Research Find all papers that use the primal
    dual method to solve the facility location
    problem
  • Health/Medicene What could be reason for
    symptoms of yellow eyes, high fever and
    frequent vomitting
  • Travel Find information on the tropical island
    of St. Lucia
  • Business Find companies that manufacture digital
    signal processors (DSPs)
  • Entertainment Find all movies starring Marilyn
    Monroe during the years 1960 and 1970
  • Arts Find all short stories written by Jhumpa
    Lahiri

4
Web Information Retrieval Model
Repository
Storage Server
Web Server
Crawler
Clustering Classification
The jaguar has a 4 liter engine
Indexer
The jaguar, a cat, can run at speeds reaching 50
mph
Inverted Index
Topic Hierarchy
engine jaguar cat
Root
Documents in repository
Business
News
Science
jaguar
Search Query
Computers
Automobiles
Plants
Animals
5
Why is Web Information Retrieval Difficult?
  • The Abundance Problem (99 of information of no
    interest to 99 of people)
  • Hundreds of irrelevant documents returned in
    response to a search query
  • Limited Coverage of the Web (Internet sources
    hidden behind search interfaces)
  • Largest crawlers cover less than 18 of Web pages
  • The Web is extremely dynamic
  • 1 million pages added each day
  • Very high dimensionality (thousands of
    dimensions)
  • Limited query interface based on keyword-oriented
    search
  • Limited customization to individual users

6
How can Data Mining Improve Web Information
Retrieval?
  • Latent Semantic Indexing (LSI)
  • SVD-based method to improve precision and recall
  • Document clustering to generate topic hierarchies
  • Hypergraph partitioning, STIRR, ROCK
  • Document classification to assign topics to new
    documents
  • Naive Bayes, TAPER
  • Exploiting hyperlink structure to locate
    authoritative Web pages
  • HITS, Google, Web Trawling
  • Collaborative searching
  • SearchLight
  • Image Retrieval
  • QBIC, Virage, Photobook, WBIIS, WALRUS

7
Latent Semantic Indexing
8
Problems with Inverted Index Approach
  • Synonymy
  • Many ways to refer to the same object
  • Polysemy
  • Most words have more than one distinct meaning

animal
jaguar
speed
car
engine
porsche
automobile
Doc 1
X
X
X
Doc 2
X
X
X
X
Doc 3
X
X
X
Synonymy
Polysemy
9
LSI - Key Idea DDF 90
  • Apply SVD to terms by documents (t x d) matrix
    X X
    T0 S0 D0T0 , D0 have
    orthonormal columns and S0 is diagonal
  • Ignoring very small singular values in S (keeping
    only the first k largest values)
    X X
    T S D
  • New matrix X of rank k is closest to X in the
    least squares sense

m x d
m x m
t x d
t x m
k x k
k x d
t x d
t x k
10
Comparing Documents and Queries
  • Comparing two documents
  • Essentially dot product of two column vectors of
    X X X D S D
  • So one can consider rows of DS matrix as
    coordinates for documents and take dot products
    in this space
  • Finding documents similar to query q with term
    vector Xq
  • Derive a representation Dq for query Dq
    Xq T S
  • Dot product of DqS and appropriate row of DS
    matrix yields similarity between query and
    specific document

2
-1
11
LSI - Benefits
  • Reduces Dimensionality of Documents
  • From tens of thousands (one dimension per
    keyword) to a few 100
  • Decreases storage overhead of index structures
  • Speeds up retrieval of documents similar to a
    query
  • Makes search less brittle
  • Captures semantics of documents
  • Addresses problems of synonymy and polysemy
  • Transforms document space from discrete to
    continuous
  • Improves both search precision and recall

12
Document Clustering
13
Improve Search Using Topic Hierarchies
  • Web directories (or topic hierarchies) provide a
    hierarchical classification of documents (e.g.,
    Yahoo!)
  • Searches performed in the context of a topic
    restricts the search to only a subset of web
    pages related to the topic
  • Clustering can be used to generate topic
    hierarchies

Yahoo home page
Recreation
Science
Business
News
Sports
Travel
Companies
Finance
Jobs
14
Clustering
  • Given
  • Data points (documents) and number of desired
    clusters k
  • Group the data points (documents) into k clusters
  • Data points (documents) within clusters are more
    similar than across clusters
  • Document similarity measure
  • Each document can be represented by vector with
    0/1 value along each word dimension
  • Cosine of angle between document vectors is a
    measure of their similarity, or (euclidean)
    distance between the vectors
  • Other applications
  • Customer segmentation
  • Market basket analysis

15
k-means Algorithm
  • Choose k initial means
  • Assign each point to the cluster with the closest
    mean
  • Compute new mean for each cluster
  • Iterate until the k means stabilize

16
Agglomerative Hierarchical Clustering Algorithms
  • Initially each point is a distinct cluster
  • Repeatedly merge closest clusters until the
    number of clusters becomes k
  • Closest dmean (Ci, Cj)
  • dmin (Ci, Cj)
  • Likewise dave (Ci, Cj) and dmax (Ci, Cj)

17
Agglomerative Hierarchical Clustering Algorithms
(Continued)
dmean Centroid approach dmin Minimum Spanning
Tree (MST) approach
(c) Correct Clusters
(a) Centroid
(b) MST
18
Drawbacks of Traditional Clustering Methods
  • Traditional clustering methods are ineffective
    for clustering documents
  • Cannot handle thousands of dimensions
  • Cannot scale to millions of documents
  • Centroid-based method splits large and
    non-hyperspherical clusters
  • Centers of subclusters can be far apart
  • MST-based algorithm is sensitive to outliers and
    slight change in position
  • Exhibits chaining effect on string of outliers
  • Using other similarity measures such as Jaccard
    coefficient instead of euclidean distance does
    not help

19
Example - Centroid Method for Clustering Documents
  • As cluster size grows
  • The number of dimensions appearing in mean go up
  • Their value in the mean decreases
  • Thus, very difficult to distinguish two points
    that differ on few dimensions
  • ripple effect
  • 1,4 and 6 are merged even though they have no
    elements in common!

20
Itemset Clustering using Hypergraph Partitioning
HKK 97
  • Build a weighted hypergraph with frequent
    itemsets
  • Hyperedge each frequent item
  • Weight of hyperedge average of confidences of
    all association rules generated from itemset
  • Hypergraph partitioning algorithm is used to
    cluster items
  • Minimize sum of weights of cut hyperedges
  • Label customers with Item clusters by scoring
  • Assume that items defining clusters are disjoint!

21
STIRR - A System for Clustering Categorical
Attribute Values GKR 98
  • Motivated by spectral graph partitioning, a
    method for clustering undirected graphs
  • Each distinct attribute value becomes a separate
    node v with weight w(v)
  • Node weights w(v) updated in each iteration as
    follows For each tuple
    do

    Update set of
    weights so that it is orthonormal
  • Positive and negative weights in non-principal
    basins tend to represent good partitions of the
    data

22
ROCK GRS 99
  • Hierarchical clustering algorithm for categorical
    attributes
  • Example market basket customers
  • Use novel concept of links for merging clusters
  • sim(pi, pj) similarity function that captures
    the closeness between pi and pj
  • pi and pj are said to be neighbors if sim(pi, pj)
  • link(pi, pj) the number of common neighbors
  • At each step, merge clusters/points with the most
    number of links
  • Points belonging to a single cluster will in
    general have a large number of common neighbors
  • Random sampling used for scale up
  • In final labeling phase, each point on disk is
    assigned to cluster with maximum neighbors

23
ROCK
1, 2, 3 1, 4, 5 1, 2, 4
2, 3, 4 1, 2, 5 2, 3, 5 1, 3, 4 2, 4,
5 1, 3, 5 3, 4, 5
1, 2, 6 1, 2, 7 1, 6, 7 2, 6,
7
  • 1, 2, 6 and 1, 2, 7 have 5 links.
  • 1, 2, 3 and 1, 2, 6 have 3 links.

24
Clustering Algorithms for Numeric Attributes
  • Scalable Clustering Algorithms
  • (From Database Community)
  • CLARANS
  • DBSCAN
  • BIRCH
  • CLIQUE
  • CURE
  • Above algorithms can be used to cluster documents
    after reducing their dimensionality using SVD

.
25
BIRCH ZRL 96
  • Pre-cluster data points using CF-tree
  • CF-tree is similar to R-tree
  • For each point
  • CF-tree is traversed to find the closest cluster
  • If the cluster is within epsilon distance, the
    point is absorbed into the cluster
  • Otherwise, the point starts a new cluster
  • Requires only single scan of data
  • Cluster summaries stored in CF-tree are given to
    main memory hierarchical clustering algorithm

26
CURE GRS 98
  • Hierarchical algorithm for dicovering arbitrary
    shaped clusters
  • Uses a small number of representatives per
    cluster
  • Note
  • Centroid-based uses 1 point to represent a
    cluster Too little information..Hyper-spherical
    clusters
  • MST-based uses every point to represent a
    cluster Too much information..Easily mislead
  • Uses random sampling
  • Uses Partitioning
  • Labeling using representatives

27
Cluster Representatives
  • A Representative set of points
  • Small in number c
  • Distributed over the cluster
  • Each point in cluster is close to one
    representative
  • Distance between clusters
  • smallest distance between
    representatives

28
Computing Cluster Representatives
  • Finding Scattered Representatives
  • We want to
  • Distribute around the center of the cluster
  • Spread well out over the cluster
  • Capture the physical shape and geometry of the
    cluster
  • Use farthest point heuristic to scatter the
    points over the cluster
  • Shrink uniformly around the mean of the cluster

29
Computing Cluster Representatives (Continued)
  • Shrinking the Representatives
  • Why do we need to alter the Representative Set?
  • Too close to the boundary of cluster
  • Shrink uniformly around the mean (center) of the
    cluster

30
Document Classification
31
Classification
  • Given
  • Database of tuples (documents), each assigned a
    class label
  • Develop a model/profile for each class
  • Example profile (good credit)
  • (25 40k) or
    (married YES)
  • Example profile (automobile)
  • Document contains a word from car,
    truck, van, SUV, vehicle, scooter
  • Other applications
  • Credit card approval (good, bad)
  • Bank locations (good, fair, poor)
  • Treatment effectiveness (good, fair, poor)

32
Naive Bayesian Classifier
  • Class c for new document d is the one for which
    Prc/d is maximum
  • Assume independent term occurrences in document

    - fraction of documents in class c that
    contain term t
  • Then, by Bayes rule

33
Hierarchical Classifier (TAPER) CDA 97
  • Class of new document d is leaf node c such that
    Prc/d is maximum Topic Hierarchy
  • can be computed using Bayes
    rule
  • Problem of computing c reduces to finding leaf
    node c with the least cost path from the root
    to c

c
34
k-Nearest Neighbor Classifier
  • Assign to a point the label for majority of the
    k-nearest neighbors
  • For k1, error rate never worse than twice the
    Bayes rate (unlimited number of samples)
  • Scalability issues
  • Use index to find k-nearest neighbors
  • R-tree family works well up to 20 dimensions
  • Pyramid tree for high-dimensional data
  • Use SVD to reduce dimensionality of data set
  • Use clusters to reduce the dataset size

35
Decision Trees
Credit Analysis
salary no
yes
education in graduate
accept
yes
no
accept
reject
36
Decision Tree Algorithms
  • Building phase
  • Recursively split nodes using best splitting
    attribute for node
  • Pruning phase
  • Smaller imperfect decision tree generally
    achieves better accuracy
  • Prune leaf nodes recursively to prevent
    over-fitting

37
Decision Tree Algorithms
  • Classifiers from machine learning community
  • ID3
  • C4.5
  • CART
  • Classifiers for large databases
  • SLIQ, SPRINT
  • PUBLIC
  • SONAR
  • Rainforest, BOAT

38
Decision Trees
  • Pros
  • Fast execution time
  • Generated rules are easy to interpret by humans
  • Scale well for large data sets
  • Can handle high dimensional data
  • Cons
  • Cannot capture correlations among attributes
  • Consider only axis-parallel cuts

39
Feature Selection
  • Choose a collection of keywords that help
    discriminate between two or more sets of
    documents
  • Fewer keywords help to speed up classification
  • Improves classification accuracy by eliminating
    noise from documents
  • Fischers discriminant (ratio of between-class to
    within-class scatter) where
    and
    if d contains t

40
Exploiting Hyperlink Structure
41
HITS (Hyperlink-Induced Topic Search) Kle 98
  • HITS uses hyperlink structure to identify
    authoritative Web sources for broad-topic
    information discovery
  • Premise Sufficiently broad topics contain
    communities consisting of two types of
    hyperlinked pages
  • Authorities highly-referenced pages on a topic
  • Hubs pages that point to authorities
  • A good authority is pointed to by many good hubs
    a good hub points to many good authorities

Hubs
Authorities
42
HITS - Discovering Web Communities
  • Discovering the community for a specific
    topic/query involves the following steps
  • Collect seed set of pages S (returned by search
    engine)
  • Expand seed set to contain pages that point to or
    are pointed to by pages in seed set
  • Iteratively update hub weight h(p) and authority
    weight a(p) for each page
  • After a fixed number of iterations, pages with
    highest hub/authority weights form core of
    community
  • Extensions proposed in Clever
  • Assign links different weights based on relevance
    of link anchor text

43
Google BP 98
  • Search engine that uses link structure to
    calculate a quality ranking (PageRank) for each
    page
  • PageRank
  • Can be calculated using a simple iterative
    algorithm, and corresponds to principal
    eigenvector of the normalized link matrix
  • Intuition PageRank is the probability that a
    random surfer visits a page
  • Parameter p is probability that the surfer gets
    bored and starts on a new random page
  • (1-p) is the probability that the random surfer
    follows a link on current page

44
Google - Features
  • In addition to PageRank, in order to improve
    search Google also weighs keyword matches
  • Anchor text
  • Provide more accurate descriptions of Web pages
  • Anchors exist for un-indexable documents (e.g.,
    images)
  • Font sizes of words in text
  • Words in larger or bolder font are assigned
    higher weights
  • Google v/s HITS
  • Google PageRanks computed initially for Web
    Pages independent of search query
  • HITS Hub and authority weights computed for
    different root sets in the context of a
    particular search query

45
Trawling the Web for Emerging Communities KRR 98
  • Co-citation pages that are related are
    frequently referenced together
  • Web communities are characterized by dense
    directed bipartite subgraphs
  • Computing (i,j) Bipartite cores
  • Sort edge list by source id and detect all source
    pages s with out-degree j (let D be the set of
    destination pages that s points to)
  • Compute intersection S of sets of source pages
    pointing to destination pages in D (using an
    index on dest id to generate each source set)
  • Output Bipartite (S,D)

Bipartite Core
46
Using Hyperlinks to Improve Classification CDI
98
  • Use text from neighbors when classifying Web page
  • Ineffective because referenced pages may belong
    to different class
  • Use class information from pre-classified
    neighbors
  • Choose class ci for which Pr(ci/Ni) is maximum
    (Ni is class labels of all the neighboring
    documents)
  • By Bayes rule, we choose ci to maximize Pr(Ni/ci)
    Pr(ci)
  • Assuming independence of neighbor classes,

47
Collaborative Search
48
SearchLight
  • Key Idea Improve search by sharing information
    on URLs visited by members of a community during
    search
  • Based on the concept of search sessions
  • A search session is the search engine query
    (collection of keywords) and the URLs visited in
    response to the query
  • Possible to extract search sessions from the
    proxy logs
  • SearchLight maintains a database of (query,
    target URL) pairs
  • Target URL is heuristically chosen to be last URL
    in search session for the query
  • In response to a search query, SearchLight
    displays URLs from its database for the specified
    query

49
Image Retrieval
50
Similar Images
  • Given
  • A set of images
  • Find
  • All images similar to a given image
  • All pairs of similar images
  • Sample applications
  • Medical diagnosis
  • Weather predication
  • Web search engine for images
  • E-commerce

51
Similar Image Retrieval Systems
  • QBIC, Virage, Photobook
  • Compute feature signature for each image
  • QBIC uses color histograms
  • WBIIS, WALRUS use wavelets
  • Use spatial index to retrieve database image
    whose signature is closest to the querys
    signature
  • QBIC drawbacks
  • Computes single signature for entire image
  • Thus, fails when images contain similar objects,
    but at different locations or in varying sizes
  • Color histograms cannot capture shape, texture
    and location information (wavelets can!)

52
WALRUS Similarity Model NRS 99
  • WALRUS decomposes an image into regions
  • A single signature is stored for each region
  • Two images are considered to be similar if they
    have enough similar region pairs

53
WALRUS (Step 1)
  • Generation of Signatures for Sliding Windows
  • Each image is broken into sliding windows
  • For the signature of each sliding window, use
  • coefficients from lowest frequency band
    of the Haar wavelet
  • Naive Algorithm
  • Dynamic Programming Algorithm
  • N - number of pixels in the image
  • S -
  • - max window size

54
WALRUS (Step 2)
  • Clustering Sliding Windows
  • Cluster the windows in the image using
    pre-clustering phase of BIRCH
  • Each cluster defines a region in the image.
  • For each cluster, the centroid is used as a
    signature. (c.f. bounding box)

55
WALRUS - Retrieval Results
Query image
56
Network-Data Management and Analysis
57
Networks Create Data
  • To effectively manage their networks
    Internet/Telecom Service Providers continuously
    gather utilization and traffic data
  • Managed IP network elements collect huge amounts
    of traffic data
  • Switch/router-level monitoring (SNMP, RMON,
    NetFlow, etc.)
  • Typical IP router several 1000s SNMP counters
  • Service-Level Agreements (SLAs),
    Quality-of-Service (QoS) guarantees
    finer-grain monitoring (per IP flow!!)
  • Telecom networks Call-Detail Records (CDRs) for
    every phone call
  • Each CDR comprises 100s bytes of data with
    several 10s of fields/attributes (e.g., endpoint
    exchanges, timestamps, tarifs)
  • End Result Massive collections of
    Network-Management (NM) data (can grow in the
    order of several TeraBytes/year!!)

58
Why Data Management??
  • Massive NM data sets hide knowledge that is
    crucial to key management tasks
  • Application/user profiling, proactive/reactive
    resource management traffic engineering,
    capacity planning, etc.
  • Data Mining research can help!
  • Develop novel tools for the effective storage,
    exploration, and analysis of massive
    Network-Management data
  • Several challenging research themes
  • semantic data compression, approximate query
    processing, XML, mining models for event
    correlation and fault analysis,
    network-recommender systems, . . .
  • Loooooong-term goal -)
  • Intelligent, self-tuning, self-healing
    communication networks

59
Mining Techniques for Network Data
  • Automated schema extraction for XML data the
    XTRACT system
  • Data reduction techniques for massive data tables
  • lossless semantic compression with simple data
    dependencies the pzip algorithm
  • lossy, guaranteed-error semantic compression
  • Fascicles
  • Model-Based Semantic Compression the SPARTAN
    system
  • Approximate query processing over data synopses
  • Mining techniques for event correlation and
    root-cause analysis
  • Managing and mining data streams

60
Automated Schema Extraction for XML Data
The XTRACT System
61
XML Primer I
  • Standard for data representation and data
    exchange
  • Unified, self-describing format for
    publishing/exchanging management data across
    heterogeneous network/NM platforms
  • Looks like HTML but it isnt
  • Collection of elements
  • Atomic (raw character data)
  • Composite (sequence of nested sub-elements)
  • Example
  • A relational Model for Large Shared Data
    Banks
  • E.F. Codd
  • IBM Research

62
XML Primer II
  • XML documents can be accompanied by Document Type
    Descriptors (DTDs)
  • DTDs serve the role of the schema of the document
  • Specify a regular expression for every element
  • Example

63
The XTRACT System GGR 00
  • DTDs are of great practical importance
  • Efficient storage of XML data collections
  • Formulation and optimization of XML queries
  • However, DTDs are not mandatory XML data may
    not be accompanied by a DTD
  • Automatically-generated XML documents (e.g., from
    relational databases or flat files)
  • DTD standards for many communities are still
    evolving
  • Goal of the XTRACT system
  • Automated inference of DTDs from XML-document
    collections

64
Problem Formulation
  • Element types Þ alphabet
  • Infer DTD for each element type separately
  • Example sequences instances of nested
    sub-elements
  • Þ Only one level down in the hierarchy
  • Problem statement
  • Given a set example sequences for element e
  • Infer a good regular expression for e
  • Hard problem!!
  • DTDs can comprise general, complex regular
    expressions
  • quantify notion of goodness for regular
    expressions

65
Example XML Documents

66
Example (Continued)
  • Simplified example sequences
  • - ,
  • -
  • - ,
  • ,
  • ,
  • Desirable solution
  • year)

67
DTD Inference Requirements
  • Requirements for a good DTD
  • Generalizes to intuitively correct but previously
    unseen examples
  • It should be concise (i.e., small in size)
  • It should be precise (i.e., not cover too many
    sequences not contained in the set of examples)
  • Example Consider the case
  • p - ta, taa, taaa, ta, taaaa

Candidate DTD
68
The XTRACT Approach MDL Principle
  • Minimum Description Length (MDL) quantifies and
    resolves the tradeoff between DTD conciseness and
    preciseness
  • MDL principle The best theory to infer from a
    set of data is the one which minimizes the sum of
  • (A) the length of the theory, in bits, plus
  • (B) the length of the data, in bits, when
    encoded with the help of the theory.
  • Part (A) captures conciseness, and
  • Part (B) captures preciseness

69
Overview of the XTRACT System
  • XTRACT consists of 3 subsystems
  • Input Sequences

I ab, abab, ac, ad, bc, bd, bbd, bbbe
SG I U (ab), (ab), bd, be
SF SG U (ab)(cd), b(de)
Inferred DTD (ab) (ab)(cd) b(de)
70
MDL Subsystem
  • MDL principle Minimize the sum of
  • Theory description length, plus
  • Data description length given the theory
  • In order to use MDL, need to
  • Define theory description length (candidate
    DTD)
  • Define data description length (input sequences)
    given the theory (candidate DTD)
  • Solve the resulting minimization problem

71
MDL Subsystem - Encoding Scheme
  • Description Length of a DTD
  • Number of bits required to encode the DTD
  • Size of DTD log U (,),,
  • Description length of a sequence given a
    candidate DTD
  • Number of bits required to specify the sequence
    given DTD
  • Use a sequence of encoding indices
  • Encoding of a given a is the empty string Î
  • Encoding of a given (abc) is the index 0
  • Encoding of aaa given a is the index 3
  • Example Encoding of ababcabc given
    ((ab)c) is the sequence 2,2,1

72
MDL Encoding Example
  • Consider again the case
  • p - ta, taa, taaa, taaaa

Data Description
Theory Description

(given the theory)
ta taa taaa taaaa (ta) ta
0, 1,0 2, 3
17 7 24
6 21 27
201, 3011, 40111, 501111
3 7 10
1, 2, 3, 4
73
MDL Subsystem - Minimization
Input Sequences
Candidate DTDs
w11
ta
c1
w12
ta
taaa
c2
taa
ta
c3
taaaa
taa
ta
  • Maps to the Facility Location Problem (NP-hard)
  • XTRACT employs fast heuristic algorithms
    proposed by the Operations Research community

74
Semantic Compression of Massive Network-Data
Tables
75
Compressing Massive Tables A New Direction in
Data Compression
  • Benefits of data compression are well established
  • Optimize storage, I/O, network bandwidth (e.g.,
    data transfers, disconnected operation for mobile
    users) over the lifetime of the data
  • Faster query processing over synopses
  • Several generic compression tools and
    algorithms(e.g., gzip, Huffman, Lempel-Ziv)
  • Syntactic methods operate at the byte level,
    view data as large byte string
  • Lossless compression only
  • Effective compression of massive alphanumeric
    tables
  • Need novel methods that are semantic account
    for and exploit the meaning and data
    dependencies of attributes in the table
  • Lossless of lossy compression flexible
    mechanisms for users to specify acceptable
    information loss

76
The pzip Table Compressor BCC 00
  • Key ideas
  • Lossless compression via training use a small
    sample of table records to learn simple
    dependency patterns
  • Build a compression plan that exploits the
    discovered dependencies (e.g., column grouping)
  • Leverage existing compression tools (e.g., gzip,
    bzip) to losslessly compress the entire table
  • Based on discovering and exploiting simple
    dependency patterns among table columns
  • Combinational dependencies
  • Differential dependencies
  • Also, use simple differential coding for
    low-frequency columns
  • Outperforms naive gzip by factors of up to 2 in
    compression ratio/time

77
Combinational Dependencies in pzip
  • Some notation
  • Ti,j portion of table T between columns i and
    j (Ti i-th column of T)
  • S(Ti,j) size of compressed (e.g., gzipped)
    representation of Ti,j
  • The ranges Ti,j and Tj1,k are
    combinationally dependent iff S(Ti,j)
    S(Tj1,k) S(Ti,k)
  • Grouping the two ranges results in better
    compression
  • Optimum Partitioning find the column groupings
    that result in minimum overall storage
    requirements (each column group is compressed
    individually)
  • Solved optimally using Dynamic Programming
  • OPT1,i min OPT1,j S(Tj1,i)
    j
  • Complexity is O(n2) assuming S(Ti,j)s are
    known (remember these are
    computed over a sample of T)

78
Differential Dependencies in pzip
  • Column Tj is differentially dependent on Ti
    iff S(Tj) S(Ti -
    Tj)
  • Compressing the difference wrt Ti rather than
    Tj itself results in better compression
  • More explicit form of dependency
  • Differential compression problem partition
    Ts columns into source and derived, and
    find the differential encoding for each derived
    column such that overall storage is minimized
  • Maps naturally to the Facility Location Problem
    (NP-hard)
  • Greedy local-search heuristics are used in the
    pzip implementation

79
Semantic Compression with Fascicles JMN 99
  • Key observation
  • Often, numerous subsets of records in T have
    similar values for many attributes
  • Compress data by storing representative
    values (e.g., centroid) only once for each
    attribute cluster
  • Lossy compression information loss is
    controlled by the notion of similar values for
    attributes (user-defined)

80
Problem Formulation
  • k-dimensional fascicle F(k,t) subset of
    records with k compact attributes
  • User-defined compactness tolerance t (vector)
    specifies the allowable loss in the compression
    per attribute
  • E.g., tDuration 3 means that all Duration
    values in a fascicle are within 3 of the centroid
    value
  • Flexible, per-attribute specification of
    compression loss
  • Problem Statement
  • Given a table T and a compactness-tolerance
    vector t, find fascicles within the specified
    tolerances such that the total storage is
    minimized
  • (1) Finding candidate fascicles in T

    (2) Selecting the best fascicles to
    compress T

81
Finding Candidate Fascicles
  • Efficient, randomized algorithm
  • Use (memory-resident) random samples of T to
    choose an initial collection of tip sets (
    maximal fascicles based on the sampled records)
  • Grow tip sets with all qualifying records in
    one pass over T
  • Not guaranteed to find all fascicles!
  • Exact, level-wise (Apriori-like) procedures are
    possible (fascicles are anti-monotone), BUT
  • Inordinately expensive
  • Not necessarily better (require static
    pre-binning of numeric attributes)

82
Selecting Fascicles for Compression
  • Selecting the optimal subset among all
    candidate fascicles is hard!
  • Generalization of Weighted Set Cover Problem
    (NP-hard)
  • Use an efficient, greedy heuristic
  • Always select the fascicle that gives maximum
    compression benefit
  • Fascicles give significantly improved compression
    ratios (factors of 2-3) compared to naive gzip

83
SPARTAN A Model-BasedSemantic Compressor
BGR 01
  • New, general paradigm Model-Based Semantic
    Compression (MBSC)
  • Extract Data Mining models and use them to
    compress
  • Lossless or lossy compression (w/ guaranteed
    per-attribute error bounds)
  • SPARTAN system specific instantiation of MBSC
    framework
  • Key observation row-wise attribute clusters
    (a-la fascicles) are not sufficient
    (e.g., Y aX b)
  • Idea use carefully-selected collection of
    Classification and Regression Trees (CaRTs) to
    capture such vertical correlations and predict
    values for entire columns

84
SPARTAN Example CaRT Models
Protocol Duration Bytes Packets
http 12 20K 3
http 16 24K
5 http 15 20K
8 http 19 40K
11 http 26 58K
18 ftp 27
100K 24 ftp 32
300K 35 ftp 18
80K 15
  • Can use two compact trees (one decision,
    one regression) to eliminate two data columns
    (predicted attributes)

85
SPARTAN Architecture
86
SPARTANs CaRTSelector
  • Heart of the SPARTAN semantic-compression
    engine
  • Uses the constructed Bayesian network on T to
    drive the construction and selection of the
    best subset of CaRT predictors
  • Hard optimization problem -- Strict
    generalization of Weighted Maximum Independent
    Set (WMIS) (NP-hard!)
  • CaRTSelector employs a novel algorithm that
    iteratively uses a near-optimal WMIS heuristic
    to determine a good subset of CaRTs for
    compression
  • SPARTANs compression ratios outperform gzip
    and fascicles by wide margins (even for lossless
    compression)
  • Higher, but reasonable compression times (8min
    for a 14-attribute, 30MB table) -- use samples to
    learn CaRT models
  • SPARTAN models predictors can be useful in
    other NM contexts
  • e.g., event correlation filtering, root cause
    analysis (more later...)

87
Approximate Query Processing Over Synopses
88
Data Exploration in Traditional Decision Support
Systems (DSS)
Data Warehouse (GB/TB)
Long Response Times
SQL Query
Exact Answers
89
Exact Answers NOT Always Required
  • Interactive exploration of massive data sets
  • early feedback giving rough idea of results would
    help to quickly find the interesting regions in
    data space
  • data visualization
  • Aggregate queries approximate answers often
    suffice
  • How does total sales of product X in NJ compare
    to that in CA? Precision to the penny is
    not needed
  • Base data may be remote/unavailable
    Locally-cached synopses of the data may be the
    only option

90
Solution Approximate Query Processing
Data Warehouse (GB/TB)
Construct Compact Relations (in advance)
Compact Relations (MB)
Fast Response Times
Transformation Algebra
SQL Query
Approximate Answers
Transformed SQL Query
91
Approximate Query Processing Using Wavelets CGR
00
  • Construct compact synopses of data table(s) using
    multi-dimensional Haar-wavelet decomposition
  • Fast takes just a single pass over the data
    if it is chunked, otherwise logarithmic
    passes
  • SQL queries are answered by working just on the
    compact synopses (collections of wavelet
    coefficients) , i.e. , entirely in the
    wavelet (compressed) domain
  • fast response times
  • results converted back to relational domain
    (rendering) at the end
  • all types of queries supported aggregate,
    set-valued, GROUP-BY, . . .
  • Fast, accurate, general

92
Query Processing Architecture
  • Entire processing in compressed (wavelet) domain

93
Query Execution
render
  • Each operator (e.g., select, project, join,
    aggregates)
  • Input Set of Haar coefficients
  • Output Set of coefficients
  • Finally, rendering step
  • Input Set of Haar coefficients
  • Output (Multi)set of tuples

Set of coeffs
Set of coeffs
Set of coeffs
94
Mining Techniques for Event Correlation and
Root-Cause Analysis
95
Network Event Correlation Root- Cause Analysis
  • The problem Alarm floods !!

Router
Router
96
NM System Architecture
  • EC use fault propagation rules to improve
    information quality and filter secondary alarms
  • RCA employ EC output to produce a set of
    possible root causes and associated degrees of
    confidence

97
Event Correlation Engine
  • Driven by fault propagation rules ( causal
    relationships between alarm signals )

CAUSAL BAYESIAN MODEL !!
Given set of observed alarms A find minimal
subset P such that PA P threshold
98
State-of-the-art
  • SMARTS InCharge
  • Network elements modeled as objects with
    hard-coded fault propagation rules
  • Use causal graph to produce binary signatures
    for each failure (codebook)
  • HP OpenView ECS , CISCO InfoCenter , GTE
    Impact , . . .
  • Graphics- or language- based specification of
    global rules for event filtering
  • Hand-coding of causal model !!
  • tedious, error-prone, non-incremental
  • ignores probabilistic aspects (dependency
    strength)

99
Data Mining can Help Automate
  • Data Mining techniques for inferring
    maintaining causal models from network alarm data

Maintenance (on-line)
  • Challenges incorporate temporal aspects,
    topology, domain knowledge in the data-mining
    process

100
Root Cause Analysis
  • Use data mining (e.g., classification
    techniques) for RCA (field data DB to learn
    failure signatures)
  • Exploit domain knowledge (e.g., topology) in
    the data-mining process
  • Refine the RCA models as more data from the field
    becomes available

101
References
  • BCC 00 A.L. Buchsbaum, D.F. Caldwell, K.W.
    Church, G.S. Fowler, and S. Muthukrishnan.
    Engineering the Compression of Massive Tables An
    Experimental Approach. SODA, 2000
  • BGR 01 S. Babu, M. Garofalakis, and R. Rastogi.
    SPARTAN A Model-Based Semantic Compression
    System for Massive Data Tables. ACM SIGMOD, 2001.
  • BP 98 S. Brin, and L. Page. The anatomy of a
    large-scale hypertextual Web search engine. WWW7,
    1998.
  • CDA 97 S. Chakrabarti, B. Dom, and P. Indyk.
    Enhanced hypertext categorization using
    hyperlinks. ACM SIGMOD, 1998.
  • CDI 98 S. Chakrabarti, B. Dom, R. Agrawal, and
    P. Raghavan. Scalable feature selection,
    classification and signature generation for
    organizing large text databases into hierarchical
    topic taxonomies. VLDB Journal, 1998.
  • CGR 00 K. Chakrabarti, M. Garofalakis, R.
    Rastogi, and K. Shim. Approximate Query
    Processing Using Wavelets. VLDB, 2000.

102
References (Continued)
  • DDF 90 S. Deerwater, S. T. Dumais, G. W.
    Furnas, T. K. Landauer, and R. Harshman. Indexing
    by latent semantic analysis. Journal of the
    Society for Information Science, 41(6), 1990.
  • GGR 00 M. Garofalakis, A. Gionis, R. Rastogi,
    S. Seshadri, and K. Shim. XTRACT A System for
    Extracting Document Type Descriptors from XML
    Documents. ACM SIGMOD, 2000.
  • GKR 98 D. Gibson, J. Kleinberg, and P.
    Raghavan. Clustering categorical data An
    approach based on dynamical systems. VLDB, 1998.
  • GRS 99 S. Guha, K. Shim, and R. Rastogi. CURE
    An efficient clustering algorithm for large
    databases. ACM SIGMOD, 1998.
  • GRS 98 S. Guha, K. Shim, and R. Rastogi. ROCK
    A robust clustering algorithm for categorical
    attributes. Data Engineering, 1999.
  • HKK 97 E. Han, G. Karypis, V. Kumar, and B.
    Mobasher. Clustering based on association rule
    hypergraphs. DMKD Workshop, 1997.

103
References (Continued)
  • JMN 99 H.V. Jagadish, J. Madar, R.T. Ng.
    Semantic Compression and Pattern Extraction with
    Fascicles. VLDB, 1999.
  • Kle 98 J. Kleinberg. Authoritative sources in a
    hyperlinked environment. SODA, 1998.
  • KRR 98 R. Kumar, P. Raghavan, S. Rajagopalan,
    and A. Tomkins. Trawling the Web for emerging
    cyber-communities. WWW8, 1999.
  • ZRL 96 T. Zhang, R. Ramakrishnan, and M. Livny.
    BIRCH An efficient data clustering method for
    very large databases. ACM SIGMOD, 1996.
Write a Comment
User Comments (0)
About PowerShow.com