Data Mining meets the Internet: Techniques for Web Information Retrieval - PowerPoint PPT Presentation

About This Presentation
Title:

Data Mining meets the Internet: Techniques for Web Information Retrieval

Description:

The jaguar, a cat, can run at. speeds reaching 50 mph. The jaguar has a 4 liter engine ... engine jaguar. cat. jaguar. Repository. Documents in repository. 5 ... – PowerPoint PPT presentation

Number of Views:3908
Avg rating:3.0/5.0
Slides: 72
Provided by: belllabo
Category:

less

Transcript and Presenter's Notes

Title: Data Mining meets the Internet: Techniques for Web Information Retrieval


1
Data Mining meets the Internet Techniques for
Web Information Retrieval
Rajeev
Rastogi Internet Management Research Bell
laboratories, Murray Hill
2
The Web
  • Over 1 billion HTML pages, 15 terabytes
  • Wealth of information
  • Bookstores, restaraunts, travel, malls,
    dictionaries, news, stock quotes, yellow white
    pages, maps, markets, .........
  • Diverse media types text, images, audio, video
  • Heterogeneous formats HTML, XML, postscript,
    pdf, JPEG, MPEG, MP3
  • Highly Dynamic
  • 1 million new pages each day
  • Average page changes in a few weeks
  • Graph structure with links between pages
  • Average page has 7-10 links
  • Hundreds of millions of queries per day

3
Why is Web Information Retrieval Important?
  • According to most predictions, the majority of
    human information will be available on the Web in
    ten years
  • Effective information retrieval can aid in
  • Research Find all papers that use the primal
    dual method to solve the facility location
    problem
  • Health/Medicene What could be reason for
    symptoms of yellow eyes, high fever and
    frequent vomitting
  • Travel Find information on the tropical island
    of St. Lucia
  • Business Find companies that manufacture digital
    signal processors (DSPs)
  • Entertainment Find all movies starring Marilyn
    Monroe during the years 1960 and 1970
  • Arts Find all short stories written by Jhumpa
    Lahiri

4
Web Information Retrieval Model
Repository
Storage Server
Web Server
Crawler
Clustering Classification
The jaguar has a 4 liter engine
Indexer
The jaguar, a cat, can run at speeds reaching 50
mph
Inverted Index
Topic Hierarchy
engine jaguar cat
Root
Documents in repository
Business
News
Science
jaguar
Search Query
Computers
Automobiles
Plants
Animals
5
Why is Web Information Retrieval Difficult?
  • The Abundance Problem (99 of information of no
    interest to 99 of people)
  • Hundreds of irrelevant documents returned in
    response to a search query
  • Limited Coverage of the Web (Internet sources
    hidden behind search interfaces)
  • Largest crawlers cover less than 18 of Web pages
  • The Web is extremely dynamic
  • 1 million pages added each day
  • Very high dimensionality (thousands of
    dimensions)
  • Limited query interface based on keyword-oriented
    search
  • Limited customization to individual users

6
How can Data Mining Improve Web Information
Retrieval?
  • Latent Semantic Indexing (LSI)
  • SVD-based method to improve precision and recall
  • Document clustering to generate topic hierarchies
  • Hypergraph partitioning, STIRR, ROCK
  • Document classification to assign topics to new
    documents
  • Naive Bayes, TAPER
  • Exploiting hyperlink structure to locate
    authoritative Web pages
  • HITS, Google, Web Trawling
  • Collaborative searching
  • SearchLight
  • Image Retrieval
  • QBIC, Virage, Photobook, WBIIS, WALRUS

7
Latent Semantic Indexing
8
Problems with Inverted Index Approach
  • Synonymy
  • Many ways to refer to the same object
  • Polysemy
  • Most words have more than one distinct meaning

animal
jaguar
speed
car
engine
porsche
automobile
Doc 1
X
X
X
Doc 2
X
X
X
X
Doc 3
X
X
X
Synonymy
Polysemy
9
LSI - Key Idea DDF 90
  • Apply SVD to terms by documents (t x d) matrix
    X X
    T0 S0 D0T0 , D0 have
    orthonormal columns and S0 is diagonal
  • Ignoring very small singular values in S (keeping
    only the first k largest values)
    X X
    T S D
  • New matrix X of rank k is closest to X in the
    least squares sense

m x d
m x m
t x d
t x m
k x k
k x d
t x d
t x k
10
Comparing Documents and Queries
  • Comparing two documents
  • Essentially dot product of two column vectors of
    X X X D S D
  • So one can consider rows of DS matrix as
    coordinates for documents and take dot products
    in this space
  • Finding documents similar to query q with term
    vector Xq
  • Derive a representation Dq for query Dq
    Xq T S
  • Dot product of DqS and appropriate row of DS
    matrix yields similarity between query and
    specific document

2
-1
11
LSI - Benefits
  • Reduces Dimensionality of Documents
  • From tens of thousands (one dimension per
    keyword) to a few 100
  • Decreases storage overhead of index structures
  • Speeds up retrieval of documents similar to a
    query
  • Makes search less brittle
  • Captures semantics of documents
  • Addresses problems of synonymy and polysemy
  • Transforms document space from discrete to
    continuous
  • Improves both search precision and recall

12
Document Clustering
13
Improve Search Using Topic Hierarchies
  • Web directories (or topic hierarchies) provide a
    hierarchical classification of documents (e.g.,
    Yahoo!)
  • Searches performed in the context of a topic
    restricts the search to only a subset of web
    pages related to the topic
  • Clustering can be used to generate topic
    hierarchies

Yahoo home page
Recreation
Science
Business
News
Sports
Travel
Companies
Finance
Jobs
14
Clustering
  • Given
  • Data points (documents) and number of desired
    clusters k
  • Group the data points (documents) into k clusters
  • Data points (documents) within clusters are more
    similar than across clusters
  • Document similarity measure
  • Each document can be represented by vector with
    0/1 value along each word dimension
  • Cosine of angle between document vectors is a
    measure of their similarity, or (euclidean)
    distance between the vectors
  • Other applications
  • Customer segmentation
  • Market basket analysis

15
k-means Algorithm
  • Choose k initial means
  • Assign each point to the cluster with the closest
    mean
  • Compute new mean for each cluster
  • Iterate until the k means stabilize

16
Agglomerative Hierarchical Clustering Algorithms
  • Initially each point is a distinct cluster
  • Repeatedly merge closest clusters until the
    number of clusters becomes k
  • Closest dmean (Ci, Cj)
  • dmin (Ci, Cj)
  • Likewise dave (Ci, Cj) and dmax (Ci, Cj)

17
Agglomerative Hierarchical Clustering Algorithms
(Continued)
dmean Centroid approach dmin Minimum Spanning
Tree (MST) approach
(c) Correct Clusters
(a) Centroid
(b) MST
18
Drawbacks of Traditional Clustering Methods
  • Traditional clustering methods are ineffective
    for clustering documents
  • Cannot handle thousands of dimensions
  • Cannot scale to millions of documents
  • Centroid-based method splits large and
    non-hyperspherical clusters
  • Centers of subclusters can be far apart
  • MST-based algorithm is sensitive to outliers and
    slight change in position
  • Exhibits chaining effect on string of outliers
  • Using other similarity measures such as Jaccard
    coefficient instead of euclidean distance does
    not help

19
Example - Centroid Method for Clustering Documents
  • As cluster size grows
  • The number of dimensions appearing in mean go up
  • Their value in the mean decreases
  • Thus, very difficult to distinguish two points
    that differ on few dimensions
  • ripple effect
  • 1,4 and 6 are merged even though they have no
    elements in common!

20
Itemset Clustering using Hypergraph Partitioning
HKK 97
  • Build a weighted hypergraph with frequent
    itemsets
  • Hyperedge each frequent item
  • Weight of hyperedge average of confidences of
    all association rules generated from itemset
  • Hypergraph partitioning algorithm is used to
    cluster items
  • Minimize sum of weights of cut hyperedges
  • Label customers with Item clusters by scoring
  • Assume that items defining clusters are disjoint!

21
STIRR - A System for Clustering Categorical
Attribute Values GKR 98
  • Motivated by spectral graph partitioning, a
    method for clustering undirected graphs
  • Each distinct attribute value becomes a separate
    node v with weight w(v)
  • Node weights w(v) updated in each iteration as
    follows For each tuple
    do

    Update set of
    weights so that it is orthonormal
  • Positive and negative weights in non-principal
    basins tend to represent good partitions of the
    data

22
ROCK GRS 99
  • Hierarchical clustering algorithm for categorical
    attributes
  • Example market basket customers
  • Use novel concept of links for merging clusters
  • sim(pi, pj) similarity function that captures
    the closeness between pi and pj
  • pi and pj are said to be neighbors if sim(pi, pj)
  • link(pi, pj) the number of common neighbors
  • At each step, merge clusters/points with the most
    number of links
  • Points belonging to a single cluster will in
    general have a large number of common neighbors
  • Random sampling used for scale up
  • In final labeling phase, each point on disk is
    assigned to cluster with maximum neighbors

23
ROCK
lt1, 2, 3, 4, 5gt 1, 2, 3 1, 4, 5 1, 2, 4
2, 3, 4 1, 2, 5 2, 3, 5 1, 3, 4 2, 4,
5 1, 3, 5 3, 4, 5
lt1, 2, 6, 7gt 1, 2, 6 1, 2, 7 1, 6, 7 2, 6,
7
  • 1, 2, 6 and 1, 2, 7 have 5 links.
  • 1, 2, 3 and 1, 2, 6 have 3 links.

24
Clustering Algorithms for Numeric Attributes
  • Scalable Clustering Algorithms
  • (From Database Community)
  • CLARANS
  • DBSCAN
  • BIRCH
  • CLIQUE
  • CURE
  • Above algorithms can be used to cluster documents
    after reducing their dimensionality using SVD

.
25
BIRCH ZRL 96
  • Pre-cluster data points using CF-tree
  • CF-tree is similar to R-tree
  • For each point
  • CF-tree is traversed to find the closest cluster
  • If the cluster is within epsilon distance, the
    point is absorbed into the cluster
  • Otherwise, the point starts a new cluster
  • Requires only single scan of data
  • Cluster summaries stored in CF-tree are given to
    main memory hierarchical clustering algorithm

26
CURE GRS 98
  • Hierarchical algorithm for dicovering arbitrary
    shaped clusters
  • Uses a small number of representatives per
    cluster
  • Note
  • Centroid-based uses 1 point to represent a
    cluster gt Too little information..Hyper-spherical
    clusters
  • MST-based uses every point to represent a
    cluster gtToo much information..Easily mislead
  • Uses random sampling
  • Uses Partitioning
  • Labeling using representatives

27
Cluster Representatives
  • A Representative set of points
  • Small in number c
  • Distributed over the cluster
  • Each point in cluster is close to one
    representative
  • Distance between clusters
  • smallest distance between
    representatives

28
Computing Cluster Representatives
  • Finding Scattered Representatives
  • We want to
  • Distribute around the center of the cluster
  • Spread well out over the cluster
  • Capture the physical shape and geometry of the
    cluster
  • Use farthest point heuristic to scatter the
    points over the cluster
  • Shrink uniformly around the mean of the cluster

29
Computing Cluster Representatives (Continued)
  • Shrinking the Representatives
  • Why do we need to alter the Representative Set?
  • Too close to the boundary of cluster
  • Shrink uniformly around the mean (center) of the
    cluster

30
Document Classification
31
Classification
  • Given
  • Database of tuples (documents), each assigned a
    class label
  • Develop a model/profile for each class
  • Example profile (good credit)
  • (25 lt age lt 40 and income gt 40k) or
    (married YES)
  • Example profile (automobile)
  • Document contains a word from car,
    truck, van, SUV, vehicle, scooter
  • Other applications
  • Credit card approval (good, bad)
  • Bank locations (good, fair, poor)
  • Treatment effectiveness (good, fair, poor)

32
Naive Bayesian Classifier
  • Class c for new document d is the one for which
    Prc/d is maximum
  • Assume independent term occurrences in document

    - fraction of documents in class c that
    contain term t
  • Then, by Bayes rule

33
Hierarchical Classifier (TAPER) CDA 97
  • Class of new document d is leaf node c such that
    Prc/d is maximum Topic Hierarchy
  • can be computed using Bayes
    rule
  • Problem of computing c reduces to finding leaf
    node c with the least cost path from the root
    to c

c
34
k-Nearest Neighbor Classifier
  • Assign to a point the label for majority of the
    k-nearest neighbors
  • For k1, error rate never worse than twice the
    Bayes rate (unlimited number of samples)
  • Scalability issues
  • Use index to find k-nearest neighbors
  • R-tree family works well up to 20 dimensions
  • Pyramid tree for high-dimensional data
  • Use SVD to reduce dimensionality of data set
  • Use clusters to reduce the dataset size

35
Decision Trees
Credit Analysis
salary lt 20000
no
yes
education in graduate
accept
yes
no
accept
reject
36
Decision Tree Algorithms
  • Building phase
  • Recursively split nodes using best splitting
    attribute for node
  • Pruning phase
  • Smaller imperfect decision tree generally
    achieves better accuracy
  • Prune leaf nodes recursively to prevent
    over-fitting

37
Decision Tree Algorithms
  • Classifiers from machine learning community
  • ID3
  • C4.5
  • CART
  • Classifiers for large databases
  • SLIQ, SPRINT
  • PUBLIC
  • SONAR
  • Rainforest, BOAT

38
Decision Trees
  • Pros
  • Fast execution time
  • Generated rules are easy to interpret by humans
  • Scale well for large data sets
  • Can handle high dimensional data
  • Cons
  • Cannot capture correlations among attributes
  • Consider only axis-parallel cuts

39
Feature Selection
  • Choose a collection of keywords that help
    discriminate between two or more sets of
    documents
  • Fewer keywords help to speed up classification
  • Improves classification accuracy by eliminating
    noise from documents
  • Fischers discriminant (ratio of between-class to
    within-class scatter) where
    and
    if d contains t

40
Exploiting Hyperlink Structure
41
HITS (Hyperlink-Induced Topic Search) Kle 98
  • HITS uses hyperlink structure to identify
    authoritative Web sources for broad-topic
    information discovery
  • Premise Sufficiently broad topics contain
    communities consisting of two types of
    hyperlinked pages
  • Authorities highly-referenced pages on a topic
  • Hubs pages that point to authorities
  • A good authority is pointed to by many good hubs
    a good hub points to many good authorities

Hubs
Authorities
42
HITS - Discovering Web Communities
  • Discovering the community for a specific
    topic/query involves the following steps
  • Collect seed set of pages S (returned by search
    engine)
  • Expand seed set to contain pages that point to or
    are pointed to by pages in seed set
  • Iteratively update hub weight h(p) and authority
    weight a(p) for each page
  • After a fixed number of iterations, pages with
    highest hub/authority weights form core of
    community
  • Extensions proposed in Clever
  • Assign links different weights based on relevance
    of link anchor text

43
Google BP 98
  • Search engine that uses link structure to
    calculate a quality ranking (PageRank) for each
    page
  • PageRank
  • Can be calculated using a simple iterative
    algorithm, and corresponds to principal
    eigenvector of the normalized link matrix
  • Intuition PageRank is the probability that a
    random surfer visits a page
  • Parameter p is probability that the surfer gets
    bored and starts on a new random page
  • (1-p) is the probability that the random surfer
    follows a link on current page

44
Google - Features
  • In addition to PageRank, in order to improve
    search Google also weighs keyword matches
  • Anchor text
  • Provide more accurate descriptions of Web pages
  • Anchors exist for un-indexable documents (e.g.,
    images)
  • Font sizes of words in text
  • Words in larger or bolder font are assigned
    higher weights
  • Google v/s HITS
  • Google PageRanks computed initially for Web
    Pages independent of search query
  • HITS Hub and authority weights computed for
    different root sets in the context of a
    particular search query

45
Trawling the Web for Emerging Communities KRR 98
  • Co-citation pages that are related are
    frequently referenced together
  • Web communities are characterized by dense
    directed bipartite subgraphs
  • Computing (i,j) Bipartite cores
  • Sort edge list by source id and detect all source
    pages s with out-degree j (let D be the set of
    destination pages that s points to)
  • Compute intersection S of sets of source pages
    pointing to destination pages in D (using an
    index on dest id to generate each source set)
  • Output Bipartite (S,D)

Bipartite Core
46
Using Hyperlinks to Improve Classification CDI
98
  • Use text from neighbors when classifying Web page
  • Ineffective because referenced pages may belong
    to different class
  • Use class information from pre-classified
    neighbors
  • Choose class ci for which Pr(ci/Ni) is maximum
    (Ni is class labels of all the neighboring
    documents)
  • By Bayes rule, we choose ci to maximize Pr(Ni/ci)
    Pr(ci)
  • Assuming independence of neighbor classes,

47
Collaborative Search
48
SearchLight
  • Key Idea Improve search by sharing information
    on URLs visited by members of a community during
    search
  • Based on the concept of search sessions
  • A search session is the search engine query
    (collection of keywords) and the URLs visited in
    response to the query
  • Possible to extract search sessions from the
    proxy logs
  • SearchLight maintains a database of (query,
    target URL) pairs
  • Target URL is heuristically chosen to be last URL
    in search session for the query
  • In response to a search query, SearchLight
    displays URLs from its database for the specified
    query

49
Image Retrieval
50
Similar Images
  • Given
  • A set of images
  • Find
  • All images similar to a given image
  • All pairs of similar images
  • Sample applications
  • Medical diagnosis
  • Weather predication
  • Web search engine for images
  • E-commerce

51
Similar Image Retrieval Systems
  • QBIC, Virage, Photobook
  • Compute feature signature for each image
  • QBIC uses color histograms
  • WBIIS, WALRUS use wavelets
  • Use spatial index to retrieve database image
    whose signature is closest to the querys
    signature
  • QBIC drawbacks
  • Computes single signature for entire image
  • Thus, fails when images contain similar objects,
    but at different locations or in varying sizes
  • Color histograms cannot capture shape, texture
    and location information (wavelets can!)

52
WALRUS Similarity Model NRS 99
  • WALRUS decomposes an image into regions
  • A single signature is stored for each region
  • Two images are considered to be similar if they
    have enough similar region pairs

53
WALRUS (Step 1)
  • Generation of Signatures for Sliding Windows
  • Each image is broken into sliding windows
  • For the signature of each sliding window, use
  • coefficients from lowest frequency band
    of the Haar wavelet
  • Naive Algorithm
  • Dynamic Programming Algorithm
  • N - number of pixels in the image
  • S -
  • - max window size

54
WALRUS (Step 2)
  • Clustering Sliding Windows
  • Cluster the windows in the image using
    pre-clustering phase of BIRCH
  • Each cluster defines a region in the image.
  • For each cluster, the centroid is used as a
    signature. (c.f. bounding box)

55
WALRUS - Retrieval Results
Query image
56
Automated Schema Extraction for XML Data
The XTRACT System
57
XML Primer I
  • Standard for data representation and data
    exchange
  • Unified, self-describing format for
    publishing/exchanging management data across
    heterogeneous network/NM platforms
  • Looks like HTML but it isnt
  • Collection of elements
  • Atomic (raw character data)
  • Composite (sequence of nested sub-elements)
  • Example
  • ltbookgt
  • lttitlegtA relational Model for Large Shared Data
    Banks lt/titlegt
  • ltauthorgt ltnamegt E.F. Codd lt/namegt
  • ltaffiliationgt IBM Research
    lt/affiliationgt
  • lt/authorgt
  • lt/ bookgt

58
XML Primer II
  • XML documents can be accompanied by Document Type
    Descriptors (DTDs)
  • DTDs serve the role of the schema of the document
  • Specify a regular expression for every element
  • Example
  • lt!ELEMENT book (title, author)gt
  • lt!ELEMENT title (PCDATA)gt
  • lt!ELEMENT author (name, affiliation)gt
  • lt!ELEMENT name (PCDATA)gt
  • lt!ELEMENT affiliation (PCDATA)gt

59
The XTRACT System GGR 00
  • DTDs are of great practical importance
  • Efficient storage of XML data collections
  • Formulation and optimization of XML queries
  • However, DTDs are not mandatory gt XML data may
    not be accompanied by a DTD
  • Automatically-generated XML documents (e.g., from
    relational databases or flat files)
  • DTD standards for many communities are still
    evolving
  • Goal of the XTRACT system
  • Automated inference of DTDs from XML-document
    collections

60
Problem Formulation
  • Element types Þ alphabet
  • Infer DTD for each element type separately
  • Example sequences instances of nested
    sub-elements
  • Þ Only one level down in the hierarchy
  • Problem statement
  • Given a set example sequences for element e
  • Infer a good regular expression for e
  • Hard problem!!
  • DTDs can comprise general, complex regular
    expressions
  • quantify notion of goodness for regular
    expressions

61
Example XML Documents
  • ltbookgt ltbookgt
  • lttitlegtlt/titlegt lttitlegtlt/titlegt
  • ltauthorgt ltauthorgt
  • ltnamegtlt/namegt ltnamegtlt/namegt
  • ltaffiliationgtlt/affiliationgt ltaddressgtlt/addressgt
  • lt/authorgt lt/authorgt
  • lt/bookgt ltauthorgt
  • ltnamegtlt/namegt
  • ltaddressgtlt/addressgt
  • lt/authorgt
  • lteditorgtlt/editorgt
  • ltpapergt ltbookgt
  • lttitlegtlt/titlegt
  • ltauthorgt
  • ltnamegtlt/namegt
  • ltaffiliationgtlt/affiliationgt
  • lt/authorgt
  • ltconferencegtlt/conferencegt
  • ltyeargtlt/yeargt

62
Example (Continued)
  • Simplified example sequences
  • ltbookgt - lttitlegtltauthorgt,
  • lttitlegtltauthorgtltauthorgtlteditorgt
  • ltpapergt - lttitlegtltauthorgtltconferencegtltyeargt
  • ltauthorgt - ltnamegtltaffiliationgt,
  • ltnamegtltaffiliationgt,
  • ltnamegtltaddressgt,
  • ltnamegtltaddressgt
  • Desirable solution
  • lt!ELEMENT book (title, author, editor?)gt
  • lt!ELEMENT paper (title, author, conference,
    year)gt
  • lt!ELEMENT author (name, affiliation?, address?)gt

63
DTD Inference Requirements
  • Requirements for a good DTD
  • Generalizes to intuitively correct but previously
    unseen examples
  • It should be concise (i.e., small in size)
  • It should be precise (i.e., not cover too many
    sequences not contained in the set of examples)
  • Example Consider the case
  • p - ta, taa, taaa, ta, taaaa

Candidate DTD
64
The XTRACT Approach MDL Principle
  • Minimum Description Length (MDL) quantifies and
    resolves the tradeoff between DTD conciseness and
    preciseness
  • MDL principle The best theory to infer from a
    set of data is the one which minimizes the sum of
  • (A) the length of the theory, in bits, plus
  • (B) the length of the data, in bits, when
    encoded with the help of the theory.
  • Part (A) captures conciseness, and
  • Part (B) captures preciseness

65
Overview of the XTRACT System
  • XTRACT consists of 3 subsystems
  • Input Sequences

I ab, abab, ac, ad, bc, bd, bbd, bbbe
SG I U (ab), (ab), bd, be
SF SG U (ab)(cd), b(de)
Inferred DTD (ab) (ab)(cd) b(de)
66
MDL Subsystem
  • MDL principle Minimize the sum of
  • Theory description length, plus
  • Data description length given the theory
  • In order to use MDL, need to
  • Define theory description length (candidate
    DTD)
  • Define data description length (input sequences)
    given the theory (candidate DTD)
  • Solve the resulting minimization problem

67
MDL Subsystem - Encoding Scheme
  • Description Length of a DTD
  • Number of bits required to encode the DTD
  • Size of DTD log U (,),,
  • Description length of a sequence given a
    candidate DTD
  • Number of bits required to specify the sequence
    given DTD
  • Use a sequence of encoding indices
  • Encoding of a given a is the empty string Î
  • Encoding of a given (abc) is the index 0
  • Encoding of aaa given a is the index 3
  • Example Encoding of ababcabc given
    ((ab)c) is the sequence 2,2,1

68
MDL Encoding Example
  • Consider again the case
  • p - ta, taa, taaa, taaaa

Data Description
Theory Description

(given the theory)
ta taa taaa taaaa (ta) ta
0, 1,0 2, 3
17 7 24
6 21 27
201, 3011, 40111, 501111
3 7 10
1, 2, 3, 4
69
MDL Subsystem - Minimization
Input Sequences
Candidate DTDs
w11
ta
c1
w12
ta
taaa
c2
taa
ta
c3
taaaa
taa
ta
  • Maps to the Facility Location Problem (NP-hard)
  • XTRACT employs fast heuristic algorithms
    proposed by the Operations Research community

70
References
  • BP 98 S. Brin, and L. Page. The anatomy of a
    large-scale hypertextual Web search engine. WWW7,
    1998.
  • CDA 97 S. Chakrabarti, B. Dom, and P. Indyk.
    Enhanced hypertext categorization using
    hyperlinks. ACM SIGMOD, 1998.
  • CDI 98 S. Chakrabarti, B. Dom, R. Agrawal, and
    P. Raghavan. Scalable feature selection,
    classification and signature generation for
    organizing large text databases into hierarchical
    topic taxonomies. VLDB Journal, 1998.
  • CGR 00 K. Chakrabarti, M. Garofalakis, R.
    Rastogi, and K. Shim. Approximate Query
    Processing Using Wavelets. VLDB, 2000.
  • DDF 90 S. Deerwater, S. T. Dumais, G. W.
    Furnas, T. K. Landauer, and R. Harshman. Indexing
    by latent semantic analysis. Journal of the
    Society for Information Science, 41(6), 1990.
  • GGR 00 M. Garofalakis, A. Gionis, R. Rastogi,
    S. Seshadri, and K. Shim. XTRACT A System for
    Extracting Document Type Descriptors from XML
    Documents. ACM SIGMOD, 2000.

71
References (Continued)
  • GKR 98 D. Gibson, J. Kleinberg, and P.
    Raghavan. Clustering categorical data An
    approach based on dynamical systems. VLDB, 1998.
  • GRS 99 S. Guha, K. Shim, and R. Rastogi. CURE
    An efficient clustering algorithm for large
    databases. ACM SIGMOD, 1998.
  • GRS 98 S. Guha, K. Shim, and R. Rastogi. ROCK
    A robust clustering algorithm for categorical
    attributes. Data Engineering, 1999.
  • HKK 97 E. Han, G. Karypis, V. Kumar, and B.
    Mobasher. Clustering based on association rule
    hypergraphs. DMKD Workshop, 1997.
  • Kle 98 J. Kleinberg. Authoritative sources in a
    hyperlinked environment. SODA, 1998.
  • KRR 98 R. Kumar, P. Raghavan, S. Rajagopalan,
    and A. Tomkins. Trawling the Web for emerging
    cyber-communities. WWW8, 1999.
  • ZRL 96 T. Zhang, R. Ramakrishnan, and M. Livny.
    BIRCH An efficient data clustering method for
    very large databases. ACM SIGMOD, 1996.
Write a Comment
User Comments (0)
About PowerShow.com