Data Mining: Concepts and Techniques - PowerPoint PPT Presentation

Loading...

PPT – Data Mining: Concepts and Techniques PowerPoint presentation | free to download - id: 45ddb1-ZmEwY



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Data Mining: Concepts and Techniques

Description:

Data Mining: Concepts and Techniques Chapter 10 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer Science – PowerPoint PPT presentation

Number of Views:256
Avg rating:3.0/5.0
Slides: 91
Provided by: DuoZ
Learn more at: http://www.cs.uiuc.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Data Mining: Concepts and Techniques


1
Data Mining Concepts and Techniques Chapter
10 10.3.1 Mining Text and Web Data (I)
  • Jiawei Han and Micheline Kamber
  • Department of Computer Science
  • University of Illinois at Urbana-Champaign
  • www.cs.uiuc.edu/hanj
  • Acknowledgements Based on the slides by students
    at CS512 (Spring 2009)

11/13/2013
1
2
Outline
  • Introduction to Information Retrieval (Rui Li)
  • Text categorization (Parikshit Sondhi)
  • Web link analysis (Kavita Ganesan)
  • Mining and Searching Structured Data on the Web
    (Bo Zhao)

3
Information Retrieval
  • Rui Li
  • ruili1_at_uiuc.edu

4
Whats the Information Retrieval ?
  • Information Retrieval
  • There exists a collection of text documents
  • User gives a query to express the information
    need
  • A retrieval system returns relevant documents to
    users
  • Typical IR systems
  • Online library catalogs
  • Online document management systems
  • Web Search Engine (Google)
  • Information Retrieval vs. Database System
  • Unstructured/free text vs. structured data
  • Ambiguous vs. well-defined semantics
  • Incomplete vs. complete specification
  • Relevant documents vs. matched records
  • No transaction VS transaction management,

5
Typical IR System Architecture
query
docs
Tokenizer
User
Query Rep
Doc Rep (Index)
Scorer
Indexer
results
Index
6
Document Representation
  • A document can be described by a set of
    representative keywords called index terms.
  • Different index terms have varying relevance when
    used to describe document contents.
  • Steps
  • Tokenize the document into the words
  • Remove stop words from stop word list E.g., is
    a or
  • Words stemmer Several words are small syntactic
    variants of each other since they share a common
    word stem E.g., drug, drugs, drugged
  • Calculate the term weight based on the word
    frequency
  • Query Representation is a similar process

7
Indexing
  • Inverted index
  • Maintains two hash- or B-tree indexed tables
  • document_table A set of document records
    ltdoc_id, postings_listgt
  • term_table A set of term records, ltterm,
    postings_listgt
  • Answer query Find all docs associated with one
    or a set of terms
  • easy to implement
  • effective to fetch documents with specific term
  • do not handle well synonymy and polysemy, and
    posting lists could be too long (storage could be
    very large)
  • Other index techniques e.g., signature file

8
Ranking Model
  • The basic question Given a query, how do we
    know if document A is more relevant than B?
  • Relevance Similarity
  • Query and document are represented similarly
  • A query can be regarded as a document
  • Relevance(d, q) ? similarity(d, q)
  • Key issues
  • How to represent query/document?
  • How to define the similarity measure ??
  • Typical Models
  • Boolean Model
  • Vector Space Model
  • Language Model

9
The Notion of Relevance
10
Vector Space Model
  • Represent a doc/query by a term vector
  • Term basic concept, e.g., word or phrase
  • Each term defines one dimension and N terms
    define a high-dimensional space
  • Element of vector corresponds to term weight,
    i.e., the importance of the term

11
How to Assign Weights
  • Two-fold heuristics based on frequency
  • TF (Term frequency)
  • More frequent within a document ? more relevant
    to semantics
  • e.g., query vs. commercial
  • IDF (Inverse document frequency)
  • Less frequent among documents ? more
    discriminative
  • e.g. algebra vs. science
  • TF-IDF weighting weight(t, d) TF(t, d)
    IDF(t)
  • Frequent within doc ? high tf ? high weight
  • Selective among docs ? high idf ? high weight

12
How to Measure Similarity?
  • Given two document
  • Similarity definition
  • dot product
  • normalized dot product (or cosine)

13
Advantages and Disadvantages of VS Model
  • Advantages
  • Empirically effective! (Top TREC performance)
  • Intuitive
  • Easy to implement
  • Well-studied/most evaluated
  • Disadvantages
  • Assume term independence
  • Assume query and document be the same
  • Lack of predictive adequacy
  • Arbitrary term weighting
  • Arbitrary similarity measure

14
Language Models for Retrieval (Ponte Croft 98)
Document
Query data mining algorithms
Text mining paper
Food nutrition paper
15
Text Generation with Unigram LM
(Unigram) Language Model ?
p(w ?)
Sampling
Document
text 0.2 mining 0.1 assocation 0.01 clustering
0.02 food 0.00001
Topic 1 Text mining
food 0.25 nutrition 0.1 healthy 0.05 diet 0.02
Topic 2 Health
16
Estimation of Unigram LM
(Unigram) Language Model ?
p(w?) ?
Estimation
Document
text 10 mining 5 association 3 database
3 algorithm 2 query 1 efficient 1
A text mining paper (total words100)
17
Ranking Docs by Query Likelihood
d1
q
d2
dN
18
Retrieval as Language Model Estimation
  • Document ranking based on query likelihood
  • Retrieval problem ? Estimation of p(wid)
  • Smoothing is an important issue, and
    distinguishes different approaches

19
Basic Measures for Text Retrieval
  • Precision the percentage of retrieved documents
    that are in fact relevant to the query (i.e.,
    correct responses)
  • Recall the percentage of documents that are
    relevant to the query and were, in fact, retrieved

20
Acknowledge
  • Some slides are coming from Professor Jiawei Han
    s CS512 course slides and from Professor
    Chengxiang Zhais CS410 course slides (Language
    Model Part)

21
Text Categorization
  • By
  • Parikshit Sondhi
  • Computer Science
  • University of Illinois at Urbana Champaign

Some slides have been adapted from Prof. Han's
presentation
22
Document Classification Motivation
  • News article classification
  • Automatic email filtering
  • Webpage classification
  • Word sense disambiguation

23
Text Categorization
  • Pre-given categories and labeled document
    examples (Categories may form hierarchy)
  • Classify new documents
  • A standard classification (supervised learning )
    problem

Data Mining Principles and Algorithms
11/13/2013
23
24
Document Classification Problem Definition
  • Need to assign a boolean value 0,1 to each
    entry of the decision matrix
  • C c1,....., cm is a set of pre-defined
    categories
  • D d1,..... dn is a set of documents to be
    categorized
  • 1 for aij dj belongs to ci
  • 0 for aij dj does not belong to ci

A Tutorial on Automated Text Categorisation,
Fabrizio Sebastiani, Pisa (Italy)
25
Flavors of Classification
  • Single Label
  • For a given di at most one (di, ci) is true
  • Train a system which takes a di and C as input
    and outputs a ci
  • Multi-label
  • For a given di zero, one or more (di, ci) can be
    true
  • Train a system which takes a di and C as input
    and outputs C, a subset of C
  • Binary
  • Build a separate system for each ci, such that it
    takes in as input a di and outputs a boolean
    value for (di, ci)
  • The most general approach
  • Based on assumption that decision on (di, ci) is
    independent of (di, cj)

26
Classification Methods
  • Manual Typically rule-based (KE Approach)
  • Does not scale up (labor-intensive, rule
    inconsistency)
  • May be appropriate for special data on a
    particular domain
  • Automatic Typically exploiting machine learning
    techniques
  • Vector space model based
  • Prototype-based (Rocchio)
  • K-nearest neighbor (KNN)
  • Decision-tree (learn rules)
  • Neural Networks (learn non-linear classifier)
  • Support Vector Machines (SVM)
  • Probabilistic or generative model based
  • Naïve Bayes classifier

Data Mining Principles and Algorithms
11/13/2013
26
27
Steps in Document Classification
  • Classification Process
  • Data preprocessing
  • E.g., Term Extraction, Dimensionality Reduction,
    Feature Selection, etc.
  • Definition of training set and test sets
  • Creation of the classification model using the
    selected classification algorithm
  • Classification model validation
  • Classification of new/unknown text documents

28
Taking an Example TFIDF Classifiers
29
Vector Space Model
  • Represent a doc by a term vector
  • Term basic concept, e.g., word or phrase
  • Each term defines one dimension
  • N terms define an N-dimensional space
  • Element of vector corresponds to term weight
  • E.g., d (x1,,xN), xi is importance of term i
  • New document is assigned to the most likely
    category based on vector similarity (e.g., based
    on cosine formula).

Data Mining Principles and Algorithms
11/13/2013
29
30
VS Model Illustration
Data Mining Principles and Algorithms
11/13/2013
30
31
TFIDF Classifier
  • The basic idea of the algorithm is to represent
    each document d as a vector d (d(1),....,
    d(F)) in a vector space so that documents with
    similar content have similar vectors.
  • Each dimension of the vector space represents a
    word selected by the feature selection process.
  • d(i) for a document d is calculated as a
    combination of the statistics TF(w, d) and
    DF(w). d(i) is called weight of word wi in
    document d.

A probabilistic Analysis of the Rocchio Algorithm
with TFIDF for Text Categorization, Thorsten
Joachims, Carnegie Mellon University, Pittsburgh,
PA
32
Representation
  • Each distinct word is a feature with the number
    of times the word occurs in the document as its
    value. This value is usually a function of
    TF(w,d) and IDF(w,d).
  • To avoid unnecessarily large feature vectors
    words are considered as features only if they
    occur in the training data at least m times
    (e.g., m 3).

A probabilistic Analysis of the Rocchio Algorithm
with TFIDF for Text Categorization, Thorsten
Joachims, Carnegie Mellon University, Pittsburgh,
PA
33
Preprocessing Feature Selection
  • All available features vs. "good" subset
  • The problem of finding a "good" subset of
    features is called feature selection
  • Feature selection methods
  • 1- pruning of infrequent words
  • Words are only considered as features, if they
    occur at least a few times in the training data.
  • 2- Pruning of high frequency words
  • This technique is supposed to eliminate non
    content words like "the", "and", "for".

A probabilistic Analysis of the Rocchio Algorithm
with TFIDF for Text Categorization, Thorsten
Joachims, Carnegie Mellon University, Pittsburgh,
PA
34
Classification TFIDF Classifier
A probabilistic Analysis of the Rocchio Algorithm
with TFIDF for Text Categorization, Thorsten
Joachims, Carnegie Mellon University, Pittsburgh,
PA
35
Evaluations
  • Effectiveness measure
  • Classic Precision Recall
  • Precision
  • Recall

Data Mining Principles and Algorithms
11/13/2013
35
36
Evaluation (cont)
  • Benchmarks
  • Classic Reuters collection
  • A set of newswire stories classified under
    categories related to economics
  • Effectiveness
  • Difficulties of strict comparison
  • different parameter setting
  • different split (or selection) between training
    and testing
  • various optimizations
  • However, widely recognizable
  • Best Boosting-based committee classifier SVM
  • Worst Naïve Bayes classifier
  • Need to consider other factors, especially
    efficiency

Data Mining Principles and Algorithms
11/13/2013
36
37
Document Classification Approach Comparisons
38
Document Clustering
  • Motivation
  • Automatically group related documents based on
    their contents
  • No predetermined training sets or taxonomies
  • Generate a taxonomy at runtime
  • Most popular clustering methods are
  • K-Means clustering
  • Agglomerative hierarchical clustering
  • EM (Gaussian Mixture)

Data Mining Principles and Algorithms
11/13/2013
38
39
The Steps and Algorithms
  • Clustering Process
  • Data preprocessing remove stop words, stem,
    feature extraction, lexical analysis, etc.
  • Hierarchical clustering compute similarities by
    applying clustering algorithms
  • Model-Based clustering (Neural Network Approach)
    clusters are represented by exemplars (e.g.,
    SOM)

40
K-Means clustering
  • Given
  • set of documents (e.g., TFIDF vectors),
  • distance measure (e.g., cosine)
  • K (number of groups)
  • For each of K groups, initialize its centroid
    with a random document
  • While not converging
  • Each document is assigned to the nearest group
    (represented by its centroid)
  • For each group, calculate new centroid (group
    mass point, average document in the group)

41
Slide adapted from Dr. Andrew Moores Presentation
42
Summary Text Categorization
  • Wide application domain
  • Comparable effectiveness to professionals
  • Manual TC is not 100 and unlikely to improve
    substantially
  • A.T.C. is growing at a steady pace
  • Prospects and extensions
  • Very noisy text, such as text from O.C.R.
  • Speech transcripts

Data Mining Principles and Algorithms
11/13/2013
42
43
References
  • Fabrizio Sebastiani, Machine Learning in
    Automated Text Categorization, ACM Computing
    Surveys, Vol. 34, No.1, March 2002
  • Yiming Yang, An evaluation of statistical
    approaches to text categorization, Journal of
    Information Retrieval, 167-88, 1999.
  • Yiming Yang and Xin Liu, A re-examination of
    text categorization methods, Proceedings of ACM
    SIGIR Conference on Research and Development in
    Information Retrieval (SIGIR'99, pp 42--49),
    1999.

Data Mining Principles and Algorithms
11/13/2013
43
44
Thank You
45
Web Link Analysis
  • By Kavita Ganesan

46
RECAP
  • What is ranking in information retrieval?

Doc 1
Doc 2
Doc 3
perform search on google
Doc 4
47
RECAP
  • What is ranking in information retrieval?

Doc 1
Ranked 1st
Doc 2
Ranked 2nd
Doc 3
Ranked 3rd
perform search on google
Doc 4
Ranked 4th
48
Why is ranking important?
  • Users tend to look at top few results
  • make sure that good matches are at the very top
  • Fast access to information!
  • savvy users want results immediately
  • What happens if pages are poorly ranked?
  • important matches missed
  • poor user retention

49
Ranking in Text Information Retrieval
  • Before web existed
  • Each document treated as a bag of words
  • Minimal structure
  • Ranking heuristics
  • Solely based on words in the documents
  • E.g., term frequency, inverse document frequency
  • After the web was born
  • Documents
  • have structure
  • contain hyperlinks
  • contain components like title, author, abstract,
    sections, references
  • Question is Can we leverage this information to
    improve ranking?

50
Exploiting inter-document links
Description (anchor text)
show
Links indicate the utility of a doc
Authority
Hub
What does a link tell us?
51
Links Analysis Algorithms
  • PageRank

HITS
Hyperlink analysis to rank documents
52
PageRank
  • Based on the idea of a random surfer
  • the likelihood that a person randomly clicking on
    links will arrive at any particular page
  • Pages represented as Markov Chain states
  • Probability of moving from one page to another is
    modelled as a state transition probability

53
PageRank
  • Ex

A B C D
B
A
1/2
0
1/2
1/2
0
A B C D
1/2
0
0
1/2
1
0
0
0
1/2
1
1/2
0
1/2
0
C
D
State transition matrix
PR(A)
½PR(B)
1PR(C)
½ PR(D)
54
PageRank
  • PageRank value for any page u can be expressed as

L(v) number of outbound links of page
v PR(v) PageRank of page v Bu set of pages
linking to page u
55
HITS
  • HITS Hyperlink-Induced Topic Search
  • Developed by Jon Michael Kleinberg from Cornell
  • The algorithm produces two types of pages
  • Authority pages that provide an important,
    trustworthy information on a given topic
  • Hub pages that contain links to authorities
  • Authorities and hubs exhibit a mutually
    reinforcing relationship
  • a better hub points to many good authorities, and
  • a better authority is pointed to by many good
    hubs

56
HITS algorithm
  • Start with each node(page) having a hub score and
    authority score of 1
  • Run the Authority Update Rule
  • Run the Hub Update Rule
  • Normalize the values
  • divide each Hub score by the sum of all Hub
    scores
  • divide each Authority score by the sum of all
    Authority scores
  • Repeat from the second step as necessary

57
HITS algorithmAuthority Update
  • Node's Authority score the sum of the Hub
    Score's of each node that points to it.
  • A page has high authority if it is linked to by
    pages that are recognized as Hubs for
    information.

B
1
A
C
D
authority(A) h(B) h(C) h(D)
58
HITS algorithmHub Update
  • Nodes Hub Score the sum of the Authority
    Score's of each node that it points to.
  • A page is a good hub if it links to pages that
    have high authority

5
E
A
6
F
7
G
hub(A) a(E) a(F) a(G)
59
PageRank vs HITS
HITS PageRank
iterative algorithm based on linkage of documents on the web iterative algorithm based on linkage of documents on the web
HITS is executed at query time (authority and hub scores are query specific) takes a perfomance hit PageRank is pre-computed
Computes two scores per document, hub and authority Computes a single score
60
End
61
AUTHORITY PAGE
Kevin Chang
Stanford.edu
ibm.com
Cheng Zhai
Marianne Winslet
berkeley.edu
If a page is popular, then it must be an
important page back
62
Mining and Searching Structured Data on the Web
  • Bo Zhao (bozhao3_at_illinois.edu)
  • Department of Computer Science
  • University of Illinois at Urbana-Champaign

63
Structured Data are EVERYWHERE!
  • Deep Web databases behind websites (aa.com)
  • Web 2.0 Contents Flickr, Del.icio.us tags
  • Google Base structured data portals
  • Surface Web emails, org, country

64
Solutions
  • Deep Web Data Integration
  • Vertical Search Engines
  • On-the-fly Meta-querying Systems
  • Pay-As-You-Go Integration
  • Deep Web Surfacing
  • Entity Search on the Surface Web

65
Vertical Search EnginesWarehousing approach
  • Academic Search
  • Libra_at_MSRA
  • DBLife_at_WISC
  • ArnetMiner_at_Tsinghua
  • Many other domains
  • Shopping
  • Events
  • Apartments

66
Vertical Search EnginesWarehousing approach
e.g., Libra Academic Search NieZW05 (courtesy
MSRA)
  • Integrating information from multiple types of
    sources
  • Ranking papers, conferences, and authors for a
    given query
  • Handling structured queries

67
On-the-fly Meta-querying Systems
  • MetaQuerier_at_UIUC
  • WISE-Integrator
  • http//www.data.binghamton.edu8080/wise-integrato
    r/
  • Commercial Systems
  • http//www.cheaptickets.com
  • http//pipl.com

68
On-the-fly Meta-querying Systems e.g., WISE
HeMYW03, MetaQuerier ChangHZ05
MetaQuerier_at_UIUC
FIND sources
Amazon.com
Cars.com
db of dbs
Apartments.com
QUERY sources
411localte.com
unified query interface
69
Technical Challenges.
  • Source Modeling Selection
  • How to describe a source and find right sources
    for query answering?
  • Schema Matching
  • How to match the schematic structures between
    sources?
  • Source Querying, Crawling, and Object Ranking
  • How to query a source? How to crawl all objects
    and to search them?
  • Data Extraction
  • How to extract result pages into relations?

70
Source Modeling Selection How to describe a
source and find right sources for query answering
  • Focus Discovery of sources.
  • Focused crawling to collect query interfaces
    BarbosaF05, ChangHZ05.
  • Focus Extraction of source models.
  • Hidden grammar-based parsing ZhangHC04.
  • Proximity-based extraction HeMY04.
  • Classification to align with given taxonomy
    HessK03, Kushmerick03.
  • Focus Organization of sources and query routing
  • Offline clustering HeTC04, PengMH04.
  • Online search for query routing KabraLC05.

71
Form Extraction the Problem
  • Output all the conditions, for each
  • Grouping elements (into query conditions)
  • Tagging elements with their semantic roles

attribute
operator
value
72
Schema Matching How to match the schematic
structures between sources
  • Focus Matching large number of interface
    schemas, often in a holistic way.
  • Statistical model discovery HeC03 correlation
    mining HeCH04, HeC05.
  • Query probing WangWL04.
  • Clustering HeMY03, WuYD04.
  • Corpus-assisted MadhavanBD05 Web-assisted
    WuDY06.
  • Focus Constructing unified interfaces.
  • As a global generative model HeC03.
  • Cluster-merge-select HeMY03.

73
WISE-Integrator Cluster-Merge-Represent
HeMY03
74
Source Querying How to query a source? How to
crawl all objects and to search them?
  • Metaquerying model
  • Focus On-the-fly Querying.
  • MetaQuerier Query Assistant ZhangHC05.
  • Vertical-search-engine model
  • Focus Source crawling to collect objects.
  • Form submission by query generation/selection
    e.g., RaghavanG01, WuWLM06.
  • Focus Object search and ranking NieZW05

75
On-the-fly Querying ZhangHC05
Type-locality based Predicate Translation
  • Correspondences occur within localities
  • Translation by type-handler

76
Source Crawling by Query Selection WuWL06
Author Title Category
Ullman Complier System
Ullman Data Mining Application
Ullman Automata Theory
Han Data Mining Application
Compiler
System
Theory
Application
Ullman
Automata
Data Mining
Han
  • Conceptually, the DB as a graph
  • Node Attributes
  • Edge Occurrence relationship
  • Crawling is transformed into graph traversal
    problem
  • Find a set of nodes N in the graph G such that
    for every node i in G, there exists a node j in
    N, j-gti. And the summation of the cost of nodes
    in N should be minimum.

77
Object Ranking - Object Relationship Graph
NieZW05
  • Popularity Propagation Factor for each type of
    relationship link
  • Popularity of an object is also affected by the
    popularity of the Web pages containing the object

78
Data Extraction How to extract result pages
into relations
  • Focus
  • Semi-automatic wrapper construction
  • Techniques
  • Wrapper-mediator architecture Wiederhold92 .
  • Manual construction
  • Semi-automatic Learning-based
  • HLRT KushmerickWD97,
  • Stalker MusleaMK99,
  • Softmealy HsuD98

Mediator
Wrapper
Wrapper
Wrapper
79
Data Extraction How to extract result pages into
relations
  • Focus
  • Even more automatic approaches.
  • Techniques
  • Semi-automatic Learning-based
  • ZhaoMWRY05, IRMKS06.
  • Automatic Syntax-based
  • RoadRunner MeccaCM01,
  • ExAlg ArasuG03,
  • DEPTA LiuGZ03, ZhaiL05.

Mediator
Wrapper
Wrapper
Wrapper
80
You can only afford to Pay As You Go
  • Data Integration Solution
  • Build data integration systems with deep web
    sources
  • Reformulate user queries at search-time
  • Build data integration for every domain of
    interest
  • Impractical for web search!
  • Cannot query sources too often
  • Precise content description required
  • Too many domains of interest?
  • Mediated schema design is infeasible!

81
Web Search Queries and Users
  • Web Queries are typically keyword queries
  • Data integration solutions assume structured
    queries
  • Web users do not typically care if results are
    structured or unstructured
  • User attention restricted to small number of
    portals (1)

82
PAYGO Architecture
  • There can be many, potentially ill-defined,
    domains
  • Mediated Schema ? Schema Clusters
  • Precise mappings cannot be created to all data
    sources
  • Exact Mappings ? Approximate Mappings
  • Users prefer keyword queries to structured
    queries
  • Query Reformulation ? Query Routing
  • Data sources are diverse and mappings approximate
  • Exact Answers ? Heterogeneous Result Ranking

Uncertainty everywhere !
83
Pay As You Go in PAYGO
  • Integration is a continuous process
  • Apriori integration impossible
  • Understanding of mappings/sources/ranking/etc.
    evolves over time
  • Mechanisms to facilitate evolution over time
  • Automatic schema clustering and matching
  • Implicit use of user feedback, e.g., from result
    clicks
  • Result variations to elicit disambiguating user
    feedback
  • Queries always answered with best effort
  • Pay more by correcting/creating semantic
    mappings

84
Query Routing Example
  • Keyword Analysis
  • Domain Selection
  • Query Construction
  • Source Selection
  • Result Ranking

85
Surfacing the Deep Web A More Practical
Solution?
  • Pre-compute all interesting form submissions each
    HTML form
  • Each form submission corresponds to a distinct
    URL
  • Add URLs for each form submission into search
    engine index
  • Enables the reuse of existing search engine
    infrastructure
  • Deep-web URLs are like any other URL (GET method)
  • Reduced load on deep-web sites
  • Only in response to user clicks on a search
    results
  • Search engine performance not dependent on
    deep-web source

86
Surfacing Challenges
  • Predicting the appropriate values for text inputs
  • Valid input values are required for retrieving
    data
  • Ingredients in recipes.com and zipcodes in
    borderstores.com
  • Predicting the correct input combinations
  • Generating all possible URLs is wasteful
    unnecessary
  • Cars.com has 500K listings, but 250M possible
    queries

87
Googles Deep-Web crawling system
  • Affects more than 1000 queries per second
  • Enables access to more than a million Deep-Web
    sites
  • Spans 50 languages and 100 domains
  • Results served from 400K distinct forms per day
  • Results validate the utility of Deep-Web content
  • Other systems
  • http//www.deeppeep.org/

88
Searching for Structured Data on the Surface Web
EntitySearch_at_UIUC
  • Entity Extraction and Indexing
  • Ranking Entities Directly
  • Contextual - Utilize Entities Surrounding
    Context
  • Uncertain - Extractions are non-prefect
  • Holistic - Many evidences from multiple sources
  • Discriminative - Web Pages are of Varying Quality
  • Associative - Tell True Associations from
    Accidental
  • Other systems
  • NAGA (http//www.mpi-inf.mpg.de/kasneci/naga/)
  • Correlator (http//correlator.sandbox.yahoo.net/)

89
References
  • Large-Scale Deep Web Integration Exploring and
    Querying Structured Data on the Deep Web.
    K. C.-C. Chang, tutorial in SIGMOD 2006
  • EntityRank Searching Entities Directly and
    Holistically. T. Cheng, X. Yan, and K. C.-C.
    Chang. VLDB 2007.
  • Web-scale Data Integration You can only afford
    to Pay As You Go. Jayant Madhavan, Shawn R.
    Jeffery, Shirley Cohen, Xin (Luna) Dong, David
    Ko, Cong Yu, Alon Halevy. CIDR, 2007.
  • Google's Deep-Web Crawl. Jayant Madhavan, David
    Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen,
    Alon Halevy. VLDB, 2008.

90
Thank you!
About PowerShow.com