Workshop on Web Mining Technology and Applications Panel Web Mining: Recent Development and Trends - PowerPoint PPT Presentation

Loading...

PPT – Workshop on Web Mining Technology and Applications Panel Web Mining: Recent Development and Trends PowerPoint presentation | free to download - id: 5b7541-MTc2Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Workshop on Web Mining Technology and Applications Panel Web Mining: Recent Development and Trends

Description:

Title: Workshop on Web Mining Technology and Applications Panel Web Mining: Recent Development and Trends Author: whlu Last modified by: whlu – PowerPoint PPT presentation

Number of Views:3157
Avg rating:3.0/5.0
Slides: 119
Provided by: whlu
Learn more at: http://myweb.ncku.edu.tw
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Workshop on Web Mining Technology and Applications Panel Web Mining: Recent Development and Trends


1
Web Mining for Unknown Term Translation
Wen-Hsiang Lu (???) Department of Computer
Science and Information engineering whlu_at_mail.ncku
.edu.tw http//myweb.ncku.edu.tw/whlu
2
Web Mining
3
Research Problems
  • Difficulties in automatic construction of
    multilingual translation lexicons
  • Techniques Parallel/comparable corpora
  • Bottlenecks Lacking diverse/multilingual
    resources
  • Difficulties in query translation for
    cross-language information retrieval (CLIR)
  • Techniques Bilingual dictionary/machine
    translation/parallel corpora
  • Bottlenecks Multiple-senses/short/diverse/unknown
    query
  • Challenges
  • Web queries are often
  • Short 2-3 words (Silverstein et al. 1998)
  • Diverse wide-scoped topic
  • Unknown (out of vocabulary) 74 is unavailable
    in CEDICT Chinese-English electronic dictionary
    containing 23,948 entries.
  • E.g.
  • Proper name ???? (Einstein), ?? (Hussein)
  • New terminology ?????????? (SARS), ????
    (Nosocomial infections)

4
Cross-Language Information Retrieval
  • Query in source language and retrieve relevant
    documents in target languages

?
SARS ???? ????? National Palace Museum
Query Translation
Information Retrieval
Target Translation
Source Query
Target Documents
5
Difficulties in Web Query Translation Using
Machine Translation
Chinese translation ???????
English source query National Palace Museum
6
Research Paradigm
Live Translation Lexicon
New approach
Web Mining
Anchor-Text Mining
Term-Translation Extraction
Applications
Internet
Search-Result Mining
Cross-Language Information Retrieval
Cross-Language Web Search
7
Multilingual Anchor-Texts
8
Language-Mixed Texts in Search Result Pages
9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
Anchor-Text Mining with Probabilistic Inference
Model
Conventional translation model
  • Asymmetric translation models
  • Symmetric model with link information

Co-occurrence
Page authority
20
Transitive Translation Model for Multilingual
Translation
  • Direct Translation Model
  • Indirect Translation Model
  • Transitive Translation Model

Direct Translation
t
s
??(Traditional Chinese)
??? (Japanese)
Indirect Translation
m
Sony (English)

s  source term t  target translation m
intermediate translation
21
Promising Results for Automatic Construction of
Multilingual Translation Lexicons
Source terms (Traditional Chinese) English Simplified Chinese Japanese
?? ?? ??? ?? ???? ?? ?? ?? ??? ?? Sony Nike Stanford Sydney internet network homepage computer database information ?? ?? ??? ?? ??? ?? ?? ??? ??? ?? ??? ??? ??????? ???? ??????? ?????? ?????? ??????? ?????? ?????????

22
Search-Result Mining
  • Goal Improve translation coverage for diverse
    queries
  • Idea
  • Chi-square test co-occurrence relation
  • Context-vector analysis context information
  • Context-vector similarity measure
  • Weighting scheme TFIDF
  • Chi-square similarity measure
  • 2-way contingency table

t t
s a b
s c d
23
Workshop on Web Mining Technology and
Applications (Dec. 13, 2006) PanelWeb Mining
Recent Development and Trends
??? ?? (Vincent S. Tseng) ???? ?????
24
Main Categories of Web Mining
  • Web content mining
  • Web usage mining
  • Web structure mining

25
Web Content Mining
  • Trends
  • Deep web mining
  • Semantic web mining
  • Vertical search
  • Web multimedia content mining
  • Web image/video search
  • Web image/video annotation/classification/clusteri
    ng
  • Web multimedia content filtering
  • Example YouTube
  • Integration with web log mining

26
Web Usage Mining
  • Developed techniques
  • Mining of frequent usage patterns
  • Association rules, sequential patterns, traversal
    patterns, etc.
  • Trends
  • Personalization
  • Recommendation
  • Web Ads
  • Incorporation of content semantics/ontology
  • Considerations of Temporality
  • Extension to mobile web applications
  • Multidiscipline integration

27
Problems Under-utilization of Clickstream Data
  • Shop.org U.S.-based visits to retail Web sites
    exceeded 10 of total Internet traffic for the
    first time ever on Thanksgiving, 2004
  • Top five sites eBay, Amazon.com, Dell.com,
    Walmart.com, BestBuy.com, and Target.com
  • Aberdeen Group
  • 70 of site companies use Clickstream data only
    for basic website management!

28
Challenges for Clickstream Data Mining- Arun Sen
et al., Communications of ACM, Nov. 2006
  • Problems with data
  • Data incompleteness
  • Very large data size
  • Messiness in the data
  • Integration problems with Enterprise Data
  • Too Many Analytical Methodologies
  • Web Metric-based Methodologies
  • Basic Marketing Metric-based Methodologies
  • Navigation-based Methodologies
  • Traffic-based Methodologies
  • Data Analysis Problems
  • Across-dimension analysis problems
  • Timeliness of data mining under very large data
    size
  • Determination of useful/actionable analysis under
    thousands of metrics

29
Web Information Extraction The Issues for
Unsupervised Approaches
  • Dr. Chia-Hui Chang (???)
  • Department of Computer Science and Information
    Engineering,
  • National Central University, Taiwan
  • (Talk given at 2006 ???????????? )

30
Outline
  • Web Information Extraction
  • The key to web information integration
  • Three Dimensions
  • Task definition
  • Automation degree
  • Technology
  • Focused on Template Pages IE task
  • Issues for record-level IE
  • Techniques for solving these issues

31
Introduction
  • The coverage of Web information is very wide and
    diverse
  • The Web has changed the way we obtain
    information.
  • Information search on the Web is not enough
    anymore.
  • The stronger need for Web information integration
    has increased than ever (both for business and
    individuals).
  • Understanding those Web pages and discovering
    valuable information from them is called Web
    content mining.
  • Information extraction is one of the keys for web
    content mining.

32
Web Information Integration
  • From information search to information
    extraction, to information mapping
  • Focused crawling / Web page gathering
  • Information search
  • Information (Data) extraction
  • Discovering structured information from input
  • Schema matching
  • With a unified interface / single ontology

33
Three Dimensions to See IE
  • Task Definition
  • Input (Unstructured free texts, semi-structured
    Web pages)
  • Output Targets (record-level, page-level,
    site-level)
  • Automation Degree
  • Programmer-involved, annotation-based or
    annotation-free approaches
  • Techniques
  • Learning algorithm specific/general to
    general/specific
  • Rule type regular expression rules vs logic
    rules
  • Deterministic finite-state transducer vs
    probabilistic hidden Markov models

34
IE from Nearly-structured Documents
  • Multiple-records Web page

Google search result
35
IE from Nearly-structured Documents
Single-record Pages
Amazon.com book pages
36
IE from Semi-structured Documents
Ungrammatical snippets
A publication list
Selected articles
37
Information Extraction From Free Texts
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Named entity extraction,
IE
NAME TITLE ORGANIZATION Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman founder Free
Soft..
Excerpted from Cohen MaCallums talk.
38
Information Extraction From Free Texts
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation




Excerpted from Cohen MaCallums talk.
39
Dimension 1 Task Definition - Input
40
Dimension 1 Task Definition - Output
  • Attribute level (single-slot)
  • Named entity extraction, concept annotation
  • Record level
  • Relation between slots
  • Page level
  • All data embedded in a dynamic page
  • Site level
  • All information about a web site

41
Template Page Generation Extraction
  • Generation/Encoding
  • Extraction/Decoding A reverse engineering

42
Dimension 2 Automation Degree
  • Programming-based
  • For programmers
  • Supervised learning
  • A bunch of labeled examples
  • Semi-supervised learning/Active learning
  • Interactive wrapper induction
  • Unsupervised learning
  • Mostly for template pages only

43
Tasks vs. Automation Degree
  • High Automation Degree (Unsupervised)
  • Template page IE
  • Semi-Automatic / Interactive
  • Semi-structured document IE
  • Low Automation Degree (Supervised)
  • Free text IE

44
Dimension 3 Technologies
  • Learning Technology
  • Supervised rule generalization, hypothesis
    testing, statistical modeling
  • Unsupervised learning pattern mining, clustering
  • Features used
  • Plain text information tokens, token class, etc.
  • HTML information DOM tree path, sibling, etc.
  • Visual information font, style, position, etc.
  • Rule Types (Expressiveness of the rules)
  • Regular expression, first-order logic rules, HMM
    model

45
Issues for Unsupervised Approaches
  • For Record-level Extraction
  • Data-rich Section Discovery
  • Record Boundary (Separator) Mining
  • Schema Detection Data Annotation
  • For Page-level Extraction
  • Schema Detection - differentiate template from
    data tokens

46
Data-Rich Section
Record Boundary
Attribute
Attribute
47
Some Related Works on Unsupervised Approaches
  • Record-level
  • IEPAD Chang and Liu, WWW2001
  • DeLa Wang and Lochovsky, WWW2003
  • DEPTA Zhai and Liu, WWW2005
  • ViPER Simon and Lausen, CIKM 2005
  • ViNTZhao et al, WWW 2005
  • Page-level
  • Roadrunner Crescenzi, VLDB2001
  • EXALG Arasu and Garcia-Molina, SIGMOD2003
  • MSR Zhao et al., VLDB 2006

48
Issue 1 Data-Rich Section Discovery
  • Comparing a normal page with no-result page
  • Comparing two normal pages
  • Locate static text lines, e.g.
  • Books
  • Related Searches
  • Narrow or Expand Results
  • Showing
  • Results

MSE Zhao, et al. VLDB2006
ViNT Zhao, et al. WWW2005
49
Issue 1 Data-Rich Section Discovery (Cont.)
  • Similarity between two adjacent leaf nodes
  • 1-dimension clustering
  • Pitch Estimation

50
Issue 2 Record Boundary Mining
  • Tree Pattern Mining
  • String Pattern Mining

lthtmlgtltbodygtltbgtTlt/bgtltolgt ltligtltbgtTlt/bgtTltbgtTlt/bgtTlt/l
igt ltligtltbgtTlt/bgtTltbgtTlt/bgtlt/ligt lt/olgtlt/bodygtlthtmlgt
IEPAD Chang and Liu, WWW2001
ltPgtltAgtTlt/AgtltAgtTlt/Agt Tlt/PgtltPgtltAgtTlt/AgtTlt/Pgt ?
ltPgtltAgtTlt/AgtTlt/Pgt ltPgtltAgtTlt/AgtTlt/Pgt
DeLa Wang and Lochovsky, WWW2003
DEPTA Zhai and Liu, WWW2005
51
Issue 2 Record Boundary Mining (Cont.)
  • Finding repeat separators from visual encoded
    context lines
  • Heuristics
  • Line following an HR-LINE
  • A unique line in a block that starting with a
    number
  • Line in a block has the smallest position code
    (Only one).
  • Line following the BLANK line is the first line.
  • Visual cues

ViPER Simon and Lausen, CIKM05
ViNT Zhao, et al. WWW2005
52
Issue 3 Data Schema Detection
  • Alignment of the multiple records found
  • Handling missing attributes, multiple-value
    attributes
  • String alignment or tree alignment
  • Examining two records at a time
  • Differentiate template from data tokens with some
    assumptions
  • Tag tokens are considered part of templates
  • Text lines are usually part of data except for
    static text lines
  • Similar to the problem of page-level IE tasks

53
Page-level IE EXALG
Arasu and Garcia-Molina, SIGMOD 2003
  • Identifying static markers (tagword tokens) from
    multiple pages
  • Occurrence vector for each token
  • Differentiating token roles
  • By DOM tree path
  • By position in the EC class
  • Equivalent class (EC)
  • Group tokens with the same occurrence vector
  • LFECs form the template
  • e.g. lt1,1,1,1gt lthtmlgt, ltbodygt, lttablegt,
    lt/tablegt, lt/bodygt, lt/htmlgt

Critical point Tags are not easy to
differentiate as compared to text lines used in
Zhao, et al, VLDB206
54
On the use of techniques
  • From supervised to unsupervised approaches
  • From string alignment (IEPAD, RoadRunner) to tree
    alignment (DEPTA, Thresher)
  • From two page summarization (MSE) to multiple
    page summarization (EXALG)

55
Summary
  • Content of this talk
  • Web Information Extraction
  • Three Dimensions
  • Focused on IE for template pages IE task
  • Issues for unsupervised approaches
  • Techniques for solving these issues
  • Content not in this talk
  • Probabilistic model for free text IE tasks

56
Personal Vision
  • From information search to information
    integration
  • Better UI for information integration
  • Information collection focused crawling
  • Information extraction
  • Schema matching and integration
  • Not only for business but also for individuals

57
References Record Level
  • C.-H. Chang, S.-C. Lui, IEPAD Information
    Extraction based on Pattern Discovery, WWW01
  • B. Liu, R. Grossman and Y. Zhai, Mining Data
    Records in Web Pages, SIGKDD03
  • Y. Zhai, B. Liu. Web Data Extraction Based on
    Partial Tree Alignment, WWW05
  • K. Simon and G. Lausen, ViPER Augmenting
    Automatic Information Extraction with Visual
    Perceptions, CIKM05
  • H. Zhao, W. Meng, V. Raghavan, and C. Yu, Fully
    Automatic Wrapper Generation for Search Engines,
    WWW05

58
References Page Level Survey
  • A. Arasu, H. Garcia-Molina, Extracting Structured
    Data from Web Pages, SIGMOD03
  • V. Crescenzi, G. Mecca, P. Merialdo. RoadRunner
    Towards Automatic Data Extraction from Large Web
    Sites, VLDB01
  • H. Zhao, W. Meng, and C. Yu, Automatic Extraction
    of Dynamic Record Sections From Search Engine
    Result Pages, VLDB06
  • A. Laender, B. Ribeiro-Neto, A. da Silva, J.
    Teixeira. A Brief Survey of Web Data Extraction
    Tools. ACM SIGMOD Record02.
  • C.-H. Chang, M. Kayed, M. R. Girgis, K. Shaalan,
    A Survey of Web Information Extraction Systems,
    IEEE TKDE06.

59
Taxonomic Information Integration Challenges and
Applications
  • Cheng-Zen Yang (???)
  • Department of Computer Sci. and Eng.
  • Yuan Ze University
  • czyang_at_syslab.cse.yzu.edu.tw

60
Outline
  • Introduction
  • Problem statement
  • Integration approaches
  • Flattened catalog integration
  • Hierarchical catalog integration
  • Applications
  • Conclusions and future work

61
Introduction
  • As the Internet develops rapidly, the number of
    on-line Web pages becomes very large today.
  • Many Web portals offer taxonomic information
    (catalogs) to facilitate information search
    AS2001.
  • These catalogs may need to be integrated if Web
    portals are merged.
  • B2B electronic marketplaces bring together many
    online suppliers and buyers.
  • An integrated Web catalog service can help users
  • gain more relevant and organized information in
    one catalog, and
  • can save them much time to surf among different
    Web catalogs.

62
B2C e-commerce Amazon
63
(No Transcript)
64
The taxonomic information integration problem
  • Taxonomic information integration is more than a
    simple classification task.
  • When some implicit source information is
    exploited, the integration accuracy can be highly
    improved.
  • Past studies have shown that
  • the Naïve Bayes classifier, SVMs, and the Maximum
    Entropy model enhance the accuracy of Web catalog
    integration in a flattened catalog integration
    structure.

65
The problem statement (1/2)
  • Flattened catalog integration
  • The source catalog S containing a set of
    categories S1 , S2 , , Sm is to be integrated
    into the destination catalog D consisting of
    categories D1 , D2 , , Dn.

66
The problem statement (2/2)
  • Hierarchical catalog integration

67
Integration Approachesfor Flattened Catalogs
68
The enhanced naïve Bayes approach
  • The pioneer work AS2001
  • They exploit the implicit source information and
    improve the integration accuracy.
  • Naïve Bayes Approach
  • The Enhanced Naïve Bayes Approach

d Test document in source catalog Ci Category
in Destination Catalog S Category in Source
catalog
69
Probabilistic enhancement and topic restriction
  • NB and SVM TCCL2003
  • Probabilistic Enhancement
  • Topic Restriction

x Test document in source catalog vt Label of
class in Destination Catalog s The class label
of x in Source Catalog
70
The pseudo relevancefeedback approach
  • Iterative-Adapting SVM CHY2005

71
An Application Example
72
Searching for multi-lingual news articles
  • Many Web portals provide monolingual news
    integration services.
  • Unfortunately, users cannot effectively find the
    related news in other languages.

73
The basic idea
  • Web portals have grouped related news articles.
  • These articles should be about thesame main
    story.
  • Can we discover these mappings?

74
Techniques in our current work
  • Machine translation
  • Taxonomy integration

Mapping Finding
75
Taxonomy integration
  • The cross-training process SCG2003
  • To make better inferences about label assignments
    in another taxonomy

English News Features
Chinese News Features
1st SVM
Semantically Overlapped Features
2nd SVM
English-Chinese News Category Mappings
76
Mapping decision
  • The SVM-BCT classifiers then calculate the
    positively mapped ratios as the mapping score
    (MSi) to predict the semantic overlapping.
    YCC2006
  • The mapping score MSi of Si ?Dj
  • Then we can rank the mappings according to their
    scores.

77
Performance evaluation
  • NLP resources
  • Standard Segmentation Corpus from ACLCLP
  • 42023 segmented words
  • Bilingual wordlists (version 2.0) from Linguistic
    Data Consortium (LDC)
  • Chinese-to-English version 2 (ldc2ce) with about
    120K records
  • English-to-Chinese (ldc2ec) with 110K records

78
Experimental datasets
  • Properties
  • news reports in the international news category
    of Google News Taiwan and U.S. version
  • May 10, 2005 - May 23, 2005
  • 20 news event categories per day
  • Chinese-to-English
  • 46.9MB
  • English-to-Chinese
  • 80.2MB
  • 29182 news stories

79
Conclusions and Future Work
80
Conclusions
  • Taxonomic information integration is an emerging
    issue for Web information mining.
  • New approaches for flattened catalog integration
    and hierarchical catalog integration are still in
    need.
  • Our approaches are in the first stage for
    taxonomic information integration.

81
Future work
  • Taxonomy alignment
  • Heterogeneous catalog integration (Jung 2006)
  • Incorporated with more conceptual information
  • Wordnet, Sinica BOW, etc.
  • Evaluation on other classifiers
  • EM, ME, etc.

82
References
  • AS2001 Agrawal, R., Srikant., R. On
    Integrating Catalogs. Proc. the 10th WWW Conf.
    (WWW10), (May 2001) 603612
  • BOYAPATI2002 Boyapati, V. Improving
    Hierarchical Text Classification Using Unlabeled
    Data. Proc. The 25th Annual ACM Conf. on Research
    and Development in Information Retrieval
    (SIGIR02), (Aug. 2002) 363364
  • CHY2005 I.-X. Chen, J.-C. Ho, and C.-Z. Yang.
    An iterative approach for web catalog integration
    with support vector machines. Proc. of Asia
    Information Retrieval Symposium 2005 (AIRS2005),
    (Oct. 2005) 703708
  • DC 2000 Dumais, S., Chen, H. Hierarchical
    Classification of Web Content. Proc. the 23rd
    Annual ACM Conf. on Research and Development in
    Information Retrieval (SIGIR00), (Jul. 2000)
    256263
  • HCY2006 J.-C. Ho, I.-X. Chen, and C.-Z. Yang.
    Learning to Integrate Web Catalogs with
    Conceptual Relationships in Hierarchical
    Thesaurus. Proc. The 3rd Asia Information
    Retrieval Symposium (AIRS 2006), (Oct. 2006)
    217-229
  • JOACHIMS1998 Joachims, T. Text Categorization
    with Support Vector Machines Learning with Many
    Relevant Features. Proc. the 10th European Conf.
    on Machine Learning (ECML98), (1998) 137142

83
  • JUNG2006 Jung, J. J. Taxonomy Alignment for
    Interoperability Between Heterogeneous Digital
    Libraries. Proc. The 9th Intl Conf. on Asian
    Digital Library (ICADL 2006), (Nov. 2006),
    274-282
  • KELLER1997 Keller,A. M. Smart Catalogs and
    Virtual Catalogs. In Ravi Kalakota and Andrew
    Whinston, editors, Readings in Electronic
    Commerce. Addison-Wesley. (1997)
  • KKL2002 Kim, D., Kim, J., and Lee, S. Catalog
    Integration for Electronic Commerce through
    Category-Hierarchy Merging Technique. Proc. the
    12th Intl Workshop on Research Issues in Data
    Engineering Engineering e-Commerce/e-Business
    Systems (RIDE02), (Feb. 2002) 2833
  • MLW 2003 Marron, P. J., Lausen, G., Weber, M.
    Catalog Integration Made Easy. Proc. the 19th
    Intl Conf. on Data Engineering (ICDE03), (Mar.
    2003) 677679
  • RR2001 Rennie, J. D. M., Rifkin, R. Improving
    Multiclass Text Classification with the Support
    Vector Machine. Tech. Report AI Memo AIM-2001-026
    and CCL Memo 210, MIT (Oct. 2001)
  • SCG2003 Sarawagi, S., Chakrabarti S., Godbole.,
    S. Cross-Training Learning Probabilistic
    Mappings between Topics. Proc. the 9th ACM SIGKDD
    Intl Conf. on Knowledge Discovery and Data
    Mining, (Aug. 2003) 177186

84
  • SH2001 Stonebraker, M. and Hellerstein, J. M.
    Content Integration for e-Commerce. Proc. of the
    2001 ACM SIGMOD Intl Conf. on Management of
    Data, (May 2001) 552560
  • SLN2003 Sun, A. ,Lim, E.-P., and Ng., W.-K.
    Performance Measurement Framework for
    Hierarchical Text Classification. Journal of the
    American Society for Information Science and
    Technology (JASIST), Vol. 54, No. 11, (June 2003)
    10141028
  • TCCL2003 Tsay, J.-J., Chen, H.-Y., Chang,
    C.-F., Lin, C.-H. Enhancing Techniques for
    Efficient Topic Hierarchy Integration. Proc. the
    3rd Intl Conf. on Data Mining (ICDM03), (Nov.
    2003) (657660)
  • WTH2005 Wu, C.-W., Tsai, T.-H., and Hsu, W.-L.
    Learning to Integrate Web Taxonomies with
    Fine-Grained Relations A Case Study Using
    Maximum Entropy Model. Proc. of Asia Information
    Retrieval Symposium 2005 (AIRS2005), (Oct. 2005)
    190205
  • YCC2006 C.-Z. Yang, C.-M. Chen, and I.-X.
    Chen. A Cross-Lingual Framework for Web News
    Taxonomy Integration. Proc. The 3rd Asia
    Information Retrieval Symposium (AIRS 2006),
    (Oct. 2006), 270-283
  • YL1999 Yang, Y., Liu, X. A Re-examination of
    Text Categorization Methods. Proc. the 22nd
    Annual ACMConference on Research and Development
    in Information Retrieval, (Aug. 1999) 4249

85
  • ZADROZNY2002 Zadrozny., B. Reducing Multiclass
    to Binary by Coupling Probability Estimates. In
    Dietterich, T. G., Becker, S., Ghahramani, Z.
    (eds) Advances in Neural Information Processing
    Systems 14 (NIPS 2001). MIT Press. (2002)
  • ZL2004WWW Zhang, D., Lee W. S. Web Taxonomy
    Integration using Support Vector Machines. Proc.
    WWW2004, (May 2004) 472481
  • ZL2004SIGIR Zhang, D., Lee W. S. Web Taxonomy
    Integration through Co-Bootstrapping. Proc.
    SIGIR04, (July 2004) 410417

86
Mining in the MiddleFrom Search to Integration
on the Web
  • Kevin C. Chang
  • Joint with the UIUC and Cazoodle Teams

87
To Begin With
What is the Web? Or How do search engines
view the Web?
88
Version 0.1 Web is a SET of PAGES.
89
Version 1.1 Web is a GRAPH of PAGES.
90
But,
What have you been searching lately?
91
Structured Data--- Prevalent but ignored!
92
Version V.2.1 Our View Web is Distributed
Bases of Data Entities.
?
?
?
93
Challenges on the Web come in dual Getting
access to the structured information!
  • Kevins 4-quadrants

Surface Web
Deep Web
?
?
Access
?
?
Structure
94
We are inspired From search to
integrationMining in the middle!
Mining
Search
Integration
95
Challenge of the Deep Web
Access How to Get There?
MetaQuerier Holistic Integration over the
Deep Web.
96
The previous Web Search used to be crawl and
index
97
The current Web Search must eventually resort to
integration
98
MetaQuerier Exploring and integrating the deep
Web
  • Explorer
  • source discovery
  • source modeling
  • source indexing

FIND sources
Amazon.com
Cars.com
db of dbs
  • Integrator
  • source selection
  • schema integration
  • query mediation

Apartments.com
QUERY sources
411localte.com
unified query interface
99
The challenge How to deal with deep semantics
across a large scale?
  • Semantics is the key in integration!
  • How to understand a query interface?
  • Where is the first condition? Whats its
    attribute?
  • How to match query interfaces?
  • What does author on this source match on that?
  • How to translate queries?
  • How to ask this query on that source?

100
Survey the frontier before going to the battle.
  • Challenge reassured
  • 450,000 online databases
  • 1,258,000 query interfaces
  • 307,000 deep web sites
  • 3-7 times increase in 4 years
  • Insight revealed
  • Web sources are not arbitrarily complex
  • Amazon effect convergence and regularity
    naturally emerge

101
Amazon effect in action
Attributes converge in a domain!
Condition patterns converge even across domains!
102
Search moves on to integration.Dont believe me?
See what Google has to say
DB People Buckle Up! To embrace the burgeoning
of structured data on the Web.
103
Challenge of the Surface Web
Structure What to look for?
WISDM Holistic Search over the Surface Web.
104
Challenge of the surface Web Despite all the
glorious search engines
Are we searching for what we want?
105
What have you been searching lately?
  • What is the email of Marc Snir?
  • What is Marc Snirs research area?
  • Who are Marc Snirs coauthors?
  • What are the phones of CS database faculty?
  • How much is Canon PowerShot A400?
  • Where is SIGMOD 2006 to be held?
  • When is the due date of SIGMOD 2006?
  • Find PDF files of SIGMOD 2006?

106
Regardless of what you want, you are searching
for pages
NO!
107
Your creativity is amazing A few examples
  • WSQ/DSQ at Stanford
  • use page counts to rank term associations
  • QXtract at Columbia
  • generate keywords to retrieve docs useful for
    extract
  • KnowItAll at Washington
  • both ideas in one framework
  • And there must be many I dont know yet
  • Time to distill to build a better mining engine?

108
What is an entity? Your target of information
or, anything.
  • Phone number
  • Email address
  • PDF
  • Image
  • Person name
  • Book title, author,
  • Price (of something)

109
We take an entity view of the Web
110
How different is entity search? How to define
such searches?
111
Lets motivate by contrasting
Page Retrieval
Entity Search
112
Consider the entire process
Page Retrieval
4. Output one page per result.
3. Scope Each page itself.
2. Criteria content keywords.
Marc Snir
Marc Snir
1. Input pages.
113
Entity search is thus different
Entity Search
4. Output associative results.
3. Scope holistic aggregagtes.
2. Criteria contextual patterns.
1. Input probabilistic entities.
114
What are technical challenges?



Or, how to write
(reviewer-friendly) papers?
115
More issues
  • Tagging/merging of basic entities?
  • Application-driven tagging
  • Webs redundancy will alleviate accuracy demand.
  • Powerful pattern language
  • Linguistic visual
  • Advanced statistical analysis
  • correlation sampling
  • Scalable query processing
  • new components scale?

116
Promises of the Concepts
  • From page at a time to entity-tuple at a time
  • getting directly to target info and evidences
  • From IR to a mining engine
  • not only page retrieval but also construction
  • From offline to online Web mining and integration
  • enable large scale ad-hoc mining over the web
  • From Web to controlled corpus
  • enhance not only efficiency but also
    effectiveness
  • From passive to active application-driven
    indexing
  • enable mining applications

117
Conclusion Mining in just the middle!
  • Dual Challenges
  • Getting access to the deep Web.
  • Getting structure from the surface Web.
  • Central Techniques
  • Holistic mining for both search and integration.

Mining
Search
Integration
118
What will such a Mining Engine be?

You tell me!
Students imagination knows no bounds.
About PowerShow.com