Web Taxonomy Integration through Co-Bootstrapping - PowerPoint PPT Presentation

About This Presentation
Title:

Web Taxonomy Integration through Co-Bootstrapping

Description:

10. Problem Statement. Characteristics. It's a multi-class, multi-label ... Top/ Arts/ Movies/ Genres/ Movie / Health ... Comedy, Horror, etc. ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 37
Provided by: dellz
Category:

less

Transcript and Presenter's Notes

Title: Web Taxonomy Integration through Co-Bootstrapping


1
Web Taxonomy Integration through Co-Bootstrapping
  • Dell Zhang, Wee Sun Lee
  • SIGIR2004

2
Outline
  • Introduction
  • Problem Statement
  • State-of-the-Art
  • Our Approach
  • Experiments
  • Conclusion

3
Introduction
  • Taxonomy
  • A taxonomy, or directory or catalog, is a
    division of a set of objects (e.g. documents)
    into a set of categories.

Adopted from a slide of Srikant.
4
Introduction
  • Taxonomy Integration
  • Integrating objects from a source taxonomy N into
    a master taxonomy M.

Adopted from a slide of Srikant.
5
Introduction
  • Taxonomy Integration
  • Integrating objects from a source taxonomy N into
    a master taxonomy M.

M
C1
C2
C3
a
b
c
d
e
f
x
y
z
w
z
Adopted from a slide of Srikant.
6
Introduction
  • Applications
  • Today
  • Web Marketplaces (e.g. Amazon)
  • Web Portals (e.g., NCSTRL)
  • Personal Bookmarks
  • Organizational Resources

7
Introduction
  • Applications
  • Future Semantic Web
  • Ontology Merging
  • Ontology Mapping
  • Content-based similarity between two concepts
    (categories)
  • Doan, A., Madhavan, J., Domingos, P. and Halevy,
    A. Learning to Map between Ontologies on the
    Semantic Web. in Proceedings of WWW2002.
  • Need to do taxonomy integration first.

8
Introduction
  • Why Machine Learning?
  • Correspondences between taxonomies are inevitably
    noisy and fuzzy.
  • Arts/Music/Styles/
    Entertainment/Music/Genres/
  • Computers_and_Internet/Software/Freeware
    Computers/Open_Source/Software
  • Manual taxonomy integration is tedious,
    error-prone, and unscalable.

9
Problem Statement
  • Taxonomy Integration as A Classification Problem
  • A master taxonomy M with a set of categories
    C1, C2, , Cm each containing a set of objects
  • Classes master categories C1, C2, , Cm
  • Training examples objects in M
  • A source taxonomy N with a set of categories
    S1, S2, , Sn each containing a set of objects
  • Test examples objects in N

10
Problem Statement
  • Characteristics
  • Its a multi-class, multi-label classification
    problem.
  • There are usually more than two possible classes.
  • One object may belong to more than one class.
  • The test examples are already known to the
    machine learning algorithm.
  • The test examples are already labeled with a set
    of categories which are not identical with but
    relevant to the set of categories to be assigned.

11
State-of-the-Art
  • Conventional Learning Algorithms
  • Do not exploit information in the source
    taxonomy.
  • NB (Naïve Bayes)
  • SVM (Support Vector Machine)
  • Enhanced Learning Algorithms
  • Exploit information in the source taxonomy to
    build better classifiers for the master taxonomy.

12
State-of-the-Art
  • Enhanced Learning Algorithms
  • Agrawal, R. and Srikant, R. On Integrating
    Catalogs. in Proceedings of WWW2001.
  • ENB (Enhanced Naïve Bayes)

13
State-of-the-Art
  • Enhanced Learning Algorithms
  • Sarawagi, S., Chakrabarti, S. and Godbole, S.
    Cross-Training Learning Probabilistic Mappings
    between Topics. in Proceedings of KDD2003.
  • EM2D (Expectation Maximization in 2-Dimensions)
  • CT-SVM (Cross-Training SVMs)

14
State-of-the-Art
  • Enhanced Learning Algorithms
  • Zhang, D. and Lee, W.S. Web Taxonomy Integration
    using Support Vector Machines. in Proceedings of
    WWW2004.
  • CS-TSVM (Cluster Shrinkage Transductive SVM)

15
Our Approach
  • Motivation
  • Possible useful semantic relationships between a
    master category C and a source category S
    include
  • identical,
  • mutual-exclusive,
  • superset,
  • subset,
  • partially-overlapping.

16
Our Approach
  • Motivation
  • In addition, semantic relationships may involve
    multiple master and source categories.
  • For example, a master category C may be a subset
    of the union of two source categories Sa and Sb,
    so if an object does not belong to either Sa or
    Sb, it cannot belong to C.

17
Our Approach
  • Motivation
  • The real-world semantic relationships are noisy
    and fuzzy, but they can still provide valuable
    information for classification.
  • For example, knowing that most (80) objects in a
    source category S belong to one master category
    Ca and the rest (20) examples belong to another
    master category Cb is obviously helpful.

18
Our Approach
  • Idea
  • The difficulty is that the knowledge about those
    semantic relationships is not explicit but hidden
    in the data.
  • If we have indicator functions for each category
    in N, we can imagine taking those indicator
    functions as features when we learn the
    classifier for M. This allows us to exploit the
    semantic relationship among the categories of M
    and N without explicitly figuring out what the
    semantic relationships are.

19
Our Approach
  • Idea
  • Specifically, for each object in M, we augment
    the set of conventional term-features FT with a
    set of source category-features FNfNi. The
    j-th source category-feature fNi of a given
    object x is a binary feature indicating whether x
    belongs to the j-th source category, i.e., Sj.
  • In the same way, we can get a set of master
    category-features FM.

20
Our Approach
  • Major Considerations (1/2)
  • Q1 How can we train the classifier using two
    different kinds of features (term-features and
    category-features)?
  • A1 Boosting.
  • Inspired by
  • Cai, L. and Hofmann, T. Text Categorization by
    Boosting Automatically Extracted Concepts. in
    Proceedings of SIGIR2003.

21
Our Approach
  • Boosting
  • A boosting learning algorithm combines many weak
    hypotheses/learners (moderately accurate
    classification rules) into a highly accurate
    classifier.
  • A boosting learning algorithm can utilize
    different kinds of weak hypotheses/learners for
    different kinds of features, and weight them
    optimally.
  • For example, boosting enables us to use the
    decision tree hypotheses/learners for the
    category-features and the Naïve Bayes
    hypotheses/learners for the term-features.

22
Our Approach
  • Boosting
  • The most popular boosting algorithm is AdaBoost
    introduced in 1995 by Freund and Schapire.
  • Our work is based on its multi-class multi-label
    version, AdaBoost.MH.
  • The weak hypotheses we used in this paper are
    simple decision stumps each corresponds to a
    binary term-feature or category-feature.

23
Our Approach
  • Two Major Considerations (2/2)
  • Q2 How can we train the classifier while the
    values of the source category-features FN of the
    training examples are unknown ?
  • A2 Co-Bootstrapping.
  • Inspired by
  • Blum, A. and Mitchell, T. Combining Labeled and
    Unlabeled Data with Co-Training. in Proceedings
    of COLT1998.
  • It is named Cross-Training by Chakrabarti etal.

24
Our Approach
  • Co-Bootstrapping
  • Train two classifiers symmetrically, one for the
    master categories and the other for the source
    categories.
  • These two classifiers collaborate to mutually
    bootstrap themselves together.

25
Our Approach
26
Experiments
  • Datasets
  • Tasks
  • Features
  • Measures
  • Settings
  • Results

27
Experiments Datasets
  • Taxonomies

Google Yahoo
Book / Top/ Shopping/ Publications/ Books/ / Business_and_Economy/ Shopping_and_Services/ Books/ Bookstores/
Disease / Top/ Health/ Conditions_and_Diseases/ / Health/ Diseases_and_Conditions/
Movie / Top/ Arts/ Movies/ Genres/ / Entertainment/ Movies_and_Film/ Genres/
Music / Top/ Arts/ Music/ Styles/ / Entertainment/ Music/ Genres/
News / Top/ News/ By_Subject/ / News_and_Media/
28
Experiments Datasets
  • Taxonomies
  • Categories
  • For example
  • Movie Action, Comedy, Horror, etc.
  • Objects
  • Each object is considered as a text document
    which is composed of the title and annotation of
    its corresponding webpage.

29
Experiments Datasets
  • Number of categories
  • The number of categories per object in these
    datasets is 1.54 on average.

Google Yahoo
Book 49 41
Disease 30 51
Movie 34 25
Music 47 24
News 27 34
30
Experiments Datasets
  • Number of objects
  • The set of objects in GnY covers only a small
    portion (usually less than 10) of the set of
    objects in Google or Yahoo alone, which suggests
    the great benefit of automatically integrating
    them.

Google Yahoo G?Y GnY
Book 10,842 11,268 21,111 999
Disease 34,047 9,785 41,439 2,393
Movie 36,787 14,366 49,744 1,409
Music 76,420 24,518 95,971 4,967
News 31,504 19,419 49,303 1,620
31
Experiments Datasets
  • Category Distribution
  • Highly skewed

Googles Book taxonomy
Yahoos Book taxonomy
32
Experiments Tasks
  • Two symmetric tasks for each dataset
  • G?Y (integrating objects from Yahoo into Google)
  • Y?G (integrating objects from Google into Yahoo)

33
Experiments Tasks
  • Test data GnY
  • We do not need to manually label them because
    their categories in both taxonomies are known.
  • Training data randomly sampled subsets of G Y
  • For G?Y tasks, we randomly sample n objects from
    the set G Y as training examples, where n is
    the number of test examples. For Y?G tasks,
  • For each task, we do such random sampling for 5
    times, and report the averaged performance.

34
Experiments Tasks
  • Training phase
  • We hide the test examples' master categories but
    expose their source categories to the learning
    algorithm.
  • Test phase
  • We compare the hidden master categories of the
    test examples with the predictions of the
    learning algorithm.

35
Experiments Features
  • Term-Features
  • A document A bag-of-words.
  • Pre-procession
  • removal of stop-words and stemming.
  • Each term corresponds to a binary feature whose
    value indicates the presence or absence of that
    term in the given document.
  • Category-Features

36
Experiments Measures
  • For one category
  • F score (F1 measure), which is the harmonic
    average of precision (p) and recall (r).

37
Experiments Measures
  • For all categories
  • Macro-averaged F score (maF)
  • Compute the F scores for the binary decisions on
    each individual category first and then average
    them over categories.
  • Micro-averaged F score (miF)
  • Compute the F scores globally over all the
    binary decisions.

38
Experiments Measures
  • For all categories
  • The maF tends to be dominated by the
    classification performance on common categories,
    whereas the miF is more influenced by the
    classification performance on rare categories.
  • Providing both kinds of scores is more
    informative than providing either alone,
    especially when the category distributions are
    highly skewed.
  • Yang, Y. and Liu, X. A Re-examination of Text
    Categorization Methods. in Proceedings of
    SIGIR1999.

39
Experiments Settings
  • NB and ENB
  • We use Lidstones smoothing parameter0.1.
  • We run ENB with a series of exponentially
    increasing values of the parameter omega (0, 1,
    3, 10, 30, 100, 300, 1000), and report the best
    experimental results.

40
Experiments Settings
  • AB and CB-AB
  • BoosTexter
  • http//www.research.att.com/schapire/BoosTexter/
  • It implements AdaBoost on top of "decision
    stumps".

41
Experiments Results
  • ENB gt NB

macro-averaged F scores
micro-averaged F scores
42
Experiments Results
  • AB gt NB

macro-averaged F scores
micro-averaged F scores
43
Experiments Results
  • CB-AB gt AB

macro-averaged F scores
micro-averaged F scores
44
Experiments Results
  • CB-AB iteratively improves AB

45
Experiments Results
  • CB-AB gt ENB

macro-averaged F scores
micro-averaged F scores
46
Conclusion
  • Main Contribution
  • The CB-AB approach to taxonomy integration
  • It achieves multi-class multi-label
    classification.
  • It is a discriminative learning algorithm.
  • It enhances the AdaBoost algorithm via exploiting
    information in the source taxonomy.
  • It does not require a tune set (a set of objects
    whose source categories and master categories are
    all known).
  • It enables usage of different weak
    hypotheses/learners for term-features and
    category-features.

47
Conclusion
  • Comparison

Co-Training Co-Bootstrapping
Classes One set of classes. Two sets of classes (1) one set of source categories (2) one set of master categories.
Features Two disjoint sets of features V1 and V2. Two sets of features (1) conventional-features plus source category-features (2) conventional-features plus master category-features
Assumption V1 and V2 are compatible and uncorrelated (conditionally independent). The source and master taxonomies have some semantic overlap, i.e., they are somewhat correlated.
48
Conclusion
  • Future Work
  • Theoretical analysis of the Co-Bootstrapping
    algorithm.
  • Can we refine the Co-Bootstrapping algorithm to
    make it theoretically-justified as the
    Greedy-Agreement algorithm for Co-Training?
  • Abney, S.P. Bootstrapping. in Proceedings of
    ACL2002.

49
Conclusion
  • Future Work
  • Empirical comparison with CS-TSVM, EM2D and
    CT-SVM.
  • Exploiting the hierarchical structure of
    taxonomies.
  • Incorporating commonsense knowledge and domain
    constraints.
  • Extending to full-functional ontology mapping
    systems.

50
Questions, Comments, Suggestions,
?
51
Thank You
Write a Comment
User Comments (0)
About PowerShow.com