Castanet: Using WordNet to Build Facet Hierarchies - PowerPoint PPT Presentation

About This Presentation
Title:

Castanet: Using WordNet to Build Facet Hierarchies

Description:

Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 43
Provided by: sariel
Category:

less

Transcript and Presenter's Notes

Title: Castanet: Using WordNet to Build Facet Hierarchies


1
CastanetUsing WordNet to Build Facet
Hierarchies
  • Emilia Stoica and Marti HearstSchool of
    Information,
  • Berkeley

2
Motivation
  • Want to assign labels from multiple hierarchies

3
Motivation
  • Hot and Sweet Chicken 1 pepper, 2 apricots,
    1 pound chicken breast, 1 Tbsp gingerroot

Meat Chicken
4
Castanet
  • Carves out a structure from the hypernym (IS-A)
    relations within WordNet
  • Produces surprisingly good results for a wide
    range of subjects
  • e.g., arts, medicine, recipes, math, news,
    bibliographical records

5
WordNet Challenges
  • A word may have more than one sense
  • - Fine granularity of word sense distinctions
  • e.g., newspaper (1) - daily publication
    on
  • folded sheets
  • newspaper (3) - physical object
  • - Ambiguity for the same sense

6
WordNet Challenges (cont.)
  • The hypernym path may be quite long (e.g., sense
    3 of tuna has 14 nodes)
  • Sparse coverage of proper names and noun phrases
    (not addressed)

7
Algorithm Goals
  • Build a set of facet hierarchies
  • Balance depth and breadth
  • Avoid skinny paths
  • Dont go too deep or too broad
  • Choose understandable labels
  • Disambiguate words
  • Currently a word can take on only one sense

8
Our Approach
Documents
9
1. Select Terms
  • Select well-distributed terms from the collection
  • Eliminate stopwords
  • Retain only those terms with a distribution
    higher than a threshold
  • (default top 10)

Build core tree
Augm. core tree
Documents
Select terms
Comp. tree
Remove top level categ.
WordNet
10
2. Build Core Tree
  • Build a backbone
  • Create paths from unambiguous terms only
  • Bias the structure towards appropriate senses of
    words
  • Get hypernym path if term
  • - has only one sense, or
  • - matches a pre-selected
  • WordNet domain
  • Adding a new term increases a count at each node
    on its path by of docs with the term.

11
2. Build Core Tree (cont.)
  • Merge hypernym paths to build a tree

12
3. Augment Core Tree
  • Attach to Core tree the terms with more than one
    sense
  • Favor the more common path over other alternatives

13
Augment Core Tree (cont.)
14
Optional Step Domains
  • To disambiguate, use Domains
  • Wordnet has 212 Domains
  • medicine, mathematics, biology, chemistry,
    linguistics, soccer, etc.
  • A better collection has been developed by Magnini
    2000
  • Assigns a domain to every noun synset
  • Automatically scan the collection to see which
    domains apply
  • The user selects which of the suggested domains
    to use or may add own
  • Paths for terms that match the selected domains
    are added to the core tree

15
Using Domains
dip glosses Sense 1 A depression in an
otherwise level surface Sense 2 The angle that a
magnet needle makes with horizon Sense 3 Tasty
mixture into which bite-size foods are dipped
dip hypernyms Sense 1
Sense 2 Sense 3
solid
shape, form food gt concave
shape gt space
gt ingredient, fixings gt
depression gt angle
gt flavorer
Given domain food, choose
sense 3
16
4. Compress Tree
  • Rule 1
  • Eliminate a parent with fewer than k children
    unless it is the root or its distribution is
    larger than 0.1maxdist

dessert
frozen dessert
ice cream sundae
parfait
sherbet,sorbet
sundae
sherbet
17
4. Compress Tree (cont.)
  • Rule 2
  • Eliminate a child whose name appears within the
    parents name

dessert
frozen dessert
sundae
parfait
sherbet
18
5. Divide into Facets
19
5. Divide into Facets(Remove top levels)
Rule 1 Eliminate very general categories (e.g.,
entity, abstraction). If no paths are longer
than threshold t, then done. Else
Rule 2 Undo first step. Then eliminate all top
levels until the maximum length of any path in
the resulting hierarchy is t.
20
Example Recipes (3500 docs)
21
Castanet Output (shown in Flamenco)
22
Castanet Output
23
Castanet Output
24
Castanet Output
25
Castanet Output
26
(No Transcript)
27
Castanet Evaluation
  • This is a tool for information architects, so
    people of this type did the evaluation
  • We compared output on
  • Recipes
  • Biomedical journal titles
  • We compared to two state-of-the-art algorithms
  • LDA (Blei et al. 04)
  • Subsumption (Sanderson Croft 99)

28
Subsumption Output
29
Subsumption Output
30
Subsumption Output
31
Subsumption Output
32
LDA Output
33
LDA Output
34
LDA Output
35
Evaluation Method
  • Information architects assessed the category
    systems
  • For each of 2 systems output
  • Examined and commented on top-level
  • Examined and commented on two sub-levels
  • Then comment on overall properties
  • Meaningful?
  • Systematic?
  • Likely to use in your work?

36
Evaluation (cont.)
  • Sample questions for top level categories

    - Would you add/remove/rename any category ?
  • - Did this category match your expectations ?
  • Sample questions for a specific category
  • - Would you add/move/remove any
    sub-categories ?
  • - Would you promote any sub-category to top
    level ?
  • General questions
  • - Would you use Castanet ?
  • - Would you use LDA ?
  • - Would you use Subsumption ?
  • - Would you use list of most frequent terms ?

37
Evaluation Results
  • Results on recipes collection for
    Would you use this system in your work?
  • Yes in some cases or yes, definitely
  • Castanet 29/34
  • LDA 0/18
  • Subsumption 6/16
  • Baseline 25/34
  • Average response to questions about quality
    (4 strongly agree)

38
Evaluation Results
  • Average responses for top-level categories
  • 4 no changes, 1 change many
  • Average responses for 2 subcategories

39
Needed Improvements
  • Take spelling variations and morphological
    variants into account
  • Use verbs and adjectives, not just nouns
  • Normalize noun phrases
  • Allow terms to have more than one sense
  • Improve algorithm for assigning documents to
    categories.

40
Opportunities for Tagging
  • New opportunity Tagging, folksonomies
  • (flickr, de.lici.ous)
  • People are created facets in a decentralized
    manner
  • They are assigning multiple facets to items
  • This is done on a massive scale
  • This leads naturally to meaningful associations

41
Conclusions
  • Flexible application of hierarchical faceted
    metadata is a proven approach for navigating
    large information collections.
  • Midway in complexity between simple hierarchies
    and deep knowledge representation.
  • Currently in use on e-commerce sites spreading
    to other domains
  • Systems are needed to help create faceted
    metadata structures
  • Our WordNet-based algorithm, while not perfect,
    seems like it will be a useful tool for
    Information Architects.

42
Conclusions
  • Castanet builds a set of faceted hierarchies by
    finding IS-A relations between terms using
    WordNet.
  • The method has been tested on various domains
  • medicine, recipes, math, news, arts,
    bibliographical records
  • Usability study shows
  • Castanet is preferred to other state-of-the art
    solutions.
  • Information architects want to use the tool in
    their work.

43
Learn More
  • Funding
  • This work supported in part by NSF (IIS-9984741)
  • For more information
  • Stoica, E., Hearst, M., and Richardson, M.,
    Automating Creation of Hierarchical Faceted
    Metadata Structures, NAACL/HLT 2007
  • See http//flamenco.berkeley.edu
Write a Comment
User Comments (0)
About PowerShow.com