Automating Creation of Hierarchical Faceted Metadata Structures - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Automating Creation of Hierarchical Faceted Metadata Structures

Description:

Automating Creation of Hierarchical Faceted Metadata Structures. Emilia Stoica, ... Standard search interface - query box retrieved results not suited for ... – PowerPoint PPT presentation

Number of Views:190
Avg rating:3.0/5.0
Slides: 43
Provided by: Sar1
Category:

less

Transcript and Presenter's Notes

Title: Automating Creation of Hierarchical Faceted Metadata Structures


1
Automating Creation of Hierarchical Faceted
Metadata Structures
  • Emilia Stoica, Marti Hearst and Megan
    Richardson
  • School of Information, Berkeley
  • Dept. of Mathematical Sciences, NMSU

2
Focus Browse Large Datasets
  • Standard search interface - query box retrieved
    results not suited for browsing and navigation
  • User interfaces need to group and organize the
    results

3
(No Transcript)
4
(No Transcript)
5
(No Transcript)
6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
How do we Create Faceted Hierarchies?
  • Goals
  • Help an information architect to create the
    hierarchy
  • Currently they do it all by hand!
  • Balance depth and breadth
  • Avoid skinny paths
  • Dont go too deep or too broad
  • Choose understandable labels
  • Disambiguate between word senses

11
Related Work
  • Automated text categorization
  • LOTS of work on this
  • Assumes that a set of categories is already
    created
  • Little if any work on building facet hierarchies

12
Castanet
  • Carves out a structure from the hypernym (IS-A)
    relations within WordNet
  • Semi-automatic algorithm for creating
    hierarchical faceted metadata
  • Produces surprisingly good results for a wide
    range of subjects
  • e.g., recipes, medicine, math, news, fine arts
    image description

13
WordNet Challenges
  • A word may have more than one sense
  • - Fine granularity of word sense distinctions
  • e.g., newspaper (1) - daily publication
    on
  • folded sheets
  • newspaper (3) - physical object
  • - Ambiguity for the same sense

14
WordNet Challenges (cont.)
  • The hypernym path may be quite long (e.g., sense
    3 of tuna has 14 nodes)
  • Sparse coverage of proper names and noun phrases
    (not addressed)

15
Our Approach
Documents
16
1. Select Terms
  • Select well-distributed terms from the collection
  • Eliminate stopwords
  • Retain only those terms with a distribution
    higher than a threshold
  • (default top 10)

Build core tree
Augm. core tree
Documents
Select terms
Comp. tree
Remove top level categ.
WordNet
17
2. Build Core Tree
  • Build a backbone
  • Create paths from unambiguous terms only
  • Bias the structure towards appropriate senses of
    words
  • Get hypernym path if term
  • - has only one sense, or
  • - matches a pre-selected
  • WordNet domain
  • Adding a new term increases a count at each node
    on its path by of docs with the term.

18
2. Build Core Tree (cont.)
  • Merge hypernym paths to build a tree

19
3. Augment Core Tree
  • Attach to Core tree the terms with more than one
    sense
  • Favor the more common path over other alternatives

20
Augment Core Tree (cont.)
Date (p1)
Date (p2)
entity
abstraction
substance,matter measure,
quantity
food, nutrient
fundamental quality
nutriment time period
food
calendar day (18) edible
fruit (78) date Sunday
berries date

?
?
21
Optional Step Domains
  • To disambiguate, use Domains
  • Wordnet has 212 Domains
  • medicine, mathematics, biology, chemistry,
    linguistics, soccer, etc.
  • A better collection has been developed by Magnini
    (2000)
  • Assigns a domain to every noun synset
  • Automatically scan the collection to see which
    domains apply
  • The user selects which of the suggested domains
    to use or may add own
  • Paths for terms that match the selected domains
    are added to the core tree

22
Using Domains
dip glosses Sense 1 A depression in an
otherwise level surface Sense 2 The angle that a
magnet needle makes with horizon Sense 3 Tasty
mixture into which bite-size foods are dipped
dip hypernyms Sense 1
Sense 2 Sense 3
solid
shape, form food concave
shape space
ingredient, fixings
depression angle
flavorer
Given domain food, choose
sense 3
23
4. Compress Tree
  • Rule 1
  • Eliminate a parent with fewer than k children
    unless it is the root or its distribution is
    larger than 0.1maxdist

dessert
frozen dessert
ice cream sundae
parfait
sherbet,sorbet
sundae
sherbet
24
4. Compress Tree (cont.)
  • Rule 2
  • Eliminate a child whose name appears within the
    parents name

dessert
frozen dessert
sundae
parfait
sherbet
25
5. Divide into Facets
26
5. Divide into Facets(Remove top levels)
Rule 1 Eliminate the top t levels (t 4 for
recipe collection).
Rule 2 For each resulting tree, test if it has
at least n children (n 2)
If yes, stop. If not, delete the root and repeat.
Manual cleaning remove facets that dont make
sense
27
Example Recipes (13,500 docs)
28
Castanet Output (shown in Flamenco)
29
Castanet Output
30
(No Transcript)
31
Castanet Evaluation
  • This is a tool for information architects (IA),
    so people of this type did the evaluation
  • Each IA compared Castanet to other
    state-of-the-art algorithms
  • LDA (Blei et al. 04)
  • Subsumption (Sanderson Croft 99)
  • Baseline most frequent terms in the collection
  • Datasets
  • 13,000 recipes from Southwestcooking.com

32
Subsumption Output
33
Subsumption Output
34
LDA Output
35
LDA Output
36
Evaluation Method
C
C


16
18
L
S
  • For each of 2 systems output
  • Examined and commented on top-level
  • Examined and commented on two sub-levels
  • Then comment on overall properties
  • Meaningful?
  • Systematic?
  • Likely to use in your work?

37
Evaluation (cont.)
  • Sample questions for top level categories

    - Would you add/remove/rename any category ?
  • - Did this category match your expectations
    ?
  • Sample questions for a specific category
  • - Would you add/move/remove any
    sub-categories ?
  • - Would you promote any sub-category to top
    level ?
  • General questions
  • - Would you use Castanet ?
  • - Would you use LDA ?
  • - Would you use Subsumption ?
  • - Would you use list of most frequent terms ?

38
Evaluation Results
  • Would you use this system in your work?
  • yes definitely, yes, in some cases
  • Castanet 85
  • LDA 0
  • Subsumption 37
  • Baseline 74
  • Average response to questions about quality
  • (4 strongly agree, 3 agree
    somewhat, 2 disagree somewhat, 1
    strongly disagree)

39
Evaluation Results
  • Average responses for top-level categories
  • (4 no changes, 3 one or two, 2 a few,
    1 many)
  • Average responses for 2 subcategories

40
Needed Improvements
  • Take spelling variations and morphological
    variants into account
  • Use verbs and adjectives, not just nouns
  • Normalize noun phrases
  • Allow terms to have more than one sense
  • Improve algorithm for assigning documents to
    categories.

41
Conclusions
  • Flexible application of hierarchical faceted
    metadata is a proven approach for navigating
    large information collections.
  • Midway in complexity between simple hierarchies
    and deep knowledge representation.
  • Currently in use on e-commerce sites spreading
    to other domains
  • Systems are needed to help create faceted
    metadata structures
  • Our WordNet-based algorithm, while not perfect,
    seems like it will be a useful tool for
    Information Architects.

42
Conclusions
  • Castanet builds a set of faceted hierarchies by
    finding IS-A relations between terms using
    WordNet.
  • The method has been tested on various domains
  • medicine, recipes, math, news, description of
    images
  • Usability study shows
  • Castanet is preferred to other state-of-the art
    solutions.
  • Information architects want to use the tool in
    their work.
  • Future work
  • Apply to tags (flickr, delicious)

43
Learn More
  • Funding
  • This work supported in part by NSF (IIS-9984741)
  • For more information
  • Stoica, E., Hearst, M., and Richardson, M.,
    Automating Creation of Hierarchical Faceted
    Metadata Structures, NAACL/HLT 2007
  • See http//flamenco.berkeley.edu
Write a Comment
User Comments (0)
About PowerShow.com