CLiMB: Computational Linguistics for Metadata Building - PowerPoint PPT Presentation

About This Presentation
Title:

CLiMB: Computational Linguistics for Metadata Building

Description:

Title: Tutorial on Latent Semantic Indexing Author: ptd7 Last modified by: Matt Hampel Created Date: 11/26/2002 8:19:37 PM Document presentation format – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 51
Provided by: ptd7
Learn more at: http://www.columbia.edu
Category:

less

Transcript and Presenter's Notes

Title: CLiMB: Computational Linguistics for Metadata Building


1
CLiMB Computational Linguistics for Metadata
Building
  • Center for Research on Information Access
  • Columbia University Libraries

2
(No Transcript)
3
Overall Goals
  • Research Development of richer retrieval through
    increased numbers of descriptors
  • Research and Practice Creation of enabling
    technologies for new large digitization projects
  • Research and Practice Expand capability for
    cross-collection searching
  • Practice Development of suite of CLiMB tools
  • Resources Vocabulary list which can be used by
    other visual resource professionals
  • The essence of CLiMB
  • Use scholars themselves as catalogers by
    utilizing scholarly publications
  • Enhance existing descriptive metadata

4
Computational Linguistic Techniques
  • What techniques have we tried?
  • How well have they worked?
  • What else do we want to try?

5
Computational Linguistic Techniques
  • What techniques have we tried?
  • Goal Identify high quality metadata terms
  • Goal Use metadata for finding images
  • How well have they worked?
  • What else do we want to try?

6
Text about Images
  • The Blacker House is known for its porte
    cochère and adjacent terraces. Samuel Parker
    Williams, an occasional Greene collaborator,
    worked on the site, particularly on the sandstone
    boulder foundation for the sleeping porch.
  • -- Based on Bosley

7
Techniques We Have Tried
  • Supervised (using existing resources)
  • Matching algorithms - proper names variants
  • Back of book index analysis
  • Composite list of terms from authoritative lists
  • Unsupervised
  • Part of speech tagging
  • Noun phrase identification
  • Proper noun identification

8
What about LSI?
  • Latent Semantic Indexing
  • Builds a representation of a document
  • Effective in information retrieval
  • Why not for CLiMB?
  • LSI is useful for text query and document
    retrieval
  • LSI, a statistical technique, removes phrasal
    info
  • CLiMB needs high quality phrases
  • May be useful in later stages

9
Indexing for What Purpose
  • Index find important terms and phrases
  • Index characterize a document with a set of
    terms that occurs in the doc

10
Indexing for What Purpose
  • Index find important terms and phrases
  • sleeping porch
  • occasional collaborator
  • sandstone boulder foundation
  • Index characterize a document with a set of
    terms that occurs in the doc
  • sleep, porch, occas, collaborat, foundat
  • enables location of docs with similar profile

11
Finding Similar Documents
  • Linear Algebra Techniques
  • Latent Semantic Indexing
  • Singular Value Decomposition (SVD)
  • Semidiscrete Decomposition
  • Vector Space Models
  • Term by Document matrices
  • Term Weighting
  • Polysemy and Synonymy
  • Clustering Techniques
  • K-means
  • EM Clustering
  • Wavelet

12
Computational Linguistic Techniques
  • What techniques have we tried?
  • Goal Identify high quality metadata terms
  • Goal Load metadata into image search database
  • Goal Use enriched metadata for finding images
  • How well have they worked?
  • What else do we want to try?

13
Art Object Identification (AO-ID)
  • Need Unique Identifiers
  • Key of database records
  • Varies from collection to collection
  • Greene Greene Project Names
  • Chinese Paper Gods God Names
  • South Asian Temples Temple Names

14
Text about Images
  • The Blacker House is known for its porte
    cochère and adjacent terraces. Samuel Parker
    Williams, an occasional Greene collaborator,
    worked on the site, particularly on the sandstone
    boulder foundation for the sleeping porch.
  • -- Based on Bosley

15
Compile list of subject vocabulary
Find meaningful terms in texts
Segment relevant texts
Collect terms from all sources. Identify and
link AO-ID described in text.
Determine term relationships
Extract metadata
Insert into existing metadata records. Mount in
image search platform.
Process queries and evaluate
16
Create Composite List of Subject Terms
  • Philosophy Use whatever resources exist
  • Catalog records
  • Robert R. Blacker house (Pasadena, Calif.)
  • Greene, Charles Sumner
  • Blacker, Robert R.
  • Art and Architecture Thesaurus
  • porte cochère
  • Back of the book index
  • Blacker house

17
Progress Composite List
  • Greene Greene
  • Extracted back of the book indexes
  • Direct matching of index terms to the text
  • Terms found - highlighted in yellow
  • David Gamble
  • Pasadena
  • Westmoreland Place
  • furniture

18
(No Transcript)
19
Compile list of subject vocabulary
Find meaningful terms in texts
Segment relevant texts
Collect terms from all sources. Identify and
link AO-ID described in text.
Determine term relationships
Extract metadata
Insert into existing metadata records. Mount in
image search platform.
Process queries and evaluate
20
Three Term Types and Approaches
  • 1) Art Object ID names and other proper nouns
    important to the domain (Charles Pratt)
  • Named Entity noun phrase finders, POS taggers
  • 2) Common noun terms, semantically significant to
    the domain (V-shaped plan)
  • List of domain terms from authority sources
  • 3) Common noun phrases in a generic domain
    vocabulary (chimney)
  • Statistical methods for identifying relevant terms

21
Part of Speech (POS) taggers
  • Why use a part of speech tagger?
  • To identify nouns, verbs and proper nouns
  • The Blacker House is known for its porte cochère
  • ltDeterminergtThe
  • ltProper_Noungt
  • ltSingular_Proper_NoungtBlacker
  • ltSingular_Proper_NoungtHouse
  • ltVerb_Presentgtis
  • ltVerb_Past_Participlegtknown
  • ltPrepositiongtfor
  • ltPossessive_Pronoungtits
  • ltAdjectivegtadjacent
  • ltNoun_Pluralgtterraces

22
Part of Speech (POS) taggers
  • Strength An essential step allows the rest of
    the system to work
  • Weakness The best POS taggers have 95 accuracy
  • A typical 20-word sentence is likely to have a
    mistake!
  • But some errors do not matter much
  • E.g. sleeping porch

23
What We Tried POS Taggers
  • Mitre Alembic WorkBench
  • Freeware from Mitre corporation
  • Strong for proper nouns
  • Average for common nouns
  • IBMs Nominator
  • Accurate for both
  • Restrictive licensing

24
Proper Nouns
  • Alembic WorkBench Results
  • 91.2 recall
  • Misses The senior Pratt, Hall brothers
  • 97.5 precision using Alembic
  • Successfully finds William Issac Ott, University
    of California
  • This is very good!
  • Highlighted in light green
  • Mary
  • Greene
  • Persian
  • Etc.

25
(No Transcript)
26
Noun Phrase Chunking
  • The Blacker House is known for
  • its Porte Cochère and adjacent terraces .
  • Samuel Parker Williams,
  • an occasional Greene collaborator,
  • worked on the site, particularly on
  • the sandstone boulder foundation
  • for the sleeping porch .
  • -- Based on Bosley

27
NP Chunkers
  • Columbias LinkIT
  • Regular expression grammar over POS tags
  • Improves WorkBench results through finding
    simplex NPs
  • LTChunk
  • By LTG Group, University of Edinburgh
  • Not as many NPs
  • Arizona - commercialized
  • IBM also commercial

28
Results Proper Nouns
29
Results Proper Nouns
30
Results NP Chunking
  • Highlighted in purple
  • The design process
  • The southwest adobe-stucco
  • July 1907

31
(No Transcript)
32
Experiments with Algorithms
  • TF/IDF and term frequency ratios
  • Filter technical terms from frequent common nouns
  • Term frequency ratio algorithm to improve
    accuracy
  • Co-occurrence
  • Useful terms may appear near other good ones
  • Machine learning
  • Use learning algorithms to discover complex
    associational context

33
Compile list of subject vocabulary
Find meaningful terms in texts
Segment relevant texts
Collect terms from all sources. Identify and
link AO-ID described in text.
Determine term relationships
Extract metadata
Insert into existing metadata records. Mount in
image search platform.
Process queries and evaluate
34
What is Segmentation?
  • Divide texts into cohesive chunks
  • Needed for determining associational context
  • Needed to determine what terms are related to an
    art object

35
Results Segmentation
  • Use the frequency that our terms appear within a
    document to estimate where the document is about
    that term
  • This graph shows where different names are
    mentioned in Bosley on Greene Greene Ch. 5

36
What Weve Tried Segmenters
  • Marti Hearsts TextTiling
  • Performs well for a general algorithm, but not
    sufficient for this specialized task
  • M. Hearst, ACL, 1993
  • F. Chois C99 segmenter
  • Performance comparable to TextTiling
  • F. Y. Y. Choi, NAACL, 2000
  • Frequency ratio approach outperformed TextTiling
  • In-house tool to be tested
  • Kan Klavans, WVLC-6, 1998, Segmenter

37
Meronymy as Part-Of
  • Why is this potentially useful?
  • A method for identifying hot paragraphs
  • Descriptive text contains part of relations
  • Details that correlate to the whole
  • Porch is a part of house
  • An early hypothesis in testing stages

38
Meronymy for Cohesion
The Spinks house design is an elaboration of the
rectangular, large-gabled form of the California
House .has porches and terraces. In front, an
expanse of lawn rises nearly to the level of the
entry terrace. The front door is approached
obliquely in the shaded recess of the terrace.
39
Meronymy and Other Relations
The California House
Other Houses
Spinks House
entry terrace
front entry
terrace
porch
front door
40
Compile list of subject vocabulary
Find meaningful terms in texts
Segment relevant texts
Collect terms from all sources. Identify and
link AO-ID described in text.
Determine term relationships
Extract metadata
Insert into existing metadata records. Mount in
image search platform.
Process queries and evaluate
41
Progress Project Name Matching
  • Finding project names in Greene Greene
  • Challenge finding variations
  • AO-ID Robert Roe Blacker House
  • RRB House
  • The house
  • 1214 Fairlawn Terrace.
  • Possible techniques to improve matching
  • Developing a semi-automatic technique
  • Use existing information to label text
  • An iterative platform for manual intervention

42
Variants of The Culbertson House
  • Cordelia A. Culbertson house (Pasadena, Calif.)
  • Francis F. Prentiss house (Pasadena, Calif.)
  • Culbertson sisters house (Pasadena, Calif.)
  • Prentiss, Francis F.
  • Culbertson, Cordelia A.
  • Allen, Elizabeth S.
  • Allen, Mrs. Dudley P.
  • House was purchased by Allens, who remarried and
    became Prentiss!

43
Zaoshen (Chinese deity)
  • USE FOR Dingfuzhenjun (Chinese deity)
  • USE FOR Kitchen God (Chinese deity)
  • USE FOR Simingzaojun (Chinese deity)
  • USE FOR Simingzaoshen (Chinese deity)
  • USE FOR Ssu-ming-tsao-chèun (Chinese deity)
  • USE FOR Ssu-ming-tsao-shen (Chinese deity)
  • USE FOR Ting-fu-chen-chèun (Chinese deity)
  • USE FOR Tsao-chèun (Chinese deity)
  • USE FOR Tsao-shen (Chinese deity)
  • USE FOR Tsao-wang (Chinese deity)
  • USE FOR Tsao-wang-yeh (Chinese deity)
  • USE FOR Zaojun (Chinese deity)
  • USE FOR Zaowang (Chinese deity)
  • REFERENCE Encyc. Britannicab(Tsao Shen, pinyin
    Zao Shen, in Chinese mythology, the god of the
    kitchen (god of the hearth), who is believed to
    report to the celestial gods on family conduct
    and have it within his power to bestow poverty or
    riches on individual families has also been
    confused with Ho Shen (god of fire) and Tsao
    Chèun (Furnace Prince))

44
Some Data to Illustrate
  • Unaltered Project Names
  • 0 matches (both case sensitive and insensitive)
  • Case Insensitive Project Name matching
  • 4 matches
  • Theodore Irwin house occurs 1 time
  • California Institute of Technology occurs 1
    time
  • William R. Thorsen house occurs 1 time
  • William T. Bolton house occurs 1 time
  • At least double in the chapter

45
A Future Solution
  • Bootstrapping algorithm
  • Seed terms hand labelled
  • Terms mapped into multi-dimensional feature space
  • Other terms that are close to the seed terms are
    added to the set
  • Features
  • Window size
  • Headedness
  • Modifier similar to that of a seed term

46
Summary Research Tools Tested
  • Part of Speech Taggers
  • Noun Phrase Chunkers
  • Merging techniques
  • Proper Noun Finders
  • Proper Name Variant Finder
  • Segmenters

47
Compile list of subject vocabulary
Find meaningful terms in texts
Segment relevant texts
Collect terms from all sources. Identify and
link AO-ID described in text.
Determine term relationships
Extract metadata
Insert into existing metadata records. Mount in
image search platform.
Process queries and evaluate
48
Future Determine relationships
  • The Blacker House related to Greene
  • The Greenes built the house.
  • Porte Cochère is related to Blacker House
  • because they are directly a part of the house.
  • William Issac Ott is related to
  • Blacker House (on which he worked)
  • Greene (with whom he worked).
  • Detecting these semantic relationships
    statistically is a challenge for our next steps
  • Co-occurrence
  • Use of subject headings
  • Meronymy and other relations (WordNet)

49
Compile list of subject vocabulary
Find meaningful terms in texts
Segment relevant texts
Collect terms from all sources. Identify and
link AO-ID described in text.
Determine term relationships
Extract metadata
Insert into existing metadata records. Mount in
image search platform.
Process queries and evaluate
50
Thank you!
  • Any questions?
  • www.columbia.edu/cu/cria/climb
Write a Comment
User Comments (0)
About PowerShow.com