Toward Large-Scale Shallow Semantics for Higher-Quality NLP - PowerPoint PPT Presentation

Loading...

PPT – Toward Large-Scale Shallow Semantics for Higher-Quality NLP PowerPoint presentation | free to download - id: 717d4e-NTU2Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Toward Large-Scale Shallow Semantics for Higher-Quality NLP

Description:

What is this Semantics in the Semantic Web, and How can You Get It? Toward Large-Scale Shallow Semantics for Higher-Quality NLP Eduard Hovy – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 72
Provided by: Edua132
Learn more at: http://kmi.open.ac.uk
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Toward Large-Scale Shallow Semantics for Higher-Quality NLP


1
Toward Large-Scale Shallow Semantics for
Higher-Quality NLP
What is this Semantics in the Semantic Web,and
How can You Get It?
  • Eduard Hovy
  • Information Sciences Institute
  • University of Southern California
  • www.isi.edu/natural-language

2
The Knowledge Base of the World
  • We live in the infosphere
  • but its unstructured, inconsistent, often
    outdated,
  • in other wordsa mess!

?
3
Franks two Semantic Webs
  • 1. The Semantic Web as data definer
  • 2. The Semantic Web as text enhancer
  • Applies to circumscribed, structured data types
  • Numbers, lists, tables, inventories, picture
    annotations
  • Suitable for constrained, context-free
    semantics
  • Amenable to OWL, etc. closed vocabularies and
    controllable relations
  • Applies to open-ended, unstructured information
  • Requires open-ended, context-sensitive
    semantics
  • Requires what exactly? Where to find it?

4
Wheres the semantics?
  • Its in the words insert standardized symbols
    for each (? content) word
  • Need symbols, vocabularies, ontologies
  • Its in the links create standardized set of
    links and use (? only) them
  • Need links, operational semantics, link
    interpreters
  • It will somehow emerge, by magic, if we just do
    enough stuff with OWL and RDF
  • Need formalisms, definitions, operational
    semantics, notation interpreters

5
NO to controlled vocabulary, says IR!
  • 1960s Cleverdon and the Cranfield aeronautics
    evaluations of text retrieval engines (Cleverdon
    67)
  • Tested algorithms and lists of controlled
    vocabularies, also all words
  • SURPRISE all words better than controlled
    vocabs!
  • which led to Saltons vector space approach to
    IR
  • which led to todays web search engines
  • The IR position forget ontologies and controlled
    liststhe semantics lies in multi-word
    combinations!
  • Theres no benefit in artificial or controlled
    languages
  • Multi-word combinations (kitchen knife) are
    good enough
  • Build language models frequency distributions
    of words in corpus/doc (Callan et al. 99 Ponte
    and Croft 98)

Nonethelessfor Semantic Web uses, we need
semantics. But WHAT is it? And how do we obtain
it?
6
Toward semantics Layers of interpretation 1
7
Layers of interpretation 2
syntax
POS
surface
PN PN PRO AUX ADV DT PN P DT PN V P DT N N PUN
PRO V V PN DT AJ N N PUN
Sheikh Mohammed, who is also the Defense Minister
of the United Arab Emirates, announced at the
inauguration ceremony we want to make Dubai a
new trading center
8
Layers of interpretation 3
shallow semantics
P0 act announce1 agent P1(Sheikh Mohammed)
theme P9 time present
P1(Sheikh Mohammed) P2(who) P3(Defense
Minister) P4(United Arab Emirates) P5(inaug.
ceremony) P6(we) P7(Dubai) P8(trading center)
coref
P9 act want3 agent P6(we) theme P10
P10 act make8 theme P7(Dubai) result
P8(center)
syntax
P1(Sheikh Mohammed) P2(who) P2(who)
P3(Defense Minister) P4(United Arab Emirates)
P6(we)
POS
surface
PN PN PRO AUX ADV DT PN P DT PN V P DT N N PUN
PRO V V PN DT AJ N N PUN
Sheikh Mohammed, who is also the Defense Minister
of the United Arab Emirates, announced at the
inauguration ceremony we want to make Dubai a
new trading center
9
Layers of interpretation 4
P0 act say-act3 agent P1(Sheikh) theme
P9 authortime T1 eventtime T2 lt T1
P9 state desire1 experiencer P1(Sheikh) theme
P10 statetime T2
P10 act change-state theme P7(Dubai) old-state
? new-state P11 eventtime T3 gt T2
P11 state essence1 experiencer P7(Dubai) theme
P8(center) statetime T4 gt T3
deep(er) semantics
info struc
Sheikh Mohammed, who is also the Defense
Minister of the United Arab Emirates,
announced at the inauguration ceremony we want
to make Dubai a new trading center
shallow semantics
P0 act announce1 agent P1(Sheikh Mohammed)
theme P9 time present
topic (theme) rheme focus
P1(Sheikh Mohammed) P2(who) P3(Defense
Minister) P4(United Arab Emirates) P5(inaug.
ceremony) P6(we) P7(Dubai) P8(trading center)
coref
P9 act want3 agent P6(we) theme P10
P10 act make8 theme P7(Dubai) result
P8(center)
syntax
P1(Sheikh Mohammed) P2(who) P2(who)
P3(Defense Minister) P4(United Arab Emirates)
P6(we)
POS
surface
PN PN PRO AUX ADV DT PN P DT PN V P DT N N PUN
PRO V V PN DT AJ N N PUN
Sheikh Mohammed, who is also the Defense Minister
of the United Arab Emirates, announced at the
inauguration ceremony we want to make Dubai a
new trading center
10
Shallow and deep semantics
  • She sold him the book / He bought the book from
    her
  • He has a headache / He gets a headache
  • Though its not perfect, democracy is the best
    system

(X1 act Sell agent She patient (X1a type
Book) recip He)
(X2a act Transfer agent She patient (X2c
type Book) recip He) (X2b act Transfer
agent He patient (X2d type Money) recip She)
(X3a prop Headache patient He) (?)
(X4a type State object (X4c type Head owner
He) state -3) (X4b type StateChange object
X4c fromstate 0 tostate -3)
(X4 type Contrast arg1 (X4a ?) arg2 (X4b
?))
11
Some semantic phenomena
  • Somewhat easier
  • Bracketing (scope) of predications
  • Word sense selection (incl. copula)
  • NP structure genitives, modifiers
  • Concepts ontology definition
  • Concept structure (incl. frames and thematic
    roles)
  • Coreference (entities and events)
  • Pronoun classification (ref, bound, event,
    generic, other)
  • Identification of events
  • Temporal relations (incl. discourse and aspect)
  • Manner relations
  • Spatial relations
  • Direct quotation and reported speech
  • Opinions and subjectivity

More difficult Quantifier phrases and numerical
expressions Comparatives Coordination Information
structure (theme/rheme) Focus Discourse
structure Other adverbials (epistemic modals,
evidentials) Identification of propositions
(modality) Pragmatics/speech acts Polarity/negatio
n Presuppositions Metaphors
12
Improving NL applications with semantics
  • How to improve accuracy of IR / web search?
  • TREC 9801 around 40
  • Understand user query expand query terms by
    meaning
  • How to achieve conceptual summarization?
  • Never been done yet, at non-toy level
  • Interpret topic, fuse concepts according to
    meaning re-generate
  • How to improve QA?
  • TREC 9902 around 65
  • Understand Q and A match their meanings know
    common info
  • How to improve MT quality?
  • MTEval 94 70, depending on what you measure
  • Disambiguate word senses to find correct meaning

13
Talk overview
  • Introduction Semantics and the Semantic Web
  • Approach General methodology for building the
    resources
  • Ontology framework Terminology ontology as start
  • Creating Omega recent work on connecting
    ontologies
  • Concept level terms and relations
  • Learning concepts by clustering
  • Learning and using concept associations
  • Instance level instances and more
  • Harvesting instances from text
  • Harvesting relations
  • Corpus manual shallow semantic annotation
  • OntoNotes project
  • Conclusion

14
2. Approach General methodology for building
the resources
15
Whats needed?
  • Set of semantic symbols democracy, eat
  • For each symbol, some kind of definition, or at
    least, rules for its combination and treatment
    during notation transformations
  • Notational conventions for each phenomenon of
    meaning comparatives, time/tense, negation,
    number, etc.
  • A collection of examples, as training data for
    learning systems to learn to do the work
  • A body of world knowledge for use in processing

16
Credo and methodology
  • Ontologies (and even concepts) are too complex to
    build all in one step
  • so build them bit by bit, testing each new (kind
    of) addition empirically
  • and develop appropriate learning techniques for
    each bit, so you can automate the process
  • so next time (since theres no ultimate truth)
    you can build a new one more quickly

17
Large standardized metadata collections
What is an ontology? My def a collection of
terms denoting entities, events, and
relationships in the domain, taxonomized and
interrelated so as to express the sharing of
properties. Its a formalized model of the
domain, focusing on the aspects of interest for
computation.
  • The need is there everybodys making lists
  • SIC and NAICS and other codes
  • Yahoo!s topic classification
  • Semantic Web termbanks / ontologies
  • But how do you
  • Guarantee the freshness and accuracy of the list?
  • Guarantee its completeness?
  • Ensure commensurate detail in levels of the list?
  • Cross-reference elements of the list?

Need automated procedure for creating lists /
metadata / ontologies
18
Plan Stepwise accretion of knowledge
  • Initial framework
  • Start with existing (terminological) ontologies
    as pre-metadata
  • Weave them together
  • Build metadata/concepts
  • Define/extract concept cores
  • Extract/learn inter-concept relationships
  • Extract/learn definitional and other info
  • Build (large) data/instance base
  • Extract instance cores
  • Link into ontology store in databases
  • Extract more information, guided by parent
    concept

19
Omega ontology Content and framework
  • Concepts 120,604 Concept/term entries 76 MB
  • Upper own Penman Upper Model (ISI Bateman et
    al.)
  • Upper SUMO (Pease et al.) DOLCE (Guarino et
    al.)
  • Middle WordNet (Princeton Miller Fellbaum)
  • Upper Middle Mikrokosmos (NMSU Nirenburg et
    al.)
  • Middle 25,000 Noun-noun compounds (ISI Pantel)
  • Lexicon / sense space
  • 156,142 English words 33,822 Spanish words
  • 271,243 word senses
  • 13,000 frames of verb arg structure with case
    roles
  • LCS case roles (Dorr) 6.3MB
  • PropBank roleframes (Palmer et al.) 5.3MB
  • Framenet roleframes (Fillmore et al.) 2.8MB
  • WordNet verb frames (Fellbaum) 1.8MB
  • Associated information (not all complete)
  • WordNet subj domains (Magnini Cavaglia) 1.2
    MB
  • Various relations learned from text (ISI
    Pantel)
  • TAP domain groupings (Stanford Guha)
  • SemCor term frequencies 7.5MB
  • Instances 10.1 GB
  • 1.1 million persons harvested from text
  • 900,000 facts harvested from text
  • 5.7 million locations from USGS and NGA
  • Framework (over 28 million statements of
    concepts, relations, instances)
  • Available in PowerLoom
  • Instances in RDF
  • With database/MYSQL
  • Online browser
  • Clustering software
  • Term and ontology alignment software

http//omega.isi.edu
20
Talk overview
  • Introduction Semantics and the Semantic Web
  • Approach General methodology for building the
    resources
  • Ontology framework Terminology ontology as start
  • Creating Omega recent work on connecting
    ontologies
  • Concept level terms and relations
  • Learning concepts by clustering
  • Learning and using concept associations
  • Instance level instances and more
  • Harvesting instances from text
  • Harvesting relations
  • Corpus manual shallow semantic annotation
  • OntoNotes project
  • Conclusion

21
3. Framework Terminology ontology as starting
point semi-automated alignment and merging
(This work with Andrew Philpot, Michael
Fleischman, and Jerry Hobbs)
22
Example application EDC (Hovy et al. 02)
23
Omega (Hovy et al. 03)
WordNet 1.7 (Princeton) 110,00 nodes
Our own new work (ISI) 400 nodes
Mikrokosmos (New Mexico State U) 6,000 nodes
Penman Upper Model (ISI) 300 nodes
24
General alignment and merging problem
  • Goal find attachment point(s) in ontology for
    node/term from somewhere else (ontology, website,
    metadata schema, etc.)
  • Its hard to do manually very hard to do
    automaticallysystem needs to understand
    semantics of entities to be aligned

Several sets of algorithms
Interesting problems
Various algorithms
25
Ontology alignment and merging
  • Goal find attachment point in ontology for
    node/term from somewhere else (ontology, website,
    metadata schema, etc.)
  • Procedure
  • 1. For a new term/concept, extract and format
    name, definition, associated text, local taxonomy
    cluster, etc.
  • 2. apply alignment suggestion heuristics (NAME,
    DEFINITION, HIERARCHY, DISPERSAL match) against
    big ontology, to get proposed attachment points
    with strengths (Hovy 98) test with numerous
    parameter combinations, see http//edc.isi.edu/ali
    gnment/ (Hovy et al. 01)
  • 3. automatically combine proposals (Fleischman et
    al 03)
  • 4. apply verification checks
  • 5. bless or reject proposals manually
  • Process developed in early 1990s (Agirre et al.
    94 Knight Luk 94 Okumura Hovy 96 Hovy 98
    Hovy et al. 01)
  • Not stunningly accurate, but can speed up manual
    alignment markedly

26
Alignment for Omega
  • Created Upper Region (400 nodes) manually
  • Manually snipped tops off Mikro and WordNet, then
    attached them to fringe of Upper Region
  • Automatically aligned bottom fringe of Mikro into
    WordNet
  • Automatically aligned sides of bubbles

27
(No Transcript)
28
A puzzle
  • Is Amber Decomposable or Nondecomposable?
  • The stone sense of it (Mikro) is the resin
    sense (WordNet) is not
  • What to do??

29
Shishkebobs (Hovy et al. 03)
  • Library ISA Building (and hence cant buy things)
  • Library ISA Institution (and hence can buy
    things)
  • SO Building ? Institution ? Location a
    Library is all these
  • Also Field-of-Study ? Activity ?
    Result-of-Process
  • (Science, Medicine, Architecture, Art)
  • Allowing shishkebobs makes merging ontologies
    easier (possible?) you respect each ontologys
    perspective
  • Continuum from on-the-fly shadings to metonymy
  • (see Guarinos identity conditions
    Pustejovskys qualia)
  • We found about 400 shishkebobs

30
http//omega.isi.edu
31
Talk overview
  • Introduction Semantics and the Semantic Web
  • Approach General methodology for building the
    resources
  • Ontology framework Terminology ontology as start
  • Creating Omega recent work on connecting
    ontologies
  • Concept level terms and relations
  • Learning concepts by clustering
  • Learning and using concept associations
  • Instance level instances and more
  • Harvesting instances from text
  • Harvesting relations
  • Corpus manual shallow semantic annotation
  • OntoNotes project
  • Conclusion

32
4a. Concept level Learning terms/concepts by
clustering web information
(This work by Patrick Pantel, Marco
Pennacchiotti, and Dekang Lin)
33
Where/how to find new concepts/terms?
  • Potential sources
  • Existing ontologies (AI efforts, Yahoo!, etc.)
    and lists (SIC codes, etc.)
  • Manual entry, esp with reference to
    foreign-language text (EuroWordNet, IL-Annot,
    etc.)
  • Dictionaries and thesauri (Websters, Rogets,
    etc.)
  • Automated discovery by text clustering (Pantel
    and Lin, etc.)
  • Issues
  • How large do you want it? tradeoff size vs.
    consistency and ease of use
  • How detailed? tradeoff granularity/domain-specif
    icity vs. portability and wide acceptance
    (Semantic Web)
  • How language-independent? tradeoff independence
    vs. utility for non/shallow-semantic NLP
    applications

34
Clustering By Committee (Pantel and Lin 02)
  • CBC clustering procedure
  • Parse entire corpus using MINIPAR (D. Lin)
  • Define syntactic/POS patterns as features
  • N-N N-subj-V Adj-N etc.
  • Cluster words, using Pointwise Mutual Information
    on features
  • (eword, fpattern)
  • Disambiguate
  • find cluster centroids word committee
  • for non-centroid words, match their pattern
    features to committee words features if match,
    include word in cluster, remove features
  • if no match, then word has remaining features so
    try to include in other clusters as well split
    ambiguous words senses
  • Complexity O(n2k) for n words in corpus, k
    features
  • Results no clustering is perfect, but CBC is
    quite good

35
www.isi.edu/pantel/
36
(No Transcript)
37
(No Transcript)
38
From words to concepts
  • How to find a name for a cluster?
  • Given term instances, search for frequently
    co-occurring terms, using apposition patterns
  • the President, Thomas Jefferson,
  • Kobe Bryant, famous basketball star
  • Extract terms, check if present in ontology
  • Examples for Lincoln
  • PRESIDENT(N891) - 0.187331
  • BORROWER / THRIFT(N724) - 0.166958
  • CAR / DIVISION(N257) - 0.137333
  • Works ok for nouns, less so for others

39
Problems with clustering
  • No text-based clustering is ever perfect
  • How many concepts are there?
  • How are they arranged? (there is no reason to
    expect that a clustering taxonomy should
    correspond with an ISA hierarchy!)
  • What interrelationships exist between them?
  • Clustering is only the start

40
Talk overview
  • Introduction Semantics and the Semantic Web
  • Approach General methodology for building the
    resources
  • Ontology framework Terminology ontology as start
  • Creating Omega recent work on connecting
    ontologies
  • Concept level terms and relations
  • Learning concepts by clustering
  • Learning and using concept associations
  • Instance level instances and more
  • Harvesting instances from text
  • Harvesting relations
  • Corpus manual shallow semantic annotation
  • OntoNotes project
  • Conclusion

41
4b. Concept level Learning and using concept
associations
(This work with Chin-Yew Lin, Mike Junk, Michael
Fleischman, and Tom Murray)
42
Topic signature
Related words in texts show Poisson
distribution In large set of texts, topic
keywords concentrate around topics so families
of related words appear in bursts. To find
family, compare topical word frequency
distributions against global background counts.
  • Word family built around inter-word relations.
  • Def Head word (or concept), plus set of related
    words (or concepts), each with strength
  • Tk, (tk1,wk1), (tk2,wk2), , (tkn,wkn)
  • Problem Scriptal co-occurrence, etc. how to
    find it?
  • Approximate by simple textual term
    co-occurrence...

43
Learning signatures
Need texts, sorted
  • Procedure

How to count co-occurrence?
1. Collect texts, sorted by topic
2. Identify families of co-occurring words
How to evaluate?
3. Evaluate their purity
4. Find the words concepts in the Ontology
5. Link together the concept signatures
Need disambiguator
44
Calculating weights
  • tf.idf wjk tfjk idfj
  • ?2 wjk (tfjk - mjk)2/ mjk if
    tfjk gt mjk
  • 0
    otherwise (Hovy Lin, 1997)
  • tfjk count of term j in text k (waiter
    often only in some texts)
  • idfj log(N/nj) within-collection frequency
    (the often in all texts),
  • nj number of docs with term j , N total
    number of documents
  • tf.idf is the best for IR, among 287 methods
    (Salton Buckley, 1988)
  • mjk ( ?j tfjk ?k tfjk ) / ?jk tfjk
    mean count for term j in text k
  • likelihood ratio ? 2log ? 2N . I (R T)
    (Lin Hovy, 2000)
  • (more approp. for sparse data -2log?
    asymptotic to ?2 )
  • N total number terms in corpus
  • I mutual information between text relevance R
    and given term T ,
  • H(R ) - H(R T ) for H(R ) entropy of
    terms over relevant texts R
  • and H(R T ) entropy of term
    T over rel and nonrel texts

45
Early signature study (Hovy Lin 97)
  • Corpus
  • Training set WSJ 1987
  • 16,137 texts (32 topics)
  • Test set WSJ 1988
  • 12,906 texts (31 topics)
  • Texts indexed into categories by humans
  • Signature data
  • 300 terms each, using tf.idf
  • Word forms single words, demorphed words,
    multi-word phrases
  • Topic distinctness...
  • Topic hierarchy

46
Evaluating signatures
  • Solution Perform text categorization task
  • create N sets of texts, one per topic,
  • create N topic signatures TSk ,
  • for each new document, create document signature
    DSi ,
  • compare DSi against all TSk assign document to
    best match
  • Match function vector space similarity measure
  • Cosine similarity, cos ?? ??TSk DSi / TSk
    ??DSi
  • Test 1 (Hovy Lin, 1997, 1999)
  • Training set 10 topics 3,000 texts (TREC)
  • Contrast set (background) 3,000 texts
  • Conclusion tf.idf and ?2 signatures work ok but
    depend on signature length
  • Test 2 (Lin Hovy, 2000)
  • 4 topics 6,194 texts uni/bi/trigram signats
  • Evaluated using SUMMARIST ? gt tf.idf

47
Text pollution on the web
Goal Create word families (signatures) for each
concept in the Ontology. Get texts from Web.
Main problem text pollution. Whats the search
term?
Purifying In later work, used Latent Semantic
Analysis
48
Purifying with Latent Semantic Analysis
  • Technique used in Psychologists to determine
    basic cognitive conceptual primitives
    (Deerwester et al., 1990 Landauer et al., 1998).
  • Singular Value Decomposition (SVD) used for text
    categorization, lexical priming, language
    learning
  • LSA automatically creates collections of items
    that are correlated or anti-correlated, with
    strengths
  • ice cream, drowning, sandals ? summer
  • Each such collection is a semantic primitive in
    terms of which objects in the world are
    understood.
  • We tried LSA to find most reliable signatures in
    a collection reduce number of signatures in
    contrast set.

49
LSA for signatures
  • Create matrix A, one signature per column (words
    ? topics).
  • Apply SVDPAC to compute U so that A U ? UT
  • Use only the first k of the new concepts ??
    ?1, ?2?k.
  • Create matrix A? out of these k vectors A? U
    ?? UT A.
  • A? is a new (words ? topics) matrix, with
    different weights and new topics. Each column
    is a purified signature.
  • U m ? n orthonormal matrix of left singular
    vectors that span space
  • UT n ? n orthonormal matrix of right singular
    vectors
  • ? diagonal matrix with exactly rank(A) nonzero
    singular values ?1 gt ?2 gt gt ?n

U
A
UT
?
50
Some results with LSA (Hovy and Junk 99)
  • Contrast set (for idf and ?2) set of
    documents on very different topic, for good idf.
  • Partitions collect documents within each topic
    set into partitions, for faster processing. /n
    is a collecting parameter.
  • U function function for creation of LSA matrix.
  • Results
  • Demorphing helps.
  • ? 2 better than tf and tf.idf .
  • LSA improves results, but not dramatically.

TREC texts
51
Weak semantics Signature for every concept
  • Procedure
  • 1. Create query from Ontology concept (word
    defn. words)
  • 2. Retrieve 5,000 documents (8 web search
    engines)
  • 3. Purify results (remove duplicates, html,
    etc.)
  • 4. Extract word family (using tf.idf, ?2, LSA,
    etc.)
  • 5. Purify
  • 6. Compare to siblings and parents in the
    Ontology
  • Problem raw signatures overlap
  • average parent-child node overlap 50
  • BakeryEdifice 35 too far missing
    generalization
  • AirplaneAircraft 80 too close?
  • Remaining problem web signatures still not
    pure...
  • WordNet In 200204, Agirre and students (U of
    the Basque Country) built signatures for all
    WordNet nouns

52
Recent work using signatures
  • Multi-document summarization (Lin and Hovy, 2002)
  • Create ? signature for each set of texts
  • Create IR query from signature terms use IR to
    extract sentences
  • (Then filter and reorder sentences into single
    summary)
  • Performance DUC-01 tied first DUC-02 tied
    second place
  • Wordsense disambiguation (Agirre, Ansa,
    Martinez, Hovy, 2001)
  • Try to use WordNet concepts to collect text sets
    for signature creation (wordsynonym gt
    def-words gt word .AND. synonym .NEAR. def-word gt
    etc)
  • Built competing signatures for various noun
    senses (a) WordNet synonyms (b) SemCor
    tagged corpus (?2) (c) web texts (?2) (d)
    WSJ texts (?2)
  • Performance Web signatures gt random, WordNet
    baseline
  • Email clustering (Murray and Hovy, 2004)
  • Social Network Analysis Cluster emails and
    create signatures
  • Infer personal expertise, project structure,
    experts omitted, etc.
  • Corpora ENRON (240K emails), ISI corpus, NSF
    eRulemaking corpus

53
Semantics from signatures
  • Assuming we can create signatures and use them in
    some applications
  • How to integrate signatures into an ontology?
  • How to employ signatures in inheritance,
    classification, inference, and other operations?
  • How to compose signatures into new concepts?
  • How to match signatures across languages?
  • How do signatures change?

54
Talk overview
  • Introduction Semantics and the Semantic Web
  • Approach General methodology for building the
    resources
  • Ontology framework Terminology ontology as start
  • Creating Omega recent work on connecting
    ontologies
  • Concept level terms and relations
  • Learning concepts by clustering
  • Learning and using concept associations
  • Instance level instances and more
  • Harvesting instances from text
  • Harvesting relations
  • Corpus manual shallow semantic annotation
  • OntoNotes project
  • Conclusion

55
5a. Instance level Harvesting instances from
text
(This work with Michael Fleischman)
56
What kinds of knowledge?
  • Goal 1 Add instantial knowledge
  • Sofia is a city
  • Sofia is a womans name
  • Cleopatra was a queen
  • Everest is a mountain
  • Varig is an airline company
  • Goal 2 Add definitional / descriptive knowledge
  • Mozart was born in 1756
  • Bell invented the telephone
  • Pisa is in Italy
  • The Leaning Tower is in Pisa
  • Columbus discovered America

Create links between concepts
  • Uses
  • QA (answer suggestion and validation)
  • Wordsense disambiguation for MT
  • Sources
  • Existing lists (CIA factbook, atlases, phone
    books)
  • Dictionaries and encyclopedias
  • The Web

Classify instances under types
57
Learning about locations (Fleischman 01)
  • Challenge ex.region, state/territory, or city?
  • The company, which is based in Dpiyj Dsm
    Gtsmdodvp, Vsaog., said an antibody prevented
    development of paralysis.
  • The situation has been strained ever since Yplup
    began waging war in Dpiyj Rsdy Sdos.
  • The pulp and paper operations moved to Dpiyj
    Vstpaomos in 1981.
  • Try to learn instances of 8 types (country,
    region, territory, city, street, artifact,
    mountain, water)
  • (we have lists of these already, so finding
    sentences for training data is easy).
  • Uses
  • QA corroborating evidence for answer.
  • IR query expansion and signature enrichment.
  • South San Franciscoregion Calif.state
    Tokyocity South East Asiaregion South
    Carolinastate

58
Learning procedure
  • Approach
  • Training For each location, identify features in
    context try to learn features that indicate each
    type
  • Usage For new material, use learned features to
    classify type of location place results with
    high confidence into ontology
  • Training
  • Applied BBNs IdentiFinder to bracket locations
  • Chose 88 features (unigrams,bigrams,trigrams in
    fixed positions before and after location
    instance later added signatures, etc.)
  • 3 approaches Bayesian classifier, neural net,
    decision tree (C4.5)
  • MemRun procedure store examples if good
    (gtTHRESH1) and prefer stored info later if unsure
    (ltTHRESH2)

59
Memrun
  • Initial results
  • Bayesian classifier not very accurate neural net
    ok.
  • D-tree better, but still multiple classes for
    each instance.
  • Memrun record best example of each instance.
  • Algorithm with Memrun
  • Pass 1 for each text,
  • preprocess with POS tagger and IdentiFinder
  • apply D-tree to classify instance
  • if score gt THRESH1, save (instance,tag,score) in
    Memrun
  • Pass 2 for each text,
  • again apply D-tree
  • if score lt THRESH2, replace tag by Memrun value

60
Examples
Water Abuna River Adriatic Adriatic Sea Adriatic
sea Aegean Sea Aguapey river Aguaray
River Akhtuba River Akpa Yafe River Akrotiri salt
lake Aksu River Alma-Atinka River Almendares
River Alto Maranon River Amazon River Amur
river Andaman Sea Angara River Angrapa river Anna
River Arabian Gulf Arabian Sea ...
City Aachen Abadan Abassi Madani Abbassi
Madani Abbreviations AZ Abdullojonov Aberdeen Abid
jan Abidjan Radio Cote d'Ivoire
Chaine Abiko Abrahamite Abramenkov Abu
Dhabi Abuja Abyssinia Acari Accom Accordance ...
Territory General Robles Ghanaians Gilan
Province Gilan Province Sha'ban Gitega
Province Glencore Goias State Goias State of
Brazil Gongola State Granma Province Great
Brotherly Russia Greytown Guanacaste
Province Guandong province Guangdong
Province Guangxi Province Guangzhou
Shipyards Guantanamo Province Guayas
Province Guerrero State Guiliano Amato Guizhou
Province Gwent ...
Mountain Wicklow Mountains Wudang
Mountain Wudangshan Mountain Wuling
mountains Wuyi Mountains Xiao Hinggan
Mountains Yimeng Mountains Zamborak
mountain al-Marakishah mountain al-Maraqishah
mountains al-Nubah mountains al-Qantal mountain
61
Results for locations
NB test samples are small
THRESH1 77 THRESH2 98
62
People (Fleischman and Hovy 02)
  • Goal Collecting training data about 8 types of
    people politicians, entertainers (movie stars,
    etc.), athletes, businesspeople...
  • Procedure as before, with added features using
    signature of each category and WordNet hypernyms.

athlete 458.029 398.626 perez 392.904 rogers 368
.441 carlos 351.686 points 333.083 roy 311.042 and
res 284.927 scored 273.197 chris 252.172 hardee's
239.747 george 223.879 games 222.202 mark 217.711
mike ...
cleric 1133.793 rabbi 1074.785 cardinal 1011.190 p
aul 809.128 archbishop 798.372 john 748.170 bishop
714.173 catholic 688.291 church 613.625 roman 610
.287 tutu 584.720 desmond 460.057 pope 309.923 kah
ane 300.236 meir ...
entertainer 1902.178 " 1573.695 actor 1083.622 act
ress 721.929 movie 618.947 george 607.466 film 553
.659 singer 541.235 president 536.962 her 536.856
keating 528.226 star 448.524 ( 433.065 ) 404.008 s
aid ...
businessperson 4428.267 greenspan 3999.135 alan 27
74.682 reserve 2429.129 chairman 1786.783 federal
1709.120 icahn 1665.358 fed 1252.701 carl 827.0291
board 682.420 rates 662.510 investor 651.529 twa
531.907 kerkorian 522.072 interest ...
63
Some results for people
Total count 1030 Total Correct 839
0.815 Total Incorrect 191
0.185 miscCorrect 0/20 0.0 lawyerCorrect
13/44 0.295 policeCorrect 11/17
0.647 doctorCorrect 48/50 0.96 entertainerCorre
ct 150/173 0.867 athleteCorrect 11/13
0.846 businessCorrect 120/166
0.722 militaryCorrect 14/21 0.666 clergyCorrect
11/11 1.0 politicianCorrect 461/515 0.895
  • Best results using signatures and WordNet
    hyperlinks (but no synset expansion).
  • Problems
  • Training and test data skewed.
  • Genuine ambiguity
  • often, politician military leader.

64
Instance extraction (Fleischman Hovy 03)
  • Goal extract all instances from the web
  • Method
  • Download text from web (15GB)
  • Identify named entities (BBNs IdentiFinder
    (Bikel et al. 93))
  • Extract ones with descriptive phrases (ltAPOSgt,
    ltCN/PNgt)
  • (the vacuum manufacturer Horeck / Saddams
    physician Abdul)
  • Cluster them, and categorize in ontology
  • Result over 900,000 instances
  • Average 2 mentions per instance, 40 for George
    W. Bush
  • Evaluation
  • Tested with 200 who is X? questions
  • Better than TextMap 25 more
  • Faster 10 sec ? 9 hr !

65
Talk overview
  • Introduction Semantics and the Semantic Web
  • Approach General methodology for building the
    resources
  • Ontology framework Terminology ontology as start
  • Creating Omega recent work on connecting
    ontologies
  • Concept level terms and relations
  • Learning concepts by clustering
  • Learning and using concept associations
  • Instance level instances and more
  • Harvesting instances from text
  • Harvesting relations
  • Corpus manual shallow semantic annotation
  • OntoNotes project
  • Conclusion

66
5b. Instance level Harvesting relations
(This work with Deepak Ravichandran, Donghui
Feng, and Patrick Pantel)
67
Shallow patterns for information
  • Goal learn relationship data from the web
  • (when was someone born? Where does he live?)
  • Procedure automatically learn word-level
    patterns
  • When was Mozart born?
  • Mozart (17561792)
  • ltNAMEgt ( ltBIRTHYEARgt ltDEATHYEARgt )
  • Apply patterns to Omega concepts/instances
  • Evaluation test in TREC QA competition
  • Main problem learning patterns
  • (In TREC QA 2001, Soubbotin and Soubbotin got
    very high score with over 10,000 patterns built
    by hand)

68
Learning extraction patterns from the web
(Ravichandran and Hovy 02)
  • Prepare
  • Select example for target relation Q term
    (Mozart) and A term (1756)
  • Collect data
  • Submit Q and A terms as queries to a search
    engine (Altavista)
  • Download top 1000 web documents
  • Preprocess
  • Apply a sentence breaker to the documents
  • Retain only sentences with both Q and A terms
  • Pass retained sentences through suffix tree
    constructor
  • Select and create patterns
  • Filter each phrase in the suffix tree to retain
    only those phrases that contain both Q and A
    terms
  • Replace the Q term by the tag ltNAMEgt and the A
    term by the term by ltANSWERgt

69
Some results
  • BIRTHYEAR
  • 1.0 ltNAMEgt (ltANSgt
  • 0.85 ltNAMEgt was born on ltANSgt
  • 0.6 ltNAMEgt was born in ltANSgt
  • DEFINITION
  • 1.0 ltNAMEgt and related ltANSgts
  • 1.0 ltANSgt (ltNAMEgt,
  • 0.9 as ltNAMEgt , ltANSgt and
  • LOCATION
  • 1.0 ltANSgts ltNAMEgt .
  • 1.0 regional ltANSgt ltNAMEgt
  • 0.9 the ltNAMEgt in ltANSgt ,
  • Testing (TREC-10 questions)
  • Question Num TREC Web
  • type Qs MRR MRR
  • BIRTHYEAR 8 0.479 0.688
  • INVENTOR 6 0.167 0.583
  • DISCOVERER 4 0.125 0.875
  • DEFINITION 102 0.345 0.386
  • WHY-FAMOUS 3 0.667 0.0
  • LOCATION 16 0.75 0.864

70
Regular expressions (Ravichandran et al. 2004)
  • New process learn regular expression patterns
  • Results over 2 million instances from 15GB
    corpus
  • Complexity O(y2), for max string length y
  • Later work downloaded and cleaned 1 TB text fro
    web created 119MB corpus used for additional
    learning of N-N compounds

71
Comparing clustering and surface patterns
  • Precision took random 50 words, each with
    systems learned superconcepts (top 3 of system)
    added top 3 from WordNet, 1 human superconcept.
    Used 2 judges (Kappa 0.780.85)
  • Recall Relative Recall RecallPatt /
    RecallCo-occ CP / CC
  • TREC-03 defns Patt up to 52 Co-Occ up to 44
    MRR

Precision (correctpartial)
Relative Recall
Pattern System Pattern System Pattern System Co-occurrence System Co-occurrence System Co-occurrence System
Training Prec Top-3 MRR Prec Top-3 MRR
1.5MB 56.6 60.0 60.0 12.4 20.0 15.2
15MB 57.3 63.0 61.0 23.2 50.0 37.3
150MB 50.7 56.0 55.0 60.6 78.0 73.2
1.5GB 52.6 51.0 51.0 69.7 93.0 85.8
15GB 61.8 69.0 67.5 78.7 92.0 86.2
150GB 67.8 67.0 65.0 Too large to process Too large to process Too large to process
(Ravichandran and Pantel 2004)
72
Relation extraction from a small corpus
  • The challenge apply RegExp pattern induction to
    a small corpus (Chemistry textbook) (Pantel and
    Pennacchiotti 06)

73
Espresso procedure (Pantel and
Pennachiotti 06)
  • Phase 1 Pattern Extraction, like Ravichandran,
    using MI
  • Measure reliability based on an approximation of
    pattern recall
  • Phase 2 Instance Extraction
  • Instantiate all patterns to extract all possible
    instances
  • Identify generic patterns using Google redundancy
    check with previously accepted patterns
  • Measure reliability of each instance
  • Select top-K instances
  • Phase 3 Instance Expansion (if too few instances
    extracted in phase 2)
  • Syntactic Drop nominal mods
  • proton is-a small particle ? proton is-a
    particle
  • WordNet Expand using hypernyms
  • hydrogen is-a element ? nitrogen is-a element
  • Web Apply patterns to the Web to extract
    additional instances
  • Phase 4 Axiomatization (transform relations into
    axioms in HNF form)
  • e.g., R is-a S becomes R(x) ? S(x)
  • e.g., R part-of S becomes (?x)R(x) ?
    (?y)S(y) part-of(x,y)

74
IE by pattern
(Feng, Ravichandran, Hovy 2005)
Why not Gorbachev? gender Why not Mrs.
Roosevelt? period Why not Maggie
Thatcher? home? Which semantics to check?
75
Talk overview
  • Introduction Semantics and the Semantic Web
  • Approach General methodology for building the
    resources
  • Ontology framework Terminology ontology as start
  • Creating Omega recent work on connecting
    ontologies
  • Concept level terms and relations
  • Learning concepts by clustering
  • Learning and using concept associations
  • Instance level instances and more
  • Harvesting instances from text
  • Harvesting relations
  • Corpus manual shallow semantic annotation
  • OntoNotes project
  • Conclusion

76
6. OntoNotes Creating a Semantic Corpus by
Manual Annotation
(This work with Ralph Weischedel (BBN), Martha
Palmer (U Colorado), Mitch Marcus (UPenn), and
various colleagues)
77
Corpus creation by annotation
  • Goal create corpus of (sentence semantic rep)
    pairs
  • Use enable machine learning algorithms to do
    this
  • Process humans add information into sentences
    (and their parses)
  • Recent projects

Interlingua Annotation (Dorr et al. 04)
coref links
OntoNotes (Weischedel et al. 05)
ontology
I-CAB, Greek banks
PropBank (Palmer et al. 03)
TIGER/SALSA Bank (Pinkal et al. 04)
verb frames
Framenet (Fillmore et al. 04)
noun frames
Prague Dependency Treebank (Hajic et al. 02)
word senses
Penn Treebank (Marcus et al. 99)
NomBank (Myers et al. 03)
syntax
78
Antecedents
VerbNet
Treebank
PropBank2
WordNet
Chinese
Chinese
PropBank frames
NomBank frames
FrameNet
Chinese
Chinese
Arabic
Sense tags, Coreference and Ontology links
Salsa-German
OntoNotes
Prague-Czech
Chinese
79
OntoNotes large-scale annotation
  • Partners BBN (Weischedel), U of Colorado
    (Palmer), U of Penn (Marcus), ISI (Hovy)
  • Goal In 4 years, annotate nouns and verbs and
    corefs in 1 mill words of English, Chinese, and
    Arabic text
  • Manually provide semantic symbols for nouns,
    verbs, adjs, advs
  • Manually connect sentence structure in verb and
    noun frames
  • Manually link anaphoric references
  • Validation inter-annotator agreement of 90
  • Outcomes (2004)
  • PropBank verb annotation procedure developed
  • Pilot corpus built, with coref annotation
  • New project started October 2005 (English,
    Chinese Arabic in 2006)
  • Potential for the near future semantics bank
  • May energize lots of research on semantic
    analysis, reps, etc.
  • May enable semantics-based IR, QA, MT, etc.

80
OntoNotes representation of literal meaning
The founder of Pakistans nuclear
department Abdul Qadeer Khan has admitted he
transferred nuclear technology to Iran, Libya, an
d North Korea
P1 type Person3 name Abdul Qadeer Khan P2
type Person3 gender male P3 type
Know-How4 P4 type Nation2 name Iran P5
type Nation2 name Libya P6 type Nation2
name N. Korea X0 act Admit1 speaker P1
saying X2 X1 act Transfer2 agent P2
patient P3 dest (P4 P5 P6) coref P1 P2
(slide credit to M. Marcus and R. Weischedel,
2004)
81
Even so Many words untouched!
  • WSJ1428
  • OPEC's ability to produce more petroleum than it
    can sell is beginning to cast a shadow over world
    oil markets. Output from the Organization of
    Petroleum Exporting Countries is already at a
    high for the year and most member nations are
    running flat out. But industry and OPEC
    officials agree that a handful of members still
    have enough unused capacity to glut the market
    and cause an oil-price collapse a few months from
    now if OPEC doesn't soon adopt a new quota system
    to corral its chronic cheaters. As a result, the
    effort by some oil ministers to get OPEC to
    approve a new permanent production-sharing
    agreement next month is taking on increasing
    urgency. The organization is scheduled to meet
    in Vienna beginning Nov. 25. So far this year,
    rising demand for OPEC oil and production
    restraint by some members have kept prices firm
    despite rampant cheating by others. But that
    could change if demand for OPEC's oil softens
    seasonally early next year as some think may
    happen. OPEC is currently producing more than 22
    million barrels a day, sharply above its nominal,
    self-imposed fourth-quarter ceiling of 20.5
    million, according to OPEC and industry officials
    at an oil conference here sponsored by the Oil
    Daily and the International Herald Tribune. At
    that rate, a majority of OPEC's 13 members have
    reached their output limits, they said.

82
OntoNotes annotation The 90 Solution
  • 1. Sense creation
  • Expert creates meaning options (shallow semantic
    senses) for verbs, nouns, adjs, advs follows
    PropBank (Palmer et al.)
  • At same time, creates concepts and
    organizes/refines Omega ontology content and
    structure
  • 2. Sense annotation process goes by word, across
    docs. Process developed in PropBank. Annotators
    manually
  • See each sentence in corpus containing the
    current word (noun, verb, adjective, adverb) to
    annotate
  • Select appropriate senses ( ontology concepts)
    for each one
  • Connect frame structure (for each verb and
    relational noun)
  • 3. Coref annotation process goes by doc.
    Annotators
  • Connect co-references within each doc
  • Constant validation require 90 inter-annotator
    agreement

83
Sense annotation procedure
  • Sense creator first creates senses for a word
  • Loop 1
  • Manager selects next nouns from sensed list and
    assigns annotators
  • Programmer randomly selects 50 sentences and
    creates initial Task File
  • Annotators (at least 2) do the first 50
  • Manager checks their performance
  • 90 agreement few or no NoneOfAbove send on
    to Loop 2
  • Else Adjudicator and Manager identify reasons,
    send back to Sense creator to fix senses and defs
  • Loop 2
  • Annotators (at least 2) annotate all the
    remaining sentences
  • Manager checks their performance
  • 90 agreement few or no NoneOfAbove send to
    Adjudicator to fix the rest
  • Else Adjudicator annotates differences
  • If Adj agrees with one Annotator 90, then
    ignore other Annotators work (assume a bad day
    for the other) else Adj agrees with both about
    equally often, then assume bad senses and send
    the problematic ones back to Sense creator

84
Pre-OntoNotes test can it be done?
  • Annotation process and tools developed and tested
    in PropBank (Palmer et al. U Colorado)
  • Typical results (10 words of each type, 100
    sentences each)

Round1 ? Round2 ? Round 3 Round1 ? Round2 ? Round 3 Round1 ? Round2 ? Round 3
tagger agreement senses time (min/100 tokens)
verbs .76 ? .86 ? .91 4.5 ? 5.2 ? 3.8 30 ? 25 ? 25
nouns .71 ? .85 ? .95 7.3 ? 5.1 ? 3.3 28 ? 20 ? 15
adjs .87 ? ? .90 2.8 ? ? 5.5 24 ? ? 18
(by comparison agreement using WordNet senses is
70)
85
Creating the senses
  • Use 90 rule to limit degree of delicacy
  • See if annotators can agree
  • Perform manual insertion
  • After manual creation, get annotator feedback
  1. Should you create the sense? How many must there
    be?
  2. Is the term definition adequate?
  3. Where should the term go relative to the other
    terms? species
  4. What is unique/different about this term?
    differentium/ae

How to do this systematically? Developed method
of graduated refinement using creation of sense
treelets with differentiae
86
Noun and verb sense creation
  • Performed by Ann Houston in Boston (who also does
    verb sense creation)
  • Sense groupings created
  • 4 nouns per day sense-created
  • Max head, with 15 senses
  • Verb procedure creates senses by grouping WordNet
    senses (PropBank)
  • Noun procedure taxonomizes senses into treelets,
    with differentiae at each level, for insertion
    into ontology

ltinventory lemma"price-n"gt ltsense n"1" type""
name"cost or monetary value of goods or
services" group"1"gt ltdiffgt quantity
monetary_value lt/diffgt ltcommentgt PRICE of
NP -gt NP'sgood/service PRICEexchange_value
lt/commentgt ltexamplesgt The price
of gasoline has soared lately. I don't
know the prices of these two fur coats.
The museum would not sell its Dutch Masters
collection for any price. The cattle
thief has a price on his head in Maine.
They say that every politician has a price.
lt/examplesgt ltmappingsgt ltwn
version"2.1"gt1,2,4,5,6lt/wngt ltomegagt lt/omegagt
lt/mappingsgt lt/sensegt ltsense n"2" type""
name"sacrifice required to achieve something"
group"1"gt ltdiffgt activity complex
effort lt/diffgt ltcommentgt PRICEeffort
PREP(of/for)/SCOMP NPgoal/result lt/commentgt
ltexamplesgt John has paid a high price
for his risky life style.
PRICE abstract quantity monetary_value
(group 1) physical activity complex
(not a single event or action) effort (group
2)
87
Word senses from lexemes to concepts
  • Lexical space
  • hang
  • call
  • Sense space
  • hang-hanged
  • hang-hung
  • summon they called them home
  • name he is called Joe
  • phone she called her mother
  • name2 he called her a liar
  • describe she called him ugly
  • Concept space
  • Cause-to-die
  • Suspend-body
  • Summon
  • Name-Describe
  • Phone
  • How many concepts?
  • How relate senses to concepts?

88
Omega after OntoNotes
  • Current Omega
  • 120,000 concepts Middle Model mostly WordNet
  • Essentially no formally defined features
  • Post-OntoNotes Omega
  • 60,000 concepts? the 90 rule
  • Each concept a sense cluster, defined with
    features
  • Each concept linked to many example sentences
  • What problems do we face?
  • Sense-to-concept compression
  • Cross-sense identification
  • Multiple languages senses
  • etc.

89
7. Conclusion
90
Summary Obtaining semantics
  • Ingredients
  • small ontologies and metadata sets
  • concept families (signatures)
  • information from dictionaries, etc.
  • additional info from text and the web

Method 1. Into a large database, pour all
ingredients 2. Stir together in the right way
3. Bake
EvaluateIR, QA, MT, and so on!
91
My recipe for SW research
  • Take two large portions of KR
  • one of ontology work,
  • one of reasoning
  • Add a big slice of databases
  • for all the non-text collections,
  • and 1 1/2 slices of NL
  • for the text collections, to insert the
    semantics.
  • Mix with a medium pinch of Correctness /
    Authority / Recency validation,
  • and add a large helping of Interfaces
  • to make the results presentable.
  • Combine, using creativity and good methodology,
  • (taste frequently to evaluate!)
  • and deliver to everyone.

92
Extending your ontology
No ontology is ever static need to develop
methods to handle change
  • congratulations to the people of Montenegro!

93
Thank you!
About PowerShow.com