Lexical knowledge schemes for modeling words and expressions in communication - PowerPoint PPT Presentation

Loading...

PPT – Lexical knowledge schemes for modeling words and expressions in communication PowerPoint presentation | free to download - id: 59654a-NzVlO



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Lexical knowledge schemes for modeling words and expressions in communication

Description:

Lexical knowledge schemes for modeling words and expressions in communication Computational Lexicology & Terminology Lab Wauter Bosma Isa Maks Roxane Segers – PowerPoint PPT presentation

Number of Views:154
Avg rating:3.0/5.0
Slides: 42
Provided by: PiekV6
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Lexical knowledge schemes for modeling words and expressions in communication


1
Lexical knowledge schemes for modeling words and
expressions in communication
  • Computational Lexicology Terminology Lab Wauter
    Bosma Isa Maks Roxane Segers Hennie van der
    Vliet Piek Vossen
  • LCC-meeting, October, 9th, 2008, VU University
    Amsterdam

2
Overview
  • Genre as a knowledge scheme
  • What do we do at CLTL?
  • How does it relate to genre?
  • Projects at CLTL
  • Discussion

3
A view on genre
  • Genre is an abstract knowledge scheme that
    natural language speakers can apply to
    effectively structure communication.
  • How and where is such a scheme stored?
  • How is this knowledge activated and applied in a
    communicative setting?
  • How can we benefit from these insights in
    computerized information and communication
    systems?

4
(No Transcript)
5
Focus of Computational Lexicology and Terminology
Lab (CLTL)
  • Lexicon model of abstract knowledge to
    efficiently process and produce natural language
    in communicative settings
  • Symbolic abstract representation of forms
    related to concepts
  • forms are variants that can refer to more-or-less
    the same semantic content
  • shootV shootingN agressionN- fightN -
    conflictN warN WOIIName
  • payV exchangeV - buyV sellV merchandiseN -
    tradeN - businessN
  • Also encode pragmatic aspects of use
  • Sentiment, subjectivity attitude
  • Perspective
  • Domain restrictions

6
Focus of Computational Lexicology and Terminology
Lab (CLTL)
  • Broad notion of knowledge
  • words expressions (what is a word, what is a
    concept?)
  • phrases, sentences and text (incorporating
    grammar)
  • genres
  • Abstract symbolic representations related to
    statistical expectation patterns
  • Tagged corpus represents an 'experience' of
    language use
  • "X drinks beer", "Y drinks wine", "Z drinks milk"
  • Lexicon is the highest abstraction of these
    experiences that gives the most effective
    prediction of how words and expressions behave
  • "XYZ drink beverages"
  • Corpus-based lexicon or corpus data represented
    as a lexicon

7
Focus of Computational Lexicology and Terminology
Lab (CLTL)
  • Validation of models and databases with lexical
    knowledge
  • Can we define types of structures (lexical and
    compositional expressions) that correctly predict
    their behavior in language use? -gt
    pluriform-object-count-noun (police),
    object-count-noun (police officer),
    group-object-count-noun (eikenbos (oak forest)),
    mass-object-uncount-noun (bos (forest))
  • Can we build a comprehensive database using these
    types?
  • Use the database in corpus research and analysis
  • import corpus data into the lexical database
  • apply the database to textual corpora in computer
    applications
  • Automatic tagging of corpora with features
  • Automatically mine textual data using the lexicon
    as a background knowledge resource, e.g. to find
    facts of causal relations for environmental
    phenomena

8
  • Ontology
  • concepts instead of words
  • identity criteria
  • language neutral
  • domain and perspective neutral
  • no genre dependency
  • logically valid
  • for inferencing
  • Lexical database
  • generic list of words and terms
  • abstracts from various text corpora
  • differentiation for different domains
  • and genres
  • most generic representation
  • in a language community

Map
Validate
Integrate
  • Text corpus with empirical data
  • linear text
  • every word occurrence is unique
  • domain and genre specific
  • Term database
  • generic list of terms
  • derived from text corpus
  • patterns and features that
  • are dominant in domain and genre

Derive
9
Projects at CLTC
  • Cornetto (Stevin project STE05039)
  • Kyoto (FP7 ICT Work Programme 2007 under
    Challenge 4 - Digital libraries and Content,
    project ICT-211423)
  • Camera projects
  • From sentiments and opinions in text to positions
    of political parties
  • The semantics of history
  • A term bank for the Belastingdienst (Steunpunt
    Terminologie)
  • DutchSemCor (NWO investeringssubsidie)

10
Cornetto
  • COmbinatorial Relational NEtwork voor Taal
    TOepassingen
  • Goal to develop a lexical semantic database for
    Dutch
  • 90K Entries generic and central part of the
    language
  • Rich horizontal and vertical semantic relations
  • Combinatoric information
  • Ontological information

11
Lexical Unit Synsets
  • Lexical Unit form-meaning relation, such that
  • form abstract representation of certain
    realizations
  • part-of-speech is the same
  • meaning is the same, where meaning is defined by
    a reference to a unique Synset
  • Synset Set of synonyms (LUs) that refer to the
    same entities in most contexts.
  • Defined by lexical semantic relations
  • Defined by reference to ontology Terms or logical
    expressions involving Terms from the ontology

12
Data Organization
Lexical Unit
Internal relations
Correspond to word-meaning pair
Synonyms
form morphology syntax semantics pragmatics usage
examples
Synset
Model meaning relations
Collection of Terms and Axioms
Princeton Wordnet
Czech Wordnet
German Wordnet
SUMO MILO
Korean Wordnet
Wordnet Domains
Spanish Wordnet
Arabic Wordnet
French Wordnet
13
Data overview
  ALL NOUNS VERBS ADJ. ADV. Other
Synsets 70,434 52,888 9,053 7,703 220 570
Lexical Units 118,466 85,278 17,363 15,731 73 21
Lemmas (formpos) 91,991 70,556 9,055 12,307 73 n.a.
Synonyms in synsets 102,572 74,893 14,091 12,899 84 605
CID records 103,668 75,812 14,093 13,089 484 190
Synonym per synset 1.46 1.42 1.56 1.67 0.38 1.06
Senses per lemma 1.29 1.21 1.92 1.28 1.00 n.a.
14
(No Transcript)
15
Integrating the ontology Sumo terms and axioms
16
Lexicon versus Ontology
LABELS for ROLES bluswater theewater koffiew
ater
Ontology
?
Abstract
Physical
Element
Organism
Process
Possession Transaction
Dog
H20
CO2
buy
PoodleDog
receiver
subj
obj
giver
ind obj
goods
LABELS for ROLES watchdogEN, waakhondNL,
bankenJP ((instance x Canine) (role x
GuardingProcess))
NAMES for TYPES poodleEN poedelNL pudoruJP
((instance x Poodle)
sell
subj
obj
ind obj
17
Kyoto
  • Yielding Ontologies for Transition-Based
    Organization
  • Funded
  • 7th Framework Program-ICT of the European Union
    Intelligent Content and Semantics
  • Goal
  • Platform for knowledge sharing across languages
    and cultures
  • Enables knowledge transition and information
    search across different target groups,
    transgressing linguistic, cultural and geographic
    boundaries.
  • Open text mining and deep semantic search
  • Wiki environment that allows people in the field
    to maintain their knowledge and agree on meaning
    without knowledge engineering skills
  • URL http//www.kyoto-project.eu/
  • Duration March 2008 March 2011
  • Effort 364 person months of work

18
KYOTO (ICT-211423) Overview
  • Languages
  • English, Dutch, Italian, Spanish, Basque,
    Chinese, Japanese
  • Domain
  • Environmental domain, BUT usable in any domain
  • Global
  • Both European and non-European languages
  • Available
  • Free as open source system and data (GPL)
  • Future perspective
  • Content standardization that supports world wide
    communication
  • Global Wordnet Grid

19
Sudden increase of CO2 emissions in 2008 in Europe
Domain
20
User perspective
  • Ecosystem services
  • nature as a resource food, transport,
    recreation, medicine, material
  • nature for waste absorption
  • economic dependency
  • state of nature
  • footprint
  • poverty

21
Lexicon versus Ontology
  • Ecosystem services
  • Nature as a resource
  • Nature for waste absorption
  • State of nature
  • Threats to nature

Ontology
?
Abstract
Physical
Element
Organism
Artifacts
Process
Spider
H20
CO2
Possession Transaction
alien invasive species
species migration
green house gas
green roof
ecosystem-based drinking water production
branding rural products
sustainable products
22
System components
  • Wikyoto wiki environment for a social group
  • to model the terms and concepts of a domain and
    agree on their meaning, within group, across
    languages and cultures
  • to define the types of knowledge and facts of
    interest
  • Tybots Term extraction robots, extract term
    data from text corpus
  • Kybots Knowledge yielding robots, extract facts
    from a text corpus
  • Linguistic processors
  • tokenizers, segmentizers, taggers, grammars
  • named entity recognition
  • word sense disambiguation
  • generate a layered text annotation in Kyoto
    Annotation Format (KAF)

23
Capture Server
Document Base Linear KAF
Tybot server (Term Extraction)
Semantic Annotation
Document Base Linear KAF
Extracted Terms Generic K-TMF
Kybot Editor
Kybot Profiles
Kybot Server (Fact Extraction)
Term Editor (Wikyoto)
Document Base Linear Generic KAF
24
What Tybots do...
  • Input are text documents
  • Green house gases, such as CO2
  • CO2 and other green house gases
  • Linguistic processors generate KAF annotation
    (sequential)
  • morpho-syntactic analysis
  • semantic roles
  • named entities
  • wordnet and ontology mappings
  • Output are term hierarchies in TMF (generic)
  • structural parent relations CO2 is a green
    house gas is a gas
  • quantified structural and semantic relations
  • statistical data
  • generalized semantic mappings

25
Conceptual modeling
Source Documents
the emissionNP of greenhouse gasesPP in
agricultural areasPP NP
TYBOT Concept Miners
Linguistic Processors
Morpho-syntactic analysis
English Wordnet
Term hierarchy
Ontology
?
location3
substance1
naturalprocess1
of
Abstract
Physical
region3
emission
gas
area
emission3
Substance
Process
area1
geographical area1
emission2
gas1
CO2
greenhouse gas
agricultural area
H20
CO2
Chemical Reaction
GreenhouseGas
greenhouse gas1
rural area1
in
CO2
GlobalWarming
CO2Emission
farmland2
(instance s1 Substance) (instance e1 Warming)
(katalyist s1 e1)
WaterPollution
26
What Kybots do
  • Input
  • KAF annotations of text sequential encoded by
    language
  • Conceptual frame from the ontology
  • Expression rules for frame to language mapping
  • Wordnet in a language
  • Morpho-syntactic mappings rules
  • Output are a database of facts in KAF/FactAF
    (generic)
  • aggregated facts
  • inferred facts
  • language neutral

27
Fact mining by Kybots
Linguistic Processors
Ontology
Wordnets Linguistic Expressions
Logical Expressions
?
Generic
Abstract
Physical
Fact analysis
Patient
the emissionNP Process e1 of greenhouse
gasesPP Patient s2 in agricultural areasPP
Location a3
Substance
Process
H2O
CO2
Chemical Reaction
Domain
Patient
CO2 emission
water pollution
28
pdf
Hidden
Shown
A.. ... decline ... population ... ..Z
Do populations always consist of marine species?
Are terrestrial species never marine species?
29
Axiomatize
Maximal abstraction integrity
Language neutral integrity
(instance s1 Substance) (instance e1 Warming)
(katalyist s1 e1)
Generic text based
Linear text
30
From sentiments and opinions in text to positions
of political parties
  • Most language use does not express facts but
    personal opinions and positions with respect to
    facts or issues, often disguised for some
    communicative or manipulative goal.
  • CAMERA project involving 2 AIOs from FdL and 1
    AIO from Political Sciences
  • Combines contemporary theories and methods in
    linguistics and political science to develop an
    automated research tool for rich text-mining
  • Complexity of language use, the linguistic
    modeling of subjectivity and the representation
    of this knowledge in a lexicon.
  • Complex dimensionality of competition between
    political parties.
  • Mining tool for language-meaning research can be
    applied to enhance the Kieskompas (Electoral
    Compass).

31
aio-1
Modeling
Lexical database
Lexical Analysis
Derivation
Co-occurrence
Lexical acquisition
aio-2
Corpus Linguistics
Linguistic rules
Quantitative Text Analyis
Concordance
Search
Manual Coding Tagging
  • Omstreden democratie
  • Jan Kleinnijenhuis
  • Wouter van Atteveldt

Political Text Corpus
Automated Tagging Analysis
Morpho-syntactic Parsers
Political Database
system integrator-4
Manual Coding
Search
Quantitative Data Analysis
Political Analysis
aio-3
Interpretation rules
32
AIO-1 Lexical model and acquisition for
sentiment and opinion analysis in Dutch text
  • Words expressions in political text
  • Model sentiment, subjectivity, lexical framing
    and attitudinal implications
  • Build a lexicon encoding these layers
  • Validate the lexicon in the mining application
    applied to the text corpus

33
Levels of subjectivity
  • sentiment orientation, e.g.
  • small (neutral), splendid (positive), dull
    (negative)
  • funeral (negative), birthday party (positive),
    meeting (neutral)
  • explicit attitudinal and deontic implications
  • hate, love, favour, desire, want
  • impossible, possible, can, cannot
  • demand, beg, hope, wish
  • implicit attitudinal and deontic implications
  • neutral describe, cite, quote
  • subjective tell my story, shout, cry out,
    suggest

34
Some concepts of saying
  • The reporter expresses attitude towards the
    subject (is not aware)
  • nazeggen1, herhalen4, echoën2 meesmuilen1 herk
    auwen2 toesnauwen1, aanblaffen2, sissen2,
    toebijten1, toeblaffen1 toesmijten2,toevoegen4
    uitputten3
  • verzuchten1
  • pretenderen1, beweren1
  • Subject of speech act has attitude towards (is
    aware)
  • afzeggen1, cancellen1 ontkennen1, miskennen1,
    ontveinzen1 toewensen1, wensen2 verbieden1 aan
    zetten12, beklemtonen2, hameren2,
    tamboereren2 onderstrepen2, onderlijnen1,
    accentueren1 toezeggen1, beloven1 uitlaten5,
    beoordelen1 distantiëren1 erkennen2,
    toegeven1
  • opmerken2, aantekenen4

35
Synsets or lexical units
  • brilliant3, glorious4, magnificent1,
    splendid2
  • bus4, jalopy1, heap3
  • has_hyperonym car1, auto1, automobile1,
    machine4, motorcar1
  • fiets1, brik7, kar3, karretje2, rijwiel1,
    velo1

36
The semantics of history
  • Camera project involving 1 AIO from FdL and 1 AIO
    from FEW (Exact Science)
  • Goal an ontology and lexicon for a historical
    multimedia archive of the Rijksmuseum.
  • Applied to an innovative information system for
    accessing the historical archive.

37
The semantics of history semantics of change
  • Represent different realities
  • related through causal changes over time
  • representing different views or perspectives on
    the same reality, e.g. form a different
    historical angle or from different geographical
    or social parties.
  • Changes are typed as events

38
Events as key notions
  • Historical events
  • events considered from a distance in time and
    abstraction of detail.
  • referenced by names (WOII, de Val van
    Srebrenica), nouns (war) or nominalizations (the
    violation of human rights)
  • News events
  • Reports on (the same) reality but more in the
    active verbal form US soldiers shoot Iraqi
    citizens.
  • Close to the actual event
  • lacking a historical abstraction and filtering.
  • Both news and historic imply subjectivity and
    perspective on these events but probably make
    different selections and use different genres to
    convey this information.
  • News becomes history over time, and we therefore
    expect a smooth transition in the use of language
    to refer to the same events, adding more and more
    historical perspective.

39
Val van Srebrenica in Wikipedia
  • Headings
  • 1992 ethnic cleansing campaign
  • The conflict in eastern Bosnia
  • Struggle for Srebrenica
  • Text
  • A fierce struggle for territorial control then
    ensued among the three major groups in Bosnia
    Bosniak (commonly known as 'Bosnian Muslims'),
    Serb and Croat. In the eastern part of Bosnia,
    close to Serbia, conflict was particularly fierce
    between Serbs and Bosniaks
  • Serb military and paramilitary forces from the
    area and neighboring parts of eastern Bosnia and
    Serbia gained control of Srebrenica for several
    weeks in early 1992, killing and expelling
    Bosniak civilians. In May 1992, Bosnian
    government forces under the leadership of Naser
    Oric recaptured the town
  • thus proceeded with the ethnic cleansing of
    Bosniaks from Bosniak ethnic territories in
    Eastern Bosnia and Central Podrinje

40
Letter from the Dutch minister of defense
  • De afgelopen zes maanden werd de uitvoering van
    deze taken aanzienlijk bemoeilijkt door de
    Bosnisch-Servische weigering de enclave voldoende
    te laten bevoorraden. Door een gebrek aan
    brandstof moesten patrouilles te voet worden
    uitgevoerd. Ook blokkeerden de Bosnische Serviers
    sinds mei jl. de rotatie van het personeel van
    Dutchbat, waardoor de bezetting werd
    teruggebracht van 630 naar 430 blauwhelmen. De
    vijandelijkheden namen geleidelijk toe, waardoor
    op 3 juni jl. een observatiepost in het
    zuidoostelijke deel van de enclave moest worden
    opgegeven
  • Historical terms blokkade, val, opgave, overgave

41
conflict struggle ethnic cleansing . killing expe
lling gain control
42
AIO at FdL
  • Lexical framing of events in news reporting and
    historical descriptions.
  • Use historical thesaurus to group all the words
    and expressions in a lexicon relative to the same
    events
  • Differentiate implications of the lexical
    variation packaging of events
  • Classification of news

43
Thank you for your attention
About PowerShow.com