Linguistic Processing of Classification Hierarchies - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Linguistic Processing of Classification Hierarchies

Description:

Sea holidays. Italy. in Europe. Interoperability among CHs ... Beach. Mountain. Italy. More specific. More specific. World Knowledge is necessary ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 50
Provided by: matte158
Category:

less

Transcript and Presenter's Notes

Title: Linguistic Processing of Classification Hierarchies


1
Linguistic Processing of Classification
Hierarchies
Bernardo Magnini ITC-irst, Istituto per la
Ricerca Scientifica e Tecnologica Trento - Italy
2
Current Research Topics on Text Processing at
ITC-irst
  • Question/Answering
  • TREC style
  • Information Extraction
  • ML approach, DOT.KOM project
  • Lexical Acquisition and Linguistic Resources
  • MultiWordnet, Wordnet Domains, corpora for
    Italian
  • Word Sense Disambiguation
  • Based on domains, MEANING project
  • NLP for Knowledge Management
  • Edamok project
  • Evaluation of NLP Technologies
  • Qa at CLEF-2003, Senseval-3

3
Current Research Topics on Text Processing at
ITC-irst
  • Question/Answering
  • TREC style
  • Information Extraction
  • ML approach, DOT.Kom project
  • Lexical Acquisition and Linguistic Resources
  • MultiWordnet, Wordnet Domains, corpora for
    Italian
  • Word Sense Disambiguation
  • Based on domains, Meaning project
  • NLP for Knowledge Management
  • Edamok project
  • Evaluation of NLP Technologies
  • Qa at CLEF-2003, Senseval-3

4
Outline
  • Classification Hierarchies (CH)
  • Concept hierarchies
  • Approaches toward interoperability of CHs
  • Semantic interpretation of CHs
  • Making the information explicit the role of
    linguistic and world knowledge
  • Experimental setting
  • Preliminary results with CTXMATCH algorithm

5
Organizing papers A senior researcher
Work
  • Knowledge about the domain is used
  • Classification schema are repeated
  • Labels are interpreted in their context

WSD
QA
Papers
Projects
Experiments
Senseval-2
ACL-02
Submission
Camera ready
Submission
6
Organizing papersA young researcher
Home
  • A different view for the same documents
  • Redundant information
  • Different labels for the same concept

Articles
Code
2002
2001
2000
workshops
Int. conferences
journals
Senseval-2
ACL-02
7
Organizing papersA student
Disambiguation
  • Less structure corresponds to more complex labels
  • Any kind of document is allowed (text, images,
    code, )

Results-all-word-Eng.
Senseval-Call-for-paper
Senseval-article
Meaning-project
Algorithm-description
Acl-article-final-version
Lexical-sample-training-data
8
Questions
  • Can a system automatically discover similarities
    among different views of the same documents?
  • Example retrieving documents in classification B
    using the schema of classification A
  • How much reasoning is involved?
  • Labels are expressed in a natural language.
  • Is there a role for NLP technologies?

9
Classification Hierarchies CH (1)
  • Taxonomic organization of documents
  • Easy to build no formal language is required
  • Widespread used
  • Web directories (Google, Yahoo!, Looksmart,
    portals)
  • Market place catalogues for product
    classifications
  • File systems
  • Local Ontologies
  • Documents are classified at all levels of the
    hierarchy
  • CHs structure reflect both the documents and
    world knowledge

10
Classification Hierarchies (2)
Vacation
  • Semi-structured relations among nodes are not
    formally defined.
  • Document dependent CHs are organized according
    to the documents that have to be classified.
  • Specificity criterion a document is classified
    in the more specific node of the hierarchy.

2001
2000
Sea
Lake
Sea
Mountains
Tuscany
Spain
USA
11
Interoperability among CHs
  • Commercial interest Distributed Knowledge
    Management in corporations
  • Scientific interest. Various terms have been
    recently used, including
  • Meaning negotiation
  • Semantic coordination
  • Mapping between domain models
  • Semantic mediation
  • Ontology merging, integration or alignment
  • Integration of hierarchical categorization
  • Fits well in the Semantic Web perspective
  • Common goal find mappings between nodes of two
    classification hierarchies

12
Interoperability among CHs
Source CH
Target CH
Vacation
Sea holidays
2001
2000
Sea
Lake
Sea
Mountains
Italy
in Europe
Tuscany
Spain
USA
13
Interoperability among CHs
Source CH
Target CH
Vacation
Sea holidays
2001
2000
Sea
Lake
Sea
Mountains
Italy
in Europe
Tuscany
Spain
USA
14
Interoperability among CHs
Source CH
Target CH
Vacation
Sea holidays
2001
2000
Sea
Lake
Sea
Mountains
Italy
in Europe
Tuscany
Spain
USA
?
15
Qualitative Mapping
Source CH
Target CH
Vacation
Sea holidays
2001
2000
Sea
Lake
Sea
Mountains
Italy
in Europe
Tuscany
Spain
USA
More general
16
Qualitative mapping
Source CH
Target CH
Vacation
More specific
Sea holidays
2001
2000
Sea
Lake
Sea
Mountains
Italy
in Europe
2001
Tuscany
Spain
USA
Tuscany
17
Qualitative mapping
Source CH
Target CH
Vacation
Sea holidays
2001
2000
Sea
Lake
Sea
Mountains
Italy
in Europe
2001
Equivalent
Tuscany
Spain
USA
Tuscany
18
Qualitative mapping
Source CH
Target CH
Vacation
Sea holidays
Not compatible
2001
2000
Sea
Lake
Sea
Mountains
Italy
in Europe
2001
Tuscany
Spain
USA
Tuscany
19
Qualitative mapping
Source CH
Target CH
Vacation
Sea holidays
2001
2000
Compatible
Sea
Lake
Sea
Mountains
Italy
in Europe
2001
Tuscany
Spain
USA
Tuscany
20
Qualitative mapping
Source CH
Target CH
Vacation
Sea holidays
2001
2000
Sea
Lake
Sea
Mountains
Italy
in Europe
2001
Tuscany
Spain
USA
Tuscany
21
Approaches to CH mapping
  • Approaches to CH mapping can be grouped in four
    classes, according with the kind of information
    used
  • Based on document content
  • Based on document classifications
  • Based on structural information
  • Based on semantic interpretation of labels
    (CTXMATCH)

22
1. Mapping based on Documents
  • Consider the content of the document
  • Procedure Madhavan et al. AAAI-2002
  • Train a classifier on documents of source CH
  • Apply the classifier to documents of target CH
  • Drawbacks
  • Needs the documents
  • Only textual documents can be considered
  • Do not consider structural information
  • Do not produce qualitative mappings

23
2. Mapping based on Classifications
  • Consider the number of documents in common with
    nodes of different CHs
  • Procedure Ichise et al. IJCAI-2003
  • Compute a a statistical model of classification
    criteria of source and target CHs
  • Determine similarity between pairs of nodes in
    source and target
  • Drawbacks
  • Needs documents in common
  • Does not produce qualitative mappings

24
3. Mapping Based on Structural Information (1)
  • Consider node definitions and their lexical
    expansions
  • Procedure Calvanese et al. ISWC 2001
  • Automatically propose candidate mappings based on
    lexicographic criteria
  • Correct mappings are validated by a domain expert
  • Drawbacks
  • Require human intervention
  • Feasible for ontology integration, not for CHs

25
3. Mapping Based on Structural Information (2)
  • Consider structural constraints among nodes
  • Procedure Daude et al. ACL-2000, this
    conference
  • Select candidates pairs with lexicographic
    criteria
  • Select structural constraints
  • Use relaxation labelling to chose the best
    candidate
  • Drawbacks
  • Good for WordNet, but CHs have a lot of implicit
    knowledge
  • Do not produce qualitative mapping

26
4. Mapping Based on Semantic Interpretation
  • Consider linguistic processing of nodes and world
    knowledge
  • Procedure Bouquet et al. ISWC-2003, to appear
  • Build a logical interpretation for the source and
    the target nodes
  • Compute the relation between the two logical
    forms
  • Drawbacks
  • Require world knowledge
  • Require tuning of linguistic tools for CHs

27
Semantic Interpretation (1)
Images
More specific
Italy
More specific
Beach
Mountain
  • World Knowledge is necessary

28
Semantic Interpretation (2)
Images
More specific
Italy
More specific
Beach
Mountain
More specific
Equivalent
29
Linguistic Processing of CHs
  • How linguistic techniques work on CHs?
  • Tokenization and Part of Speech Tagging
  • Multiwords recognition
  • Named entities recognition
  • Word sense disambiguation
  • Which peculiar problems are posed by CHs as far
    as their semantic interpretation is concerned?
  • How much implicit information is it possible to
    extract from CHs?

30
Part of Speech Tags (1)
Vacation
  • Nouns are prevalent
  • Limited context available for solving ambiguities

2001
2000
Sea
Lake
Beach
Mountains
Tuscany
Spain
USA
31
Part of Speech Tags (2)
  • POS tagger TNT Brants, ANLP-2000
  • CH 5k tokens extracted by a balanced set of CHs
    (web directories, file systems, product
    catalogues, ontologies) both for English and
    Italian
  • Text
  • English training over 1M words (BNC)
  • Italian training over 50k words (Elsnet)

32
Tokenization
  • Parenthesis and Acronyms

Credit agencies
Business credit agencies
Business credit gathering or reporting services
Value added network (VAN) services
From UNSPSC
33
Abbreviations
  • Abbreviations

Potato, pot. product
Semi-instant product (veg.)
From EClass
34
Multiwords
  • Multiword on two contiguous levels
  • Multiword on one level

Sport
Billiards
Players
United States
From Google
35
Coordination
  • Conjunction
  • Disjunction

Healthcare Services
Alternative and Holistic medicine
Witch doctors or voodoo services
From UNSPSC
36
Multilinguality
  • Spanish
  • English
  • Mixed

37
Lexical Ambiguity
  • Structural information provide context for word
    sense disambiguation
  • The connections between WSD and web directories
    have been investigated by Gonzalo et al. 2003

Plants
Trees
Apple tree
From Google
38
Arc Interpretation
  • Relations among nodes are not formally defined
  • Instance-of
  • In CHs documents classified under a certain node
    A are a subset of the documents classified under
    a parent node of A.
  • According to our world knowledge the relation
    among two nodes can be interpreted in various
    ways.

39
Arc Interpretation
  • Relations among nodes are not formally defined
  • Part-of

Images
Tuscany
Pisa
Florence
From Google
40
Arc Interpretation
  • Relations among nodes are not formally defined
  • Generic Associations

Television
Cable_TV
Satellite
Public_Access
Guides
From Google
41
Arc Interpretation
  • Relations among nodes are not formally defined
  • Meta-level criteria

World Languages
A
B
Afrikaans
Bali
From Google
42
Implicit Negation
  • Trentino is part of North Italy

43
Implicit Negation
  • Trentino is part of North Italy

Origin of ITC-irst employees
Italy
North except Trentino
Center
South
Trentino
From ITC-irst personnel office
44
CTXMATCH Algorithm
  • Semantic explicitation
  • Linguistic analysis of labels
  • Shallow parsing, access to wordnet, multiwords
  • Contextualization
  • Sense filtering (use Wordnet as knowledge
    repository)
  • Sense composition (use Wordnet as knowledge
    repository)
  • Semantic comparison
  • Build a logical form (description logics)
  • Computing the logical relation between two
    formula (SAT solver)

45
An Experimental SettingMatching Web Directories
  • Task automatically discover qualitative mappings
    among corresponding directories of Google and
    Yahoo
  • CTXMATCH
  • Input a pair ltN1, N2gt belonging to CH1 and CH2
  • Output a relation holding between N1 and N2
  • more general, more specific, equivalent, no
    relation
  • Evaluation define a metric considering the
    documents (Urls) classified both by Google and
    Yahoo. Define a mapping between this metric and
    the CTXMATCH relations.
  • Baseline string match of the paths of the two
    nodes.

46
Matching Google and Yahoo! Linguistic Analysis
47
Matching Google and Yahoo! Preliminary Results
Google Architecture/History/Periods_and_Styles/Go
thic
Is More specific than
Yahoo Architecture/History/Medieval
48
Ongoing and Future Experiments
  • Web directories build a reference benchmark for
    evaluating matching algorithms.
  • Include Looksmart
  • Google English vs Google Italian
  • File systems
  • Collaboration Edamok, SWAP, MEANING
  • Domain specific applications
  • Medical classification integration of UML in the
    algorithm
  • Public Administration matching document
    classification hierarchies for automatic routing
  • Edamok project www.edamok.itc.it
  • Papers, algorithm specifications, case studies

49
Conclusions
  • Interoperability of Classification Hierarchies
  • Scientific interest Semantic Web community
  • Application oriented interest
  • NLP can play a crucial role
  • A proper experimental setting is necessary for
    comparing different approaches
  • CTXMATCH
  • Qualitative mappings
  • Semantic interpretation based on linguistic
    analysis
  • Preliminary results
Write a Comment
User Comments (0)
About PowerShow.com