Using bilingual LSA for FN annotation of French text from generic resources - PowerPoint PPT Presentation

About This Presentation
Title:

Using bilingual LSA for FN annotation of French text from generic resources

Description:

A Berkeley-Nancy collab. Funded by France-Berkeley Fund - ICSI, ATILF, LORIA. French participants : Susanne Alt, Beno t Crabb , Christiane Jadelot, Guillaume ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 39
Provided by: guillau4
Category:

less

Transcript and Presenter's Notes

Title: Using bilingual LSA for FN annotation of French text from generic resources


1
Using bilingual LSA for FN annotation of French
text from generic resources
  • Guillaume Pitel - LORIA/LED
  • FR.FrameNet Project
  • Funded by France-Berkeley Fund

2
Outline
  • The (small) FR.FrameNet project
  • The projection problem
  • Realizations
  • French Frames database
  • Annotated reference sub-corpus
  • English semantic clusters from FEs
  • Projection into French
  • Other potential applications

3
The (small) FR.FrameNet project
  • A Berkeley-Nancy collab. Funded by
    France-Berkeley Fund - ICSI, ATILF, LORIA
  • French participants Susanne Alt, Benoît Crabbé,
    Christiane Jadelot, Guillaume Pitel, Laurent
    Romary
  • Setting the foundations for a cheap bootstrapping
    of a French FrameNet
  • Reusing existing French Lexical Semantic
    resources
  • Reusing any available resources
  • Focus on automatic methods

4
The projection problem
  • Use a semantic lexicon in language A to annotate
    a corpus in language B
  • Resulting data is expected to be of much lower
    quality than a handcrafted lexicon
  • It is a bootstrapping process requires manual
    correction
  • Important question does it really speed up the
    final production ?

5
Pado Lapata approach
  • Using a Source language/Target language parallel
    corpus
  • The Source-side of the corpus must be
    FN-annotated,
  • The roles are projected in the Target corpus
  • Train a statistical semantic role parser for
    Target language
  • Automatic annotation of any corpus in Target
    language

6
Pado Lapata approach
  • Problems
  • translation is not frame-conserving in many cases
    (20-30)
  • parallel corpora are a rare resource
  • Berkeleys FrameNet is not built on the English
    side of a // corpus (
  • But very useful with a resource like Europarl

7
The main bottleneck
  • Existence of parallel AND annotated corpora
    rare and expensive to build
  • But
  • Annotated corpora are available
  • Parallel, aligned corpora are available

8
The Semantic Space based approach (using LSA)
  • Pure semantic annotation
  • no grammatical function
  • no POS
  • Use a bilingual LSA space to make the projection
  • Preparation
  • Find the lexical units in the Target language
    that fits for each frame
  • Use an available resource
  • Compute them automatically
  • Compute the semantic clusters of each frame
    element

9
The Semantic Space based approach (using LSA)
  • Usage Automatic preannotation (or selection)
  • For each sentence in Target corpus
  • Find potential frames from LUs
  • Compare each word (or head of constituent) of the
    sentence with to computed semantic clusters of
    the (core) roles of the candidate frames (or the
    corresponding roles in parents if training data
    missing)
  • Candidate Frames and FEs are rated by the
    semantic distance
  • What we can expect
  • Cant deal with anaphora,
  • Cant deal with FEs not semantically narrow

10
Subprojects
  • Convert frames to French
  • Using the ISC Semantic Atlas (built from 2
    synonym dictonaries a minimal FR//EN corpus)
  • Annotation of reference subcorpus
  • 1000 sentences from Europarl
  • Projection using LSA

11
Convert Frames to French
12
English LUs to French LUs
  • For each Frame in Berkeley FrameNet
  • For each LU, find potential translations in
    French. Using Semantic ATLAS (Ploux Ji, 2003) -
    other languages ?
  • Compute the French profile of the Frame
  • Manually check that a lemma can actually evoke
    the frame (pure subjective judgment)
  • Frame-by-frame procedure
  • Must be validated later by corpus evidence

13
Lexical units in Filling Frame
  • adorn.v, anoint.v, asphalt.v, brush.v, butter.v,
    coat.v, cover.v, cram.v, crowd.v, dab.v, daub.v,
    douse.v, drape.v, drizzle.v, dust.v, embellish.v,
    fill.v, flood.v, gild.v, glaze.v, hang.v, heap.v,
    inject.v, jam.v, load.v, pack.v, paint.v,
    panel.v, pave.v, pile.v, plant.v, plaster.v,
    pump.v, scatter.v, seed.v, shower.v, smear.v,
    sow.v, spatter.v, splash.v, splatter.v, spray.v,
    spread.v, sprinkle.v, squirt.v, strew.v, stuff.v,
    suffuse.v, surface.v, tile.v, varnish.v,
    wallpaper.v, wrap.v

14
Translations 1/4
  • Adorn Chamarrer, embellir, enjoliver, orner,
    parer, revêtir
  • Anoint Oindre
  • Asphalt Asphalter, bitumer
  • Brush Badigeonner, brosser, effleurer
  • Butter Beurrer
  • Coat Empâter, enduire, enrober, revêtir
  • Cover badigeonner, barbouiller, couvrir,
    franchir, gainer, garnir, habiller, monter,
    parcourir, quadriller, recouvrir, revêtir,
    saillir, se couvrir, subvenir, tapisser
  • Cram bachoter, bâfrer, bûcher, chauffer,
    engraisser, lester, potasser
  • Crowd foule (should be also peupler)
  • Dab bassiner, tamponner, toucher
  • Daub badigeonner, barbouiller, peinturlurer
  • Douse ???
  • Drape Draper
  • Drizzle brouillasser, bruiner, crachiner,
    pleuvasser, pleuviner
  • Dust enlever la poussière, essuyer, poussière,
    saupoudrer, épousseter
  • Embellish broder, embellir, enjoliver, orner
  • Fill appliquer un enduit, boucher, bourrer,
    calfeutrer, combler, devenir plein, emplir,
    enfler, fourrer, garnir, gonfler, gorger,
    imprégner, lester, mastiquer, meubler, obturer,
    occuper, peupler, plomber, pourvoir, pourvoir à,
    pénétrer, remplir, s'enfler, se gonfler, se
    peupler, se remplir

15
Manual selection 1/4
  • Adorn Chamarrer, embellir, enjoliver, orner,
    parer, revêtir
  • Anoint Oindre
  • Asphalt Asphalter, bitumer
  • Brush Badigeonner, brosser, effleurer
  • Butter Beurrer
  • Coat Empâter, enduire, enrober, revêtir
  • Cover badigeonner, barbouiller, couvrir,
    franchir, gainer, garnir, habiller, monter,
    parcourir, quadriller, recouvrir, revêtir,
    saillir, se couvrir, subvenir, tapisser
  • Cram bachoter, bâfrer, bûcher, chauffer,
    engraisser, lester, potasser
  • Crowd foule (should be also peupler)
  • Dab bassiner, tamponner, toucher
  • Daub badigeonner, barbouiller, peinturlurer
  • Douse ???
  • Drape Draper
  • Drizzle brouillasser, bruiner, crachiner,
    pleuvasser, pleuviner
  • Dust enlever la poussière, essuyer, poussière,
    saupoudrer, épousseter
  • Embellish broder, embellir, enjoliver, orner
  • Fill appliquer un enduit, boucher, bourrer,
    calfeutrer, combler, devenir plein, emplir,
    enfler, fourrer, garnir, gonfler, gorger,
    imprégner, lester, mastiquer, meubler, obturer,
    occuper, peupler, plomber, pourvoir, pourvoir à,
    pénétrer, remplir, s'enfler, se gonfler, se
    peupler, se remplir

16
Frame building Conclusion
  • Quite inexpensive compared to an approach of
    introspection from scratch or corpus-based
    (Filling is a big frame with a lot of LUs, it
    took me 30min to select good instances - with
    manual color setting)
  • Probably far from perfect coverage, low precision
  • Need several annotators to duplicate the work

17
Our approach to cross-language semantic annotation
  • The goal
  • A lemma can be related to several Frames
  • We want to disambiguate between the possible
    choices,
  • And also try to attribute roles (at least core
    roles) once we have made the choice
  • All of this in French, while we have the training
    data in English

18
Bilingual LSA approach
19
Latent Semantic Analysis
  • Improvement of cooccurrence matrices
  • Reduce the number of dimensions
  • Example
  • A occurs in documents (or contexts) 1,2,3
  • B in 2,3,4,5
  • C in 4,5,6
  • A and C never occur in the same document
  • LSA would allow to reduce documents 1-6 into one
    dimension

20
Evaluating the semantic position of Frame
Elements in LSA
  • Computing an English LSA space
  • Tools Treetagger Infomap-nlp
  • Corpus BNCEnglish part of Europarl
    translation of Balzac
  • POSlemma NNyear
  • Keep only Verbs, Adjectives, Nouns, Adverbs
  • Other combinations (no POS, all POS, raw form)
    dont perform as well

21
Example
  • FEs annotations for Filling.Theme
  • with water.
  • with a fungicide such as green or yellow sulphur.
  • with a soft brush and malathion dust.
  • with a little cayenne pepper.
  • Terms used for the FEs representation
  • NNwaterNNfungicideJJsuchJJgreenJJyellowNNsulp
    hurJJsoftNNbrushNNmalathionNNdustJJlittleNNc
    ayenneNNpepper

22
Evaluating FEs semantic coherence
  • Compute the semantic center of the FE center of
    each FE terms position
  • Find the N nearest neighbors of this center
  • If the center is in a semantically coherent
    region, the average similarity between neighbors
    and center is high.

23
FEs de Filling
  • Frame.FE Average Std Min Max Nb annot
  • Filling.Agent 0.604941 0.0413504 0.563591
    0.717469 279
  • Filling.Cause
  • Filling.Degree 0.595513 0.0431123 0.552401
    0.697830 4
  • Filling.Depictive 0.683302 0.0502735 0.633029
    0.804053 1
  • Filling.Goal 0.6483 0.0510976 0.597202
    0.793063 543
  • Filling.Instrument 0.646028 0.0715617 0.574466
    0.844308 4
  • Filling.Manner 0.647012 0.0795992 0.567413
    0.896142 25
  • Filling.Means 0.67356 0.0502949 0.623265
    0.820630 1
  • Filling.Path 0.708096 0.069683 0.638413
    0.925448 2
  • Filling.Place 0.562765 0.0364663 0.526299
    0.683526 2
  • Filling.Purpose 0.631099 0.0585047 0.572594
    0.761788 5
  • Filling.Result 0.734567 0.0585102 0.676057
    0.825459 37
  • Filling.Source 0.611222 0.0447367 0.566485
    1.000000 1
  • Filling.Subregion 0.782659 0.0756196 0.707039
    0.944916 2
  • Filling.Theme 0.747146 0.0485786 0.698567
    0.890307 450
  • Filling.Time 0.474269 0.0474972 0.426772
    0.628049 16

24
Neighbors of Filling.Theme
  • powder 0.890307
  • spray 0.836283
  • dry 0.821666
  • crushed 0.820905
  • charcoal 0.813571
  • plastic 0.806768
  • copper 0.804459
  • paste 0.802643
  • foam 0.802201
  • brush 0.799847
  • Computed from with fake diamonds. with pictures
    of cute white bunnies. with jewels and fine
    gowns. with one of these pegs. with pictures ,
    flowers , and messages of peace. with wreaths of
    flowers and garlands of feathers. with the finest
    furniture from a firm in London 's New Bond
    Street. with a crown. with beautifully hooked
    melodies and harmonies. with chrism , the sacred
    ointment ,. with gel. with such a leaden armour
    of expectations. with the poison. with these
    substances. with vaseline. with his pungent
    urine. with holy oil. in bulb fibre. in whipped
    cream and honey. with a foot of topsoil. with her
    hand.

25
Neighbors of Filling.Agent
  • oliver 0.717469
  • jack 0.696716
  • joe 0.691628
  • marie 0.686812
  • harry 0.684113
  • charlie 0.681887
  • billy 0.680378
  • tom 0.678887
  • jane 0.676179
  • rose 0.669748
  • Computed from Your man. I. They. The priests.
    He. the wife of Cnut 's henchman Tofi the Proud.
    The Reclusiarch. she. What father. The Indians.
    Over 200 species of birds. He. He. Father Peter.
    Viktor. by ecclesiastics. We. One girl. She. she.
    he. the white gravel. the reluctant soldier. I.
    Eva. he. Two people. he. the good beachcombers.
    Sylvester. he. He. Two girls. you. Cecil Beaton.
    you. Larsen. you. He. you. you. He. he. she. Mina
    and K. She. you. she. the programme that turns
    the cameras on teenagers and let's them do the
    talking and the interviews. Baldwin. by Molly
    Fletcher. She. I. They. she. Endill. They. He.
    the BBC and official propaganda

26
FEs clusters
  • Grouping terms of the FE by minimal distance
    (arbitrarily set) i.e. 0.8 74
  • Keeping clusters with more than 5 of terms
  • http//guillaume.work.free.fr/Frames.en.3

27
Clusters of Filling frame
  • Agent 2 cluster(s)
  • Degree 4 cluster(s)
  • Depictive 6 cluster(s)
  • Goal 2 cluster(s)
  • Instrument 6 cluster(s)
  • Manner 2 cluster(s)
  • Means 2 cluster(s)
  • Path 1 cluster(s)
  • Place 5 cluster(s)
  • Purpose 1 cluster(s)
  • Result 2 cluster(s)
  • Source 1 cluster(s)
  • Subregion 1 cluster(s)
  • Theme 2 cluster(s)
  • Time 0 cluster(s)

28
Clusters Filling.Agent
  • rachel 0.867663
  • sara 0.863332
  • ellen 0.856612
  • lily 0.855513
  • sally 0.853933
  • alice 0.849205
  • emily 0.847480
  • dad 0.845598
  • jenny 0.844003
  • kate 0.839664
  • maggie 0.836391

tom 0.924026 john 0.908828 hugh 0.898049 michael
0.897622 scott 0.892861 sir 0.891623 david
0.889539 frank 0.889324 murray 0.879660 anthony
0.879149 geoffrey 0.876748
29
Clusters Filling.Goal
  • tin 0.924426
  • pot 0.908988
  • jar 0.908169
  • cake 0.893367
  • bottle 0.888083
  • bag 0.871596
  • jug 0.866099
  • bowl 0.860658
  • basket 0.858857
  • plastic 0.852992
  • dish 0.846176
  • peel 0.834313

wall 0.911646 wooden 0.864492 entrance
0.851708 front 0.846124 floor 0.834214 porch
0.834039 staircase 0.827131 roof 0.823297 rear
0.815847 corner 0.815765 rear 0.813187 front
0.813136
30
Clusters Filling.Theme
  • powder 0.913015
  • salt 0.907773
  • dry 0.900202
  • aromatic 0.886529
  • vegetable 0.870903
  • spray 0.867004
  • bean 0.860508
  • herb 0.858321
  • meat 0.852165
  • apple 0.848998
  • vinegar 0.848045
  • pea 0.845492

shiny 0.915945 red 0.908281 pink 0.905748 tint
0.900729 grey 0.899490 yellow 0.882565 blue
0.882097 white 0.877434 ribbon 0.876266 brown
0.875334 pale 0.875016 silk 0.865824
31
Projection
  • Compute French clusters from English clusters
  • Corpus collection
  • Europarl (French-English)
  • // French-English Balzac from Project Gutenberg
  • French//English 50M lemmas
  • Shakespeare, Hansard Corpus to be included

32
Training data
  • Lemmas interleaved on a sentence alignment basis
  • Training with a larger window
  • Only parallel corpus, experiments that introduce
    bits of pure monolingual corpus show a quality
    loss

33
Similarity between translations in the Biling.
Sem. Space
  • Results
  • eat / manger 0,98 (32)
  • fleuve / river 0,94 (55)
  • green / vert 0,83 (92)
  • bleu / blue 0,87 (81)
  • eat / fleuve 0,77 (107)
  • drink / écran 0,82 (96)

34
Neighborhood in Bilingual Semantic Space
  • Eat/Manger

eat0.976250 manger0.976250 consommer0.823532 (consume) boire0.818577 (drink) feed0.784077 fumeur0.777815 (smoker) consume0.775385 fumer0.757367 (to smoke) cream0.744859
35
Neighborhood in Bilingual Semantic Space
  • Fleuve/River

river0.938150 fleuve0.938150 coastline0.810345 rivière0.807991 alp0.801064 sea0.774821 lake0.771523 coast0.761910 littoral0.756541(seashore) bassin0.755235 (basin)
36
Neighborhood in Bilingual Semantic Space
  • Vert/Green

vert0.825634 green0.825634 green0.748835 biotechnology0.745683 mandelkern0.675176 hatch0.664682 taslima0.633138 cote0.628252 converter0.624423 orée0.616550 (forest border) hydrogen0.611002
37
Projection Conclusion
  • Projecting whole clusters gives variable results
  • Results in the projection are very disappointing
  • Unusable in this state
  • Seems that it may simply come from alignment
    mistakes
  • Can we improve the projected clusters with a
    bilingual dictionary ?
  • Relating clusters to Synsets ? Not necessarily a
    good idea Champagne and Caviar are not related
    in WN
  • More generally simple translation may cause
    undesired broadening of the cluster

38
Potential application
  • Statistical processing is interesting because it
    can capture usage-based regularities
  • Clusters built with LSA can be interesting
    information sources for the lexicographer
  • They can also more simply be used to
    automatically find new semantic types/selectional
    preferences emerging from the annotation of a new
    domain (metaphors occuring frequently for
    instance)
  • In a multilingual, collaborative annotation task,
    could be useful in order to transfer work between
    languages without requiring annotation of a
    parallel corpus.
Write a Comment
User Comments (0)
About PowerShow.com