CoNLLX Shared Task on Multilingual Dependency Parsing - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

CoNLLX Shared Task on Multilingual Dependency Parsing

Description:

German: TIGER treebank. Japanese: Japanese Verbmobil treebank ... 1_1, x_comment=ALH20010911.0001_story Wed Jul 21 12:51:09 2004 [MorphoFS.pl 1.09 ... – PowerPoint PPT presentation

Number of Views:100
Avg rating:3.0/5.0
Slides: 18
Provided by: Sab663
Category:

less

Transcript and Presenter's Notes

Title: CoNLLX Shared Task on Multilingual Dependency Parsing


1
CoNLL-X Shared Task on Multilingual Dependency
Parsing
  • Sabine Buchholz, Speech Technology Group,
    Cambridge Research Lab, Toshiba Research Europe
    Ltd, UK
  • Erwin Marsi, Tilburg University, The Netherlands
  • Amit Dubey, University of Edinburgh, UK
  • Yuval Krymolowski, University of Haifa, Israel

2
Overview
  • Introduction
  • Dependency structures
  • Data format
  • Treebanks used
  • Evaluation metric
  • Parsing approaches
  • Conclusion
  • Results
  • Analysis
  • The Future

3
Dependency structures
  • No constituents (unlike phrase structure)
  • Dependency relations between two lexical items
    (tokens)
  • One possible graphical representation

4
Dependency structure terminology
  • Label

subj
This is
  • Child
  • Dependent
  • Modifier
  • Parent
  • Governor
  • Head

Note Other people may draw arrows from head to
child!
5
Dependency structures in the shared task
punc
ROOT
comp
  • Virtual root node
  • Each token except BOS has exactly one head
  • More than one token can link to BOS
  • Crossing arcs are allowed, i.e. structures can be
    non-projective

subj
det
BOS This is a test .
0 1 2 3 4 5
Do you need it for something ?
What do you need it for ?
An arc (i,j) is projective iff all nodes
occurring between i and j are dominated by i
(where dominates is the transitive closure of the
arc relation)
6
Data format
punc
ROOT
comp
subj
det
BOS This is a test .
0 1 2 3 4 5
7
Data format details
  • ID, FORM, CPOSTAG, POSTAG, HEAD and DEPREL
    guaranteed to contain a non-dummy value
  • except Spanish DEPREL bug ... ?
  • Although CPOSTAG and POSTAG may be identical
  • German and Swedish
  • LEMMA and FEATS allowed to contain dummy value (
    _ )
  • if information not available in original treebank
  • Additional columns PHEAD and PDEPREL (in training
    data) not used by anybody
  • Unicode (UTF-8)

8
Treebanks used
  • Czech Prague Dependency Treebank (PDT)
  • Arabic Prague Arabic Dependency Treebank (PADT)
  • Slovene Slovene Dependency Treebank (SDT)
  • Danish Danish Dependency Treebank (DDT)
  • Swedish Talbanken05
  • Turkish Metu-Sabanci treebank
  • German TIGER treebank
  • Japanese Japanese Verbmobil treebank
  • Portuguese The Bosque part of the Floresta
    sintá(c)tica
  • Dutch Alpino treebank
  • Chinese Sinica treebank
  • Spanish Cast3LB
  • Bulgarian BulTreeBank

Depen- dency format
Consti- tuents and functions
Constituents and some functions
9
Data format some examples
55.39031 VP(evaluationDbb?HeadV_11?range
NP(propertyNv3??HeadNab??))?(PERIODCATEGORY
)
10
1,AuxS,tagHEADLINE,1,ord0,commentSun Oct 3
050228 2004 \SyntaxFS.pl 1.06\,x_id_ord1_1,
x_commentALH20010911.0001_story Wed Jul 21
125109 2004 \MorphoFS.pl 1.09\\(\??????,ExD
,?????_1,N-------1R,????,ord1,x_id_ord1/1_12, x
_lookupgyAb,giyAbu,absence/disappearance
\def.nom.\\(\???????,Atr,???????_2,Z---------
,?????,ord3,x_id_ord1/3_6, x_lookupknEAn,kanoE
An,Kan'an\(\?????,Atr,?????_2,Z---------,????,o
rd2,x_id_ord1/2_11, x_lookupfAd,fuAd,Fuad/Fo
uad\)))
11
ltnode id"0" rel"top" cat"top" begin"0"
end"6"gt ltnode id"1" rel"--" cat"sv1"
begin"0" end"5"gt ltnode id"2" rel"hd"
pos"verb" begin"0" end"1" root"ben"
word"Ben"/gt ltnode id"3" rel"su"
pos"noun" begin"1" end"2" root"je"
word"je"/gt ltnode id"4" rel"predc"
cat"mwu" begin"2" end"5"gt ltnode id"5"
rel"mwp" pos"adj" begin"2" end"3" root"op"
word"op"/gt ltnode id"6" rel"mwp"
pos"adj" begin"3" end"4" root"de"
word"de"/gt ltnode id"7" rel"mwp"
pos"adj" begin"4" end"5" root"hoogte"
word"hoogte"/gt lt/nodegt lt/nodegt
ltnode id"8" rel"--" pos"punct" begin"5"
end"6" root"?" word"?"/gt lt/nodegt
ltsentencegtBen je op de hoogte ?lt/sentencegt
12
ltS No"2"gt ltW IX"1" LEM"" MORPH" "
IG'(1,"kurtulVerbPos")(2,"NounInfA3sgPnon
Nom")' REL"2,1,(OBJECT)"gt Kurtulmak lt/Wgt ltW
IX"2" LEM"" MORPH" " IG'(1,"isteVerbNegPro
g1A1sg")' REL"3,1,(SENTENCE)"gt istemiyorum
lt/Wgt ltW IX"3" LEM"" MORPH" "
IG'(1,".Punc")' REL",( )"gt . lt/Wgt lt/Sgt
13
SOURCE CETEMPúblico n22 sececo sem92a CP22-4
Mas se falhar? A1 UTTacl COconj-c('mas') Mas A
DVLfcl SUBconj-s('se') se Pv-fin('falhar'
FUT 3S SUBJ) falhar ?
  • Head table
  • acl COM, PRD, P, leftmost non-punctuation
  • fcl P, PAUX,

14
Data format training data and test data
  • Training data
  • Contains all columns
  • Blind test data (given to participants)
  • Contains only first six columns
  • Participants predict HEAD and DEPREL
  • Approximately 5000 scoring tokens for each
    language

15
Evaluation metric
  • Official metric Labelled attachment score (LAS)
  • The percentage of scoring token for which the
    system predicted the correct HEAD and DEPREL
    value
  • A token is non-scoring if all characters of the
    FORM value have the Unicode category property
    Punctuation
  • E.g. . , ? ( -- _ ?_?
  • Also computed, for error analysis and system
    comparison
  • Unlabelled attachment score (UAS)
  • The percentage of scoring token for which the
    system predicted the correct HEAD value
  • Label accuracy
  • The percentage of scoring token for which the
    system predicted the correct DEPREL value

16
Parsing approaches
  • Many different approaches!
  • How to deal with non-projectivity
  • Different machine learners
  • perceptron, Maximum Entropy, SVM, MLE, ...
  • Search
  • deterministic, n-best, approximate, optimal, ...
  • FORM versus LEMMA, CPOSTAG versus POSTAG
  • Always use one, always use both, one or the
    other, ...
  • FEATS
  • Ignore, treat as atomic, split into components,
    cross-product, ...
  • Unlabelled parsing (HEAD) versus labelling
    (DEPREL)
  • Interleaved or separate step

17
Parsing approaches Parsing order
  • Four clusters (1 long several spotlight
    talks) of talks today
  • All pairs (cluster 4) versus stepwise
    (clusters 1 3)
  • Chart-parsing (most of cluster 3) versus
    classifier-based (12)
  • Child-focused (cluster 1) versus
    direction-focused (cluster 2)
Write a Comment
User Comments (0)
About PowerShow.com