Title: Analyzing Czech and English Parses Core NLP Technology Applicable to Multiple Languages
1Analyzing Czech and English ParsesCore NLP
Technology Applicable to Multiple Languages
- Cynthia Kuo
- Stanford University
- 20 August 1998
2Purpose
- Evaluation tool
- Compare a parser's output with the "truth"
- Use dependency structures in SGML format
- Output recall and precision
- Parsing by translation
- Explore the use of a bilingual corpus to modify
English parsers for Czech
3Truth Vs. Guess
ltfgtKanadskĂ©ltrgt1ltMDggt2 ltfgtfirmyltrgt2ltMDggt3 ltfgtmajĂltr
gt3ltMDggt0 ltfgtzájemltrgt4ltMDggt0 ltfgtoltrgt5ltMDggt0 ltfgtspol
upráciltrgt6ltMDggt0 ltfgtastltrgt7ltMDggt0
ltfgtKanadskĂ©ltrgt1ltggt2 ltfgtfirmyltrgt2ltggt3 ltfgtmajĂltrgt3ltg
gt0 ltfgtzájemltrgt4ltggt3 ltfgtoltrgt5ltggt4 ltfgtspolupráciltrgt6
ltggt5 ltdgtastltrgt7ltggt0
4Evaluation
- Recall correct
dependencies dependencies in truth - 4/7 57.1
- Precision correct
dependencies guesses by parser - 4/7 57.1
- 7 dependencies in the truth
- 7 "guesses" by the parser
- 4 correct dependencies in parser's output
5Parsing by Translation
with Doug Jones
- Ideally, the Czech and English sentences would
look identical
6Tag Affinities
Part of Czech English speech
tag tag adjective A- J- adverb D- R- conju
nction J- CC determiner --, (P-) D- noun N-
N- preposition R- IN pronoun P- PRP verb V-
V-
- Manually created
- Tags from
- Penn treebank
- Prague dependency treebank
7Tag Affinities Unigram Frequencies
Content words POS Czech English adjective 655
643 adverb 833 920 noun 2112
2868 number 197 203 verb 2420 2442 total
number of words 9317 11214
- Set of aligned readers digest sentences
- Similar distribution of tags
- Slightly more English words than Czech words
8More Unigram Frequencies
- Function words
- POS Czech English
- Conjunction 433 231
- Determiner -- 892
- Modal -- 173
- Particle 74 4
- Preposition 658 684
- Pronoun 1297 1262
- Unknown 531 22
- Total number
- Of words 9317 11214
- some words in English are not in Czech
- determiners
- modal verbs
- more unknown tags and particles in Czech
9Transformations
- Eliminate punctuation
- Change word order / paraphrase
- Drop determiners
- Collapse modal verbs and infinitives
- Drop prepositions
- Drop subject
10Transformations 2
English Czech
Word
Root
- Eliminate punctuation
- Change word order / paraphrase
- Drop determiners
Punc
Punc
1
1
2
3
2
3
11Transformations 3
English Czech
- Collapse modal verbs and infinitives
- Drop prepositions
- Drop subject
Modal
Verb
Verb
12Future Research Looking for Matches
13Automating the Search?
- Compare domains?
- Grouping phrases
- Use domains containing descendents of uppermost
ancestor - Of , recording , music not that , recording
, music - Grouping around verbs
- Verbs play a central role
- Verbs usually appear in both sentences
- Matching groups by probability
- Identify important words in the phrases
- Make educated guesses about matching phrases
- Loosen affinities