Analyzing Czech and English Parses Core NLP Technology Applicable to Multiple Languages - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

Analyzing Czech and English Parses Core NLP Technology Applicable to Multiple Languages

Description:

Compare a parser's output with the 'truth' Use dependency structures in SGML format ... Use domains containing descendents of uppermost ancestor ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 14
Provided by: jacobla3
Category:

less

Transcript and Presenter's Notes

Title: Analyzing Czech and English Parses Core NLP Technology Applicable to Multiple Languages


1
Analyzing Czech and English ParsesCore NLP
Technology Applicable to Multiple Languages
  • Cynthia Kuo
  • Stanford University
  • 20 August 1998

2
Purpose
  • Evaluation tool
  • Compare a parser's output with the "truth"
  • Use dependency structures in SGML format
  • Output recall and precision
  • Parsing by translation
  • Explore the use of a bilingual corpus to modify
    English parsers for Czech

3
Truth Vs. Guess
ltfgtKanadskéltrgt1ltMDggt2 ltfgtfirmyltrgt2ltMDggt3 ltfgtmajíltr
gt3ltMDggt0 ltfgtzájemltrgt4ltMDggt0 ltfgtoltrgt5ltMDggt0 ltfgtspol
upráciltrgt6ltMDggt0 ltfgtastltrgt7ltMDggt0
ltfgtKanadskéltrgt1ltggt2 ltfgtfirmyltrgt2ltggt3 ltfgtmajíltrgt3ltg
gt0 ltfgtzájemltrgt4ltggt3 ltfgtoltrgt5ltggt4 ltfgtspolupráciltrgt6
ltggt5 ltdgtastltrgt7ltggt0
4
Evaluation
  • Recall correct
    dependencies dependencies in truth
  • 4/7 57.1
  • Precision correct
    dependencies guesses by parser
  • 4/7 57.1
  • 7 dependencies in the truth
  • 7 "guesses" by the parser
  • 4 correct dependencies in parser's output

5
Parsing by Translation
with Doug Jones
  • Ideally, the Czech and English sentences would
    look identical

6
Tag Affinities
Part of Czech English speech
tag tag adjective A- J- adverb D- R- conju
nction J- CC determiner --, (P-) D- noun N-
N- preposition R- IN pronoun P- PRP verb V-
V-
  • Manually created
  • Tags from
  • Penn treebank
  • Prague dependency treebank

7
Tag Affinities Unigram Frequencies
Content words POS Czech English adjective 655
643 adverb 833 920 noun 2112
2868 number 197 203 verb 2420 2442 total
number of words 9317 11214
  • Set of aligned readers digest sentences
  • Similar distribution of tags
  • Slightly more English words than Czech words

8
More Unigram Frequencies
  • Function words
  • POS Czech English
  • Conjunction 433 231
  • Determiner -- 892
  • Modal -- 173
  • Particle 74 4
  • Preposition 658 684
  • Pronoun 1297 1262
  • Unknown 531 22
  • Total number
  • Of words 9317 11214
  • some words in English are not in Czech
  • determiners
  • modal verbs
  • more unknown tags and particles in Czech

9
Transformations
  • Eliminate punctuation
  • Change word order / paraphrase
  • Drop determiners
  • Collapse modal verbs and infinitives
  • Drop prepositions
  • Drop subject

10
Transformations 2
English Czech
Word
Root
  • Eliminate punctuation
  • Change word order / paraphrase
  • Drop determiners

Punc
Punc
1
1
2
3
2
3
11
Transformations 3
English Czech
  • Collapse modal verbs and infinitives
  • Drop prepositions
  • Drop subject

Modal
Verb
Verb
12
Future Research Looking for Matches
13
Automating the Search?
  • Compare domains?
  • Grouping phrases
  • Use domains containing descendents of uppermost
    ancestor
  • Of , recording , music not that , recording
    , music
  • Grouping around verbs
  • Verbs play a central role
  • Verbs usually appear in both sentences
  • Matching groups by probability
  • Identify important words in the phrases
  • Make educated guesses about matching phrases
  • Loosen affinities
Write a Comment
User Comments (0)
About PowerShow.com