Interlingual Annotation of Multilingual Text Corpora IAMTC Project Overview for ITIC November 13, 20 - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Interlingual Annotation of Multilingual Text Corpora IAMTC Project Overview for ITIC November 13, 20

Description:

25 original texts in: French, Spanish, Japanese, Korean, Arabic, Hindi ... English Translation from French: ... Pick a text and two English translations of the text ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 23
Provided by: lsl
Category:

less

Transcript and Presenter's Notes

Title: Interlingual Annotation of Multilingual Text Corpora IAMTC Project Overview for ITIC November 13, 20


1
Interlingual Annotation of Multilingual Text
Corpora (IAMTC)Project Overview for
ITICNovember 13, 2003Carnegie Mellon University
  • Lori Levin, Teruko Mitamura, Simon Fung

2
Principal investigators and senior personnel
  • Bonnie Dorr, University of Maryland
  • Nizar Habash, University of Maryland and Columbia
  • Stephen Helmreich, NMSU
  • Eduard Hovy, USC
  • David Farwell, NMSU
  • Lori Levin, CMU
  • Keith Miller, MITRE
  • Teruko Mitamura, CMU
  • Owen Rambow, Columbia University
  • Florence Reeder, MITRE

3
Wiki
Owen Rambow does everyone know what wiki is?
add cooperative website?
  • http//sparky.umiacs.umd.edu8000/IAMTC/IAMTC.wiki
  • Corpora
  • Documents and manuals
  • Discussion

4
Goals of IAMTC
  • A practical interlingua for unrestricted text
  • Based on mismatch resolution remove dash
    between languages and between multiple English
    translations
  • Goal Feasible human coding
  • Speed
  • Inter-coder agreement
  • This is unclear will we develop parsers and
    generators? noFeasible parsing and generation

5
Benefits of IAMTC
  • I would make this a separate slide very
    importantUsable by many research communities, and
    by researchers using different approaches,
    working at different levels
  • MT, information extraction, summarization,
    question-answring, etc.
  • Corpus-based, rule-based, machine learning-based,
    statistical approaches, etc. (note heterogeneous
    list, not mutually exclusive)
  • Multiple levels of representation
  • Syntactic dependency structure
  • Language-specific predicate argument structure
  • Interlingua (with resolution of some mismatches)

6
Products of IAMTC
  • A coding manual for the interlingua
  • A multilingual tagged corpus
  • 25 original texts in French, Spanish, Japanese,
    Korean, Arabic, Hindi
  • Three English translations of each text
  • An evaluation metric for the interlingua

7
Representations
  • IL0 Language-specific dependency syntaxSyntax
  • IL1 Language-specific semantic
    structuredependency structure with
  • Labeling of nodes using ontology
  • Labeling of arcs with semantic role names
  • IL2 Interlingua new slide? this is the holy
    grail! then you can properlyu itemize the points
    below
  • Neutralize support verbs some multi-word
    expressions and non-literal language some
    lexical converses (buy-sell) some text planning
    differences I think you mean john who is blond
    likes apples lt-gt john is blond and likes
    apples, right? I would call that sentence
    planning, text planning refers to content
    determination and structuring in a
    language-independent way conflational mismatches
    give example?, head-switching mismatches, etc.

8
Examples(from Nizar Habash)what is role of this
example as opposed to next?
  • http//www.umiacs.umd.edu/habash/artb_004.idg.5.I
    L.1
  • The minister, who has his own website, also said
    "I want Dubai to be the best place in the world
    for state -of-the-art technology companies.
  • http//www.umiacs.umd.edu/habash/artb_004.idg.5.I
    L.2
  • The minister who has a personal website on the
    internet, further said that he wanted Dubai to
    become the best place in the world for the
    advanced (hitech) technological companies.

9
Example 1ive corrected 2 accents below without
marking them
  • Original English
  • In its first five years of operation, PRODEM
    financed loans to over 13,300 micorentrepreneurs,
    77 per cent of whom were women, disbursing over
    27 million in loans averaging 273.
  • Original French
  • Au bout de cinq ans, le programme avait consenti
    plus de 27 millions de dollars de prêts d'un
    montant moyen de 273 dollars, à plus de13 300
    entrepreneurs, dont 77 de femmes ....
  • English Translation from French
  • At the end of five years, the program had granted
    more than 27 million dollars in loans with an
    average amount of 273 dollars, to more than 13
    300 entrepreneurs, of which 77 were women,....

10
Example 12 accents corrected
  • Original English
  • financed
  • loans
  • to over 13,300 micorentrepreneurs,
  • disbursing
  • over 27 million
  • in loans
  • Original French
  • consenti
  • plus de 27 millions de dollars
  • de prêts
  • à plus de 13 300 entrepreneurs,
  • English Translation from French
  • granted
  • more than 27 million dollars
  • in loans
  • to more than 13 300 entrepreneurs

11
Example 2more accents
  • Original English
  • Its network of eighteen independent organizations
    in Latin America has lent ..
  • Original French
  • le réseau regroupe dix-huit organisations
    indépendantes qui ont déboursé ..
  • English Translation from French
  • the network comprises eighteen independent
    organizations which have disbursed ..

12
Example 2more accents also make sure this green
is legible when projected
  • Original English
  • has lent
  • Its network
  • of eighteen independent organizations
  • ..
  • Original French
  • regroupe
  • le réseau
  • dix-huit organisations indépendantes
  • ont déboursé
  • English Translation from French
  • comprises
  • the network
  • eighteen independent organizations
  • have disbursed

13
Interlingua Mergingaccents
  • Language-faithful interlinguas
  • Original English
  • financed
  • loans
  • to over 13,300 micorentrepreneurs
  • disbursing
  • over 27 million
  • in loans
  • Original French
  • consenti
  • plus de 27 millions de dollars
  • de prêts
  • à plus de 13 300 entrepreneurs
  • English Translation from French
  • granted
  • more than 27 million dollars
  • in loans
  • Merged Interlingua
  • TRANSFER-MONEY
  • over 27 million
  • to over 13,300 micorentrepreneurs
  • SOME-RELATION
  • over 27 million
  • loans

14
Interlingua Mergingaccents
  • Original English
  • has lent
  • Its network
  • of eighteen independent organizations
  • Original French
  • regroupe
  • le réseau
  • dix-huit organisations indépendantes
  • ont déboursé
  • English Translation from French
  • comprises
  • the network
  • eighteen independent organizations
  • have disbursed
  • Merged Interlingua
  • HAS-AS-PART
  • the network
  • eighteen independent organizations
  • TRANSFER-MONEY
  • the network
  • ..

15
Example 3i gave up on accents here
  • Original English
  • Three of the most advanced institutions in the
    ACCION network started their programmes as
    non-profit organizations and have, in the last
    five years, converted into
  • Original French
  • Trois des institutions les plus performantes
    rattachees a ACCION International qui etaient au
    depart des organisations a but nonlucratif sont
    devenues ces cinq dernieres annees
  • English Translation from French
  • Three of the most successful institutions
    connected to ACCION International, which were
    non-profit organizations in the beginning, have
    become, in these last five years,

16
Example 3
  • Original English
  • Started
  • their programmes
  • Institutions
  • as non-profit organizations
  • Converted
  • Institutions
  • ..
  • Original French
  • sont devenues
  • Institutions
  • relative-clause etaient au depart
  • institutions
  • English Translation from French
  • Have become
  • Institutions
  • Relative-clause Were in the beginning
  • institutions

17
Meetings and Workshops
  • Meetings
  • September 2003 New Orleans
  • November 2003 CMU
  • January 18 and 19,2004 ISI
  • Workshops
  • September 2003 MT Summit
  • May 2004 Plan for a panel in the workshop
    organized by Adam Meyer
  • July 2004 Plan to propose ACL workshop

18
Timeline
  • November 10 to December 1
  • Assembly of ENGLISH tools and knowledge sources
  • Tools committee Hovy, Rambow, Miller
  • Omega ontology, ISI
  • LCS verb lexicon (connect to Omega via Propbank)
  • LDA (Lightweight Dependency Analyzer, Srinivas
    Bangalore)
  • Graph tool from Prague
  • New annotation tool (Dependency tree, Omega,
    Lexicon)
  • Draft of coding manual for IL1
  • Annotation Committee Rambow, Mitamura, Levin,
    Dorr, Habash, Helmreich
  • Ontology symbols Hovy
  • IL0 dependency structure Rambow
  • IL1 markup format Rambow and Habash
  • Semantic roles Dorr, Habash, Mitamura, Levin
  • Nouns and compounds Mitamura
  • Adverbs and adjectives Helmreich
  • Prepositions Miller
  • Named entities Reeder
  • Modification vs Predication Habash

19
Annotation Procedure (English)
  • Run LDA parser
  • Use tree editing tool to convert syntactic
    dependency parse into IL1
  • Correct parsing errors
  • Choose symbols from the ontology as node labels
  • For verbs
  • look the verb up in the lexicon to get a list of
    semantic role names
  • Match phrases to roles

20
Timeline
  • December 1 to January 19
  • Annotation development cycle
  • Procedure committee Hovy, Farwell, Mitamura
  • For each week, for each language
  • Pick a text and two English translations of the
    text
  • Annotator 1 Annotate the original and two
    English translations
  • Annotator 2 Annotate the two English
    translations and one English translation from
    another site.
  • Each week
  • Conference call on Friday at 100 pm Eastern Time
  • Revise annotation manuals as necessary
  • Development of inter-coder agreement metric
  • Evaluation committee Reeder and Habash, leaders
  • Proposal for IL2 based on comparison of IL1s for
    different translations of the same text

21
Timeline
  • January 19-February 23
  • Development of foreign language analysis tools
  • Large inter-coder agreement evaluation (IL1)
  • Small intercoder agreement evaluation of IL2
  • March 1 Mid year report
  • March 1 2004 to September 2004
  • Annotation of full corpus
  • 25 original texts in each of the six languages
    (French, Spanish, Hindi, Korean, Arabic,
    Japanese)
  • 3 translations of each text into English

22
Plans for year 2
  • Argument taking predicates other than verbs
  • Additional tools for automatic construction of
    IL1 and IL2
  • More comprehensive set of divergences resolved in
    IL2
  • Additional annotation topics
  • Coreference
  • Scope
  • Tense and aspect
  • Etc.
  • Larger annotated corpus
  • Suitable for corpus-based methods and machine
    learning
Write a Comment
User Comments (0)
About PowerShow.com