Corpora for machine translation - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Corpora for machine translation

Description:

A collection of naturally occurring language text, chosen to ... Open source parallel corpus: http://logos.uio.no/opus/ BAF corpus -- no longer available online ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 17
Provided by: din55
Category:

less

Transcript and Presenter's Notes

Title: Corpora for machine translation


1
Corpora for machine translation
2
Structure
  • Very quick introduction to corpus linguistics
  • Tools
  • Building corpora for MT
  • Bilingual corpora for MT

3
What is a corpus?
  • A collection of naturally occurring language
    text, chosen to characterise a state or variety
    of language (Sinclair)
  • A collection of linguistic data, either written
    text or a transcription of recorded data, which
    can be used as starting-point of linguistic
    description or as a means of verifying hypotheses
    about a language (Dictionary of linguistics and
    phonetics)

4
What is a corpus? (II)
  • Large body of evidence typically composed of
    attested language use (McEnery)
  • Usually a corpus is in machine-readable format
    and is ideally viewable and analysable through (a
    single) software package
  • The word corpus comes from Latin body and the
    plural is corpora

5
What can we do with a corpus?
  • Corpus-based approaches hypotheses are checked
    against a corpus
  • Corpus-driven approaches hypotheses are drawn
    from the corpus

6
Parameters of a corpus
  • Language
  • Monolingual
  • Multilingual (comparable corpora)
  • Parallel
  • Type of source
  • Written
  • Spoken
  • Mix

7
Concordances
  • show words in the context they appear
  • usually they are obtained using special programs
    which allow to manipulate the lists of
    concordances
  • KWIC (Key Word In Context) is the most common
    format

8
Collocations
  • collocation the occurrence of two or more
    words within a short space of each other in text
  • the collocates are extracted using a window to
    the left and right of a specified word
  • can be used to further analyse the context of a
    word

9
Available corpora
  • BNC
  • Web as a corpus http//www.webcorp.org.uk/
  • Bank of English http//www.collins.co.uk/corpus/C
    orpusSearch.aspx
  • Online Corpora in different languages
    http//corp.hum.sdu.dk/corpustop.html
  • More links can be found at http//clg.wlv.ac.uk/
    dinel/corpora.html

10
Building corpora for MT
  • Multilingual corpora are easier to build
  • Parallel corpora require to align segments
    between text, and therefore quite difficult to
    obtain

11
Alignment of segments
  • Paragraphs can be usually aligned without a
    problem
  • Sentences are more difficult to align
  • It is impossible to have a word-by-word alignment
  • Cognates, function words and other signposts can
    help the alignment

12
This alignment cannot be obtained Examples
13
Corpora in MT
  • Monolingual corpora can be employed to see how a
    word in used in the target language
  • Multilingual corpora are mainly used to extract
    linguistic information about the languages
  • Parallel corpora can help the translation process

14
Parallel corpora
  • Give examples about how a word is translated
  • Can be used in machine aided human translation
    because they indicate how word/phrases were
    translated in the past ? translation by analogy
  • Can be used in statistical MT

15
Paraconc
16
Examples of parallel corpora
  • English-Chinese parallel corpus
    http//www.edict.com.hk/concordance/
  • Open source parallel corpus http//logos.uio.no/o
    pus/
  • BAF corpus -- no longer available online ?
  • UN multilingual terminology database can be used
    as a parallel corpus http//157.150.197.21/dgaacs
    /unterm.nsf
Write a Comment
User Comments (0)
About PowerShow.com