Parallel corpora and contrastive studies - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Parallel corpora and contrastive studies

Description:

Parallel corpora and contrastive studies Hilde Hasselg rd University of Oslo Correspondences of be going to (percentages) Correspondences of komme til ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 43
Provided by: Bru872
Category:

less

Transcript and Presenter's Notes

Title: Parallel corpora and contrastive studies


1
Parallel corpora and contrastive studies
  • Hilde Hasselgård
  • University of Oslo

2
From monolingual to multilingual corpus
linguistics
  • Corpus linguistics the study of language by
    means of large(ish), structured databases of text
    compiled and prepared for use in linguistic
    research.
  • Largely developed within English linguistics,
    with the Brown corpus as the first (1960s),
    followed by the Lancaster-Oslo/Bergen (LOB)
    corpus.
  • Greatly facilitated the access to material.
  • Opened up new possibilities for quantitative
    studies variation studies.
  • Parallel corpora a more recent development
    (1990s), requiring new technology and new
    research methods.

3
Structure of talk
  • Multilingual corpus linguistics
  • Multilingual corpora
  • The English-Norwegian Parallel Corpus
  • Contrastive analysis
  • The use of parallel corpora in contrastive
    studies
  • The contribution of parallel corpora
  • Methodology
  • The Oslo Multilingual Corpus and the work of
    Språk i Kontrast (Languages in Contrast) in
    Oslo
  • Case study two future-referring expressions
  • Summing up

4
What is a parallel corpus?
  • original texts with translations into one or more
    other languages ? A translation corpus
  • comparable original texts in different languages
    ? A comparable corpus
  • bi-directional translation corpus ? Parallel
    corpus

5
Translation corpus
  • A corpus that contains the same texts in more
    than one language, in other words a corpus with
    both original and translated texts.

6
Comparable corpus
  • a corpus that contains original texts in more
    than one language and where the texts in each
    language have been selected according to the same
    criteria (genre, content, publication date etc.)

Language 1 criterion A criterion B criterion C criterion D Language 2 criterion A criterion B criterion C criterion D Language 3 criterion A criterion B criterion C criterion D
7
Parallel corpus (ENPC model)
  • Combination of translation and comparable corpus
  • The original texts are comparable (genre, number
    of words)
  • The translations go in both directions a
    bidirectional translation corpus

8
The English-Norwegian Parallel Corpus (ENPC)
Some facts
  • Started as a research project at the Department
    of British and American Studies in 1994 and
    completed in 1997. Prof. Stig Johansson initiated
    and directed the project.
  • Original texts with translations
    (English-Norwegian and Norwegian-English)
  • Fiction and non-fiction
  • Compiled for use in applied and theoretical
    linguistic research
  • Development of software for alignment of the
    texts (Knut Hofland, UiB) and for searching the
    corpus (Jarle Ebeling, UiO)
  • Sister projects The English-Swedish Parallel
    Corpus (Lund/Göteborg), English-Finnish Parallel
    Corpus (Jyväskylä/Savonlinna/Tampere) same
    principle of compilation to some extent also
    shared texts.
  • Other corpora built on the ENPC model in Germany
    (Chemnitz), France/Belgium (Poitiers/Louvain-la-Ne
    uve the PLECI corpus), Spain (University of
    Léon).

9
Contrastive analysis
  • Contrastive analysis is the systematic comparison
    of two or more languages, with the aim of
    describing their similarities and differences.
    (Johansson 2007 1)
  • CA contrastive analysis is a linguistic
    enterprise aimed at producing inverted (i.e.
    contrastive, not comparative) two-valued
    typologies (a CA is always concerned with a pair
    of languages), and founded on the assumption that
    languages can be compared. (James 1980 3)
  • Executing a CA involves two steps description
    and comparison and the steps are taken in that
    order. (James 1980 63)

10
Contrastive analysis
  • A CA presupposes a tertium comparationis, i.e. a
    measure by which we can be fairly certain we are
    comparing like with like.
  • The items to be compared across languages are
    selected on the basis of perceived similarity
    (Chesterman 1998), such as translation
    equivalence, semantic/etymological similarity,
    grammatical or functional categories.
  • A frequently suggested tertium comparationis is
    translation equivalence (e.g. James 1980,
    Chesterman 1998) which implies that the items in
    the two languages convey (more or less) the same
    meaning.

11
What can multilingual corpora contribute?
  • They give insights into the languages compared
    insights that are likely to be unnoticed in
    studies of monolingual corpora.
  • They can be used for a range of comparative
    purposes and increase our understanding of
    language-specific, typological and cultural
    differences, as well as of universal features.
  • They illuminate differences between source texts
    and translations, and between native and
    non-native texts.
  • They can be used for a number of practical
    applications, e.g. in lexicography, language
    teaching, and translation.
  • (Aijmer Altenberg 1996 12)

12
Other benefits of a parallel corpus such as the
ENPC
  • Ready access to (relatively) large quantities of
    bilingual data
  • Sentence alignment
  • Comparable original and translated texts in both
    languages
  • Control for translation bias
  • In-built tertium comparationis through
    translation equivalence and text comparability
  • the paired texts reveal the interlingual
    identifications made by translators (Johansson
    1999 117)

13
Methodology Classifying correspondences
  • congruent
  • expressed
  • divergent
  • Correspondence
  • zero

Same realisation type
Different realisation type
Example English correspondences of imidlertid
(however) in ENPC Alle "innrømmelsene" hadde
imidlertid en pris. (GL1) ? However, all these
"concessions" had a price. Det endte imidlertid
godt () (UD1) ? But it ended well
() Reguleringstiltakene har imidlertid gitt
resultater (). (ABJH1) ? The regulations have
shown results ().
14
Paradigms of correspondences
  • Swedish translations of however
  • emellertid (51 47)
  • men (but) (36 33)
  • dock (14 13)
  • ändå (2)
  • däremot (1)
  • i alla fall (1)
  • Ø (4)
  • English translations of emellertid
  • however (83 81)
  • but (3)
  • yet (3)
  • anyway (1)
  • Ø (13)

(Altenberg 1999)
15
Mutual correspondence (MC)(Altenberg 1999)
  • The frequency with which different (grammatical,
    semantic and lexical) expressions are translated
    into each other.
  • Calculated and expressed as a percentage by means
    of the formula
  • (At Bt) x 100
  • As Bs
  • The MC of however and emmelertid in the ESPC is
    thus
  • (51 83) x 100 / (109 103) 63.2

16
Lexicogrammar
  • Paradigms of correspondence highlight the fuzzy
    borderlines between lexis and grammar and grammar
    and discourse.
  • Example A modal verb will have a wide range of
    correspondences
  • Norwegian kan (can)
  • Valget av tidspunkt kan også inneholde et stenk
    av egoisme. (KH1)
  • Maybe his choice of timing also contained a touch
    of egotism.

Modal aux can, could, may, might, ll, will,
would, should Other verbs know, enable, have,
have to, had better Adjectives possible, able,
capable. Adverbs maybe, perhaps Suffix -able
(Løken 2007)
17
From ENPC to OMC under the SPRIK umbrella (SPRåk
I Kontrast)
  • New languages have been added, first (mainly)
    German, then French
  • Focus on English Norwegian German in the
    first phase of the SPRIK-project original texts
    in each language with translations into the other
    two.
  • Same principles for text selection, text sampling
    and preparation as for the ENPC (exception even
    more biased towards fiction because of the lack
    of translated non-fiction)
  • Same (or later versions of same) software for
    alignment, searching etc.
  • Expanded search facilities and research
    possibilities
  • Three-way comparison of translations and
    originals
  • Possibilities of investigating two different
    translations of the same text (translation
    strategies, translationese)

18
Current stock of multilingual corpora at Oslo
  • OMC
  • Parallel corpora English-Norwegian,
    French-Norwegian, German-Norwegian three-way
    English-German Norwegian.
  • Translation corpora Norwegian English French
    German, Norwegian French German,
    English-Dutch, English-Portuguese.
  • Multiple translations corpus (English-Norwegian)
  • Outside OMC
  • Russian English Norwegian (RuN)
  • Multilingual corpora of historical texts (two
    projects)

19
Trilingual parallel corpus model
20
Searching in the OMC (En-Ge-No)Search terms
however in English originals, doch in German
originals
  • Now, however, our father wears jackets and ties
    and white shirts, and a tweed overcoat and a
    scarf. (MA1)
  • Jetzt jedoch trägt unser Vater Jacken und
    Krawatten und weiße Hemden und einen Tweedmantel
    und einen Schal. (MA1TD)
  • Nå går faren vår imidlertid med jakker og slips
    og hvite skjorter og tweedfrakk og skjerf.
    (MA1TN)
  • "However, the ex-Royal Family will be protected
    by the laws of the land. (ST1)
  • Doch die Ex-Königsfamilie genießt auch den Schutz
    der Gesetze dieses Landes. (ST1TD)
  • Ikke desto mindre vil den eks-kongelige familie
    kunne påberope seg beskyttelse under landets lov.
    (ST1TN)
  • Und er war doch noch da. (ME1)
  • And after all he was still there. (ME1TE)
  • Og han var jo fortsatt til. (ME1TN)

21
Translation corpus with four languages
No-En-Fr-Ge
22
Searching in No-En-Fr-Ge
  • Jeg kommer til å si det til ham likevel. (KF1)
  • Ich werde es ihm sowieso sagen. (KF1TD)
  • I 'll tell him about it anyway. (KF1TE)
  • De toute façon, je le lui dirai. (KF1TF)
  • "You're going to have a book reissued (BHH1TE)
  • Du skal få en bok trykt opp igjen ... (BHH1)
  • "Ein Buch von dir wird neu aufgelegt, ...
    (BHH1TD)
  • Un de tes livres va être réédité ... (BHH1TF)

23
Using the ENPC/OMC for research
  • Particularly well suited for studies of lexis /
    lexico-grammar (or phenomena that can take lexis
    as their starting point)
  • A broad range of phenomena have been (are being)
    investigated, e.g. the use of individual verbs
    (bli, få, take, give, see), modality, particular
    syntactic constructions, connectives, sentence
    openings and other discourse phenomena.
  • The methodology is not tied to any particular
    theoretical approach
  • A range of theoretical approaches, e.g. SFL,
    cognitive linguistics, pattern grammar,
    lexis-based approach à la Sinclair traditional
    grammar / basic linguistic theory.

24
Limitations
  • (As with corpus linguistics in general) you can
    only search for something that is explicit in the
    text
  • Restricted to texts / text types that have been
    translated
  • The size of the corpus restricts studies of less
    frequent lexical/ grammatical constructions
  • Faulty and less successful translations
  • The corpus has been word-class tagged, but not
    parsed (syntactically annotated), i.e. it is not
    possible to search for grammatical constructions,
    patterns of word order etc.
  • Tagging errors

25
Ways around the limitations?
  • Identify typical (and searchable!) expressions of
    a grammatical construction, e.g. presentatives,
    clefting, phrasal verbs, inversion.
  • Use a combination of word class tagging, filters
    and wildcards. Example tense / aspect,
    participle clauses. (e.g. BE Ving)
  • In any case a lot of work involved in tidying
    up the search results (precision).
  • Possibility of searching with regular expressions
  • Errors in the tagging Never possible to make
    sure that you have found all the relevant
    instances (recall).
  • Errors/idiosyncracies in the translation Weed
    out? Ignore translations that occur only once, or
    in only one text?
  • Manual searches in running text, e.g. for Theme,
    subjects.
  • Supplement results of parallel corpus study with
    (larger) monolingual corpora.
  • Supplement corpus study with e.g. experimental
    data.

26
Examples of studies based on ENPC/ OMC / ESPC
  • Bengt Altenberg Work on adverbial connectors,
    sentence openings, subject selection etc. in
    English and Swedish.
  • Karin Aijmer Work on modality and discourse
    markers in English and Swedish.
  • Åke Viberg Work on verbs of motion and cognition
    in English and Swedish.
  • Helge Dyvik Translations as semantic mirrors
    ENPC as basis for bilingual wordnet.
  • Jarle Ebeling (2000) Presentative constructions
    in English and Norwegian a corpus-based
    contrastive study (PhD, University of Oslo)
  • Mats Johansson (2002) Clefts in English and
    Swedish A contrastive study of IT-clefts and
    WH-clefts in original texts and translations.
    (PhD, Lund University)
  • Signe Oksefjell Ebeling (2003) The Norwegian
    verbs bli and få and their correspondences in
    English a corpus-based contrastive study (PhD,
    University of Oslo)

27
  • Berit Løken Beyond modals A corpus-based study
    of English and Norwegian expressions of
    possibility (PhD, Oslo, 2007)
  • Lene Nordrum English lexical nominalizations in
    a Norwegian-Swedish contrastive perspective.
    (PhD, Göteborg, 2007)
  • Wiebke Ramm Sentence boundary adjustments in
    translation (German / Norwegian) Consequences on
    information distribution and discourse structure
    (PhD, Oslo, ongoing)
  • Astrid Nome Ongoing PhD work on connectors in
    Norwegian and French. (Oslo)
  • Cathrine Fabricius Hansen et al Big Events,
    Small Clauses. The Grammar of Elaboration.
    (Forthcoming book with multiple authors and
    multiple languages)
  • Master theses (English, German, French) studying
    individual verbs, syntactic constructions,
    connectors, metaphor

28
My own contrastive work
  • 2009. A textual perspective on the pragmatic
    markers in fact and faktisk. In S. Slembrouck,,
    M. Taverniers, M. Van Herreweghe (eds.) From will
    to well Studies in Linguistics offered to
    Anne-Marie Simon-Vandenbergen. Ghent Academia
    Press.
  • 2007. Using the ENPC and the ESPC as a parallel
    translation corpus adverbs of frequency and
    usuality. Nordic Journal of English Studies 61,
    http//ojs.ub.gu.se/ojs/index.php/njes/issue/view/
    6
  • 2006. Not now on non-correspondence between
    the cognate adverbs now and nå. In K. Aijmer
    A.-M. Simon Vandenbergen (eds.) Pragmatic Markers
    in Contrast. Elsevier, 93-114.
  • 2005. Theme in Norwegian. In K.L. Berge, E.
    Maagerø (eds.). Semiotics from the North Nordic
    Approaches to Systemic Functional Linguistics.
    Oslo Novus, 35-48.
  • 2004 . Spatial linking in English and Norwegian.
    In K. Aijmer H. Hasselgård (eds.). Translation
    and Corpora. Göteborg Acta Universitatis
    Gothoburgensis, 163-188.
  • 2004. Thematic choice in English and Norwegian.
    Functions of Language 112. 187-212.
  • 2000. English multiple Themes in translation. In
    A. Klinge (ed.) Contrastive Studies in Syntax.
    Special issue of Copenhagen Studies in Language,
    Vol 25. Copenhagen Samfundslitteratur, 11-38.

29
Case study be going to and komme til å (come
to)
  • Future-referring expressions based on motion verb
    infinitive
  • Both described in grammars as common expressions,
    though less common than expressions with English
    will, Norwegian skal

30
Meanings
  • be going to
  • future fulfilment of the present present
    intention or present cause (Quirk et al 1985)
  • associated with present intention or arrangement
    was going to quite often has an implicature of
    non-actualisation. (Huddleston Pullum 2002)
  • Two meanings futurish, linked to a present
    situation, and future tense, simply expressing
    future time reference. (Declerck 2006)
  • komme til å
  • the speaker predicts what will happen based on
    his knowledge at the moment of speaking (Faarlund
    et al 1997)
  • Past tense kom til å V also accidentally V or
    was led to V/ grew to V (Vannebo 1979 and
    Engelsk Stor Ordbok)

31
Examples
  1. I know what hes going to say even before he says
    it. (FW1)
  2. Jeg vet hva han kommer til å si selv før han sier
    det. (FW1T)
  3. "I was going to wait until another time we met,
    but I may as well tell you now. (AH1)
  4. Meningen var å vente til en annen gang, men jeg
    kan like godt si det nå. (AH1T)
  5. Ingen av dem visste hva som kom til å skje.
    (TTH1)
  6. Neither of them knew what was going to happen.
    (TTH1T)
  7. Kanskje hun kom til å svelge dem ved et uhell?
    (LSC1)
  8. Maybe she happened to swallow them by accident?
    (LSC1T)
  9. Og siden ble det jeg som kom til å se mest til
    henne. (EHA1)
  10. And then I became the one who ended up seeing her
    most often. (EHA1T)

32
be going to and komme til å in ENPC fiction (raw
frequencies)
33
Preliminary observations
  • Be going to is more common than komme til å in
    original texts
  • Be going to is more common in original texts
    than in translations
  • Komme til å is less common in original texts than
    in translations
  • i.e. translations in both directions can be
    assumed to be coloured by the source texts.
  • The frequency differences between originals and
    translations (particularly with komme til å)
    indicate that the two expressions can often be
    used in the same contexts, but may tend not to
    be.

34
Correspondences of be going to (percentages)
35
Correspondences of komme til å (percentages)
36
Correspondences
  • The mutual correspondence between be going to and
    komme til å is surprisingly low 12.6
  • The correspondence is asymmetrical
  • 15 of be going to are translated as komme til å
  • 7 of komme til å are translated as be going to
  • Komme til å has meanings not covered by be going
    to (accidentally, grow to, be led to).
  • The present cause/intention meaning works
    differently for the two expressions apparently
    also speaker certainty/non-actualisation.
  • What are we going to do, says Ruth, (BV2T)
  • Hva skal vi gjøre, sier Rut (BV2)
  • Hun kommer bare til å bli redd." (THA1)
  • She 'll only be frightened." (THA1T)
  • "Are you going to run a hotel?" enquired
    Frederick reasonably, (DL1)
  • "Har dere tenkt å drive hotell?" spurte Frederick
    fornuftig, (DL1T)

Uncertain outcome, no intentionality
Confident prediction speaker knowledge
Intention, but uncertain outcome
37
  • Thus, in spite of shared meanings, English be
    going to and and Norwegian komme til å, differ as
    to
  • The frequency with which the item is chosen
  • The extent to which they compete with other
    future-referring expressions
  • The extent to which they convey confident
    predictions, present intention and actualised
    future in past.
  • Some other explanations may be
  • Translators in both directions tend to normalize
    be going to / komme til å into a more common
    future-referring expression (will/would INF and
    skal/skulle INF) Will/would and skal/skulle are
    also the most common sources of komme til å / be
    going to
  • Sometimes more lexically explicit forms have been
    used to translate be going to/komme til å ha
    tenkt å / intend to (subjects intention) was to
    (was led/destined to)
  • Be going to may be needed for syntactic reasons,
    as English modals lack non-finite forms and do
    not show tense clearly.
  • Norwegian modal auxiliaries are more flexible,
    having non-finite and tensed forms ? skal /skulle
    INF fits into more syntactic environments than
    will/would INF

38
The verb forms
39
  • The present tense be going to occurs to a great
    extent in direct speech.
  • The meanings of accidentally do and grow to/
    be led to of komme til å occur mainly with the
    past tense, the former also with modalisation.
  • Hun kjenner at hun er søvnig, at hun kan komme
    til å sovne mot fars jakke, hun vil ikke det.
    (BV2)
  • She feels that she is sleepy, that she might fall
    asleep against father's jacket, but she doesn't
    want to do that. (BV2T)
  • og at den kvinnen jeg leter efter egentlig var
    et barn den gangen hun kom til å bety noe for
    meg. (FC1)
  • and that the woman I'm searching for was really
    a child when she came to mean something to me.
    (FC1T)

40
Some reflections on findings and further work
  • The picture of correspondence is a complex one,
    in spite of the rather similar descriptions in
    grammars of be going to and komme til å.
  • Syntactic differences between will/skal-future
    expressions may go some way towards explaining
    the difference in distribution.
  • Correspondence types will have to be correlated
    with tense forms.
  • Subtle differences of meaning regarding speaker
    certainty and present cause/intention come to the
    surface when studying correspondences.
  • be going to is closer to a neutral future meaning
    than komme til å further grammaticalized as a
    future tense.

41
Summing up
  • Parallel corpora enhance contrastive studies in a
    number of ways
  • by ensuring that observations are based on
    authentic language use
  • by yielding paradigms and patterns of
    correspondences
  • thus often revealing meanings and nuances we
    might not have thought of
  • and showing how the same meaning may be expressed
    by means of different linguistic categories
  • by providing quantitative data
  • thus also giving insights into preferred ways
    of putting things
  • (if the corpus is bidirectional) by providing
    control for translation bias
  • (if the corpus is representative) by controlling
    for the idiosyncrasies of individual
    authors/translators

42
Why undertake corpus-based contrastive
investigations?
  • The importance of multilingual corpora extends
    beyond contrastive studies. It is up to the user
    to define fruitful research questions and use the
    corpora creatively. In this process we learn not
    only about individual languages and their
    relationships, about translation and
    foreign-language acquisition, but also about
    language in general provided that the study
    becomes truly multilingual. Seeing through
    corpora we can see through language.
  • Stig Johansson (2007 316)

43
Information on the OMC / ENPC
  • About the corpora
  • OMC www.hf.uio.no/ilos/english/originalfiler/serv
    ices/omc/
  • ENPC www.hf.uio.no/ilos/english/originalfiler/ser
    vices/omc/enpc/
  • www.helsinki.fi/varieng/CoRD/corpora/ENPC/
  • About publications based on the OMC (up to 2006)
  • www.hf.uio.no/ilos/forskning/prosjekter/sprik/engl
    ish/publications/

44
References
  • Aijmer, K. B. Altenberg. 1996. Introduction. In
    K. Aijmer, B. Altenberg, M. Johansson (eds.)
    Languages in Contrast. Lund University Press,
    11-16.
  • Altenberg, B. 1999. Adverbial connectors in
    English and Swedish Semantic and lexical
    correspondences. In Hasselgård Oksefjell (eds.)
    Out of Corpora. Amsterdam Rodopi, 249-268.
  • Berglund, Y. 2005. Expressions of Future in
    Present-day English. A Corpus-based Approach.
    Uppsala University.
  • Chesterman, A. 1998 Contrastive Functional
    Analysis. Amsterdam/Philadelphia John Benjamins
    Publishing Company.
  • Declerck, R. 2006. The Grammar of the English
    Verb Phrase, Vol. 1. Berlin Mouton de Gruyter.
  • Faarlund, J. T., S. Lie, K. I. Vannebo. 1997.
    Norsk Referansegrammatikk. Oslo
    Universitetsforlaget.
  • Huddleston, R. and G. K. Pullum. 2002. The
    Cambridge Grammar of the English Language.
    Cambridge Cambridge University Press.
  • James, C.. 1980. Contrastive Analysis. London
    Longman.
  • Johansson, S. 1999. Corpora and contrastive
    studies. In P. Pietilä O-P. Salo (eds.)
    Multiple Languages Multiple Perspectives.
    AFinLA Yearbook 1999 / No. 57, 116-125.
  • Johansson, S. 2007. Seeing through multilingual
    corpora. Amsterdam Benjamins.
  • Quirk, R., S. Greenbaum, G. Leech, J. Svartvik.
    1985. A Comprehensive Grammar of the English
    Language. London Longman.
  • Vannebo, K. I. 1979. Tempus og tidsreferanse.
    Oslo Novus
Write a Comment
User Comments (0)
About PowerShow.com