Corpus linguistics an introduction - PowerPoint PPT Presentation


PPT – Corpus linguistics an introduction PowerPoint presentation | free to view - id: 47df1-ZDc1Z


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Corpus linguistics an introduction


A corpus can give you more objective evidence. Why bother with corpora? ( III) ... P S W POS='PRON' NUM='PL' LEMMA='we' We /W W POS='V' LEMMA='have' have /W W ... – PowerPoint PPT presentation

Number of Views:206
Avg rating:3.0/5.0
Slides: 44
Provided by: Comp66


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Corpus linguistics an introduction

Corpus linguisticsan introduction
What is a corpus?
  • A collection of naturally occurring language
    text, chosen to characterise a state or variety
    of language (Sinclair)
  • A collection of linguistic data, either written
    text or a transcription of recorded data, which
    can be used as starting-point of linguistic
    description or as a means of verifying hypotheses
    about a language (Dictionary of linguistics and

What is a corpus? (II)
  • Large body of evidence typically composed of
    attested language use (McEnery)
  • Usually a corpus is in machine-readable format
    and is ideally viewable and analysable through (a
    single) software package
  • The word corpus comes from Latin body and the
    plural is corpora

What is not a corpus
  • Lists of words
  • Lists of sentences produced with the purpose of
    creating a corpus
  • Archive a repository of readable electronic
    texts not linked in any coordinated way
    (http// Internet Archive
    is building a digital library of Internet sites
    and other cultural artifacts in digital form.
    Like a paper library, we provide free access to
    researchers, historians, scholars, and the
    general public.

What can we do with a corpus?
  • Corpus-based approaches hypotheses are checked
    against a corpus
  • Corpus-driven approaches hypotheses are drawn
    from the corpus

What can we do with a corpus? (II)
  • 'Alright,' said the computer Deep Thought. 'The
    Answer to the Great Question...'
  • 'Yes...!'
  • 'Of Life, the Universe and Everything ... ' said
    Deep Thought.
  • 'Yes ... !'
  • 'Is ...'
  • 'Yes...!!!...?'
  • 'Forty-two,' said Deep Thought, with infinite
    majesty and calm.
  • It was a long time before anyone spoke.
  • 'Forty-two!' yelled someone in the audience. 'Is
    that all you've got to show for seven and a half
    million years' work?'
  • 'I checked it very thoroughly,' said the
    computer, 'and that quite definitely is the
    answer. I think the problem, to be quite honest
    with you, is that you've never actually known
    what the question is.'
  • Hitchhikers guide to the galaxy by Douglas Adams

Fields where corpora are used
  • Lexicography to design dictionaries
  • Language studies (relations between languages,
    differences between genre, evolution of the
  • Computational linguistics (training and testing
  • Language teaching (learners corpora)
  • Cultural studies, psycholinguistics

The characteristics of analysis using corpora
(Biber, 1998)
  • It is empirical, analysing the actual patterns
    of use from natural texts
  • It utilises a large and principled collection of
    natural texts as the basis for analysis
  • It makes extensive use of computers for
    analysis, using both automatic and interactive
  • It depends on both quantitative and qualitative
    analytical techniques

  • We have to split the history in two periods
    before Chomsky and after Chomsky
  • Before Chomsky, methods similar to the ones in
    corpus linguistics were used (empiricism)http/

Early corpus linguistics
  • Before Chomsky
  • Computers were not available so it was difficult
    to analyse large collections of text
  • Studies of child language using diaries kept by
  • Spelling conventions in a German corpus of 11
    million words
  • Foreign language pedagogy

Early corpus linguistics (II)
  • All the work of early corpus linguistics was
    underpinned by two fundamental, yet flawed
  • The sentences of a natural language are finite.
  • The sentences of a natural language can be
    collected and enumerated.
  • Most linguists saw the corpus as the only source
    of linguistic evidence in the formation of
    linguistic theories

  • Between 1957 and 1965 Chomsky changed the
    direction of linguistics from empiricism towards
  • Any natural corpus will be skewed. Some
    sentences wont occur because they are obvious,
    other because they are false, still others
    because they are impolite. The corpus, if
    natural, will be so wildly skewed that the
    description would be no more than a mere list
    (Chomsky, 1962)
  • Introspection started to be used instead

Problems with introspection
  • Naturally occurring data is observable and
    verifiable by everyone.
  • Introspective data is artificial.
  • Human beings have only the vaguest notion of the
    frequency of a construct or a word.

The revival of corpus linguistics
  • The research in corpus linguistics was continued
    in small centres
  • The hardware still imposed some restrictions,
    the real development will start in the 80s
  • Fields like computational linguistics were not
    interested to use corpora

The revival of corpus linguistics (II)
  • 1960s Brown Corpus (at the Brown University
    American English)
  • 1970s LOB corpus British English
  • 1980s Bank of English in Birmingham
  • 1990s (BNC, LDC, ICE corpus, ELRA, TRACTOR,

Why bother with corpora?
  • Even expert speakers have only a partial
    knowledge of a languageA corpus can be more
    comprehensive and balanced
  • Even expert speakers tend to notice the unusual
    and think of what is possibleA corpus can show
    us what is common and typical
  • Even expert speakers cannot quantify their
    knowledge of languageA corpus can give us
    accurate statistics

Why bother with corpora? (II)
  • Even expert speakers cannot remember everything
    they knowA corpus can store and recall all the
    information that has been input
  • Even experts speakers cannot make up natural
    examplesA corpus can provide us with a vast
    number of real examples
  • Even expert speakers have prejudices and
    preferences and every language has cultural
    connotations and underlying ideologyA corpus can
    give you more objective evidence

Why bother with corpora? (III)
  • Even expert speakers are not always available to
    be consultedA corpus can be made permanently
    accessible to all
  • Even expert speakers cannot keep up with language
    changeA constantly updated corpus can reflect
    even recent changes in the language
  • Even expert speakers lack authority they can be
    challenged by other expert speakersA corpus can
    encompass the actual language use of many expert

Parameters of a corpus
  • Language
  • Monolingual
  • Multilingual (comparable corpora)
  • Parallel
  • Type of source
  • Written
  • Spoken
  • Mix

Parameters of a corpus (II)
  • Size of the corpus is not all important and it
    depends very much on the type of texts used
  • Annotated/not annotated (type of encoding used
    plain text, SGML/XML encoded)
  • Static corpus static/monitor corpus
  • Corpus/sub-corpus
  • Number of words/types

Type/token ratio
  • From Brown corpus 1m tokens (written only) -
    50,406 types
  • From 1980s Birmingham/Cobuild corpora 1m tokens
    (spoken only) - 36,807 types - 17,459 occur only
  • NB - fewer types than Brown (written only)
    spoken language is more repetitive, smaller
    vocabulary is used
  • 4m tokens (Times newspapers only) - 122,773 types
    - 54,144 occur only once
  • 18m tokens (general corpus) - 228,323 types -
    131,299 occur only once

Type/token ratio
  • 121m tokens (general corpus) - 475,633 types -
    213,684 occur only once
  • 211m tokens (general corpus) - 638,901 types
  • 323m tokens (general corpus) - 812,467 types
  • 418m tokens (general corpus) - 938,914 types -
    438,647 occur only once

Ways to exploit a corpus
  • Word (token) / types frequency lists
  • N-grams
  • Concordances
  • Collocations/collegations
  • Specially designed programs (especially when the
    corpus is annotated)

Frequency lists
  • are lists which indicates the words which appear
    in a corpus and their frequency
  • they provide a survey of the corpus
  • a frequency list becomes more meaningful when
    compared with other lists
  • they remove a word from its contexts

  • groups of N words which appear in sequence in
    the text
  • they are presented using frequency lists
  • good way to identify recurring/specific
    expressions for a corpus
  • provide limited context for the words

  • show words in the context they appear
  • usually they are obtained using special programs
    which allow to manipulate the lists of
  • KWIC (Key Word In Context) is the most common

  • collocation the occurrence of two or more
    words within a short space of each other in text
  • the collocates are extracted using a window to
    the left and right of a specified word
  • can be used to further analyse the context of a

The word gamut
Building corpora
  • Ways to acquire corpora
  • Direct conversion from electronic format
  • Optical scanning
  • Keyboarding
  • Speech transcription

Building corpora (II)
  • Criteria in corpus design
  • Size (small corpora are for genre specific
    studies, whereas big corpora make robust, general
    statements about a language)
  • Genre (domain, distribution, age, )
  • The structure of the corpus can be decided
  • A priori (Brown, LOB, )
  • A posteriori
  • Old material is replaced with new one (monitor

Building corpora (III)
  • Selection, permission, acquisition
  • Data conversion, optical scanning, keyboarding,
    speech transcription
  • Cleaning, spell-checking, encoding (annotation),
  • Writing documentation
  • Evaluation of corpora
  • Distribution of corpora

Possible problems when building a corpus
  • A sampling frame designed to allow the
    exploitation of a certain linguistics properties
  • Balance and representativeness
  • Information can be lost through cleaning
  • Duplication
  • When working with speech information can be lost
    through transcribing

Web as a corpus
  • The Web can be very useful source of texts
  • The Web is very helpful for languages other than
  • Quite often there is not control on the language
    which is investigated therefore filtering (if
    possible) is necessary

Corpus annotation
  • Enrichment of a corpus with various types of
  • It can be done at every level
  • Word part of speech, sense
  • Sentence sentence boundaries, syntactic tree
  • Discourse coreferential chains, discourse
  • Certain expressions named entities

Annotation scheme
  • A standard used to annotate certain
  • Gives meaning to a tag
  • Nowadays it is in XML
  • Usually in addition to an annotation scheme, a
    set of guidelines is produces to assist the

Examples (II)
  • ltPgtltSgtltW POS"PRON" NUM"PL LEMMA"we"gtWelt/WgtltW
    POS"V" LEMMA"have"gthavelt/WgtltW POS"EN"
    LEMMA"develop"gtdevelopedlt/WgtltNPgtltW POS"DET"
    LEMMA"a"gtalt/WgtltW POS"A LEMMA"computational"gt
    computationallt/WgtltW POS"N" NUM"SG"
    LEMMA"paradigm"gt paradigmlt/WgtltW
    POS"PUNCT"gt,lt/Wgt ... lt/Sgtlt/Pgt

What are the advantages of corpus annotation?
  • Ease of exploitation
  • Reusability
  • Multi-functionality
  • Explicit analyses
  • Once a corpus is annotated it can be used in
    further research

Annotation of a corpus
  • Can be done automatically, semi-automatically
    and manually
  • Sometimes the method is automatic and then the
    results postprocessed
  • Usually special tools are used to minimise the
    human error

Criticism to corpus annotation
  • Corpus annotation produce impure corpora
  • Sometimes annotation can hide certain features
  • Consistency versus accuracy
  • Measures to compute the reliability of an
  • Sometimes the annotation scheme can cover a
    phenomenon only partially.

Existing corpora
  • Brown Corpus/LOB corpus
  • Bank of English
  • Wall Street Journal, Penn Tree Bank, BNC, ANC,
    ICE, WBE, Reuters Corpus
  • Canadian Hansard parallel corpus English-French
  • York-Helsinki Parsed corpus of Old Poetry
  • Tiger corpus German
  • CORII/CODIS - contemporary written Italian
  • MULTEX 1984 and The Republic in many languages

Distributors of corpora
  • LDC (Linguistic Data Consortium)
  • ELRA (European Language Resources Association)
  • TRACTOR (TELRI Research Archive of Computational
    Tools and Resources)
  • ICAME (International Computer Archive of Modern
    and Medieval English)

  • Karin Aijmer and Bengt Altenberg (1991) English
    corpus linguistics, Longman
  • Duglas Biber, Susan Cnrad and Randi Reppen (1998)
    Corpus linguistics, Cambridge University Press
  • Graeme D. Kennedy (1998) An introduction to
    corpus linguistics, Longman
  • Tony McEnery and Andrew Wilson (1996) Corpus
    linguistics, Edinburgh University Press

References (II)
  • Geoff Barnbrook (1996) Language and Computers,
    Edinburgh University Press
  • Tony McEnery (2003) Corpus linguistics. In
    Ruslan Mitkov (ed.) The Oxford Handbook of
    Computational Linguistics, Oxford University