Introduction to corpus linguistics - PowerPoint PPT Presentation

View by Category
About This Presentation
Title:

Introduction to corpus linguistics

Description:

Bank of English currently 450 m words. http://www.cobuild.collins.co.uk. BTANT 129 w5. British National Corpus. 100 m words careful selection. 10 % spoken material ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 27
Provided by: vra70
Learn more at: http://corpus.nytud.hu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Introduction to corpus linguistics


1
Introduction to corpus linguistics
2
Corpus
  • The old school concept
  • A collection of texts especially if complete and
    self-contained the corpus of Anglo-Saxon verse
  • The Oxford Companion to the English Language
  • The modern view
  • A collection of naturally occurring language text
    chosen to characterize a state or variety of a
    language
  • John Sinclair Corpus Concordance Collocation OUP

3
Corpus vs. archive
  • Text archive
  • Collection of texts in their original format
  • (Oxford Text Archive http//ota.ox.ac.uk/)
  • Corpus
  • texts collected and processed in a
    unified,systematic manner
  • British National Corpus http//www.natcorp.ox.ac.
    uk/

4
(No Transcript)
5
(No Transcript)
6
Short history
  • Brief mention of just a select few!
  • Brown Corpus (Brown university)
  • 1 m words
  • 15 genres
  • 500 samples 2000 words each
  • Area US
  • Time 1961
  • LOB Corpus (Lancaster-Bergen-Oslo)
  • GB replica of Brown

7
Cobuild
  • Major corpus initiative by Collins and Birmingham
    Univ. John Sinclair
  • 1991 20 m
  • -gt Bank of English currently 450 m words
  • http//www.cobuild.collins.co.uk

8
British National Corpus
  • 100 m words careful selection
  • 10 spoken material
  • time span 1960 (fiction) 1975 non-ficion)
  • 40-50 000 word texts
  • TEI compliant SGML coding
  • http//www.comp.lancs.ac.uk/ucrel/bncindex/

9
(No Transcript)
10
International Corpus of English
  • 20 corpora of 1 m words devoted to varieties of
    English around the world
  • 500 texts (300 written 200 spoken) of 2000 words
    each
  • time span 1990-0996
  • ICE-GB available in demo version
  • syntactic annotation, graphical tool ICECUP

11
(No Transcript)
12
Corpus processing tokenization
  • Preprocessing
  • tokenization segmenting the text into sentences
  • sometimes tricky sentence delimiters in
    mid-sentence positions
  • words
  • multi-word units problem
  • Normalization
  • restoring clitics, abbreviations ("can't", "I've")

13
Corpus processing tagging
  • Tagging
  • labelling every word with its Part of Speech
    category
  • Problem ambiguity
  • out of context, words can belong to different
    part of speech or have different analysis within
    the same POS
  • set N vs. set V
  • bánt 'bánik' VBD vagy 'bánt' VBZ

14
Corpus processing disambiguation
  • Disambiguation
  • defining the correct analysis in context
  • Two approaches
  • both needs manually corrected training corpus
  • statistical
  • Hidden Markov model
  • calculating probability within a span of usually
    one or two words
  • rate of success can be around 98
  • rule-based

15
Syntactic annotation
  • Difficult to do on such a scale
  • shallow parsing
  • Treebank collection of syntactically analyzed
    sentences
  • Penn treebank
  • http//www.cis.upenn.edu/treebank/

16
Recent trends
  • Word sense ambiguation (SENSEVAL)
  • http//www.itri.brighton.ac.uk/events/senseval/
  • Message understanding
  • http//www.itl.nist.gov/iaui/894.02/related_projec
    ts/muc/index.html
  • SEMANTIC WEB
  • making information on the web understandable for
    machines
  • a vision requiring a huge effort, not clear
    whether feasible at all

17
Representative sample?
  • A corpus any size is inevitably a sample
  • Of what?
  • Two approaches
  • sampling speakers demographic sampling
  • sampling their output text type sample

18
The notion of representativeness
  • Sample vs. population
  • sample should be proportional to the population
    for a given feature
  • example for demographic sampling
  • if we know from census figures that 48 of people
    in living in Budapest are male
  • we should compile our sample so that 48 of the
    informants are male
  • -gt our sample is representative of Budapest
    residents for gender

19
Trouble with representativeness
  • What should be the units of sampling?
  • Registers, text types, genres etc.
  • But no independent evidence about theirratio in
    the totality of language output
  • -gt representativeness is an ideal but impossible
    to implement

20
Approaches to Representativeness
  • Douglas Biber
  • Rejects notion of proportional sampling
  • Sample should be as varied as possible
  • Representativeness measured in terms of wide
    variety of text types included in the sample

21
The Web as a corpus?
  • Pro
  • immense database
  • dynamically growing
  • ideal 'quick and dirty' method
  • Cons
  • lots of rubbish, irrelevant data
  • difficult to extract hits
  • no language analysis
  • only string query, which is crude

22
One quick example
  • Representativity or representativeness
  • Throw the two words at Google and have a look at
    the figures
  • Think about the conclusions
  • There are special front-end sites

23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
About PowerShow.com