Introduction to Corpora and Corpus Linguistics COG - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Introduction to Corpora and Corpus Linguistics COG

Description:

Introduction to Corpora and Corpus Linguistics COGS 523-Lecture 1 General Introduction * COGS 523 - Bilge Say * COGS 523 - Bilge Say COGS 523 - Bilge Say COGS 523 ... – PowerPoint PPT presentation

Number of Views:310
Avg rating:3.0/5.0
Slides: 34
Provided by: ocwMetuE
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Corpora and Corpus Linguistics COG


1
Introduction to Corpora and Corpus Linguistics
  • COGS 523-Lecture 1
  • General Introduction

2
Related Readings
  • Course Pack
  • Meyer (2002). Corpus Analysis and Linguistic
    Theory. Ch 1
  • Abney (1996) Statistical Methods and Linguistics
  • Extra Material (Entirely optional, part of the
    presentation draws on these material)
  • McEnery and Wilson (2001) Ch1
  • McEnery et al. (2006) A1 and B2
  • Tognini-Bonelli (2001). Corpus Linguistics at
    Work. Ch 3
  • Corpora Discussion List Archives Corpora
    Chomsky/Harris Discussion, April 2001
  • BorsleyIngham vs Stubbs Discussion. Lingua 112
    (2002)
  • Schönefeld (1999) Corpus Linguistics and
    Cognitivism, International Journal of Corpus
    Linguistics 4(1)

3
What is a Corpus?
Derlem (alt. Bütünce)
Text/Speech/ Video
Annotation

Digital media
Written/Spoken Language
Design Criteria
4
Questions of the Week
  • Is working with corpora a methodology within
    linguistics or a distinctive subfield (corpus
    linguistics)?
  • What potential is there for empirical analysis of
    corpora to contribute to linguistic theory?
  • What are the dangers involved in corpus-based
    linguistics? How can these dangers be reduced?

5
What is a Corpus,again?
  • A body of written text or transcribed speech
    which can serve as a basis for linguistic
    analysis or description, designed or required for
    a particular representative function.
  • An electronic collection of texts in a uniform
    representation
  • Corpus vs text archive vs database

6
Sinclairs definition
  • A corpus is a collection of pieces of language
    that are selected and ordered according to
    explicit linguistic criteria in order to be used
    as sample of language

7
Should a Corpus be Necessarily
  • Large?
  • Be authentic?
  • Compiled for linguistic analysis?
  • Be saturated in terms of lexical growth?
  • Be representative?
  • Be machine readable?

8
A History of Corpora
  • Pre-computers era (pre 60s)
  • Transition era (60s to beginning of 90s)
  • Maturation era (90s onwards)
  • What did technology bring?
  • Increased accuracy, speed, accountability,
    replicability, large volumes of better annotated
    data.

9
Phonology Morphology Lexicon Syntax Semantics Disc
ourse Pragmatics
Introspection Experimental Methods Formal
Linguistic Analysis Computational Modeling Corpus
Based Methods?
Linguistics
Computational Linguistics Psycholinguistics Sociol
inguistics Historical Linguistics Applied
Linguistics Corpus Linguistics ?
10
Corpus Linguistics
  • The term emerged in 1980s, although the use of
    corpora has a long history.
  • Modern perspectives contain a number of opposing
    positions.

11
Linguistic Subdisciplines with a tradition for
corpora
  • Historical Linguistics
  • Phonetics
  • Language Acquisition
  • Statistical Natural Language Processing/Language
    Engineering/Computational Linguistics

12
Corpus Linguistics a Methodology, Theory, or
Subfield of Linguistics?
  • Rationalism vs Empiricism
  • Formalists vs Functionalists
  • Competence vs Performance
  • Core vs Periphery
  • Applied Linguistics vs Theoretical Linguistics
  • Corpus-Based vs Corpus-Driven Approaches
    (Tognini-Bonelli)

13
False Assumptions
  • All corpus linguists are descriptivists,
    interested only in counting and categorizing
    occurrences in a corpus, and that all generative
    grammarians are theoreticians unconcerned with
    the data on which their theories are based.
    Complexity of the structure is not in the
    interest of corpus linguist. (Meyer, 2002)

14
Evaluating Linguistic Theories
  • Observational vs explanatory vs descriptive
    adequacy
  • Falsifiability, Completeness, Simplicity,
    Objectivity etc...

15
Chomskyan quotes
  • The corpus could never be a useful tool for the
    linguist, as the linguist must seek to model
    language
  • Corpus Linguistics does not exist.
  • Any natural corpus will be skewed and
    incomplete. Some sentences wont occur, because
    they are obvious, others because they are false,
    still others because they are impolite. The
    corpus, if natural, will be so wildly skewed that
    the description would be no more than a mere
    list.
  • Indeed Chomsky contributed to modern view of
    corpus linguistics by improving language
    technology and to overcoming the
    structuralist-behaviourist views of language as
    something that could be enumerated, by way of
    formal language theory.

16
Why Statistics help? (Abney)
  • Language Acquisition
  • Language Changes
  • Language Variation
  • Grammaticality- Ambiguity Computation
  • Modularity is not in isolation

17
Grammaticality Judgements
  • He shines Tony books.
  • He gives Tony books.
  • If intutions do,why bother with corpus analysis?
  • Artificial data is artificial and creates another
    kind of skewedness.
  • Yes I could say that-but I never would
    gradedness in grammaticality judgements
  • Intuitions are perceptions....

18
Alternative Views
  • Leech (92)
  • Computer Corpus Linguistics is a new research
    enterprise, a new philosophical approach that
  • Concentrates on linguistic performance
  • Leads to a more empirical view of scientific
    inquiry
  • Exploits qualitative as well as quantitative
    methodology to produce a quantitatively oriented
    language model such as Bayesian language models.
  • Not everyone agrees!

19
Further Remarks
  • Corpus Linguistics contributed to blurring the
    distinction between grammar and lexicon.
  • Sinclairs open choice vs idiom principle
  • Cognitive linguists can accommodate data and
    facts revealed by corpus linguistic analysis

20
Corpus Linguistics vs Corpus Based Linguistics
  • There is no inherent incompatibility between
    theoretical generative linguistics and corpus
    linguistics (Seegmiller)
  • Generative and corpus linguistics are two
    approaches to the same problem, and must meet
    somewhere. Generative theories should match or be
    backed up by real data. (Schiffrin)
  • What is possible and what is probable? Corpus
    linguistics offers a way of describing things
    that we do regularly and frequently with
    greater confidence and reliability than by using
    introspection alone. (Krishnamurthy)

21
Corpus-Based Linguistics vs Corpus-Driven
Linguistics
  • Take existing theory as a starting point and
    correct and revise the theory in light of corpus
    evidence.
  • Favour very large, full text corpora, with the
    idea of cumulative representativeness and no
    annotation-to be able to free oneself of
    preconceived theories.
  • e.g collocations rather than colligations
  • Without a corpus, there is no meaningful work to
    be done (attributed to Sinclair, Stubbs but see
    their own writings)

22
Reconciling Views
  • Corpora are excellent resources for verifying the
    falsifiability, completeness, simplicity,
    strength, and objectivity of linguistic
    hypotheses (Meyer, 2002).
  • They can provide additional linguistic
    perspectives which improve our knowledge of
    language and our ability to use it (a weaker
    position)

23
The Rise of Corpora
Years No of Corpus based studies
To 1965 10
1966-1970 20
1971-1975 30
1976-1980 80
1981-1985 160
1986-1991 320
(McEnery and Wilso, 2001)
24
Range of Activities in Corpus-based Linguistics
  1. Corpus Design, Compilation and Annotation
  2. Developing Tools for (1) or Analysis of Corpora
  3. Linguistic Studies or Applications using corpora
    developed in (1) using tools developed in (2)

25
Types of Corpora
  • General (typically balanced and made available
    for general linguistic use) vs Specialized
    (Dialect corpora,language acquisition
    corpora,learner corpora)
  • Core Corpora
  • Written vs Spoken Corpora
  • Full-text vs Sample-text Corpora

26
More Typology
  • Finite-size (Static) vs Dynamic/Monitor Corpora
  • Monolingual vs Multilingual Corpora (Parallel
    corpora, Comparable Corpora)
  • Rather Graded Distinctions
  • Raw vs Annotated,
  • Balanced vs Pyramidal vs Opportunistic Corpora
  • Synchronic vs Diachronic

27
Some Examples of Corpora
  • Pre-electronic corpora
  • Biblical and Literary Studies
  • Lexicographical
  • Dialect Studies
  • Language Education
  • Grammatical
  • Quirks Survey of English Usage Corpus (later
    computerized) had 200 samples of 5000 words each,
    half spoken, half written, tagged manually with
    65 grammatical features.

28
More Examples
  • Major Electronic Corpora
  • Brown Corpus (Francis and Kucera, 1965) Brown
    University Standart Corpus of Present Day
    American English- 1 million words, 1961-64, 500
    samples of 2000 words each
  • Lancaster-Oslo-Bergen Corpus (LOB corpus) a
    comparable corpus of British English fewer
    westerns exist,though!
  • FBrown and FLOB comparable corpora of 1990s

29
Major Electronic Corpora
  • Also modeled after Brown
  • Kolhapur Corpus of Indian English
  • Wellington Corpus of New Zealand English...
  • London-Lund Corpus (1975)- 100 5000-word samples
    of spoken data, major spoken corpus till mid
    1990s, predominantly highly educated adult
    speakers
  • Lancaster/IBM Spoken Corpus (SEC)-better
    balance-11 categories,detailed prosodic annotation

30
Major Electronic Corpora
  • Longman Dictionary of Contemporary English
    (LDOCE) COBUILD Project-Bank of English-524
    million words as of 2004.
  • International Corpus of English
  • International Corpus of Learners English- 2M
    words- 500 word essays, different English
    backgrounds
  • Longman Learners Corpus, HKUST Learners Corpus
  • CHILDES Child Language Data Exchange System
  • European Corpus Initiative ECI 93 million
    words
  • Many corpora are available from LDC and
    ELDA/ELRA.

31
Major Natural Language Processing Corpora
  • PennTreebank (1993) 4.9 million words, tagged
    and parsed, not balanced (optional paper in
    course pack)
  • TIPSTER corpus- AP Newswire and Wall Street
    Journal mainly used for Information Retrieval
  • More variety by National Corpora and dependency
    treebanks

32
National Corpora
  • British National Corpus (BNC Corpus)
  • 100 million words, 90 written, 10 spoken, BNC
    Baby 2 million word sampler, SARA and Xaira
    its own corpus query tools, wholly tagged by
    CLAWS tagger
  • American National Corpus (ANC)
  • In progress, preliminary releases available
  • Czech National Corpus (optional paper in course
    pack)
  • 12 full time persons working for 5 years in a
    speacialized institute
  • 100 million words
  • Partially tagged and parsed in Prague Dependency
    School tradition
  • See METU Online links

33
Lecture 2
  • Corpus Design Issues
  • Readings
  • Tognini-Bonelli (2001) Corpus Issues. Ch3
  • McEnery et al(2006) Unit A7-A9, B1 all appear to
    be one article in the course pack
  • Meyer (2002) Planning the Construction of a
    corpus. Ch 2.
Write a Comment
User Comments (0)
About PowerShow.com