Title: Introduction to Corpora and Corpus Linguistics COG
1Introduction to Corpora and Corpus Linguistics
- COGS 523-Lecture 1
- General Introduction
2Related Readings
- Course Pack
- Meyer (2002). Corpus Analysis and Linguistic
Theory. Ch 1 - Abney (1996) Statistical Methods and Linguistics
- Extra Material (Entirely optional, part of the
presentation draws on these material) - McEnery and Wilson (2001) Ch1
- McEnery et al. (2006) A1 and B2
- Tognini-Bonelli (2001). Corpus Linguistics at
Work. Ch 3 - Corpora Discussion List Archives Corpora
Chomsky/Harris Discussion, April 2001 - BorsleyIngham vs Stubbs Discussion. Lingua 112
(2002) - Schönefeld (1999) Corpus Linguistics and
Cognitivism, International Journal of Corpus
Linguistics 4(1)
3What is a Corpus?
Derlem (alt. Bütünce)
Text/Speech/ Video
Annotation
Digital media
Written/Spoken Language
Design Criteria
4Questions of the Week
- Is working with corpora a methodology within
linguistics or a distinctive subfield (corpus
linguistics)? - What potential is there for empirical analysis of
corpora to contribute to linguistic theory? - What are the dangers involved in corpus-based
linguistics? How can these dangers be reduced?
5What is a Corpus,again?
- A body of written text or transcribed speech
which can serve as a basis for linguistic
analysis or description, designed or required for
a particular representative function. - An electronic collection of texts in a uniform
representation - Corpus vs text archive vs database
6Sinclairs definition
- A corpus is a collection of pieces of language
that are selected and ordered according to
explicit linguistic criteria in order to be used
as sample of language
7Should a Corpus be Necessarily
- Large?
- Be authentic?
- Compiled for linguistic analysis?
- Be saturated in terms of lexical growth?
- Be representative?
- Be machine readable?
8A History of Corpora
- Pre-computers era (pre 60s)
- Transition era (60s to beginning of 90s)
- Maturation era (90s onwards)
- What did technology bring?
- Increased accuracy, speed, accountability,
replicability, large volumes of better annotated
data.
9Phonology Morphology Lexicon Syntax Semantics Disc
ourse Pragmatics
Introspection Experimental Methods Formal
Linguistic Analysis Computational Modeling Corpus
Based Methods?
Linguistics
Computational Linguistics Psycholinguistics Sociol
inguistics Historical Linguistics Applied
Linguistics Corpus Linguistics ?
10Corpus Linguistics
- The term emerged in 1980s, although the use of
corpora has a long history. - Modern perspectives contain a number of opposing
positions.
11Linguistic Subdisciplines with a tradition for
corpora
- Historical Linguistics
- Phonetics
- Language Acquisition
- Statistical Natural Language Processing/Language
Engineering/Computational Linguistics
12Corpus Linguistics a Methodology, Theory, or
Subfield of Linguistics?
- Rationalism vs Empiricism
- Formalists vs Functionalists
- Competence vs Performance
- Core vs Periphery
- Applied Linguistics vs Theoretical Linguistics
- Corpus-Based vs Corpus-Driven Approaches
(Tognini-Bonelli)
13False Assumptions
- All corpus linguists are descriptivists,
interested only in counting and categorizing
occurrences in a corpus, and that all generative
grammarians are theoreticians unconcerned with
the data on which their theories are based.
Complexity of the structure is not in the
interest of corpus linguist. (Meyer, 2002)
14Evaluating Linguistic Theories
- Observational vs explanatory vs descriptive
adequacy - Falsifiability, Completeness, Simplicity,
Objectivity etc...
15Chomskyan quotes
- The corpus could never be a useful tool for the
linguist, as the linguist must seek to model
language - Corpus Linguistics does not exist.
- Any natural corpus will be skewed and
incomplete. Some sentences wont occur, because
they are obvious, others because they are false,
still others because they are impolite. The
corpus, if natural, will be so wildly skewed that
the description would be no more than a mere
list. - Indeed Chomsky contributed to modern view of
corpus linguistics by improving language
technology and to overcoming the
structuralist-behaviourist views of language as
something that could be enumerated, by way of
formal language theory.
16Why Statistics help? (Abney)
- Language Acquisition
- Language Changes
- Language Variation
- Grammaticality- Ambiguity Computation
- Modularity is not in isolation
17Grammaticality Judgements
- He shines Tony books.
- He gives Tony books.
- If intutions do,why bother with corpus analysis?
- Artificial data is artificial and creates another
kind of skewedness. - Yes I could say that-but I never would
gradedness in grammaticality judgements - Intuitions are perceptions....
18Alternative Views
- Leech (92)
- Computer Corpus Linguistics is a new research
enterprise, a new philosophical approach that - Concentrates on linguistic performance
- Leads to a more empirical view of scientific
inquiry - Exploits qualitative as well as quantitative
methodology to produce a quantitatively oriented
language model such as Bayesian language models. - Not everyone agrees!
19Further Remarks
- Corpus Linguistics contributed to blurring the
distinction between grammar and lexicon. - Sinclairs open choice vs idiom principle
- Cognitive linguists can accommodate data and
facts revealed by corpus linguistic analysis
20Corpus Linguistics vs Corpus Based Linguistics
- There is no inherent incompatibility between
theoretical generative linguistics and corpus
linguistics (Seegmiller) - Generative and corpus linguistics are two
approaches to the same problem, and must meet
somewhere. Generative theories should match or be
backed up by real data. (Schiffrin) - What is possible and what is probable? Corpus
linguistics offers a way of describing things
that we do regularly and frequently with
greater confidence and reliability than by using
introspection alone. (Krishnamurthy)
21Corpus-Based Linguistics vs Corpus-Driven
Linguistics
- Take existing theory as a starting point and
correct and revise the theory in light of corpus
evidence.
- Favour very large, full text corpora, with the
idea of cumulative representativeness and no
annotation-to be able to free oneself of
preconceived theories. - e.g collocations rather than colligations
- Without a corpus, there is no meaningful work to
be done (attributed to Sinclair, Stubbs but see
their own writings)
22Reconciling Views
- Corpora are excellent resources for verifying the
falsifiability, completeness, simplicity,
strength, and objectivity of linguistic
hypotheses (Meyer, 2002). - They can provide additional linguistic
perspectives which improve our knowledge of
language and our ability to use it (a weaker
position)
23The Rise of Corpora
Years No of Corpus based studies
To 1965 10
1966-1970 20
1971-1975 30
1976-1980 80
1981-1985 160
1986-1991 320
(McEnery and Wilso, 2001)
24Range of Activities in Corpus-based Linguistics
- Corpus Design, Compilation and Annotation
- Developing Tools for (1) or Analysis of Corpora
- Linguistic Studies or Applications using corpora
developed in (1) using tools developed in (2)
25Types of Corpora
- General (typically balanced and made available
for general linguistic use) vs Specialized
(Dialect corpora,language acquisition
corpora,learner corpora) - Core Corpora
- Written vs Spoken Corpora
- Full-text vs Sample-text Corpora
26More Typology
- Finite-size (Static) vs Dynamic/Monitor Corpora
- Monolingual vs Multilingual Corpora (Parallel
corpora, Comparable Corpora) - Rather Graded Distinctions
- Raw vs Annotated,
- Balanced vs Pyramidal vs Opportunistic Corpora
- Synchronic vs Diachronic
27Some Examples of Corpora
- Pre-electronic corpora
- Biblical and Literary Studies
- Lexicographical
- Dialect Studies
- Language Education
- Grammatical
- Quirks Survey of English Usage Corpus (later
computerized) had 200 samples of 5000 words each,
half spoken, half written, tagged manually with
65 grammatical features.
28More Examples
- Major Electronic Corpora
- Brown Corpus (Francis and Kucera, 1965) Brown
University Standart Corpus of Present Day
American English- 1 million words, 1961-64, 500
samples of 2000 words each - Lancaster-Oslo-Bergen Corpus (LOB corpus) a
comparable corpus of British English fewer
westerns exist,though! - FBrown and FLOB comparable corpora of 1990s
29Major Electronic Corpora
- Also modeled after Brown
- Kolhapur Corpus of Indian English
- Wellington Corpus of New Zealand English...
- London-Lund Corpus (1975)- 100 5000-word samples
of spoken data, major spoken corpus till mid
1990s, predominantly highly educated adult
speakers - Lancaster/IBM Spoken Corpus (SEC)-better
balance-11 categories,detailed prosodic annotation
30Major Electronic Corpora
- Longman Dictionary of Contemporary English
(LDOCE) COBUILD Project-Bank of English-524
million words as of 2004. - International Corpus of English
- International Corpus of Learners English- 2M
words- 500 word essays, different English
backgrounds - Longman Learners Corpus, HKUST Learners Corpus
- CHILDES Child Language Data Exchange System
- European Corpus Initiative ECI 93 million
words - Many corpora are available from LDC and
ELDA/ELRA.
31Major Natural Language Processing Corpora
- PennTreebank (1993) 4.9 million words, tagged
and parsed, not balanced (optional paper in
course pack) - TIPSTER corpus- AP Newswire and Wall Street
Journal mainly used for Information Retrieval - More variety by National Corpora and dependency
treebanks
32National Corpora
- British National Corpus (BNC Corpus)
- 100 million words, 90 written, 10 spoken, BNC
Baby 2 million word sampler, SARA and Xaira
its own corpus query tools, wholly tagged by
CLAWS tagger - American National Corpus (ANC)
- In progress, preliminary releases available
- Czech National Corpus (optional paper in course
pack) - 12 full time persons working for 5 years in a
speacialized institute - 100 million words
- Partially tagged and parsed in Prague Dependency
School tradition - See METU Online links
33 Lecture 2
- Corpus Design Issues
- Readings
- Tognini-Bonelli (2001) Corpus Issues. Ch3
- McEnery et al(2006) Unit A7-A9, B1 all appear to
be one article in the course pack - Meyer (2002) Planning the Construction of a
corpus. Ch 2.