Corpus linguistics an introduction presentation

About This Presentation

Transcript and Presenter's Notes

Title: Corpus linguistics an introduction

1
Corpus linguisticsan introduction
2
What is a corpus?

A collection of naturally occurring language
text, chosen to characterise a state or variety
of language (Sinclair)
A collection of linguistic data, either written
text or a transcription of recorded data, which
can be used as starting-point of linguistic
description or as a means of verifying hypotheses
about a language (Dictionary of linguistics and
phonetics)

3
What is a corpus? (II)

Large body of evidence typically composed of
attested language use (McEnery)
Usually a corpus is in machine-readable format
and is ideally viewable and analysable through (a
single) software package
The word corpus comes from Latin body and the
plural is corpora

4
What is not a corpus

Lists of words
Lists of sentences produced with the purpose of
creating a corpus
Archive a repository of readable electronic
texts not linked in any coordinated way
(http//www.archive.org)The Internet Archive
is building a digital library of Internet sites
and other cultural artifacts in digital form.
Like a paper library, we provide free access to
researchers, historians, scholars, and the
general public.

5
What can we do with a corpus?

Corpus-based approaches hypotheses are checked
against a corpus
Corpus-driven approaches hypotheses are drawn
from the corpus

6
What can we do with a corpus? (II)

'Alright,' said the computer Deep Thought. 'The
Answer to the Great Question...'
'Yes...!'
'Of Life, the Universe and Everything ... ' said
Deep Thought.
'Yes ... !'
'Is ...'
'Yes...!!!...?'
'Forty-two,' said Deep Thought, with infinite
majesty and calm.
It was a long time before anyone spoke.
'Forty-two!' yelled someone in the audience. 'Is
that all you've got to show for seven and a half
million years' work?'
'I checked it very thoroughly,' said the
computer, 'and that quite definitely is the
answer. I think the problem, to be quite honest
with you, is that you've never actually known
what the question is.'
Hitchhikers guide to the galaxy by Douglas Adams

7
Fields where corpora are used

Lexicography to design dictionaries
Language studies (relations between languages,
differences between genre, evolution of the
language)
Computational linguistics (training and testing
methods)
Language teaching (learners corpora)
Cultural studies, psycholinguistics

8
The characteristics of analysis using corpora
(Biber, 1998)

It is empirical, analysing the actual patterns
of use from natural texts
It utilises a large and principled collection of
natural texts as the basis for analysis
It makes extensive use of computers for
analysis, using both automatic and interactive
techniques
It depends on both quantitative and qualitative
analytical techniques

9
History

We have to split the history in two periods
before Chomsky and after Chomsky
Before Chomsky, methods similar to the ones in
corpus linguistics were used (empiricism)http/
/www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpu
s1/1fra1.htm

10
Early corpus linguistics

Before Chomsky
Computers were not available so it was difficult
to analyse large collections of text
Studies of child language using diaries kept by
parents
Spelling conventions in a German corpus of 11
million words
Foreign language pedagogy

11
Early corpus linguistics (II)

All the work of early corpus linguistics was
underpinned by two fundamental, yet flawed
assumptions
The sentences of a natural language are finite.
The sentences of a natural language can be
collected and enumerated.
Most linguists saw the corpus as the only source
of linguistic evidence in the formation of
linguistic theories

12
Chomsky

Between 1957 and 1965 Chomsky changed the
direction of linguistics from empiricism towards
rationalism
Any natural corpus will be skewed. Some
sentences wont occur because they are obvious,
other because they are false, still others
because they are impolite. The corpus, if
natural, will be so wildly skewed that the
description would be no more than a mere list
(Chomsky, 1962)
Introspection started to be used instead

13
Problems with introspection

Naturally occurring data is observable and
verifiable by everyone.
Introspective data is artificial.
Human beings have only the vaguest notion of the
frequency of a construct or a word.

14
The revival of corpus linguistics

The research in corpus linguistics was continued
in small centres
The hardware still imposed some restrictions,
the real development will start in the 80s
Fields like computational linguistics were not
interested to use corpora

15
The revival of corpus linguistics (II)

1960s Brown Corpus (at the Brown University
American English)
1970s LOB corpus British English
1980s Bank of English in Birmingham
1990s (BNC, LDC, ICE corpus, ELRA, TRACTOR,
ICAME)

16
Why bother with corpora?

Even expert speakers have only a partial
knowledge of a languageA corpus can be more
comprehensive and balanced
Even expert speakers tend to notice the unusual
and think of what is possibleA corpus can show
us what is common and typical
Even expert speakers cannot quantify their
knowledge of languageA corpus can give us
accurate statistics

17
Why bother with corpora? (II)

Even expert speakers cannot remember everything
they knowA corpus can store and recall all the
information that has been input
Even experts speakers cannot make up natural
examplesA corpus can provide us with a vast
number of real examples
Even expert speakers have prejudices and
preferences and every language has cultural
connotations and underlying ideologyA corpus can
give you more objective evidence

18
Why bother with corpora? (III)

Even expert speakers are not always available to
be consultedA corpus can be made permanently
accessible to all
Even expert speakers cannot keep up with language
changeA constantly updated corpus can reflect
even recent changes in the language
Even expert speakers lack authority they can be
challenged by other expert speakersA corpus can
encompass the actual language use of many expert
speakers

19
Parameters of a corpus

Language
Monolingual
Multilingual (comparable corpora)
Parallel
Type of source
Written
Spoken
Mix

20
Parameters of a corpus (II)

Size of the corpus is not all important and it
depends very much on the type of texts used
Annotated/not annotated (type of encoding used
plain text, SGML/XML encoded)
Static corpus static/monitor corpus
Corpus/sub-corpus
Number of words/types

21
Type/token ratio

From Brown corpus 1m tokens (written only) -
50,406 types
From 1980s Birmingham/Cobuild corpora 1m tokens
(spoken only) - 36,807 types - 17,459 occur only
once
NB - fewer types than Brown (written only)
spoken language is more repetitive, smaller
vocabulary is used
4m tokens (Times newspapers only) - 122,773 types
- 54,144 occur only once
18m tokens (general corpus) - 228,323 types -
131,299 occur only once

22
Type/token ratio

121m tokens (general corpus) - 475,633 types -
213,684 occur only once
211m tokens (general corpus) - 638,901 types
323m tokens (general corpus) - 812,467 types
418m tokens (general corpus) - 938,914 types -
438,647 occur only once

23
Ways to exploit a corpus

Word (token) / types frequency lists
N-grams
Concordances
Collocations/collegations
Specially designed programs (especially when the
corpus is annotated)

24
Frequency lists

are lists which indicates the words which appear
in a corpus and their frequency
they provide a survey of the corpus
a frequency list becomes more meaningful when
compared with other lists
they remove a word from its contexts

25
N-grams

groups of N words which appear in sequence in
the text
they are presented using frequency lists
good way to identify recurring/specific
expressions for a corpus
provide limited context for the words

26
Concordances

show words in the context they appear
usually they are obtained using special programs
which allow to manipulate the lists of
concordances
KWIC (Key Word In Context) is the most common
format

27
Collocations

collocation the occurrence of two or more
words within a short space of each other in text
the collocates are extracted using a window to
the left and right of a specified word
can be used to further analyse the context of a
word

28
The word gamut
29
Building corpora

Ways to acquire corpora
Direct conversion from electronic format
Optical scanning
Keyboarding
Speech transcription

30
Building corpora (II)

Criteria in corpus design
Size (small corpora are for genre specific
studies, whereas big corpora make robust, general
statements about a language)
Genre (domain, distribution, age, )
The structure of the corpus can be decided
A priori (Brown, LOB, )
A posteriori
Old material is replaced with new one (monitor
corpus)

31
Building corpora (III)

Selection, permission, acquisition
Data conversion, optical scanning, keyboarding,
speech transcription
Cleaning, spell-checking, encoding (annotation),
indexing
Writing documentation
Evaluation of corpora
Distribution of corpora

32
Possible problems when building a corpus

A sampling frame designed to allow the
exploitation of a certain linguistics properties
Balance and representativeness
Information can be lost through cleaning
Duplication
When working with speech information can be lost
through transcribing

33
Web as a corpus

The Web can be very useful source of texts
The Web is very helpful for languages other than
English
Quite often there is not control on the language
which is investigated therefore filtering (if
possible) is necessary

34
Corpus annotation

Enrichment of a corpus with various types of
information
It can be done at every level
Word part of speech, sense
Sentence sentence boundaries, syntactic tree
Discourse coreferential chains, discourse
segments
Certain expressions named entities

35
Annotation scheme

A standard used to annotate certain
characteristics
Gives meaning to a tag
Nowadays it is in XML
Usually in addition to an annotation scheme, a
set of guidelines is produces to assist the
annotation

36
Examples (II)

ltPgtltSgtltW POS"PRON" NUM"PL LEMMA"we"gtWelt/WgtltW
POS"V" LEMMA"have"gthavelt/WgtltW POS"EN"
LEMMA"develop"gtdevelopedlt/WgtltNPgtltW POS"DET"
LEMMA"a"gtalt/WgtltW POS"A LEMMA"computational"gt
computationallt/WgtltW POS"N" NUM"SG"
LEMMA"paradigm"gt paradigmlt/WgtltW
POS"PUNCT"gt,lt/Wgt ...lt/NPgt ... lt/Sgtlt/Pgt

37
What are the advantages of corpus annotation?

Ease of exploitation
Reusability
Multi-functionality
Explicit analyses
Once a corpus is annotated it can be used in
further research

38
Annotation of a corpus

Can be done automatically, semi-automatically
and manually
Sometimes the method is automatic and then the
results postprocessed
Usually special tools are used to minimise the
human error

39
Criticism to corpus annotation

Corpus annotation produce impure corpora
Sometimes annotation can hide certain features
Consistency versus accuracy
Measures to compute the reliability of an
annotation
Sometimes the annotation scheme can cover a
phenomenon only partially.

40
Existing corpora

Brown Corpus/LOB corpus
Bank of English
Wall Street Journal, Penn Tree Bank, BNC, ANC,
ICE, WBE, Reuters Corpus
Canadian Hansard parallel corpus English-French
York-Helsinki Parsed corpus of Old Poetry
Tiger corpus German
CORII/CODIS - contemporary written Italian
MULTEX 1984 and The Republic in many languages

41
Distributors of corpora

LDC (Linguistic Data Consortium)
ELRA (European Language Resources Association)
TRACTOR (TELRI Research Archive of Computational
Tools and Resources)
ICAME (International Computer Archive of Modern
and Medieval English)

42
References

Karin Aijmer and Bengt Altenberg (1991) English
corpus linguistics, Longman
Duglas Biber, Susan Cnrad and Randi Reppen (1998)
Corpus linguistics, Cambridge University Press
Graeme D. Kennedy (1998) An introduction to
corpus linguistics, Longman
Tony McEnery and Andrew Wilson (1996) Corpus
linguistics, Edinburgh University Press

43
References (II)

Geoff Barnbrook (1996) Language and Computers,
Edinburgh University Press
Tony McEnery (2003) Corpus linguistics. In
Ruslan Mitkov (ed.) The Oxford Handbook of
Computational Linguistics, Oxford University
Press

Write a Comment

User Comments (0)

About PowerShow.com

Corpus linguistics an introduction PowerPoint PPT Presentation