Title: Corpus Linguistics and Language Variation
 1Corpus Linguistics and Language Variation
- Michael P. Oakes 
 - University of Sunderland
 
  2Contents
- Introduction to Corpora and Language Variation 
 - The Chi-Squared Test 
 - British versus U.S. English 
 - Social Differentiation in the Use of Vocabulary 
 - Genre Analysis 
 - Yules Distinctiveness Coefficient 
 - Hierarchical Clustering 
 - Factor Analysis 
 - Linguistic Facets and Support Vector Machines 
 - Computational Stylometry 
 - Conclusions
 
  3Introduction
- Computer analyses of linguistic variation are 
restricted to comparisons of the use of 
frequently occurring, objectively countable 
linguistic features.  - Not hapax legomena 
 - Not Semitisms in the Greek New Testament 
 - A corpus is a large body of (usually electronic) 
text, sampled for a purpose.  - What are the differences in the ways that feature 
x is used in corpus y and corpus z ?  - In order to carry out comparisons, we need 
corpora that are matched as far as possible in 
every way but one, e.g. same genre, same country, 
different gender.  
  4The LOB Corpus Family
- A well-known family of corpora, based on the 
Brown corpus of one million words of American 
English (1962).  - The Lancaster-Oslo-Bergen corpus is the British 
Equivalent.  - Carefully constructed using the same sampling 
model (balanced) to carry out studies of 
diachronic and synchronic variation.  
  5Broad Text Category Genre Texts in Brown Texts in LOB
Press A Reportage 44 44
B Editorial 27 27
C Reviews 17 17
General Prose D Religion 17 17
E Skills, Trades, Hobbies 36 38
F Popular Lore 48 44
G Belles Lettres, Biographies, Essays 75 77
H Miscellaneous 30 30
J Academic Prose 80 80
Fiction K General Fiction 29 29
L Mystery and Detective 24 24
M Science Fiction 6 6
N Adventure and Western 29 29
P Romance and Love Story 29 29
R Humour 9 9 
 6Diachronic Corpora
- LOB and Brown represent English of the 1960s. 
 - FLOB and Frown are balanced with respect to LOB 
and Brown, designed to reflect English of the 
1990s.  - Lancaster1931 Corpus (Leech and Smith, 2005) 
 - All annotated with a common grammatical tagset 
known as C8, but no demographic information.  - Equal 30-year gaps enable researchers to 
determine whether linguistic change is speeding 
up, constant or slowing down.  - BE06 (Baker, 2006). 
 
  7Sources of Linguistic Variation
- Biber (1998) uses the term genre for classes of 
texts that are determined on the basis of 
external criteria relating to authors or 
speakers purpose.  - Text types grouped by similarities in intrinsic 
linguistic form, irrespective of their genre 
classifications e.g. press reportage, 
biographies and academic prose all have a 
narrative linguistic form.  - Register refers to variation in language arising 
from the situation it is used in, depending on 
such things as interactiveness, who is the 
addressee.  - Dialect defined by association with different 
speaker groups, based on region, social group or 
other demographic factors.  - Topic, genre, stylistic preference, may swamp 
language change proper. 
  8Feature Selection
- An early stage in many text classification tasks 
is to decide which features (called attributes 
in machine learning applications) should be used 
to characterise the texts.  - Words (non-trivial) 
 - Lemmas and Stems 
 - Interpretive codes e.g. AB (abbreviation), SF 
(neologism from science fiction), FO (foreign 
origin)  - Part-of-Speech and Semantic tags 
 - E.g. does American English use any of these more 
than British English?  
  9The Chi-Squared (?²) Test (1)
- Is mother more typical of female speech than 
male speech?  - Rayson, Leech and Hodges (1987) used the BNC 
Conversational Corpus  - Start with a contingency table of observed values 
 
Female Male Column totals
Mother 627 272 899
Any other word 2,592, 825 1,714, 161 4,306, 956
Row totals 2,593, 452 1,714, 433 Grand total  4,307, 885 
 10The Chi-Squared (?²) Test (2)
- Expected values are found using the formula 
 -  row total  col total / grand total 
 
Female Male
Mother 541.2 357.8
Any other word 2,592, 910.8 1,714,075.2 
 11The Chi-Squared (?²) Test (3)
- X²  S (O  E)² / E 
 - X²  34.2 
 - For one degree of freedom, if X² gt 10.8 we can be 
99.9 confident that women really do say mother 
more than men. 
Female Male
Mother 13.6 20.6
Any other word 0.0 0.0 
 12Gender Variation in the Use of Vocabulary
Words most characteristic of Male speech Words most characteristic of Female speech
fing, er, the, yeah, aye, right, hundred, f, is, of, two, three, a, four, ah, no, number, quid, mate She, her, said, nt, I, and, to, cos, oh, Christmas, thought, lovely, nice, mm, had, did, going, because, him, really 
 13Common Pitfalls
- The corpora need not be the same size 
 - Expected values should be at least 5 
 - Log-likelihood / G² also requires E  5 
 - Express data as counts, not ratios (e.g. words 
per million)  - Need for a dispersion measure (thalidomide 
example)  - Bonferroni correction for multiple comparisons 
reduces Type I errors  
  14Yules Distinctiveness Coefficient (Q)
- Q  (AD  BC) / (AD  BC) 
 - Q measures strength and direction of a 
relationship  - Q  -1 or 1 for both complete and absolute 
relationships  - F  ?² / sample size
 
complete 
Brown LOB
Theatre 0 63
Theater 95 30
absolute 
Brown LOB
South-west 0 10
southwest 16 0 
 15Comparison of Corpora representing English spoken 
in 5 Countries (Oakes  Farrow, 2007)
ACE (Australia) Labour rights (unions, unemployed, superannuation)
FLOB (UK) Aristocratic titles (royal, Lord)
Frown (United States) Spelling differences (color, theater), terms for transportation (railroad, highway), diversity (black, gender, white, gay)
Kolhapur (India) Crores (ten millions), lakhs (ten thousands), caste system (dalit, caste), religion (Hindu, Krishna, temple), function words (the, upto)
Wellington (N Zealand) Sports (rugby), the natural world (bay, beach, cliff). 
 16Genre Analysis
- Univariate techniques such as chi-squared and 
Yules Q can be used to compare the vocabulary 
used in different genres  - Here we will look at multivariate techniques 
hierarchical clustering, factor analysis, support 
vector machine.  - multivariate statistical analysis when several 
variables are observed for each sample unit, e.g. 
the frequencies of many different words in a 
genre. 
  17Hierarchical Clustering
- Cluster analysis is a type of automatic 
categorisation  similar things (such as related 
genres) are brought together, and dissimilar 
things are kept apart.  - The starting point for many clustering algorithms 
is the similarity matrix, a square table of 
numbers showing how much each of the items (such 
as texts) to be clustered have in common with 
each of the others.  
  18The ranks of ten common words in four corpora
LOB Brown Carroll Jones  Sinclair
The 1 1 1 1
Of 2 2 2 6
And 3 3 3 2
To 4 4 5 4
A 5 5 4 3
In 6 6 6 7
That 7 7 8 8
Is 8 8 7 9
Was 9 9 10 10
It 10 10 9 5 
 19Spearmans Rank Correlation Coefficient
- C  1  
 - 6 S (R - S)² / n (n² - 1)  
 - R is rank in LOB, S is rank in Jones  Sinclair, 
n  number of words  -  1  
 -  (6  50) / (10  99) 
 -  0.697
 
R S R-S (R-S)²
The 1 1 0 0
Of 2 6 -4 16
And 3 2 1 1
To 4 4 0 0
A 5 3 2 4
In 6 7 -1 1
That 7 8 -1 1
Is 8 9 -1 1
Was 9 10 -1 1
It 10 5 5 25
S  50 
 20Similarity matrix for the four corpora
LOB Brown Carroll Jones  Sinclair
LOB - 1.000 0.964 0.697
Brown 1.000 - 0.964 0.697
Carroll 0.964 0.964 - 0.757
Jones  Sinclair 0.697 0.697 0.757 - 
 21Production of a Dendrogram by Nearest Neighbour 
linkage
0.757
0.964
1.000
LOB
Brown
C
SJ 
 22Dendrogram for 15 LOB Genres using 89 most 
frequent words (1) 
 23Dendrogram for 15 LOB genres (2)
- If our cut-off point is 0.2 difference, then we 
get 4 clusters  - K (general fiction), P (romance and love story), 
L (mystery and detective), N (adventure and 
western)  - A (press reportage), C (press reviews) 
 - M (science fiction), R (humour), F (popular 
lore), G (belles lettres)  - B (press editorial), D (religion), E (skills, 
trades and hobbies), H (government documents), J 
(learned and scientific writings).  - Hofland and Johansson (1982) the two major 
groups of texts were imaginative and informative 
prose, bridged by essayistic prose 
  24Factor Analysis
- Decathlon analogy running, jumping and throwing. 
  - Biber (1988) groups of countable features which 
consistently co-occur in texts are said to define 
a linguistic dimension.  - Such features are said to have positive loadings 
with respect to that dimension, but dimensions 
can also be defined by features which are in 
complementary distributions, i.e. negatively 
loaded.  - Example at one pole is many pronouns and 
contractions, near which lie conversational 
texts and panel discussions. At the other pole, 
few dimensions and contractions are scientific 
texts and fiction.  
  25Factor Analysis Methodology
- The use of computer corpora, such as LOB, which 
classify texts by a wide range of genres.  - The use of computer programs to count the 
frequencies of linguistic features throughout the 
range of genres, e.g. Perl scripts to count the 
frequencies of the 50 most common words in LOB.  - Use of factor analysis to determine co-occurrence 
relations among the features, e.g. using Matlab.  - Use of the linguists intuition to interpret the 
linguistic dimensions discovered.  
  26(No Transcript) 
 27Main Findings
- The singular personal pronouns (I, you, he, she, 
his, him, her) were clustered very closely 
together, and between K (general fiction) and L 
(mystery and detective)  showing that these 
pronouns are characteristic of fictional texts  - As for the dendrogram, various genres were 
closely related to each other  - Four types of fiction K (general), L (mystery 
and detective), N (adventure and western) and P 
(romance).  - M (science fiction) and R (humour). 
 - F (popular lore), G (belles lettres), C (press 
reviews).  - D (religion), E (skills, trades and hobbies), H 
(government docs) and J (science)  - B (press editorial) is the only genre to be 
positively loaded on both factors, but its 
nearest neighbour is D (religion).  - Only A (press reportage) is differently located 
with respect to its neighbours compared with the 
dendrogram.  
  28Support Vector Machine (Santini, 2007)
- Extant web genres e.g. editorials, 
do-it-yourself, mini-guides, biographies  - Variant web genres, on the other hand, have 
arisen since the advent of the web blogs, 
e-shops, FAQs, listings (e.g. site maps), 
personal home pages, search pages.  - First stage is to count automatically extractable 
linguistic features, e.g. function words, 
punctuation, POS trigrams, lingusitic facets, 
e.g. genre-specific referential vocabulary (, 
basket, buy, cart, catalogue, checkouts, cost 
for e-shops), functionalality facet (tags such as 
ltbuttongt and ltformgt which indicate an interactive 
web page.  - Application to search engines. 
 
  29Equation of Hyperplane w0Tx  b0  0
?0
X
X
X
X
X
X 
 30Computational Stylometry 
- Writing styles of individual authors 
 - Forsyth (1999) used ?² to find substring 
discriminators for the younger (s, an,  whi, 
 with) and the older Yeats (what,  can, 
?)  - Holmes (1992) used hierarchical clustering to 
compare Mormon Scripture, Joseph Smiths personal 
writings, Old Testament sections.  - Holmes, Gordon and Wilson (2001) used Principal 
Components Analysis, similar to Factor Analysis, 
to determine the authorship of The Heart of a 
Soldier  - Popescu and Dinu (2007) used an SVM to decide who 
wrote the Federalist Papers (Madison rather than 
Hamilton).  
  31Hierarchical Clustering of Mormon Scripture, 
Joseph Smiths personal writings and samples 
from Isaiah (Holmes, 1992) 
 J1 J2 J3 I1 I3 I2 JB N2 
A2 A1 M2 M4 R2 D2 M3 N3 M5 R1 AB D3 L1 D1 N1 M1 
 32Principal Components Plot  LaSalle 
Autobiography, o LaSalle Letters,  
George Personal, x George War, ? 
Harrison, ? Heart, ? Inman papers. 
 33Conclusions
- Statistical methods can distinguish between 
different text types, arising through such 
factors as demographic variation, genre 
differences, topic differences, or individual 
writing styles.  - One difficulty is that each of these differences 
can obscure the others.  - Cultural corpora versus balancing text-for-text. 
 - Search for linguistic features that are good at 
identifying one form of linguistic variation, 
without being indicative of others. 
  34Source
- Oakes, M. P. Corpus Linguistics and Language 
Variation, in Baker, P. (ed.) Contemporary 
Approaches to Corpus Linguistics, Continuum (to 
appear).