Title: Investigating speech, thought and writing presentation in a corpus of spoken British English
1Investigating speech, thought and writing
presentation in a corpus of spoken British English
- An AHRB funded project under the supervision of
- Mick Short, Elena Semino and Tony McEnery
Research Assistants John Heywood and Dan McIntyre
2Project outline
- To compare speech, thought and writing
presentation in spoken and written English. - To build a new corpus of 260,000 words of spoken
British English to compare with the STWP Written
English Corpus (1995-99). - To investigate the presentation of speech,
thought and writing in the STWP Spoken Corpus by
tagging with the Leech and Short (1981) category
set. - To further test and adapt the Leech and Short
(1981) model of STP. - The project is funded until February 2003.
3Construction of the corpus
- 120 texts - approximately 260,000 words.
- Texts rich in STWP taken from the British
National Corpus (BNC) and the Centre for North
West Regional Studies (CNWRS) oral history
archives at Lancaster University. - CNWRS interview tapes digitised to be
time-aligned with text.
4Number and distribution of NWRS files in the
corpus
NWRS Archive Family and Social Life Archive
Childhood and Schooling
Archive
Male Female
Male
Female
1890-1940 1940-1970 1890-1940
1940-1970
7 records 7 records 8 records
8 records 15 records
15 records
i.e. 60 files with an equal balance of male
and female speakers in each age-range
5Number and distribution of BNC files in the corpus
BNC spoken data
Spoken Demographic Spoken
Context- Governed
Male
Female
0-14 15-24 25-34 35-44 45-59
60 0-14 15-24 25-34
35-44 45-59 60
5 files 5 files 5 files 5 files
5 files 5 files 5 files 5 files
5 files 5 files 5 files 5 files
i.e. 60 files with an equal balance of male
and female speakers in each age-range
6The development of the tag-set
Leech Short (1981)
NRA NRSA NRS/IS FIS NRS/DS FDS
NRTA NRT/IT FIT NRS/DT FDT
- The STWP Written Project (1995)
- 3 main genres Fiction, Biography
Autobiography, and Newspaper Journalism each
divided into Serious/Popular sections.
N NV NRSA-P NRS/IS FIS NRS/DS FDS
N NI NRTA-P NRT/IT FIT NRT/DT FDT
N NW NRWA-P NRWS/IW FIW NRW/DW FDW
embedded, hypothetical, inferred, quote
7The development of the tag-set new tags
The STWP Spoken Project (2001) BNC spoken
demographic data and NWRS oral history interviews
RM
A RV RSA-P RS/IS FIS RS/DS FDS
A RI RTA-P RT/IT FIT RT/DT FDT
A RN RWA-P RW/IW FIW RW/DW FDW
embedded, negative / absence, hypothetical,
inferred, quote, reiterated, interrogative,
imperative, uncompleted, 2 / 3 / 4
8A 15-field tag-set 5 main categories
FIELD CHARACTER VALUE
1 x, A, F, Anything! Free
2 x, , R, I, D Representation, Indirect, Direct
3 x, S, T, W, V, I, N, M Speech, Thought, Writing, Voice, Internal state, WritiNg, Mention
4 x, A Act
5 x, P toPic
9A 15-field tag-set 10 category attributes
FIELD CHARACTER VALUE
6 x, , 1, 2, 3, 4 odd interesting borderline cases, no.s repeated (-ing or ed) adjacent categories
7 xe embedded
8 xxg/a negative action etc e.g. 'we weren't allowed to go', absence eg 'I didn't say anything'
9 xxxh hypothetical
10 xxxxi inferred
11 xxxxxq quote
12 xxxxxxr iterative
13 xxxxxxxv/p interrogative, imperative
14 xxxxxxxxu uncompleted
15 xexxxxxxx2 level of embedding (2, 3, 4)
10Issues arising
- Technical issues
- Legibility.
- Comparability between NWRS and BNC data.
- Tagging issues
- Comparability between written and spoken corpora.
- What counts as STWP?
- Functional and formal criteria.
- Embedding.
- Repetition (e.g. he said he said well he said).
- Report of mention.
- Reading, hearing, listening and singing dogs!