Corpus search - PowerPoint PPT Presentation

Loading...

PPT – Corpus search PowerPoint presentation | free to download - id: 84530a-MjJhN



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Corpus search

Description:

What are the most common words in English What are the most common verbs What is the most common pronoun What is the most common proper noun What are the most common ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 35
Provided by: byu59
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Corpus search


1
Corpus search
  • What are the most common words in English
  • What are the most common verbs
  • What is the most common pronoun
  • What is the most common proper noun
  • What are the most common nounnoun collocations?

2
Whom in British and American
  • Fill in the tables using COCA and BNC

Freq. whom Freq. Prep. whom Freq. Whom (prep whom)
British
American
3
Whom in British and American
  • COCA has 450M and BNC 100M
  • Calculate per million frequencies

Freq. whom Freq. Prep. whom Freq. Whom (prep whom)
British
American
4
Simple past vs. present perfect
  • Use the BNC and COCA to fill in this chart with
    frequencies per million
  • Past participle (vhv?n auxiliary have
    verb PP)
  • Simple past (v?d past tense verb)

US US UK UK
Simple Pres. Perf. Simple Pres. Perf.
just
already
yet
ever
5
Simple past vs. present perfect
  • Use the BNC and COCA to fill in this chart with
    frequencies per million
  • Past participle (vhv?n auxiliary have
    verb PP)
  • Simple past (v?d past tense verb)

US US UK UK
Simple Pres. Perf. Simple Pres. Perf.
just
already
yet
ever
6
Browse corpora at BYU
  • Corpus.byu.edu

7
Corpus Applications
8
What is corpus linguistics good for?
  • Making a concordance
  • List of all words in a text and where they are
    found
  • Scriptures
  • Works of Shakespeare

9
What is corpus linguistics good for?
  • Finding word frequencies
  • Psycholinguistic experiments
  • Language instruction
  • Put most common words in L2 vocabulary
  • toxicomano

10
What is corpus linguistics good for?
  • Lexicography
  • What words to include in a dictionary?
  • What do words mean?
  • How are meanings changing?
  • How are spellings changing?
  • Blowtorch
  • Blow-torch
  • Blow torch
  • Identifying regionalisms

11
What is corpus linguistics good for?
  • Computer systems development
  • Text to speech
  • Text messaging
  • If you have typed gla- frequency data says glass
    is highly probable and fills it in for you
  • Speech synthesis
  • Natural language processing

12
What is corpus linguistics good for?
  • Testing linguistic theories
  • Generativists relied on personal introspection
  • So what if Dayton is less frequent than New York
    in a corpus
  • Im a native speaker and know what sounds right
    and wrong

13
What is corpus linguistics good for?
  • Problems with introspections
  • Theyre subjective
  • They cant be verified
  • Your introspection probably go along with your
    theory

14
What is corpus linguistics good for?
  • Corpus data . . .
  • Are objective
  • Can be verified
  • Can be shared
  • Can be used to test theories
  • Can be used to get ideas for theories

15
Limitations of corpora
  • They cant contain every sentence
  • Some data arent interesting
  • Frequency of Dayton versus New York
  • They have mistakes

16
Lexical
  • Word lists
  • General Service List
  • 2,000 most frequent words in English
  • Academic Word List (Coxhead)
  • 570 words in English academic writing
  • Academic Vocabulary List (Davies Gardner)
  • 3,000 words
  • High frequency in ACAD, low frequency in other
    registers
  • Measure of dispersion (Juillands D)

17
Lexical
  • Word lists
  • General Service List
  • 2,000 most frequent words in English
  • Academic Word List (Coxhead)
  • 570 words in English academic writing
  • Academic Vocabulary List (Davies Gardner)
  • 3,000 words
  • High frequency in ACAD, low frequency in other
    registers
  • Measure of dispersion (Juillands D)

18
Phraseology
  • Formulaic sequences (lexical bundles)
  • Corpus-driven
  • Frequency
  • Function
  • Fixedness
  • at the of
  • What do you think most often fills the ?
  • Check in COCA

19
Grammar
  • Descriptive reference grammars
  • Describe descriptions of how language is actually
    used rather than prescriptions about how language
    should be used
  • Longman Grammar of Spoken and Written English

20
Lexicogrammar
  • Certain words are more likely to occur in some
    grammatical structures than others
  • E.g., some verbs (e.g., deem, base, subject) are
    much more common in the passive than active voice
  • The material was deemed faulty.
  • Her choice was based solely on
  • The matter may be subjected to
  • Collostructional analysis is a means of measuring
    the strength of a relationship between a word and
    a grammatical structure

21
Register variation
  • Does general English exist?

22
Frequent phrases in conversation
Phraseological feature Examples
Personal pronoun lexical verb phrase I dont know what, I dont want to, I was going to
Yes-no question fragments do you want to, are you going to
Wh-question fragments what are you doing, what do you mean, what do you think, what do you want
23
Frequent phrases in academic writing
Phraseological feature Examples
Noun phrase with of-phrase fragment the end of the, one of the most
Prepositional phrases with embedded of-phrase fragments in the case of  
Other prepositional phrase fragments on the other hand
24
Register variationcomplexity
  • Which is more complexspeech or writing?
  • Define the type(s) of complexity we find in each.

25
Multi-Dimensional analysis
  • Identify a comprehensive set of relevant
    linguistic features
  • Identify and quantify those features in a corpus
    of texts
  • Use factor analysis to identify dimensions based
    on co-occurrence among linguistic features
  • Interpret dimensions functionally
  • Calculate scores for each text on each dimension
  • Compare mean scores of registers/varieties

26
Involved vs. Informational
27
  • Non-technical Synthesis vs. Specialized
    Information Density
  • Positive features
  • Verbs verb HAVE (.36)
  • Adverbs general adverbs (.59), amplifiers (.43),
    certainty adverbs (.37), emphatics (.36)
  • Coordination adverbial conjuncts (.51), phrasal
    coordinating conjunctions (.39)
  • Nominal Modifiers that-relative clauses (.36)
  • Lexical Features COCA Core Vocabulary (1-500)
    (.61)
  • Negative features
  • Nouns pre-nominal modifiers (-.73) nouns
    (-.73), technical concrete nouns (-.31)
  • Verbs agentless passive voice (-0.42)

27
28
Study 1Dimension 1 Results

28
29
Register variation
  • Does general English exist?

30
Dialect variation activity
  • In what country is this expression permitted?
  • Allow to Verb
  • Permit to Verb
  • Where is the word banjaxed used? Meaning?
  • UK vs. US use of
  • Different from/to
  • Which do they use in Australia?
  • Neednt vs. don't need
  • Haven't a Noun vs. don't have a Noun

31
Diachronic change
  • whom
  • be v?n
  • get v?n
  • end up v?g
  • need nt
  • Others?

32
Data-driven Learning
  • Language learners actually use corpora in the
    classroom
  • Research is mixed
  • It seems to be more useful/effective for advanced
    learners

33
Corpus-informed materials
34
Political discourse
  • Www.speechwars.com
About PowerShow.com