Korpuslinguistik mit und f - PowerPoint PPT Presentation

About This Presentation
Title:

Korpuslinguistik mit und f

Description:

Construction of Corpora. Written text is easier to obtain than spoken text. Some examples: ... at http://www.cl-ki.uni-osnabrueck.de/~aluedeli/Corpuslinguistik. ... – PowerPoint PPT presentation

Number of Views:269
Avg rating:3.0/5.0
Slides: 31
Provided by: marti46
Category:

less

Transcript and Presenter's Notes

Title: Korpuslinguistik mit und f


1
Korpuslinguistik mit und für Computerlinguistik
  • Martin Volk
  • Universität Zürich
  • Eurospider Information Technology AG

2
Sources for linguistic information
  • Introspection (own usage and judgement)
  • Usage and judgement by others
  • Questioning (goal-driven)
  • interview
  • questionaire
  • Observation ('involuntary' utterances)
  • spoken utterances (? corpora)
  • written utterances (? corpora)

3
What is a corpus?
  • a text collection
  • a representative text collection
  • a representative and structured text collection
  • a representative, structured and annotated text
    collection
  • ...

4
Example
  • Is 'ob' used as a preposition in German?
  • Introspection
  • Rothenburg ob der Tauber
  • Dictionary (Wahrig. Deutsches Wörterbuch. 1996)
    Präp. mit Dativ veraltet ob dem Wasserfall
  • Web Google 'ob dem'
  • Sage Der Wilde Jäger ob dem Neuenburgersee
  • Corpus

5
Corpus Examples
  • CZ94 ... fiel schier vom Stuhl ob der Äusserung
    eines Ozeanologen ...
  • CZ94 Bei manchem Ölgiganten kam ob der
    Ergebnisse gar Euphorie auf.
  • CZ94 ... rieben sich vergnügt die Hände ob des
    zu erwartenden Schlagabtauschs.
  • ob is a preposition with genitive!!
  • in CZ corpus 'ob' is tagged as preposition 21
    times (obviously some incorrect)

6
History of Corpus Linguistics
  • collections of text were widely used in the 19th
    century and in the first half of the 20th century
  • language acquisition
  • orthography (letter frequency)
  • field linguistics
  • ? American Structuralism (influential until 1960)

7
History of Corpus Linguistics
  • Chomsky's criticism Speakers produce and
    understand infinitely many new sentences/words.
  • therefore the new research goal is to describe
    the underlying language faculty of a speaker (
    universal grammar), competence rather than
    performance

8
History of Corpus Linguistics
  • Chomsky's criticism every collection of texts is
    a collection of performance data and so many
    factors contribute to it that it cannot be used
    to model competence.
  • A corpus is necessarily skewed. Some sentences
    won't occur because they are obvious, false or
    impolite.

9
History of Corpus Linguistics
  • theoretical linguistics
  • competence (what is grammatical?)
  • introspection
  • indefinitely many types, productivity
  • grammatical vs. ungrammatical
  • corpus linguistics
  • performance (what is attested?)
  • instances
  • finite number of types
  • degrees of grammaticality

10
Corpus research in Linguistics
  • Lexicography (Dictionaries)
  • Grammaticography (Reference grammars)
  • Learner corpora Language acquisition
  • Parallel corpora Translation

11
Construction of Corpora
  • Written text is easier to obtain than spoken
    text. Some examples
  • Newspapers
  • Fiction (e.g. fairy tales)
  • Technical Literature (e.g. manuals, medicine)
  • Personal letters Email
  • Advertising (incl. political propaganda)
  • Belief and Thought (e.g. bible)

12
Corpora of spoken language
  • Spontaneous spoken language
  • recording of dialogues (e.g. telephone
    conversation)
  • Prepared spoken language
  • Public speeches (e.g. in parliament)
  • Radio or TV news
  • Spoken utterances must be transcribed for
    linguistic research.

13
Size of corpora
  • Brown Corpus for English (1964, 1 Mio. words)
  • LIMAS-Corpus for German (1970, 1 Mio. words)
  • British National Corpus (1995, 100 Mio. words)
  • Cosmas corpus (2002, gt 100 Mio. words)

14
Brown Corpus (1964)
  • 500 texts
  • out of 15 different text types
  • with 2000 words each

15
British National Corpus
  • 90 written English, 10 spoken English
  • 3209 texts
  • out of 10 different text types written and
  • 6 text types spoken
  • with lt 40'000 words each
  • ? multi-purpose corpus

16
Other considerations
  • Time frame of the corpus
  • Native and non-native speakers
  • Sociolinguistic variables
  • Gender
  • Age
  • Education
  • Dialect
  • Social context and relationships

17
Types of corpora
  • Raw texts
  • Automatically annotated corpora
  • Texts with Part-of-Speech tags
  • Partially parsed texts
  • Manually annotated corpora
  • Treebank
  • FrameNet

18
Types of Corpora
  • Balanced Corpora vs. special corpora
  • Spoken vs. written language
  • Monolingual vs. Multilingual Corpora
  • Parallel vs. comparable corpora

19
Corpora in Computational Linguistics
Corpora
annotation
Facts Rules Preferences
learning
20
My Motivation for Corpus Linguistics
  • Attempt to build a parser for German
  • But problems with ambiguities!!
  • Therefore Learn attachment preferences from a
    corpus!

21
Corpora vs. Test suites
  • A test suite
  • is a collection of manually constructed and
    selected sentences.
  • is used for testing computational grammars and
    parsers.
  • reduces the amount of testing.
  • leads to specific problems of the NLP system.

22
Basic problems in CL
  • Knowledge is missing (too little information)
  • e.g. unknown words
  • Ambiguities (too much information)
  • e.g. in syntax attachment preferences

23
Corpora in Computational Linguistics
  • Widespread use of (manually) annotated material
    for measuring progress!
  • Some examples from COLING 2002
  • Treebanks to train and test probabilistic
    grammars
  • Enriching treebanks with dependency information
  • Automatic error detection in PoS-Tagged Corpora
  • SENSEVAL data to train and test word sense
    disambiguation programs

24
Possible Student Tasks
  1. Which German prepositions take a noun without a
    determiner? (e.g. pro, via)
  2. When is mit used as an adverb? (e.g. )
  3. What is the distribution of separable verb
    prefixes in German?
  4. How often are relative clauses introduced with
    welche(r) ?
  5. How often are present participle forms used in
    German?
  6. What kind of foreign language material is in the
    corpus?

25
Possible Student Tasks
  1. Create a small parallel corpus (e.g. with various
    versions of 'Alice in Wonderland' or National
    Geographic)
  2. Create a small corpus of spoken language (e.g. by
    transcription of one issue of 'Big Brother').
  3. Create a small treebank with the ANNOTATE tool.

26
What corpora do we have for German?
  • Raw text
  • ComputerZeitung 1993-97 (about 1.3 million words
    per year)
  • ComputerZeitung iX
  • Tages-Anzeiger 2000

27
Information in TagesAnzeiger
  • Date
  • Category (Sport, Politics, Culture, Economics
    etc.)
  • Author
  • Title vs. Text

28
What corpora do we have for German?
  • Syntactically Annotated Text (Treebanks)
  • NEGRA treebank (20'000 sentences)
  • ComputerZeitung treebank (3'000 sentences)
  • Text with manually corrected PoS tags
  • 50'000 sentences from University speeches
  • others

29
The goal
  • If you can walk, you can dance.
  • If you can talk, you can sing.
  • If you can parse, you can understand.
  • (Hans Uszkoreit, COLING 2002)

30
Acknowledgement
  • Some slides were highly influenced by or even
    copied from Anke Lüdeling's course "Introduction
    to Corpus Linguistics" at http//www.cl-ki.uni-osn
    abrueck.de/aluedeli/Corpuslinguistik.html
Write a Comment
User Comments (0)
About PowerShow.com