Corpus-Based Work - PowerPoint PPT Presentation


PPT – Corpus-Based Work PowerPoint presentation | free to download - id: 75d958-ODYwM


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Corpus-Based Work


Corpus-Based Work Chapter 4 Foundations of statistical natural language processing – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 23
Provided by: ahme84
Learn more at:
Tags: based | corpus | slide | work


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Corpus-Based Work

Corpus-Based Work
  • Chapter 4
  • Foundations of statistical natural language

  • Requirements of NLP work
  • Computers
  • Corpora
  • Application/Software
  • This section covers some issues concerning the
    formats and problems encountered in dealing with
    raw data
  • Low-level processing before actual work
  • Word/Sentence extraction

Getting Set Up
  • Computers
  • Memory requirements for large corpora
  • Statistical NLP methods involve counts required
    to be accessed speedily
  • Corpora
  • A corpus is a special collection of textual
    material collected according to a certain set of
  • Licensing
  • Most of the time free sources are not
    linguistically marked-up

  • Corpora
  • Representative sample
  • What we find for sample also holds for general
  • Balanced corpus
  • Each subtype of text matching predetermined
    criterion of importance
  • Importance in statistical NLP
  • Representative corpus
  • In results type/domain of corpus should be

  • Software
  • Text editors
  • TextPad, Emacs, BBedit
  • Regular expressions
  • Patterns as regular language
  • Programming language
  • C/C widely used (Efficient)
  • Pearl for text preparation and formatting
  • Built in database and easy handling of
    complicated structures makes Prolog important
  • Java as pure Object Oriented gives automatic
    memory management

Looking at Text
  • Either in raw format or marked-up
  • Markup is used for putting some codes into data
    file, giving some information about text
  • Issues in automatic processing
  • Junk formatting/content (Corpus Cleaning)
  • Case sensitivity (All capitalize)
  • Proper Nouns?
  • Stress through capitalization
  • Loss of contextual information

  • Tokenization
  • Text is divided into units called tokens
  • Treatment of punctuation marks?
  • What is a word?
  • Graphic word (Kucera and Francis 1967)
  • A string of contiguous alphanumeric characters
    with white space on either side.
  • This is not practical definition even in case of
  • Especially for news corpus some odd entries can
    be present e.g. Microoft, C net
  • Apart from these oddities there are some other

  • Periods
  • Words are not always bounded by white spaces
    (commas, semicolons and periods)
  • Periods are at the end of sentence and also at
    the end of abbreviations
  • In abbreviation they should be attached to words
    (Wash. wash)
  • When abbreviations occur at the end of sentence
    there is only one period present, performing both
  • Within morphology, this phenomenon is referred as

  • Single Apostrophes
  • Difficulties in dealing with constructions such
    as Ill or isnt
  • The count of graphic word is 1 according to basic
    definition but should be counted as 2 words
  • 1. S? NP VP
  • 2. if we split then some funny words may occur in
  • End of quotations marks
  • Possessive form of words ending with s or z
  • Charles Law Muaz book

  • Hyphenation
  • Does sequence of letters with hyphen in-between,
    count as one or two?
  • Line ending hyphens
  • Remove hyphen at the end of line and join both
    parts together
  • If there is some other type of hyphen at end of
    line (haplology) then? (text-based)
  • Mostly in electronic text line breaking hyphens
    are not present, but there are some other

  • Some things with hyphens are clearly treated as
    one word
  • E-mail, A-l-Plus and co-operate
  • Other cases are arguable
  • Non-lawyer, pro-Arabs and so-called
  • The hyphens here are called lexical hyphens
  • Inserted before or after small word formatives to
    split vowel sequence in some cases
  • Third class of hyphens is inserted to indicate
    correct grouping
  • A text-based medium
  • A final take-it-or-leave-it offer

  • Inconsistencies in hyphenation
  • Cooperate ? Co-operate
  • So we can have multiple forms treated as either
    one word or two
  • Lexemes
  • Single dictionary entry with single meaning
  • Homographs
  • Two lexemes have overlapping forms/nature
  • Saw

  • Word segmentation in other languages
  • Opposite issue
  • White spaces but not word boundary
  • the New York-New Heaven railroad
  • I couldnt work the answer out
  • In spite of, in order to, because of
  • Variant coding of information of certain semantic
  • Phone numbers 42-111-128-128
  • Problem in information extraction

  • Speech Corpora Issues
  • More contractions
  • Various phonetic representations
  • Pronunciation variants
  • Sentence fragments
  • Filler words
  • Morphology
  • Keep various forms separately or collapse them?
    e.g. sit, sits, sat
  • Grouping them together and working with lexemes
    (Initially looks easier)

  • Stemming
  • Strips off affixes
  • Lemmatization
  • To extract the lemma or lexeme from inflected
  • Empirical research within IR shows that stemming
    does not help in performance
  • Information loss (operating ?operate)
  • Closely related tokens are grouped in chunks,
    which are more useful
  • Not good for morphologically rich languages

  • Sentences
  • What is a sentence?
  • In English, something ending with ., ? or !
  • Abbreviations issues
  • Other issues
  • you reminded me, she remarked, of your mother.
  • Nested things are classified as clauses
  • Quotation marks after punctuation
  • . is not sentence boundary in this case

  • Sentence boundary (SB) detection
  • Place tentative SB after all occurrences of .?!
  • Move the boundary after quotation mark (if any)
  • Disqualify a period boundary in case of
  • Preceded by an abbreviation not at sentence end ,
    and capitalized Prof., Dr.
  • Or not followed by capitalized words like in case
    of etc., jr.
  • Disqualify a boundary with ? Or !
  • If followed by a lower case letter
  • Regard all other as correct SBs

  • Riley (1989) used classification trees for SB
  • Features of trees included case and length of
    words preceding or following a period and
    probabilities of words to occur before and after
    a sentence boundary
  • It required large quantity of labeled data
  • Palmer and Hearst used POS of such words and
    implemented with Neural Networks (98-99
  • In other languages?

  • Marked-up Data
  • Some sort of code is used to provide information
    (mostly SGML, XML)
  • It can be done automatically, manually or mixture
    of both (Semi-Automatic)
  • Some texts mark up just sentence and paragraph
  • Other mark up more than this basic information
  • e.g. Pen Treebank (Full syntactic structure)
  • Common mark up is POS tagging

  • Grammatical Tagging
  • Generally done with conventional POS tagging like
    Noun, Verbs etc.
  • Also some information regarding nature of the
    words like Plurality of nouns or Superlative
    forms of adjectives
  • Tag set
  • The most influential tag set have been the one
    used to tag American Brown Corpus and
    Lancaster-Oslo-Bergen corpus

  • Size of tag sets
  • Brown 87 179 (Total tags)
  • Penn 45
  • Claws1 132
  • Penn tag set is widely used in computational work
  • Tags are different in different tag sets
  • Larger tag sets obviously have fine-grained
  • Detail level is according to domain of corpora

  • The design of tag set
  • Grammatical class of word
  • Features to tell the behavior of the word
  • Part of Speech
  • Semantic grounds
  • Syntactic distributional grounds
  • Morphological grounds
  • Splitting tags in further categories gives
    improved information but makes classification
  • There is not a simple relationship between tag
    set size and performance of taggers