CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 9 (03/02/06) Prof. Pushpak Bhattacharyya IIT Bombay Dealing With Corpus - PowerPoint PPT Presentation

About This Presentation
Title:

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 9 (03/02/06) Prof. Pushpak Bhattacharyya IIT Bombay Dealing With Corpus

Description:

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 9 (03/02/06) Prof. Pushpak Bhattacharyya IIT Bombay Dealing With Corpus – PowerPoint PPT presentation

Number of Views:178
Avg rating:3.0/5.0
Slides: 12
Provided by: ail84
Category:

less

Transcript and Presenter's Notes

Title: CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 9 (03/02/06) Prof. Pushpak Bhattacharyya IIT Bombay Dealing With Corpus


1
CS460/IT632Natural Language Processing/Language
Technology for the WebLecture 9
(03/02/06)Prof. Pushpak BhattacharyyaIIT
BombayDealing With Corpus
2
Sources of the Corpus
  • Mainly in US Europe
  • LDC www.ldc.upenn.edu
  • ELRA www.elra.info
  • Oxford Text Archive www.ota.ahds.ac.uk
  • Brown Corpus (1961) American English (1 million
    words)
  • British National Corpus (BNC) British English
    (10 million words)

3
Dealing with Corpus - Challenges
  • Word recognition (Tokenization)
  • Sentence recognition

4
Problems in Tokenization
  • Uppercase/lowercase
  • Uppercase may be a proper noun/sentence
    beginning/emphasizer/Title
  • I told you to SUBMIT the report. (emphasizer)
  • Proper Nouns Named entity detection
  • Dates they have a non-standard format
  • 2/8/2006
  • 8 February 2006
  • 8-Feb-06
  • February 8, 2006 many more

5
Problems in Tokenization (Contd.1)
  • Phone Numbers non-standard format
  • 25767718
  • 2576 7718
  • 22-25767718
  • 022-25767718
  • 022.25767718
  • 91-22-25767718
  • 01711380647 (UK format)
  • (44-171)8301007 (UK format)
  • 45 43 48606 (Denmark format)
  • (94-1)866854 (Sri Lanka format)

6
Problems in Tokenization (Contd.2)
  • Periods (full stops) Its roles are sentence
    delimiter or abbreviations. Given are some
    examples.
  • U.N.O. stopped aid to Afghanistan.
  • Ex.
  • Apostrophe genitive or shortening device
  • Rams (genitive) brother isnt (shortening) well
    today.
  • Haplology Multiple rules played by a
    punctuation mark.

7
Precision Recall (False Hit False Miss)
Precision size (Actual ? Hypothesis)
size (Hypothesis) Recall size (Actual ?
Hypothesis) size (Actual)
Actual Set
Hypothesis Set
False Miss
False Hit
8
Further Challenges
  • Hyphen can occur in a compound word or as a
    word continuer.
  • Play-mate
  • I would like to watch the game play-
  • fully.
  • Conjoining Compounding (Sandhii Samaas)
  • Vidhyaalaya vidhyaa aalay (sandhii)
  • Raajaputra rajaa putra (samaas)

9
Multiword Recognition
  • A multiword is a single token composed of words
    separated by blanks.
  • United Nation Organization (proper noun)
  • Golf club, cricket bat (common nouns)
  • What kind of relationship occur between the words
    of the multiwords or compounds.
  • Raajaarshi raajaa rishi
  • Meaning A king (raajaa) who has qualities of a
    saint (rishi) also
  • Raajaputra rajaa putra
  • Meaning son (putra) of the king (raajaa)

10
Multiword (Contd.)
  • There are NOUN VERB combinations
  • sit down
  • jamhaayii lenaa (to yawn)
  • gir paRnaa (to fall down)
  • A typical multiword in German
  • Donaudampfschiffahtrsgesellschaftskapitansmi
    tzenfabrikant
  • Danubesteamshipvoyagecompanycaptaincapprodu
    cer (Gloss in English)
  • NOTE The character is inserted to show the
    different constituents of the compound word.
    Otherwise the word is written without any in
    between.

11
Techniques for Parsing
  • There are different techniques for parsing -
  • Top Down Parsing
  • Bottom Up (Bottom Up Chart Parsing)
  • Top Down Bottom Up (Top Down Chart Parsing)
  • Parsing can be deterministic or probabilistic
  • It can be based on phrase structured grammar or
    dependency grammar.
Write a Comment
User Comments (0)
About PowerShow.com