CSA3180:%20Natural%20Language%20Processing - PowerPoint PPT Presentation

About This Presentation
Title:

CSA3180:%20Natural%20Language%20Processing

Description:

Handling Large Document Collections. Applications: Anatomy of a Search Engine. NLTK ... Linguistic Data Consortium (LDC) GigaWord (News) Tree Banks ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 10
Provided by: michael307
Category:

less

Transcript and Presenter's Notes

Title: CSA3180:%20Natural%20Language%20Processing


1
CSA3180 Natural Language Processing
  • Text Processing 1
  • Language Encoding Issues
  • Common Corpora
  • Handling Large Document Collections
  • Applications Anatomy of a Search Engine
  • NLTK

2
Language Encoding Issues
  • Different encoding methods
  • Different languages
  • Unicode Standard
  • Further information
  • Unicode Consortium
  • Jukka Korpela Tutorial
  • http//www.cs.tut.fi/jkorpela/chars.html

3
Language Encoding Issues
  • Character Repertoire set of distinct characters
  • Character Code mapping between characters and
    positive integers
  • Character Encoding algorithm for presenting
    characters using particular code

4
Language Encoding Issues
  • Encoding using octets
  • Common Encodings
  • ASCII
  • ISO Latin I (ISO 8859-1)
  • ISO Latin II III Extensions (for Maltese)
  • Unicode UTF-8
  • ANSI
  • Cyrillic and Chinese Encodings

5
Language Encoding Issues
  • Text encoding on the Web
  • MIME Standard
  • Content-Type text/html charsetiso-8859-1
  • Used in Email and Web Servers
  • Problems in implementation few encodings
    properly supported
  • UTF-8 recommended

6
Common Corpora
  • WordNet
  • TREC/ACE/TIDES Corpora
  • Linguistic Data Consortium (LDC)
  • GigaWord (News)
  • Tree Banks
  • MUC (Message Understanding Conference)
  • TIPSTER (Information Retrieval)

7
Handling Large Document Collections
  • Special issues involved in processing
  • Hierarchical directory structures
  • File indexes
  • Batch processing start, resume, pause, end
  • Job scheduling

8
Applications
  • Anatomy of a Search Engine (Larry Page and Sergey
    Brin)
  • Describes the internals of Google
  • NLP in everyday life!

9
Next Sessions
  • Natural Language Toolkit (NLTK)
  • http//nltk.sourceforge.net/
  • Please download and install!
Write a Comment
User Comments (0)
About PowerShow.com