Introduction to Natural Language Processing and Text Mining and The basic building blocks - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to Natural Language Processing and Text Mining and The basic building blocks

Description:

Computational Linguistics deals with the modeling of natural language from a ... to returning simple fact-like (factoid) answers (names, dates, places, etc) ... – PowerPoint PPT presentation

Number of Views:198
Avg rating:3.0/5.0
Slides: 27
Provided by: DanJur1
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Natural Language Processing and Text Mining and The basic building blocks


1
Introduction to Natural Language Processing and
Text MiningandThe basic building blocks
  • Sudeshna Sarkar
  • Professor
  • Computer Science Engineering Department
  • Indian Institute of Technology Kharagpur

2
What is speech and language processing?
  • Computational Linguistics deals with the modeling
    of natural language from a computational
    perspective.
  • Natural Language Processing
  • Process information contained in natural language
    text / speech
  • Getting computers to perform useful tasks
    involving human languages
  • Enabling human-machine communication
  • Improving human-human communication
  • Doing stuff with language objects
  • Can machines understand human language?
  • What does one mean by understand?
  • Understanding is the ultimate goal. However, one
    doesnt need to fully understand to be useful.

3
Natural Language Processing
  • What is it?
  • Were going to study what goes into getting
    computers to perform useful and interesting tasks
    involving human languages.
  • We will be secondarily concerned with the
    insights that such computational work gives us
    into human processing of language.

4
Importance of studying NLP
  • A hallmark of human intelligence.
  • Text is the largest repository of human knowledge
    and is growing quickly.
  • emails, news articles, web pages, scientific
    articles, insurance claims, customer complaint
    letters, transcripts of phone calls, technical
    documents, government documents, patent
    portfolios, court decisions, contracts,
  • Are we reading any faster than before?
  • How do we keep up?

5
Goals of NLP
  • Scientific Goal
  • Identify the computational machinery needed for
    an agent to exhibit various forms of linguistic
    behaviour
  • Engineering Goal
  • Design, implement, and test systems that process
    natural languages for practical applications

6
Computer Speech and Language Processing
  • Goals can be very ambitious
  • True text understanding
  • Good quality translation
  • Or goals can be practical
  • Web search engines
  • Question Answering
  • Machine Translation services on the Web
  • Speech synthesis
  • Voice recognition
  • Conversational Agents
  • Summarization
  • Natural language technology not yet perfected
  • But still good enough for several useful
    applications

7
Text Mining
  • Text mining
  • deriving high quality information from text.
  • Text mining usually involves
  • the process of structuring the input text
  • deriving patterns within the structured data
  • evaluation and interpretation of the output.
  • 'High quality' in text mining usually refers to
    some combination of relevance, novelty, and
    interestingness.
  • Typical text mining tasks include
  • text categorization, text clustering
  • concept/entity extraction
  • generation of taxonomies
  • sentiment analysis
  • document summarization
  • entity relation modeling

8
Big Applications
  • These kinds of applications require a tremendous
    amount of knowledge of language.
  • Consider the following interaction with HAL the
    computer from 2001 A Space Odyssey
  • HAL
  • Dave Open the pod bay doors, Hal.
  • HAL Im sorry Dave, Im afraid I cant do that.

9
Whats needed?
  • Speech recognition and synthesis
  • Knowledge of the English words involved
  • What they mean
  • How they combine (bay, vs. pod bay)
  • How groups of words clump
  • What the clumps mean
  • Dialog
  • It is polite to respond, even if youre planning
    to kill someone.
  • It is polite to pretend to want to be cooperative
    (Im afraid, I cant)

10
Real Example
  • What is the Feds current position on interest
    rates?
  • What or who is the Fed?
  • What does it mean for it to to have a position?
  • How does current modify that?

11
Caveat
  • NLP has an AI aspect to it.
  • Were often dealing with ill-defined problems
  • We dont often come up with perfect
    solutions/algorithms
  • We cant let either of those facts get in our way

12
Preparation
  • Basic algorithm and data structure analysis
  • Ability to program
  • Some exposure to logic
  • Exposure to basic concepts in probability
  • Interest in Language

13
Commercial World
  • Lots of exciting stuff going on
  • Some samples
  • Machine translation
  • Question answering
  • Buzz analysis

14
Google/Arabic
15
Web Q/A
16
Summarization
  • Current web-based Q/A is limited to returning
    simple fact-like (factoid) answers (names, dates,
    places, etc).
  • Multi-document summarization can be used to
    address more complex kinds of questions.
  • Circa 2002
  • Whats going on with the Hubble?

17
NewsBlaster Example
  • The U.S. orbiter Columbia has touched down at the
    Kennedy Space Center after an 11-day mission to
    upgrade the Hubble observatory. The astronauts on
    Columbia gave the space telescope new solar
    wings, a better central power unit and the most
    advanced optical camera. The astronauts added an
    experimental refrigeration system that will
    revive a disabled infrared camera. ''Unbelievable
    that we got everything we set out to do
    accomplished,'' shuttle commander Scott Altman
    said. Hubble is scheduled for one more servicing
    mission in 2004.

18
Weblog Analytics
  • Textmining weblogs, discussion forums, user
    groups, and other forms of user generated media.
  • Product marketing information
  • Political opinion tracking
  • Social network analysis
  • Buzz analysis (whats hot, what topics are people
    talking about right now).

19
Google/Arabic Translation
20
Forms of Natural Language
  • The input/output of a NLP system can be
  • written text newspaper articles, letters,
    manuals, prose,
  • Speech read speech (radio, TV, dictations),
    conversational speech, commands,
  • To process written text, we need
  • lexical,
  • syntactic,
  • Semantic
  • knowledge about the language
  • discourse information,
  • real world knowledge
  • To process spoken language, we need additionally
  • speech recognition
  • speech synthesis

21
Components of NLP
  • Natural Language Understanding
  • Mapping the given input in the natural language
    into a useful representation.
  • Different level of analysis required
  • morphological analysis,
  • syntactic analysis,
  • semantic analysis,
  • discourse analysis,
  • Natural Language Generation
  • Producing output in the natural language from
    some internal representation.
  • Different level of synthesis required
  • deep planning (what to say),
  • syntactic generation

Which is harder?
22
Natural language understanding
  • Uncovering the mappings between the linear
    sequence of words (or phonemes) and the meaning
    that it encodes.
  • Representing this meaning in a useful (usually
    symbolic) representation.
  • By definition - heavily dependent on the target
    task
  • Words and structures mean different things in
    different contexts
  • The required target representation is different
    for different tasks.
  • Why is NLU hard?
  • The mapping between words, their linguistic
    structure and the meaning that they encode is
    extremely complex and difficult to model and
    decompose.
  • Natural language is very ambiguous
  • The goal of understanding is itself task
    dependent and very complex.

23
Why NL Understanding is hard?
  • Natural language is extremely rich in form and
    structure, and
  • very ambiguous.
  • How to represent meaning,
  • Which structures map to which meaning structures.
  • Ambiguity ne input can mean many different
    things
  • Lexical (word level) ambiguity -- different
    meanings of words
  • Syntactic ambiguity -- different ways to parse
    the sentence
  • Interpreting partial information -- how to
    interpret pronouns
  • Contextual information -- context of the
    sentence may affect the meaning of that sentence.
  • Many input can mean the same thing.
  • Interaction among components of the input.
  • Noisy input (e.g. speech)

24
Linguistics Levels of Analysis
  • Phonology sounds / letters / pronunciation.
    concerns how words are related to the sounds that
    realize them.
  • Morphology the structure of words and the laws
    concerning the formation of new words from pieces
    (morphs)
  • Syntax how these sequences are structured, eg,
    structures of sentences and the ways individual
    words are connected within them
  • Semantics concerns what words mean and how these
    meaning combine in sentences to form sentence
    meaning. The study of context-independent meaning.

25
Linguistics Levels of Analysis
  • Pragmatics concerns how sentences are used in
    different situations and how use affects the
    interpretation of the sentence
  • Discourse concerns how the immediately preceding
    sentences affect the interpretation of the next
    sentence. For example, interpreting pronouns and
    interpreting the temporal aspects of the
    information.
  • World Knowledge includes general knowledge
    about the world. What each language user must
    know about the others beliefs and goals.

26
Knowledge needed
  • Speech recognition and synthesis
  • Dictionaries (how words are pronounced)
  • Phonetics (how to recognize/produce each sound of
    the language)
  • Natural language understanding
  • Knowledge of the natural language words involved
  • What they mean
  • How they combine
  • Knowledge of syntactic structure
  • Dialog and pragmatic knowledge
Write a Comment
User Comments (0)
About PowerShow.com