Statistische Methoden in der Computerlinguistik Statistical Methods in Computational Linguistics - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Statistische Methoden in der Computerlinguistik Statistical Methods in Computational Linguistics

Description:

Statistical Machine Translation. The Status of Statistical Methods ... Machine translation. Probabilistic context-free grammars. Word sense disambiguation ... – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 22
Provided by: jonas5
Category:

less

Transcript and Presenter's Notes

Title: Statistische Methoden in der Computerlinguistik Statistical Methods in Computational Linguistics


1
Statistische Methoden in der ComputerlinguistikSt
atistical Methods in Computational Linguistics
  • 1. Course Overview
  • Jonas Kuhn
  • Universität Potsdam, 2007

2
Outline
  • Course Overview Introduction
  • Some Python Programming

3
Course Overview
  • Simple Python Programming
  • Basic Probability Theory
  • N-Gram Language Modeling
  • Basic Information Theory Entropy
  • Data Sparseness Smoothing Techniques
  • Machine Learning Paradigms
  • Part-of-Speech-Tagging with Statistical and ML
    Techniques
  • Probabilistic Grammars Parsing
  • Statistical Machine Translation

4
The Status of Statistical Methods
  • Eric Brill and Raymond J. Mooney (1997)
  • An Overview of Empirical Natural Language
    Processing
  • In AI Magazine, 18(4) Winter 1997, 13-24.
  • The linguistic knowledge-acquisition problem
  • Rationalist methods
  • Empirical or corpus-based methods

5
Rationalist methods
6
Empirical or corpus-based methods
7
History of NLP
  • 1950s empirical and statistical analyses of
    natural language (compare behaviorism in
    psychology Skinner)
  • Mid-1950s
  • Chomskys program
  • Observational and explanatory adequacy
  • Arguments against learnability of language from
    data Innateness hypothesis
  • Rationalist methods in AI research in NLP
  • Hand coding of rules
  • Starting in early 1980s
  • Some work on induction of lexical and syntactic
    information from text
  • Empirical methods in speech recognition (hidden
    Markov models HMMs)

8
History of NLP
  • Late 1980s/1990s Statistical techniques in
    various areas of NLP
  • POS tagging
  • Machine translation
  • Probabilistic context-free grammars
  • Word sense disambiguation
  • Anaphora resolution

9
Reasons for the Resurgence of Empiricism
  • Empirical methods offer potential solutions to
    several related, long-standing problems in NLP
  • (1) Acquisition, automatically identifying and
    coding all the necessary knowledge
  • (2) Coverage, accounting for all the phenomena in
    a given domain or application
  • (3) Robustness, accommodating real data that
    contain noise and aspects not accounted for by
    the underlying model
  • (4) Extensibility, easily extending or porting a
    system to a new set of data or a new task or
    domain

10
Reasons for the Resurgence of Empiricism
  • Additional factors
  • (1) computing resources, the availability of
    relatively inexpensive workstations with
    sufficient processing and memory resources to
    analyze large amounts of data
  • (2) data resources, the development and
    availability of large corpora of linguistic and
    lexical data for training and testing systems
  • (3) emphasis on applications and evaluation,
    industrial and government focus on the
    development of practical systems that are
    experimentally evaluated on real data

11
Categories of Empirical Methods (1)
  • Probabilistic methods
  • Symbolic learning methods
  • Neural network/connectionist methods

12
Categories of Empirical Methods (2)
  • Different dimension type of training data
  • Supervised learning
  • Annotated text
  • Unsupervised learning
  • Indirect feedback
  • Important combination of rationalist and
    empirical methods

13
An Interdisciplinary Field
Electrical Engineering
Computational Neuroscience
Artificial Intelligence
Philosophy
Pattern/Speech Recognition
Mathematics
Machine Learning
Information Theory
Neural Networks
Clustering
Empirical Sciences
Information Retrieval
Computer Science

Statistics
Search Algorithms
Probability Theory
Computational Linguistics
Algorithms Data Structures
Natural Language Parsing
Statistical NLP
Linguistics
Grammar Formalisms
Complexity Theory
Psycho- linguistics
Formal Language Theory
Corpus Linguistics
14
Practical Aspects
  • We will use
  • Python for small programming exercises
  • http//www.python.org/
  • NLTK library (in Python) Natural Language
    Toolkit
  • http//nltk.sourceforge.net/
  • (probably) WEKA for small Machine Learning
    experiments
  • http//www.cs.waikato.ac.nz/ml/weka/

15
Python
  • Tutorial introduction in an NLP context
  • http//nltk.sourceforge.net/docs.html
  • Chapter 2 Programming

16
Python Key Features
  • Simple yet powerful, shallow learning curve
  • Object-oriented encapsulation, re-use
  • Scripting language, facilitates interactive
    exploration
  • Excellent functionality for processing linguistic
    data
  • Extensive standard library, incl graphics, web,
    numerical processing
  • Downloaded for free from http//www.python.org/

Slide taken from Bird/Loper/Klein NLTK
Introduction to NLP
17
Python example
  • import sys
  • for line in sys.stdin.readlines()
  • for word in line.split()
  • if word.endswith(ing)
  • print word
  • whitespace nesting lines of code scope
  • object-oriented attributes, methods (e.g. line)
  • readable

Slide taken from Bird/Loper/Klein NLTK
Introduction to NLP
18
Comparison with Perl
  • while (ltgt)
  • foreach my word (split)
  • if (word /ing/)
  • print "word\n"
  • syntax is obscure what are ltgt my split ?
  • it is quite easy in Perl to write programs that
    simply look like raving gibberish, even to
    experienced Perl programmers (Hammond Perl
    Programming for Linguists 200347)
  • large programs difficult to maintain, reuse

Slide taken from Bird/Loper/Klein NLTK
Introduction to NLP
19
What NLTK adds to Python
  • NLTK defines a basic infrastructure that can be
    used to build NLP programs in Python. It
    provides
  • Basic classes for representing data relevant to
    natural language processing
  • Standard interfaces for performing tasks, such as
    tokenization, tagging, and parsing
  • Standard implementations for each task, which can
    be combined to solve complex problems
  • Extensive documentation, including tutorials and
    reference documentation

Slide taken from Bird/Loper/Klein NLTK
Introduction to NLP
20
Installing Python and NLTK
  • Install Python, Numeric
  • Install NLTK-Lite, NLTK-Lite-Corpora
  • Set environment variable NLTK_LITE_CORPORA
  • For detailed instructions, see
  • http//nltk.sourceforge.net/install.html

21
Running Project Idea
  • Language Identification
  • In what language is a given text document?
  • First ideas?
  • (Using simple text processing techniques)
Write a Comment
User Comments (0)
About PowerShow.com