Title: Statistische Methoden in der Computerlinguistik Statistical Methods in Computational Linguistics
1Statistische Methoden in der ComputerlinguistikSt
atistical Methods in Computational Linguistics
- 1. Course Overview
- Jonas Kuhn
- Universität Potsdam, 2007
2Outline
- Course Overview Introduction
- Some Python Programming
3Course Overview
- Simple Python Programming
- Basic Probability Theory
- N-Gram Language Modeling
- Basic Information Theory Entropy
- Data Sparseness Smoothing Techniques
- Machine Learning Paradigms
- Part-of-Speech-Tagging with Statistical and ML
Techniques - Probabilistic Grammars Parsing
- Statistical Machine Translation
4The Status of Statistical Methods
- Eric Brill and Raymond J. Mooney (1997)
- An Overview of Empirical Natural Language
Processing - In AI Magazine, 18(4) Winter 1997, 13-24.
- The linguistic knowledge-acquisition problem
- Rationalist methods
- Empirical or corpus-based methods
5Rationalist methods
6Empirical or corpus-based methods
7History of NLP
- 1950s empirical and statistical analyses of
natural language (compare behaviorism in
psychology Skinner) - Mid-1950s
- Chomskys program
- Observational and explanatory adequacy
- Arguments against learnability of language from
data Innateness hypothesis - Rationalist methods in AI research in NLP
- Hand coding of rules
- Starting in early 1980s
- Some work on induction of lexical and syntactic
information from text - Empirical methods in speech recognition (hidden
Markov models HMMs)
8History of NLP
- Late 1980s/1990s Statistical techniques in
various areas of NLP - POS tagging
- Machine translation
- Probabilistic context-free grammars
- Word sense disambiguation
- Anaphora resolution
9Reasons for the Resurgence of Empiricism
- Empirical methods offer potential solutions to
several related, long-standing problems in NLP - (1) Acquisition, automatically identifying and
coding all the necessary knowledge - (2) Coverage, accounting for all the phenomena in
a given domain or application - (3) Robustness, accommodating real data that
contain noise and aspects not accounted for by
the underlying model - (4) Extensibility, easily extending or porting a
system to a new set of data or a new task or
domain
10Reasons for the Resurgence of Empiricism
- Additional factors
- (1) computing resources, the availability of
relatively inexpensive workstations with
sufficient processing and memory resources to
analyze large amounts of data - (2) data resources, the development and
availability of large corpora of linguistic and
lexical data for training and testing systems - (3) emphasis on applications and evaluation,
industrial and government focus on the
development of practical systems that are
experimentally evaluated on real data
11Categories of Empirical Methods (1)
- Probabilistic methods
- Symbolic learning methods
- Neural network/connectionist methods
12Categories of Empirical Methods (2)
- Different dimension type of training data
- Supervised learning
- Annotated text
- Unsupervised learning
- Indirect feedback
- Important combination of rationalist and
empirical methods
13An Interdisciplinary Field
Electrical Engineering
Computational Neuroscience
Artificial Intelligence
Philosophy
Pattern/Speech Recognition
Mathematics
Machine Learning
Information Theory
Neural Networks
Clustering
Empirical Sciences
Information Retrieval
Computer Science
Statistics
Search Algorithms
Probability Theory
Computational Linguistics
Algorithms Data Structures
Natural Language Parsing
Statistical NLP
Linguistics
Grammar Formalisms
Complexity Theory
Psycho- linguistics
Formal Language Theory
Corpus Linguistics
14Practical Aspects
- We will use
- Python for small programming exercises
- http//www.python.org/
- NLTK library (in Python) Natural Language
Toolkit - http//nltk.sourceforge.net/
- (probably) WEKA for small Machine Learning
experiments - http//www.cs.waikato.ac.nz/ml/weka/
15Python
- Tutorial introduction in an NLP context
- http//nltk.sourceforge.net/docs.html
- Chapter 2 Programming
16Python Key Features
- Simple yet powerful, shallow learning curve
- Object-oriented encapsulation, re-use
- Scripting language, facilitates interactive
exploration - Excellent functionality for processing linguistic
data - Extensive standard library, incl graphics, web,
numerical processing - Downloaded for free from http//www.python.org/
Slide taken from Bird/Loper/Klein NLTK
Introduction to NLP
17Python example
- import sys
- for line in sys.stdin.readlines()
- for word in line.split()
- if word.endswith(ing)
- print word
- whitespace nesting lines of code scope
- object-oriented attributes, methods (e.g. line)
- readable
Slide taken from Bird/Loper/Klein NLTK
Introduction to NLP
18Comparison with Perl
- while (ltgt)
- foreach my word (split)
- if (word /ing/)
- print "word\n"
-
-
-
- syntax is obscure what are ltgt my split ?
- it is quite easy in Perl to write programs that
simply look like raving gibberish, even to
experienced Perl programmers (Hammond Perl
Programming for Linguists 200347) - large programs difficult to maintain, reuse
Slide taken from Bird/Loper/Klein NLTK
Introduction to NLP
19What NLTK adds to Python
- NLTK defines a basic infrastructure that can be
used to build NLP programs in Python. It
provides - Basic classes for representing data relevant to
natural language processing - Standard interfaces for performing tasks, such as
tokenization, tagging, and parsing - Standard implementations for each task, which can
be combined to solve complex problems - Extensive documentation, including tutorials and
reference documentation
Slide taken from Bird/Loper/Klein NLTK
Introduction to NLP
20Installing Python and NLTK
- Install Python, Numeric
- Install NLTK-Lite, NLTK-Lite-Corpora
- Set environment variable NLTK_LITE_CORPORA
- For detailed instructions, see
- http//nltk.sourceforge.net/install.html
21Running Project Idea
- Language Identification
- In what language is a given text document?
- First ideas?
- (Using simple text processing techniques)