L3S Overview Visit in Sweden - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

L3S Overview Visit in Sweden

Description:

Too difficult for humans, how can automation help? Information Goal ... Video: Daily Motion. Machine Learning: has many practical applications ... – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 47
Provided by: Nej7
Category:

less

Transcript and Presenter's Notes

Title: L3S Overview Visit in Sweden


1
Introduction Machine Learning December 16,
2008
Avaré Stewart
2
Outline
  • Motivation
  • Learner Input
  • Selected Learning Techniques
  • Supervised Learning
  • Unsupervised Learning
  • Summary
  • Tools
  • Further Reading

3
Motivation
  • Data Volume
  • Terabytes of data
  • How do we detect patterns
  • Structure, Classify
  • Too difficult for humans, how can automation
    help?
  • Information Goal
  • Recognize faces, objects in a picture
  • Recognize speech
  • Filter email
  • Humans do this with ease .. but, how do we
    encode such expertise in an algorithm?

4
Example Machine Learning Application - Text
Mining
  • Machine Learning
  • has many practical applications
  • can be applied to different domains
  • i.e. personalized search, ranking
  • Online Text Documents (i.e. blogs)
  • Classify Blogs, known classes
  • Group Blogs into Clusters
  • Discover salient terms
  • Compact representations

5
What is Machine Learning?
  • Machine Learning is
  • programming computers to optimize a performance
    criterion using example data or past experience
  • A machine has learnt when
  • it changes its structure, program, or data,
    based on its inputs, in such a manner that its
    expected future performance improves

6
Supervised Learning
Class Variable
Values
Attributes
  • Technique for creating a target function relating
    the inputs (attribute values ) to the output
    (class variable)
  • Predict the value of the function for unseen,
    test data (any valid input object) after having
    seen a number of training examples

7
Unsupervised Learning
Input
Output
  • Input a-priori partition of the data is not
    given
  • Goal Discover intrinsic groupings of input data
    based on
  • Similarity Function
  • Distance Function

8
Semi-Supervised Learning
  • Labeled Data Tradeoff
  • Labeled data - guides the machine learning
    algorithm
  • Labeled data high manual effort
  • Most real data is not labeled
  • Semi-Supervised Learning
  • Some labeled , mostly unlabeled data
  • Labels positive or negative

Semi-supervised
9
Our Roadmap
  • Learner Input s
  • Case Text Documents
  • Supervised Learning
  • Case Naïve Bayes
  • Case Bayesian Network
  • Unsupervised Learner
  • EM Algorithm
  • Case Latent Topic Model

10
Learner Preprocessing / Input (Text Documents)
  • Regardless Learning Method
  • text documents are not directly process-able by
    learner
  • preprocessing influences results of learner
  • Partitioning Data
  • Training
  • Testing
  • Validation
  • Indexing
  • Interpretation of a term
  • Interpretation weight
  • Stop Word
  • Stemming
  • Pruning

Information Retrieval
Machine Learning
11
Data Partitioning
  • Training Set build learner
  • Validation Set tune learner
  • Test Set evaluate learner i.e. perplexity,
    k-fold cross validation

Training Set
Test Set
Validation Set
?
?
12
Filtering Stop Word / Stemming
  • Stop Word Removal eliminate non-discriminating
    terms
  • Prepositions over, up
  • Articles the , a
  • Conjunctions and, thus
  • Stemming group words that same morphological
    root
  • i.e. play , plays
  • i.e. teacher , teaching

13
Terms and Weights
  • Terms can be
  • Bag of words (most popular)
  • Syntactic phases grammatical sequence , i.e.
    noun phase
  • Phantom Limb, Retinal Ganglion, Eating-disorder
  • Statistical phrases sequence significant words,
    co-occurrence
  • Asthma-lungs, cholesterol-Arteries
  • Weights can
  • Computed Term frequency, tf-idf , etc
  • Valued according
  • binary i.e. 0 or 1
  • Normalization, probabilistic i.e. 0..1

14
Pruning
  • Term Space Reduction / Dimensionality Reduction
  • Reduce the set of terms used in the learner
  • Reasons
  • Some Learners dont scale with huge number of
    terms
  • Improve performance, less noise
  • Reduced overfitting
  • Learner can not generalize idiosyncratically
    builds model for training data given
  • Overfitting may be avoided even if smaller amount
    training examples is used
  • Risks removing potentially useful terms
    elucidate meaning of document

15
Pruning Approaches
  • Document Frequency Keep terms that receive
    highest score according to a function that
    measures importance of the term
  • Reduce dimensions by factor 10
  • at most 1 - 3 times in training documents
  • at most 1 - 5 times in the training set
  • Term Clustering
  • Group words with high degree of semantic
    relatedness
  • Represents set of words as abstraction / concept
  • Groups or centroids used as learner dimensions
  • Information Theoretic
  • best terms based on distribute differently across
    the classes.

16
Information Theoretic Pruning Approaches
17
  • Supervised Learning

18
What Can We Learn? - Naïve Bayes Classifier
Class Attribute
  • Probabilistic
  • Goal predict Class value
  • Pr(C j d)
  • Simple yet effective
  • Based on Conditional Independence Assumption

Input Attributes
  • Joint Probabilities can be rewritten as the
    product of individual probabilities

19
Bayes Theorem Likelihood
Posterior
Prior
Bayes Rule
Law of Total Probability (normalization)
Substituting
Sometimes used ..
20
Making a Prediction with Naïve Bayes (1 of 3)
  • From Bayes Theorem
  • Law of Total Probability
  • Conditional Independence Assumption
  • Term 1
  • Term 2
  • Looks like we need a bunch of products and sums
    ...for 2 terms.

21
Example Naïve Bayes (2 of 3)
Example Training Data
22
Making a Prediction with Naïve Bayes (3 of 3)
From Slide 2 of 3
  • Given a new instance assume
  • A m , B q , C ?

True wins
23
Summary
  • Advantages Naïve Bayes
  • Simple technique
  • results in high accuracy, especially when
    combined with other methods.
  • Disadvantages Naïve Bayes
  • Treats variable as independent and equally
    important, can cause skewed results
  • not allow for categorical output attributes
  • Other Supervised Methods
  • SVM Support Vector Machines
  • Decision Trees
  • Bayesian Network

24
What Can We Learn? - A Bayesian Network
  • Bayesian Network can
  • Overcome independence assumption of Naïve Bayes
  • Handles Noise (misclassifications)
  • Optimal Prediction small or large data

Andrew W Moore
25
Bayesian Network
Conditional Probability Table
DAG
26
Making a Prediction with BN
  • Given Evidence / Unseen case
  • Outlook sunny
  • Temperature cool
  • Windy true
  • Humidity high
  • Prediction Step
  • a What is the probability play no
  • Pr(play no x) .367.625.538.111.250
  • b What is the probability play yes
  • Which probability, a or b is maximum?

27
Learning Bayesian Network
  • Nice But How do I get the DAG ?
  • Nice But how do I fill in the Table?
  • Bayesian Network Learning
  • Structure Learning
  • Parameter Learning
  • Given training set, no missing values

28
Parameter Learning
SELECT outlook, play, temperature , count() as
count FROM db.table group by outlook, play,
temperature
  • Given the structure ..
  • count occurrences in the database

Portion of Conditional Probability Table for the
Temperature Node
29
Learning Structure
  • Structure Learning Requires
  • Search Procedure
  • Scoring Mechanism

Bayesian Network Structure
entropy
degrees
  • Example Score

Data
30
Search Procedure
  • Search procedure we produced different possible
    Bayesian Network Structures
  • Example K2 Algorithm
  • Start with empty DAG (or Naïve Bayes Net)
  • Add, remove, or reverse edges
  • Assure no cycles created
  • Score, checking for improvement
  • Keep structure, if new score is higher than
    previous score
  • How do we score the current Bayesian Network ?

31
Scoring Mechanism
Child Configurations k 1,2,3 rk 3 N52
4 N521 2 N522 1 N523 1
1
2
3
1
Combine Parent Values
Parent configurations i 1,2,3 , 6 qi 6 2
X 3
2
3
4
5
6
32
  • Unsupervised Learning

Using Tags to Cluster Blogs
33
What Can We Learn? - Model Based Clustering
  • Clustering divides the data into groups
  • Has wide use, in many applications
  • Why Cluster?
  • Enhance Understanding
  • Group Web search results
  • Segmenting customers targeted marketing
  • Utility
  • Summarization
  • Compression

34
Model Based Clustering
  • Model Based Learner
  • Gives the probability with which a specific
    object belongs to a particular cluster
  • Assumptions
  • data generated by a mixture model
  • a one-to-one correspondence between mixture
    components and classes (clusters)
  • Prob. each component sum to 1

Mixture Model
combination of probability distributions
Cluster 1
Cluster 2
Data
35
What is a Model ?
  • A Model is
  • Distribution
  • Parameters
  • Learning Process
  • Decide on a statistical model for the data
  • Learn the parameters of the model from the data

36
Maximum Likelihood Estimation
For the moment, assume a single Gaussian .
Reformulation, taking the log
s
µ
  • To find the MLE parameters, maximize
  • Take the derivative of the likelihood function
    wrt. the parameter
  • Set the result to zero, and solve

37
Learning Parameters of the Mixture Model
Cluster 1
Cluster 2
  • Now, lets go back to 2 clusters ..
  • Which points were generated by which
    distribution? - Here is were the unsupervised
    part comes in !
  • Guess ? Or, calculate the prob. that each point
    belongs to each distribution
  • Use these prob. to compute a new estimate for the
    parameters

Data
  • Expectation
  • Step
  • Maximization
  • Step

38
EM Algorithm Step 1 Initialization
  • Step 1 Select an initial set of model parameters
  • Assume we know
  • Make initial guesses for

39
EM Algorithm Step 2 E-Step
The probability that a point came from a
particular distribution
Assuming, x 0
Recall
Compute for all points and distributions
40
EM Algorithm Step 3 M-Step
  • Given the probabilities from the expectation
    step, find the new estimates of the parameters
    that maximize the expected likelihood
  • Repeat from E-Step until values stabilize

41
Topic Model
Latent Topic
  • Generative Process for Document Creation
  • Topic mixture of words
  • Document mixture of topics
  • Predefined distributions govern selection process

http//psiexp.ss.uci.edu/research/papers/SteyversG
riffithsLSABookFormatted.pdf
42
Example Topic Model pLSA
Example
pLSA Model
E-Step
M-Step
43
Summary Naïve Bayes
  • Advantages
  • Simple technique
  • results in high accuracy, especially when
    combined with other methods.
  • Disadvantages
  • Treats variable as independent and equally
    important, can cause skewed results
  • not allow for categorical output attributes

44
Summary Bayesian Belief Network
  • Advantages
  • Well suited incomplete data
  • Disadvantages
  • can be computationally intensive, esp. when
    variables are not conditionally independent of
    one another.

45
Summary Generative Model
  • Advantages
  • Clear, well-studied probabilistic framework
  • Can be extremely effective, if the model is close
    to correct
  • Disadvantages
  • Often difficult to verify the correctness of the
    model
  • EM local optima
  • Unlabeled data may hurt if generative model is
    wrong

46
Tools and Further Reading
  • Tools
  • Machine Learning Weka, Lemur, Mallet
  • Clustering Demo
  • TCT - Text Clustering Toolkit
  • CLUTO - Clustering Toolkit
  • Kevin Murphy's Bayesian Network Toolbox for
    MatLab
  • Hugin - Bayesian Networks
  • Bibliography
  • Machine Learning. Mitchell
  • Principles of Data Mining. Hand, Mannila, Symth.
  • Introduction to Data Mining. Tan Steinbach,
    Kumar.
  • Introduction to Information Retrieval. Manning,
    Raghavan ,Schütze,
  • Weka Data Mining Practical Machine Learning
    Tools and Techniques .Witten, Frank
  • Web Data Mining.Bing Liu
  • Andrew Moore Tutorial Slides

47
  • Semi-Supervised Learning

48
What can we Learn Semi-Supervised Learner Text
Classification
  • Assumptions
  • Documents represented as a bag of words
  • Probability of word is independent of position
    in the document
  • Words generated by a multinomial distribution
  • A one-to-one correspondence between mixture
    components and classes

??
??
tube
dry
Abbey
Columbian
Diana
49
Making a Prediction w/ Naïve Bayes Text
Classifier
c1
c2
c3
  • Term 2
  • Term 1
  • Looks like we need a bunch of products and sums
    ...for 2 terms.

D1
D2
D3
Dj subset of data for class cj
50
Closer Look at Term 1
Term 1 Prior probability on the classes
Estimated from the data based on the number of
documents appearing in each class and the total
number of documents
D1
D2
D3
51
Closer Look at Term 2
Term 2 applying model assumptions..
The number of times wt occurs in the training
data Dj (of class cj) divided by the total number
of word occurrences in the training data for that
class
Nti number of times that word wt occurs in
document di Pr(cjdi) 1 , if di in Dj 0
otherwise V set of all distinctive words
52
Recall Making a Prediction with Naïve Bayes
  • Given a new instance assume
  • A m , B q , C ?

53
Overview Semi-Supervised w/ EM Algorithm
Term 1
Classifier
M-Step
Term 2
6
Labeled Data
Naïve Bayes Classifier
E-Step
4
5
1
2
3
Unlabeled Data
D
Write a Comment
User Comments (0)
About PowerShow.com