A Brief Survey of Machine Learning presentation

About This Presentation

Transcript and Presenter's Notes

Title: A Brief Survey of Machine Learning

1
A Brief Survey of Machine Learning

Used materials from
William H. Hsu
Linda Jackson
Lex Lane
Tom Mitchell
Machine Learning, Mc Graw Hill 1997
Allan Moser
Tim Finin,
Marie desJardins
Chuck Dyer

2
ML Lectures Outline what we will discuss?

Why machine learning?
Brief Tour of Machine Learning
A case study
A taxonomy of learning
Intelligent systems engineering specification of
learning problems
Issues in Machine Learning
Design choices
The performance element intelligent systems
Some Applications of Learning
Database mining, reasoning (inference/decision
support), acting
Industrial usage of intelligent systems
Robotics

3
What is Learning?
definitions

Learning denotes changes in a system that ...
enable a system to do the same task more
efficiently the next time. -- Herbert Simon
Learning is constructing or modifying
representations of what is being experienced. --
Ryszard Michalski
Learning is making useful changes in our minds.
-- Marvin Minsky

4
Why Machine Learning?

Discover new things or structures that are
unknown to humans
Examples
Data mining,
Knowledge Discovery in Databases
Fill in skeletal or incomplete specifications
about a domain
Large, complex AI systems cannot be completely
derived by hand
They require dynamic updating to incorporate new
information.
Learning new characteristics
1. expands the domain or expertise
2. lessens the "brittleness" of the system
Using learning, the software agents can adapt to
to their users,
to other software agents,
to the changing environment.

5
Why Machine Learning?

New Computational Capability
Database mining
converting (technical) records into knowledge
Self-customizing programs
learning news filters,
adaptive monitors
Learning to act
robot planning,
control optimization,
decision support
Applications that are hard to program
automated driving,
speech recognition

6
Why Machine Learning?

Better Understanding of Human Learning and
Teaching
Understand and improve efficiency of human
learning
Use to improve methods for teaching and tutoring
people
e.g., better computer-aided instruction.
Cognitive science theories of knowledge
acquisition (e.g., through practice)
Performance elements reasoning (inference) and
recommender systems
Time is Right
Recent progress in algorithms and theory
Rapidly growing volume of online data from
various sources
Available computational power
Growth and interest of learning-based industries
(e.g., data mining/KDD)

7
A General Model of Learning Agents
8
Three Aspects of Learning Systems

1. Models
decision trees,
linear threshold units (winnow, weighted
majority),
neural networks,
Bayesian networks (polytrees, belief networks,
influence diagrams, HMMs),
genetic algorithms,
instance-based (nearest-neighbor)
2. Algorithms (e.g., for decision trees)
ID3,
C4.5,
CART,
OC1
3. Methodologies
supervised,
unsupervised,
reinforcement
knowledge-guided

9
What are the aspects of research on Learning?

1. Theory of Learning
Computational learning theory (COLT) complexity,
limitations of learning
Probably Approximately Correct (PAC) learning
Probabilistic, statistical, information theoretic
results
2. Multistrategy Learning
Combining Techniques,
Knowledge Sources
3. Create and collect Data
Time Series,
Very Large Databases (VLDB),
Text Corpora
4. Select good applications
Performance element
classification,
decision support,
planning,
control
Database mining and knowledge discovery in
databases (KDD)
Computer inference learning to reason

10
Some Issues in Machine Learning

What Algorithms Can Approximate Functions
Well? When?
How Do Learning System Design Factors Influence
Accuracy?
Number of training examples
Complexity of hypothesis representation
How Do Learning Problem Characteristics Influence
Accuracy?
Noisy data
Multiple data sources
What Are The Theoretical Limits of Learnability?
How Can Prior Knowledge of Learner Help?
What Clues Can We Get From Biological Learning
Systems?
How Can Systems Alter Their Own Representation?

11
Major Paradigms of Machine Learning

Rote Learning
One-to-one mapping from inputs to stored
representation.
"Learning by memorization.
Association-based storage and retrieval.
Clustering
Analogue
Determine correspondence between two different
representations
Induction
Use specific examples to reach general
conclusions
Discovery
Unsupervised, specific goal not given
Genetic Algorithms

12
Major Paradigms of Machine Learning

Neural Networks
Reinforcement
Feedback given at end of a sequence of steps.
Feedback can be positive or negative reward
Assign reward to steps by solving the credit
assignment problem
which steps should receive credit or blame for a
final result?

13
The Inductive Learning Problem

Induce rules that extrapolate from a given set of
examples
These rules should make accurate predictions
about future examples.
Supervised versus Unsupervised learning
Learn an unknown function f(X) Y, where
X is an input example and
Y is the desired output.
Supervised learning implies we are given a
training set of (X, Y) pairs by a "teacher."
Unsupervised learning means we are only given the
Xs and some (ultimate) feedback function on our
performance.
Concept learning
Called also Classification
Given a set of examples of some
concept/class/category, determine if a given
example is an instance of the concept or not.
If it is an instance, we call it a positive
example.
If it is not, it is called a negative example.

14
Supervised Concept Learning

Given a training set of positive and negative
examples of a concept
Usually each example has a set of
features/attributes
Construct a description that will accurately
classify whether future examples are positive or
negative.
That is,
learn some good estimate of function f
given a training set (x1, y1), (x2, y2), ...,
(xn, yn)
where each yi is either (positive) or -
(negative).
f is a function of the features/attributes

15
Inductive Learning Framework

Raw input data from sensors are preprocessed to
obtain a feature vector, X, that adequately
describes all of the relevant features for
classifying examples.
Each x is a list of (attribute, value) pairs. For
example,
X PersonSue, EyeColorBrown, AgeYoung,
SexFemale
The number and names of attributes (aka features)
is fixed (positive, finite).
Each attribute has a fixed, finite number of
possible values.
Each example can be interpreted as a point in an
n-dimensional feature space, where n is the
number of attributes.

16
Inductive Learning by Nearest-Neighbor
Classification

One simple approach to inductive learning is to
save each training example as a point in feature
space
Classify a new example by giving it the same
classification ( or -) as its nearest neighbor
in Feature Space.
1. A variation involves computing a weighted sum
of class of a set of neighbors
where the weights correspond to distances
2. Another variation uses the center of class
The problem with this approach is that it doesn't
necessarily generalize well if the examples are
not well "clustered."

17
Learning Decision Trees

Goal Build a decision tree for classifying
examples as positive or negative instances of a
concept using supervised learning from a training
set.
A decision tree is a tree where
each non-leaf node is associated with an
attribute (feature)
each leaf node is associated with a
classification ( or -)
each arc is associated with one of the possible
values of the attribute at the node where the arc
is directed from.
Generalization allow for gt2 classes
e.g., sell, hold, buy

18
Preference Bias Ockham's Razor

Aka Occams Razor, Law of Economy, or Law of
Parsimony
Principle stated by William of Ockham
(1285-1347/49), a scholastic, that
non sunt multiplicanda entia praeter
necessitatem
or, entities are not to be multiplied beyond
necessity.
The simplest explanation that is consistent with
all observations is the best.
Therefore, the smallest decision tree that
correctly classifies all of the training examples
is best.
Finding the provably smallest decision tree is
NP-Hard
Therefore we do not construct the absolute
smallest tree consistent with the training
examples.
We construct a tree that is pretty small.

19
Inductive Learning and Bias

Suppose that we want to learn a function f(x) y
and we are given some sample (x,y) pairs, as in
figure (a).
There are several hypotheses we could make about
this function, e.g. (b), (c) and (d).
A preference for one over the others reveals the
bias of our learning technique, e.g.
prefer piece-wise functions
prefer a smooth function
prefer a simple function and treat outliers as
noise

20
Example of using probabilities to create trees
Huffman code

In 1952 MIT student David Huffman devised, in the
course of doing a homework assignment, an elegant
coding scheme
This scheme is optimal in the case where all
symbols probabilities are integral powers of
1/2.
A Huffman code can be built in the following
manner
1. Rank all symbols in order of probability of
occurrence.
2. Successively combine the two symbols of the
lowest probability to form a new composite
symbol
eventually we will build a binary tree where each
node is the probability of all nodes beneath it.
3. Trace a path to each leaf, noticing the
direction at each node.

21
Huffman code example as a prototypical idea from
other area

Message Probability.
A .125
B .125
C .25
D .5

If we need to send many messages (A,B,C or D) and
they have this probability distribution and we
use this code, then over time, the average
bits/message should approach 1.75 (
0.12530.12530.2520.51)
22

If a set T of records is partitioned into
disjoint exhaustive classes (C1,C2,..,Ck) on the
basis of the value of the categorical attribute,
then the information needed to identify the class
of an element of T is
Info(T) I(P)
where P is probability distribution of
partition (C1,C2,..,Ck)
P (C1/T, C2/T, ..., Ck/T)
If we partition T w.r.t attribute X into sets
T1,T2, ..,Tn then the information needed to
identify the class of an element of T becomes the
weighted average of the information needed to
identify the class of an element of Ti,
i.e. the weighted average of Info(Ti)
Info(X,T) STi/T Info(Ti) STi/T
log Ti/T

23
Gain

Consider the quantity Gain(X,T) defined as
Gain(X,T) Info(T) - Info(X,T)
This represents the difference between
information needed to identify an element of T
and
information needed to identify an element of T
after the value of attribute X has been obtained,
that is, this is the gain in information due to
attribute X.
We can use this to rank attributes and to build
decision trees where at each node is located the
attribute with greatest gain among the attributes
not yet considered in the path from the root.
The intents of this ordering are twofold
1. To create small decision trees so that records
can be identified after only a few questions.
2. To match a hoped for minimality of the process
represented by the records being considered
(Occam's Razor).

We will use this idea to build decision trees, ID3
24
Rule and Decision Tree Learning

Example Rule Acquisition from Historical Data
Data
Patient 103 (time 1) Age 23, First-Pregnancy
no, Anemia no, Diabetes no, Previous-Premature-B
irth no, Ultrasound unknown, Elective
C-Section unknown, Emergency-C-Section unknown
Patient 103 (time 2) Age 23, First-Pregnancy
no, Anemia no, Diabetes yes, Previous-Premature-
Birth no, Ultrasound abnormal, Elective
C-Section no, Emergency-C-Section unknown
Patient 103 (time n) Age 23, First-Pregnancy
no, Anemia no, Diabetes no, Previous-Premature-B
irth no, Ultrasound unknown, Elective
C-Section no, Emergency-C-Section YES
Learned Rule
IF no previous vaginal delivery, AND abnormal 2nd
trimester ultrasound, AND malpresentation at
admission, AND no elective C-Section THEN probabil
ity of emergency C-Section is 0.6
Training set 26/41 0.634
Test set 12/20 0.600

25
Neural Network Learning

Autonomous Learning Vehicle In a Neural Net
(ALVINN) Pomerleau et al
http//www.cs.cmu.edu/afs/cs/project/alv/member/ww
w/projects/ALVINN.html
Drives 70mph on highways

26
Specifying A Learning Problem

Learning Improving with Experience at Some Task
Improve over task T,
with respect to performance measure P,
based on experience E.
Example Learning to Play Checkers
T play games of checkers
P percent of games won in world tournament
E opportunity to play against self
Refining the Problem Specification Issues
What experience?
What exactly should be learned?
How shall it be represented?
What specific algorithm to learn it?
Defining the Problem Milieu
Performance element
How shall the results of learning be applied?
How shall the performance element be evaluated?
The learning system?

27
Example Learning to Play Checkers
28
A Target Function forLearning to Play Checkers
29
A Training Procedure for Learning to Play
Checkers

Obtaining Training Examples
the target function
the learned function
the training value
One Rule For Estimating Training Values
Choose Weight Tuning Rule
Least Mean Square (LMS) weight update
rule REPEAT
Select a training example b at random
Compute the error(b) for this training
example
For each board feature fi, update weight wi as
follows where c is a small, constant
factor to adjust the learning rate

30
Design Choices forLearning to Play Checkers
Completed Design
31
Example of Interesting Application Data Mining
32
Example Reasoning (Inference, Decision Support)
33
Example Planning and Control
34
Relevant Disciplines

Artificial Intelligence
Bayesian Methods
Cognitive Science
Computational Complexity Theory
Control Theory
Information Theory
Neuroscience
Philosophy
Psychology
Statistics

Optimization Learning Predictors Meta-Learning
Entropy Measures MDL Approaches Optimal Codes
PAC Formalism Mistake Bounds
Language Learning Learning to Reason
Machine Learning
Bayess Theorem Missing Data Estimators
Symbolic Representation Planning/Problem
Solving Knowledge-Guided Learning
Bias/Variance Formalism Confidence
Intervals Hypothesis Testing
ANN Models Modular Learning
Occams Razor Inductive Generalization
Power Law of Practice Heuristic Learning

Write a Comment

User Comments (0)

About PowerShow.com

A Brief Survey of Machine Learning PowerPoint PPT Presentation