Machine Learning Introduction - PowerPoint PPT Presentation

About This Presentation
Title:

Machine Learning Introduction

Description:

Learning how to do something better, either more efficiently ... How to Adjust Weights. For the weights connecting the hidden layer to the output, we adjust a ... – PowerPoint PPT presentation

Number of Views:517
Avg rating:3.0/5.0
Slides: 31
Provided by: NKU
Learn more at: https://www.nku.edu
Category:

less

Transcript and Presenter's Notes

Title: Machine Learning Introduction


1
Machine Learning Introduction
  • Why is machine learning important?
  • Early AI systems were brittle, learning can
    improve such a systems capabilities
  • AI systems require some form of knowledge
    acquisition, learning can reduce this effort
  • KBS research clearly shows that producing a KBS
    is extremely time consuming dozens of man-years
    per system is the norm
  • in some cases, there is too much knowledge for
    humans to enter (e.g., common sense reasoning,
    natural language processing)
  • Some problems are not well understood but can be
    learned (e.g., speech recognition, visual
    recognition)
  • AI systems are often placed into real-world
    problem solving situations
  • the flexibility to learn how to solve new problem
    instances can be invaluable
  • A system can improve its problem solving accuracy
    (and possibly efficiency) by learning how to do
    something better

2
How Does Machine Learning Work?
  • Learning in general breaks down into one of two
    forms
  • Learning something new
  • no prior knowledge of the domain/concept so no
    previous representation of that knowledge
  • in ML, this requires adding new information to
    the knowledge base
  • Learning something new about something you
    already knew
  • add to the knowledge base or refine the knowledge
    base
  • modification of the previous representation
  • new classes, new features, new connections
    between them
  • Learning how to do something better, either more
    efficiently or with more accuracy
  • previous problem solving instance (case, chain of
    logic) can be chunked into a new rule (also
    called memoizing)
  • previous knowledge can be modified typically
    this is a parameter adjustment like a weight or
    probability in a network that indicates that this
    was more or less important than previously thought

3
Types of Machine Learning
  • There are many ways to implement ML
  • Supervised vs. Unsupervised vs. Reinforcement
  • is there a teacher that rewards/punishes
    right/wrong answers?
  • Symbolic vs. Subsymbolic vs. Evolutionary
  • at what level is the representation?
  • subsymbolic is the fancy name for neural networks
  • evolutionary learning is actually a subtype of
    symbolic learning
  • Knowledge acquisition vs. Learning through
    problem solving vs. Explanation-based learning
    vs. Analogy
  • We can also focus on what is being learned
  • Learning functions
  • Learning rules
  • Parameter adjustment
  • Learning classifications
  • these are not mutually exclusive, for instance
    learning classification is often done by
    parameter adjustment

4
Supervised Learning
  • The idea behind supervised learning is that the
    learning system is offered examples
  • The system uses what it already knows to respond
    to an input (if the system has yet to learn,
    initial values are randomly assigned)
  • If correct, the system strengthens the components
    that led to the right answer
  • If incorrect, the system weakens the components
    that led to the wrong answer
  • This is performed for each item in the training
    set
  • Repeat some number of iterations or until the
    system converges to an answer
  • Below, we see that learning is actually a search
    problem
  • The system is searching for the representation
    that will allow it to respond correctly to every
    (or most) instance in the training set
  • There could be many correct solutions
  • Some of these will also allow the system to
    respond correctly to most instances in the
    testing set

5
Forms of Supervised Learning
  • Most ML is some form of learning a function
  • F(x) y where x is the input (typically
    comprised of (x1, x2, , xn) for some
    n-dimensional space, and y is the output
  • This form of learning typically breaks down into
    one of two forms
  • classification the training items are mapped to
    distinct elements of a set
  • regression the training items are mapped to
    continuous values
  • In supervised learning, we have a training set of
    x, y pairs
  • Use the training set to teach the ML system
  • Many different approaches have been developed
  • neural networks using backpropagation
  • HMM
  • Bayesian networks
  • decision trees
  • clustering
  • Usually, once the system is trained, another data
    set (the test set) is run on the system to see
    how it performs
  • There is a danger in this approach, overtraining
    the system means that it learns the training set
    too well it overfits to the training set such
    that it performs poorly on the test set

6
Learning a Function
  • One of the most basic ideas in learning is to
    provide examples of input/output and have the
    system learn the function
  • The system will not learn, say f(x1, x2) x12
    3x2 5 but instead will learn how to map f(xi,
    xj) to an output (hopefully reliably)
  • The function will be learned only approximately
    based on how useful the training set is and the
    specific type of learning algorithm applied

Consider learning the function that fits the data
points plotted to the left there are many
functions that might fit which one is
correct? Do we need to find a precise fit?
If not, how much error should we allow?
7
Perceptrons
  • Earliest form of neural network
  • given a series of input/output pairs, identify
    the linear separability (a hyper-plane)
  • e.g., a line in 2-d, a plane in 3-d
  • If the data points are linearly separable, the
    perceptron learning algorithm is guaranteed to
    find it
  • many functions, such as XOR, are not linearly
    separable, in which case perceptrons fail

Think of the points as items that are either in
a given class or not, the perceptron learns to
classify the items
An n-input perceptron computes
Weights are adjusted during learning to improve
the perceptrons performance this amounts to
learning the function that separates the ins
from the outs
8
Linear Regression
  • Another approach is based on the statistical
    method of regression analysis
  • Here, the strategy is to identify the
    coefficients (such as a, b below) to fit the
    equation below, given the data set of ltx, ygt
    values
  • e is some random element
  • we need to expand on this to be an n-dimensional
    formula since our data will consist of elements X
    x1, x2, x3, , xn, and y
  • There are a variety of ways to do regression
    including applying using some sort of
    distribution (e.g., Gaussian), applying the
    method of least squares, applying Bayesian
    probabilities, etc
  • note neural networks are a form of non-linear
    regression

y a ßx e
9
Classifiers
  • The more common form of supervised learning is
    that of a classifier the goal is to learn how
    to classify the data
  • f(x) y means that x describes some input and y
    is its proper category (again x is actually x1,
    x2, , xn)
  • Much of ML has revolved around classifiers
  • Naïve bayesian classifiers
  • Neural networks
  • K nearest neighbors
  • Boosting
  • Induction
  • version spaces
  • decision trees
  • inductive logic programming
  • Some of these forms of classifiers are used
    heavily in data mining, so we will hold off on
    discussion those until the next lecture (K
    nearest neighbors, boosting, decision trees)
  • We will skip version spaces and inductive logic
    programming as they are not as common today, but
    you might investigate them on your own

10
Bayesian Learning
  • Recall to apply Bayesian probabilities, we must
    either
  • have an enormous number of evidential hypotheses
  • or must assume that evidence are independent
  • The Naïve Bayesian Classifier takes the latter
    assumption
  • thus, it is known as naïve
  • p(C e1, e2, e3) P(C e1) P(C e2) P(C
    e3)
  • rather than the more complex chain of
    probabilities that we saw previously
  • We can learn the prior and evidential
    probabilities by counting occurrences of evidence
    and hypotheses amongst the data in the training
    set
  • P(A B) of times that A B both appear in
    the training set / of times that B appears in
    the training set
  • P(A) of times that A appears / size of the
    training set
  • in case any of these values appears 0 times, we
    might want to smooth the probability so that no
    conditional probability would ever be 0.0
  • smoothing is done by adding some hallucinated
    values to both the numerator and denominator
    based on the size of the training set and some
    pre-established constant

11
Example
  • Consider that I want to train a NBC on whether a
    particular text-based article is one that I would
    like to read
  • Given a set of training articles, mark each as
    yes or no
  • Create the following probabilities
  • P(wordi yes) probability that word i appears
    in an article i want to read
  • P(wordi no) probability that word i appears
    in an article i do not want to read
  • P(wordi) probability that word i appears in an
    article
  • this is known as the bag of words approach

Accuracy of the NBC given training set of size
0-10000
  • Now, given an article, compute P(yes words) and
    P(no words) where words worda, wordb, wordc,
    for each unique word in the article
  • We can enhance this strategy by
  • removing common words
  • using phrases
  • making sure that the bag contains important words

12
Learning in Bayesian Networks
  • Rather than assuming evidential independence, we
    might prefer Bayesian nets
  • We cannot learn (compute) the complex
    probabilities in a Bayesian network
  • e.g., P(A B C D
  • What we can do, given these probabilities (or
    estimates), is learn the proper (best) structure
    for the Bayesian net
  • this is done by taking our original network,
    making some minor change(s) to it, computing the
    results probability, and selecting the network
    with the highest probability for that result
  • For instance, in the figure to the right, we want
    to know P(T )
  • We compute that probability on several versions
    of the Bayesian net and select the network that
    provides the highest resulting probability in
    which T was found to be true (likely)

13
Introduction to Neural Networks
  • After proving perceptrons could not learn XOR,
    research into connectionism died for about 15
    years
  • A new learning algorithm, backpropagation, and a
    new type of layered network, the Artificial
    Neural Network, led to a revised interest in
    connectionism
  • To the right is a multi-layered ANN
  • I inputs
  • some (0 or more) intermediate levels known as
    hidden layers
  • O outputs
  • Each layer is completely connected to the next
    layer
  • Each edge has its own weight
  • The goal of the backprop algorithm is to train
    the ANN to learn proper weights

14
NN Supervised Learning
  • First feed forward the input
  • most NN use a sigmoid function to compute the
    output of a given node but otherwise, it is like
    computing the result of a perceptron node
  • Determine the error (if any) by examining each
    output node and comparing the value to the
    expected value from the training set
  • Backpropagate the error from the output nodes to
    the hidden layer nodes (formula for weight
    adjustment on the next slide)
  • Continue to backpropagate the error to the
    previous level (another hidden layer or the input)
  • note that since we dont know what a given hidden
    layer node was supposed to be, we cant directly
    compute an error here, we have to therefore
    modify our formula for adjusting the weight
    (again, see the next slide)
  • Repeat the learning algorithm on the next
    training set item
  • Repeat the entire training set until the network
    converges (weights change less than some D)

15
How to Adjust Weights
  • For the weights connecting the hidden layer to
    the output, we adjust a weight wij as follows
  • wij wij sf oj (1 oj) (ej - oj) i
  • sf is the scaling factor this controls how
    quickly the network learns
  • oj is the output value of node j
  • ej is the expected value for output node j (as
    dictated by the training set item)
  • i is the input value
  • We do not know ej for the hidden layer nodes, so
    we have to revise the formula to adjust the
    weights between hidden layer a and hidden layer
    b, or between the input layer and the hidden
    layer
  • wij wij sf oi (1 oi) Sum (wk vk)
    i
  • wk is the weight connecting this node to node k
    in the next layer and vk is the value that node k
    provided during the feed-forward part of the
    algorithm

16
Learning Example
Assume an input lt10, 30, 20gt and expected
output is lt1, 0gt from our training set. Use a
scaling factor of 0.1.
Recall computing output uses the sigmoid function
below
Part 1 Feed forward ? H1 receives 7, H2
receives -5 H1 outputs .9990, H2 outputs
.0067 O1 receives 1.0996, O2
receives 3.1047 O1 outputs .7501, O2 outputs
.9571
17
Example Continued
Part 2 Compute Error at Output ? O1 should be
1.0, O2 should be 0.0
Part 3 Compute Error for Hidden Units
Back prop to H1 (w11d01) (w12dO2)
(1.10.0469)(3.1-0.0394) -0.0706
Compute H1s error (multiply by h1(E)(1-h1(E))
-0.0706 (0.999 (1-0.999)) 0.0000705
dH1 Back prop to H2 (w21d01)
(w22dO2) (0.10.0469)(1.17-0.0394)
-0.0414 Compute H2s error (multiply by
h2(E)(1-h2(E)) -0.0414 (0.067 (1-0.067))
-0.00259 dH2
18
Example Continued
Part 4 Adjust weights as new weight old
weight scaling factor error
19
Over or Under Training
  • The scaling factor controls how quickly the
    network can learn so why not make it a large
    value?
  • What the NN is actually doing is performing a
    task called gradient descent
  • weights are adjusted based on the derivative of
    the cost function
  • the learning algorithm is searching for the
    absolute minimum value, however because we are
    moving in small leaps, we might get stuck in a
    local minima
  • a local minima may learn the training set well,
    but not the testing set
  • So we control just how well the NN learns to
    classify the domain by
  • the scaling factor
  • the number of epochs
  • the training data set
  • But also impacting this is the structure and size
    of the network (which also impacts the number of
    epochs that it might take to train the network)

20
What a Neural Network Learns
  • There has been some confusion regarding what a NN
    can do and what it learns
  • The weights that a NN learns is a form of
    distributed representation more specifically a
    distributed statistical model of what features
    are important for a given class
  • Aside from the input and output nodes, the hidden
    layer nodes do not represent any single thing but
    instead, groups of them represent intermediate
    concepts in the domain/problem being learned

The facial recognition NN (on the right) has
learned to recognize what direction a face is
turned up, right, left or straight). The hidden
layers three nodes, when analyzed, are storing
the pixels that make up the three rough images
of a face turned in one of the directions
21
Problems with NNs
  • In terms of learning, NNs surpass most of the
    previously mentioned methods because they learn
    via non-linear regression
  • A NN might be stuck in a local minima resulting
    in excellent performance on the training set but
    poor performance on the test set
  • The number of epochs (iterations through the
    training set) is extremely random
  • it might take a few dozen epochs, in other cases,
    a million epochs
  • There is no way to predict, given the structure
    of a network, how well or quickly it will learn
  • NNs are not understandable by us, so we cant
    really tell what the NN has learned or how the
    information is represented
  • NNs cannot generate explanations
  • NNs do poorly in knowledge-intensive problems
    (e.g., diagnosis) but very well in some
    recognition problems (e.g., OCR)
  • NNs have a fixed sized input so problems that
    deal with temporal issues (e.g., speech rec)
    perform problematically, but recurrent NNs are
    one way to possibly get around this problem

22
Avoiding Some of These Problems
To avoid getting stuck in a local minima, one
strategy is to use an additional factor called
momentum which in effect changes the scaling
factor over time One form of this is called
simulated annealing
To avoid over fitting the training set, do not
use accuracy on the training set, instead every
so often, test the testing set and use the
accuracy on that set to judge convergence
23
HMM Learning
  • Known as the EM algorithm or Baum-Welch algorithm
  • Use one training set item with observations o1,
    o2, , on
  • Work through the HMM, one observation at a time
  • Once you have fed forward this example
  • for each time interval t and each state
    transition from i at time t to j at time t1,
    compute the estimator probability of transitions
    from i to j
  • at(i) aij bj(Ot1) bt1(j)
  • Where at1(i) S (at(j)aji) bi(Ot1)
  • bt(j) S bt1(i) aij bj(Ot1)
  • aij is the transition from i to j
  • and bi(Ot) is the output probability, which is
    the probability of observable Ot being seen at
    state I
  • Now modify each transition probability aij and
    output probability bi(Ot) as follows
  • New aij estimator probability from i to j /
    number of transitions out of i
  • New bi(Ot) at(i) bt(i) / expected number of
    times in j
  • When done with this iteration, replace the old
    transition probabilities with the new
    probabilities and repeat with the next training
    set example until either the HMM converges, or
    you have depleted the examples

24
Genetic Algorithms
  • Learning through manipulation of a feature space
  • The state is a vector representing features
  • binary vector - feature is present or absent
  • multi-valued vector - features represented by a
    discrete or continuous value
  • Supervised learning requiring a method of
    determining how good a given feature vector is
  • learning is viewed as a search problem what is
    the ideal or optimal vector
  • Natural selection techniques will (hopefully)
    improve the performance of the search during
    successive iterations (called generations)
  • this form of learning can be used to learn
    recognition knowledge, control knowledge,
    planning/design knowledge, diagnostic knowledge
  • The genetics come in by considering that the
    vector is a chromosome which is mutated by
    various random operations, and then evaluated
    the most fit chromosomes survive to become
    parents for the next generation

25
General Procedure for GAs
  • Repeat the following until either you have
    exceeded the number of stated generations or you
    have a vector that is found suitable
  • Start with a population of parent vectors
  • Breed children through mutation operations
  • Apply the fitness function to the children
  • Select those children which will become parents
    of the next generation
  • Decisions
  • What is the fitness function? Is there a
    reasonable one available?
  • What mutation operations should be applied and
    how randomly? Should children be very similar to
    the parents or highly different?
  • How many children should be selected for the next
    generation? How many children should be produced
    by the parents?
  • How is selection going to take place?

26
Fitness and Selection
  • Unlike other forms of supervised learning where
    feedback is a previously known classification or
    value, here, the feedback for the worth of a
    vector is in the form of a fitness function
  • given a vector V, apply the function f(V)
  • use this value to determine this vectors worth
    towards the next generation
  • a vector that is highly rated may be selected in
    forming the next generation of vectors whereas a
    vector that is lowly rated will probably not be
    used (unless randomly selected)
  • How do you determine which vectors to
    alter/mutate?
  • Fitness Ranking - use a fitness function to
    select the best available vector (or vectors) and
    use it (them)
  • Rank Method - use the fitness function but do not
    select the best, use probabilities instead
  • Random Selection - in addition to the top
    vector(s), some approaches randomly select some
    number of vectors from the remaining, lesser
    ranked ones
  • Diversity - determine which vectors are the most
    diverse from the top ranked one(s) and select it
    (them)

27
Mutation and Selection Mechanisms
  • Standard mutation methods are
  • inversion moving around values in a vector
  • If p1 1, 2, 3, 4, 5, 6, then this might
    result in 1, 5, 4, 3, 2, 6
  • mutation changing a features value to another
    value
  • crossover (requires two chromosomes) randomly
    swap some portion of the two vectors
  • If p1 5, 4, 3, 2, 6, 1 and p2 1, 6, 2, 3,
    4, 5, crossover may yield the two children 5,
    4, 2, 3, 4, 1 and 1, 6, 3, 2, 6, 5
  • How do you determine which vectors to
    alter/mutate?
  • Fitness ranking select the best available
    vectors
  • Rank Method rank the vectors as scored by the
    fitness function and then use a probabilistic
    mechanism for selection
  • if v1 is .5, v2 if .3 and v3 is .15 and v4 is
    .05, then v1 has a 50 chance of being selected,
    v2 has a 30 chance, v3 has a 15 chance and v4 a
    5 chance
  • Random Selection select the top vector(s) and
    select the remainder by random selection
  • Diversity select the top vector(s) and then
    select the remainder by finding the most diverse
    from the ones already selected

28
Genetic Programming
  • This form of learning is most commonly applied to
    programming code
  • unlike the GA approach, here the representation
    is some dynamic structure, commonly a tree
  • the process of inversion, mutation or crossover
    is applied
  • Since trees are formed out of syntactic parses of
    programs, we can manipulate a program using this
    approach
  • notice that by randomly manipulating a program,
    it may no longer be syntactically valid however
    if we just use crossover, the result will
    hopefully remain syntactically valid (why?)
  • What kind of fitness function might be used?

29
Other Forms of Learning
  • Reinforcement learning
  • A variation on supervised learning a learner
    must determine what action to take in a given
    situation that maximizes its reward it does
    this through trial and error rather than through
    training examples
  • reinforcement learning is not a new learning
    technique but rather a type of problem which can
    be solved by any of a number of techniques
    including those already seen (NNs, HMMs,
  • Unsupervised learning
  • No training set, no feedback, a form of discovery
  • Commonly uses either a Bayesian inference to
    produce probabilities, or a statistical approach
    and clustering to produce class descriptions
  • mostly a topic for data mining, also sometimes
    referred to as discovery

30
Knowledge-based Learning
  • Back in the 1970s, machine learning mostly
    revolved around learning new concepts in a
    knowledge base
  • Version spaces offering positive and negative
    examples of a class to learn the features that
    distinguish items that are in versus out of the
    class, see for example
  • http//www.site.uottawa.ca/nat/Courses/CSI5387/ML
    _Lecture_2.ppt
  • http//www.cs.cf.ac.uk/Dave/AI2/node146.html
  • Explanation based learning given a KB, offer
    one or more examples of a concept and have the
    system add representations that fit the new
    concepts being learned a commonly sited example
    is to add to a chess programs capability by
    understanding the strategy of a fork, see for
    example
  • http//www.cs.cf.ac.uk/Dave/AI2/node148.htmlSECTI
    ON000162000000000000000
  • Analogy taking a model in one domain and
    applying it to another domain, often done through
    case based reasoning
  • Discovery finding patterns in data, what we now
    call data mining, one early example was pioneered
    in a system called BACON that analyzed data to
    find laws (which also reasoned using analogy)
  • it was able to infer Keplers third law, Ohms
    law, Joules law, and the conservation of
    momentum by analyzing data
Write a Comment
User Comments (0)
About PowerShow.com