Machine Learning CS 165B Spring 2012 - PowerPoint PPT Presentation

1 / 69
About This Presentation

Machine Learning CS 165B Spring 2012


Machine Learning CS 165B Spring 2012* – PowerPoint PPT presentation

Number of Views:299
Avg rating:3.0/5.0
Slides: 70
Provided by: Ambu152


Transcript and Presenter's Notes

Title: Machine Learning CS 165B Spring 2012

Machine LearningCS 165BSpring 2012
Catalog description
  • Prerequisites Computer Science 130A
  • Covers the most important techniques of machine
    learning (ML) and includes discussions of
    well-posed learning problems artificial neural
    networks concept learning and general to
    specific ordering decision tree learning
    genetic algorithms Bayesian learning analytical
    learning and others.

Prerequisite concepts
  • Abstract Data Types
  • Stacks, queues, lists, trees, graphs,
  • Discrete mathematics
  • Functions, relations, induction, logic, proofs,
  • Programming
  • Algorithms complexity

UCSB CS Sequence 165A and 165B
  • 165A (Artificial Intelligence) is a companion
  • Prerequisites Computer Science 130A open to
    computer science majors only
  • Introduction to the field of artificial
    intelligence, which seeks to understand and build
    intelligent computational systems. Topics include
    intelligent agents, problem solving and heuristic
    search, knowledge representation and reasoning,
    uncertainty, probabilistic reasoning, and
  • No specific ordering

Course organization
  • Grading
  • 25 homework assignments
  • 25 term project
  • 20 midterm
  • 30 final examination
  • Policy
  • Cheating and plagiarism F grade and
    disciplinary actions
  • Online info
  • Email
  • Teaching assistant Minh Hoang
  • Reader Erdinc Korpeoglu

  • The main focus of this course is to introduce you
    to a systematic study of machine learning (ML)
  • To teach you the main ideas of ML
  • To introduce you to a set of key techniques and
    algorithms from ML
  • To help you understand whats hard in ML and why
  • To see how ML relates to the rest of computer
  • To get you thinking about how ML can be applied
    to a variety of real problems

  • Machine Learning by Tom Mitchell
  • But other material will be used, so the ultimate
    source should be my lectures.
  • Lecture notes will be posted but they will not
    contain all the discussed material.

Course outline
  • Overview of machine learning
  • Concept Learning
  • Decision Tree Learning
  • Artificial Neural Networks
  • Probability Theory
  • Bayesian Learning
  • Computational Learning Theory
  • Instance-Based Learning
  • Genetic Algorithms
  • Support Vector Machines

This weeks goals
  • Peruse the course web site
  • Assigned reading (Chapters 1 and 2)
  • Swap of Monday 4/2 lecture and Friday 4/6
    discussion section first week
  • Swap of Wednesday 4/11 lecture and Friday 4/13
    discussion section second week
  • First discussion session Monday 4/9 during
  • Review of relevant prerequisites probability and
    statistics, ..

  • Near synonyms
  • Data mining
  • Pattern recognition
  • Computational statistics
  • Advanced or predictive analytics (used in
  • Artificial Intelligence

Machine Learning Statistics
  • Differences in terminology
  • Ridge regression weight-decay
  • Fitting learning
  • Held-out data test data
  • The emphasis is very different
  • A good piece of statistics Clever proof that a
    relatively simple estimation procedure is
    asymptotically unbiased.
  • A good piece of machine learning Demonstration
    that a complicated algorithm produces impressive
    results on a specific task.
  • Data-mining Using machine learning techniques on
    very large databases.

Machine Learning in Popular Media
  • Netflix Prize (Wikipedia entry)
  • Jeopardy Challenge video 1
  • Jeopardy Challenge video 2
  • Nokia Mobile Data Challenge

DARPA Grand Challenge (2004, 2005)
  • DARPA intends to conduct a challenge of
    autonomous ground vehicles between Los Angeles
    and Las Vegas in March of 2004. A cash award of
    1 million will be granted to the team that
    fields the first vehicle to complete the
    designated route within a specified time limit.

(No Transcript)
(No Transcript)
DARPA Urban Challenge 2007
  • The DARPA Urban Challenge was held on November
    3, 2007, at the former George AFB in Victorville,
    Calif.  Building on the success of the 2004 and
    2005 Grand Challenges, this event required teams
    to build an autonomous vehicle capable of driving
    in traffic, performing complex maneuvers such as
    merging, passing, parking and negotiating
  • This event was groundbreaking as the first time
    autonomous vehicles have interacted with both
    manned and unmanned vehicle traffic in an urban
  • 2 million first prize.
  • A small video.

What is Machine Learning?
  • Optimize a performance criterion using example
    data or past experience.
  • Role of Statistics inference from a sample
  • Role of Computer science efficient algorithms to
  • Solve an optimization problem
  • Represent and evaluate the model for inference
  • Learning is used when
  • Human expertise does not exist (navigating on
  • Humans are unable to explain their expertise
    (speech recognition)
  • Solution changes with time (routing on a computer
  • Solution needs to be adapted to particular cases
    (user biometrics)
  • There is no need to learn to calculate payroll

What We Talk About When We Talk About Learning
  • Learning general models from a data of particular
  • Data is cheap and abundant (data warehouses, data
    marts) knowledge is expensive and scarce.
  • Example in retail Customer transactions to
    consumer behavior
  • People who bought Da Vinci Code also bought
    The Five People You Meet in Heaven
  • Build a model that is a good and useful
    approximation to the data.

Types of Learning Tasks
  • Association
  • Supervised learning
  • Learn to predict output when given an input
  • Reinforcement learning
  • Learn action to maximize payoff
  • Payoff is often delayed
  • Exploration vs. exploitation
  • Online setting
  • Unsupervised learning
  • Create an internal representation of the input
    e.g. form clusters extract features
  • How do we know if a representation is good?
  • Big datasets do not come with labels.

Learning Associations
  • Basket analysis
  • P (Y X ) probability that somebody who buys X
    also buys Y where X and Y are products/services.
  • Example P ( chips beer ) 0.7

  • Example Credit scoring
  • Differentiating between low-risk and high-risk
    customers from their income and savings

Discriminant IF income gt ?1 AND savings gt ?2
THEN low-risk ELSE high-risk
Classification Applications
  • Aka Pattern recognition
  • Face recognition Pose, lighting, occlusion
    (glasses, beard), make-up, hair style
  • Character recognition Different handwriting
  • Speech recognition Temporal dependency.
  • Use of a dictionary or the syntax of the
  • Sensor fusion Combine multiple modalities eg,
    visual (lip image) and acoustic for speech
  • Medical diagnosis From symptoms to illnesses
  • ...

Face Recognition
Training examples of a person
Test images
The Role of Learning
(No Transcript)
  • Example Price of a used car
  • x car attributes
  • y price
  • y g (x, ? )
  • g ( ) model,
  • ? parameters

y wxw0
Supervised Learning Uses
  • Prediction of future cases Use the rule to
    predict the output for future inputs
  • Knowledge extraction The rule is easy to
  • Compression The rule is simpler than the data it
  • Outlier detection Exceptions that are not
    covered by the rule, e.g., fraud

Unsupervised Learning
  • Learning what normally happens
  • Clustering Grouping similar instances
  • Example applications
  • Customer segmentation in CRM (customer
    relationship management)
  • Image compression Color quantization
  • Bioinformatics Learning motifs

Displaying the structure of a set of documents
Example Cancer Diagnosis
  • Application automatic disease detection
  • Importance this is modern/future medical
  • Prediction goal Based on past patients, predict
    whether you have the disease
  • Data Past patients with and without the disease
  • Target Cancer or no-cancer
  • Features Concentrations of various proteins in
    your blood

Example Netflix
  • Application automatic product recommendation
  • Importance this is the modern/future shopping.
  • Prediction goal Based on past preferences,
    predict which movies you might want to watch
  • Data Past movies you have watched
  • Target Like or dont-like
  • Features ?

Example Zipcodes
  • Application automatic zipcode recognition
  • Importance this is modern/future delivery of
    small goods.
  • Goal Based on your handwritten digits, predict
    what they are and use them to route mail
  • Data Black-and-white pixel values
  • Target Which digit
  • Features ?

What makes a 2?
Example Google
  • Application automatic ad selection
  • Importance this is modern/future advertising.
  • Prediction goal Based on your search query,
    predict which ads you might be interested in
  • Data Past queries
  • Target Whether the ad was clicked
  • Features ?

Example Call Centers
  • Application automatic call routing
  • Importance this is modern/future customer
  • Prediction goal Based on your speech recording,
    predict which words you said
  • Data Past recordings of various people
  • Target Which word was intended
  • Features ?

Example Stock Market
  • Application automatic program trading
  • Importance this is modern/future finance.
  • Prediction goal Based on past patterns, predict
    whether the stock will go up
  • Data Past stock prices
  • Target Up or down
  • Features ?

Web-based examples of machine learning
  • The web contains a lot of data. Tasks with very
    big datasets often use machine learning
  • especially if the data is noisy or
  • Spam filtering, fraud detection
  • The enemy adapts so we must adapt too.
  • Recommendation systems
  • Lots of noisy data. Million dollar prize!
  • Information retrieval
  • Find documents or images with similar content.

What is a Learning Problem?
  • Learning involves performance improving
  • at some task T
  • with experience E
  • evaluated in terms of performance measure P
  • Example learn to play checkers
  • Task T playing checkers
  • Experience E playing against itself
  • Performance P percent of games won
  • What exactly should be learned?
  • How might this be represented?
  • What specific algorithm should be used?

Develop methods, techniques and tools for
building intelligent learning machines, that can
solve the problem in combination with an
available data set of training examples. When
a learning machine improves its performance at a
given task over time, without reprogramming, it
can be said to have learned something.
Learning Example
  • Example from Machine/Computer Vision field
  • learn to recognise objects from a visual scene or
    an image
  • T identify all objects
  • P accuracy (e.g. a number of objects correctly
  • E a database of objects recorded

Components of a Learning Problem
  • Task the behavior or task thats being improved,
    e.g. classification, object recognition, acting
    in an environment.
  • Data the experiences that are being used to
    improve performance in the task.
  • Measure of improvements How can the improvement
    be measured? Examples
  • Provide more accurate solutions (e.g. increasing
    the accuracy in prediction)
  • Cover a wider range of problems
  • Obtain answers more economically (e.g. improved
  • Simplify codified knowledge
  • New skills that were not presented initially

Learning Generalization
H. Simon Learning denotes changes in the system
that are adaptive in the sense that they enable
the system to do the task or tasks drawn from the
same population more efficiently and more
effectively the next time.
The ability to perform a task in a situation
which has never been encountered before
Hypothesis Space
  • One way to think about a supervised learning
    machine is as a device that explores a
    hypothesis space.
  • Each setting of the parameters in the machine is
    a different hypothesis about the function that
    maps input vectors to output vectors.
  • If the data is noise-free, each training example
    rules out a region of hypothesis space.
  • If the data is noisy, each training example
    scales the posterior probability of each point in
    the hypothesis space in proportion to how likely
    the training example is given that hypothesis.
  • The art of supervised machine learning is in
  • Deciding how to represent the inputs and outputs
  • Selecting a hypothesis space that is powerful
    enough to represent the relationship between
    inputs and outputs but simple enough to be

  • The real aim of supervised learning is to do well
    on test data that is not known during learning.
  • Choosing the values for the parameters that
    minimize the loss function on the training data
    is not necessarily the best policy.
  • We want the learning machine to model the true
    regularities in the data and to ignore the noise
    in the data.
  • But the learning machine does not know which
    regularities are real and which are accidental
    quirks of the particular set of training examples
    we happen to pick.
  • So how can we be sure that the machine will
    generalize correctly to new data?

Goodness of Fit vs. Model Complexity
  • It is intuitively obvious that you can only
    expect a model to generalize well if it explains
    the data surprisingly well given the complexity
    of the model.
  • If the model has as many degrees of freedom as
    the data, it can fit the data perfectly but so
  • There is a lot of theory about how to measure the
    model complexity and how to control it to
    optimize generalization.

A Sampling Assumption
  • Assume that the training examples are drawn
    independently from the set of all possible
  • Assume that each time a training example is
    drawn, it comes from an identical distribution
  • Assume that the test examples are drawn in
    exactly the same way i.i.d. and from the same
    distribution as the training data.
  • These assumptions make it very unlikely that a
    strong regularity in the training data will be
    absent in the test data.

A Simple Example Fitting a Polynomial
from Bishop
  • The green curve is the true function (which is
    not a polynomial)
  • The data points are uniform in x but have noise
    in y.
  • We will use a loss function that measures the
    squared error in the prediction of y(x) from x.
    The loss for the red polynomial is the sum of the
    squared vertical errors.

Some fits to the data which is best?
from Bishop
A simple way to reduce model complexity
  • If we penalize polynomials that have a high
    number of coefficients, we will get less wiggly

from Bishop
Ockhams Razor
What Experience E to Use?
  • Direct or indirect?
  • Direct feedback on individual moves
  • Indirect feedback on a sequence of moves
  • e.g., whether win or not
  • Teacher or not?
  • Teacher selects board states
  • Tailored learning
  • Can be more efficient
  • Learner selects board states
  • No teacher
  • Questions
  • Is training experience representative of
    performance goal?
  • Does training experience represent distribution
    of outcomes in world?

What Exactly Should be Learned?
  • Playing checkers
  • Alternating moves with well-defined rules
  • Choose moves using some function
  • Call this function the Target Function
  • Target function (TF) function to be learned
    during a learning process
  • ChooseMove Board ? Move
  • ChooseMove is difficult to learn, e.g., with
    indirect training examples
  • A key to successful learning is to choose
    appropriate target function
  • Strategy reduce learning to search for
  • Alternative TF for checkers
  • V Board ? R
  • Measure quality of the board state
  • Generate all moves
  • choose move with largest value

A Possible Target Function V For Checkers
  • In checkers, know all legal moves
  • From these, choose best move in any situation
  • Possible V function for checkers
  • if b is a final board state that is win, then
    V(b) 100
  • if b is a final board state that is loss, then
    V(b) -100
  • if b is a final board state that is draw, then
    V(b) 0
  • if b is a not a final state in the game, then
    V(b) V(b?), where b? is the best final board
    state that can be achieved starting from b and
    playing optimally until the end of the game
  • This gives correct values, but is not operational
  • So may have to find good approximation to V
  • Call this approximation V

How Might Target Function be Represented?
  • Many possibilities (subject of course)
  • As collection of rules ?
  • As neural network ?
  • As polynomial function of board features ?
  • Example of linear function of board features
  • w0 w1bp(b) w2rp(b) w3bk(b)w4rk(b)w5bt(b)w6
  • bp(b) number of black pieces on board b
  • rp(b) number of red pieces on b
  • bk(b) number of black kings on b
  • rk(b) number of red kings on b
  • bt(b) number of red pieces threatened by black
    (i.e., which can be taken on black's next turn)
  • rt(b) number of black pieces threatened by red
  • Generally, the more expressive the
    representation, the more difficult it is to

Obtaining Training Examples
  • With learned function
  • Search over space of weights estimate wi
  • Training values that are needed Vtrain(b)
  • Some from prior experience some generated
  • Example of training examples (?3,0,1,0,0,0?,
  • One rule for estimating training value
  • successor(b) is for which it is programs turn to
  • Used for intermediate values
  • good in practice
  • Issue now of how to estimate weights wi

Example of LMS Weight Update Rule
  • Choose weights to minimize squared error
  • Do repeatedly
  • Select a training example b at random
  • 1. Compute
  • 2. for each board feature xi, update weight wi
  • 3. If error gt 0, wi increases and vice

Gradient descent
Some Issues in Machine Learning
  • What algorithms can approximate functions well
    (and when)?
  • How does number of training examples influence
  • How does complexity of hypothesis representation
    impact learning?
  • How does noisy data influence accuracy?
  • What are the theoretical limits of learnability?
  • How can prior knowledge of learner help?
  • What clues can we get from biological learning
  • How can systems alter their own representations?

Learning Feedback
  • Learning feedback can be provided by the system
    environment or the agents themselves.
  • Supervised learning specifies the desired
    activities/objectives of learning feedback from
    a teacher
  • Unsupervised learning no explicit feedback is
    provided and the objective is to find out useful
    and desired activities on the basis of
    trial-and-error and self-organization processes
    a passive observer
  • Reinforcement learning specifies the utility of
    the actual activity of the learner and the
    objectives is to maximize this utility feedback
    from a critic

Ways of Learning
  • Rote learning, i.e. learning from memory in a
    mechanical way
  • Learning from examples and by practice
  • Learning from instructions/advice/explanations
  • Learning by analogy
  • Learning by discovery

Inductive and Deductive Learning
  • Inductive Learning Reasoning from a set of
    examples to produce a general rules. The rules
    should be applicable to new examples, but there
    is no guarantee that the result will be correct.
  • Deductive Learning Reasoning from a set of known
    facts and rules to produce additional rules that
    are guaranteed to be true.

Assessment of Learning Algorithms
  • The most common criteria for learning algorithms
    assessments are
  • Accuracy (e.g. percentages of correctly
    classified s and s)
  • Efficiency (e.g. examples needed, computational
  • Robustness (e.g. against noise, against
  • Special requirements (e.g. incrementality,
    concept drift)
  • Concept complexity (e.g. representational issues
    examples bookkeeping)
  • Transparency (e.g. comprehensibility for the
    human user)

Some Theoretical Settings
  • Inductive Logic Programming (ILP)
  • Probably Approximately Correct (PAC) Learning
  • Learning as Optimization (Reinforcement Learning)
  • Bayesian Learning

Key Aspects of Learning
  • Learner who or what is doing the learning, e.g.
    an algorithm, a computer program.
  • Domain what is being learned, e.g. a function, a
  • Goal why the learning is done.
  • Representation the way the objects to be learned
    are represented.
  • Algorithmic Technology the algorithmic framework
    to be used, e.g. decision trees, lazy learning,
    artificial neural networks, support vector

An Owed to the Spelling Checker
  • I have a spelling checker.
  • It came with my PC
  • It plane lee marks four my revue
  • Miss steaks aye can knot sea.
  • Eye ran this poem threw it.
  • your sure reel glad two no.
  • Its vary polished in it's weigh
  • My checker tolled me sew.
  • ..

The Role of Learning
  • Learning is at the core of
  • Understanding High Level Cognition
  • Performing knowledge intensive inferences
  • Building adaptive, intelligent systems
  • Dealing with messy, real world data
  • Learning has multiple purposes
  • Knowledge Acquisition
  • integration of various knowledge sources to
    ensure robust behavior
  • Adaptation (human, systems)

Why Study Learning?
  • Computer systems with new capabilities.
  • Develop systems that are too difficult to
    impossible to
  • construct manually .
  • Develop systems that can automatically adapt and
  • customize themselves to the needs of the
  • users through experience.
  • Discover knowledge and patterns in databases,
  • database mining, e.g. discovering purchasing
  • for marketing purposes.

Why Study Learning?
  • Computer systems with new capabilities.
  • Understand human and biological learning
  • Understanding teaching better.
  • Time is right.
  • Initial algorithms and theory in place.
  • Growing amounts of on-line data
  • Computational power available.
  • Necessity many things we want to do cannot be
    done by programming.

Learning is the future
  • Learning techniques will be a basis for every
    application that involves a connection to the
    messy real world
  • Basic learning algorithms are ready for use in
    limited applications today
  • Prospects for broader future applications make
    for exciting fundamental research and development
  • Many unresolved issues Theory and Systems

Work in Machine Learning
  • Artificial Intelligence Theory Experimental CS
  • Makes Use of
  • Probability and Statistics Linear Algebra
    Statistics Theory of Computation
  • Related to
  • Philosophy, Psychology (cognitive,
    developmental), Neurobiology, Linguistics
  • Has applications in
  • AI (natural Language Vision Planning HCI)
  • Engineering (Agriculture Civil )
  • Computer Science (Compilers Architecture
    Systems data bases)
  • Very active field Lots of applications in
  • What to teach?
  • The fundamental paradigms
  • Important algorithmic ideas

Training, validation, and test datasets
  • Divide the total dataset into three subsets
  • Training data is used for learning the parameters
    of the model.
  • Validation data is not used of learning but is
    used for avoiding overfitting.
  • Test data is used to get a final, unbiased
    estimate of how well the learning method works.
    We expect this estimate to be worse than on the
    training/validation data.
  • Often reduced to training and testing

The Bayesian Framework
  • The Bayesian framework assumes that we always
    have a prior distribution for everything.
  • The prior may be very vague.
  • When we see some data, we combine our prior
    distribution with a likelihood term to get a
    posterior distribution.
  • The likelihood term takes into account how
    probable the observed data is given the
    parameters of the model.
  • It favors parameter settings that make the data
  • It fights the prior
  • With enough data the likelihood terms always win.
Write a Comment
User Comments (0)