Loading...

PPT – CS194-10 Fall 2011 Introduction to Machine Learning Machine Learning: An Overview PowerPoint presentation | free to download - id: 4cc82f-ODAzN

The Adobe Flash plugin is needed to view this content

CS194-10 Fall 2011Introduction to Machine

LearningMachine Learning An Overview

People

Avital Steinitz 2nd year CS PhD student

Stuart Russell 30th-year CS PhD student

Mert Pilanci 2nd year EE PhD student

Administrative details

- Web page
- Newsgroup

Course outline

- Overview of machine learning (today)
- Classical supervised learning
- Linear regression, perceptrons, neural nets,

SVMs, decision trees, nearest neighbors, and all

that - A little bit of theory, a lot of applications
- Learning probabilistic models
- Probabilistic classifiers (logistic regression,

etc.) - Unsupervised learning, density estimation, EM
- Bayes net learning
- Time series models
- Dimensionality reduction
- Gaussian process models
- Language models
- Bandits and other exciting topics

Lecture outline

- Goal Provide a framework for understanding all

the detailed content to come, and why it matters - Learning why and how
- Supervised learning
- Classical finding simple, accurate hypotheses
- Probabilistic finding likely hypotheses
- Bayesian updating belief in hypotheses
- Data and applications
- Expressiveness and cumulative learning
- CTBT

Learning is.

- a computational process for improving

performance based on experience

Learning Why?

Learning Why?

- The baby, assailed by eyes, ears, nose, skin, and

entrails at once, feels it all as one great

blooming, buzzing confusion - William James, 1890

Learning Why?

- The baby, assailed by eyes, ears, nose, skin, and

entrails at once, feels it all as one great

blooming, buzzing confusion - William James, 1890

Learning is essential for unknown environments,

i.e., when the designer lacks omniscience

Learning Why?

- Instead of trying to produce a programme to

simulate the adult mind, why not rather try to

produce one which simulates the child's? If this

were then subjected to an appropriate course of

education one would obtain the adult brain.

Presumably the child brain is something like a

notebook as one buys it from the stationer's.

Rather little mechanism, and lots of blank

sheets. - Alan Turing, 1950
- Learning is useful as a system construction

method, i.e., expose the system to reality rather

than trying to write it down

Learning How?

Learning How?

Learning How?

Learning How?

Structure of a learning agent

Design of learning element

- Key questions
- What is the agent design that will implement the

desired performance? - Improve the performance of what piece of the

agent system and how is that piece represented? - What data are available relevant to that piece?

(In particular, do we know the right answers?) - What knowledge is already available?

Examples

Agent design Component Representation Feedback Knowledge

Alpha-beta search Evaluation function Linear polynomial Win/loss Rules of game Coefficient signs

Logical planning agent Transition model (observable envt) Successor-state axioms Action outcomes Available actions Argument types

Utility-based patient monitor Physiology/sensor model Dynamic Bayesian network Observation sequences Gen physiology Sensor design

Satellite image pixel classifier Classifier (policy) Markov random field Partial labels Coastline Continuity scales

Supervised learning correct answers for each

training instance Reinforcement learning reward

sequence, no correct answers Unsupervised

learning just make sense of the data

Supervised learning

- To learn an unknown target function f
- Input a training set of labeled examples (xj,yj)

where yj f(xj) - E.g., xj is an image, f(xj) is the label

giraffe - E.g., xj is a seismic signal, f(xj) is the label

explosion - Output hypothesis h that is close to f, i.e.,

predicts well on unseen examples (test set) - Many possible hypothesis families for h
- Linear models, logistic regression, neural

networks, decision trees, examples

(nearest-neighbor), grammars, kernelized

separators, etc etc

Supervised learning

- To learn an unknown target function f
- Input a training set of labeled examples

(xj,yj) where yj f(xj) - E.g., xj is an image, f(xj) is the label

giraffe - E.g., xj is a seismic signal, f(xj) is the label

explosion - Output hypothesis h that is close to f, i.e.,

predicts well on unseen examples (test set) - Many possible hypothesis families for h
- Linear models, logistic regression, neural

networks, decision trees, examples

(nearest-neighbor), grammars, kernelized

separators, etc etc

Supervised learning

- To learn an unknown target function f
- Input a training set of labeled examples

(xj,yj) where yj f(xj) - E.g., xj is an image, f(xj) is the label

giraffe - E.g., xj is a seismic signal, f(xj) is the label

explosion - Output hypothesis h that is close to f, i.e.,

predicts well on unseen examples (test set) - Many possible hypothesis families for h
- Linear models, logistic regression, neural

networks, decision trees, examples

(nearest-neighbor), grammars, kernelized

separators, etc etc

Example object recognition

x

f(x)

giraffe

giraffe

giraffe

llama

llama

llama

Example object recognition

x

f(x)

giraffe

giraffe

giraffe

llama

llama

llama

X

f(x)?

Example curve fitting

Example curve fitting

Example curve fitting

Example curve fitting

Example curve fitting

Basic questions

- Which hypothesis space H to choose?
- How to measure degree of fit?
- How to trade off degree of fit vs. complexity?
- Ockhams razor
- How do we find a good h?
- How do we know if a good h will predict well?

Philosophy of Science (Physics)

- Which hypothesis space H to choose?
- Deterministic hypotheses, usually mathematical

formulas and/or logical sentences implicit

relevance determination - How to measure degree of fit?
- Ideally, h will be consistent with data
- How to trade off degree of fit vs. complexity?
- Theory must be correct up to experimental error
- How do we find a good h?
- Intuition, imagination, inspiration (invent new

terms!!) - How do we know if a good h will predict well?
- Humes Problem of Induction most philosophers

give up

Kolmogorov complexity (also MDL, MML)

- Which hypothesis space H to choose?
- All Turing machines (or programs for a UTM)
- How to measure degree of fit?
- Fit is perfect (program has to output data

exactly) - How to trade off degree of fit vs. complexity?
- Minimize size of program
- How do we find a good h?
- Undecidable (unless we bound time complexity of

h) - How do we know if a good h will predict well?
- (recent theory borrowed from PAC learning)

Classical stats/ML Minimize loss function

- Which hypothesis space H to choose?
- E.g., linear combinations of features hw(x)

wTx - How to measure degree of fit?
- Loss function, e.g., squared error Sj (yj wTx)2
- How to trade off degree of fit vs. complexity?
- Regularization complexity penalty, e.g., w2
- How do we find a good h?
- Optimization (closed-form, numerical) discrete

search - How do we know if a good h will predict well?
- Try it and see (cross-validation, bootstrap,

etc.)

Probabilistic Max. likelihood, max. a priori

- Which hypothesis space H to choose?
- Probability model P(y x,h) , e.g., Y

N(wTx,s2) - How to measure degree of fit?
- Data likelihood ?j P(yj xj,h)
- How to trade off degree of fit vs. complexity?
- Regularization or prior argmaxh P(h) ?j P(yj

xj,h) (MAP) - How do we find a good h?
- Optimization (closed-form, numerical) discrete

search - How do we know if a good h will predict well?
- Empirical process theory (generalizes Chebyshev,

CLT, PAC) - Key assumption is (i)id

Bayesian Computing posterior over H

- Which hypothesis space H to choose?
- All hypotheses with nonzero a priori probability
- How to measure degree of fit?
- Data probability, as for MLE/MAP
- How to trade off degree of fit vs. complexity?
- Use prior, as for MAP
- How do we find a good h?
- Dont! Bayes predictor P(yx,D) Sh P(yx,h)

P(Dh) P(h) - How do we know if a good h will predict well?
- Silly question! Bayesian prediction is optimal!!

Bayesian Computing posterior over H

- Which hypothesis space H to choose?
- All hypotheses with nonzero a priori probability
- How to measure degree of fit?
- Data probability, as for MLE/MAP
- How to trade off degree of fit vs. complexity?
- Use prior, as for MAP
- How do we find a good h?
- Dont! Bayes predictor P(yx,D) Sh P(yx,h)

P(Dh) P(h) - How do we know if a good h will predict well?
- Silly question! Bayesian prediction is optimal!!

Neon sculpture at Autonomy Corp.

(No Transcript)

Lots of data

- Web estimated Google index 45 billion pages
- Clickstream data 10-100 TB/day
- Transaction data 5-50 TB/day
- Satellite image feeds 1TB/day/satellite
- Sensor networks/arrays
- CERN Large Hadron Collider 100 petabytes/day
- Biological data 1-10TB/day/sequencer
- TV 2TB/day/channel YouTube 4TB/day uploaded
- Digitized telephony 100 petabytes/day

(No Transcript)

Real data are messy

Arterial blood pressure (high/low/mean) 1s

Application satellite image analysis

Application Discovering DNA motifs

- ...TTGGAACAACCATGCACGGTTGATTCGTGCCTGTGACCGCGCGCCTC

ACACGGAAGACGCAGCCACCGGTTGTGATG - TCATAGGGAATTCCCCATGTCGTGAATAATGCCTCGAATGATGAGTAATA

GTAAAACGCAGGGGAGGTTCTTCAGTAGTA - TCAATATGAGACACATACAAACGGGCGTACCTACCGCAGCTCAAAGCTGG

GTGCATTTTTGCCAAGTGCCTTACTGTTAT - CTTAGGACGGAAATCCACTATAAGATTATAGAAAGGAAGGCGGGCCGAGC

GAATCGATTCAATTAAGTTATGTCACAAGG - GTGCTATAGCCTATTCCTAAGATTTGTACGTGCGTATGACTGGAATTAAT

AACCCCTCCCTGCACTGACCTTGACTGAAT - AACTGTGATACGACGCAAACTGAACGCTGCGGGTCCTTTATGACCACGGA

TCACGACCGCTTAAGACCTGAGTTGGAGTT - GATACATCCGGCAGGCAGCCAAATCTTTTGTAGTTGAGACGGATTGCTAA

GTGTGTTAACTAAGACTGGTATTTCCACTA - GGACCACGCTTACATCAGGTCCCAAGTGGACAACGAGTCCGTAGTATTGT

CCACGAGAGGTCTCCTGATTACATCTTGAA - GTTTGCGACGTGTTATGCGGATGAAACAGGCGGTTCTCATACGGTGGGGC

TGGTAAACGAGTTCCGGTCGCGGAGATAAC - TGTTGTGATTGGCACTGAAGTGCGAGGTCTTAAACAGGCCGGGTGTACTA

ACCCAAAGACCGGCCCAGCGTCAGTGA...

Application Discovering DNA motifs

- ...TTGGAACAACCATGCACGGTTGATTCGTGCCTGTGACCGCGCGCCTC

ACACGGAAGACGCAGCCACCGGTTGTGATG - TCATAGGGAATTCCCCATGTCGTGAATAATGCCTCGAATGATGAGTAATA

GTAAAACGCAGGGGAGGTTCTTCAGTAGTA - TCAATATGAGACACATACAAACGGGCGTACCTACCGCAGCTCAAAGCTGG

GTGCATTTTTGCCAAGTGCCTTACTGTTAT - CTTAGGACGGAAATCCACTATAAGATTATAGAAAGGAAGGCGGGCCGAGC

GAATCGATTCAATTAAGTTATGTCACAAGG - GTGCTATAGCCTATTCCTAAGATTTGTACGTGCGTATGACTGGAATTAAT

AACCCCTCCCTGCACTGACCTTGACTGAAT - AACTGTGATACGACGCAAACTGAACGCTGCGGGTCCTTTATGACCACGGA

TCACGACCGCTTAAGACCTGAGTTGGAGTT - GATACATCCGGCAGGCAGCCAAATCTTTTGTAGTTGAGACGGATTGCTAA

GTGTGTTAACTAAGACTGGTATTTCCACTA - GGACCACGCTTACATCAGGTCCCAAGTGGACAACGAGTCCGTAGTATTGT

CCACGAGAGGTCTCCTGATTACATCTTGAA - GTTTGCGACGTGTTATGCGGATGAAACAGGCGGTTCTCATACGGTGGGGC

TGGTAAACGAGTTCCGGTCGCGGAGATAAC - TGTTGTGATTGGCACTGAAGTGCGAGGTCTTAAACAGGCCGGGTGTACTA

ACCCAAAGACCGGCCCAGCGTCAGTGA...

Application User website behavior from

clickstream data (from P. Smyth, UCI)

128.195.36.195, -, 3/22/00, 103511, W3SVC,

SRVR1, 128.200.39.181, 781, 363, 875, 200, 0,

GET, /top.html, -, 128.195.36.195, -, 3/22/00,

103516, W3SVC, SRVR1, 128.200.39.181, 5288,

524, 414, 200, 0, POST, /spt/main.html, -,

128.195.36.195, -, 3/22/00, 103517, W3SVC,

SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET,

/spt/images/bk1.jpg, -, 128.195.36.101, -,

3/22/00, 161850, W3SVC, SRVR1, 128.200.39.181,

60, 425, 72, 304, 0, GET, /top.html, -,

128.195.36.101, -, 3/22/00, 161858, W3SVC,

SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0,

POST, /spt/main.html, -, 128.195.36.101, -,

3/22/00, 161859, W3SVC, SRVR1, 128.200.39.181,

0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,

128.200.39.17, -, 3/22/00, 205437, W3SVC,

SRVR1, 128.200.39.181, 140, 199, 875, 200, 0,

GET, /top.html, -, 128.200.39.17, -, 3/22/00,

205455, W3SVC, SRVR1, 128.200.39.181, 17766,

365, 414, 200, 0, POST, /spt/main.html, -,

128.200.39.17, -, 3/22/00, 205455, W3SVC,

SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,

/spt/images/bk1.jpg, -, 128.200.39.17, -,

3/22/00, 205507, W3SVC, SRVR1, 128.200.39.181,

0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,

128.200.39.17, -, 3/22/00, 205536, W3SVC,

SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0,

POST, /spt/main.html, -, 128.200.39.17, -,

3/22/00, 205536, W3SVC, SRVR1, 128.200.39.181,

0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,

128.200.39.17, -, 3/22/00, 205539, W3SVC,

SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,

/spt/images/bk1.jpg, -, 128.200.39.17, -,

3/22/00, 205603, W3SVC, SRVR1, 128.200.39.181,

1081, 382, 414, 200, 0, POST, /spt/main.html, -,

128.200.39.17, -, 3/22/00, 205604, W3SVC,

SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,

/spt/images/bk1.jpg, -, 128.200.39.17, -,

3/22/00, 205633, W3SVC, SRVR1, 128.200.39.181,

0, 262, 72, 304, 0, GET, /top.html, -,

128.200.39.17, -, 3/22/00, 205652, W3SVC,

SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0,

POST, /spt/main.html, -,

3

3

3

3

1

3

1

1

1

3

3

3

2

2

3

2

User 1

1

1

1

3

3

3

User 2

User 3

7

7

7

7

7

7

7

7

1

1

1

1

1

1

5

1

5

1

1

1

5

1

User 4

5

1

1

5

User 5

Application social network analysis

HP Labs email data 500 users, 20k

connections evolving over time

Application spam filtering

- 200 billion spam messages sent per day
- Asymmetric cost of false positive/false negative
- Weak label discarded without reading
- Strong label (this is spam) hard to come by
- Standard iid assumption violated spammers alter

spam generators to evade or subvert spam filters

(adversarial learning task)

Learning

Learning

knowledge

data

Learning

prior knowledge

Learning

knowledge

data

Learning

prior knowledge

Learning

knowledge

data

Learning

prior knowledge

Learning

knowledge

data

Crucial open problem weak intermediate forms of

knowledge that support future generalizations

Example

Example

Example

Example arriving at Sao Paulo, Brazil

Bem-vindo!

Example arriving at Sao Paulo, Brazil

Bem-vindo!

Example arriving at Sao Paulo, Brazil

Bem-vindo!

Bem-vindo!

Example arriving at Sao Paulo, Brazil

Bem-vindo!

Bem-vindo!

Weak prior knowledge

- In this case, people in a given country (and

city) tend to speak the same language - Where did this knowledge come from?

Weak prior knowledge

- In this case, people in a given country (and

city) tend to speak the same language - Where did this knowledge come from?
- Experience with other countries
- Common sense i.e., knowledge of how societies

and languages work

Weak prior knowledge

- In this case, people in a given country (and

city) tend to speak the same language - Where did this knowledge come from?
- Experience with other countries
- Common sense i.e., knowledge of how societies

and languages work - And where did that knowledge come from?

Knowledge? What is knowledge? All I know is

samples!! V. Vapnik

- All knowledge derives, directly or indirectly,

from experience of individuals - Knowledge serves as a directly applicable

shorthand for all that experience better than

requiring constant review of the entire

sensory/evolutionary history of the human race

Expressiveness

The world has things in it!!

- Expressive language gt concise models
- gt fast learning, sometimes fast reasoning
- E.g., rules of chess
- 1 page in first-order logic
- On(color,piece,x,y,t)
- 100000 pages in propositional logic
- WhiteKingOnC4Move12
- 100000000000000000000000000000000000000 pages

as atomic-state model - R.B.KB.RPPP..PPP..N..N..PP.q.pp..Q..n..n..ppp..

pppr.b.kb.r - Note chess is a tiny problem compared to the

real world

Brief history of expressiveness

probability

logic

atomic

propositional

first-order/relational

Brief history of expressiveness

probability

5th C B.C.

logic

atomic

propositional

first-order/relational

Brief history of expressiveness

17th C

probability

5th C B.C.

logic

atomic

propositional

first-order/relational

Brief history of expressiveness

17th C

probability

5th C B.C.

19th C

logic

atomic

propositional

first-order/relational

Brief history of expressiveness

17th C

20th C

probability

5th C B.C.

19th C

logic

atomic

propositional

first-order/relational

Brief history of expressiveness

17th C

20th C

21st C

probability

5th C B.C.

19th C

logic

atomic

propositional

first-order/relational

Brief history of expressiveness

Bernoulli Categorical Uni. Gaussian (H)MMs

Bayes nets MRFs Multi. Gaussians DBNs Kalman

filters

RPMs BLOG MLNs (DBLOG)

probability

First-order logic Database systems Programs First-

order STRIPS Temporal logic

OBDDs, k-CNF Decision trees Perceptrons Propositio

nal STRIPS Register circuits

Finite automata

logic

atomic

propositional

first-order/relational

CTBT Comprehensive Nuclear-Test-Ban Treaty

- Bans testing of nuclear weapons on earth
- Allows for outside inspection of 1000km2
- 182/195 states have signed
- 153/195 have ratified
- Need 9 more ratifications including US, China
- US Senate refused to ratify in 1998
- too hard to monitor

2053 nuclear explosions

(No Transcript)

254 monitoring stations

(No Transcript)

The problem

- Given waveform traces from all seismic stations,

figure out what events occurred when and where - Traces at each sensor station may be preprocessed

to form detections (90 are not real) - ARID ORID STA PH BEL DELTA

SEAZ ESAZ TIME TDEF AZRES

ADEF SLORES SDEF WGT VMODEL

LDDATE - 49392708 5295499 WRA P -1.0 23.673881 342.00274

163.08123 0.19513991 d -1.2503497 d 0.24876981

d -999.0 0.61806399 IASP 2009-04-02 125427 - 49595064 5295499 FITZ P -1.0 20.835616

4.3960142 184.18581 1.2515257 d 2.7290018 d

5.4541182 n -999.0 0.46613527 IASP 2009-04-02

125427 - 49674189 5295499 MKAR P -1.0 58.574266 124.26633

325.35514 -0.053738765 d -4.6295428 d 1.5126035

d -999.0 0.76750542 IASP 2009-04-02 125427 - 49674227 5295499 ASAR P -1.0 27.114852 345.18433

166.42383 -0.71255454 d -6.4901126 d 0.95510033

d -999.0 0.66453657 IASP 2009-04-02 125427

What do we know?

- Events happen randomly each has a time,

location, depth, magnitude seismicity varies

with location - Seismic waves of many kinds (phases) travel

through the Earth - Travel time and attenuation depend on phase and

source/destination - Arriving waves may or may not be detected,

depending on sensor and local noise environment - Local noise may also produce false detections

(No Transcript)

(No Transcript)

(No Transcript)

(No Transcript)

(No Transcript)

(No Transcript)

(No Transcript)

(No Transcript)

(No Transcript)

(No Transcript)

- SeismicEvents PoissonTIME_DURATIONEVENT_RATE

- IsEarthQuake(e) Bernoulli(.999)
- EventLocation(e) If IsEarthQuake(e) then

EarthQuakeDistribution() - Else UniformEarthDistrib

ution() - Magnitude(e) Exponential(log(10)) MIN_MAG
- Distance(e,s) GeographicalDistance(EventLocation

(e), SiteLocation(s)) - IsDetected(e,p,s) LogisticSITE_COEFFS(s,p)(Mag

nitude(e), Distance(e,s) - Arrivals(site s) PoissonTIME_DURATIONFALSE_

RATE(s) - Arrivals(evente, site) If IsDetected(e,s)

then 1 else 0 - Time(a) If (event(a) null) then

Uniform(0,TIME_DURATION) - else IASPEI(EventLocation(event(a)),SiteLocation

(site(a)),Phase(a)) TimeRes(a) - TimeRes(a) Laplace(TIMLOC(site(a)),

TIMSCALE(site(a))) - Azimuth(a) If (event(a) null) then Uniform(0,

360) - else GeoAzimuth(EventLocation(event(a)),SiteLoca

tion(site(a)) AzRes(a) - AzRes(a) Laplace(0, AZSCALE(site(a)))
- Slow(a) If (event(a) null) then Uniform(0,20)
- else IASPEI-SLOW(EventLocation(event(a)),SiteLocat

ion(site(a)) SlowRes(site(a))

Learning with prior knowledge

- Instead of learning a mapping from detection

histories to event bulletins, learn local pieces

of an overall structured model - Event location prior (A6)
- Predictive travel time model (A1)
- Phase type classifier (A2)

Event location prior (A6)

Travel time prediction (A1)

- How long does it take for a seismic signal to get

from A to B? This is the travel time T(A,B) - If we know this accurately, and we know the

arrival times t1, t2, t3, at several stations

B1, B2, B3, , we can find an accurate estimate

of the location A and time t for the event, such

that - T(A,Bi) ti t for all i

Earth 101

Seismic phases (wave types/paths)

- Seismic energy is emitted in different types of

waves there are also qualitatively distinct

paths (e.g., direct vs reflected from surface vs.

refracted through core). P and S are the direct

waves P is faster

(No Transcript)

IASP 91 reference velocity model

- Spherically symmetric, Vphase(depth) from this,

obtain Tpredicted(A,B).

IASP91 inaccuracy is too big!

- Earth is inhomogeneous variations in crust

thickness and rock properties (fast and slow)

Travel time residuals (Tactual Tpredicted)

- Residual surface (wrt a particular station) is

locally smooth estimate by local regression