Exploring Massive Learning via a Prediction System presentation

About This Presentation

Transcript and Presenter's Notes

Title: Exploring Massive Learning via a Prediction System

1
Exploring Massive Learning via a Prediction System

Omid Madani
Yahoo! Research

www.omadani.net
2
Goal

Convey a taste of the
motivations/considerations/assumptions/speculation
s/hopes,
The game, a 1st system, and its algorithms

3
Talk Overview

Motivational part
The approach
The game (categories, )
Algorithms
Some experiments

4
Fill in the Blank(s)!
Would ----
like ------ ------- ----- ------ ?
your
coffee
with
sugar
you
5
What is this object?
6
Categorization is Fundamental!

Well, categorization is one of the most basic
functions of living creatures. We live in a
categorized world table, chair, male, female,
democracy, monarchy every object and event is
unique, but we act towards them as members of
classes.
From an interview with Eleanor Rosch
(Psychologist, a pioneer on the phenomenon of
basic level concepts)
Concepts are the glue that holds our mental
world together. From The Big Book of
Concepts, Gregory Murphy

7
Rather, the formation and use of categories is
the stuff of experience.Philosophy in the
Flesh, Lakoff and Johnson.
8
Repeated and rapid classification
Two Questions Arise

in the presence of myriad classes

In the presence of myriad categories
How to categorize efficiently?
How to efficiently learn to categorize
efficiently?

9
Now, a 3rd Question ..

How can so many inter-related categories be
acquired?
Programming them unlikely to be successful/scale
Limits of our explicit/conscious knowledge
Unknown/unfamiliar domains
The required scale..
Making the system operational..

10
Learn? How?

Supervised learning (explicit human
involvement) likely inadequate
Required scale, or a good sign post
millions of categories and beyond..
Billions of weights, and beyond..
Inaccessible knowledge (see last slide!)
Other approaches likely do not meet the needs
(incomplete, different goals, etc) active
learning, semi-supervised learning, clustering,
density learning, RL, etc..

11
Desiderata/Requirements(or Speculations)

Higher intelligence, such as advanced advanced
pattern recognition/generation (e.g. vision), may
require
Long term learning (weeks, months, years,)
Cumulative learning (learn these first, then
these, then these,)
Massive Learning Myriad inter-related
categories/concepts
Systems learning
Autonomy (relatively little human involvement)

Whats the learning task?
12
This Work An Exploration

An avenue prediction games in infinitely rich
worlds
Exciting part
World provides unbounded learning opportunity!
(world is the validator, the system is the
experimenter!.. and actively builds much of its
own concepts)
World enjoys many regularities (e.g.
hierarchical)
Based in part on supervised techniques!!
(discriminative, feedback driven,
supervisory signal doesnt originate from humans
)

13
In a Nutshell
(e.g. words, digits, phrases, phone numbers,
faces, visual objects, home pages, sites,)
(Text characters, .. Vision edges, curves,)
. 0011101110000.
predict
observe update
Prediction System
14
The Game

Repeat
Hide part(s) of the stream
Predict (use context)
Update
Move on
Objective predict better ... subject to
efficiency constraints
In the process categories at different levels of
size and abstraction should be learned

15
Research Goals

Conjecture There is much value to be attained
from this task
Beyond language modeling more advanced pattern
recognition/generation
If so, should yield a wealth of new problems (gt
Fun)

16
Overview

Goal Convey a taste of the motivations/considerat
ions, the system and algorithms,..
Motivation
The approach
The game (categories, )
Algorithms
Some experiments

17
Upshot

Takes streams of text
Make categories (strings)
Approx three hours on 800k documents
Large-scale discriminative learning (evidence
better than than language modeling)

18
Caveat Emptor!

Exploratory research
Many open problems (many Im not aware of )
Chosen algorithms, system org, or
objective/performance measures, etc., etc are
likely not even near the best possible

19
Categories

Building blocks (atoms!) of intelligence?
Patterns that frequently occur
External
Internal..
Useful for predicting other categories!
They can have structure/regularities, in
particular
Composition (conjunctions) of other categories
(Part-Of)
Grouping (disjunctions)(Is-A relations)

20
Categories

Low level primitive examples 0 and 1 or
characters (a, b, .. ,0, -,..)
Provided to the system (easy to detect)
Higher/composite levels
Sequence of bits/characters
Words
Phrases
More general Phone number, contact info, resume,
...

21
Example Concept

Area code is a concept that involves both
composition and grouping
Composition of 3 digits
A digit is a grouping, i.e., the set 0,1,2,,9
( 2 is a digit )
Other example concepts phone number, address,
resume page, face (in visual domain), etc.

22
Again, our goal, informally, is to build a
system that acquires millions of useful concepts
on its own.
23
Questions for a First System

Functionality? Architecture? Org?
Would many-class learning scale to millions of
concepts?
Choice of concept building methods?
How would various learning processes interact?

24
Expedition a First System

Plays the game in text
Begins at character level
No segmentation, just a stream
Makes and predicts larger sequences, via
composition
No grouping yet

25
Learning Episodes
predictors (active categories)
New Jersey in
target (category to predict)
window containing context and target
In this example, context contains one category
on each side
26
.. Some Time Later ..
predictors
loves New York life
target (category to predict)
window containing context and target

In terms of supervised learning/classification,
in this learning activity (prediction games)
The set of concepts grows over time
Same for features/predictors (concepts ARE the
predictors!)
Instance representation (segmentation of the
data stream) changes/grows over time ..

27
Prediction/Recall
features
categories
f1
c1
c2
f2
c3
f3
1. Features are activated
2. Edges are activated
c4
f4
3. Receiving categories are activated
c5
4. Categories sorted/ranked

Like use of inverted
indices
2. Sparse dot products

28
Updating a Features Connections
1. Identify connection
2. Increase weight
3. Normalize/weaken weights
4. Drop tiny weights
Degrees are constrained
29
Example Category Node (from Jane Austens)
and
nei
0.087
0.13
heart
toge
0.07
ther
0.11
far
0.057
love
0.10
0.052
bro
by
A category nodes keeps track of various weights,
such as edge (or prediction) weights, and
predictiveness weights, and other statistics
(e.g. frequency, first/last time seen), and
updates them when it is activated as a predictor
or target..
30
Network

Categories and their edges form a network
(a directed weighted graph, with
different kinds of edges ... )

The network grows over time millions of nodes
and beyond

31
When and How to Compose?

Two major approaches
Pre-filter dont compose if certain conditions
are not met (simplest only consider
possibilities that you see)
Post-filter compose and use, but remove if
certain conditions are not met (e.g. if not seen
recently enough, remove)
I expect both are needed

32
Some Composition (Prefilter) Heuristics

FRAC If you see c1 then c2 in the stream, then,
with some probability, add cc1c2
MU use the pointwise mutual
information between c1 and c2
IMPROVE take string lengths into account
and see whether joining is better
BOUND Generate all strings under length Lt.

33
Prediction Objective

Desirable learn higher level categories
(bigger/abstract categories are useful
externally)
Question how does this relate to improving
predictions?
Higher level categories improve context and
can save memory
Bigger, save time in playing the game (categories
are atomic)

34
Objective (evaluation criterion)

The Matching Performance
Number of bits (characters) correctly predicted
per unit time or per prediction action
Subject to constraints (space, time,..)
How about entropy/perplexity?
Categories are structured, so perplexity seems
difficult to use..

35
Linearity and Non-Linearity (a motivation for
new concept creation)
new????
n
Aggregate the votes of n, e, and w to
predict what comes next
e
w
Which one predicts better? (better constrains
what comes next)
36
Data

Reuters RCV1 800k news articles
Several online books of Jane Austen, etc.
Web search query logs

37
Some Observations

Ran on Reuters RCV1 (text body) ( simply zcat
dir/file )
800k articles
gt 150 million learning/prediction episodes
Over 10 million categories built
3-4 hours each pass (depends on parameters)

38
Observations

Performance on held out (one of the Reuters
files)
8-9 characters long to predict on average
Almost two characters correct on average, per
prediction action
Can overfit/memorize! (long categories)
Current stop category generation after first pass

39
(No Transcript)
40
Some Example Categories(in order of
first time appearance and increasing length)

cat name "lt"
cat name " t"
cat name ".lt/"
cat name "pgt- "
cat name " the "
cat name "ation "
cat name "of the "
cat name "ing the "
cat name "quotThe "
cat name "company said "
cat name ", the company "
cat name "said on Tuesday"
cat name " said on Tuesday"
cat name ",quot said one "
cat name ",quot he said.lt/pgt
cat name "--------------------------------"
cat name "--------------------------------------
------------------"

41
Example Recall Paths

From processing one month of Reuters
"Sinn Fei" (0.128) "n a seat" (0.527) " in the "
(0.538) "talks." (0.468) "lt/pgt
ltpgtB" (0.0185) "rokers " The end connection
weight less than 0.04
" Gas in S" (1) "cotland" (1.04) " and north"
(1.18) "ern E"
(0.572) "ngland" (0.165) ",quot a " (0.0542)
"spokeswo" (0.551) "man
said " (0.044) "the idea" (0.0869) " was to "
(0.144) "quot" (0.164)
"e the d" (0.0723) "ivision" (0.0671) " in N"
(0.397) "ew York"
(0.062) " where " (0.0557) "the main " (0.0474)
"marque" (0.229) "s
were " (0.253) "base" (0.264) "d. quot"
(0.0451) "It will " (0.117)
"certain" (0.0691) "ly b" (0.0892) "e New "
(0.353) "York" (0.112) "
party" (0.0917) "s is goin" (0.559) "g to "
(0.149) "end.quot"
(0.239) "lt/pgt ltpgtT" (0.104) "wedish " (0.125)
"Export" (0.0211) "
Credi" The end connection weight less than
0.04

42
Search Query Logs

"bureoofi" (1) "migration" (1.13) "andci" (1.04)
"tizenship." (0.31) "com
www," (0.11) "ictions" (0.116) "zenship."
The end this concept wasn't seen in last 1000000
time points.
Random Recall
"bureoofi" (1) "migration" (0.0129) "dept.com"
The end this concept wasn't seen in last
1000000 time points.

43
Much Related Work!

Online learning, cumulative learning, feature and
concept induction, neural networks, clustering,
Bayesian methods, language modeling, deep
learning, hierarchical learning,
importance/ubiquity of predictions/anticipations
in the brain (On Intelligence, natural
computations,), models of neocortex (circuits
of the mind), concepts and conceptual phenomena
(e.g. big book of concepts), compression, .

44
Summary

Large-scale learning and classification (data
hungry, efficiency paramount)
A systems approach Integration of multiple
learning processes
The system makes it own classes
Driving objective Improve prediction (currently
matching performance)
The underlying goal effectively acquire complex
concepts
See www.omadani.net

45
Current/Future

Much work
Integrate learning of groupings
Recognize/use structural categories? (learn to
parse/segment?)
Prediction objective.. ok?
Control over input stream, etc..
Category generation.. What are good methods?
Other domains (vision,)
Compare language modeling, etc

Write a Comment

User Comments (0)

About PowerShow.com

Exploring Massive Learning via a Prediction System PowerPoint PPT Presentation