Title: Exploring Massive Learning via a Prediction System
1Exploring Massive Learning via a Prediction System
- Omid Madani
- Yahoo! Research
www.omadani.net
2Goal
- Convey a taste of the
- motivations/considerations/assumptions/speculation
s/hopes, - The game, a 1st system, and its algorithms
3Talk Overview
- Motivational part
- The approach
- The game (categories, )
- Algorithms
- Some experiments
4Fill in the Blank(s)!
Would ----
like ------ ------- ----- ------ ?
your
coffee
with
sugar
you
5What is this object?
6Categorization is Fundamental!
- Well, categorization is one of the most basic
functions of living creatures. We live in a
categorized world table, chair, male, female,
democracy, monarchy every object and event is
unique, but we act towards them as members of
classes. - From an interview with Eleanor Rosch
(Psychologist, a pioneer on the phenomenon of
basic level concepts) - Concepts are the glue that holds our mental
world together. From The Big Book of
Concepts, Gregory Murphy
7Rather, the formation and use of categories is
the stuff of experience.Philosophy in the
Flesh, Lakoff and Johnson.
8 Repeated and rapid classification
Two Questions Arise
- in the presence of myriad classes
- In the presence of myriad categories
- How to categorize efficiently?
- How to efficiently learn to categorize
efficiently?
9Now, a 3rd Question ..
- How can so many inter-related categories be
acquired? - Programming them unlikely to be successful/scale
- Limits of our explicit/conscious knowledge
- Unknown/unfamiliar domains
- The required scale..
- Making the system operational..
10Learn? How?
- Supervised learning (explicit human
involvement) likely inadequate - Required scale, or a good sign post
- millions of categories and beyond..
- Billions of weights, and beyond..
- Inaccessible knowledge (see last slide!)
- Other approaches likely do not meet the needs
(incomplete, different goals, etc) active
learning, semi-supervised learning, clustering,
density learning, RL, etc..
11Desiderata/Requirements(or Speculations)
- Higher intelligence, such as advanced advanced
pattern recognition/generation (e.g. vision), may
require - Long term learning (weeks, months, years,)
- Cumulative learning (learn these first, then
these, then these,) - Massive Learning Myriad inter-related
categories/concepts - Systems learning
- Autonomy (relatively little human involvement)
Whats the learning task?
12This Work An Exploration
- An avenue prediction games in infinitely rich
worlds - Exciting part
- World provides unbounded learning opportunity!
(world is the validator, the system is the
experimenter!.. and actively builds much of its
own concepts) - World enjoys many regularities (e.g.
hierarchical) - Based in part on supervised techniques!!
- (discriminative, feedback driven,
supervisory signal doesnt originate from humans
)
13In a Nutshell
(e.g. words, digits, phrases, phone numbers,
faces, visual objects, home pages, sites,)
(Text characters, .. Vision edges, curves,)
. 0011101110000.
predict
observe update
Prediction System
14The Game
- Repeat
- Hide part(s) of the stream
- Predict (use context)
- Update
- Move on
- Objective predict better ... subject to
efficiency constraints - In the process categories at different levels of
size and abstraction should be learned
15Research Goals
- Conjecture There is much value to be attained
from this task - Beyond language modeling more advanced pattern
recognition/generation - If so, should yield a wealth of new problems (gt
Fun)
16Overview
- Goal Convey a taste of the motivations/considerat
ions, the system and algorithms,.. - Motivation
- The approach
- The game (categories, )
- Algorithms
- Some experiments
17Upshot
- Takes streams of text
- Make categories (strings)
- Approx three hours on 800k documents
- Large-scale discriminative learning (evidence
better than than language modeling)
18Caveat Emptor!
- Exploratory research
- Many open problems (many Im not aware of )
- Chosen algorithms, system org, or
objective/performance measures, etc., etc are
likely not even near the best possible
19Categories
- Building blocks (atoms!) of intelligence?
- Patterns that frequently occur
- External
- Internal..
- Useful for predicting other categories!
- They can have structure/regularities, in
particular - Composition (conjunctions) of other categories
(Part-Of) - Grouping (disjunctions)(Is-A relations)
20Categories
- Low level primitive examples 0 and 1 or
characters (a, b, .. ,0, -,..) - Provided to the system (easy to detect)
- Higher/composite levels
- Sequence of bits/characters
- Words
- Phrases
- More general Phone number, contact info, resume,
...
21Example Concept
- Area code is a concept that involves both
composition and grouping - Composition of 3 digits
- A digit is a grouping, i.e., the set 0,1,2,,9
( 2 is a digit ) - Other example concepts phone number, address,
resume page, face (in visual domain), etc.
22Again, our goal, informally, is to build a
system that acquires millions of useful concepts
on its own.
23Questions for a First System
- Functionality? Architecture? Org?
- Would many-class learning scale to millions of
concepts? - Choice of concept building methods?
- How would various learning processes interact?
24Expedition a First System
- Plays the game in text
- Begins at character level
- No segmentation, just a stream
- Makes and predicts larger sequences, via
composition - No grouping yet
25Learning Episodes
predictors (active categories)
New Jersey in
target (category to predict)
window containing context and target
In this example, context contains one category
on each side
26.. Some Time Later ..
predictors
loves New York life
target (category to predict)
window containing context and target
- In terms of supervised learning/classification,
in this learning activity (prediction games) - The set of concepts grows over time
- Same for features/predictors (concepts ARE the
predictors!) - Instance representation (segmentation of the
data stream) changes/grows over time ..
27Prediction/Recall
features
categories
f1
c1
c2
f2
c3
f3
1. Features are activated
2. Edges are activated
c4
f4
3. Receiving categories are activated
c5
4. Categories sorted/ranked
- Like use of inverted
- indices
- 2. Sparse dot products
28Updating a Features Connections
1. Identify connection
2. Increase weight
3. Normalize/weaken weights
4. Drop tiny weights
Degrees are constrained
29Example Category Node (from Jane Austens)
and
nei
0.087
0.13
heart
toge
0.07
ther
0.11
far
0.057
love
0.10
0.052
bro
by
A category nodes keeps track of various weights,
such as edge (or prediction) weights, and
predictiveness weights, and other statistics
(e.g. frequency, first/last time seen), and
updates them when it is activated as a predictor
or target..
30 Network
- Categories and their edges form a network
- (a directed weighted graph, with
- different kinds of edges ... )
- The network grows over time millions of nodes
- and beyond
31When and How to Compose?
- Two major approaches
- Pre-filter dont compose if certain conditions
are not met (simplest only consider
possibilities that you see) - Post-filter compose and use, but remove if
certain conditions are not met (e.g. if not seen
recently enough, remove) - I expect both are needed
32Some Composition (Prefilter) Heuristics
- FRAC If you see c1 then c2 in the stream, then,
with some probability, add cc1c2 - MU use the pointwise mutual
- information between c1 and c2
- IMPROVE take string lengths into account
- and see whether joining is better
- BOUND Generate all strings under length Lt.
33Prediction Objective
- Desirable learn higher level categories
(bigger/abstract categories are useful
externally) - Question how does this relate to improving
predictions? - Higher level categories improve context and
can save memory - Bigger, save time in playing the game (categories
are atomic)
34Objective (evaluation criterion)
- The Matching Performance
- Number of bits (characters) correctly predicted
per unit time or per prediction action - Subject to constraints (space, time,..)
- How about entropy/perplexity?
- Categories are structured, so perplexity seems
difficult to use..
35Linearity and Non-Linearity (a motivation for
new concept creation)
new????
n
Aggregate the votes of n, e, and w to
predict what comes next
e
w
Which one predicts better? (better constrains
what comes next)
36Data
- Reuters RCV1 800k news articles
- Several online books of Jane Austen, etc.
- Web search query logs
-
37Some Observations
- Ran on Reuters RCV1 (text body) ( simply zcat
dir/file ) - 800k articles
- gt 150 million learning/prediction episodes
- Over 10 million categories built
- 3-4 hours each pass (depends on parameters)
38Observations
- Performance on held out (one of the Reuters
files) - 8-9 characters long to predict on average
- Almost two characters correct on average, per
prediction action - Can overfit/memorize! (long categories)
- Current stop category generation after first pass
39(No Transcript)
40 Some Example Categories(in order of
first time appearance and increasing length)
- cat name "lt"
- cat name " t"
- cat name ".lt/"
- cat name "pgt- "
- cat name " the "
- cat name "ation "
- cat name "of the "
- cat name "ing the "
- cat name "quotThe "
- cat name "company said "
- cat name ", the company "
- cat name "said on Tuesday"
- cat name " said on Tuesday"
- cat name ",quot said one "
- cat name ",quot he said.lt/pgt
- cat name "--------------------------------"
- cat name "--------------------------------------
------------------"
41Example Recall Paths
- From processing one month of Reuters
- "Sinn Fei" (0.128) "n a seat" (0.527) " in the "
(0.538) "talks." (0.468) "lt/pgt - ltpgtB" (0.0185) "rokers " The end connection
weight less than 0.04 - " Gas in S" (1) "cotland" (1.04) " and north"
(1.18) "ern E" - (0.572) "ngland" (0.165) ",quot a " (0.0542)
"spokeswo" (0.551) "man - said " (0.044) "the idea" (0.0869) " was to "
(0.144) "quot" (0.164) - "e the d" (0.0723) "ivision" (0.0671) " in N"
(0.397) "ew York" - (0.062) " where " (0.0557) "the main " (0.0474)
"marque" (0.229) "s - were " (0.253) "base" (0.264) "d. quot"
(0.0451) "It will " (0.117) - "certain" (0.0691) "ly b" (0.0892) "e New "
(0.353) "York" (0.112) " - party" (0.0917) "s is goin" (0.559) "g to "
(0.149) "end.quot" - (0.239) "lt/pgt ltpgtT" (0.104) "wedish " (0.125)
"Export" (0.0211) " - Credi" The end connection weight less than
0.04
42Search Query Logs
- "bureoofi" (1) "migration" (1.13) "andci" (1.04)
"tizenship." (0.31) "com - www," (0.11) "ictions" (0.116) "zenship."
The end this concept wasn't seen in last 1000000
time points. -
- Random Recall
- "bureoofi" (1) "migration" (0.0129) "dept.com"
The end this concept wasn't seen in last
1000000 time points. -
43Much Related Work!
- Online learning, cumulative learning, feature and
concept induction, neural networks, clustering,
Bayesian methods, language modeling, deep
learning, hierarchical learning,
importance/ubiquity of predictions/anticipations
in the brain (On Intelligence, natural
computations,), models of neocortex (circuits
of the mind), concepts and conceptual phenomena
(e.g. big book of concepts), compression, .
44Summary
- Large-scale learning and classification (data
hungry, efficiency paramount) - A systems approach Integration of multiple
learning processes - The system makes it own classes
- Driving objective Improve prediction (currently
matching performance) - The underlying goal effectively acquire complex
concepts - See www.omadani.net
45Current/Future
- Much work
- Integrate learning of groupings
- Recognize/use structural categories? (learn to
parse/segment?) - Prediction objective.. ok?
- Control over input stream, etc..
- Category generation.. What are good methods?
- Other domains (vision,)
- Compare language modeling, etc