Exploring Massive Learning via a Prediction System - PowerPoint PPT Presentation

About This Presentation
Title:

Exploring Massive Learning via a Prediction System

Description:

Exploring Massive Learning via a Prediction System Omid Madani Yahoo! Research www.omadani.net – PowerPoint PPT presentation

Number of Views:160
Avg rating:3.0/5.0
Slides: 46
Provided by: penn164
Category:

less

Transcript and Presenter's Notes

Title: Exploring Massive Learning via a Prediction System


1
Exploring Massive Learning via a Prediction System
  • Omid Madani
  • Yahoo! Research

www.omadani.net
2
Goal
  • Convey a taste of the
  • motivations/considerations/assumptions/speculation
    s/hopes,
  • The game, a 1st system, and its algorithms

3
Talk Overview
  • Motivational part
  • The approach
  • The game (categories, )
  • Algorithms
  • Some experiments

4
Fill in the Blank(s)!
Would ----
like ------ ------- ----- ------ ?
your
coffee
with
sugar
you
5
What is this object?
6
Categorization is Fundamental!
  • Well, categorization is one of the most basic
    functions of living creatures. We live in a
    categorized world table, chair, male, female,
    democracy, monarchy every object and event is
    unique, but we act towards them as members of
    classes.
  • From an interview with Eleanor Rosch
    (Psychologist, a pioneer on the phenomenon of
    basic level concepts)
  • Concepts are the glue that holds our mental
    world together. From The Big Book of
    Concepts, Gregory Murphy

7
Rather, the formation and use of categories is
the stuff of experience.Philosophy in the
Flesh, Lakoff and Johnson.
8
Repeated and rapid classification
Two Questions Arise
  • in the presence of myriad classes
  • In the presence of myriad categories
  • How to categorize efficiently?
  • How to efficiently learn to categorize
    efficiently?

9
Now, a 3rd Question ..
  • How can so many inter-related categories be
    acquired?
  • Programming them unlikely to be successful/scale
  • Limits of our explicit/conscious knowledge
  • Unknown/unfamiliar domains
  • The required scale..
  • Making the system operational..

10
Learn? How?
  • Supervised learning (explicit human
    involvement) likely inadequate
  • Required scale, or a good sign post
  • millions of categories and beyond..
  • Billions of weights, and beyond..
  • Inaccessible knowledge (see last slide!)
  • Other approaches likely do not meet the needs
    (incomplete, different goals, etc) active
    learning, semi-supervised learning, clustering,
    density learning, RL, etc..

11
Desiderata/Requirements(or Speculations)
  • Higher intelligence, such as advanced advanced
    pattern recognition/generation (e.g. vision), may
    require
  • Long term learning (weeks, months, years,)
  • Cumulative learning (learn these first, then
    these, then these,)
  • Massive Learning Myriad inter-related
    categories/concepts
  • Systems learning
  • Autonomy (relatively little human involvement)

Whats the learning task?
12
This Work An Exploration
  • An avenue prediction games in infinitely rich
    worlds
  • Exciting part
  • World provides unbounded learning opportunity!
    (world is the validator, the system is the
    experimenter!.. and actively builds much of its
    own concepts)
  • World enjoys many regularities (e.g.
    hierarchical)
  • Based in part on supervised techniques!!
  • (discriminative, feedback driven,
    supervisory signal doesnt originate from humans
    )

13
In a Nutshell
(e.g. words, digits, phrases, phone numbers,
faces, visual objects, home pages, sites,)
(Text characters, .. Vision edges, curves,)
. 0011101110000.
predict
observe update
Prediction System
14
The Game
  • Repeat
  • Hide part(s) of the stream
  • Predict (use context)
  • Update
  • Move on
  • Objective predict better ... subject to
    efficiency constraints
  • In the process categories at different levels of
    size and abstraction should be learned

15
Research Goals
  • Conjecture There is much value to be attained
    from this task
  • Beyond language modeling more advanced pattern
    recognition/generation
  • If so, should yield a wealth of new problems (gt
    Fun)

16
Overview
  • Goal Convey a taste of the motivations/considerat
    ions, the system and algorithms,..
  • Motivation
  • The approach
  • The game (categories, )
  • Algorithms
  • Some experiments

17
Upshot
  • Takes streams of text
  • Make categories (strings)
  • Approx three hours on 800k documents
  • Large-scale discriminative learning (evidence
    better than than language modeling)

18
Caveat Emptor!
  • Exploratory research
  • Many open problems (many Im not aware of )
  • Chosen algorithms, system org, or
    objective/performance measures, etc., etc are
    likely not even near the best possible

19
Categories
  • Building blocks (atoms!) of intelligence?
  • Patterns that frequently occur
  • External
  • Internal..
  • Useful for predicting other categories!
  • They can have structure/regularities, in
    particular
  • Composition (conjunctions) of other categories
    (Part-Of)
  • Grouping (disjunctions)(Is-A relations)

20
Categories
  • Low level primitive examples 0 and 1 or
    characters (a, b, .. ,0, -,..)
  • Provided to the system (easy to detect)
  • Higher/composite levels
  • Sequence of bits/characters
  • Words
  • Phrases
  • More general Phone number, contact info, resume,
    ...

21
Example Concept
  • Area code is a concept that involves both
    composition and grouping
  • Composition of 3 digits
  • A digit is a grouping, i.e., the set 0,1,2,,9
    ( 2 is a digit )
  • Other example concepts phone number, address,
    resume page, face (in visual domain), etc.

22
Again, our goal, informally, is to build a
system that acquires millions of useful concepts
on its own.
23
Questions for a First System
  • Functionality? Architecture? Org?
  • Would many-class learning scale to millions of
    concepts?
  • Choice of concept building methods?
  • How would various learning processes interact?

24
Expedition a First System
  • Plays the game in text
  • Begins at character level
  • No segmentation, just a stream
  • Makes and predicts larger sequences, via
    composition
  • No grouping yet

25
Learning Episodes
predictors (active categories)
New Jersey in
target (category to predict)
window containing context and target
In this example, context contains one category
on each side
26
.. Some Time Later ..
predictors
loves New York life
target (category to predict)
window containing context and target
  • In terms of supervised learning/classification,
    in this learning activity (prediction games)
  • The set of concepts grows over time
  • Same for features/predictors (concepts ARE the
    predictors!)
  • Instance representation (segmentation of the
    data stream) changes/grows over time ..

27
Prediction/Recall
features
categories
f1
c1
c2
f2
c3
f3
1. Features are activated
2. Edges are activated
c4
f4
3. Receiving categories are activated
c5
4. Categories sorted/ranked
  • Like use of inverted
  • indices
  • 2. Sparse dot products

28
Updating a Features Connections
1. Identify connection
2. Increase weight
3. Normalize/weaken weights
4. Drop tiny weights
Degrees are constrained
29
Example Category Node (from Jane Austens)
and
nei
0.087
0.13
heart
toge
0.07
ther
0.11
far
0.057
love
0.10
0.052
bro
by
A category nodes keeps track of various weights,
such as edge (or prediction) weights, and
predictiveness weights, and other statistics
(e.g. frequency, first/last time seen), and
updates them when it is activated as a predictor
or target..
30
Network
  • Categories and their edges form a network
  • (a directed weighted graph, with
  • different kinds of edges ... )
  • The network grows over time millions of nodes
  • and beyond

31
When and How to Compose?
  • Two major approaches
  • Pre-filter dont compose if certain conditions
    are not met (simplest only consider
    possibilities that you see)
  • Post-filter compose and use, but remove if
    certain conditions are not met (e.g. if not seen
    recently enough, remove)
  • I expect both are needed

32
Some Composition (Prefilter) Heuristics
  • FRAC If you see c1 then c2 in the stream, then,
    with some probability, add cc1c2
  • MU use the pointwise mutual
  • information between c1 and c2
  • IMPROVE take string lengths into account
  • and see whether joining is better
  • BOUND Generate all strings under length Lt.

33
Prediction Objective
  • Desirable learn higher level categories
    (bigger/abstract categories are useful
    externally)
  • Question how does this relate to improving
    predictions?
  • Higher level categories improve context and
    can save memory
  • Bigger, save time in playing the game (categories
    are atomic)

34
Objective (evaluation criterion)
  • The Matching Performance
  • Number of bits (characters) correctly predicted
    per unit time or per prediction action
  • Subject to constraints (space, time,..)
  • How about entropy/perplexity?
  • Categories are structured, so perplexity seems
    difficult to use..

35
Linearity and Non-Linearity (a motivation for
new concept creation)
new????
n
Aggregate the votes of n, e, and w to
predict what comes next
e
w
Which one predicts better? (better constrains
what comes next)
36
Data
  • Reuters RCV1 800k news articles
  • Several online books of Jane Austen, etc.
  • Web search query logs

37
Some Observations
  • Ran on Reuters RCV1 (text body) ( simply zcat
    dir/file )
  • 800k articles
  • gt 150 million learning/prediction episodes
  • Over 10 million categories built
  • 3-4 hours each pass (depends on parameters)

38
Observations
  • Performance on held out (one of the Reuters
    files)
  • 8-9 characters long to predict on average
  • Almost two characters correct on average, per
    prediction action
  • Can overfit/memorize! (long categories)
  • Current stop category generation after first pass

39
(No Transcript)
40
Some Example Categories(in order of
first time appearance and increasing length)
  • cat name "lt"
  • cat name " t"
  • cat name ".lt/"
  • cat name "pgt- "
  • cat name " the "
  • cat name "ation "
  • cat name "of the "
  • cat name "ing the "
  • cat name "quotThe "
  • cat name "company said "
  • cat name ", the company "
  • cat name "said on Tuesday"
  • cat name " said on Tuesday"
  • cat name ",quot said one "
  • cat name ",quot he said.lt/pgt
  • cat name "--------------------------------"
  • cat name "--------------------------------------
    ------------------"

41
Example Recall Paths
  • From processing one month of Reuters
  • "Sinn Fei" (0.128) "n a seat" (0.527) " in the "
    (0.538) "talks." (0.468) "lt/pgt
  • ltpgtB" (0.0185) "rokers " The end connection
    weight less than 0.04
  • " Gas in S" (1) "cotland" (1.04) " and north"
    (1.18) "ern E"
  • (0.572) "ngland" (0.165) ",quot a " (0.0542)
    "spokeswo" (0.551) "man
  • said " (0.044) "the idea" (0.0869) " was to "
    (0.144) "quot" (0.164)
  • "e the d" (0.0723) "ivision" (0.0671) " in N"
    (0.397) "ew York"
  • (0.062) " where " (0.0557) "the main " (0.0474)
    "marque" (0.229) "s
  • were " (0.253) "base" (0.264) "d. quot"
    (0.0451) "It will " (0.117)
  • "certain" (0.0691) "ly b" (0.0892) "e New "
    (0.353) "York" (0.112) "
  • party" (0.0917) "s is goin" (0.559) "g to "
    (0.149) "end.quot"
  • (0.239) "lt/pgt ltpgtT" (0.104) "wedish " (0.125)
    "Export" (0.0211) "
  • Credi" The end connection weight less than
    0.04

42
Search Query Logs
  • "bureoofi" (1) "migration" (1.13) "andci" (1.04)
    "tizenship." (0.31) "com
  • www," (0.11) "ictions" (0.116) "zenship."
    The end this concept wasn't seen in last 1000000
    time points.
  • Random Recall
  • "bureoofi" (1) "migration" (0.0129) "dept.com"
    The end this concept wasn't seen in last
    1000000 time points.

43
Much Related Work!
  • Online learning, cumulative learning, feature and
    concept induction, neural networks, clustering,
    Bayesian methods, language modeling, deep
    learning, hierarchical learning,
    importance/ubiquity of predictions/anticipations
    in the brain (On Intelligence, natural
    computations,), models of neocortex (circuits
    of the mind), concepts and conceptual phenomena
    (e.g. big book of concepts), compression, .

44
Summary
  • Large-scale learning and classification (data
    hungry, efficiency paramount)
  • A systems approach Integration of multiple
    learning processes
  • The system makes it own classes
  • Driving objective Improve prediction (currently
    matching performance)
  • The underlying goal effectively acquire complex
    concepts
  • See www.omadani.net

45
Current/Future
  • Much work
  • Integrate learning of groupings
  • Recognize/use structural categories? (learn to
    parse/segment?)
  • Prediction objective.. ok?
  • Control over input stream, etc..
  • Category generation.. What are good methods?
  • Other domains (vision,)
  • Compare language modeling, etc
Write a Comment
User Comments (0)
About PowerShow.com