Predictive Profiling from Massive Transactional Data Sets - PowerPoint PPT Presentation

About This Presentation
Title:

Predictive Profiling from Massive Transactional Data Sets

Description:

Title: Predictive Profiling from Massive Transactional Data Sets Author: Information and Computer Science Last modified by: Information and Computer Sciences – PowerPoint PPT presentation

Number of Views:216
Avg rating:3.0/5.0
Slides: 40
Provided by: Informatio381
Category:

less

Transcript and Presenter's Notes

Title: Predictive Profiling from Massive Transactional Data Sets


1
Statistical Modeling of Large Text
CollectionsPadhraic SmythDepartment of
Computer ScienceUniversity of California, Irvine
MURI Project Kick-off MeetingNovember 18th
2008
2
The Text Revolution
  • Widespread availability of text in digital form
    is driving
  • many new applications based on automated text
    analysis
  • Categorization/classification
  • Automated summarization
  • Machine translation
  • Information extraction
  • And so on.

3
The Text Revolution
  • Widespread availability of text in digital form
    is driving
  • many new applications based on automated text
    analysis
  • Categorization/classification
  • Automated summarization
  • Machine translation
  • Information extraction
  • And so on.
  • Most of this work is happening in computing, but
    many of the underlying techniques are statistical

4
Motivation
Pennsylvania Gazette 80,000 articles 1728-1800
16 million Medline articles
NYT 1.5 million articles
5
Problems of Interest
  • What topics do these documents span?
  • Which documents are about a particular topic?
  • How have topics changed over time?
  • What does author X write about?
  • and so on..

6
Problems of Interest
  • What topics do these documents span?
  • Which documents are about a particular topic?
  • How have topics changed over time?
  • What does author X write about?
  • and so on..
  • Key Ideas
  • Learn a probabilistic model over words and docs
  • Treat query-answering as computation of
    appropriate conditional probabilities

7
Topic Models for Documents
  • P( words document ) ??
  • S P(wordstopic) P (topicdocument)

Topic probability distribution over words
Coefficients for each document
Automatically learned from text corpus
8
Topics Multinomials over Words
9
Topics Multinomials over Words
10
Basic Concepts
  • Topics distributions over words
  • Unknown a priori, learned from data
  • Documents represented as mixtures of topics
  • Learning algorithm
  • Gibbs sampling (stochastic search)
  • Linear time per iteration
  • Provides a full probabilistic model over words,
    documents, and topics
  • Query answering computation of conditional
    probabilities

11
Enron email data
250,000 emails 28,000 individuals 1999-2002
12
Enron email business topics
13
Enron non-work topics
14
Enron public-interest topics...
15
Examples of Topics from New York Times
Terrorism
Wall Street Firms
Stock Market
Bankruptcy
WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT
CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDA
Y DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NA
SDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN 10
500_STOCK_INDEX
WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS
FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIE
S RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRM
S SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING
INVESTMENT_BANKERS INVESTMENT_BANKS
SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED
AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTAC
K NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATI
ONAL QAEDA TERRORIST_ATTACKS
BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS
COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_C
OURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPA
NIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CA
SE GROUP
16
Topic trends from New York Times
TOUR RIDER LANCE_ARMSTRONG TEAM BIKE RACE FRANCE
Tour-de-France
COMPANY QUARTER PERCENT ANALYST SHARE SALES EARNIN
G
Quarterly Earnings
330,000 articles 2000-2002
ANTHRAX LETTER MAIL WORKER OFFICE SPORES POSTAL BU
ILDING
Anthrax
17
What does an author write about?
  • Author Jerry Friedman, Stanford

18
What does an author write about?
  • Author Jerry Friedman, Stanford
  • Topic 1 regression, estimate, variance, data,
    series,
  • Topic 2 classification, training, accuracy,
    decision, data,
  • Topic 3 distance, metric, similarity, measure,
    nearest,

19
What does an author write about?
  • Author Jerry Friedman, Stanford
  • Topic 1 regression, estimate, variance, data,
    series,
  • Topic 2 classification, training, accuracy,
    decision, data,
  • Topic 3 distance, metric, similarity, measure,
    nearest,
  • Author Rakesh Agrawal, IBM

20
What does an author write about?
  • Author Jerry Friedman, Stanford
  • Topic 1 regression, estimate, variance, data,
    series,
  • Topic 2 classification, training, accuracy,
    decision, data,
  • Topic 3 distance, metric, similarity, measure,
    nearest,
  • Author Rakesh Agrawal, IBM
  • - Topic 1 index, data, update, join,
    efficient.
  • - Topic 2 query, database, relational,
    optimization, answer.
  • - Topic 3 data, mining, association, discovery,
    attributes,

21
Examples of Data Sets Modeled
  • 1,200 Bible chapters (KJV)
  • 4,000 Blog entries
  • 20,000 PNAS abstracts
  • 80,000 Pennsylvania Gazette articles
  • 250,000 Enron emails
  • 300,000 North Carolina vehicle accident police
    reports
  • 500,000 New York Times articles
  • 650,000 CiteSeer abstracts
  • 8 million MEDLINE abstracts
  • Books by Austen, Dickens, and Melville
  • ..
  • Exactly the same algorithm used in all cases
    and in all cases interpretable topics produced
    automatically

22
Related Work
  • Statistical origins
  • Latent class models in statistics (late 60s)
  • Admixture models in genetics
  • LDA Model Blei, Ng, and Jordan (2003)
  • Variational EM
  • Topic Model Griffiths and Steyvers (2004)
  • Collapsed Gibbs sampler
  • Alternative approaches
  • Latent semantic indexing (LSI/LSA)
  • less interpretable, not appropriate for count
    data
  • Document clustering
  • simpler but less powerful

23
Clusters v. Topics
Hidden Markov Models in Molecular Biology New Algorithms and Applications Pierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. McClure Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied on-line or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, and classification.
24
Clusters v. Topics
One Cluster
Hidden Markov Models in Molecular Biology New Algorithms and Applications Pierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. McClure Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied on-line or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, and classification. cluster 88 model data models time neural figure state learning set parameters network probability number networks training function system algorithm hidden markov
25
Clusters v. Topics
Multiple Topics
One Cluster
Hidden Markov Models in Molecular Biology New Algorithms and Applications Pierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. McClure Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied on-line or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, and classification. cluster 88 model data models time neural figure state learning set parameters network probability number networks training function system algorithm hidden markov topic 10 state hmm markov sequence models hidden states probabilities sequences parameters transition probability training hmms hybrid model likelihood modeling topic 37 genetic structure chain protein population region algorithms human mouse selection fitness proteins search evolution generation function sequence sequences genes
26
Extensions
  • Author-topic models
  • Authors mixtures over topics

  • (Steyvers, Smyth, Rosen-Zvi, Griffiths, 2004)
  • Special-words model
  • Documents mixtures of topics idiosyncratic
    words

  • (Chemudugunta, Smyth, Steyvers, 2006)
  • Entity-topic models
  • Topic models that can reason about entities
  • (Newman,
    Chemudugunta, Smyth, Steyvers, 2006)
  • See also work by McCallum, Blei, Buntine,
    Welling, Fienberg, Xing, etc
  • Probabilistic basis allows for a wide range of
    generalizations

27
Combining Models for Networks and Text
28
Combining Models for Networks and Text
29
Combining Models for Networks and Text
30
Combining Models for Networks and Text
31
Technical Approach and Challenges
  • Develop flexible probabilistic network models
    that can incorporate textual information
  • e.g., ERGMs with text as node or edge covariates
  • e.g., latent space models with text-based
    covariates
  • e.g., dynamic relational models with text as edge
    covariates
  • Research challenges
  • Computational scalability
  • ERGMS not directly applicable to large text data
    sets
  • What text representation to use
  • High-dimensional bag of words ?
  • Low-dimensional latent topics ?
  • Utility of text
  • Does the incorporation of textual information
    produce more accurate models or predictions? How
    can this be quantified?

32
Graphical Model
z
Group Variable
..........
Word 2
Word 1
Word n
33
Graphical Model
z
Group Variable
w
Word
n words
34
Graphical Model
z
Group Variable
w
Word
n words
D documents
35
Mixture Model for Documents
Group Probabilities
a
z
Group Variable
f
Group-Word distributions
w
Word
n words
D documents
36
Clustering with a Mixture Model
Cluster Probabilities
a
z
Cluster Variable
f
Cluster-Word distributions
w
Word
n words
D documents
37
Graphical Model for Topics
Document-Topic distributions
q
z
Topic
f
Topic-Word distributions
w
Word
n
D
38
Learning via Gibbs sampling
Document-Topic distributions
q
Gibbs sampler to estimate z for each word
occurrence, marginalizing over
other parameters
z
Topic
f
Topic-Word distributions
w
Word
n
D
39
More Details on Learning
  • Gibbs sampling for word-topic assignments (z)
  • 1 iteration full pass through all words in all
    documents
  • Typically run a few hundred Gibbs iterations
  • Estimating ? and ?
  • use z samples to get point estimates
  • non-informative Dirichlet priors for ? and ?
  • Computational Efficiency
  • Learning is linear in the number of word tokens ?
  • Can still take order of a day on 100k or more
    docs

40
Gibbs Sampler Stability
Write a Comment
User Comments (0)
About PowerShow.com