On Mining Massive Dynamic Data - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

On Mining Massive Dynamic Data

Description:

Crawl a max of 1K pages per website. Parser. Each webpage parsed into sentences by a parser. Index ... Build a baseline model, any model that provides a p-value ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 62
Provided by: Yah973
Category:
Tags: build | data | dynamic | how | massive | mining | page | to | web

less

Transcript and Presenter's Notes

Title: On Mining Massive Dynamic Data


1
On Mining Massive Dynamic Data
  • Deepak Agarwal
  • Yahoo! Research
  • SF Bay Area Chapter, ACM
  • 13th September, 2006

2
Background
  • PhD in statistics, university of Connecticut 01
  • Multi-level hierarchical Bayesian models to study
    deforestation in Madagascar.
  • Working with massive data in GIS got me
    interested in Data Mining
  • Joined Statistics and Data Mining at ATT
  • Massive Dynamic networks monitoring massive
    streams.
  • Intrigued by internet advertising joined Y! in
    2006

3
Yahoo! Research
  • Head Prabhakar Raghavan 2005
  • Search
  • Economics
  • Machine Learning
  • Statistics and Data Mining
  • http//research.yahoo.com/

4
Context
  • Massive amounts of data
  • Internet wave, telecommunications pose new
    statistical challenges
  • Computer science ?progress in managing data,
    computing summary statistics efficiently
  • Dynamic nature of data
  • Ubiquitous, methods for static data not optimal
  • Methods for time series, point processes germane
  • Challenge building scalable methods

5
Focus on three problems
  • Estimating delta through time
  • Monitoring massive streams
  • Mining massive dynamic social networks
  • Effective but lossy representation
  • Sequential sampling for learning
  • optimize the explore-exploit tradeoff
  • Different from fixed sample size design

6
Other research areas
  • Detecting hotspots in massive spatial data
  • Scan statistics (SODA 06, KDD 06)
  • Forecasting long-term and short-term
  • Applications at Yahoo!
  • Data squashing
  • Scaling down data to facilitate statistical
    modeling (KDD 03)
  • Hierarchical Bayesian modeling
  • Shrinkage estimation for massive data
  • (KDD 02 SDM 04 ICDM 05)

7
Problem 1 Monitoring massive streams
  • Estimating delta through time several apps.
  • Network monitoring Traffic volume in SNMP etc.
  • Bio-surveillance Emergency room data events on
    websites
  • spam detection e-mail spam web spam etc.
  • Business intelligence traffic pattern to
    customer care centers augments usual dashboard
    type statistics

8
Challenges
  • Estimate accurate baseline model
  • Change detection with good sensitivity/specificity
  • Adjusting for multiple testing global features
  • Adaptive procedure easy to update
  • Incorporating correlations among series

9
Application
  • Question
  • Can we detect social disruption events in China
    before they get reported in the mainstream media?
  • Our Answer
  • Probably yes, if we knew what data to use

10

Word patterns
11
  • We would like to thank Simon Urbanek for
    providing the plot

12
How did we get the patterns?
  • Patterns emerged from retrospective analysis of
    biological events (West Nile, SARS) foreseen as
    potential indicators and warnings of social
    disruption
  • Direct indicators (e.g, news reports of
    outbreaks)
  • Indirect indicators (school and factory closings
    etc.)
  • Patterns selected manually by experts.
  • Contingency table obtained daily Websites (about
    40) x patterns (about 25).

13
Data Collection, Reporting.
  • Crawler
  • Crawl a max of 1K pages per website
  • Parser
  • Each webpage parsed into sentences by a parser
  • Index
  • Converted to UTF-8 and indexed incrementally,
    lucene empowered indexing and search software
  • Anomalies reported in a newsletter form on a web
    portal every morning.

14
(No Transcript)
15
  • Number of crawled pages show variation, monitor
    rate of occurrence per page for each pattern.

16
Notations and transformation.
17
40 days of initial data
18
No short term seasonal effects in the rates
19
Baseline model State space approach
20
QQ-plots of standardized residuals to test the
conditional independence assumption in the
observation equation of the baseline model.
21
State Equation, model update
22
Estimating Variance components.
23
Estimating forgetting factors
24
Detecting anomalies, intervention strategy.
  • Q-charting, monitor the EWMA of normal scores of
    p-values (Liu and Lambert, 2006).
  • CUSUM based approach using Bayes factor (West and
    Harrison, 1997 Gargallo and Salvador,2003)
  • In this application Only detected spikes

25
Other Issues
  • What if a spike/drop is detected? Dont use the
    data point in model updating, increase the prior
    variance by a factor of c (c9 has worked well
    for this application)
  • Missing data
  • Occurs when we download very few (or no pages).
  • Draw an observation randomly from the predictive
    distribution and proceed as usual.
  • Deleting uninformative series, adding new ones
  • Delete a series if 95th percentile lt 1/1000.

26
Multiple testing Bayesian Approach.
  • Monitoring large number of independent streams
  • need to correct for multiple testing
  • Main idea
  • Derive empirical null based on observed
    deviations
  • Flag interesting cases after adjusting for global
    characteristics in the system
  • Bayesian approach shrink residuals
  • Shrinkage automatically builds in penalty for
    conducting multiple tests (Scott and Berger, 2003)

27
Bayesian procedure.
28
Estimating hyperparameters
  • Large number of time series
  • Approximate likelihood data squashing
  • likelihood approximated by weighted likelihood
  • MAP estimate used as plug-in
  • Moderate number of time series
  • Fully Bayesian Inference using Gauss-Hermite
    Quadrature

29
Distribution of normal scores on a randomly
selected day
30
(No Transcript)
31
(No Transcript)
32
Putting it all together
  • Build a baseline model, any model that provides a
    p-value of observed relative to predictive is
    appropriate. The state space approach provides a
    rich class.
  • Declare anomalies after adjusting for multiple
    testing. We use a Bayesian procedure but other
    approaches like FDR may be used
  • Delete time series that are uninformative (based
    on a user defined criteria). Add new series to
    the monitoring process.
  • For missing data, draw an observation randomly
    from the predictive distribution. When an anomaly
    occurs, make appropriate adjustments to maintain
    the correct variance.
  • Update the baseline distribution with arrival of
    new data. The updating process should be quick
    for large applications.

33
Dotted solid lines Days when reports appeared in
mainstream mediaDotted gray lines Days when our
system found spikes related to the reports that
appeared later.
34
Rough validation using actual media reports.
  • July 24th mystery illness kills 17 people in
    China, we noticed several spikes on July 17th and
    18th
  • Sept 29th and Dec 7th On Sept 29th , news
    reports of China carving out emergency plans to
    fight bird flu and prevent it from spreading to
    humans. On Dec 7th , a confirmed case of bird flu
    in humans reported.
  • We reported several spikes on Sept 12th and 14th,
    Nov 2nd, 7th, 11th, and 16th mostly for the
    pattern influenza, flu, pneumonia, meningitis.
  • On Nov 21st , four big spikes on bf3.syd.com.cn
    on influenza, flu, pneumonia, meningitis
  • emergency, disaster, crisis prevention and
    quarantine.

35
Ongoing work
  • Monitoring hierarchical streams
  • Applications at Yahoo!
  • Correlation structure induced partly by
    hierarchy
  • Concise description of anomalies
  • ICDM 05 for contingency tables
  • ICDE 06 for hierarchical data under submission

36
Other applications
  • Monitoring emergency room visits by symptom and
    location (JSM, 2005).
  • Monitoring calls to customer care centers to
    augment usual slice and dice dashboard statistics
  • E.g. there was a 10 increase in Hang-ups for
    calls from Maryland (ICDM, 2005)

37
Relevant articles
  • Agarwal, D., Feng, J. , Torres, V. (2006)
    Monitoring massive streams simultaneously A
    holistic Approach, Interface (refereed section)
  • Agarwal, D. (2005) Empirical Bayes Approach to
    Detect Anomalies in Dynamic Multidimensional
    Arrays, ICDM, Houston
  • Agarwal, D., DuMouchel, W , Goodall, G (2005)
    KFC A Kalman filtering appraoch to detect
    anomalies in Massive contingency tables, JSM,
    Minneapolis

38
Problem 2 Building Efficient Representation for
Massive Dynamic Networks
39
Context
  • Transactional data Time stamped records of
    interaction between pairs of entities,
  • e.g., telephone calls, credit card purchases,
    e-mail exchanges, hyperlinks etc.
  • Gives rise to a dynamic graph, nodes represent
    transactors, edges represent interactions over
    time.
  • Goal Find a lossy but efficient representation
  • No unique solution, depends on objective

40
Our application
  • Directed graph phone calls on ATTs network.
  • Massive millions of nodes and edges
  • Dynamic lose nodes and edges, get new ones
  • Heterogeneous biz,res,cell,800 etc.
  • Sparse the 80/20 rule power law
  • Incomplete dont see all calls, miss calls
    originating in competitors network, cell calls,
    local calls etc

41
Applications
  • Fraud detection ( fraudsters compromising 800
    numbers, international numbers etc.)
  • Marketing (viral marketing market to people with
    strong network influence)
  • Repetitive debtors (catching subscription fraud)
  • Key observations
  • Analysis at local transactor level useful
    global not needed
  • Facilitate near-real time applications

42
Representation
  • Create local graph centered around each node
  • captures interaction with the rest of the
    network
  • Approximation
  • Graph at time t union of local sub-graphs
  • Capturing dynamics of graph over time (Cortes et
    al)
  • Smoothing each local subgraph over time
  • Smoothed local subgraph around node X Community
    of interest (COI) of X.

43
Smoothing
  • Based only on yesterdays data?
  • Too narrow
  • Union of all time periods?
  • Too broad
  • A moving average of the t most recent time
    periods?
  • Better but does not capture slow drifts well,
    logistically difficult
  • Exponentially weighted moving average (EWMA)
  • G(t)?G(t-1)(1- ?)g(t)
  • Gives more weight to recent data
  • Easy to maintain and update

44
Weighting past calls choosing theforgetting
factor
Calls fade out over time The larger ? is , the
longer the call has non-negligible weight
Selecting ? Standard problem in time
series Derive estimates from Kalman filter or
ARIMA (0,1,1) But whats our loss function here?
Graph provided by Chris Volinsky
45
Reducing redundancy
  • Only a small fraction have high degrees
  • Introduce a parameter k (positive integer)
  • COI of X is smoothed subgraph centered around X
  • Top k called by X other
  • Top k calling X other
  • Weights on the edges are those derived from EWMA.
  • Still not done this will lead to gain in more
    and more edges over time introduce a third
    parameter e such that an edge below this
    threshold gets dropped.
  • Helps with noise reduction storage savings.

46
COI of X
Other inbound
X
Other outbound
47
How to select parameters?
  • Select L pre and post periods, maximize an
    average similarity measure

48
Similarity functions
Pre-period
Post-period
TN1
p1
TN1
p2
TN2
TN2
pother
49
Selecting theta and k
Hellinger
Wdice
50
Selecting epsilon.
51
Repetitive Debtor
  • Thomas Hanley
  • 62 Rio Robles, San Jose
  • CA
  • Disconnected non-payment
  • Deborah Hanley
  • 62 Rio Robles, San Jose
  • CA
  • ??

Key Calling patterns dont change
Compare new connections with fraudulent numbers
using a similarity function.
52
Validation with labels from experts
53
Enhancing COI My friends friends
  • Impute calls not seen on the network exploiting
    social structure
  • Other issues
  • Quantify social characteristics like tendency to
    call tendency to receive calls tendency to
    return calls for each node.

54
Extended COI
d3
x
other
d0
X
d2
other
X
d1
55
Approach
  • Extended COI -gt social network representing the
    nodes interaction with the network
  • Developed a rich class of statistical models to
    do inference.

56
Parameters
  • Node i
  • ai expansiveness (tendency to call)
  • ßi attractiveness (tendency of being called)
  • Global parameters
  • ? density of COI (reduces with increasing
    sparseness)
  • ? reciprocity of COI (tendency to return calls)
  • ?s caller specific effect
  • ?r cal lee specific effect
  • ? call specific effect

57
Saß
Hyperparameters
density
Imputing tijs
Coefficients for edge covariates
tij
wij
tij
COI (wij,wji) i lt j Covariates
k
wji
58
(No Transcript)
59
Example
  • COI with 117 nodes, 172 edges.
  • 14 missing edges, local calls from14 non ATT
    local customers to seed node .
  • One node attributebiz/cell/res gives rise to 9
    edge attributes.
  • cell-gtbiz, cell-gtcell, cell-gtres collapsed into
    one block.
  • M1 uniform reciprocity, M2 differential
    reciprocity
  • Latter gives better fit, edge covariates were
    statistically significant. Results in Table 2 of
    paper.

60
(No Transcript)
61
(No Transcript)
62
(No Transcript)
63
Relevant Papers
  • S.Hill, D.Agarwal, R.Bell, C.Volinsky (2006)
    Building an Effective Representation of Dynamic
    Networks, Journal of Computational and Graphical
    Statistics(to appear)
  • C.Cortes, D.Pregibon, C. Volinsky (2003)
    Computational Methods for Dynamic Graphs,
    Journal of Computational and Graphical
    Statistics, 12, 950-970
  • D.Agarwal, D. Pregibon (2004) Enhancing
    Communities of Interest using Bayesian Stochastic
    Blockmodels, Siam Data Mining Conference,
    Orlando
Write a Comment
User Comments (0)
About PowerShow.com