Sensor and Graph Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Sensor and Graph Mining

Description:

School of Computer Science. Carnegie Mellon. Sensor ... Given a emi-infinite stream of values (time series) x1, x2, ..., xt, ... Vision; Astronomy, seismology, ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 64
Provided by: christosf
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Sensor and Graph Mining


1
Sensor and Graph Mining
  • Christos Faloutsos
  • Carnegie Mellon University IBM
  • www.cs.cmu.edu/christos

2
Joint work with
  • Anthony Brockwell (CMU/Stat)
  • Deepayan Chakrabarti (CMU)
  • Spiros Papadimitriou (CMU)
  • Chenxi Wang (CMU)
  • Yang Wang (CMU)

3
Outline
  • Introduction - motivation
  • Problem 1 Stream Mining
  • Motivation
  • Main idea
  • Experimental results
  • Problem 2 Graphs Virus propagation
  • Conclusions

4
Introduction
  • Sensor devices
  • Temperature, weather measurements
  • Road traffic data
  • Geological observations
  • Patient physiological data
  • Embedded devices
  • Network routers
  • Intelligent (active) disks

5
Introduction
  • Limited resources
  • Memory
  • Bandwidth
  • Power
  • CPU
  • Remote environments
  • No human intervention

6
Introduction problem dfn
  • Given a emi-infinite stream of values (time
    series) x1, x2, , xt,
  • Find patterns, forecasts, outliers

7
Introduction
  • E.g.,

Periodicity? (twice daily)
Periodicity? (daily)
8
Introduction
  • Can we capture these patterns
  • automatically
  • with limited resources?

9
Related workStatistics Time series forecasting
  • Main problem
  • The first step in the analysis of any time
    series is to plot the data and inspect the
    graph Brockwell 91
  • Typically
  • Resource intensive
  • Cannot update online
  • AR(I)MA and seasonal variants
  • ARFIMA, GARCH,

10
Related workDatabases Continuous Queries
  • Typically, different focus
  • Compression
  • Not generative models
  • Largely orthogonal problem
  • Gilbert, Guha, Indyk et al. (STOC 2002)
  • Garofalakis, Gibbons (SIGMOD 2002)
  • Chen, Dong, Han et al. (VLDB 2002) Bulut, Singh
    (ICDE 2003)
  • Gehrke, Korn, et al. (SIGMOD 2001), Dobra,
    Garofalakis, Gehrke et al. (SIGMOD 2002)
  • Guha, Koudas (ICDE 2003) Datar, Gionis, Indyk et
    al. (SODA 2002)
  • Madden SIGMOD02, SIGMOD03

11
Goals
  • Adapt and handle arbitrary periodic components
  • No human intervention/tuning
  • Also
  • Single pass over the data
  • Limited memory (logarithmic)
  • Constant-time update

12
Outline
  • Introduction - motivation
  • Problem 1 Stream Mining
  • Motivation
  • Main idea
  • Experimental results
  • Problem 2 Graphs Virus propagation
  • Conclusions

13
WaveletsStraight signal
time
14
WaveletsIntroduction Haar
frequency
time
15
Wavelets
  • So?
  • Wavelets compress many real signals well
  • Image compression and processing
  • Vision Astronomy, seismology,
  • Wavelet coefficients can be updated as new points
    arrive Kotidis

16
WaveletsCorrelations
xt
frequency
time
17
WaveletsCorrelations
xt
frequency
time
18
Main ideaCorrelations
  • Wavelets are good
  • we can do even better
  • One number
  • and the fact that they are equal/correlated

19
Proposed method
Wl,t ? ?l,1Wl,t-1 ? ?l,2Wl,t-2 ?
Wl,t ? ?l,1Wl,t-1 ? ?l,2Wl,t-2 ?
Wl,t
Small windows suffice (k4)
20
More details
  • Update of wavelet coefficients
  • Update of linear models
  • Feature selection
  • Not all correlations are significant
  • Throw away the insignificant ones
  • very important!!
  • see paper

(incremental)
(incremental RLS)
(single-pass)
21
Complexity
SKIP
  • Model update
  • Space O?lgN mk2? ? O?lgN?
  • Time O?k2? ? O?1?
  • Where
  • N number of points (so far)
  • k number of regression coefficients fixed
  • m number of linear models O?lgN?
  • see paper

22
Outline
  • Introduction - motivation
  • Problem 1 Stream Mining
  • Motivation
  • Main idea
  • Experimental results
  • Problem 2 Graphs Virus propagation
  • Conclusions

23
Setup
  • First half used for model estimation
  • Models applied forward to forecast entire second
    half
  • AR, Seasonal AR (SAR) R
  • Simplest possible estimation no maximum
    likelihood estimation (MLE), etc.
  • vs. Python scripts

24
ResultsSynthetic data Triangle pulse
  • Triangle pulse
  • AR captures wrong trend (or none)
  • Seasonal AR (SAR) estimation fails

25
ResultsSynthetic data Mix
  • Mix (sine square pulse)
  • AR captures wrong trend (or none)
  • Seasonal AR estimation fails

26
ResultsReal data Automobile
(filtered)
  • Automobile traffic
  • Daily periodicity with rush-hour peaks
  • Bursty noise at smaller time scales

27
ResultsReal data Automobile
  • Automobile traffic
  • Daily periodicity with rush-hour peaks
  • Bursty noise at smaller time scales
  • AR fails to capture any trend (average)
  • Seasonal AR estimation fails

28
ResultsReal data Automobile
  • Automobile traffic
  • Daily periodicity with rush-hour peaks
  • Bursty noise at smaller time scales
  • AWSOM spots periodicities, automatically

29
ResultsReal data Automobile
  • Automobile traffic
  • Daily periodicity with rush-hour peaks
  • Bursty noise at smaller time scales
  • Generation with identified noise

30
ResultsReal data Sunspot
  • Sunspot intensity Slightly time-varying
    period
  • AR captures wrong trend (average)
  • Seasonal ARIMA
  • Captures immediate wrong downward trend
  • Requires human to determine seasonal component
    period (fixed)

31
ResultsReal data Sunspot
  • Sunspot intensity Slightly time-varying
    period

Estimation 40 minutes (R) vs. 9 seconds (Python)
32
Variance
SKIP
Hurst exponent
  • Variance (log-power) vs. scale
  • Noise diagnostic (if decreasing linear)
  • Can use to estimate noise parameters

33
Running time
time (t)
stream size (N)
34
Space requirements
Equal total number of model parameters
35
Conclusion
  • Adapt and handle arbitrary periodic components
  • No human intervention/tuning
  • Single pass over the data
  • Limited memory (logarithmic)
  • Constant-time update

36
Conclusion
  • Adapt and handle arbitrary periodic components
  • No human intervention/tuning
  • Single pass over the data
  • Limited memory (logarithmic)
  • Constant-time update

no human
limited resources
37
Outline
  • Introduction - motivation
  • Problem 1 Streams
  • Problem 2 Graphs Virus propagation
  • Motivation problem definition
  • Related work
  • Main idea
  • Experiments
  • Conclusions

38
Introduction
Protein Interactions genomebiology.com
Internet Map lumeta.com
Food Web Martinez 91
? Graphs are ubiquitious
Friendship Network Moody 01
39
Introduction
bridges
  • What can we do with graph analysis?
  • Immunization
  • Information Dissemination
  • network value of a customer Domingos

Needle exchange networks of drug usersWeeks
et al. 2002
40
Problem definition
  • Q1 How does a virus spread across an arbitrary
    network?
  • Q2 will it create an epidemic?
  • (in a sensor setting, with a gossip protocol,
    will a rumor/query spread?)

41
Framework
  • Susceptible-Infected-Susceptible (SIS) model
  • Cured nodes immediately become susceptible

42
Framework
  • b prob. an infected neighbor attacks
  • d prob. an infected node heals

43
The model
  • (virus) Birth rate ß probability than an
    infected neighbor attacks
  • (virus) Death rate d probability that an
    infected node heals

Healthy
N2
N
N1
Infected
N3
44
Epidemic threshold t
  • Defined as the value of t, such that
  • if b / d lt t
  • an epidemic can not happen
  • Thus,
  • given a graph
  • compute its epidemic threshold

45
Epidemic threshold t
  • What should t depend on?
  • avg. degree? and/or highest degree?
  • and/or variance of degree?
  • and/or determinant of the adjacency matrix?

46
Basic Homogeneous Model
  • Homogeneous graphs Kephart-White 91, 93
  • Epidemic threshold 1/ltkgt
  • Homogeneous connectivity ltkgt, ie, all nodes have
    same degree ? unrealistic

47
Power-law Networks
  • Model for Barabási-Albert networks
  • Pastor-Satorras Vespignani, 01, 02
  • Epidemic threshold ltkgt / ltk2gt
  • for BA type networks, with only ? 3 (? slope
    of power-law exponent)

48
Epidemic threshold
  • Homogeneous graphs 1/ltkgt
  • BA (g3) ltkgt / ltk2gt
  • more complicated graphs ?
  • arbitrary, REAL graphs ?
  • how many parameters??

49
Epidemic threshold
  • Theorem We have no epidemic, if

ß/d ltt 1/ ?1,A
50
Epidemic threshold
  • Theorem We have no epidemic, if

epidemic threshold
recovery prob.
ß/d ltt 1/ ?1,A
largest eigenvalue of adj. matrix A
attack prob.
Proof Wang03
51
Epidemic threshold for various networks
  • sanity checks / older results
  • Homogeneous networks
  • ?1,A ltkgt t 1/ltkgt
  • where ltkgt average degree
  • This is the same result as of Kephart White !

52
Epidemic threshold for various networks
  • sanity checks / older results
  • Star networks
  • ?1,A sqrt(d) t 1/ sqrt(d)
  • where d the degree of the central node

53
Epidemic threshold for various networks
  • sanity checks / older results
  • Infinite, power-law networks
  • ?1,A 8 t 0 any virus has a chance!
    Barabasi et al
  • Finite power-law networks
  • t 1/ ?1,A

54
Outline
  • Introduction - motivation
  • Problem 1 Streams
  • Problem 2 Graphs Virus propagation
  • Motivation problem definition
  • Related work
  • Main idea
  • Experiments
  • Conclusions

55
Experiments
  • 2 graphs
  • Star network one hub 99 spokes
  • Oregon Internet AS graph
  • 10,900 nodes, 31180 edges
  • topology.eecs.umich.edu/data.html
  • More in our paper SRDS 03

56
Experiments (Star)
ß/d gt t (above threshold)
ß/d t (at the threshold)
57
Experiments (Oregon)
ß/d gt t (above threshold)
ß/d t (at the threshold)
ß/d lt t (below threshold)
58
Our prediction vs. previous prediction
PL3
Number of infected nodes
Our
Our
ß/d
ß/d
Oregon
Star
  • our predictions are more accurate

59
Conclusions
  • We found an epidemic threshold
  • v that applies to any network topology
  • v and it depends only on one parameter of the
    graph

60
Overall conclusions
  • Automatic stream mining AWSOM
  • graphs and virus propagation eigenvalue

61
Ongoing / related work
  • Streams
  • how to find hidden variables on multiple streams
    w/ Spiros and Jimeng Sun
  • network tomography w/ Airoldi
  • Graphs
  • graph partitioning w/ Deepay
  • important subgraphs w/ Tomkins McCurley
  • graph generators RMAT, w/ Deepay

62
Thank you!
  • Contact info
  • christos _at_ cs.cmu.edu
  • spapadim _at_ cs.cmu.edu
  • deepay _at_ cs.cmu.edu

63
Main References
  • Spiros Papadimitriou, Anthony Brockwell and
    Christos Faloutsos Adaptive, Hands-Off Stream
    Mining VLDB 2003, Berlin, Germany, Sept. 2003.
  • Wang03 Yang Wang, Deepayan Chakrabarti, Chenxi
    Wang and Christos Faloutsos Epidemic Spreading
    in Real Networks an Eigenvalue Viewpoint, SRDS
    2003, Florence, Italy.

64
Additional References
  • Connection Subgraphs, C. Faloutsos, K. McCurley,
    A. Tomkins, SIAM-DM 2004 workshop on link
    analysis
  • RMAT A recursive graph generator, D.
    Chakrabarti, Y. Zhan, C. Faloutsos, SIAM-DM 2004
  • iFilter Network tomography using particle
    filters, Edoardo Airoldi, Christos Faloutsos
    (submitted)
Write a Comment
User Comments (0)
About PowerShow.com