Beyond Streams and Graphs: Dynamic Tensor Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Beyond Streams and Graphs: Dynamic Tensor Analysis

Description:

DTA and STA are orders of magnitude faster than OTA ... DTA/STA incrementally decompose tensors into core tensors and projection matrices ... – PowerPoint PPT presentation

Number of Views:152
Avg rating:3.0/5.0
Slides: 35
Provided by: IBMU398
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Beyond Streams and Graphs: Dynamic Tensor Analysis


1
Beyond Streams and Graphs Dynamic Tensor
Analysis
Dacheng Tao
  • Jimeng Sun

Christos Faloutsos
Speaker Jimeng Sun
2
Motivation
  • Goal incremental pattern discovery on streaming
    applications
  • Streams
  • E1 Environmental sensor networks
  • E2 Cluster/data center monitoring
  • Graphs
  • E3 Social network analysis
  • Tensors
  • E4 Network forensics
  • E5 Financial auditing
  • E6 fMRI Brain image analysis
  • How to summarize streaming data effectively and
    incrementally?

3
E3 Social network analysis
  • Traditionally, people focus on static networks
    and find community structures
  • We plan to monitor the change of the community
    structure over time and identify abnormal
    individuals

4
E4 Network forensics
  • Directional network flows
  • A large ISP with 100 POPs, each POP 10Gbps link
    capacity Hotnets2004
  • 450 GB/hour with compression
  • Task Identify abnormal traffic pattern and find
    out the cause

abnormal traffic
normal traffic
destination
source
Collaboration with Prof. Hui Zhang and Dr.
Yinglian Xie
5
Static Data model
  • For a timestamp, the stream measurements can be
    modeled using a tensor
  • Dimension a single stream
  • E.g, ltChristos, graphgt
  • Mode a group of dimensions of the same kind.
  • E.g., Source, Destination, Port

Time 0
Source
Destination
6
Static Data model (cont.)
  • Tensor
  • Formally,
  • Generalization of matrices
  • Represented as multi-array, data cube.

Order 1st 2nd 3rd
Correspondence Vector Matrix 3D array
Example
7
Dynamic Data model (our focus)
Source
Destination
time
  • Streams come with structure
  • (time, source, destination, port)
  • (time, author, keyword)

8
Dynamic Data model (cont.)
  • Tensor Streams
  • A sequence of Mth order tensor

Order 1st 2nd 3rd
Correspondence Multiple streams Time evolving graphs 3D arrays
Example
keyword
time
author


9
Dynamic tensor analysis
New Tensor
Old Tensors
Source
Destination
UDestination
Old cores
USource
10
Roadmap
  • Motivation and main ideas
  • Background and related work
  • Dynamic and streaming tensor analysis
  • Experiments
  • Conclusion

11
Background Singular value decomposition (SVD)
  • SVD
  • Best rank k approximation in L2
  • PCA is an important application of SVD

n
n
k
k
k
VT
A
?
U
UT
m
m
Y
12
Latent semantic indexing (LSI)
  • Singular vectors are useful for clustering or
    correlation detection

cluster
cache
frequent
query
pattern
concept-association
DM
x
x

DB
document-concept
concept-term
13
Tensor Operation Matricize X(d)
  • Unfold a tensor into a matrix

Acknowledge to Tammy Kolda for this slide
14
Tensor Operation Mode-product
  • Multiply a tensor with a matrix

port
port
source
?source
destination
destination
group
group
source
15
Related work
Low Rank approximation PCA, SVD orthogonal based projection Multilinear analysis Tensor decompositions Tucker, PARAFAC, HOSVD
Stream mining Scan data once to identify patterns Sampling Vitter85, Gibbons98 Sketches Indyk00, Cormode03 Graph mining Explorative Faloutsos04Kumar99 Leskovec05 Algorithmic Yan05Cormode05
Our Work
16
Roadmap
  • Motivation and main ideas
  • Background and related work
  • Dynamic and streaming tensor analysis
  • Experiments
  • Conclusion

17
Tensor analysis
  • Given a sequence of tensors
  • find the projection matrices
  • such that the reconstruction error e
  • is minimized


t

Note that this is a generalization of PCA when n
is a constant
18
Why do we care?
  • Anomaly detection
  • Reconstruction error driven
  • Multiple resolution
  • Multiway latent semantic indexing (LSI)

Philip Yu
time
Michael Stonebreaker
Pattern
Query
19
1st order DTA - problem
  • Given x1xn where each xi? RN, find
  • U?RN?R such that the error e is
  • small

N
UT
Y
x1
R
?
Sensors
.
n
time
xn
indoor
Note that Y XU
Sensors
outdoor
20
1st order DTA
  • Input new data vector x? RN, old variance matrix
    C? RN? N
  • Output new projection matrix U? RN? R
  • Algorithm
  • 1. update variance matrix Cnew xTx C
  • 2. Diagonalize U?UT Cnew
  • 3. Determine the rank R and return U

Old X
time
x
x
UT
U
Cnew
C
xT
Diagonalization has to be done for every new x!
21
1st order STA
  • Adjust U smoothly when new data arrive without
    diagonalization VLDB05
  • For each new point x
  • Project onto current line
  • Estimate error
  • Rotate line in the direction of the error and in
    proportion to its magnitude
  • For each new point x and for i 1, , k
  • yi UiTx (proj. onto Ui)
  • di ? ?di yi2 (energy ? i-th eigenval.)
  • ei x yiUi (error)
  • Ui ? Ui (1/di) yiei (update estimate)
  • x ? x yiUi (repeat with remainder)

error
Sensor 2
U
Sensor 1
22
Mth order DTA
23
Mth order DTA complexity
  • Storage
  • O(? Ni), i.e., size of an input tensor at a
    single timestamp
  • Computation
  • ? Ni3 (or ? Ni2) diagonalization of C
  • ? Ni ? Ni matrix multiplication X (d)T
    X(d)
  • For low order tensor(lt3), diagonalization is the
    main cost
  • For high order tensor, matrix multiplication is
    the main cost

24
Mth order STA
  • Run 1st order STA along each mode
  • Complexity
  • Storage O(? Ni)
  • Computation ? Ri ? Ni which is smaller than DTA

25
Roadmap
  • Motivation and main ideas
  • Background and related work
  • Dynamic and streaming tensor analysis
  • Experiments
  • Conclusion

26
Experiment
  • Objectives
  • Computational efficiency
  • Accurate approximation
  • Real applications
  • Anomaly detection
  • Clustering

27
Data set 1 Network data
  • TCP flows collected at CMU backbone
  • Raw data 500GB with compression
  • Construct 3rd order tensors with hourly windows
    with ltsource, destination, portgt
  • Each tensor 500?500?100
  • 1200 timestamps (hours)

value
Sparse data
Power-law distribution
10AM to 11AM on 01/06/2005
28
Data set 2 Bibliographic data (DBLP)
  • Papers from VLDB and KDD conferences
  • Construct 2nd order tensors with yearly windows
    with ltauthor, keywordsgt
  • Each tensor 4584?3741
  • 11 timestamps (years)

29
Computational cost
3rd order network tensor
2nd order DBLP tensor
  • OTA is the offline tensor analysis
  • Performance metric CPU time (sec)
  • Observations
  • DTA and STA are orders of magnitude faster than
    OTA
  • The slight upward trend in DBLP is due to the
    increasing number of papers each year (data
    become denser over time)

30
Accuracy comparison
3rd order network tensor
2nd order DBLP tensor
  • Performance metric the ratio of reconstruction
    error between DTA/STA and OTA fixing the error
    of OTA to 20
  • Observation DTA performs very close to OTA in
    both datasets, STA performs worse in DBLP due to
    the bigger changes.

31
Network anomaly detection
  • Reconstruction error gives indication of
    anomalies.
  • Prominent difference between normal and abnormal
    ones is mainly due to unusual scanning activity
    (confirmed by the campus admin).

32
Multiway LSI
Authors Keywords Year
michael carey, michael stonebreaker, h. jagadish, hector garcia-molina queri,parallel,optimization,concurr, objectorient 1995
surajit chaudhuri,mitch cherniack,michael stonebreaker,ugur etintemel distribut,systems,view,storage,servic,process,cache 2004
jiawei han,jian pei,philip s. yu, jianyong wang,charu c. aggarwal streams,pattern,support, cluster, index,gener,queri 2004
DB
DM
  • Two groups are correctly identified Databases
    and Data mining
  • People and concepts are drifting over time

33
Conclusion
  • Tensor stream is a general data model
  • DTA/STA incrementally decompose tensors into core
    tensors and projection matrices
  • The result of DTA/STA can be used in other
    applications
  • Anomaly detection
  • Multiway LSI

34
Final word Think structurally!
  • The world is not flat, neither should data mining
    be.

Contact Jimeng Sun jimeng_at_cs.cmu.edu
Write a Comment
User Comments (0)
About PowerShow.com