1 / 52

Statistical Mining in Data Streams

- Ankur Jain
- Dissertation Defense
- Computer Science, UC Santa Barbara
- Committee
- Edward Y. Chang (chair)
- Divyakant Agrawal
- Yuan-Fang Wang

Roadmap

- The Data Stream Model
- Introduction and research issues
- Related work
- Data Stream Mining
- Stream data clustering
- Bayesian reasoning for sensor stream processing
- Contribution Summary
- Future work

Data Streams

- A data stream is an unbounded and continuous

sequence of tuples. - Tuples arrive online and could be

multi-dimensional - A tuple seen once cannot be easily retrieved

later - No control over the tuple arrival order

Applications Sensor Networks

Applications Network Monitoring

Applications Text Processing

Applications

- Video surveillance
- Stock ticker monitoring
- Process control manufacturing
- Traffic monitoring analysis
- Transaction log processing

Traditional DBMS does not work!

Data Stream Projects

- STREAM (Stanford)
- A general-purpose Data Stream Management System

(DSMS) - Telegraph (Berkeley)
- Adaptive query processing
- TinyDB General purpose sensor database
- Aurora Project (Brown/MIT)
- Distributed stream processing
- Introduces new operators (map, drop etc.)
- The Cougar Project (Cornell)
- Sensors form a distributed database system
- Cross-layer optimizations (data management layer

and the routing layer) - MAIDS (UIUC)
- Mining Alarming Incidents in Data Streams
- Streaminer Data stream mining

Data Stream Processing Key Ingredients

- Adaptivity
- Incorporate evolutionary changes in the stream
- Approximation
- Exact results are hard to compute fast with

limited memory

A Data Stream Management System (DSMS)

The Central Stream Processing System

Thesis Outline

- Develop fast, online, statistical methods for

mining data streams. - Adaptive non-linear clustering in

multi-dimensional streams - Bayesian reasoning sensor stream processing
- Filtering methods for resource conservation
- Change detection in data streams
- Video sensor data stream processing

Roadmap

- The Data Stream Model
- Introduction and research issues
- Related work
- Data Stream Mining
- Stream data clustering
- Bayesian reasoning for sensor stream processing
- Contribution Summary
- Future work

Clustering in High-Dimensional Streams

- Given a continuous sequence of points, group

them into some number of clusters, such that the

members of a cluster are geometrically close to

each other.

Example Application Network Monitoring

INTERNET

Connection tuples (high-dimensional)

Stream Clustering New Challenges

- One-pass restriction and limited memory

constraint - Fading cluster technique proposed by Aggarwal et

al. - Non-linear separation boundaries
- We propose using the kernel trick to deal with

the non-linearity issue - Data dimensionality
- We propose effective incremental dimension

reduction technique

The 2-Tier Framework

Adaptive Non-linear Clustering

Input Space

LDS

d-dimensional

Tier1 Stream Segmentation

q-dimensional q lt d

Tier 2 LDS Projection Update

Latest point received from the stream

C5

2-Tier clustering module (uses the kernel trick)

Fading Clusters

The Fading Cluster Methodology

- Each cluster Ci, has a recency value Ri s.t.
- Ri f(t-tlast), where,
- t current time
- tlast last time Ci was updated
- f(t) e-? t
- ? fading factor
- A cluster is erased from memory (faded) when Ri

h, where h is a user parameter - ? controls the influence of historical data
- Total number of clusters is bounded

Non-linearity in Data

Input Space

Feature Space

Spectral clustering methods likely to perform

better

Traditional clustering techniques (k-means) do

not perform well

Feature space mapping

?

Non-linearity in Network Intrusion Data

Geometrically well-behaved trend

Use kernel trick

?

?

Input Space

Feature Space

ipsweep attack data

The Kernel Trick

- Actual projection in higher dimension is

computationally expensive - The kernel trick does the non-linear projection

implicitly! - Given two input space vectors x,y
- k(x,y) lt?(x),?(y)gt

Gaussian kernel function k(x,y)

exp(-?x-y2) used in the previous example !

Kernel Function

Kernel Trick - Working Example

Not required explicitly!

- ?x (x1, x2) ? ?(x) (x12, x22, v2x1x2)
- lt?(x),?(z)gt lt(x12, x22, v2x1x2), (z12, z22,

v2z1z2)gt, - x12z12 x22z22 2x1x2z1z2,
- (x1z1 x2z2)2,
- ltx,zgt2.

Kernel trick allows us to make operations in

high-dimensional feature space using a kernel

function but without explicitly representing ?

Stream Clustering New Challenges

- One-pass restriction and limited memory

constraint - We use the fading cluster technique proposed by

Aggarwal et. al. - Non-linear separation boundaries
- We propose using kernel methods to deal with the

non-linearity issue - Data dimensionality
- We propose effective incremental dimension

reduction technique

Dimensionality Reduction

- PCA like kernel method desirable
- Explicit representation EVD preferred
- KPCA is computationally prohibitive - O(n3)
- The principal components evolve with time

frequent EVD updates may be necessary - We propose to perform EVD on grouped-data instead

point-data

Requires a novel kernel method

The 2-Tier Framework

Adaptive Non-linear Clustering

Input Space

LDS

d-dimensional

Tier1 Stream Segmentation

q-dimensional q lt d

Tier 2 LDS Projection Update

Latest point received from the stream

C5

2-Tier clustering module (uses the kernel trick)

Fading Clusters

The 2-Tier Framework

- Tier 1 captures the temporal locality in a

segment - Segment is a group of contiguous points in the

stream geometrically packed closely in the

feature space - Tier 2 adaptively selects segments to project

data in LDS - Selected segments are called representative

segments - Implicit data in the feature space is projected

explicitly in LDS such that the feature-space

distances are preserved

The 2-Tier Framework

Obtain a point x From the stream

YES

TIER 2

Is S a representative segment?

Add S in memory and update LDS

Is (?(x) novel w.r.t S and s gt smin) OR is s

smax?

TIER 1

YES

NO

Clear contents of S

Obtain in LDS

NO

Add x to S

Is close to an active cluster?

YES

Update cluster centers and recency values.

Delete faded clusters

Assign x to its nearest cluster

Create new cluster with x

NO

Network Intrusion Stream

- Simulated data from MIT Lincoln Labs.
- 34 Continuous Attributes (Features)
- 10.5 K Records
- 22 types of intrusion attacks 1 normal class

Network Intrusion Stream

Clustering accuracy at LDS dimensionality u10

Efficiency - EVD Computations

Image data 5K Records, 576 Features, 10 digits

Newswire data 3.8K Records, 16.5K Features, 10

news topics

In Retrospect

- We proposed an effective stream clustering

framework - We use the kernel trick to delineate non-linear

boundaries efficiently - We use stream segmentation approach to

continuously project data into a low dimensional

space

Roadmap

- The Data Stream Model
- Introduction and research issues
- Related work
- Contributions Towards Stream Mining
- Stream data clustering
- Bayesian reasoning sensor steam processing
- Contribution Summary
- Future work

Bayesian Reasoning for Sensor Data Processing

- Users submit queries with precision constraints
- Resource conservation is of prime concern to

prolong system life - Data acquisition
- Data communication

Find the temperature with 80 confidence

Use probabilistic models at central site for

approximate predictions preventing actual

acquisitions

Dependencies in Sensor Attributes

Attribute Acquisition Cost

Temperature 50 J

Voltage 5 J

Acquire Voltage!

Get Temperature

Dependency Model

Bayesian Networks

Report Temperature

Get Voltage

Using Correlation Models Deshpande et al.

- Correlation models ignore conditional

dependency

Intel Lab ( Real Sensor network data) Attributes

Voltage (V), Temperature (T), Humidity (H)

voltage is correlated with temperature

Humidity 35-40)

voltage is conditionally independent of

temperature, given humidity !

Deshpande et al. VLDB04

BN vs. Correlations

- Correlation model Deshpande et. al.
- Maintains all dependencies
- Search space of finding best possible

alternative sensor attribute is high - Joint probability is represented in O(n2) cells

NDBC Buoy Dataset

- Bayesian Network
- Maintains vital dependencies only
- Lower search complexity O(n)
- Storage O(nd), d avg. node degree
- Intuitive dependency structure

Intel Lab. Dataset

Bayesian Networks (BN)

- Qualitative Part Directed Acyclic Graph (DAG)
- Nodes Sensor Attributes
- Edges Attribute influence relationship
- Quantitative Part Conditional Probability Table

(CPT) - Each node X has its own CPT , P(Xparents(X))
- Together, the BN represent the joint probability

in - factored from P(T,H,V,L) P(T)P(HT)P(VH)P(LT

) - The influence relationship is represented by

conditional entropy function H. - H(Xi)?kl1 P( Xi xil

)log(P( Xi xil )) - We learn the BN by minimizing H(Xi Parents(Xi)).

System Architecture

Storage

Acquisition Plan

Group Query (Q)

Acquisition Values

Query Processor

Finding the Candidate Attributes

- For any attribute in the group-query Q, analyze

candidates attributes in the Markov blanket

recursively - Selection criterion
- Select candidates in a
- greedy fashion

Information Gain (Conditional Entropy)

Acquisition cost

Meet precision constraints

Maximize resource conservation

Experiments Resource Conservation

NDBC dataset, 7 attributes

Effect of using MB Property with ?min 0.90

Effect of using group-queries, Q - Group-query

size

Results - Selectivity

Wave Period (WP)

Wind Speed (SP)

Air Pressure (AP)

Wind Direction (DR)

Water Temperature (WT)

Wave Height (WH)

Air Temperature (AT)

In Retrospect

- Bayesian networks can encode the sensor

dependencies effectively - Our method provides significant resource

conservation for group-queries

Contribution Summary

- Adaptive Stream resource management using Kalman

Filters. SIGMOD04 - Adaptive sampling for sensor networks.

DMSN04 - Adaptive non-linear clustering for Data

Streams. CIKM06 - Using stationary-dynamic camera assemblies for

wide-area video surveillance and selective

attention. CVPR06 - Filtering the data streams. in submission
- Efficient diagnostic and aggregate queries on

sensor networks. - in submission
- OCODDS An On-line Change-Over Detection

framework for tracking evolutionary changes in

Data Streams. in submission

Future Work

- Develop non-linear techniques for capturing

temporal correlations in data streams - The Bayesian framework can be extended to address

what-if queries with counterfactual evidence - The clustering framework can be extended for

developing stream visualization systems - Incremental EVD techniques can improve the

performance further

- Thank You !

- BACKUP SLDIES!

Back to Stream Clustering

- We propose a 2-tier stream clustering framework
- Tier 1 Kernel method that continuously divides

the stream into segments - Tier 2 Kernel method that uses the segments to

project data in a low-dimensional space (LDS) - The fading clusters reside in the LDS

Clustering LDS Projection

Clustering LDS Update

Network Intrusion Stream

Clustering accuracy at LDS dimensionality u10

Cluster strengths at LDS dimensionality u10

Effect of dimensionality

Query Plan Generation

- Given a group query, the query plan computes

candidate attributes that will actually be

acquired to successfully address the query. - We exploit the Markov Blanket (MB) property to

select candidate attributes. - Given a BN G, the Markov Blanket of a node Xi

comprises the node, and its immediate parent and

child.

Exploiting the MB Property

- Given a node Xi and a set of arbitrary nodes Y

in a BN s.t. MB(Xi) Âµ Y Xi), the conditional

entropy of Xi given Y is at least as high as that

given its Markov blanket or H(XiY)

H(XiMB(Xi)). - Proof Separating MB(Xi) into two parts MB1

MB(Xi) Y and MB2 MB(Xi) - MB1 and denoting Z

Y MB(Xi) - H(XiY) H(XiZ,MB1) Y Z

MB1 - H(XiZ,MB1,MB2) Additional

information cannot -

increase entropy - H(XiZ, MB(Xi)) MB(Xi)

MB1MB2 - H(XiMB(Xi))

Markov-blanket definition

Bayesian Reasoning -More Results

Effect of using MB Property with ?min 0.90

Query answer Quality loss 50-node Synth. Data BN

Bayesian Reasoning for Group Queries

- More accurate in addressing group queries
- Q (Xi, ?i)Xi 2 X Ã† (0 lt ?i 1) Ã† 1 i n)

s.t. ?i ltmaxl P(Xi xil) - X X1,X2 ,X3,, Xn Sensor attributes
- ?i Confidence parameters
- P(Xi xil) Probability with which Xi assumes the

value of xil - Bayesian reasoning is helpful in detecting

abnormalities

Bayesian Reasoning Candidate attribute

selection algorithm