Anomaly Detection Through a Bayesian SVM - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Anomaly Detection Through a Bayesian SVM

Description:

There is a need for efficient and reliable prognostics for electronic systems ... be shown that D(x) is the maximum a posteriori (MAP) solution to P(Y=y|X=x) P ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 33
Provided by: vasilisa7
Category:

less

Transcript and Presenter's Notes

Title: Anomaly Detection Through a Bayesian SVM


1
Anomaly Detection Through a Bayesian SVM
  • Vasilis A. Sotiris
  • AMSC 664 Final Presentation
  • May 6th 2008
  • Advisor Dr. Michael Pecht
  • University of Maryland
  • College Park, MD 20783

2
Objectives
  • Develop an algorithm to detect anomalies in
    electronic systems (large multivariate datasets)
  • Perform detection in the absence of negative
    class data One Class Classification
  • Predict future system performance
  • Develop application toolbox CALCEsvm to
    implement a proof of concept on simulated and
    real data
  • Simulated degradation
  • Lockheed Martin Data-set

3
Motivation
  • With increasing functional complexity of on-board
    autonomous systems, there is an increasing demand
    for system level
  • Health assessment,
  • Fault diagnostics
  • Failure prognostics
  • This is of special importance for analyzing
    intermittent failures, some of the most common
    failure modes in todays electronics
  • There is a need for efficient and reliable
    prognostics for electronic systems using
    algorithms that can
  • fuse sensor data,
  • discriminate false alarms from actual failures
  • correlate faults with relevant system events
  • and reduce redundant processing elements which
    are subject to common mode failures

4
Algorithm Objectives
  • Develop a machine learning approach to
  • detect anomalies in large multivariate systems
  • detect anomalies in the absence of reliable
    failure data
  • Mitigate false alarms and intermittent faults and
    failures
  • Predict future system performance

x2
Distribution of fault/failure data
Fault Space ?
Distribution of training data
x1
5
Data Setup
  • Data is collected at times Ti from a multivariate
    distribution of random variables x1ixmi
  • xs are the system covariates
  • Xis are independent random vectors
  • Class ? -1,1
  • Class probability p(classX)

estimate
given
X
6
Data Decomposition (Models)
  • Extract features from the data by
  • constructing lower dimensional models
  • X training data ? Rnxm
  • Singular Value Decomposition (SVD)
  • With H project data onto M and R models
  • k number of principal components (k2)
  • xM the projection of x onto the model space M
  • xR projection of x onto the residual space R

7
Two Class Support Vector Machines
D(x)
Solution
Mapping F
Feature Space
Input Space
Input Space
  • Given nonlinearly separable labeled data xi with
    labels yi ? 1,-1
  • Solve linear optimization problem to find w and b
    in the feature space
  • Form a nonlinear decision function my mapping
    back to the input space
  • The result is that we can obtain a decision
    boundary on the given training set and use it to
    classify new observations

8
Two Class Support Vector Machines
  • Interested in a function that best separates two
    classes of data
  • The margin M2/w can be maximized by
    minimizing w
  • the learning problem is stated as
  • subject to
  • The classifier function D(x) is constructed
  • with appropriate w and b (distance origin to
    D(x))

9
Two Class Support Vector Machines
  • Lagrangian function
  • Instead of minimizing LP w.r.t. to w and b,
    minimize LD w.r.t to a
  • where H is the Hessian Matrix,
  • Hi j yi yj xiT xj
  • aa1,,an
  • and p is a unit vector

KKT conditions
10
Two Class Support Vector Machines
  • In the nonlinear case use kernel function F
    centered at each x
  • Form the same optimization problem
  • where
  • Argument the resulting function D(x) is the best
    classifier for the given training set

x2
D(x)-1
Distribution of fault/failure data
D(x)0
D(x)1
Support Vectors
Distribution of training data
x1
11
Bayesian Interpretation of D(x)
  • The classification y ? -1.1 for any x, is
    equivalent to asking p(Y1 Xx) gt ? lt p(Y-1
    Xx)
  • An optimal classifier yMAP maximizes the
    conditional probability
  • Quadratic optimization problem D(x)
  • It can be shown that D(x) is the maximum a
    posteriori (MAP) solution to P(YyXx) ?
    P(classdata), and therefore the optimal
    classifier of the given two classes

if
if
12
One Class Training
  • In the absence of negative class data (fault or
    failure information), a one-class-classification
    approach is used
  • X(X1, X2) bivariate distribution
  • Likelihood of positive class L p(Xxiy1)
  • Class label y ? (-1,1)
  • Use the margin of this likelihood to construct
    the negative class

L
X1
X2
13
Nonparametric Likelihood Estimation
  • If the probability that any data point xi falls
    into the kth bin is r, then the probability of a
    set of data x1,,xm falling into the kth bin is
    given by a binomial distribution
  • Total sample size n
  • Number of samples in kth bin m
  • Region defined by bin R
  • MLE of r
  • Density estimate

14
Estimate likelihood Gaussian kernel j
  • The volume of R
  • For uniform kernel the number of data m in R
  • Kernel function f
  • Points xi which are close to the sample point x
    receive higher weight
  • Resulting density fj(x) is smooth
  • The bandwidth h is selected according to a
    nearest neighbor algorithm
  • Each bin R contains kn data

15
Estimate of Negative Class
  • The negative class is estimated based on the
    likelihood of the positive class (training data)
  • A threshold t is used to estimate the likelihood
    ratio of positive to negative class probability
    for the given training data
  • A 1D cross-section of the density illustrates the
    idea of the threshold ratio

Positive
Negative
16
D(x) as a Sufficient Statistic
  • D(x) can be used as a sufficient statistic to
    classify data point x
  • Argument since D(x) is the optimal classifier,
    posterior class probabilities are related to
    datas distance to D(x)0
  • These probabilities can be modeled by a logistic
    distribution, centered at D(x)0


D(x)
17
Posterior Class Probability
  • The positive posterior class probability is given
    by
  • Use D(x) as the sufficient statistic for the
    classification of xi, by replacing ai by D(xi)
  • Simplify
  • Get MLE for parameters A and B

logistic distribution
where
18
Joint Probability Model
  • Interested in P P(YXM,XR), the joint
    probability of classification given two models
  • XM model space M
  • XR residual space R
  • Assume XM, XR independent
  • After some algebra get the joint positive and
    negative posterior class probabilities P() and
    P(-)

19
Case Studies
  • Simulated degradation
  • Lockheed Martin Dataset

20
Case Study I Simulated Degradation
  • Given
  • Simulated correlated data
  • X1 gamma, X2 student t,
  • X3 beta
  • Degradation modeling
  • Period of healthy data
  • Three successive periods of increasingly larger
    changes in the mean for each parameter
  • Expecting a posterior classification probability
    to reflect these four periods accordingly
  • First with a probability close to 1
  • For the three successive a decreasing trend

x1
Observation
21
Case Study I Results Simulated Degradation
  • Results a plot of the joint positive
    classification probability

P1
P2
P4
P3
22
Case Study II Lockheed Martin Data (Known
Faulty Periods)
  • Given Data set from Lockheed martin
  • Type of data server data, unknown parameters
  • Multivariate, 22 parameters, 2741 observations
  • Healthy period (T) observations 0 - 800
  • Fault periods observations F1 912 1040, F2
    1092 1106, F3 1593 - 1651
  • Training data constructed with sample from period
    T, with size n140
  • Goal
  • Detect onset of known faulty periods without the
    knowledge of unhealthy system characteristics

23
Case Study II - Results
Period F1
Period F2
Period T
912
800
24
Comparison Metrics of Code Accuracy (LibSVM vs
CALCEsvm)
  • An established and commercially used C SVM
    software (LibSVM) was used to test the accuracy
    of the code
  • LibSVM features used two class SVM
  • does not include classification probabilities for
    one class SVM
  • Input to LibSVM
  • Positive class same training data
  • Negative class estimated negative class data
    from CALCEsvm
  • Metrics detection accuracy
  • The count of correct classifications based on two
    categories
  • Classification label y
  • Correct classification probability estimate

25
Detection Accuracy LibSVM vs CALCEsvm (Case Study
1 Degradation Simulation)
  • Description of test
  • Period 1 should be captured with a probability
    estimate ranging from 80 to 100 positive class
  • Period 2 equivalently between 70 and 85
  • Period 3 between 30 and 70
  • Period 4 between 0 and 40
  • Based on just the class index, the detection
    accuracy for both algorithms was almost identical
  • Based on ranges of probabilities LibSVM performs
    better in determining the early stages where the
    system is healthy, but performs worse is
    detecting degradation in comparison to CALCEsvm

P1
P2
P3
P4
P1
P2
P3
P4
26
Detection Accuracy LibSVM vs CALCEsvm (Case Study
2 Lockheed Data)
  • Description of test
  • The acceptable probability estimate for a correct
    positive classification should lie between 80 and
    100
  • Similarly the acceptable probability estimate for
    a negative classification should not exceed 40
  • Based on the class index, both LibSVM and
    CALCEsvm perfrom almost identically, with small
    improved performance for CALCEsvm
  • Based on acceptable probability estimates,
  • LibSVM
  • does a poor job at identifying the healthy state
    between each successive faulty period
  • Has a much better performance at detecting the
    anomalies
  • CALCEsvm
  • Seems to perform overall much better, and
    identifies correctly both base on index and
    acceptable probability ranges the faulty and
    healthy periods in the data

27
Summary
  • For the given data, and on some additional data
    sets the CALCEsvm algorithm has accomplished the
    objective
  • Detected the time events for known anomalies
  • Identified trends of degradation
  • Comparison of its performance accuracy to LibSVM
    is at first hand good!

28
Backups
29
Dual Form of Lagrangian Function
  • Dual form of the Lagrangian function, for the
    optimization problem in LD space

through KKT conditions
subject to
30
Karush-Kuhn-Tucker (KKT) Conditions
  • Optimal solution (w, b, a) exists if and only
    if KKT conditions are satisfied. In other words,
    KKT conditions are necessary and sufficient to
    solve w, b and a in a convex problem

31
Posterior Class Probability
  • Interested in finding the maximum likelihood
    estimates for parameters A and B
  • The classification probability of a set of test
    data Xx1,,xk, into c1,0 is given by a
    product Bernoulli distribution
  • Where pi is the probability of classification
    when c1 (y1) and 1-pi is the probability of
    classification when c0 (refers to class y-1)

32
Posterior Class Probability
  • Maximize the likelihood of correct classification
    y for each xi (MLE)
  • Determine parameters AMLE and BMLE from maximum
    likelihood equation (above)
  • Use AMLE and BMLE to compute p(i)MLE in
  • Where piMLE is the
  • maximum likelihood estimator of the posterior
    class probability pi (due to the invariance
    property of the MLE)
  • best estimate for the classification probability
    of each xi
  • Currently implemented is
Write a Comment
User Comments (0)
About PowerShow.com