Hamsa: Fast Signature Generation for Zero-day Polymorphic Worms with Provable Attack Resilience

About This Presentation

Title:

Hamsa: Fast Signature Generation for Zero-day Polymorphic Worms with Provable Attack Resilience

Description:

Worms spread in exponential speed, to detect them in their early stage is very crucial... No. No. Yes. Provable atk resilience. General. purpose. Server ... – PowerPoint PPT presentation

Number of Views:111

Avg rating:3.0/5.0

Slides: 62

Provided by: yanc8

Learn more at: https://users.cs.northwestern.edu

Category:

more less

Transcript and Presenter's Notes

Title: Hamsa: Fast Signature Generation for Zero-day Polymorphic Worms with Provable Attack Resilience

1
Hamsa Fast Signature Generation for Zero-day
Polymorphic Wormswith Provable Attack Resilience
Lab for Internet Security Technology
(LIST)Northwestern University
2
The Spread of Sapphire/Slammer Worms
3
Desired Requirements for Polymorphic Worm
Signature Generation

Network-based signature generation
Worms spread in exponential speed, to detect them
in their early stage is very crucial However
At their early stage there are limited worm
samples.
The high speed network router may see more worm
samples But
Need to keep up with the network speed !
Only can use network level information

4
Desired Requirements for Polymorphic Worm
Signature Generation

Noise tolerant
Most network flow classifiers suffer false
positives.
Even host based approaches can be injected with
noise.
Attack resilience
Attackers always try to evade the detection
systems
Efficient signature matching for high-speed links

No existing work satisfies these requirements !
5
Outline

Motivation
Hamsa Design
Model-based Signature Generation
Evaluation
Related Work
Conclusion

6
Choice of Signatures

Two classes of signatures
Content based
Token a substring with reasonable coverage to
the suspicious traffic
Signatures conjunction of tokens
Behavior based
Our choice content based
Fast signature matching. ASIC based approach can
archive 6 8Gb/s
Generic, independent of any protocol or server

7
Unique Invariants of Worms

Protocol Frame
The code path to the vulnerability part, usually
infrequently used
Code-Red II .ida? or .idq?
Control Data leading to control flow hijacking
Hard coded value to overwrite a jump target or a
function call
Worm Executable Payload
CLET polymorphic engine 0\x8b, \xff\xff\xff
and t\x07\xeb
Possible to have worms with no such invariants,
but very hard

8
Hamsa Architecture
9
Components from existing work

Worm flow classifiers
Scan based detector Autograph
Byte spectrum based approach PAYL
Honeynet/Honeyfarm sensors Honeycomb

10
Hamsa Design

Key idea model the uniqueness of worm invariants
Greedy algorithm for finding token conjunction
signatures
Highly accurate while much faster
Both analytically and experimentally
Compared with the latest work, polygraph
Suffix array based token extraction
Provable attack resilience guarantee
Noise tolerant

11
Outline

Motivation
Hamsa Design
Model-based Signature Generation
Evaluation
Related Work
Conclusion

12
Hamsa Signature Generator

Core part Model-based Greedy Signature
Generation
Iterative approach for multiple worms

13
Problem Formulation
Signature Generator
Signature
false positive bound r
Without noise, can be solve linearly using token
extraction
With noise
NP-Hard!
14
Model Uniqueness of Invariants
U(1)upper bound of FP(t1)
U(2)upper bound of FP(t1,t2)
The total number of tokens bounded by k
15
Signature Generation Algorithm
token extraction
t1
u(1)15
tokens
Suspicious pool
Order by coverage
16
Signature Generation Algorithm
Signature
t1
t2
u(2)7.5
Order by joint coverage with t1
17
Algorithm Runtime Analysis

Preprocessing needO(m n Tl T(MN))
Running time O(T(MN))
In most case M lt N so, it can reduce to
O(TN)

T the of tokens l the maximum length of tokens
M the of flows in the suspicious pool N the of flows in the normal pool
m the of bytes in the suspicious pool n the of bytes in the normal pool
18
Provable Attack Resilience Guarantee

Proved the worse case bound on false negative
given the false positive
Analytically bound the worst attackers can do!
Example K5, u(1)0.2, u(2)0.08, u(3)0.04,
u(4)0.02, u(5)0.01 and r0.01
The better the flow classifier, the lower are the
false negatives

Noise ratio FP upper bound FN upper bound
5 1 1.84
10 1 3.89
20 1 8.75
19
Attack Resilience Assumptions

Common assumptions for any sig generation sys
The attacker cannot control which worm samples
are encountered by Hamsa
The attacker cannot control which worm samples
encountered will be classified as worm samples by
the flow classifier
Unique assumptions for token-based schemes
The attacker cannot change the frequency of
tokens in normal traffic
The attacker cannot control which normal samples
encountered are classified as worm samples by the
worm flow classifier

20
Attack Resilience Assumptions

Attacks to the flow classifier
Our approach does not depend on perfect flow
classifiers
But with 99 noise, no approach can work!
High noise injection makes the worm propagate
less efficiently.
Enhance flow classifiers
Cluster suspicious flows by return messages
Information theory based approaches (DePaul Univ)

21
Generalizing Signature Generation with noise

BEST Signature Balanced Signature
Balance the sensitivity with the specificity
Create notation scoring functionscore(cov, fp,
) to evaluate the goodness of signature
Current used
Intuition it is better to reduce the coverage
1/a if the false positive becomes 10 times
smaller.
Add some weight to the length of signature (LEN)
to break ties between the signatures with same
coverage and false positive

22
Hamsa Signature Generator

Next Token extraction and token identification

23
Token Exaction

Problem formulation
Input a set of strings, and minimum length l and
minimum coverage COVmin
Output
A set of tokens (substrings) meet the minimum
length and coverage requirements
Coverage the portion of strings having the token
Corresponding sample vectors for each token
Main techniques
Suffix array
LCP (Longest Common Prefix) array, and LCP
intervals
Token Exaction Algorithm (TEA)

24
Suffix Array

Illustration by an example
String1 abrac, String2 adabra
Cat together abracadabra
All suffix a, ra, bra, abra, dabra
Sort all the suffix
4n space
Sorting can be done in 4n space and O(nlog(n))
time

a 10
abra 7
abracadabra 0
acadabra 3
adabra 5
bra 8
bracadabra 1
cadabra 4
dabra 6
ra 9
racadabra 2
25
LCP Array and LCP Intervals
Suffixes sufarr lcparr idx str
a 10 - (0) 0 2
abra 7 1 1 2
abracadabra 0 4 2 1
acadabra 3 1 3 1
adabra 5 1 4 2
bra 8 0 5 2
bracadabra 1 3 6 1
cadabra 4 0 7 1
dabra 6 0 8 2
ra 9 0 9 2
racadabra 2 2 10 1
LCP intervals gt tokens
26
Token Exaction Algorithm (TEA)

Find eligible LCP intervals first
Then find the tokens

27
Token Exaction Algorithm (TEA)
28
Token Exaction Algorithm (TEA)
29
Token Identification

For normal traffic, pre-compute and store suffix
array offline
For a given token, binary search in suffix array
gives the corresponding LCP intervals
O(log(n)) time complexity
More sophisticated O(1) algorithm is possible,
may require more space

30
Implementation Details

Token Extraction extract a set of tokens with
minimum length l and minimum coverage COVmin.
Polygraph use suffix tree based approach 20n
space and time consuming.
Our approach Enhanced suffix array 8n space and
much faster! (at least 20 times)
Calculate false positive when check U-bounds
(Token Identification)
Again suffix array based approach, but for a
300MB normal pool, 1.2GB suffix array still
large!
Optimization using MMAP, memory usage 150
250MB

31
Hamsa Signature Generator

Next signature refinement

32
Signature Refinement

Why refinement?
Produce a signature with same sensitivity but
better specificity
How?
After we use the core algorithm to get the greedy
signature, we believe the samples matched by the
greedy signature are all worm samples
Reduce to a signature generation without noise
problem. Do another round token extraction

33
Extend to Detect Multiple Worms

Iteratively use single worm detector to detect
multiple worms
At the first iteration, the algorithm find the
signature for the most popular worms in the
suspicious pool.
All other worms and normal traffic treat as noise

34
Practical Issues on Data Normalization

Typical cases need data normalization
IP packet fragmentation
TCP flow reassembly (defend fragroute)
RPC fragmentation
URL Obfuscation
HTML Obfuscation
Telnet/FTP Evasion by \backspace or \delete keys
Normalization translates data into the canonical
form

35
Practical Issues on Data Normalization (II)

Hamsa with data normalization works better
Without or with weak data normalization, Hamsa
still work
But because the data many have different forms of
encoding, may produce multiple signature for a
single worm
Need sufficient samples for each form of encoding

36
Outline

Motivation
Hamsa Design
Model-based Signature Generation
Evaluation
Related Work
Conclusion

37
Experiment Methodology

Experiential setup
Suspicious pool
Three pseudo polymorphic worms based on real
exploits (Code-Red II, Apache-Knacker and
ATPhttpd),
Two polymorphic engines from Internet (CLET and
TAPiON).
Normal pool 2 hour departmental http trace
(326MB)
Signature evaluation
False negative 5000 generated worm samples per
worm
False positive
4-day departmental http trace (12.6 GB)
3.7GB web crawling including .mp3, .rm, .ppt,
.pdf, .swf etc.
/usr/bin of Linux Fedora Core 4

38
Results on Signature Quality
Worms TrainingFN TrainingFP EvaluationFN EvaluationFN EvaluationFP Binaryevaluation FP
Worms Signature Signature Signature Signature Signature Signature
Code-Red II 0 0 0 0 0 0
Code-Red II '.ida?' 1, 'u780' 1, ' HTTP/1.0\r\n' 1, 'GET /' 1, 'u' 2 '.ida?' 1, 'u780' 1, ' HTTP/1.0\r\n' 1, 'GET /' 1, 'u' 2 '.ida?' 1, 'u780' 1, ' HTTP/1.0\r\n' 1, 'GET /' 1, 'u' 2 '.ida?' 1, 'u780' 1, ' HTTP/1.0\r\n' 1, 'GET /' 1, 'u' 2 '.ida?' 1, 'u780' 1, ' HTTP/1.0\r\n' 1, 'GET /' 1, 'u' 2 '.ida?' 1, 'u780' 1, ' HTTP/1.0\r\n' 1, 'GET /' 1, 'u' 2
CLET 0 0.109 0 0.06236 0.06236 0.268
CLET '0\x8b' 1, '\xff\xff\xff' 1,'t\x07\xeb' 1 '0\x8b' 1, '\xff\xff\xff' 1,'t\x07\xeb' 1 '0\x8b' 1, '\xff\xff\xff' 1,'t\x07\xeb' 1 '0\x8b' 1, '\xff\xff\xff' 1,'t\x07\xeb' 1 '0\x8b' 1, '\xff\xff\xff' 1,'t\x07\xeb' 1 '0\x8b' 1, '\xff\xff\xff' 1,'t\x07\xeb' 1

Single worm with noise
Suspicious pool size 100 and 200 samples
Noise ratio 0, 10, 30, 50, 70
Noise samples randomly picked from the normal
pool
Always get above signatures and accuracy.

39
Results on Signature Quality (II)

Suspicious pool with high noise ratio
For noise ratio 50 and 70, sometimes we can
produce two signatures, one is the true worm
signature, anther solely from noise, due to the
locality of the noise.
The false positive of these noise signatures have
to be very small
Mean 0.09
Maximum 0.7
Multiple worms with noises give similar results

40
Experiment U-bound evaluation

To be conservative we chose k15.
u(k) u(15) 9.1610-6.
u(1) and ur evaluation
We testedu(1) 0.02, 0.04, 0.06, 0.08, 0.10,
0.20, 0.30, 0.40, 0.5
and ur 0.20, 0.40, 0.60, 0.8.
The minimum (u(1), ur) works for all our worms
was (0.08,0.20)
In practice, we use conservative value (0.15,0.5)

41
Speed Results

Implementation with C/Python
500 samples with 20 noise, 100MB normal traffic
pool, 15 seconds on an XEON 2.8Ghz, 112MB memory
consumption
Speed comparison with Polygraph
Asymptotic runtime O(T) vs. O(M2), when M
increase, T wont increase as fast as M!
Experimental 64 to 361 times faster (polygraph
vs. ours, both in python)

42
Experiment Sample requirement

Coincidental-pattern attack Polygraph
Results
For the three pseudo worms, 10 samples can get
good results
CLET and TAPiON at least need 50 samples
Conclusion
For better signatures, to be conservative, at
least need 100 samplesRequire scalable and fast
signature generation!

43
Token-fit Attack Can Fail Polygraph

Polygraph hierarchical clustering to find
signatures w/ smallest false positives
With the token distribution of the noise in the
suspicious pool, the attacker can make the worm
samples more like noise traffic
Different worm samples encode different noise
tokens
Our approach can still work!

44
Token-fit attack could make Polygraph fail
CANNOT merge further!NO true signature
found!
45
Experiment Token-fit attack

Suspicious of 50 samples with 50 noise
Elaborate different worm samples like different
noise samples.
Results
Polygraph 100 false negative
Hamsa still can get the correct signature as
before!

46
Outline

Motivation
Hamsa Design
Model-based Signature Generation
Evaluation
Related Work
Conclusion

47
Related works
Hamsa Polygraph CFG PADS Nemean COVERS Malware Detection
Network or host based Network Network Network Host Host Host Host
Content or behavior based Contentbased Contentbased Behaviorbased Contentbased Contentbased Behavior based Behaviorbased
Noise tolerance Yes Yes (slow) Yes No No Yes Yes
Multi worms in one protocol Yes Yes (slow) Yes No Yes Yes Yes
On-line sig matching Fast Fast Slow Fast Fast Fast Slow
Generality Generalpurpose Generalpurpose Generalpurpose Generalpurpose Protocolspecific Serverspecific Generalpurpose
Provable atk resilience Yes No No No No No No
Information exploited egp egp p egp e eg p
48
Conclusion

Network based signature generation and matching
are important and challenging
Hamsa automated signature generation
Fast
Noise tolerant
Provable attack resilience
Capable of detecting multiple worms in a single
application protocol
Proposed a model to describe the worm invariants

49
Questions ?
50
Results on Signature Quality (II)

Suspicious pool with high noise ratio
For noise ratio 50 and 70, sometimes we can
produce two signatures, one is the true worm
signature, anther solely from noise.
The false positive of these noise signatures have
to be very small
Mean 0.09
Maximum 0.7
Multiple worms with noises give similar results

51
Normal Traffic Poisoning Attack

We found our approach is not sensitive to the
normal traffic pool used
History last 6 months time window
The attacker has to poison the normal traffic 6
month ahead!
6 month the vulnerability may have been patched!
Poisoning the popular protocol is very difficult.

52
Red Herring Attack

Hard to implement
Dynamic updating problem. Again our approach is
fast
Partial Signature matching, in extended version.

53
Coincidental Attack

As mentioned in the Polygraph paper, increase the
sample requirement
Again, our approach are scalable and fast

54
Model Uniqueness of Invariants

Let worm has a set of invariantsDetermine their
order by
t1 the token with minimum false positive in
normal traffic. u(1) is the upper bound of the
false positive of t1
t2 the token with minimum joint false positive
with t1 FP(t1,t2) bounded by u(2)
ti the token with minimum joint false positive
with t1, t2, ti-1. FP(t1,t2,,ti) bounded by
u(i)
The total number of tokens bounded by k

55
Problem Formulation

Noisy Token Multiset Signature Generation Problem
INPUT Suspicious pool M and normal traffic
pool N value rlt1.OUTPUT A multi-set of tokens
signature S(t1, n1), . . . (tk, nk) such that
the signature can maximize the coverage in the
suspicious pool and the false positive in normal
pool should less than r
Without noise, exist polynomial time algo
With noise, NP-Hard

56
Generalizing Signature Generation with noise

BEST Signature Balanced Signature
Balance the sensitivity with the specificity
But how? Create notation Scoring
functionscore(cov, fp, ) to evaluate the
goodness of signature
Current used
Intuition it is better to reduce the coverage
1/a if the false positive becomes 10 times
smaller.
Add some weight to the length of signature (LEN)
to break ties between the signatures with same
coverage and false positive

57
Generalizing Signature Generation with noise

Algorithm similar
Running time same as previous simple form
Attack Resilience Guarantee similar

58
Extension to multiple worm

Iteratively use single worm detector to detect
multiple worm
At the first iteration, the algorithm find the
signature for the most popular worms in the
suspicious pool. All other worms and normal
traffic treat as noise.
Though the analysis for the single worm can apply
to multiple worms, but the bound are not very
promising. Reason high noise ratio

59
Token Extraction

Extract a set of tokens with minimum length lmin
and coverage COVmin. And for each token output
the frequency vector.
Polygraph use suffix tree based approach 20n
space and time consuming.
Our approach
Enhanced suffix array 4n space
Much faster, at least 50(UPDATE) times!
Can apply to Polygraph also.

60
Calculate the false positive

We need to have the false positive to check the
U-bounds
Again suffix array based approach, but for a
300MB normal pool, 1.2GB suffix array still
large!
Improvements
Caching
MMAP suffix array. True memory usage 150
250MB.
2 level normal pool
Hardware based fast string matching
Compress normal pool and string matching
algorithms directly over compressed strings

61
Future works

Enhance the flow classifiers
Cluster suspicious flows by return messages
Malicious flow verification by replaying to
Address Space Randomization enabled servers.

Write a Comment

User Comments (0)

About PowerShow.com

Hamsa: Fast Signature Generation for Zero-day Polymorphic Worms with Provable Attack Resilience - PowerPoint PPT Presentation

Hamsa: Fast Signature Generation for Zero-day Polymorphic Worms with Provable Attack Resilience

Worms spread in exponential speed, to detect them in their early stage is very crucial... No. No. Yes. Provable atk resilience. General. purpose. Server ... – PowerPoint PPT presentation