Hamsa: Fast Signature Generation for Zero-day Polymorphic Worms with Provable Attack Resilience - PowerPoint PPT Presentation

About This Presentation
Title:

Hamsa: Fast Signature Generation for Zero-day Polymorphic Worms with Provable Attack Resilience

Description:

Worms spread in exponential speed, to detect them in their early stage is very crucial... No. No. Yes. Provable atk resilience. General. purpose. Server ... – PowerPoint PPT presentation

Number of Views:111
Avg rating:3.0/5.0
Slides: 62
Provided by: yanc8
Category:

less

Transcript and Presenter's Notes

Title: Hamsa: Fast Signature Generation for Zero-day Polymorphic Worms with Provable Attack Resilience


1
Hamsa Fast Signature Generation for Zero-day
Polymorphic Wormswith Provable Attack Resilience
Lab for Internet Security Technology
(LIST)Northwestern University
2
The Spread of Sapphire/Slammer Worms
3
Desired Requirements for Polymorphic Worm
Signature Generation
  • Network-based signature generation
  • Worms spread in exponential speed, to detect them
    in their early stage is very crucial However
  • At their early stage there are limited worm
    samples.
  • The high speed network router may see more worm
    samples But
  • Need to keep up with the network speed !
  • Only can use network level information

4
Desired Requirements for Polymorphic Worm
Signature Generation
  • Noise tolerant
  • Most network flow classifiers suffer false
    positives.
  • Even host based approaches can be injected with
    noise.
  • Attack resilience
  • Attackers always try to evade the detection
    systems
  • Efficient signature matching for high-speed links

No existing work satisfies these requirements !
5
Outline
  • Motivation
  • Hamsa Design
  • Model-based Signature Generation
  • Evaluation
  • Related Work
  • Conclusion

6
Choice of Signatures
  • Two classes of signatures
  • Content based
  • Token a substring with reasonable coverage to
    the suspicious traffic
  • Signatures conjunction of tokens
  • Behavior based
  • Our choice content based
  • Fast signature matching. ASIC based approach can
    archive 6 8Gb/s
  • Generic, independent of any protocol or server

7
Unique Invariants of Worms
  • Protocol Frame
  • The code path to the vulnerability part, usually
    infrequently used
  • Code-Red II .ida? or .idq?
  • Control Data leading to control flow hijacking
  • Hard coded value to overwrite a jump target or a
    function call
  • Worm Executable Payload
  • CLET polymorphic engine 0\x8b, \xff\xff\xff
    and t\x07\xeb
  • Possible to have worms with no such invariants,
    but very hard

8
Hamsa Architecture
9
Components from existing work
  • Worm flow classifiers
  • Scan based detector Autograph
  • Byte spectrum based approach PAYL
  • Honeynet/Honeyfarm sensors Honeycomb

10
Hamsa Design
  • Key idea model the uniqueness of worm invariants
  • Greedy algorithm for finding token conjunction
    signatures
  • Highly accurate while much faster
  • Both analytically and experimentally
  • Compared with the latest work, polygraph
  • Suffix array based token extraction
  • Provable attack resilience guarantee
  • Noise tolerant

11
Outline
  • Motivation
  • Hamsa Design
  • Model-based Signature Generation
  • Evaluation
  • Related Work
  • Conclusion

12
Hamsa Signature Generator
  • Core part Model-based Greedy Signature
    Generation
  • Iterative approach for multiple worms

13
Problem Formulation
Signature Generator
Signature
false positive bound r
Without noise, can be solve linearly using token
extraction
With noise
NP-Hard!
14
Model Uniqueness of Invariants
U(1)upper bound of FP(t1)
U(2)upper bound of FP(t1,t2)
The total number of tokens bounded by k
15
Signature Generation Algorithm
token extraction
t1
u(1)15
tokens
Suspicious pool
Order by coverage
16
Signature Generation Algorithm
Signature
t1
t2
u(2)7.5
Order by joint coverage with t1
17
Algorithm Runtime Analysis
  • Preprocessing needO(m n Tl T(MN))
  • Running time O(T(MN))
  • In most case M lt N so, it can reduce to
    O(TN)

T the of tokens l the maximum length of tokens
M the of flows in the suspicious pool N the of flows in the normal pool
m the of bytes in the suspicious pool n the of bytes in the normal pool
18
Provable Attack Resilience Guarantee
  • Proved the worse case bound on false negative
    given the false positive
  • Analytically bound the worst attackers can do!
  • Example K5, u(1)0.2, u(2)0.08, u(3)0.04,
    u(4)0.02, u(5)0.01 and r0.01
  • The better the flow classifier, the lower are the
    false negatives

Noise ratio FP upper bound FN upper bound
5 1 1.84
10 1 3.89
20 1 8.75
19
Attack Resilience Assumptions
  • Common assumptions for any sig generation sys
  • The attacker cannot control which worm samples
    are encountered by Hamsa
  • The attacker cannot control which worm samples
    encountered will be classified as worm samples by
    the flow classifier
  • Unique assumptions for token-based schemes
  • The attacker cannot change the frequency of
    tokens in normal traffic
  • The attacker cannot control which normal samples
    encountered are classified as worm samples by the
    worm flow classifier

20
Attack Resilience Assumptions
  • Attacks to the flow classifier
  • Our approach does not depend on perfect flow
    classifiers
  • But with 99 noise, no approach can work!
  • High noise injection makes the worm propagate
    less efficiently.
  • Enhance flow classifiers
  • Cluster suspicious flows by return messages
  • Information theory based approaches (DePaul Univ)

21
Generalizing Signature Generation with noise
  • BEST Signature Balanced Signature
  • Balance the sensitivity with the specificity
  • Create notation scoring functionscore(cov, fp,
    ) to evaluate the goodness of signature
  • Current used
  • Intuition it is better to reduce the coverage
    1/a if the false positive becomes 10 times
    smaller.
  • Add some weight to the length of signature (LEN)
    to break ties between the signatures with same
    coverage and false positive

22
Hamsa Signature Generator
  • Next Token extraction and token identification

23
Token Exaction
  • Problem formulation
  • Input a set of strings, and minimum length l and
    minimum coverage COVmin
  • Output
  • A set of tokens (substrings) meet the minimum
    length and coverage requirements
  • Coverage the portion of strings having the token
  • Corresponding sample vectors for each token
  • Main techniques
  • Suffix array
  • LCP (Longest Common Prefix) array, and LCP
    intervals
  • Token Exaction Algorithm (TEA)

24
Suffix Array
  • Illustration by an example
  • String1 abrac, String2 adabra
  • Cat together abracadabra
  • All suffix a, ra, bra, abra, dabra
  • Sort all the suffix
  • 4n space
  • Sorting can be done in 4n space and O(nlog(n))
    time

a 10
abra 7
abracadabra 0
acadabra 3
adabra 5
bra 8
bracadabra 1
cadabra 4
dabra 6
ra 9
racadabra 2
25
LCP Array and LCP Intervals
Suffixes sufarr lcparr idx str
a 10 - (0) 0 2
abra 7 1 1 2
abracadabra 0 4 2 1
acadabra 3 1 3 1
adabra 5 1 4 2
bra 8 0 5 2
bracadabra 1 3 6 1
cadabra 4 0 7 1
dabra 6 0 8 2
ra 9 0 9 2
racadabra 2 2 10 1
LCP intervals gt tokens
26
Token Exaction Algorithm (TEA)
  • Find eligible LCP intervals first
  • Then find the tokens

27
Token Exaction Algorithm (TEA)
28
Token Exaction Algorithm (TEA)
29
Token Identification
  • For normal traffic, pre-compute and store suffix
    array offline
  • For a given token, binary search in suffix array
    gives the corresponding LCP intervals
  • O(log(n)) time complexity
  • More sophisticated O(1) algorithm is possible,
    may require more space

30
Implementation Details
  • Token Extraction extract a set of tokens with
    minimum length l and minimum coverage COVmin.
  • Polygraph use suffix tree based approach 20n
    space and time consuming.
  • Our approach Enhanced suffix array 8n space and
    much faster! (at least 20 times)
  • Calculate false positive when check U-bounds
    (Token Identification)
  • Again suffix array based approach, but for a
    300MB normal pool, 1.2GB suffix array still
    large!
  • Optimization using MMAP, memory usage 150
    250MB

31
Hamsa Signature Generator
  • Next signature refinement

32
Signature Refinement
  • Why refinement?
  • Produce a signature with same sensitivity but
    better specificity
  • How?
  • After we use the core algorithm to get the greedy
    signature, we believe the samples matched by the
    greedy signature are all worm samples
  • Reduce to a signature generation without noise
    problem. Do another round token extraction

33
Extend to Detect Multiple Worms
  • Iteratively use single worm detector to detect
    multiple worms
  • At the first iteration, the algorithm find the
    signature for the most popular worms in the
    suspicious pool.
  • All other worms and normal traffic treat as noise

34
Practical Issues on Data Normalization
  • Typical cases need data normalization
  • IP packet fragmentation
  • TCP flow reassembly (defend fragroute)
  • RPC fragmentation
  • URL Obfuscation
  • HTML Obfuscation
  • Telnet/FTP Evasion by \backspace or \delete keys
  • Normalization translates data into the canonical
    form

35
Practical Issues on Data Normalization (II)
  • Hamsa with data normalization works better
  • Without or with weak data normalization, Hamsa
    still work
  • But because the data many have different forms of
    encoding, may produce multiple signature for a
    single worm
  • Need sufficient samples for each form of encoding

36
Outline
  • Motivation
  • Hamsa Design
  • Model-based Signature Generation
  • Evaluation
  • Related Work
  • Conclusion

37
Experiment Methodology
  • Experiential setup
  • Suspicious pool
  • Three pseudo polymorphic worms based on real
    exploits (Code-Red II, Apache-Knacker and
    ATPhttpd),
  • Two polymorphic engines from Internet (CLET and
    TAPiON).
  • Normal pool 2 hour departmental http trace
    (326MB)
  • Signature evaluation
  • False negative 5000 generated worm samples per
    worm
  • False positive
  • 4-day departmental http trace (12.6 GB)
  • 3.7GB web crawling including .mp3, .rm, .ppt,
    .pdf, .swf etc.
  • /usr/bin of Linux Fedora Core 4

38
Results on Signature Quality
Worms TrainingFN TrainingFP EvaluationFN EvaluationFN EvaluationFP Binaryevaluation FP
Worms Signature Signature Signature Signature Signature Signature
Code-Red II 0 0 0 0 0 0
Code-Red II '.ida?' 1, 'u780' 1, ' HTTP/1.0\r\n' 1, 'GET /' 1, 'u' 2 '.ida?' 1, 'u780' 1, ' HTTP/1.0\r\n' 1, 'GET /' 1, 'u' 2 '.ida?' 1, 'u780' 1, ' HTTP/1.0\r\n' 1, 'GET /' 1, 'u' 2 '.ida?' 1, 'u780' 1, ' HTTP/1.0\r\n' 1, 'GET /' 1, 'u' 2 '.ida?' 1, 'u780' 1, ' HTTP/1.0\r\n' 1, 'GET /' 1, 'u' 2 '.ida?' 1, 'u780' 1, ' HTTP/1.0\r\n' 1, 'GET /' 1, 'u' 2
CLET 0 0.109 0 0.06236 0.06236 0.268
CLET '0\x8b' 1, '\xff\xff\xff' 1,'t\x07\xeb' 1 '0\x8b' 1, '\xff\xff\xff' 1,'t\x07\xeb' 1 '0\x8b' 1, '\xff\xff\xff' 1,'t\x07\xeb' 1 '0\x8b' 1, '\xff\xff\xff' 1,'t\x07\xeb' 1 '0\x8b' 1, '\xff\xff\xff' 1,'t\x07\xeb' 1 '0\x8b' 1, '\xff\xff\xff' 1,'t\x07\xeb' 1
  • Single worm with noise
  • Suspicious pool size 100 and 200 samples
  • Noise ratio 0, 10, 30, 50, 70
  • Noise samples randomly picked from the normal
    pool
  • Always get above signatures and accuracy.

39
Results on Signature Quality (II)
  • Suspicious pool with high noise ratio
  • For noise ratio 50 and 70, sometimes we can
    produce two signatures, one is the true worm
    signature, anther solely from noise, due to the
    locality of the noise.
  • The false positive of these noise signatures have
    to be very small
  • Mean 0.09
  • Maximum 0.7
  • Multiple worms with noises give similar results

40
Experiment U-bound evaluation
  • To be conservative we chose k15.
  • u(k) u(15) 9.1610-6.
  • u(1) and ur evaluation
  • We testedu(1) 0.02, 0.04, 0.06, 0.08, 0.10,
    0.20, 0.30, 0.40, 0.5
  • and ur 0.20, 0.40, 0.60, 0.8.
  • The minimum (u(1), ur) works for all our worms
    was (0.08,0.20)
  • In practice, we use conservative value (0.15,0.5)

41
Speed Results
  • Implementation with C/Python
  • 500 samples with 20 noise, 100MB normal traffic
    pool, 15 seconds on an XEON 2.8Ghz, 112MB memory
    consumption
  • Speed comparison with Polygraph
  • Asymptotic runtime O(T) vs. O(M2), when M
    increase, T wont increase as fast as M!
  • Experimental 64 to 361 times faster (polygraph
    vs. ours, both in python)

42
Experiment Sample requirement
  • Coincidental-pattern attack Polygraph
  • Results
  • For the three pseudo worms, 10 samples can get
    good results
  • CLET and TAPiON at least need 50 samples
  • Conclusion
  • For better signatures, to be conservative, at
    least need 100 samplesRequire scalable and fast
    signature generation!

43
Token-fit Attack Can Fail Polygraph
  • Polygraph hierarchical clustering to find
    signatures w/ smallest false positives
  • With the token distribution of the noise in the
    suspicious pool, the attacker can make the worm
    samples more like noise traffic
  • Different worm samples encode different noise
    tokens
  • Our approach can still work!

44
Token-fit attack could make Polygraph fail
CANNOT merge further!NO true signature
found!
45
Experiment Token-fit attack
  • Suspicious of 50 samples with 50 noise
  • Elaborate different worm samples like different
    noise samples.
  • Results
  • Polygraph 100 false negative
  • Hamsa still can get the correct signature as
    before!

46
Outline
  • Motivation
  • Hamsa Design
  • Model-based Signature Generation
  • Evaluation
  • Related Work
  • Conclusion

47
Related works
Hamsa Polygraph CFG PADS Nemean COVERS Malware Detection
Network or host based Network Network Network Host Host Host Host
Content or behavior based Contentbased Contentbased Behaviorbased Contentbased Contentbased Behavior based Behaviorbased
Noise tolerance Yes Yes (slow) Yes No No Yes Yes
Multi worms in one protocol Yes Yes (slow) Yes No Yes Yes Yes
On-line sig matching Fast Fast Slow Fast Fast Fast Slow
Generality Generalpurpose Generalpurpose Generalpurpose Generalpurpose Protocolspecific Serverspecific Generalpurpose
Provable atk resilience Yes No No No No No No
Information exploited egp egp p egp e eg p
48
Conclusion
  • Network based signature generation and matching
    are important and challenging
  • Hamsa automated signature generation
  • Fast
  • Noise tolerant
  • Provable attack resilience
  • Capable of detecting multiple worms in a single
    application protocol
  • Proposed a model to describe the worm invariants

49
Questions ?
50
Results on Signature Quality (II)
  • Suspicious pool with high noise ratio
  • For noise ratio 50 and 70, sometimes we can
    produce two signatures, one is the true worm
    signature, anther solely from noise.
  • The false positive of these noise signatures have
    to be very small
  • Mean 0.09
  • Maximum 0.7
  • Multiple worms with noises give similar results

51
Normal Traffic Poisoning Attack
  • We found our approach is not sensitive to the
    normal traffic pool used
  • History last 6 months time window
  • The attacker has to poison the normal traffic 6
    month ahead!
  • 6 month the vulnerability may have been patched!
  • Poisoning the popular protocol is very difficult.

52
Red Herring Attack
  • Hard to implement
  • Dynamic updating problem. Again our approach is
    fast
  • Partial Signature matching, in extended version.

53
Coincidental Attack
  • As mentioned in the Polygraph paper, increase the
    sample requirement
  • Again, our approach are scalable and fast

54
Model Uniqueness of Invariants
  • Let worm has a set of invariantsDetermine their
    order by
  • t1 the token with minimum false positive in
    normal traffic. u(1) is the upper bound of the
    false positive of t1
  • t2 the token with minimum joint false positive
    with t1 FP(t1,t2) bounded by u(2)
  • ti the token with minimum joint false positive
    with t1, t2, ti-1. FP(t1,t2,,ti) bounded by
    u(i)
  • The total number of tokens bounded by k

55
Problem Formulation
  • Noisy Token Multiset Signature Generation Problem
    INPUT Suspicious pool M and normal traffic
    pool N value rlt1.OUTPUT A multi-set of tokens
    signature S(t1, n1), . . . (tk, nk) such that
    the signature can maximize the coverage in the
    suspicious pool and the false positive in normal
    pool should less than r
  • Without noise, exist polynomial time algo
  • With noise, NP-Hard

56
Generalizing Signature Generation with noise
  • BEST Signature Balanced Signature
  • Balance the sensitivity with the specificity
  • But how? Create notation Scoring
    functionscore(cov, fp, ) to evaluate the
    goodness of signature
  • Current used
  • Intuition it is better to reduce the coverage
    1/a if the false positive becomes 10 times
    smaller.
  • Add some weight to the length of signature (LEN)
    to break ties between the signatures with same
    coverage and false positive

57
Generalizing Signature Generation with noise
  • Algorithm similar
  • Running time same as previous simple form
  • Attack Resilience Guarantee similar

58
Extension to multiple worm
  • Iteratively use single worm detector to detect
    multiple worm
  • At the first iteration, the algorithm find the
    signature for the most popular worms in the
    suspicious pool. All other worms and normal
    traffic treat as noise.
  • Though the analysis for the single worm can apply
    to multiple worms, but the bound are not very
    promising. Reason high noise ratio

59
Token Extraction
  • Extract a set of tokens with minimum length lmin
    and coverage COVmin. And for each token output
    the frequency vector.
  • Polygraph use suffix tree based approach 20n
    space and time consuming.
  • Our approach
  • Enhanced suffix array 4n space
  • Much faster, at least 50(UPDATE) times!
  • Can apply to Polygraph also.

60
Calculate the false positive
  • We need to have the false positive to check the
    U-bounds
  • Again suffix array based approach, but for a
    300MB normal pool, 1.2GB suffix array still
    large!
  • Improvements
  • Caching
  • MMAP suffix array. True memory usage 150
    250MB.
  • 2 level normal pool
  • Hardware based fast string matching
  • Compress normal pool and string matching
    algorithms directly over compressed strings

61
Future works
  • Enhance the flow classifiers
  • Cluster suspicious flows by return messages
  • Malicious flow verification by replaying to
    Address Space Randomization enabled servers.
Write a Comment
User Comments (0)
About PowerShow.com