Pretty Pictures Are Good: Towards Better Tools For Data Comprehension Maybe - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Pretty Pictures Are Good: Towards Better Tools For Data Comprehension Maybe

Description:

Pretty Pictures Are Good: Towards Better Tools For Data Comprehension (Maybe? ... years ago...it was too slow to be cool then, but heh, machines are faster now! ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 27
Provided by: seattle4
Category:

less

Transcript and Presenter's Notes

Title: Pretty Pictures Are Good: Towards Better Tools For Data Comprehension Maybe


1
Pretty Pictures Are GoodTowards Better Tools
For Data Comprehension (Maybe?)
  • Dan Kaminsky, Director ofPenetration Testing,
    IOActive

2
Whats This Talk About?
  • Work In Progress
  • Problem Analysis Of Large Datasets Is Hard
  • General Solution Let the computer do the
    analysis, give the human a report
  • Potential Solution Let the computer transform
    the data into a form that a human can analyze
  • Theory Humans have wetware pattern recognition
    capabilities that we do not yet know how to code
  • By tapping into this wetware, we can find new
    things.
  • Theory Best pathway to human pattern matching
    skills is through the visual system
  • Sonification has been triedno successes here yet.

3
Types of Data
  • Static
  • Given a sequence of bytes, comprehend patterns
    that relate to whatever the sequence of bytes was
    meant to encode
  • This may not be known
  • Dynamic
  • Given an ongoing stream, detect change that may
    correlate with active software behavior

4
History Phase Space Analysis of Static
Documents (Zalewski)
5
Mechanism
  • Treat a dataset as a sequence of integers
  • Collect four numbers
  • Take the differences between the four numbers.
  • First is X
  • Second is Y
  • Third is Z
  • Plot a point

6
Phentropy (Part of Original Paketto Keiretsu!)
7
Limitation
  • Phentropy only worked on static data
  • Actually used a medical renderer for MRI data ?
  • Advantage Could handle arbitrary amounts of
    input data, because everything was inserted into
    a fixed size structure
  • Disadvantage Couldnt stream data in
  • Possible to stream?
  • Yes, wrote an OpenGL version a few years agoit
    was too slow to be cool then, but heh, machines
    are faster now!

8
Demos (For Those Not Here)
9
Interesting Streams
  • Packets
  • Watchpoints / strace / ltrace
  • Process memory?

10
Interesting Findings
  • Can summarize a shocking amount of data to a
    human eye per second
  • 200Kbytes of data on screen, with full updates
    several times a second, equals megabyte-scale
    summarization
  • Glanceable
  • Patterns in motion reconstruct to 3D surprisingly
    well were much better at integrating moving
    point clouds than static
  • Can encode several pattern types
  • Easy to see when pattern types go through
    significant changes
  • Hard to backreference to what causes a particular
    pattern
  • Cant even do Zalewskis RNG analysis, as this
    wasnt designed to do that
  • Points dont age out yet, as per Xovi simple
    loop model

11
What About Sequitur?
  • Sequitur Linear Time Pattern Finder
  • Give it data, it gives you strings that it found
    repeating inside
  • Actually a compression format you can peek
    inside
  • Does not find all combinations
  • Is very fast
  • Rather than trying to come up with a global
    context, can we start from repeated strings and
    go from there?
  • Part of the better hex editing through hackery
    effort

12
Step One XML-ize the Sequence Generator
  • echo aabbabc ./sequitur_simple.exe
  • Why translate Gives us much easier to
    manipulate output
  • C is very good for generating the tree
  • Other languages are very good for analyzing /
    modifying the tree
  • XML is a (shockingly) good machine format for
    representing structure

13
Step Two Render Depth Transitions Around The
Compression Format
14
Global View
15
Step Three Allow Point-And-Click Selection Of
Repeats
16
Could We Do Sequitur Realtime?
  • Short Version, Yes
  • Its an online algorithm
  • Reads left-to-write
  • Could emit its grammar as it went
  • Must be patched to do so
  • Probably will be issues with XML output
  • Other Issues
  • Misses repeats
  • Sequitur builds a compression algorithm useful
    for it. Sequitur does not solve the All Common
    Substring problem

17
Substrings
  • When analyzing a protocol on the micro scale, hex
    is awful.
  • What we want is to be able to chunk bytes into
    tokens, as were trying to reverse engineer the
    parser which sees the bytestream as a sequence of
    tokens
  • A token repeated is a token potentially detected
  • If not across the same file, then across
    different instances of a file format
  • Sequitur is not necessarily going to tell you
    about every repeat
  • Many, but not all

18
N-grams
  • Often used for language ID can work for file
    formats, google Fileprinting
  • Given a dataset, find all 4-byte repeats, 5-byte
    repeats, 10-byte repeats, etc.
  • Not ideal Take the following exampleabcabcdabc
    deabcde
  • 3-gram abc (4 instances), bcd
  • 4-grams abcd
  • Now repeat the string twice abcabcdabcdeabcdeabca
    bcdabcdeabcde
  • Lots and lots of repeated substrings
  • Wed want to actually collapse them together w/
    exclusivity
  • Sequitur does some of this, at the cost of
    missing potential matches
  • (abcabcdabcdeabcde) -gt (abcd) (abc) (bcd)
  • Best approach for finding all repeats, with the
    caveat that the repeats arent fully elucidated

19
Ngrams v. Sequitur on PNG My Ngram highlighter
is buggy right now ?
20
Suffix Trees
  • Converts dataset to a directed graph
  • Nodes have variable length characters
  • Lots of good linear time properties for searching
    and transformation
  • Libstree is the definitive implementation
  • Havent played too much with these yet

21
Dotplots
  • Compare chunks of a file to other chunks of a
    file
  • If the chunk is similar, drop a light pixel
  • If the chunk is different, drop a dark pixel
  • Leverages significant uptake of texture data by
    human visual system
  • Graphs collapse into textures after only a few
    dozen node/edges
  • Slow to compute
  • Streaming live engine was created, too slow for
    production use ?
  • Challenging to map back to real data

22
Realtime Dotplot Hex Exploration (Law)
23
Dotplots In Alternate Domains Video
24
Dotplots in Alternate Domains Audio
25
Breaking Audio Captchas w/ Similarity Detection
26
Conclusion
  • Many alternate domains to express complex data
  • Hex sucks lets do better
Write a Comment
User Comments (0)
About PowerShow.com