Title: Pretty Pictures Are Good: Towards Better Tools For Data Comprehension Maybe
1Pretty Pictures Are GoodTowards Better Tools
For Data Comprehension (Maybe?)
- Dan Kaminsky, Director ofPenetration Testing,
IOActive
2Whats This Talk About?
- Work In Progress
- Problem Analysis Of Large Datasets Is Hard
- General Solution Let the computer do the
analysis, give the human a report - Potential Solution Let the computer transform
the data into a form that a human can analyze - Theory Humans have wetware pattern recognition
capabilities that we do not yet know how to code - By tapping into this wetware, we can find new
things. - Theory Best pathway to human pattern matching
skills is through the visual system - Sonification has been triedno successes here yet.
3Types of Data
- Static
- Given a sequence of bytes, comprehend patterns
that relate to whatever the sequence of bytes was
meant to encode - This may not be known
- Dynamic
- Given an ongoing stream, detect change that may
correlate with active software behavior
4History Phase Space Analysis of Static
Documents (Zalewski)
5Mechanism
- Treat a dataset as a sequence of integers
- Collect four numbers
- Take the differences between the four numbers.
- First is X
- Second is Y
- Third is Z
- Plot a point
6Phentropy (Part of Original Paketto Keiretsu!)
7Limitation
- Phentropy only worked on static data
- Actually used a medical renderer for MRI data ?
- Advantage Could handle arbitrary amounts of
input data, because everything was inserted into
a fixed size structure - Disadvantage Couldnt stream data in
- Possible to stream?
- Yes, wrote an OpenGL version a few years agoit
was too slow to be cool then, but heh, machines
are faster now!
8Demos (For Those Not Here)
9Interesting Streams
- Packets
- Watchpoints / strace / ltrace
- Process memory?
10Interesting Findings
- Can summarize a shocking amount of data to a
human eye per second - 200Kbytes of data on screen, with full updates
several times a second, equals megabyte-scale
summarization - Glanceable
- Patterns in motion reconstruct to 3D surprisingly
well were much better at integrating moving
point clouds than static - Can encode several pattern types
- Easy to see when pattern types go through
significant changes - Hard to backreference to what causes a particular
pattern - Cant even do Zalewskis RNG analysis, as this
wasnt designed to do that - Points dont age out yet, as per Xovi simple
loop model
11What About Sequitur?
- Sequitur Linear Time Pattern Finder
- Give it data, it gives you strings that it found
repeating inside - Actually a compression format you can peek
inside - Does not find all combinations
- Is very fast
- Rather than trying to come up with a global
context, can we start from repeated strings and
go from there? - Part of the better hex editing through hackery
effort
12Step One XML-ize the Sequence Generator
- echo aabbabc ./sequitur_simple.exe
- Why translate Gives us much easier to
manipulate output - C is very good for generating the tree
- Other languages are very good for analyzing /
modifying the tree - XML is a (shockingly) good machine format for
representing structure
13Step Two Render Depth Transitions Around The
Compression Format
14Global View
15Step Three Allow Point-And-Click Selection Of
Repeats
16Could We Do Sequitur Realtime?
- Short Version, Yes
- Its an online algorithm
- Reads left-to-write
- Could emit its grammar as it went
- Must be patched to do so
- Probably will be issues with XML output
- Other Issues
- Misses repeats
- Sequitur builds a compression algorithm useful
for it. Sequitur does not solve the All Common
Substring problem
17Substrings
- When analyzing a protocol on the micro scale, hex
is awful. - What we want is to be able to chunk bytes into
tokens, as were trying to reverse engineer the
parser which sees the bytestream as a sequence of
tokens - A token repeated is a token potentially detected
- If not across the same file, then across
different instances of a file format - Sequitur is not necessarily going to tell you
about every repeat - Many, but not all
18N-grams
- Often used for language ID can work for file
formats, google Fileprinting - Given a dataset, find all 4-byte repeats, 5-byte
repeats, 10-byte repeats, etc. - Not ideal Take the following exampleabcabcdabc
deabcde - 3-gram abc (4 instances), bcd
- 4-grams abcd
- Now repeat the string twice abcabcdabcdeabcdeabca
bcdabcdeabcde - Lots and lots of repeated substrings
- Wed want to actually collapse them together w/
exclusivity - Sequitur does some of this, at the cost of
missing potential matches - (abcabcdabcdeabcde) -gt (abcd) (abc) (bcd)
- Best approach for finding all repeats, with the
caveat that the repeats arent fully elucidated
19Ngrams v. Sequitur on PNG My Ngram highlighter
is buggy right now ?
20Suffix Trees
- Converts dataset to a directed graph
- Nodes have variable length characters
- Lots of good linear time properties for searching
and transformation - Libstree is the definitive implementation
- Havent played too much with these yet
21Dotplots
- Compare chunks of a file to other chunks of a
file - If the chunk is similar, drop a light pixel
- If the chunk is different, drop a dark pixel
- Leverages significant uptake of texture data by
human visual system - Graphs collapse into textures after only a few
dozen node/edges - Slow to compute
- Streaming live engine was created, too slow for
production use ? - Challenging to map back to real data
22Realtime Dotplot Hex Exploration (Law)
23Dotplots In Alternate Domains Video
24Dotplots in Alternate Domains Audio
25Breaking Audio Captchas w/ Similarity Detection
26Conclusion
- Many alternate domains to express complex data
- Hex sucks lets do better