Pretty Pictures Are Good: Towards Better Tools For Data Comprehension Maybe

About This Presentation

Title:

Pretty Pictures Are Good: Towards Better Tools For Data Comprehension Maybe

Description:

Pretty Pictures Are Good: Towards Better Tools For Data Comprehension (Maybe? ... years ago...it was too slow to be cool then, but heh, machines are faster now! ... – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 27

Provided by: seattle4

Category:

more less

Transcript and Presenter's Notes

Title: Pretty Pictures Are Good: Towards Better Tools For Data Comprehension Maybe

1
Pretty Pictures Are GoodTowards Better Tools
For Data Comprehension (Maybe?)

Dan Kaminsky, Director ofPenetration Testing,
IOActive

2
Whats This Talk About?

Work In Progress
Problem Analysis Of Large Datasets Is Hard
General Solution Let the computer do the
analysis, give the human a report
Potential Solution Let the computer transform
the data into a form that a human can analyze
Theory Humans have wetware pattern recognition
capabilities that we do not yet know how to code
By tapping into this wetware, we can find new
things.
Theory Best pathway to human pattern matching
skills is through the visual system
Sonification has been triedno successes here yet.

3
Types of Data

Static
Given a sequence of bytes, comprehend patterns
that relate to whatever the sequence of bytes was
meant to encode
This may not be known
Dynamic
Given an ongoing stream, detect change that may
correlate with active software behavior

4
History Phase Space Analysis of Static
Documents (Zalewski)
5
Mechanism

Treat a dataset as a sequence of integers
Collect four numbers
Take the differences between the four numbers.
First is X
Second is Y
Third is Z
Plot a point

6
Phentropy (Part of Original Paketto Keiretsu!)
7
Limitation

Phentropy only worked on static data
Actually used a medical renderer for MRI data ?
Advantage Could handle arbitrary amounts of
input data, because everything was inserted into
a fixed size structure
Disadvantage Couldnt stream data in
Possible to stream?
Yes, wrote an OpenGL version a few years agoit
was too slow to be cool then, but heh, machines
are faster now!

8
Demos (For Those Not Here)
9
Interesting Streams

Packets
Watchpoints / strace / ltrace
Process memory?

10
Interesting Findings

Can summarize a shocking amount of data to a
human eye per second
200Kbytes of data on screen, with full updates
several times a second, equals megabyte-scale
summarization
Glanceable
Patterns in motion reconstruct to 3D surprisingly
well were much better at integrating moving
point clouds than static
Can encode several pattern types
Easy to see when pattern types go through
significant changes
Hard to backreference to what causes a particular
pattern
Cant even do Zalewskis RNG analysis, as this
wasnt designed to do that
Points dont age out yet, as per Xovi simple
loop model

11
What About Sequitur?

Sequitur Linear Time Pattern Finder
Give it data, it gives you strings that it found
repeating inside
Actually a compression format you can peek
inside
Does not find all combinations
Is very fast
Rather than trying to come up with a global
context, can we start from repeated strings and
go from there?
Part of the better hex editing through hackery
effort

12
Step One XML-ize the Sequence Generator

echo aabbabc ./sequitur_simple.exe
Why translate Gives us much easier to
manipulate output
C is very good for generating the tree
Other languages are very good for analyzing /
modifying the tree
XML is a (shockingly) good machine format for
representing structure

13
Step Two Render Depth Transitions Around The
Compression Format
14
Global View
15
Step Three Allow Point-And-Click Selection Of
Repeats
16
Could We Do Sequitur Realtime?

Short Version, Yes
Its an online algorithm
Reads left-to-write
Could emit its grammar as it went
Must be patched to do so
Probably will be issues with XML output
Other Issues
Misses repeats
Sequitur builds a compression algorithm useful
for it. Sequitur does not solve the All Common
Substring problem

17
Substrings

When analyzing a protocol on the micro scale, hex
is awful.
What we want is to be able to chunk bytes into
tokens, as were trying to reverse engineer the
parser which sees the bytestream as a sequence of
tokens
A token repeated is a token potentially detected
If not across the same file, then across
different instances of a file format
Sequitur is not necessarily going to tell you
about every repeat
Many, but not all

18
N-grams

Often used for language ID can work for file
formats, google Fileprinting
Given a dataset, find all 4-byte repeats, 5-byte
repeats, 10-byte repeats, etc.
Not ideal Take the following exampleabcabcdabc
deabcde
3-gram abc (4 instances), bcd
4-grams abcd
Now repeat the string twice abcabcdabcdeabcdeabca
bcdabcdeabcde
Lots and lots of repeated substrings
Wed want to actually collapse them together w/
exclusivity
Sequitur does some of this, at the cost of
missing potential matches
(abcabcdabcdeabcde) -gt (abcd) (abc) (bcd)
Best approach for finding all repeats, with the
caveat that the repeats arent fully elucidated

19
Ngrams v. Sequitur on PNG My Ngram highlighter
is buggy right now ?
20
Suffix Trees

Converts dataset to a directed graph
Nodes have variable length characters
Lots of good linear time properties for searching
and transformation
Libstree is the definitive implementation
Havent played too much with these yet

21
Dotplots

Compare chunks of a file to other chunks of a
file
If the chunk is similar, drop a light pixel
If the chunk is different, drop a dark pixel
Leverages significant uptake of texture data by
human visual system
Graphs collapse into textures after only a few
dozen node/edges
Slow to compute
Streaming live engine was created, too slow for
production use ?
Challenging to map back to real data

22
Realtime Dotplot Hex Exploration (Law)
23
Dotplots In Alternate Domains Video
24
Dotplots in Alternate Domains Audio
25
Breaking Audio Captchas w/ Similarity Detection
26
Conclusion