Title: From Dirt to Shovels: Inferring PADS descriptions from ASCII Data
1From Dirt to ShovelsInferring PADS descriptions
from ASCII Data
Kathleen Fisher David Walker Peter White Kenny
Zhu
2Learning Goals Approach
Visual Information
End-user tools
Email
struct ........ ...... ...........
ASCII log files
Binary Traces
Raw Data
Data Description
CSV
XML
Standard formats schema
Problem Producing useful tools for ad hoc data
takes a lot of time. Solution A learning system
to generate data descriptions and tools
automatically .
3PADS Reminder
Inferred data formats are described using PADS
types.
- Base types describe atomic data
- Pint8, Puint8, // -123, 44
- Pstring() // hello
- Pstring_FW(3) // catdog
- Pdate, Ptime, Pip,
- Type constructors describe structured data
- sequences Pstruct, Parray,
- choices Punion, Penum, Pswitch
- constraints arbitrary predicates to describe
expected properties.
4Format inference overview
XML
XMLifier
Raw Data
Accumlator
Analysis Report
Chunking Process
PADS Compiler
Tokenization
Structure Discovery
PADS Description
IR to PADS Printer
Scoring Function
Format Refinement
5Chunking Process
- Convert raw input into sequence of chunks.
- Supported divisions
- Various forms of newline
- File boundaries
- Also possible user-defined paragraphs
6Tokenization
- Tokens expressed as regular expressions.
- Basic tokens
- Integer, white space, punctuation, strings
- Distinctive tokens
- IP addresses, dates, times, MAC addresses, ...
7Histograms
8Clustering
Group clusters with similar frequency
distributions
Cluster 1
Cluster 2
Cluster 3
Two frequency distributions are similar if they
have the same shape (within some error tolerance)
when the columns are sorted by height.
Rank clusters by metric that rewards high
coverage and narrower distributions. Chose
cluster with highest score.
9Partition chunks
In our example, all the tokens appear in the same
order in all chunks, so the union is degenerate.
10Find subcontexts
Tokens in selected cluster Quote(2) Comma White
11Then Recurse...
12Inferred type
13Finding arrays
Single cluster with high coverage, but wide
distribution.
14Partitioning
Selected tokens for array cluster String Pipe
Context 1,2 String Pipe
Context 3 String
15Scoring
- Goal A quantitative metric to evaluate the
quality of inferred descriptions and drive
refinement. - Challenges
- Underfitting. Pstring(Peof) describes data, but
is too general to be useful. - Overfitting. Type that exhaustively describes
data (H, e, r, m, i, o, n,
e,) is too precise to be useful. - Sweet spot Reward compact descriptions that
predict the data well.
16Minimum Description Length (MDL)
- Standard metric from machine learning.
- Cost of transmitting the syntax of a description
plus the cost of transmitting the data given the
description - cost(T,d)
- complexity(T) complexity(dT)
- Functions defined inductively over the structure
of the type T and data d respectively. - Normalized MDL gives compression factor.
- Scoring function triggers rewriting rules.
17Format Refinement
- Format refinement rewrites descriptions to
- Optimize information-theoretic complexity
- Simplify presentation
- Merge adjacent structures and unions
- Improve precision
- Identify constant values
- Introduce enumerations
- Refine types
- Find termination conditions for strings
- Determine integer sizes (8 bit, 32 bit, etc)
- Identify array separators terminators
- Detect dependencies
- Array lengths
- Switched unions
18Testing and Evaluation
- Evaluated overall results qualitatively
- Compared with Excel -- a manual process with
limited facilities for representation of
hierarchy or variation - Compared with hand-written descriptions -
performance variable depending on tokenization
choices complexity - Evaluated accuracy quantitatively
- Implemented infrastructure to use generated
accumulator programs to determine inferred
description error rates - Evaluated performance quantitatively
- Tokenization rough structure inference perform
well less than 1 second on 300K - Dependency analysis can take a long time on
complex format (but can be cut down easily).
19Benchmark Formats
20Execution Times
SD structure discovery Ref
refinement Tot total HW hand-written
21Training Time
22Normalized MDL Scores
SD structure discovery Ref
refinement HW hand-written
23Training Accuracy
24Type Complexity and Min. Training Size
25(No Transcript)
26(No Transcript)
27(No Transcript)
28(No Transcript)
29Technical Summary
- Format inference is feasible for many ASCII data
formats - Our current tools infer sufficient structure that
descriptions may be piped into the PADS compiler
and used to generate tools for XML conversion and
simple statistical analysis.
Email
struct ........ ...... ...........
ASCII log files
Binary Traces
CSV
XML
30Thanks Acknowledgements
- Collaborators
- Kenny Zhu (Princeton)
- Peter White (Galois)
- Other contributors
- Alex Aiken (Stanford)
- David Blei (Princeton)
- David Burke (Galois)
- Vikas Kedia (Stanford)
- John Launchbury (Galois)
- Rob Shapire (Princeton)
31Problem Tokenization
- Technical problem
- Different data sources assume different
tokenization strategies - Useful token definitions sometimes overlap, can
be ambiguous, arent always easily expressed
using regular expressions - Matching tokenization of underlying data source
can make a big difference in structure discovery. - Current solution
- Parameterize learning system with customizable
configuration files - Automatically generate lexer file basic token
types - Future solutions
- Use existing PADS descriptions and data sources
to learn probabilistic tokenizers - Incorporate probabilities into sophisticated
back-end rewriting system - Back end has more context for making final
decisions than the tokenizer, which reads 1
character at a time without look ahead
32Scaling to Larger Data Sets (In progress)
- Original algorithm keeps entire data set in
memory, so wont scale to large data sets. - Proposed conceptual architecture to permit
incremental learning
330, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
34struct
0, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
,
union
union
structure discovery
int
alpha
int
alpha
35struct
struct
0, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
,
,
union
(id2)
union
(id1)
union
union
structure discovery
tagging/ table gen
int
int (id3)
alpha
alpha (id4)
int
alpha
int (id5)
alpha
(id6)
id1
id2
id3
id4
id5
id6
1
1
0
--
--
24
2
2
--
foo
beg
--
...
...
...
...
...
...
36struct
struct
0, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
,
,
union
(id2)
union
(id1)
union
union
structure discovery
tagging/ table gen
int
int (id3)
alpha
alpha (id4)
int
alpha
int (id5)
alpha
(id6)
id1
id2
id3
id4
id5
id6
1
1
0
--
--
24
2
2
--
foo
beg
--
...
...
...
...
...
...
constraint inference
id3 0 id1 id2 (first union is int
whenever second union is int)
37struct
struct
0, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
,
,
union
(id2)
union
(id1)
union
union
structure discovery
tagging/ table gen
int
int (id3)
alpha
alpha (id4)
int
alpha
int (id5)
alpha
(id6)
id1
id2
id3
id4
id5
id6
1
1
0
--
--
24
2
2
--
foo
beg
--
...
...
...
...
...
...
struct
constraint inference
union
id3 0 id1 id2 (first union is int
whenever second union is int)
struct
struct
rule-based structure rewriting
,
,
int
0
alpha-string
alpha-string
more accurate -- first int 0 -- rules out int
, alpha-string records