From Dirt to Shovels: Inferring PADS descriptions from ASCII Data - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

From Dirt to Shovels: Inferring PADS descriptions from ASCII Data

Description:

Problem: Producing useful tools for ad hoc data takes a lot of time. ... Automatically generate lexer file & basic token types. Future solutions: ... – PowerPoint PPT presentation

Number of Views:107

Avg rating:3.0/5.0

Slides: 38

Provided by: dav8192

Category:

more less

Transcript and Presenter's Notes

Title: From Dirt to Shovels: Inferring PADS descriptions from ASCII Data

1
From Dirt to ShovelsInferring PADS descriptions
from ASCII Data
Kathleen Fisher David Walker Peter White Kenny
Zhu

December 2008

2
Learning Goals Approach
Visual Information
End-user tools
Email
struct ........ ...... ...........
ASCII log files
Binary Traces
Raw Data
Data Description
CSV
XML
Standard formats schema
Problem Producing useful tools for ad hoc data
takes a lot of time. Solution A learning system
to generate data descriptions and tools
automatically .
3
PADS Reminder
Inferred data formats are described using PADS
types.

Base types describe atomic data
Pint8, Puint8, // -123, 44
Pstring() // hello
Pstring_FW(3) // catdog
Pdate, Ptime, Pip,
Type constructors describe structured data
sequences Pstruct, Parray,
choices Punion, Penum, Pswitch
constraints arbitrary predicates to describe
expected properties.

4
Format inference overview
XML
XMLifier
Raw Data
Accumlator
Analysis Report
Chunking Process
PADS Compiler
Tokenization
Structure Discovery
PADS Description
IR to PADS Printer
Scoring Function
Format Refinement
5
Chunking Process

Convert raw input into sequence of chunks.
Supported divisions
Various forms of newline
File boundaries
Also possible user-defined paragraphs

6
Tokenization

Tokens expressed as regular expressions.
Basic tokens
Integer, white space, punctuation, strings
Distinctive tokens
IP addresses, dates, times, MAC addresses, ...

7
Histograms
8
Clustering
Group clusters with similar frequency
distributions
Cluster 1
Cluster 2
Cluster 3
Two frequency distributions are similar if they
have the same shape (within some error tolerance)
when the columns are sorted by height.
Rank clusters by metric that rewards high
coverage and narrower distributions. Chose
cluster with highest score.
9
Partition chunks
In our example, all the tokens appear in the same
order in all chunks, so the union is degenerate.
10
Find subcontexts
Tokens in selected cluster Quote(2) Comma White
11
Then Recurse...
12
Inferred type
13
Finding arrays
Single cluster with high coverage, but wide
distribution.
14
Partitioning
Selected tokens for array cluster String Pipe
Context 1,2 String Pipe
Context 3 String
15
Scoring

Goal A quantitative metric to evaluate the
quality of inferred descriptions and drive
refinement.
Challenges
Underfitting. Pstring(Peof) describes data, but
is too general to be useful.
Overfitting. Type that exhaustively describes
data (H, e, r, m, i, o, n,
e,) is too precise to be useful.
Sweet spot Reward compact descriptions that
predict the data well.

16
Minimum Description Length (MDL)

Standard metric from machine learning.
Cost of transmitting the syntax of a description
plus the cost of transmitting the data given the
description
cost(T,d)
complexity(T) complexity(dT)
Functions defined inductively over the structure
of the type T and data d respectively.
Normalized MDL gives compression factor.
Scoring function triggers rewriting rules.

17
Format Refinement

Format refinement rewrites descriptions to
Optimize information-theoretic complexity
Simplify presentation
Merge adjacent structures and unions
Improve precision
Identify constant values
Introduce enumerations
Refine types
Find termination conditions for strings
Determine integer sizes (8 bit, 32 bit, etc)
Identify array separators terminators
Detect dependencies
Array lengths
Switched unions

18
Testing and Evaluation

Evaluated overall results qualitatively
Compared with Excel -- a manual process with
limited facilities for representation of
hierarchy or variation
Compared with hand-written descriptions -
performance variable depending on tokenization
choices complexity
Evaluated accuracy quantitatively
Implemented infrastructure to use generated
accumulator programs to determine inferred
description error rates
Evaluated performance quantitatively
Tokenization rough structure inference perform
well less than 1 second on 300K
Dependency analysis can take a long time on
complex format (but can be cut down easily).

19
Benchmark Formats
20
Execution Times
SD structure discovery Ref
refinement Tot total HW hand-written
21
Training Time
22
Normalized MDL Scores
SD structure discovery Ref
refinement HW hand-written
23
Training Accuracy
24
Type Complexity and Min. Training Size
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
Technical Summary

Format inference is feasible for many ASCII data
formats
Our current tools infer sufficient structure that
descriptions may be piped into the PADS compiler
and used to generate tools for XML conversion and
simple statistical analysis.

Email
struct ........ ...... ...........
ASCII log files
Binary Traces
CSV
XML
30
Thanks Acknowledgements

Collaborators
Kenny Zhu (Princeton)
Peter White (Galois)
Other contributors
Alex Aiken (Stanford)
David Blei (Princeton)
David Burke (Galois)
Vikas Kedia (Stanford)
John Launchbury (Galois)
Rob Shapire (Princeton)

31
Problem Tokenization

Technical problem
Different data sources assume different
tokenization strategies
Useful token definitions sometimes overlap, can
be ambiguous, arent always easily expressed
using regular expressions
Matching tokenization of underlying data source
can make a big difference in structure discovery.
Current solution
Parameterize learning system with customizable
configuration files
Automatically generate lexer file basic token
types
Future solutions
Use existing PADS descriptions and data sources
to learn probabilistic tokenizers
Incorporate probabilities into sophisticated
back-end rewriting system
Back end has more context for making final
decisions than the tokenizer, which reads 1
character at a time without look ahead

32
Scaling to Larger Data Sets (In progress)

Original algorithm keeps entire data set in
memory, so wont scale to large data sets.
Proposed conceptual architecture to permit
incremental learning

33
0, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
34
struct
0, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
,

union
union
structure discovery
int
alpha
int
alpha
35
struct
struct
0, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
,
,

union
(id2)

union
(id1)
union
union
structure discovery
tagging/ table gen
int
int (id3)
alpha
alpha (id4)
int
alpha
int (id5)
alpha
(id6)
id1
id2
id3
id4
id5
id6
1
1
0
--
--
24
2
2
--
foo
beg
--
...
...
...
...
...
...
36
struct
struct
0, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
,
,

union
(id2)

union
(id1)
union
union
structure discovery
tagging/ table gen
int
int (id3)
alpha
alpha (id4)
int
alpha
int (id5)
alpha
(id6)
id1
id2
id3
id4
id5
id6
1
1
0
--
--
24
2
2
--
foo
beg
--
...
...
...
...
...
...
constraint inference
id3 0 id1 id2 (first union is int
whenever second union is int)
37
struct
struct
0, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
,
,

union
(id2)

union
(id1)
union
union
structure discovery
tagging/ table gen
int
int (id3)
alpha
alpha (id4)
int
alpha
int (id5)
alpha
(id6)
id1
id2
id3
id4
id5
id6
1
1
0
--
--
24
2
2
--
foo
beg
--
...
...
...
...
...
...
struct
constraint inference

union
id3 0 id1 id2 (first union is int
whenever second union is int)
struct
struct
rule-based structure rewriting
,
,
int
0
alpha-string
alpha-string
more accurate -- first int 0 -- rules out int
, alpha-string records

Write a Comment

User Comments (0)