From Dirt to Shovels: Inferring PADS descriptions from ASCII Data - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

From Dirt to Shovels: Inferring PADS descriptions from ASCII Data

Description:

Problem: Producing useful tools for ad hoc data takes a lot of time. ... Automatically generate lexer file & basic token types. Future solutions: ... – PowerPoint PPT presentation

Number of Views:107
Avg rating:3.0/5.0
Slides: 38
Provided by: dav8192
Category:

less

Transcript and Presenter's Notes

Title: From Dirt to Shovels: Inferring PADS descriptions from ASCII Data


1
From Dirt to ShovelsInferring PADS descriptions
from ASCII Data
Kathleen Fisher David Walker Peter White Kenny
Zhu
  • December 2008

2
Learning Goals Approach
Visual Information
End-user tools
Email
struct ........ ...... ...........
ASCII log files
Binary Traces
Raw Data
Data Description
CSV
XML
Standard formats schema
Problem Producing useful tools for ad hoc data
takes a lot of time. Solution A learning system
to generate data descriptions and tools
automatically .
3
PADS Reminder
Inferred data formats are described using PADS
types.
  • Base types describe atomic data
  • Pint8, Puint8, // -123, 44
  • Pstring() // hello
  • Pstring_FW(3) // catdog
  • Pdate, Ptime, Pip,
  • Type constructors describe structured data
  • sequences Pstruct, Parray,
  • choices Punion, Penum, Pswitch
  • constraints arbitrary predicates to describe
    expected properties.

4
Format inference overview
XML
XMLifier
Raw Data
Accumlator
Analysis Report
Chunking Process
PADS Compiler
Tokenization
Structure Discovery
PADS Description
IR to PADS Printer
Scoring Function
Format Refinement
5
Chunking Process
  • Convert raw input into sequence of chunks.
  • Supported divisions
  • Various forms of newline
  • File boundaries
  • Also possible user-defined paragraphs

6
Tokenization
  • Tokens expressed as regular expressions.
  • Basic tokens
  • Integer, white space, punctuation, strings
  • Distinctive tokens
  • IP addresses, dates, times, MAC addresses, ...

7
Histograms
8
Clustering
Group clusters with similar frequency
distributions
Cluster 1
Cluster 2
Cluster 3
Two frequency distributions are similar if they
have the same shape (within some error tolerance)
when the columns are sorted by height.
Rank clusters by metric that rewards high
coverage and narrower distributions. Chose
cluster with highest score.
9
Partition chunks
In our example, all the tokens appear in the same
order in all chunks, so the union is degenerate.
10
Find subcontexts
Tokens in selected cluster Quote(2) Comma White
11
Then Recurse...
12
Inferred type
13
Finding arrays
Single cluster with high coverage, but wide
distribution.
14
Partitioning
Selected tokens for array cluster String Pipe
Context 1,2 String Pipe
Context 3 String
15
Scoring
  • Goal A quantitative metric to evaluate the
    quality of inferred descriptions and drive
    refinement.
  • Challenges
  • Underfitting. Pstring(Peof) describes data, but
    is too general to be useful.
  • Overfitting. Type that exhaustively describes
    data (H, e, r, m, i, o, n,
    e,) is too precise to be useful.
  • Sweet spot Reward compact descriptions that
    predict the data well.

16
Minimum Description Length (MDL)
  • Standard metric from machine learning.
  • Cost of transmitting the syntax of a description
    plus the cost of transmitting the data given the
    description
  • cost(T,d)
  • complexity(T) complexity(dT)
  • Functions defined inductively over the structure
    of the type T and data d respectively.
  • Normalized MDL gives compression factor.
  • Scoring function triggers rewriting rules.

17
Format Refinement
  • Format refinement rewrites descriptions to
  • Optimize information-theoretic complexity
  • Simplify presentation
  • Merge adjacent structures and unions
  • Improve precision
  • Identify constant values
  • Introduce enumerations
  • Refine types
  • Find termination conditions for strings
  • Determine integer sizes (8 bit, 32 bit, etc)
  • Identify array separators terminators
  • Detect dependencies
  • Array lengths
  • Switched unions

18
Testing and Evaluation
  • Evaluated overall results qualitatively
  • Compared with Excel -- a manual process with
    limited facilities for representation of
    hierarchy or variation
  • Compared with hand-written descriptions -
    performance variable depending on tokenization
    choices complexity
  • Evaluated accuracy quantitatively
  • Implemented infrastructure to use generated
    accumulator programs to determine inferred
    description error rates
  • Evaluated performance quantitatively
  • Tokenization rough structure inference perform
    well less than 1 second on 300K
  • Dependency analysis can take a long time on
    complex format (but can be cut down easily).

19
Benchmark Formats
20
Execution Times
SD structure discovery Ref
refinement Tot total HW hand-written
21
Training Time
22
Normalized MDL Scores
SD structure discovery Ref
refinement HW hand-written
23
Training Accuracy
24
Type Complexity and Min. Training Size
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
Technical Summary
  • Format inference is feasible for many ASCII data
    formats
  • Our current tools infer sufficient structure that
    descriptions may be piped into the PADS compiler
    and used to generate tools for XML conversion and
    simple statistical analysis.

Email
struct ........ ...... ...........
ASCII log files
Binary Traces
CSV
XML
30
Thanks Acknowledgements
  • Collaborators
  • Kenny Zhu (Princeton)
  • Peter White (Galois)
  • Other contributors
  • Alex Aiken (Stanford)
  • David Blei (Princeton)
  • David Burke (Galois)
  • Vikas Kedia (Stanford)
  • John Launchbury (Galois)
  • Rob Shapire (Princeton)

31
Problem Tokenization
  • Technical problem
  • Different data sources assume different
    tokenization strategies
  • Useful token definitions sometimes overlap, can
    be ambiguous, arent always easily expressed
    using regular expressions
  • Matching tokenization of underlying data source
    can make a big difference in structure discovery.
  • Current solution
  • Parameterize learning system with customizable
    configuration files
  • Automatically generate lexer file basic token
    types
  • Future solutions
  • Use existing PADS descriptions and data sources
    to learn probabilistic tokenizers
  • Incorporate probabilities into sophisticated
    back-end rewriting system
  • Back end has more context for making final
    decisions than the tokenizer, which reads 1
    character at a time without look ahead

32
Scaling to Larger Data Sets (In progress)
  • Original algorithm keeps entire data set in
    memory, so wont scale to large data sets.
  • Proposed conceptual architecture to permit
    incremental learning

33
0, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
34
struct
0, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
,


union
union
structure discovery
int
alpha
int
alpha
35
struct
struct
0, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
,
,



union
(id2)

union
(id1)
union
union
structure discovery
tagging/ table gen
int
int (id3)
alpha
alpha (id4)
int
alpha
int (id5)
alpha
(id6)
id1
id2
id3
id4
id5
id6
1
1
0
--
--
24
2
2
--
foo
beg
--
...
...
...
...
...
...
36
struct
struct
0, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
,
,



union
(id2)

union
(id1)
union
union
structure discovery
tagging/ table gen
int
int (id3)
alpha
alpha (id4)
int
alpha
int (id5)
alpha
(id6)
id1
id2
id3
id4
id5
id6
1
1
0
--
--
24
2
2
--
foo
beg
--
...
...
...
...
...
...
constraint inference
id3 0 id1 id2 (first union is int
whenever second union is int)
37
struct
struct
0, 24 foo, beg bar, end 0, 56 baz,
middle 0, 12 0, 33
,
,



union
(id2)

union
(id1)
union
union
structure discovery
tagging/ table gen
int
int (id3)
alpha
alpha (id4)
int
alpha
int (id5)
alpha
(id6)
id1
id2
id3
id4
id5
id6
1
1
0
--
--
24
2
2
--
foo
beg
--
...
...
...
...
...
...
struct
constraint inference


union
id3 0 id1 id2 (first union is int
whenever second union is int)
struct
struct
rule-based structure rewriting
,
,
int
0
alpha-string
alpha-string
more accurate -- first int 0 -- rules out int
, alpha-string records
Write a Comment
User Comments (0)
About PowerShow.com