PadsML: A Functional Data Description Language - PowerPoint PPT Presentation

About This Presentation
Title:

PadsML: A Functional Data Description Language

Description:

how do we view it? how do we transform it into standard formats like ... 2:3004092508||5001|dns1=abc.com;dns2=xyz.com|c=slow link;w=lost packets|INTERNATIONAL ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 55
Provided by: yitzh
Category:

less

Transcript and Presenter's Notes

Title: PadsML: A Functional Data Description Language


1
Pads/MLA Functional Data Description Language
  • David Walker
  • Princeton University
  • with Yitzhak Mandelbaum (Princeton),
  • Kathleen Fisher and Mary Fernandez (ATT)

2
Data, data everywhere!
  • Incredible amounts of data stored in well-behaved
    formats
  • Tools
  • Schema
  • Browsers
  • Query languages
  • Standards
  • Libraries
  • Books, documentation
  • Conversion tools
  • Vendor support
  • Consultants

Databases
XML
3
Ad hoc Data
  • Vast amounts of data in ad hoc formats.
  • Ad hoc data is semi-structured
  • Not free text.
  • Not as rigid as data in relational databases.
  • Examples from many different areas
  • Physics
  • Computer system maintenance and administration
  • Biology
  • Finance
  • Government
  • Healthcare
  • More!

4
Ad Hoc Data in Biology
format-version 1.0 date 11112005
1424 auto-generated-by DAG-Edit 1.419 rev
3 default-namespace gene_ontology subsetdef
goslim_goa "GOA and proteome slim" Term id
GO0000001 name mitochondrion inheritance namespa
ce biological_process def "The distribution of
mitochondria\, including the mitochondrial
genome\, into daughter cells after mitosis or
meiosis\, mediated by interactions between
mitochondria and the cytoskeleton."
PMID10873824,PMID11389764, SGDmcc is_a
GO0048308 ! organelle inheritance is_a
GO0048311 ! mitochondrion distribution
www.geneontology.org
5
Ad Hoc Data in Chemistry
OC(C_at__at_H2OC(C)O)C_at__at_3(C)C_at_(C_at_(CO4) (OC(C)
O)C_at_H4CC_at__at_H3O)(H)C_at_H (OC(C7CCCCC7)O)C
_at__at_1(O)C_at__at_(C)(C)C2C(C) C_at__at_H(OC(C_at_H(O)C_at__at_H
(NC(C6CCCCC6)O) C5CCCCC5)O)C1
6
Ad Hoc Data in Finance
HA00000000START OF TEST CYCLE aA00000001BXYZ
U1AB0000040000100B0000004200 HL00000002START OF
OPEN INTEREST d 00000003FZYX G1AB0000030000300000
HM00000004END OF OPEN INTEREST HE00000005START OF
SUMMARY f 00000006NYZX B1QB00052000120000070000B00
0050000000520000 0049000000510000000100B0000000
5300000052500000535000 HF00000007END OF SUMMARY
www.opradata.com
7
Ad Hoc Data from Web Server Logs (CLF)
207.136.97.49 - - 15/Oct/1997184651 -0700
"GET /tk/p.txt HTTP/1.0" 200 30 tj62.aol.com - -
16/Oct/1997143222 -0700 "POST
/scpt/dd_at_grp.org/confirm HTTP/1.0" 200
941 234.200.68.71 - - 15/Oct/1997185333
-0700 "GET /tr/img/gift.gif HTTP/1.0 200
409 240.142.174.15 - - 15/Oct/1997183925
-0700 "GET /tr/img/wool.gif HTTP/1.0" 404
178 188.168.121.58 - - 16/Oct/1997125935
-0700 "GET / HTTP/1.0" 200
3082 214.201.210.19 ekf - 17/Oct/1997100823
-0700 "GET /img/new.gif HTTP/1.0" 304 -
8
Ad Hoc Data DNS packets
00000000 9192 d8fb 8480 0001 05d8 0000 0000 0872
...............r 00000010 6573 6561 7263 6803
6174 7403 636f 6d00 esearch.att.com. 00000020
00fc 0001 c00c 0006 0001 0000 0e10 0027
...............' 00000030 036e 7331 c00c 0a68
6f73 746d 6173 7465 .ns1...hostmaste 00000040
72c0 0c77 64e5 4900 000e 1000 0003 8400
r..wd.I......... 00000050 36ee 8000 000e 10c0
0c00 0f00 0100 000e 6............... 00000060
1000 0a00 0a05 6c69 6e75 78c0 0cc0 0c00
......linux..... 00000070 0f00 0100 000e 1000
0c00 0a07 6d61 696c ............mail 00000080
6d61 6ec0 0cc0 0c00 0100 0100 000e 1000
man............. 00000090 0487 cf1a 16c0 0c00
0200 0100 000e 1000 ................ 000000a0
0603 6e73 30c0 0cc0 0c00 0200 0100 000e
..ns0........... 000000b0 1000 02c0 2e03 5f67
63c0 0c00 2100 0100 ......_gc...!... 000000c0
0002 5800 1d00 0000 640c c404 7068 7973
..X.....d...phys 000000d0 0872 6573 6561 7263
6803 6174 7403 636f .research.att.co
9
Properties of Ad hoc Data
  • Data arrives as is -- you dont choose the
    format
  • Documentation is often out-of-date or
    nonexistent.
  • Hijacked fields.
  • Undocumented missing value representations.
  • Data is buggy.
  • Missing data, extra data,
  • Human error, malfunctioning machines, software
    bugs (e.g. race conditions on log entries),
  • Errors are sometimes the most interesting portion
    of the data.
  • Data sources often have high volume.
  • Data might not fit into main memory.
  • Data can be created by malicious sources
    attempting to exploit software vulnerabilities
  • c.f. Ethereal network monitoring system

10
The Goal(s)
  • What can we do about ad hoc data?
  • how do we read it into programs?
  • how do we detect errors?
  • how do we correct errors?
  • how do we query it?
  • how do we discover its structure and properties?
  • how do we view it?
  • how do we transform it into standard formats like
    CSV, XML?
  • how do we merge multiple data sources?
  • In short how do we do all the things we take
    for granted when dealing with standard formats in
    a fault-tolerant and efficient, yet nearly
    effortless way?

11
Enter Pads
  • Pads a system for Processing Ad hoc Data Sources
  • Three main components
  • a data description language
  • for concise and precise specifications of ad hoc
    data formats and semantic properties
  • a compiler that automatically generates a suite
    of programming libraries end-to-end
    applications
  • a visual interface to support both novice and
    expert users

12
One Description, Many Tools
Data Description (Type T)
compiler
query engine
parser
printer
visual data browser
xml translator
...
programming library
complete application
13
Some Advantages Over Ad Hoc Methods
  • Big bang for buck 1 description, many tools
  • Descriptions document data sources
  • the documentation IS the tool generator so
    documentation is automatically kept up-to-date
    with implementation
  • Descriptions are easy to write, easy to
    understand.
  • descriptions are high-level declarative
  • description syntax exploits programmer intuition
    concerning types
  • Tools are robust
  • Error handling code generated automatically
    doesnt clutter documentation.
  • Descriptions generated tools can be analyzed
    and reasoned about
  • eg data size, tool termination safety
    properties, coherence of generated parsers
    printers

14
The PADS Project
  • PADS/C PLDI 05 POPL 06
  • Based on C type structure.
  • Generates C libraries.
  • too bad C doesnt actually support libraries ....
  • LaunchPADS visual interface Daly et al., SIGMOD
    06
  • PADS/ML (Mandelbaums thesis)
  • Based on the ML type structure.
  • polymorphic, dependent datatypes
  • Generates ML modules.
  • better reuse library structure
  • functional data processing far greater
    programmer productivity
  • New framework for tool development.
  • Format-independent algorithms architected using
    functors vs macros
  • Implementation status.
  • Version 1.0 up and running
  • Many more exciting things to do
  • Describe real formats
  • Newick tree-structured data
  • Reglens galaxy catalogues
  • Palm PDA databases
  • ATT call-detail data
  • ATT billing data
  • Web server logs
  • Gene ontologies
  • DNS packets
  • OPRA data
  • More

15
Outline
  • Motivation and PADS Overview
  • Data Description in PADS/ML
  • Implementation architecture
  • The Semantic of PADS
  • Conclusions

16
Base Types and Records
  • Base types C (e).
  • Describe atomic portions of data.
  • Parameterized by host-language expression.
  • Examples
  • Pint, Pchar, Pstring_FW(n), Pstring(c).
  • Tuples and Records t t and xt yt.
  • Record fields are dependent field names can be
    referenced by types of later fields.
  • Example to follow.

17
Base Types and Records
Movie-director Bowling Score (MBS) Format
122JoeWright459579n/aEdWood104731124Chri
sNolan809385 Burton308271126G
eorgeLucas326240
Tim
125

Pint
Pstring()
Pchar
18
Base Types and Records
Bookshelf Listing (BL) Format
13C Programming31Types and Programming
Languages20Twenty Years of PLDI36Modern Compiler
Implementation in ML 27Elements o f ML Programming
13
C Programming
width
title Pstring_FW(width)
Pint
19
Constraints
  • Constrained types xt e .
  • Enforce the constraint e on the underlying type t.

Burton
Tim
125

30

82
71



cPchar c
Pchar
ptype Scores minPint
max mPint min m
avg aPint min a a
max
20
Datatypes
  • Describe alternatives in data source with
    datatypes.
  • Parser tries each alternative in order.

n/aEdWood104731 124ChrisNolan809385
122JoeWright459579 n/aEdWood104731 124Chri
sNolan809385 125TimBurton308271 126George
Lucas326240
pdatatype Id None of n/a
Some of Pint
21
Recursive Datatypes
  • Describe inductively-defined formats.

79
40
31
71
pdatatype IntList
Cons of Pint
IntList
Last of Pint
22
Polymorphic Types
  • Parameterize types by other types.

ptype IntList Pint List
ptype CharList Pchar List
23
Dependent Types
  • Parameterize types by values.

pdatatype IntList Cons of Pint
IntList Nil of
Pint
pdatatype (Elt) List (xchar) Cons of Elt
x (Elt) List(x)
Nil of Elt
ptype IntListBar Pint List()
ptype CharListComma Pchar List (,)
24
More Dependent Types
  • Switched datatypes

pdatatype GuidedOption (tag int) pmatch
tag of 0 gt Zero of Pstring 1 gt One of
Pint 2 gt Two of Pint Pint _ gt None
ptype source tag Pint payload GuidedOption
(tag)
25
PADS/ML Regulus Format
ptype Semicolon Pcharlit() ptype Vbar
Pcharlit() pdatatype Info(alarm_code int)
Pmatch alarm_code with 5074 -gt Details of
Details _ -gt Generic of
(Nvp_a,Semicolon,Vbar) Plist pdatatype Service
Dom of "DOMESTIC" Int of
"INTERNATIONAL" Spec of "SPECIAL" ptype
Raw_alarm alarm i Puint32 i 2
or i 3 start Timestamp Popt
clear Timestamp Popt code
Puint32 src_dns SVString
Nvp("dns1") dest_dns SVString
Nvp("dns2") info Info(code)
service Service let checkCorr ra
... ptype Alarm xRaw_alarm checkCorr
x ptype Source (Alarm,Peor,Peof) Plist
ptype Timestamp Ptimestamp_explicit_FW(8,
"HMS", gmt) ptype Pip Puint8 .
Puint8 . Puint8 . Puint8 ptype
(Alpha) Pnvp(p string -gt bool) name
name Pstring() p name value
Alpha ptype (Alpha) Nvp(namestring) Alpha
Pnvp(fun s -gt s name) ptype SVString
Pstring_SE("/\\/") ptype Nvp_a SVString
Pnvp(fun _ -gt true) ptype Details
source Pip Nvp("src_addr") dest Pip
Nvp("dest_addr") start_time Timestamp
Nvp("start_time") end_time Timestamp
Nvp("end_time") cycle_time Puint32
Nvp("cycle_time")
Sample Regulus Data
230040925085001dns1abc.comdns2xyz.comcslo
w linkwlost packetsINTERNATIONAL 33004097201
5074dns1bob.comdns2alice.comsrc_addr192.168.
0.10 dst_addr192.168.23.10start_time1234567890
end_time1234568000cycle_time17412SPECIAL
26
Outline
  • Motivation and PADS Overview
  • Data Description in PADS/ML
  • Implementation architecture
  • The Semantic of PADS
  • Conclusions

27
Parsing With PADS
data description (type T)
compiler
data rep (type T)
01001001 00111
parser
user code
parse descriptor (type T)
28
Example MBS Representation
n/aEdWood104731
29
Tool Generation With PADS/ML
data description (type T)
format- independent tool module
compiler
data rep (type T)
format-specific traversal functor
01001001 00111
parser
parse descriptor (type T)
tools in this pattern accumulator, debugger,
histograms, clusters, format converters
30
Types as Modules
  • PADS/ML generates a module for each
    type/description
  • Parameterized types gt Functors
  • Recursive types gt Recursive modules
  • sigh combination of recursive modules functors
    not supported in OCaml, so were reduced to a
    bit of a hack for recursion

sig type rep type pd fun parser
Pads.handle -gt rep pd module Traverse (tool
TOOL) sig ... end end
31
Outline
  • Motivation and PADS Overview
  • Data Description in PADS/ML
  • Implementation architecture
  • The Semantic of PADS
  • Conclusions

32
Motivation
  • To crystallize design principles.
  • Example error counting methodology in PADS/C.
  • To ensure system correctness.
  • Example parsers return data of expected type.
  • As basis for evolution and experimentation.
  • Critical to design of PADS/ML.
  • To communicate core ideas.
  • Designing the next 700 data description languages.

33
PADS and DDC
  • Developed semantic framework based on Data
    Description Calculus (DDC).
  • Explains PADS/ML and other languages with DDC.
  • Give denotational semantics to DDC.

PADS/ML
PADS/C
The Next 700
DDC
34
Data Description Calculus
  • DDC calculus of dependent types for describing
    data.
  • Expressions e with type ? drawn from F-omega
  • A kinding judgment specifies well-formed
    descriptions.

t unit bottom C(e) ?xt.t
t t t t xt e t
seq(t,e,t) ?x.e t e ??.t t t
? ??.t compute (e?) absorb(t)
scan(t)
35
Choosing a Semantics
  • Semantics of REs, CFGs given as sets of strings
    but fails to account for
  • Relationship between internal and external data.
  • Error handling.
  • Types of representation and parse descriptor.
  • DDC
  • Denotational semantics of types as parsers in
    F-omega

36
A 3-Fold Semantics
Description
Interpretations of t
xt e rep t rep t rep ? xt.t
rep t rep t rep
t
t
t rep t pd
xt e pd hdr t pd ? xt.t pd
hdr t pd t pd
Representation
?
0100100100...
Parse Descriptor
Parser
37
Type Correctness
Description
Theorem t bits ? t rep t pd
Interpretations of t
t
t
t rep t pd
Representation
?
0100100100...
Parse Descriptor
Parser
38
Outline
  • Motivation and PADS Overview
  • Data Description in PADS/ML
  • Implementation architecture
  • The Semantic of PADS
  • Conclusions

39
Related Work
  • parser generator technology
  • Lex Yacc
  • no dependency
  • semantic actions entwined with data description
  • no higher-level tools
  • Parser combinators
  • semantic actions entwined with data description
  • no higher-level tools

40
ReminderOne Description, Many Tools
Data Description (Type T)
compiler
query engine
parser
printer
visual data browser
xml translator
...
programming library
complete application
41
Parser combinatorsOne algorithm, One Tool
parser
42
Related Work
  • Other data description languages
  • Data Format Description Language (DFDL)
  • Binary Format Description Language (BFD)
  • PacketTypes SIGCOMM 00
  • DataScript GPCE 02
  • None have a well-defined semantics or Pads tool
    support

43
Current Future Work
  • Tools and Applications
  • Description inference.
  • Support for specific domains (microbiology)
  • Language Design
  • Transformation language for ad hoc data.
  • Description language for distributed
  • Describe locations, versions, timing,
    relationships, etc.
  • Theory
  • Analyze data descriptions for interesting
    properties, e.g. equivalence, data size,
    termination, emptiness (always fails).
  • Coherence of parsing printing

44
Summary
  • The PADS vision reliable, efficient and
    effortless ad hoc data processing
  • PADS/ML
  • Data description based on polymorphic, dependent
    datatypes
  • Types as modules implementation
  • Solid theoretical basis.
  • Visit www.padsproj.org

45
The End
Questions?
46
Cut slides follow
47
Switched Datatypes
  • Choose branch based on parameter.

pdatatype Id match t with
0 ? Name of Pstring_FW(3) 1
? Num of Pint
48
PADS/C vs. PADS/ML
  • Next-generation PADS language and compiler.
  • Based on PADS/C experience and insights from
    semantics.
  • Targeted at ML.
  • Functional languages better suited to data
    transformation.
  • Higher level of abstraction than PADS/C.
  • Key features polymorphic, recursive datatypes.
  • Improved compiler design
  • New framework for tool development.
  • Greater focus on modularity.

49
Existing Approaches
  • C, Perl, or shell scripts most popular.
  • Time consuming error prone to hand code
    parsers.
  • Difficult to maintain (worse than the ad hoc data
    itself in some cases!).
  • Often incomplete, particularly with respect to
    errors.
  • Error code, if written, swamps main-line
    computation.
  • If not written, errors can corrupt good data.
  • Lex Yacc
  • Good match for programming languages.
  • Bad match for ad hoc data.
  • Compiler converts descriptions into robust,
    format-specific tools.

50
Parsing With PADS
  • Robust parser at the core of generated tools.

51
Using Ad hoc Data
  • Can Ed Wood bowl?

122JoeWright459579124ChrisNolan80938512
5TimBurton308271126GeorgeLucas326240
n/aEdWood104731
  • Parsing only brings you part way.
  • Queries must be written in ML.
  • A lot of work.
  • What about a declarative query?

52
From Ad hoc Data To XML
  • XML
  • Encoding for semi-structured data.
  • Good match!
  • XQuery
  • Declarative XML query language for
    semi-structured sources.
  • Standardized by W3C, many implementations.

53
PADX PADS XQuery
  • Galax Fernandez, et al.
  • Complete, open-source XQuery implementation.
  • PADX
  • Integrates PADS and Galax.
  • Supports declarative queries over ad hoc data
    sources.

54
Using PADX
  • User describes format in PADS.
  • PADX provides
  • XML view of data in XML Schema.
  • Customized XQuery engine.
  • Query PADS-specific and other XML sources.
  • User provides
  • Ad hoc data
  • Queries expressed in XQuery.

55
Describing MBS Format
  • Example Movie-director Bowling Score data
  • PADS/ML Description

n/aEdWood104731
ptype MBS-Entry id Id
first Pstring()
last Pstring() scores Scores

56
Viewing and Querying MBS
  • Virtual XML view
  • Query What is Ed Woods maximum score?
  • pads/Psource/MBS-Entryfirst Edlast
    Wood/scores/max
  • Query Which directors have scored less than 50?
  • pads/Psource/MBS-Entryscores/min lt 50

ltMBS-Entrygt ltidgtltNonegtn/alt/Nonegtlt/idgt
ltfirstgtEdlt/firstgt ltlastgtWoodlt/lastgt ltscoresgt
ltmingt10lt/mingt ltmaxgt47lt/maxgt
ltavggt31lt/avggt ltscoresgt lt/MBS-Entrygt
n/aEdWood104731
ptype MBS-Entry id Id
first Pstring()
last Pstring() scores Scores

57
Challenges Solutions
  • Semantics
  • Map PADS language to XML Schema.
  • Re-engineer Galax Data Model
  • Create abstract data model.
  • Generate description-specific concrete data
    models.
  • Efficiently query large-scale data sources.
  • Provide lazy access to data.
  • Implement custom memory-management.
Write a Comment
User Comments (0)
About PowerShow.com