The Next 700 Data Description Languages - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

The Next 700 Data Description Languages

Description:

Terminator type Tt. false: reads nothing, flagging an error. ... Example: IP address with terminator. IP_addr |' * Sc( |') term.int seq(Sc( .'); len 4, Sc(term) ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 27
Provided by: yitzhakma
Category:

less

Transcript and Presenter's Notes

Title: The Next 700 Data Description Languages


1
The Next 700 Data Description Languages
  • Yitzhak Mandelbaum
  • Princeton University
  • Computer Science

Collaborators Kathleen Fisher and David Walker
2
The Next 700
3
What Data Needs Describing?
  • There's much data in databases and common formats
    like XML theres much data thats ad hoc.
  • Ad hoc data lacks readily available parsing,
    querying, analysis or transformation tools
  • Its all over the place financial, telecomm,
    chemistry, physics, biology, etc.

4
Ad Hoc Data in Biology
!autogenerated-by DAG-Edit version 1.419 rev 3
!saved-by gocvs !date Fri Mar 18 210028 PST
2005 !version Revision 3.223 !type is_a
is a !type lt part_of part of !type
inverse_of inverse of !type disjoint_from
disjoint from Gene_Ontology GO0003673
ltbiological_process GO0008150 behavior
GO0007610 synonymbehaviour adult
behavior GO0030534 synonymadult behaviour
adult feeding behavior GO0008343
synonymadult feeding behaviour feeding
behavior GO0007631 adult locomotory
behavior GO0008344 ...
from www.geneontology.org
5
Ad Hoc Data in Chemistry
OC(C_at__at_H2OC(C)O)C_at__at_3(C)C_at_(C_at_(CO4) (OC(C)
O)C_at_H4CC_at__at_H3O)(H)C_at_H (OC(C7CCCCC7)O)C
_at__at_1(O)C_at__at_(C)(C)C2C(C) C_at__at_H(OC(C_at_H(O)C_at__at_H
(NC(C6CCCCC6)O) C5CCCCC5)O)C1
6
Ad Hoc Data from Web Server Logs (CLF)
207.136.97.49 - - 15/Oct/1997184651 -0700
"GET /tk/p.txt HTTP/1.0" 200 30 tj62.aol.com - -
16/Oct/1997143222 -0700 "POST
/scpt/dd_at_grp.org/confirm HTTP/1.0" 200 941
7
Ad Hoc Data DNS packets
00000000 9192 d8fb 8480 0001 05d8 0000 0000 0872
...............r 00000010 6573 6561 7263 6803
6174 7403 636f 6d00 esearch.att.com. 00000020
00fc 0001 c00c 0006 0001 0000 0e10 0027
...............' 00000030 036e 7331 c00c 0a68
6f73 746d 6173 7465 .ns1...hostmaste 00000040
72c0 0c77 64e5 4900 000e 1000 0003 8400
r..wd.I......... 00000050 36ee 8000 000e 10c0
0c00 0f00 0100 000e 6............... 00000060
1000 0a00 0a05 6c69 6e75 78c0 0cc0 0c00
......linux..... 00000070 0f00 0100 000e 1000
0c00 0a07 6d61 696c ............mail 00000080
6d61 6ec0 0cc0 0c00 0100 0100 000e 1000
man............. 00000090 0487 cf1a 16c0 0c00
0200 0100 000e 1000 ................ 000000a0
0603 6e73 30c0 0cc0 0c00 0200 0100 000e
..ns0........... 000000b0 1000 02c0 2e03 5f67
63c0 0c00 2100 0100 ......_gc...!... 000000c0
0002 5800 1d00 0000 640c c404 7068 7973
..X.....d...phys 000000d0 0872 6573 6561 7263
6803 6174 7403 636f .research.att.co
8
Data Description Languages
  • Data description languages describe many ad hoc
    formats and provide the following features
  • Descriptions serves as documentation, including
    semantic of data
  • Compiler generates tools from description
    parser, printer, query engine, converter to XML,
    statistical profiler, etc.
  • Parser includes robust error detection and
    recovery.
  • Parsers can handle high data volume.
  • gt 1GB/second Netflow traffic from Cisco routers.

9
Many Data Description Languages
Physical
  • Logical Descriptions
  • ASN.1
  • ASDL
  • Physical Descriptions
  • PacketTypes (SIGCOMM 00)
  • DataScript (GPCE 02)
  • PADS (PLDI 05)
  • Basis for current work

001010101 101001001 111001010 001010100
Logical
10
Contributions
  • A core data description calculus (DDC)
  • Based on dependent type theory
  • Simple, orthogonal, composable types
  • Types are transducers from external data source
    to internal data representation.
  • Encodings of high-level DDLs in low-level DDC
  • Explain semantics of PADS language in particular.

11
Base Types and Sequences
  • C(e) base type can be parameterized by
    expression e.
  • ?xT.T dependent product describes sequence of
    values.
  • Variable x gives name to first value in sequence.
  • Examples

12
Constraints
  • xT e set types allow you to constrain the
    type T and express relationships between elements
    of the data.
  • Examples

13
Unions and the Empty String
  • true matches the empty string.
  • T T deterministic, exclusive or try T on
    failure, try T.
  • Examples

14
Array Features
  • What features do we need to handle data
    sequences?
  • Elements
  • Separator between elements
  • Termination condition (are we done yet?)
  • Terminator after sequence
  • Examples
  • 192.168.1.1
  • BillCathyJaneBob

15
False and Arrays
  • T seq(Ts e, Tt) specifies
  • Element type T
  • Separator types Ts.
  • Termination condition e.
  • Terminator type Tt.
  • false reads nothing, flagging an error.
  • Example IP address.

16
Abstraction and Application
  • Can parameterize types over values ?x.T
  • Correspondingly, can apply types to values T e
  • Example IP address with terminator

17
Absorb, Compute and Scan
  • Absorb, Compute and Scan are active types.
  • absorb(T) consume data from source produce
    nothing.
  • compute(e?) consume nothing output result of
    computation e.
  • scan(T) scan data source for type T.
  • Examples

18
Type Kinding
  • Kinding ensures types are well formed.

19
Parsing Semantics of Types
  • Semantics expressed as parsing functions written
    in the polymorphic ?-calculus.
  • Sem(T) DDC Type ? Function
  • Input data and offset, output new offset, value
    and parse descriptor.
  • For specifics, see upcoming technical report.

20
Types of Parser Output
  • Parsers produce values with following type in the
    host language

Base Types
unrecoverable error
Products
dependency erased
Abs. and App.
Union
semantic error
Set types
21
Properties of the Calculus
  • Theorem If ? - T k then
  • T F well formed types yield parsers
  • ? - F bits offset ? offset Trep
    Tpda T-Parser returns values with types that
    correspond to T.
  • Theorem Parsers report errors accurately.
  • Errors in parse descriptor correspond to actual
    errors in data.
  • Parsers check all semantic constraints.
  • More

22
Making Use of the Calculus
IPADS t C(e) Pfun(xs) t t e
Pstructfields Punionfields
Pswitch e of alts tdef Popt t t
Pwhere x.e Paltfields t t e,t
Pcompute e Plit c fields fields x
t alts alts e gt t
? - t ? T
23
Example Popt and Plit
true T1 T2
C(e) xT e absorb(T) scan(T)
24
Example Pswitch
25
Future work
  • What are the set of languages recognized by the
    DDC?
  • How does the expressive power of the DDC relate
    to CFGs and regular expressions?
  • Implement recursive types in PADS system based on
    the recursive types of the DDC.
  • Add polymorphism to DDC and PADS.

26
Summary
  • Data description languages are well-suited to
    describing ad hoc data.
  • No one DDL will ever be right - different domains
    and applications will demand different languages
    with differing levels of expressiveness and
    abstraction.
  • Our work defines the first semantics for data
    description languages.
  • For more information, visit www.padsproj.org.

27
Cut slides follow
28
A Brief History
  • In the beginning, there was just one program
    (maybe two).
  • No need for programming language.
  • That program was copied and changed until there
    were many programs.
  • High-level programming language was invented.
  • Nice, but not right for all situations - many new
    programming languages appeared.
  • How do these languages related to each other?
  • Programming language semantics was born.

29
A Brief History
  • In the beginning, there was just one data format
    (binary).
  • No need for data description language.
  • That format was evolved until there were many
    formats.
  • Data description language was invented.
  • One language did not suit all and many new data
    description languages appeared.
  • This is where we are today
  • Wed like to help answer that question by
    devising the first data description language
    semantics.
Write a Comment
User Comments (0)
About PowerShow.com