Title: Tracking Metadata and Lineage of the Data Processing Chain for Mapping Snow Cover Properties with the NASA MODIS
1Tracking Metadata and Lineageof the Data
Processing Chainfor Mapping Snow Cover
Propertieswith the NASA MODIS
- James Frew1, Thomas H. Painter2,Peter
Slaughter1, Jeff Dozier1
1Donald Bren School of Environmental Science and
Management, University of California, Santa
Barbara 2National Snow and Ice Data
Center,University of Colorado, Boulder
2Outline
- Motivation
- Snow mapping product
- Implications for hydrologic modeling
- Lineage Capture
- Wrapping the ESSW experience
- Instrumenting,overriding,monitoring the
(ongoing) ES3 experience
3MODIS image Sierra Nevada
EOS Terra MODIS 07 March 2004 MOD09 Surface
Reflectance 0.555 0.645 0.858
4Snow-covered area and grain size
5Hindu Kush
2003 DOY 070
6Colorado RockiesCLPX13 March 2002
7Model structure MODIS snow-area / albedo
8Lineage Capture, Take 1
9Using Existing Science Applications
- No standardEarth science computing environment
- commercial packages (ArcInfo, MATLAB, )
- public packages/models (MM5, MODTRAN, )
- locally-developed codes
- arbitrary combinations of ?
- Example SST from AVHRR ? ? ?
- commercial, standalone programs
- parameters highly customized for UCSB
- How do we get these programs to
- communicate
- cooperate
- with ESSW, without rewriting them?
Receive
Ingest and Calibrate
Navigate (Manual/Automatic)
Sea Surface Temp (SST)
Rectify
SST Maps
10Lineage Current Best Practice
11Earth System Science Workbench (ESSW)
- Producer and consumer issues can both be
addressedby a laboratory metaphor - Experiment
- Network of models
- ingesting / synthesizing data
- generating products
- Laboratory
- Experiment execution environment
- Computing storage accessibility scalability
- Lab Notebook
- Persistent storage that can be queried
- Keeps track of all experiments
- Documentation lineage accountability
12Wrap Your App Scripts Talk to ESSW
- No changes,just additions
- Wrapper scripts
- Make program (groups) look like ESSW experiments
- use Perl API
- Lab Notebook daemon
- Accepts API commands
- Creates XML documents
- Sends to database
- ESSW database
- XML metadata DTDs
- Tabular metadata
- XML search terms
- Lineage links
Perl API
XML SQL
Lab Notebookdaemon
Receive
Ingest and Calibrate
ESSW Database
Navigate (Manual/Automatic)
Sea Surface Temp (SST)
Rectify
MySQL
Java
SST Maps
JDBC
Perl
13ESSW Metadata management
- Lab Notebook daemon verifies XML metadata
document - Experiment step metadata stored for product
lineage tracking - Complete metadata document stored in custom
database table - XML DTD ? 11 ? database table
- (n1)th column is document itself
- Some metadata values extracted into database
tables - DTD contains column names and types for some
elements - Always save all the XML,even if dont know how
to columnize all of it
14Wrapper Example Input Dataset
15Wrapper Example Output Dataset
16Wrapper Example Process
17Wrapper Example Lineage Links
18Process graph reconstructedfrom ESSW database
19ESSW Lessons
- Providers are customers
- ESIPs arent much good unless scientists are
happy to put information in them - A light touch is the right touch
- Wrapping is easier for scientists and their
programmers to deal with than complete
re-engineering - Scientists do write scripts, but not necessarily
Perl - Scripting (gluing stuff together) comes naturally
to scientists - Scientists dont write DTDs
- Nobody calls metadata APIs
- ESSW was automatic, but not automatic enough
20Lineage Capture, Take 2
21ES3 Earth System Science Server
ESSW data lineage tracking
MODster
OpenDAP
Watershed-scale snow product
MODIS
Microsoft TerraServer
AVHRR
Global-scale snow product
Alexandria Digital Library
Corona
BUB data storage
ROCKS processing clusters
22From ESSW to ES3 Summary
- Perl wrappers ? Probulators
- Perl API ? web services XML messages
- MySQL ? XML database(s)
23From Wrappers to Probulators
- Wrappers Active Lineage
-
- Complete control over what gets recorded
- Single language/API for all wrapped events
- Not tied to execution
- You can even lie about what happened
-
- Must explicitly script everything
- Scripts can drift from reality
- You can even lie about what happened
24From Wrappers to Probulators
- Probulators Passive Lineage
-
- Record what actually happened
- Not just what you think happened
- Not what didnt happen
- Automatic dont have to write new scripts for
everything -
- Different flavors for different environments
- Cant just do everything in Perl
25Probulator patterns
- Instrumentation
- Insert lineage capture instructions directly into
science codes - e.g. I just created file foo
- Typical implementation preprocessor/precompiler
- Overriding
- Replace standard routines/libraries with
lineage-capturing versions - e.g. open() ? snoopy_open()
- Typical implementation modify execution
environment - environment variables
- configuration files
- Passive monitoring
- Trace program execution
- e.g. called open() with args foo, bar,
- Typical implementation straced shell
26ES3 Lineage Architecture
probulator1
logger
transmitter
ES3 core
probulatorn
27Probulating IDL Instrumenting the code
- edit
- pro modscag_cleanse,prefixprefix,nsns,nlnl
- HELP, NAMES"", OUTPUTES3_ENVIROMENT ES3_LOG,
- ENTER"modscag_cleanse", ENVIROMENTES3_ENVIROME
NT - clean up under,overflow of MODSCAG run
-
- Input prefix prefix for all of the MODSCAG
output filenames - ns number of samples
- nl number of lines
- Output rewrite of the MODSCAG files
-
- t.h.painter / 1.19.2005
- open snow file
- ES3_openr,1,string(prefix,'snow.pic')
- snowfltarr(ns,nl)
- readu,1,snow
28Probulating IDL Results
- ltinit time"20050522T234606Z
- pid"31002" stime"20050522T234604Z"
pstime"20050522T234256Z" ppid"30920"
language"idl" user"haavar" hostname"spitting-du
ck.bren.ucsb.edu"gt - ltenviromentgt
- ltvariable name"!PATH" value"/home/haavar/probu
lator//idl - /home/rsi/idl_6.1/lib/hook
-
- lt/enviromentgt
- ltmount-pointsgt
- ltmount share"dab15/ed15/rsi"
type"nfs"gt/home/rsilt/mountgt - lt/mount-pointsgt
- lt/initgt
- ltenter region"modscag_cleanse"gt
- ltenviromentgt
- ltvariable type"INT" name"NL" value"2"/gt
- ltvariable type"INT" name"NS" value"2"/gt
-
- lt/enviromentgt
- lt/entergt
- ltexec time"20050522T234610Z" routine"OPENR"gt
29Probulating bash Passive Monitoring
- cat /etc/passwd grep haavar sed -n
's/\(.\)\2\\(0-9\\)./\2/p' - 25232 1138336174.480079 open("/etc/ld.so.cache",
O_RDONLY) 3 - 25232 1138336174.480215 open("/lib/libm.so.6",
O_RDONLY) 3 -
- 25234 1138336178.887267 dup2(3, 255) 255
- 25234 1138336178.887912 pipe(3, 4) 0
- 25234 1138336178.888257 clone(child_stack0, ,
child_tidptr0xb7f2e708) 25235 - 25235 1138336178.889366 dup2(4, 1) 1
- 25235 1138336178.889975 pipe(3, 4) 0
- 25235 1138336178.890326 clone(child_stack0, ,
child_tidptr0xb7f2e708) 25236 - 25235 1138336178.891260 pipe(4, 5) 0
- 25235 1138336178.891756 clone(child_stack0, ,
child_tidptr0xb7f2e708) 25237 - 25235 1138336178.892753 clone(child_stack0, ,
child_tidptr0xb7f2e708) 25238 - 25238 1138336178.894266 dup2(4, 0) 0
- 25236 1138336178.894726 dup2(4, 1) 1
- 25237 1138336178.894763 dup2(3, 0) 0
- 25237 1138336178.895581 dup2(5, 1) 1
30Probulating bash Results
- ltinitgt same as IDL
- ltexec time"20060027T042938.900117Z"
routine"/bin/cat" pid"25236" ppid"25235"gt - ltargumentsgt
- ltargumentgt/etc/passwdlt/argumentgt
- lt/argumentsgt
- ltiogt
- ltpipe read"true" id"std-in"/gt
- ltpipe write"true" id"3"/gt
- ltpipe write"true" id"std-err"/gt
- ltfile read"true"gt/etc/ld.so.cachelt/filegt
-
- ltfile read"true"gt/etc/passwdlt/filegt
- lt/iogt
- lt/execgt
- ltexec time"20060027T042938.903342Z"
routine"/bin/grep" pid"25237" ppid"25235"gt - ltargumentsgt
- ltargumentgthaavarlt/argumentgt
- lt/argumentsgt
- ltiogt
31Now What?
- Probulator reports not universally unique
- Q How hook separate reports together?
- A Logger assigns UUIDs to
- Data streams
- Processes
- Jobs (workflows)
- Lineage not explicit
- Q How publish lineage?
- A ES3 Core builds serialized graph
32Thanks to
- Current
- Mike Colee
- Stephane Maritorena
- Dominic Metzger
- Karl Rittger
- Dave Siegel
- Former
- Anurag Acharya
- Rajendra Bose
- Scott Denning
- Debbie Donahue
- Jim Duff
- Calin Duma
- Erik Fields
- Jim Gray
- Steve Miley
- Jordan Morris
- Mark Pelletier
- Pete Peterson
- Walter Rosenthal
- Klaus Schauser
- Håvar Valeur
33To Probulate Further http//www.snow.ucsb.edu
Publications
- Bose, R. and Frew, J., 2005. Lineage retrieval
for scientific data processing a survey. ACM
Computing Surveys, vol. 37, no. 1, pp. 1-28. - doi10.1145/1057977.1057978
- Dozier, J., and Painter, T.H., 2004.
Multispectral and hyperspectral remote sensing of
alpine snow properties. Annual Review of Earth
and Planetary Sciences, vol. 32, pp. 465-494. - doi10.1146/annurev.earth.32.101802.120404
- Molotch, N.P., Painter, T.H., Bales, R.C., and
Dozier, J., 2004. Incorporating remotely sensed
snow albedo into spatially distributed snowmelt
modeling. Geophysical Research Letters, 31,
L03501 - doi10.1029/2003GL019063
- Frew, J. and Bose, R., 2001. Earth System Science
Workbench a data management infrastructure for
Earth science products. In Kerschberg, L. and
Kafatos, M. (eds.) 2001. Proceedings, 13th
International Conference on Scientific and
Statistical Database Management (SSDBM 2001), pp.
180-189. - doi10.1109/SSDM.2001.938550