Monday 17th July

About This Presentation

Title:

Monday 17th July

Description:

Foundations of Collaboration. Strong commitment by individuals. To work together ... Standards & Collaboration Essential. Data Access and Integration: motives ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 57

Provided by: LillyH5

Category:

more less

Transcript and Presenter's Notes

Title: Monday 17th July

1
Session 25 Monday 17th July
Malcolm Atkinson
2
Distributed Structured Data Management Introducti
on, Principles Foundations
3
Introduction to Structured Data in Grids

Reminders Distributed Systems Data scale
Significance of Structure
Strategies for Data Integration
Metadata Challenges
A view of OGSA-DAI

4
Reminders Distributed Systems Data scale Data
opportunities
5
Foundations of Collaboration

Strong commitment by individuals
To work together
To take on communication challenges
Mutual respect mutual trust
Distributed technology
To support information interchange
To support resource sharing
To support data integration
To support trust building
Sufficient time
Common goals
Complementary knowledge, skills data

Can we predictwhen it will work? Can we
findremedies when itdoesnt?
6
A strategy that works well

Collaboratively constructed
Shared access
Data Resources
Sequence databases
Protein structure and Crystallography databases
Sky Surveys
Census data
Zoo DB
Mouse Atlas

Works betterwhen linked to Funding
Publication. But fundingthe maintenance?
7
Works better with an organising nucleus

EBI
BIRN
GEON
SEEK / Species 2000
IVOA
CaBIG

Helping to Organise Giving user
support Establishing standards Sharing methods
8
Principles of Distributed Computing

Issues you cant avoid
Lack of Complete Knowledge (LOCK)
Latency
Heterogeneity
Autonomy
Unreliability
Change
A Challenging goal
balance technical feasibility
against virtual homogeneity, stability and
reliability
Appropriate balance between usability and
productivity
while remaining affordable, manageable and
maintainable

This is NOT easy
9
Compound Causes of Data Growth

Faster devices
Cheaper devices
Higher-resolution
all Moores law
Increased processor throughput
? more derived data
Cheaper higher-volume storage
Remote data more accessible
Public policy to make research data available
Bandwidth increases
Latency doesnt get less though

10
Motivation Data Curation shared Data
integration Data opportunities
11
Interpretational Opportunities Challenges

Finding Accessing data
Variety of mechanisms policies
Interpreting data
Variety of forms, value systems ontologies
Independent provision ownership
Autonomous changes in availability, form, policy,
Processing data
Understanding how it may be related
Devising models that expose the relationships
Presenting results
Humans need either
Derived small volumes of statistics
Visualisations

12
Interpretational Opportunities Challenges

Finding Accessing data
Variety of mechanisms policies
Interpreting data
Variety of forms, value systems ontologies
Independent provision ownership
Autonomous changes in availability, form, policy,
Processing data
Understanding how it may be related
Devising models that expose the relationships
Presenting results
Humans need either
Derived small volumes of statistics
Visualisations

Variety Autonomy Essential
13
Interpretational Opportunities Challenges

Finding Accessing data
Variety of mechanisms policies
Interpreting data
Variety of forms, value systems ontologies
Independent provision ownership
Autonomous changes in availability, form, policy,
Processing data
Understanding how it may be related
Devising models that expose the relationships
Presenting results
Humans need either
Derived small volumes of statistics
Visualisations

Standards Collaboration Essential
14
Data Access and Integration motives

Key to Integration of Scientific Methods
Publication and sharing of results
Primary data from observation, simulation
experiment
Encourages novel uses
Allows validation of methods and derivatives
Enables discovery by combining data independently
collected

and Decisions!
15
Data Access and Integration motives

Key to Large-scale Collaboration
Economies data production, publication
management
Sharing cost of storage, management and curation
Many researchers contributing increments of data
Pooling annotation ? rapid incremental
publication
And criticism
Accommodates global distribution
Data code travel faster and more cheaply
Accommodates temporal distribution
Researchers assemble data
Later (other) researchers access data

16
Data Access and Integration challenges
Petabyte of Digital Data / Hospital / Year

Scale
Many sites, large collections, many uses
Longevity
Research requirements outlive technical decisions
Diversity
No one size fits all solutions will work
Primary Data, Data Products, Meta Data,
Administrative data,
Many Data Resources
Independently owned managed
No common goals
No common design
Work hard for agreements on foundation types and
ontologies
Autonomous decisions change data, structure,
policy,
Geographically distributed

17
Data Integration Scientific discovery

Choosing data sources
How do you find them?
How do they describe and advertise them?
Is the equivalent of Google possible?
Obtaining access to that data
Overcoming administrative barriers
Overcoming technical barriers
Understanding that data
The parts you care about for your research
Extracting nuggets from multiple sources
Pieces of your jigsaw puzzle
Combing them using sophisticated models
The picture of reality in your head
Analysis on scales required by statistics
Coupling data access with computation
Repeated Processes
Examining variations, covering a set of
candidates
Monitoring the emerging details
Coupling with scientific workflows

Youre an innovator
Your model ? their model
? Negotiation patienceneeded from both sides

18
Scientific Data Opportunities Challenges

Opportunities
Global Production of Published Data
Volume? Diversity?
Combination ? Analysis ? Discovery

Challenges
Data Huggers
Meagre metadata
Ease of Use
Optimised integration
Dependability

A Cornucopia of Research Challenges

Opportunities
Specialised Indexing
New Data Organisation
New Algorithms
Varied Replication
Shared Annotation
Intensive Data Computation

Challenges
Fundamental Principles
Approximate Matching
Multi-scale optimisation
Autonomous Change
Legacy structures
Scale and Longevity
Privacy and Mobility
Sustained Support / Funding

19
Requirements Users viewpoint

Find Data
Registries Human communication
Understand data
Metadata description, Standard / familiar formats
representations, Standard value systems
ontologies
Data Access
Find how to interact with data resource
Obtain permission (authority)
Make connection
Make selection
Move Data
In bulk or streamed (in increments)

20
Requirements Users viewpoint 2

Transform Data
To format, organisation representationrequired
for computation or integration
Combine data
Standard DB operations operations relevant
tothe application model
Present results
To humans data movement transform for viewing
To application code data movement transform to
the required format
To standard analysis tools, e.g. R
To standard visualisation tools, e.g. Spotfire

21
Requirements Owners viewpoint

Create Data
Automated generation, Accession Policies,
Metadata generation
Storage Resources SRM, SRB,
Preserve Data
Archiving
Replication
Metadata
Protection
Provide Services with available resources
Definition implementation costs stability
Resources storage, compute bandwidth

22
Requirements Owners viewpoint 2

Protect Services
Authentication, Authorisation, Accounting, Audit
Reputation
Protect data
Comply with owner requirements encryption for
privacy,
Monitor and Control use
Detect and handle failures, attacks, misbehaving
users
Plan for future loads and services
Establish case for Continuation
Usage statistics
Discoveries enabled

23
Significance of Data Structure
24
Why structure data

It always is structured
Without structure it is just a bag of bits
Is the next 32 bits
An integer
Two integers
Part of a double
4 characters
2 characters in Ucode
Is this a 1D, 2D or 3D array?
How big is it?
Where is the UUID?

Of course the Author of the Application Knows
this
25
More interesting questions

How do you discover the structure?
If the application developer isnt available
They are virtually never available
There were lots of them who made changes
Perhaps a community has defined the structure
Then communicated it among themselves
How do you find that community

26
More interesting questions 2

Perhaps structure description written with the
data
Binary data at start of file(s)
Binary data in another file
How do you know the relationship between the
files
Binary data among the other data
How do you find it
How do you find these binary structure
descriptions?
How do you interpret them?

27
Structure Described textually

Binary data is efficient
TRY Separate textual description
E.g. MIME types
Bespoke structural description language
Product specific
Computing language specific
Application community specific
Attempt a standard data structure description
language
E.g. GGF DFDL
Still have to discover which description applies
to which data
A binding problem
Still have to understand the names
interpretation
E.g. a field described Distance IEE64bitFloat
Which distance?
What units?
When measured?

28
Textual data is easy to use

Humans can read write it
Though there is a limit as to how much!
Humans can edit it
Though they make errors break structure
It allows structural flexibility extension
The structure may be implicit
E.g. a standard natural language text
A popular format maintained by user discipline
A format maintained by tools
E.g. mail message headers
That then make the structure explicit maintained

29
Structured textual data

Semistructured data
May use layout and tags to code structure
E.g. field-name text newline
E.g. column names, newline, comma-separated
values, newline,
E.g. XML tag pairs
Structure may be more or less consistently
This may be improved with a schema
AND schema checking
E.g. XML schema, e.g. XSD
Another binding problem which schema controls
which document?
May be some implicit rules
E.g. XML tag pairing
Structure may be partially inferred
E.g. recognise integers
With textual exceptions, e.g. not yet known

30
Databases provide some structure

Manage data
Manage description of structure
Schema (logical and physical metadata)
Constraints
Authorisation rules
Manage storage
Often efficient layout binary / compressed
Manage Privacy
E.g. guarantee encryption
Provide operations
Queries, updates, bulk loads, rule checks, stored
procedures

Interpretation challenges remain
31
Exploit structure

Go directly to parts of data
Extract relevant parts
Transform during this process
Generate descriptions of data structure
Store bindings between
Structure description and data
Transfer smaller volumes of data
Compress exploiting structure
Aids to interpretation
Require a structural foundation

32
Strategies for Data integration
33
Basic Strategies for Users

Use a Service provided by a Data Owner
Use a self-administered workflow
Use a scripted workflow
Use data virtualisation services

34
Basic Strategies for Users

Use a Service provided by a Data Owner
Easiest as pre-packaged
Web-based form interfaces
E.g. for BLAST jobs at EBI
Now may be provided as Web Services
Accessed by client portal
E.g. Initiating BLAST runs in BRIDGES project
No multi-source data integration
Unless provided by Data Owner
Opportunity for discovery restricted to that data
Use a self-administered workflow
Use a scripted workflow
Use data virtualisation services

35
Basic Strategies for Users

Use a Service provided by a Data Owner
Use a self-administered workflow
Use a sequence of Services
Plus own data
Organise each step
Collect and manage intermediate results
Organise integration processes manually
Common strategy
Very laborious
Error prone
Tedious repetition
Hard to provide to other researchers
Use a scripted workflow
Use data virtualisation services

36
Basic Strategies for Users

Use a Service provided by a Data Owner
Use a self-administered workflow
Use a scripted workflow
Describe the steps in a Scripting Language
Steps performed by Workflow Enactment Engine
Many languages in use
Trade off familiarity availability
Trade off detailed control versus abstraction
Incrementally develop correct process
Sharable Editable
Basis for scientific communication validation
Valuable IPR asset
Repetition is now easy
Parameterised explicitly implicitly
Use data virtualisation services

37
Workflow Systems
Language WF Enact. Comments
Shell scripts Shell OS Common but not often thought of as WF. Depend on context, e.g. NFS across all sites
Perl Perl runtime Popular in bioinformatics. Similar context dependence distribution has to be coded
Java JVM Popular target because JVM ubiquity similar dependence distribution has to be coded
BPEL BPEL En-actment OASIS standard for industry coordinating use of multiple Web Services low level detail - tools
Taverna Scufl Tuesday Wednesday this week http//taverna.sourceforge.net/index.php
VDT / Pegasus Chimera DAGman High-level abstract formulation of workflows, automated mapping towards executable forms, cached result re-use
Kepler Kepler Tuesday Wednesday this week http//kepler-project.org/
38
Example Grid3 ApplicationNVO Mosaic Construction
Construct custom mosaics on demand from multiple
data sources User specifies projection,
coordinates, size, rotation, spatial sampling
NVO/NASA Montage A small (1200 node) workflow
Work by Ewa Deelman et al., USC/ISI and Caltech
39
Basic Strategies for Users

Use a Service provided by a Data Owner
Use a self-administered workflow
Use a scripted workflow
Use data virtualisation services
Form a federation
Set of data resources incremental addition
Registration description of collected resources
Warehouse data or access dynamically to obtain
updated data
Virtual data warehouses automating division
between collection and dynamic access
Describe relevant relationships between data
sources
Incremental description refinement / correction
Run jobs, queries workflows against combined
set of data resources
Automated distribution transformation
Example systems
IBMs Information Integrator
GEON, BIRN SEEK
OGSA-DAI is an extensible framework for building
such systems

40
Basic Strategies for Users

Use a Service provided by a Data Owner
Use a self-administered workflow
Use a scripted workflow
Use data virtualisation services
Arrange that multiple data services have common
properties
Arrange federations of these
Arrange access presenting the common properties
Expose the important differences
Support integration accommodating those
differences

41
Virtualisation variations

Extent to which homogeneity obtained
Regular representation choices e.g. units
Consistent ontologies
Consistent data model
Consistent schema integrated super-schema
DB operations supported across federation
Ease of adding federation elements
Ease of accommodating change as federation
members change their schema and policies
Drill through to primary forms supported

42
Metadata
43
Metadata Definition

Metadata is data that describes other data
Any property of the other data
Structure
Physical organisation
Usage and storage policies
Destruction policies
Privacy and legal constraints
Provenance
Aids to interpretation
Known uses and users

One persons metadata can be another persons data
44
Challenges for metadata

All the challenges of Data
E.g. authorisation, privacy, dependable storage,
Managing changes, quality,
The binding between Data Metadata
What metadata describes this data?
What data does this metadata describe?
Specific data
All the data about a particular topic
All the data that will be produced in a
particular way
Good abstractions for using data metadata
together
Good mechanisms for generating metadata
Automation incentives

45
Metadata modes of use creation

Generate Metadata
Then generate and store data that complies
Generate Metadata Data
At the same time
Atomic operation
Have already a collection of data
And some metadata, e.g. structural
Mine or generate further information about the
data
Store that as additional metadata
Note constructing bindings in each case
Must maintain stable and accurate bindings

46
Modes of using metadata

Query or search metadata
Use this to find specific parts
Browse metadata (after query)
To understand data
To consider exploitation strategies
Create indexes
Use these to accelerate algorithms
This should be done more often!
Applications tools read metadata
Use it to drive selections, mappings,
presentations
E.g. use it to generate detailed workflows from
abstract workflows
E.g. construct wrappers and data transformers

47
Views of OGSA-DAI
48
Simple Intermediary Pattern
49
Persistent Intermediary Pattern
50
Redirector Pattern
51
Redirector OGSA-DAI as the consumer
52
Coordinator Pattern
53
Data Assembly Pattern
54
Pattern Features Facilities provided
Intermediary Data service interposed between client applications and data resource Consistent interface for different kinds of data, data filtering, sampling, transformation, composition and transport, movement of computation to data, latency reduction from multiple actions per request, authorisation gateways, sessions and concurrency via pipelining and parallelism.
Persistent Intermediary Data storage permits results to be used by subsequent requests As above plus the assembly and caching of results for use by subsequent requests providing replication, snapshots and acceleration.
Redirector Third-party data transfers As above plus reduction in data transport costs through (a) using protocols suitable to the data and recipient and (b) avoiding transfer via intermediaries and double handling.
Coordinator Multiple data resources per data service As above plus integration of data from these resources, efficient movement of data between them and transactional integrity of multiple resource operations.
Data Assembler Data services using other data services as well as data resources As above plus data federation, distributed query and distributed transactions.
55
Integrated service for Data Metadata
56
?
Picture compositionbyLuke Humphrybased on
prior art

Write a Comment

User Comments (0)