Monday 17th July - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Monday 17th July

Description:

Foundations of Collaboration. Strong commitment by individuals. To work together ... Standards & Collaboration Essential. Data Access and Integration: motives ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 57
Provided by: LillyH5
Category:

less

Transcript and Presenter's Notes

Title: Monday 17th July


1
Session 25 Monday 17th July
Malcolm Atkinson
2
Distributed Structured Data Management Introducti
on, Principles Foundations
3
Introduction to Structured Data in Grids
  • Reminders Distributed Systems Data scale
  • Significance of Structure
  • Strategies for Data Integration
  • Metadata Challenges
  • A view of OGSA-DAI

4
Reminders Distributed Systems Data scale Data
opportunities
5
Foundations of Collaboration
  • Strong commitment by individuals
  • To work together
  • To take on communication challenges
  • Mutual respect mutual trust
  • Distributed technology
  • To support information interchange
  • To support resource sharing
  • To support data integration
  • To support trust building
  • Sufficient time
  • Common goals
  • Complementary knowledge, skills data

Can we predictwhen it will work? Can we
findremedies when itdoesnt?
6
A strategy that works well
  • Collaboratively constructed
  • Shared access
  • Data Resources
  • Sequence databases
  • Protein structure and Crystallography databases
  • Sky Surveys
  • Census data
  • Zoo DB
  • Mouse Atlas

Works betterwhen linked to Funding
Publication. But fundingthe maintenance?
7
Works better with an organising nucleus
  • EBI
  • BIRN
  • GEON
  • SEEK / Species 2000
  • IVOA
  • CaBIG

Helping to Organise Giving user
support Establishing standards Sharing methods
8
Principles of Distributed Computing
  • Issues you cant avoid
  • Lack of Complete Knowledge (LOCK)
  • Latency
  • Heterogeneity
  • Autonomy
  • Unreliability
  • Change
  • A Challenging goal
  • balance technical feasibility
  • against virtual homogeneity, stability and
    reliability
  • Appropriate balance between usability and
    productivity
  • while remaining affordable, manageable and
    maintainable

This is NOT easy
9
Compound Causes of Data Growth
  • Faster devices
  • Cheaper devices
  • Higher-resolution
  • all Moores law
  • Increased processor throughput
  • ? more derived data
  • Cheaper higher-volume storage
  • Remote data more accessible
  • Public policy to make research data available
  • Bandwidth increases
  • Latency doesnt get less though

10
Motivation Data Curation shared Data
integration Data opportunities
11
Interpretational Opportunities Challenges
  • Finding Accessing data
  • Variety of mechanisms policies
  • Interpreting data
  • Variety of forms, value systems ontologies
  • Independent provision ownership
  • Autonomous changes in availability, form, policy,
  • Processing data
  • Understanding how it may be related
  • Devising models that expose the relationships
  • Presenting results
  • Humans need either
  • Derived small volumes of statistics
  • Visualisations

12
Interpretational Opportunities Challenges
  • Finding Accessing data
  • Variety of mechanisms policies
  • Interpreting data
  • Variety of forms, value systems ontologies
  • Independent provision ownership
  • Autonomous changes in availability, form, policy,
  • Processing data
  • Understanding how it may be related
  • Devising models that expose the relationships
  • Presenting results
  • Humans need either
  • Derived small volumes of statistics
  • Visualisations

Variety Autonomy Essential
13
Interpretational Opportunities Challenges
  • Finding Accessing data
  • Variety of mechanisms policies
  • Interpreting data
  • Variety of forms, value systems ontologies
  • Independent provision ownership
  • Autonomous changes in availability, form, policy,
  • Processing data
  • Understanding how it may be related
  • Devising models that expose the relationships
  • Presenting results
  • Humans need either
  • Derived small volumes of statistics
  • Visualisations

Standards Collaboration Essential
14
Data Access and Integration motives
  • Key to Integration of Scientific Methods
  • Publication and sharing of results
  • Primary data from observation, simulation
    experiment
  • Encourages novel uses
  • Allows validation of methods and derivatives
  • Enables discovery by combining data independently
    collected

and Decisions!
15
Data Access and Integration motives
  • Key to Large-scale Collaboration
  • Economies data production, publication
    management
  • Sharing cost of storage, management and curation
  • Many researchers contributing increments of data
  • Pooling annotation ? rapid incremental
    publication
  • And criticism
  • Accommodates global distribution
  • Data code travel faster and more cheaply
  • Accommodates temporal distribution
  • Researchers assemble data
  • Later (other) researchers access data

16
Data Access and Integration challenges
Petabyte of Digital Data / Hospital / Year
  • Scale
  • Many sites, large collections, many uses
  • Longevity
  • Research requirements outlive technical decisions
  • Diversity
  • No one size fits all solutions will work
  • Primary Data, Data Products, Meta Data,
    Administrative data,
  • Many Data Resources
  • Independently owned managed
  • No common goals
  • No common design
  • Work hard for agreements on foundation types and
    ontologies
  • Autonomous decisions change data, structure,
    policy,
  • Geographically distributed

17
Data Integration Scientific discovery
  • Choosing data sources
  • How do you find them?
  • How do they describe and advertise them?
  • Is the equivalent of Google possible?
  • Obtaining access to that data
  • Overcoming administrative barriers
  • Overcoming technical barriers
  • Understanding that data
  • The parts you care about for your research
  • Extracting nuggets from multiple sources
  • Pieces of your jigsaw puzzle
  • Combing them using sophisticated models
  • The picture of reality in your head
  • Analysis on scales required by statistics
  • Coupling data access with computation
  • Repeated Processes
  • Examining variations, covering a set of
    candidates
  • Monitoring the emerging details
  • Coupling with scientific workflows
  • Youre an innovator
  • Your model ? their model
  • ? Negotiation patienceneeded from both sides

18
Scientific Data Opportunities Challenges
  • Opportunities
  • Global Production of Published Data
  • Volume? Diversity?
  • Combination ? Analysis ? Discovery
  • Challenges
  • Data Huggers
  • Meagre metadata
  • Ease of Use
  • Optimised integration
  • Dependability

A Cornucopia of Research Challenges
  • Opportunities
  • Specialised Indexing
  • New Data Organisation
  • New Algorithms
  • Varied Replication
  • Shared Annotation
  • Intensive Data Computation
  • Challenges
  • Fundamental Principles
  • Approximate Matching
  • Multi-scale optimisation
  • Autonomous Change
  • Legacy structures
  • Scale and Longevity
  • Privacy and Mobility
  • Sustained Support / Funding

19
Requirements Users viewpoint
  • Find Data
  • Registries Human communication
  • Understand data
  • Metadata description, Standard / familiar formats
    representations, Standard value systems
    ontologies
  • Data Access
  • Find how to interact with data resource
  • Obtain permission (authority)
  • Make connection
  • Make selection
  • Move Data
  • In bulk or streamed (in increments)

20
Requirements Users viewpoint 2
  • Transform Data
  • To format, organisation representationrequired
    for computation or integration
  • Combine data
  • Standard DB operations operations relevant
    tothe application model
  • Present results
  • To humans data movement transform for viewing
  • To application code data movement transform to
    the required format
  • To standard analysis tools, e.g. R
  • To standard visualisation tools, e.g. Spotfire

21
Requirements Owners viewpoint
  • Create Data
  • Automated generation, Accession Policies,
    Metadata generation
  • Storage Resources SRM, SRB,
  • Preserve Data
  • Archiving
  • Replication
  • Metadata
  • Protection
  • Provide Services with available resources
  • Definition implementation costs stability
  • Resources storage, compute bandwidth

22
Requirements Owners viewpoint 2
  • Protect Services
  • Authentication, Authorisation, Accounting, Audit
  • Reputation
  • Protect data
  • Comply with owner requirements encryption for
    privacy,
  • Monitor and Control use
  • Detect and handle failures, attacks, misbehaving
    users
  • Plan for future loads and services
  • Establish case for Continuation
  • Usage statistics
  • Discoveries enabled

23
Significance of Data Structure
24
Why structure data
  • It always is structured
  • Without structure it is just a bag of bits
  • Is the next 32 bits
  • An integer
  • Two integers
  • Part of a double
  • 4 characters
  • 2 characters in Ucode
  • Is this a 1D, 2D or 3D array?
  • How big is it?
  • Where is the UUID?

Of course the Author of the Application Knows
this
25
More interesting questions
  • How do you discover the structure?
  • If the application developer isnt available
  • They are virtually never available
  • There were lots of them who made changes
  • Perhaps a community has defined the structure
  • Then communicated it among themselves
  • How do you find that community

26
More interesting questions 2
  • Perhaps structure description written with the
    data
  • Binary data at start of file(s)
  • Binary data in another file
  • How do you know the relationship between the
    files
  • Binary data among the other data
  • How do you find it
  • How do you find these binary structure
    descriptions?
  • How do you interpret them?

27
Structure Described textually
  • Binary data is efficient
  • TRY Separate textual description
  • E.g. MIME types
  • Bespoke structural description language
  • Product specific
  • Computing language specific
  • Application community specific
  • Attempt a standard data structure description
    language
  • E.g. GGF DFDL
  • Still have to discover which description applies
    to which data
  • A binding problem
  • Still have to understand the names
    interpretation
  • E.g. a field described Distance IEE64bitFloat
  • Which distance?
  • What units?
  • When measured?

28
Textual data is easy to use
  • Humans can read write it
  • Though there is a limit as to how much!
  • Humans can edit it
  • Though they make errors break structure
  • It allows structural flexibility extension
  • The structure may be implicit
  • E.g. a standard natural language text
  • A popular format maintained by user discipline
  • A format maintained by tools
  • E.g. mail message headers
  • That then make the structure explicit maintained

29
Structured textual data
  • Semistructured data
  • May use layout and tags to code structure
  • E.g. field-name text newline
  • E.g. column names, newline, comma-separated
    values, newline,
  • E.g. XML tag pairs
  • Structure may be more or less consistently
  • This may be improved with a schema
  • AND schema checking
  • E.g. XML schema, e.g. XSD
  • Another binding problem which schema controls
    which document?
  • May be some implicit rules
  • E.g. XML tag pairing
  • Structure may be partially inferred
  • E.g. recognise integers
  • With textual exceptions, e.g. not yet known

30
Databases provide some structure
  • Manage data
  • Manage description of structure
  • Schema (logical and physical metadata)
  • Constraints
  • Authorisation rules
  • Manage storage
  • Often efficient layout binary / compressed
  • Manage Privacy
  • E.g. guarantee encryption
  • Provide operations
  • Queries, updates, bulk loads, rule checks, stored
    procedures

Interpretation challenges remain
31
Exploit structure
  • Go directly to parts of data
  • Extract relevant parts
  • Transform during this process
  • Generate descriptions of data structure
  • Store bindings between
  • Structure description and data
  • Transfer smaller volumes of data
  • Compress exploiting structure
  • Aids to interpretation
  • Require a structural foundation

32
Strategies for Data integration
33
Basic Strategies for Users
  • Use a Service provided by a Data Owner
  • Use a self-administered workflow
  • Use a scripted workflow
  • Use data virtualisation services

34
Basic Strategies for Users
  • Use a Service provided by a Data Owner
  • Easiest as pre-packaged
  • Web-based form interfaces
  • E.g. for BLAST jobs at EBI
  • Now may be provided as Web Services
  • Accessed by client portal
  • E.g. Initiating BLAST runs in BRIDGES project
  • No multi-source data integration
  • Unless provided by Data Owner
  • Opportunity for discovery restricted to that data
  • Use a self-administered workflow
  • Use a scripted workflow
  • Use data virtualisation services

35
Basic Strategies for Users
  • Use a Service provided by a Data Owner
  • Use a self-administered workflow
  • Use a sequence of Services
  • Plus own data
  • Organise each step
  • Collect and manage intermediate results
  • Organise integration processes manually
  • Common strategy
  • Very laborious
  • Error prone
  • Tedious repetition
  • Hard to provide to other researchers
  • Use a scripted workflow
  • Use data virtualisation services

36
Basic Strategies for Users
  • Use a Service provided by a Data Owner
  • Use a self-administered workflow
  • Use a scripted workflow
  • Describe the steps in a Scripting Language
  • Steps performed by Workflow Enactment Engine
  • Many languages in use
  • Trade off familiarity availability
  • Trade off detailed control versus abstraction
  • Incrementally develop correct process
  • Sharable Editable
  • Basis for scientific communication validation
  • Valuable IPR asset
  • Repetition is now easy
  • Parameterised explicitly implicitly
  • Use data virtualisation services

37
Workflow Systems
Language WF Enact. Comments
Shell scripts Shell OS Common but not often thought of as WF. Depend on context, e.g. NFS across all sites
Perl Perl runtime Popular in bioinformatics. Similar context dependence distribution has to be coded
Java JVM Popular target because JVM ubiquity similar dependence distribution has to be coded
BPEL BPEL En-actment OASIS standard for industry coordinating use of multiple Web Services low level detail - tools
Taverna Scufl Tuesday Wednesday this week http//taverna.sourceforge.net/index.php
VDT / Pegasus Chimera DAGman High-level abstract formulation of workflows, automated mapping towards executable forms, cached result re-use
Kepler Kepler Tuesday Wednesday this week http//kepler-project.org/
38
Example Grid3 ApplicationNVO Mosaic Construction
Construct custom mosaics on demand from multiple
data sources User specifies projection,
coordinates, size, rotation, spatial sampling
NVO/NASA Montage A small (1200 node) workflow
Work by Ewa Deelman et al., USC/ISI and Caltech
39
Basic Strategies for Users
  • Use a Service provided by a Data Owner
  • Use a self-administered workflow
  • Use a scripted workflow
  • Use data virtualisation services
  • Form a federation
  • Set of data resources incremental addition
  • Registration description of collected resources
  • Warehouse data or access dynamically to obtain
    updated data
  • Virtual data warehouses automating division
    between collection and dynamic access
  • Describe relevant relationships between data
    sources
  • Incremental description refinement / correction
  • Run jobs, queries workflows against combined
    set of data resources
  • Automated distribution transformation
  • Example systems
  • IBMs Information Integrator
  • GEON, BIRN SEEK
  • OGSA-DAI is an extensible framework for building
    such systems

40
Basic Strategies for Users
  • Use a Service provided by a Data Owner
  • Use a self-administered workflow
  • Use a scripted workflow
  • Use data virtualisation services
  • Arrange that multiple data services have common
    properties
  • Arrange federations of these
  • Arrange access presenting the common properties
  • Expose the important differences
  • Support integration accommodating those
    differences

41
Virtualisation variations
  • Extent to which homogeneity obtained
  • Regular representation choices e.g. units
  • Consistent ontologies
  • Consistent data model
  • Consistent schema integrated super-schema
  • DB operations supported across federation
  • Ease of adding federation elements
  • Ease of accommodating change as federation
    members change their schema and policies
  • Drill through to primary forms supported

42
Metadata
43
Metadata Definition
  • Metadata is data that describes other data
  • Any property of the other data
  • Structure
  • Physical organisation
  • Usage and storage policies
  • Destruction policies
  • Privacy and legal constraints
  • Provenance
  • Aids to interpretation
  • Known uses and users

One persons metadata can be another persons data
44
Challenges for metadata
  • All the challenges of Data
  • E.g. authorisation, privacy, dependable storage,
  • Managing changes, quality,
  • The binding between Data Metadata
  • What metadata describes this data?
  • What data does this metadata describe?
  • Specific data
  • All the data about a particular topic
  • All the data that will be produced in a
    particular way
  • Good abstractions for using data metadata
    together
  • Good mechanisms for generating metadata
  • Automation incentives

45
Metadata modes of use creation
  • Generate Metadata
  • Then generate and store data that complies
  • Generate Metadata Data
  • At the same time
  • Atomic operation
  • Have already a collection of data
  • And some metadata, e.g. structural
  • Mine or generate further information about the
    data
  • Store that as additional metadata
  • Note constructing bindings in each case
  • Must maintain stable and accurate bindings

46
Modes of using metadata
  • Query or search metadata
  • Use this to find specific parts
  • Browse metadata (after query)
  • To understand data
  • To consider exploitation strategies
  • Create indexes
  • Use these to accelerate algorithms
  • This should be done more often!
  • Applications tools read metadata
  • Use it to drive selections, mappings,
    presentations
  • E.g. use it to generate detailed workflows from
    abstract workflows
  • E.g. construct wrappers and data transformers

47
Views of OGSA-DAI
48
Simple Intermediary Pattern
49
Persistent Intermediary Pattern
50
Redirector Pattern
51
Redirector OGSA-DAI as the consumer
52
Coordinator Pattern
53
Data Assembly Pattern
54
Pattern Features Facilities provided
Intermediary Data service interposed between client applications and data resource Consistent interface for different kinds of data, data filtering, sampling, transformation, composition and transport, movement of computation to data, latency reduction from multiple actions per request, authorisation gateways, sessions and concurrency via pipelining and parallelism.
Persistent Intermediary Data storage permits results to be used by subsequent requests As above plus the assembly and caching of results for use by subsequent requests providing replication, snapshots and acceleration.
Redirector Third-party data transfers As above plus reduction in data transport costs through (a) using protocols suitable to the data and recipient and (b) avoiding transfer via intermediaries and double handling.
Coordinator Multiple data resources per data service As above plus integration of data from these resources, efficient movement of data between them and transactional integrity of multiple resource operations.
Data Assembler Data services using other data services as well as data resources As above plus data federation, distributed query and distributed transactions.
55
Integrated service for Data Metadata
56
?
Picture compositionbyLuke Humphrybased on
prior art
Write a Comment
User Comments (0)
About PowerShow.com