Exploiting Diverse Sources of Scientific Data the vision, what has been achieved and what next - PowerPoint PPT Presentation

Loading...

PPT – Exploiting Diverse Sources of Scientific Data the vision, what has been achieved and what next PowerPoint presentation | free to download - id: 1e56dc-YzM3Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Exploiting Diverse Sources of Scientific Data the vision, what has been achieved and what next

Description:

In meta-utopia, Everyone uses the same language. and means the same thing... A look inside the meta-utopia of ecology. Identification: dataset elements ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 121
Provided by: jessi161
Learn more at: http://www.nesc.ac.uk
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Exploiting Diverse Sources of Scientific Data the vision, what has been achieved and what next


1
Exploiting Diverse Sources of Scientific Data
the vision, what has been achieved and what
next…
  • Prof. Jessie Kennedy

2
Science Scientific Data
  • Science and Scientific Data are Complex…

3
(No Transcript)
4
(No Transcript)
5
Scientific Community complex
Small Scientific Community
Individual Scientist
Large Scientific Community
Scientific Laboraotory
6
(No Transcript)
7
Science Scientific Data
  • Are continually changing
  • Conclusions become foundations for new hypotheses
  • New experiments invalidate existing knowledge
  • Knowledge is open to interpretation
  • Different opinions
  • World continually changing

8
Exploiting Diverse Sources of Scientific Data
the vision
  • To provide scientists with technological
    solutions to exploit the wealth and diversity of
    Scientific Data
  • Discovery
  • Access
  • Sharing
  • Integration/Linking
  • Analysis
  • Which would thereby improve the potential for new
    scientific discovery

9
Projects in most sciences
10
SEEK (Scientific Environment for Ecological
Knowledge) Vision
  • Research, develop, and capitalize upon advances
    in information technology to radically improve
    the type and scale of ecological science that can
    be addressed
  • Scalable synthesis

Michener
11
Data Dispersion Challenges
  • Data are massively dispersed
  • Ecological field stations and research centers
    (100s)
  • Natural history museums and biocollection
    facilities (100s)
  • Agency data collections (10s to 100s)
  • Individual scientists (1000s)
  • Maintenance must be local

Michener
12
Data Integration Challenges
  • Data are heterogeneous
  • Syntax
  • (format)
  • Schema
  • (model)
  • Semantics
  • (meaning)

Jones
13
Ecological Modeling Challenges
  • Analysis and modeling tools are
  • Specialized
  • Disconnected
  • Proprietary
  • It is
  • Difficult to revise analyses
  • Hard to document analyses
  • Impossible to reliably publish models to share
    with colleagues
  • Hard to re-use models and analyses from
    colleagues
  • Difficult to use grid-computing for demanding
    computations
  • Labor-intensive to manage data in popular
    analysis software

Michener
14
Exploiting Diverse Sources of Scientific Data
the approaches
  • Data Discovery/Access
  • Metadata
  • To describe the data sets
  • Ontologies
  • To define the terminology used
  • Standardisation of formats
  • For the exchange of data
  • Life Science Identifiers (LSIDs)
  • To uniquely identify and resolve data objects
  • Provenance of data
  • To record where the data has come from
  • And what has happened to it en route.
  • GRID/Web technology
  • Distributed data management

15
Exploiting Diverse Sources of Scientific Data
the approaches
  • Data Integration/Linking
  • Metadata
  • To know how to interpret the data sets
  • Ontologies
  • To know how data in the data sets might be
    related
  • To aid automatic transformation of the data
  • Standardisation of formats
  • To ease integration
  • Life Science Identifiers (LSIDs)
  • To know when 2 things are the same
  • Workflows
  • To enable refinement and repetition of integration

16
Exploiting Diverse Sources of Scientific Data
the approaches
  • Data Analysis
  • Metadata
  • To know how to interpret the data sets
  • Ontologies
  • To know analytical/transformation processes
    appropriate
  • Workflow Tools
  • To ease analytical processes
  • Recording/reuse of analytical processes
  • Provenance
  • Recording life history of data
  • To enable validation

17
Exploiting Diverse Sources of Scientific Data
the technologies
  • Standardisation of formats
  • Metadata
  • Ontologies
  • Life Science Identifiers (LSIDs)
  • Provenance
  • Workflow Tools
  • GRID/Web technology

18
Exploiting Diverse Sources of Scientific Data
the technologies
  • Standardisation of formats
  • Metadata
  • Ontologies
  • Life Science Identifiers (LSIDs)
  • Provenance
  • Workflow Tools
  • GRID/Web technology

19
Meta Data the vision
  • Meta data - "data about data"
  • keywords, title, creator ….
  • If scientists marked up their data with the
    agreed meta data it would be trivial to find
    highly relevant data (sub-)sets for analysis…
  • Meta-utopia….

20
Meta-utopia
  • A world of complete, reliable metadata.
  • In meta-utopia,
  • Everyone uses the same language
  • and means the same thing…
  • The guardians of epistemology have rationally
    mapped out a schema or hierarchy of ideas.
  • that everyone adheres to…
  • Scientists accurately describe their methods,
    processes and results.
  • so anyone can do anything with it in the future…

Cory Doctorow
21
Meta Data the approach
  • Common language
  • XML Schemas to describe data/meta data
  • Domain specific exchange schemas
  • Explosion of these in every domain
  • Exchanging data
  • Archiving data

22
Ecological Metadata Language
  • A look inside the meta-utopia of ecology

23
Identification dataset elements
24
Identification resource elements
25
Identification party elements
26
Discovery coverage elements
Geographic Temporal Taxonomic
27
Evaluation Level Information
28
Evaluation Method Information
29
Evaluation Project Information
L3
30
Access Permissions Information
L4
31
Access Physical Information
32
Access Physical formatting details
33
Access Distribution Information
L4
34
Integration Level Information
35
Integration Level Attribute structure
36
Integration Level attribute domains
37
Integration Level attribute domains
38
Integration Level measurementScale
39
Meta Data the approach
  • Common language
  • XML Schemas to describe data/meta data
  • Domain specific exchange schemas
  • Explosion of these in every domain
  • Exchanging data
  • Archiving data
  • Turned into extensive specifications
  • Difficult to know where to stop…

40
  • but even this wasnt enough…..
  • Its not good enough to have meta-data, we need
    to know what the terms in the meta-data (schema
    or data values) mean.

41
Ontologies the vision
  • If we understood the meaning of the schema and
    the terms used in the meta-data or databases we
    would be able to
  • find things more reliably,
  • integrate things more easily,
  • reason about what things are comparable….
  • because we have support for automatic inference

42
Ontologies the approach
  • Common Language…
  • OWL?
  • RDF, OWL lite, OWL DL, OWL full…..
  • Domain specific ontologies
  • or project specific?
  • Map different ontologies
  • Modularise the ontologies
  • Reuse..
  • Build upper ontologies to which domain ontologies
    extend/link

43
Biodiversity Base Ontology
44
Core Layer
45
BDI Core Taxon Name
46
BDI Core Taxon Concept
47
BDI Core BioSpecimen
48
BDI Core BioObservation
Similar to…
49
SEEK Observation ontology
Josh Madin
50
entity
An extension point for domain-specific terms
Josh Madin
51
Characteristic
Josh Madin
52
Measurement standard
Similar to…
All the units, scales, indices, classifications,
and lists used for measuring a characteristic
Josh Madin
53
Biosphere Data Data Center Human Activity
Material Thing Numerics Sensor Space Time
Units
Earth Realm Physical Phenomena Physical Process
Physical Property Physical Substance Sun Realm
Takes us back to…
54
BDI Taxon Concept Ontology
…is really just a schema for representing…
55
Biological Taxonomy
  • Classify and name all organisms in the world
  • So we can talk about them, experiment with them
  • Do life science…
  • The longest running attempt at building an
    ontology?
  • Linnaeus binomial system of nomenclature started
    in 1758
  • An attempt to resolve a long standing problem in
    biology
  • Many ways to classify things
  • Understanding continually changes with new
    discoveries technologies
  • Classifications continually being redone
  • New things defined, New definitions given for
    things in existence
  • Lots of classifications over time
  • Many compete at any one point in time

56
Taxonomic history of imaginary genus Aus L. 1758
Archer 1965
Linneaus 1758
Fry 1989
Pargiter 2003
Tucker 1991
Aus L.1758
Aus L.1758
Aus L.1758
Aus L.1758
Aus L.1758
Aus aus L. 1758
Aus aus L.1758
Aus aus L.1758
Aus aus L.1758
Aus ceus BFry 1989
Aus bea Archer 1965
Aus bea Archer 1965
Aus aus L.1758
(vi) Xus Pargiter 2003
Aus cea BFry 1989
Aus cea BFry 1989
Xus beus (Archer) Pargiter 2003.
Pyle 1990
5 Revisions of Aus 1 name spelling change
Aus bea and Aus cea noted as invalid names and
replaced with Aus beus and Aus ceus.
57
Taxonomic history of imaginary genus Aus L. 1758
Archer 1965
Linneaus 1758
Fry 1989
Pargiter 2003
Tucker 1991
Aus L.1758
Aus L.1758
Aus L.1758
Aus L.1758
Aus L.1758
Aus aus L. 1758
Aus aus L.1758
Aus aus L.1758
Aus aus L.1758
Aus ceus BFry 1989
Aus bea Archer 1965
Aus bea Archer 1965
Aus aus L.1758
(vi) Xus Pargiter 2003
Aus cea BFry 1989
Aus cea BFry 1989
Xus beus (Archer) Pargiter 2003.
Pyle 1990
  • 8 Names
  • 2 genus
  • 6 species

Aus bea and Aus cea noted as invalid names and
replaced with Aus beus and Aus ceus.
58
Results in many concepts for each name
C0.1
C0.1 - Aus L.1758 sec. Linneaeus 1758
C0.2
C0.2 - Aus L.1758 sec. Archer 1965
N0
C0.3
C0.3 - Aus L.1758 sec. Fry 1989
N0 - Aus L.1758
C0.4
C0.4 - Aus L.1758 sec. Tucker 1991
C0.5
C0.5 - Aus L.1758 sec. Pargiter 2003
C1.1
C1.1 - Aus aus L.1758 sec. Linneaeus 1758
C1.2
C1.2 - Aus aus L.1758 sec. Archer 1965
C1.3
N1
C1.3 - Aus aus L.1758 sec. Fry 1989
N1 - Aus aus L.1758
C1.4
C1.4 - Aus aus L.1758 sec. Tucker 1991
C1.5
C1.5 - Aus aus L.1758 sec. Pargiter 2003
C2.2
C2.2 - Aus bea Archer 1965 sec. Archer 1965
N2
C2.3
N2 - Aus bea Archer 1965
C2.3 - Aus bea Archer 1965 sec. Fry 1989
N3
C3.3
N3 - Aus cea Fry 1989
C3.3 - Aus cea Fry 1989 sec. Fry 1989
N4
C3.4 - Aus cea Fry 1989 sec. Tucker 1991
C3.4
N4 - Aus beus Archer 1965
N5
C5.5
C5.5 - Aus ceus Fry 1989 sec. Fry 1989
N5 - Aus ceus Fry 1989
C6.5
N6
C6.6 - Xus beus Pargiter 2003 sec. Pargiter 2003
N6 - Xus beus Pargiter 2003
C7.5
N7
C7.6 - Xus Pargiter 2003 sec. Pargiter 2003
N7 - Xus Pargiter 2003
8 Names 17 Concepts
59
Possible interpretations of Aus aus L. 1758
  • Request data sets about Aus aus (N1)
  • whats returned?
  • Original concept C1.1
  • Most recent concept C1.5
  • Preferred Authority (e.g. Fry 1989) C1.3
  • Everything ever named N1 Union(C1.1,C1.2,C1.3,C1
    .4,C1.5)
  • Best fit according to some matching algorithm
    Best(C1.1,C1.2,C1.3,C1.4,C1.5)
  • New concept containing only those features common
    to all concepts with the name N1
    Intersection(C1.1,C1.2,C1.3,C1.4,C1.5)
  • Is it appropriate to link or merge data on this?
  • Depends on the users purpose
  • Level of precision required

C1.1
N1 - Aus aus L.1758
C1.2
N1
C1.3
C1.4
C1.5
60
Classifications synonymy relationships between
concepts and names.
Parent child relationships in 5 revisions
Names for each of the concepts
In the literature taxonomists tell us names that
are synonymous with their concepts
61
Classifications synonymy relationships between
concepts and names.
Which can result in anything being returned for
Aus aus by traversing the synonymy links
62
Classifications with set relationships between
concepts.
We can build systems to return data suit for
purpose
N7
N0
What we need are the set relationships from
concepts in a revision to earlier concepts
C0.5
C7.5
C0.4
C0.1
C0.2
C0.3
C1.5
C5.5
C6.5
C1.4
C3.4
C1.1
C1.2
C2.2
C1.3
C2.3
C3.3
?
?
?
?
?
?
?
?
?
?
?
?
and name changes related to earlier names
N5
N6
N3
N4
N1
N2


? ? ?
63
Real Taxonomic Revisions
  • German mosses
  • 14 classifications in 73 years
  • covering 1548 taxa
  • only 35 thought to be stable concepts
  • 65 of names used in legacy data sets are
    ambiguous
  • and we dont know which ones??
  • we need computers to help understand this…
  • Smaller classifications are combined into large
    classifications
  • ITIS integrated taxonomy (also changing)
    approx. 250,000 taxa
  • Taxonomic Revision of genus Alteromonas
  • 34 years from 1972 to 2006
  • Thanks to George Garrity, Michigan State Univ.

64
1972
Alteromonas
macleodii(T)
communis
vaga
65
1972 1973
Alteromonas
macleodii(T)
communis
vaga
haloplanktis
66
1972 1973 1976
Alteromonas
macleodii(T)
communis
vaga
haloplanktis
rubra
67
1972 1973 1976 1977
Alteromonas
macleodii(T)
communis
vaga
haloplanktis
rubra
citrea
68
1972 1973 1976 1977 1978
Alteromonas
macleodii(T)
communis
vaga
haloplanktis
rubra
citrea
esperjiana
undina
69
1972 1973 1976 1977 1978 1979
Alteromonas
macleodii(T)
communis
vaga
haloplanktis
rubra
citrea
esperjiana
undina
aurantia
70
1972 1973 1976 1977 1978 1979 1981
Alteromonas
macleodii(T)
communis
vaga
haloplanktis
rubra
citrea
esperjiana
undina
aurantia
putrifaciens
hanedai
71
1972 1973 1976 1977 1978 1979 1981 1982
Alteromonas
macleodii(T)
communis
vaga
haloplanktis
rubra
citrea
esperjiana
undina
aurantia
putrifaciens
hanedai
luteoviolaceae
72
1972 1973 1976 1977 1978 1979 1981 1982 1984
Alteromonas
macleodii(T)
communis
vaga
vaga
haloplanktis
rubra
citrea
esperjiana
undina
aurantia
putrifaciens
hanedai
luteoviolaceae
commune
vagum
73
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986
Marinomonas
Alteromonas
Oceanosprillum
communis(T)
linum(T)
macleodii(T)
communis
vaga
benthica
japonicum
vaga
hanedai
minutium
haloplanktis
biejerinckii
rubra
maris maris
citrea
maris williamsae
esperjiana
undina
hiroshimense
aurantia
multiglobiferum
putrifaciens
pelagicum
hanedai
pusillum
luteoviolaceae
commune
jannaschii
kreigii
vagum
74
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986
1987
Marinomonas
Alteromonas
Shewanella
Oceanosprillum
putrifaciens(T)
communis(T)
linum(T)
macleodii(T)
communis
vaga
benthica
japonicum
vaga
hanedai
minutium
haloplanktis
biejerinckii
rubra
maris maris
citrea
maris williamsae
esperjiana
undina
hiroshimense
aurantia
multiglobiferum
putrifaciens
pelagicum
hanedai
pusillum
luteoviolaceae
commune
denitrificans
jannaschii
kreigii
vagum
75
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986
1987 1988
Marinomonas
Alteromonas
Shewanella
Oceanosprillum
putrifaciens(T)
communis(T)
linum(T)
macleodii(T)
communis
vaga
benthica
japonicum
vaga
hanedai
minutium
haloplanktis
biejerinckii
rubra
maris maris
citrea
maris williamsae
esperjiana
undina
hiroshimense
aurantia
multiglobiferum
putrifaciens
pelagicum
hanedai
pusillum
luteoviolaceae
commune
denitrificans
jannaschii
colwelliana
kreigii
vagum
76
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986
1987 1988 1990
Marinomonas
Shewanella
Oceanosprillum
Alteromonas
putrifaciens(T)
communis(T)
linum(T)
macleodii(T)
vaga
benthica
japonicum
communis
hanedai
minutium
vaga
biejerinckii
haloplanktis
colwelliana
maris maris
rubra
citrea
maris williamsae
esperjiana
undina
hiroshimense
aurantia
multiglobiferum
putrifaciens
pelagicum
hanedai
pusillum
luteoviolaceae
commune
denitrificans
jannaschii
colwelliana
kreigii
tetradonis
vagum
biejerinckii pelagicum
maris hiroshimense
77
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986
1987 1988 1990 1992
Marinomonas
Shewanella
Alteromonas
Oceanosprillum
putrifaciens(T)
communis(T)
linum(T)
macleodii(T)
vaga
benthica
communis
japonicum
hanedai
vaga
minutium
colwelliana
haloplanktis
biejerinckii
algae
rubra
maris maris
citrea
maris williamsae
esperjiana
undina
hiroshimense
aurantia
multiglobiferum
putrifaciens
pelagicum
hanedai
pusillum
luteoviolaceae
commune
denitrificans
jannaschii
colwelliana
kreigii
tetradonis
vagum
atlantica
biejerinckii pelagicum
carageenovora
maris hiroshimense
78
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986
1987 1988 1990 1992 1995
Marinomonas
Shewanella
Alteromonas
Oceanosprillum
putrifaciens(T)
communis(T)
linum(T)
macleodii(T)
vaga
benthica
communis
japonicum
hanedai
vaga
minutium
colwelliana
haloplanktis
biejerinckii
algae
rubra
maris maris
citrea
maris williamsae
esperjiana
undina
hiroshimense
aurantia
multiglobiferum
putrifaciens
pelagicum
hanedai
pusillum
luteoviolaceae
commune
denitrificans
jannaschii
colwelliana
kreigii
tetradonis
vagum
atlantica
biejerinckii pelagicum
carageenovora
distincta
maris hiroshimense
fuliginea
79
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986
1987 1988 1990 1992 1995
Marinomonas
Shewanella
Alteromonas
Oceanosprillum
putrifaciens(T)
communis(T)
linum(T)
macleodii(T)
vaga
benthica
communis
japonicum
haloplanktis tetradonis
hanedai
vaga
minutium
colwelliana
haloplanktis
biejerinckii
atlantica
algae
rubra
maris maris
aurantia
citrea
maris williamsae
carrageenovora
esperjiana
citrea
undina
hiroshimense
esperjiana
aurantia
multiglobiferum
luteoviolacea
putrifaciens
pelagicum
hanedai
pusillum
luteoviolaceae
commune
rubra
denitrificans
jannaschii
undina
colwelliana
kreigii
tetradonis
vagum
atlantica
biejerinckii pelagicum
carageenovora
distincta
maris hiroshimense
fuliginea
80
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986
1987 1988 1990 1992 1995 1997
Marinomonas
Shewanella
Alteromonas
Oceanosprillum
Pseudoalteromonas
haloplanktis haloplanktis(T)
putrifaciens(T)
communis(T)
linum(T)
macleodii(T)
vaga
benthica
communis
japonicum
haloplanktis tetradonis
hanedai
vaga
minutium
colwelliana
haloplanktis
biejerinckii
atlantica
algae
rubra
maris maris
aurantia
citrea
maris williamsae
carrageenovora
esperjiana
citrea
undina
hiroshimense
esperjiana
aurantia
multiglobiferum
luteoviolacea
putrifaciens
pelagicum
nigrifaciens
hanedai
pusillum
pisicida
luteoviolaceae
commune
rubra
denitrificans
jannaschii
undina
colwelliana
kreigii
antartica
tetradonis
vagum
atlantica
biejerinckii pelagicum
carageenovora
distincta
maris hiroshimense
fuliginea
elyakoviii
81
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986
1987 1988 1990 1992 1995 1997 2000
Marinomonas
Shewanella
Alteromonas
Oceanosprillum
Pseudoalteromonas
haloplanktis haloplanktis(T)
putrifaciens(T)
communis(T)
linum(T)
macleodii(T)
vaga
benthica
communis
japonicum
haloplanktis tetradonis
hanedai
vaga
minutium
mediterannea
colwelliana
haloplanktis
biejerinckii
atlantica
algae
rubra
maris maris
aurantia
citrea
fridgidimarina
maris williamsae
carrageenovora
esperjiana
geldimarina
citrea
undina
hiroshimense
esperjiana
aurantia
multiglobiferum
luteoviolacea
putrifaciens
baltica
pelagicum
nigrifaciens
hanedai
pusillum
pisicida
luteoviolaceae
commune
rubra
denitrificans
jannaschii
undina
colwelliana
kreigii
antartica
tetradonis
vagum
bacteriolytica
atlantica
biejerinckii pelagicum
prydzensis
carageenovora
tunicata
distincta
maris hiroshimense
distincta
fuliginea
elyakovii
elyakoviii
peptidolytica
82
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986
1987 1988 1990 1992 1995 1997 2000 2001
Marinomonas
Shewanella
Alteromonas
Oceanosprillum
Pseudoalteromonas
haloplanktis haloplanktis(T)
putrifaciens(T)
communis(T)
linum(T)
macleodii(T)
vaga
benthica
communis
japonicum
haloplanktis tetradonis
hanedai
vaga
minutium
mediterannea
colwelliana
haloplanktis
biejerinckii
algae
rubra
maris maris
atlantica
citrea
aurantia
fridgidimarina
maris williamsae
esperjiana
carrageenovora
geldimarina
undina
citrea
woodyii
hiroshimense
aurantia
esperjiana
amazonensis
multiglobiferum
putrifaciens
luteoviolacea
baltica
pelagicum
hanedai
nigrifaciens
oneidensis
pusillum
luteoviolaceae
pisicida
pealeana
commune
denitrificans
rubra
violacea
jannaschii
colwelliana
undina
japonica
kreigii
tetradonis
antartica
vagum
atlantica
bacteriolytica
biejerinckii pelagicum
carageenovora
prydzensis
distincta
tunicata
maris hiroshimense
fuliginea
distincta
elyakoviii
elyakovii
peptidolytica
tetrodonis
83
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986
1987 1988 1990 1992 1995 1997 2000 2001 2002
Marinomonas
Shewanella
Alteromonas
Oceanosprillum
Pseudoalteromonas
haloplanktis haloplanktis(T)
putrifaciens(T)
communis(T)
linum(T)
macleodii(T)
vaga
benthica
communis
japonicum
haloplanktis tetradonis
hanedai
vaga
minutium
mediterannea
colwelliana
haloplanktis
biejerinckii
algae
rubra
maris maris
atlantica
citrea
fridgidimarina
aurantia
maris williamsae
esperjiana
geldimarina
carrageenovora
undina
woodyii
citrea
hiroshimense
aurantia
amazonensis
esperjiana
multiglobiferum
putrifaciens
baltica
luteoviolacea
pelagicum
hanedai
oneidensis
nigrifaciens
pusillum
luteoviolaceae
pealeana
pisicida
commune
denitrificans
violacea
rubra
jannaschii
colwelliana
japonica
undina
kreigii
tetradonis
denitrificans
antartica
vagum
atlantica
livingstonensis
bacteriolytica
biejerinckii pelagicum
carageenovora
alleyanna
prydzensis
distincta
tunicata
maris hiroshimense
fuliginea
distincta
elyakoviii
elyakovii
peptidolytica
tetrodonis
84
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986
1987 1988 1990 1992 1995 1997 2000 2001 2002 2004
Marinomonas
Shewanella
Alteromonas
Oceanosprillum
Pseudoalteromonas
haloplanktis haloplanktis(T)
putrifaciens(T)
communis(T)
linum(T)
macleodii(T)
vaga
benthica
communis
japonicum
haloplanktis tetradonis
hanedai
vaga
minutium
mediterannea
colwelliana
haloplanktis
biejerinckii
primoryensis
algae
rubra
maris maris
atlantica
citrea
fridgidimarina
aurantia
maris williamsae
esperjiana
geldimarina
carrageenovora
undina
woodyii
citrea
hiroshimense
aurantia
amazonensis
esperjiana
multiglobiferum
putrifaciens
baltica
luteoviolacea
pelagicum
hanedai
oneidensis
nigrifaciens
pusillum
luteoviolaceae
pealeana
pisicida
commune
denitrificans
violacea
rubra
jannaschii
colwelliana
japonica
undina
kreigii
tetradonis
denitrificans
antartica
vagum
atlantica
livingstonensis
bacteriolytica
biejerinckii pelagicum
carageenovora
alleyanna
prydzensis
distincta
tunicata
mariniintestina
maris hiroshimense
fuliginea
distincta
saire
elyakoviii
elyakovii
schlegeliana
peptidolytica
gaetbuli
stellipolaris
tetrodonis
litorea
5 others
12 others
85
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986
1987 1988 1990 1992 1995 1997 2000 2001 2002 2004
2005
Marinomonas
Shewanella
Alteromonas
Oceanosprillum
Pseudoalteromonas
haloplanktis haloplanktis(T)
putrifaciens(T)
communis(T)
linum(T)
macleodii(T)
vaga
benthica
communis
japonicum
haloplanktis tetradonis
hanedai
vaga
minutium
mediterannea
colwelliana
haloplanktis
biejerinckii
primoryensis
algae
rubra
maris maris
atlantica
citrea
fridgidimarina
aurantia
maris williamsae
esperjiana
geldimarina
carrageenovora
undina
woodyii
citrea
hiroshimense
aurantia
amazonensis
esperjiana
multiglobiferum
putrifaciens
baltica
luteoviolacea
pelagicum
hanedai
oneidensis
nigrifaciens
pusillum
luteoviolaceae
pealeana
pisicida
commune
denitrificans
violacea
rubra
jannaschii
colwelliana
japonica
undina
kreigii
tetradonis
denitrificans
antartica
vagum
atlantica
livingstonensis
bacteriolytica
biejerinckii pelagicum
carageenovora
alleyanna
prydzensis
distincta
tunicata
mariniintestina
maris hiroshimense
fuliginea
distincta
saire
elyakoviii
elyakovii
schlegeliana
peptidolytica
gaetbuli
stellipolaris
tetrodonis
litorea
8 others
14 others
2 others
86
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986
1987 1988 1990 1992 1995 1997 2000 2001 2002 2004
2005 2006
Marinomonas
Shewanella
Alteromonas
Oceanosprillum
Pseudoalteromonas
haloplanktis haloplanktis(T)
putrifaciens(T)
communis(T)
linum(T)
macleodii(T)
vaga
benthica
communis
japonicum
haloplanktis tetradonis
hanedai
vaga
minutium
mediterannea
colwelliana
haloplanktis
biejerinckii
primoryensis
algae
rubra
maris maris
atlantica
citrea
fridgidimarina
aurantia
maris williamsae
esperjiana
geldimarina
carrageenovora
undina
woodyii
citrea
hiroshimense
aurantia
amazonensis
esperjiana
multiglobiferum
putrifaciens
baltica
luteoviolacea
pelagicum
hanedai
oneidensis
nigrifaciens
pusillum
luteoviolaceae
pealeana
pisicida
commune
denitrificans
violacea
rubra
jannaschii
colwelliana
japonica
undina
kreigii
tetradonis
denitrificans
antartica
vagum
atlantica
livingstonensis
bacteriolytica
biejerinckii pelagicum
carageenovora
alleyanna
prydzensis
distincta
tunicata
mariniintestina
maris hiroshimense
fuliginea
distincta
saire
elyakoviii
elyakovii
schlegeliana
peptidolytica
gaetbuli
stellipolaris
tetrodonis
litorea
13 others
14 others
2 others
87
May 2004
November 2004
Gammaproteobacteria
Alteromonadales
Colwelliaceae
Alteromonadacea
Idiomarinaceae
Alteromonas
Colwelliaceae
Idiomarina
Aestuariibacter
Thalassomonas
Alishewanella
Ferrimonadacea
Colwellia
Psychromonadaceae
Ferrimonas
At the species level 18 emendations 21 new
species 19 species reassigned to 4 genera 3 new
combinations 6 synonyms 2 species to
subspecies 2 subspecies to species 50 names, five
genera, five families, and two classes but…. only
5 validly published species. At the higher
level 1 Family 16 genera -gt 8 families 12
genera 1 unclassified genus -gt 7 unclassified
genera Which is correct? Which is
supported/recorded in the data? What is the
impact on Analysis?
Ferrimonas
Psychromonas
Glaciecola
Idiomarina
Pseudoalteromonadaceae
Marinobacter
Incertae sedis
Pseudoalteromonas
Marinobacterium
Agarvorans
Algicola
Microbulbifer
Alishewanella
Moritella
Marinobacter
Shewanellaceae
Pseudoalteromonas
Marinobacterium
Shewanella
Psychromonas
Microbulbifer
Shewanella
Salinomonas
Moritellaceae
Thalassomonas
Teredinibacter
Moritella
Incertae sedis
Teredinibacter
88
Meta-utopia - a pipe dream?
  • What is meta-data?
  • Your meta data is my data…
  • Depends on your perspective
  • How you see the world
  • Whats important to you
  • What you want to do with the data
  • Its all data anyway…..
  • But its useful to differentiate for certain
    purposes

89
Meta-utopia - a pipe dream?
  • Schemas aren't neutral
  • Presumes there is a "correct" way of modelling or
    categorising ideas
  • that, given enough time and incentive, people can
    agree on the correct way…
  • Any hierarchy of concepts necessarily implies the
    importance of some axes over others.

90
  • Geographic/cartographic perspective
  • Instance of Picea rubens is-a feature that can be
    mapped
  • Features inherently have geospatial coordinates.
  • Taxonomic perspective
  • Instance of Picea rubens is a specimen of some
    biological taxon
  • Taxa inherently have characteristics used in
    classification

91
Meta-utopia - a pipe dream?
  • There's more than one way to describe something

92
(No Transcript)
93
Meta-utopia - a pipe dream?
  • There's more than one way to describe something
  • Reasonable people can disagree forever on how to
    describe something.
  • Requiring scientists to use the same vocabulary
    to describe their data enforces homogeneity in
    ideas.
  • Which could limit science…

94
Meta-utopia - a pipe dream?
  • Metrics influence results
  • Agreeing to a common metric for measuring
    important things in a domain necessarily
    privileges the items that score high on that
    metric, regardless of those items' overall
    suitability.
  • Ranking axes are mutually exclusive
  • software that scores high for security scores low
    for convenience,
  • Everyone wants to emphasize their high-scoring
    axes
  • and de-emphasize (or, if possible, ignore
    altogether) their low-scoring axes.

95
Meta-utopia - a pipe dream?
  • People are not altruistic
  • Scientists have their own immediate deliverables
  • Doesnt leave time for thinking about who else
    might do what with their data
  • Metadata exists in a competitive world.
  • People want their work cited and will (ab)use
    meta-data to do so.
  • People are busy
  • e-Scientists understand the importance of
    excellent metadata
  • Jo-scientist is mainly concerned about publishing
    the results.
  • No time for added extras

96
Meta-utopia - a pipe dream?
  • People make mistakes
  • Even when there's a positive benefit to creating
    good metadata, people dont exercise enough care
    and diligence in their metadata creation.
  • Mission Impossible?
  • Simple observation demonstrates people are poor
    observers of their own behaviours.
  • Therefore any meta data will be a poor
    representation

97
Life Science Identifiers (LSIDs) the vision
  • WWW provides a globally distributed communication
    framework
  • LSID and the LSID Resolution System
  • will provide a simple mechanism to globally
    resolve locally named objects distributed over
    the WWW.
  • LSIDs will allow us to know
  • what kind of object it is,
  • who originated it,
  • who is responsible for it,
  • how to interface to it and
  • what computations might be carried out on it.
  • Adoption of LSIDs
  • will facilitate more reliable integration of
    multiple knowledge bases,
  • each of which has partial information of a shared
    domain
  • will encourage stronger global collaboration in
    life sciences.

Clark T., Martin S., Liefeld T. Globally
Distributed Object Identification for Biological
Knowledgebases Briefings in Bioinformatics
5.159-70, March 1, 2004.
98
Life Science Identifiers
An LSID has data - gene sequence in GenBank -
ecological data set (in excel, or in a text
file) - image The data should never change - can
version
  • URI based naming scheme
  • urnlsidipni.orgnames1234-1
  • retrieval framework
  • http//lsid.sourceforge.net/
  • An LSID has metadata
  • - format of the data
  • display title for clients
  • Dublin core metadata
  • anything you want
  • The metadata can change

99
Issues For Each Community
  • What gets an LSID?
  • Real life objects
  • Biological specimen
  • Abstract concepts
  • Taxon concept or name Bellis perennis
  • Electronic representations of things
  • Image of specimen, description of specimen or
    concept
  • For each thing, whats the data and metadata?
  • LSIDs
  • Data doesnt change but Meta data can
  • Should all data become meta data?
  • Maybe it implies a temporal database approach

100
Issues For Each Community
  • Who issues LSIDs?
  • Owner of data
  • Not always clear who owns data especially legacy
    data
  • A central authority
  • One authority responsible for issuing LSID for
    specific types of information
  • This would help enforce a 11 mapping of LSIDs
    and data items
  • It MAY also reduce the likelihood of LSIDs
    becoming unresolvable
  • A respected authority
  • This would help enforce a 11 mapping for those
    who use the authority
  • It may also be more feasible
  • Free for all (possibly with an index)
  • List your LSID authority in an index so your
    LSIDs are easy to find
  • Perhaps structured delegation has best potential
    to globally unite science

101
Organizations Using LSIDs
  • Biopathways consortium
  • National Center for Biotech Information (NCBI)
  • Pubmed, Genbank
  • European Bioinformatics Institute (EBI)
  • BioMOBY an biological database interoperability
    program (biomoby.org)
  • represent all entities in MOBY Ontologies
    (Object, Service, and Namespace), as well as all
    instances of BioMOBY services.
  • myGrid (mygrid.org.uk)
  • used throughout as object naming device
  • TDWG (tdwg.org)
  • IPNI plant names
  • Index Fungorum fungi names
  • US Long Term Ecological Research Network (LTER)
  • SEEK (seek.ecoingformatics.org)
  • Used in Kepler actors, components, TOS taxon
    concepts…

102
Use of LSIDs
Ecological Data Sets
Hippocampus erectus Perry 1810 urnlsidbiocast.or
gconcept347
TAX
Hippocampus tetragonous Mitchill, 1814
347
347
347
Lined seahorse
Hippocampus erectus
Hippocampus marginalis Kaup, 1856
103
Moving to a world of LSIDs
  • Using LSIDs alone will not address all issues of
    data sharing
  • Data repositories must (re)use LSIDs to cross
    reference data
  • within and outwith their own repository.
  • it is important that we use the same LSID to
    refer to the same entity
  • If multiple LSIDs exist for the same entity we
    would be required to decide whether or not two
    LSIDs were really the same thing.
  • We would be in a worse situation than we are
    today,
  • for example when trying to decide if two
    taxonomic names mean the same.
  • Generating LSIDs for any self contained data set
    is a fairly trivial task
  • Appointing LSIDs to existing data from an
    authoritative repository to re-use them is more
    challenging
  • Investigate whats involved…

104
Convert Data Provider to use LSIDs
Original data repository (target)
RDF Data to be updated with LSIDs from authority
providers
Hexacorallia
Data
Provider
Map to ontology
Linker Tool
Hexacorallia
Data
Triple Store
Name
Specimen
Concept
Publication
Person
LSID RDF
LSID RDF
LSID RDF
LSID RDF
LSID RDF
105
Linking….
106
Linking….
107
Linking….
Request possible LSIDs
108
Confirm/Skip Annotations
109
Issues in converting to LSIDs
  • Mapping to ontology
  • LSIDs ? RDF ? schema? ? ontology?
  • agreement on ontology - problem?
  • Replace or annotate existing data?
  • If we replace an author with a person LSID
  • what is returned when resolving that LSID wont
    likely be what data was stored in DB for an
    author.
  • Dependencies between objects with LSIDs
  • If you link via a taxon name LSID the resolved
    name should have embedded an LSID for a
    publication so there shouldnt be any need (in
    principal) to match publications for names
  • What about authorities that issues LSIDs but
    dont map to other authorities
  • e.g. name providers not mapping to either
    publication or specimen providers

110
Issues in converting to LSIDs
  • What support would a linking tool need to provide
    end users?
  • How would users want to process this data
  • How much automation?
  • E.g. above a certain confidence level
  • Would this be trusted?
  • Order of matching
  • E.g. match all instances of persons at once
  • Match of persons by publication?
  • Other Issues…
  • Performance of existing linking tool approach
  • Lots of data passing going on
  • Need more efficient approach which matches user
    needs
  • Finding authorities that provide linking services
  • How do scientists find out about authorities with
    linking services?
  • How do you they which ones to use?

111
To Summarise….
  • We have seen that (Life) Science is
  • Complex Changing
  • The fundamental challenges of science that have
    always been there are still here
  • Now we have additional opportunities associated
    with the explosion of scientific information and
    the move to a virtual world
  • And now the challenge is how best to exploit
    these….
  • e-Science uses computation to aid scientists
  • By providing appropriate infrastructure and tool
    support
  • Speed up scientific processes
  • Do them repeatedly
  • Re-evaluation
  • Can give scientists time for more thoughtful
    science…
  • May require a change of emphasis in how
    scientists work
  • Must support the inherent features of science,
    scientists and scientific data

112
e-Science Complex Science
  • Support decomposition of scientific domains,
    problems and associated data
  • Fundamental to data software analysis and
    design
  • Support re-composition, linking or building on
    the components
  • Need to know when components or links have
    changed
  • Identify the overlaps/linkages in the different
    domains
  • Need useful approximations of things to simplify
    linked domain
  • Need to understand the approximations or linking
    points well
  • Raise level of abstraction
  • Artefact of storage mechanisms
  • Implies lingua franca
  • Need more evaluation of the different approaches

113
e-Science Changing Science
  • Science is full of legacy data
  • Todays scientific research is tomorrows legacy
    data
  • Provide long-term persistent storage
  • Any published scientific discovery should store
    the data as evidence
  • Data needs to be accurately annotated
  • Sufficient to repeat analyses to test hypotheses
  • e-Science already changing the way scientists do
    science
  • But to be effective it needs to change even more…
  • More emphasis on well curated, accessible,
    persistent data
  • Evidence for results

114
Meta Data Ontologies?
  • Do we throw out meta data/ontologies, then?
  • No…
  • To benefit from stored data we need to know what
    it means!
  • However, there are no large-scale benefits while
    there is insufficient coverage of meta data
  • if only 10 data has meta data people wont use
    meta data…
  • Need to reach the tipping point…
  • Controlled vocabulary and schemas shown useful
    for large projects or small communities with
    common goal
  • Need long-term projects to see if they sustain
    their value as the community and the science
    evolves.

115
Describe or Prescribe?
  • Descriptions become a vocabularies used by others
  • Folksonomy or ontologies?
  • Informal versus formal or free versus constrained
  • Informal can be basis for something formal
  • Move towards common vocabularies
  • with built in flexibility and extensibility
  • Issue of what language(s)…
  • Need more research evaluating these issues…

116
Reliability of Meta Data
  • Automatic recording of meta data
  • From machines, software, workflows…
  • Avoids labour
  • Starting to happen
  • Helps reach critical mass of available meta data
  • Still need to decide what it is that the
    machines/software are collecting…
  • Human input still needed
  • Purpose of experiment, deviations from planned
    protocol etc.

117
Support
  • Community ontologies need to be easily available
    to all scientists
  • Listing the known ontologies on a web site is not
    enough
  • Need to understand when (meta) data is fit for
    purpose
  • Accurate enough, not overly precise
  • Need collaborative approaches to extending
    ontologies
  • Allow users to be involved to achieve community
    buy-in
  • Ontologies are difficult for people to comprehend
  • Need good visualisation
  • Need to trust system

118
Tools
  • Simple tools would go a long way to help
  • Contextual data is consistent for many data sets
  • e.g. observer/location
  • Tools should support collection and re-use of
    this data
  • Make use of (incorporate) existing ontologies
    into tools
  • Get the software to do as much work as possible
  • Good at repetitive tasks, faster than humans
  • Personalisation
  • How application specific do tools have to be to
    be useful
  • Generic/ Domain specific/ Individual?
  • The more generic the more widely applicable
  • Pluggable components for personalisation?

119
Finally…
  • It will take time and commitment for any of these
    approaches to work.
  • Focus on central important resources that are
    reused in many (sub-)domains
  • Ensure the data are well managed and curated,
    identified, described, easily available, lasting
    and evolving
  • Observe whether they benefit the community or act
    as a straight jacket
  • A good test case for this approach is the
    development of a taxon concept name resolution
    service
  • To allow scientists to find correct names for the
    concepts they are working with,
  • Mark up their data,
  • Resolve their concepts against other scientists
    data so they know they are talking about the same
    thing.
  • Is central to communication in all life sciences
  • Poses many computational, social and data
    research issues

120
Acknowledgements
  • E-Science Institute for sponsoring theme
    leadership
  • Malcolm Atkinson
  • For support and many interesting discussions on
    exploiting scientific data.
  • Collaborators
  • on SEEK project,
  • Matt Jones, Bill Michener, Aimee Stewart, Robert
    Gales, Josh Madin, Shaun Bowers
  • Collaborators in TDWG/GBIF
  • Robert Kukla, Roger Hyam,
  • funding, slides, interesting problems
About PowerShow.com