Title: The Tree of Life: Challenges for Discrete Mathematics and Theoretical Computer Science
1The Tree of Life Challenges for Discrete
Mathematics and Theoretical Computer Science
Fred S. Roberts DIMACS Rutgers University
2The tree of life problem raises new challenges
for mathematics and computer science just as it
does for biological science.
3- For math. and CS to become more effectively
utilized, we need to - develop new tools
- establish working partnerships between
mathematical scientists and biological
scientists - introduce the two communities to each others
problems, language, and tools - .
4- introduce outstanding junior researchers from
both sides to the issues, problems, and
challenges of problems arising from the tree of
life
5- involve biological and mathematical scientists
together to define the agenda and develop the
tools of this field.
6These are some of the motivations for this
meeting. I will lay out some of the challenges
for math and CS, with emphasis on discrete math
and theoretical CS.
7What are DM and TCS?
- DM deals with
- arrangements
- designs
- codes
- patterns
- schedules
- assignments
8TCS deals with the theory of computer algorithms.
- During the first 30-40 years of the computer age,
TCS, aided by powerful mathematical methods, had
a direct impact on technology, by developing
models, data structures, algorithms, and lower
bounds that are now at the core of computing.
9DM and TCS have found extensive use in many areas
of science and public policy, for example in
Molecular Biology. These tools seem
especially relevant to problems of the tree of
life
10DM and TCS Continued
- These tools are made especially relevant to the
tree of life problem because of - Geographic Information Systems
-
11DM and TCS Continued
- Availability of large and disparate computerized
databases on subjects relating to species and the
relevance of modern methods of data mining.
12Outline
- Phylogenetic Tree Reconstruction
- Database Issues
- Nomenclature
- Setting up a Species Bank
- Digitization of Natural History Collections
- Interoperability
- The Many Applications of Research on the Tree of
Life
13Phylogenetic Tree Reconstruction
14Phylogeny (continued)
- New methods of phylogenetic tree reconstruction
owe a significant amount to modern methods of
DM/TCS. - Trees, supertrees, consensus trees will all be
discussed at length in this meeting - I will only make a few brief remarks about them.
15Phylogenetic Challenges for DM/TCS
- Tailoring phylogenetic methods to describe the
idiosyncracies of viral evolution -- going beyond
a binary tree with a small number of
contemporaneous species appearing as leaves. - Dealing with trees of thousands of vertices, many
of high degree. - Making use of data about species at internal
vertices (e.g., when data comes from serial
sampling of patients).
16Phylogenetic Challenges for DM/TCS Continued
- Network representations of evolutionary history -
if recombination has taken place. - Modeling viral evolution by a collection of trees
-- to recognize the quasispecies nature of
viruses. - Devising fast methods to average the quantities
of interest over all likely trees. - Thanks to Eddie Holmes and Mike Steel for ideas.
- DIMACS Working Group on Phylogenetic Trees and
Rapidly Evolving Diseases, Sept. 3-6, 2003
17Database Issues
- Assembling the tree of life requires collecting
massive amounts of data about the worlds
scientific species. - Making it a collaborative project requires making
such data universally available. - There are great challenges for Math and CS,
specifically DM and TCS. - Thanks to the Global Biodiversity Information
Facility (GBIF) for many of the following ideas.
18Complexity of Data
- In many ways, data about the worlds species are
far more complex than genetic or protein sequence
data. (GBIF)
19Complexity of Data (contd)
- There are databases of images, databases in
numerous forms, etc. - Data is heterogeneous.
- Data has errors and inconsistencies.
20Nomenclature
- There are some 1.75M named species
- By some estimates, there are up to 10M actual
species.
21Nomenclature (contd)
- The same species is often named more than once.
- On the average, each species has two additional
names (synonyms) besides its own name. (GBIF)
22Nomenclature (contd)
- Thus, there is need to assemble names in an
electronic catalogue, with synonyms and common
misspellings. - This would be of fundamental importance in aiding
research on biodiversity.
23Nomenclature (contd)
- Because of errors, one major challenge for TCS is
data cleaning.
24Nomenclature (contd)
- Another challenge is to search a database to see
if two entries are similar. - This is a standard problem in database theory.
- TCS algorithms involving k-nearest neighbor and
other methods are very helpful here.
25Setting up a Species Bank
26Setting up a Species Bank (contd)
- A species bank would provide not only names, but
also data about a species - Type
- Distribution
- Ecological role
- Phylogenetic history
- Physiology
- Genomics
- This involves issues about huge datasets.
27Setting up a Species Bank (contd)
- NASA earth science satellites alone beam home
image data at the rate of 1.2 terabytes a day. - By 2010, this is expected to grow to 10 petabytes
a day. (Kathleen Bergen, U. Michigan)
28Name Equal to Size in Bytes
Bit 1 bit 1/8
Nibble 4 bits 1/2 (rare)
Byte 8 bits 1
Kilobyte 1,024 bytes 1,024
Megabyte 1,024 kilobytes 1,048,576
Gigabyte 1,024 megabytes 1,073,741,824
Terrabyte 1,024 gigabytes 1,099,511,627,776
Petabyte 1,024 terrabytes 1,125,899,906,842,624
Exabyte 1.024 petabytes 1,152,921,504,606,846,976
Zettabyte 1,024 exabytes 1,180,591,620,717,411,303,424
Yottabyte 1,024 zettabytes 1,208,925,819,614,629,174,706,176
29Setting up a Species Bank (contd)
- The problem is even worse We need to combine
information from many databases. - There is no known way to catalogue all species of
plants in one place given current database
systems techniques. (Jessie Kennedy, Napier
University, Edinburgh)
30Setting up a Species Bank (contd)
- One possible approach Tree and graph methods to
support overlapping classifications as directed
acyclic graphs or with complex objects (taxa or
specimens) as nodes. (Jessie Kennedy)
31Digitizing Natural History Collections
- It has been estimated that there are between 1.5
and 3 Billion specimens in the worlds natural
history collections, including herbaria, living
microorganism stock centers, and other
repositories (GBIF).
32Digitizing Natural History Collections (contd)
- If we could digitize information about these
specimens, and make them available, we would
have a treasure trove of information about the
worlds biota. (GBIF) - Pilot projects have shown that utilizing
digitized data from several institutions
databases can be a powerful tool. (GBIF)
33Digitizing Natural History Collections (contd)
- Challenge digitization and reference of
non-standard data (photos, sonograms, field
notes)
34Digitizing Natural History Collections (contd)
- Challenge Develop methods for visualizing the
data (e.g., species distributions)
35Digitizing Natural History Collections (contd)
- Challenge Develop search engines for real-time
searching of such extremely large data sets.
36Digitizing Natural History Collections (contd)
- Challenge Make information access on the web
more knowledge-based so humans and intelligent
software can work together. (Susan Gauch, U.
Kansas)
37Digitizing Natural History Collections (contd)
- Challenge Use intelligent agents to organize
and present relevant information on the web.
(Susan Gauch)
38Digitizing Natural History Collections (contd)
- Challenge Use partial information as training
data for classification algorithms (Susan Gauch) - One approach Use training data and
classification algorithms with learning
capabilities. - (See DIMACS project on Monitoring Message
Streams)
39Digitizing Natural History Collections (contd)
- Another approach to problems posed by
digitization Use tools of knowledge
inferencing (Yannis Ioannidis, University of
Wisconsin) - Still another approach Use methods of
spatio-temporal data mining (Ioannidis see work
of Muthukrishnan at Rutgers)
40Interoperability
- Goal Devise standards for datasets so as to
allow researchers to collaborate across datasets
develop standards leading to database
interoperability. (GBIF)
41Interoperability
- Challenge How do we develop ways to more
accurately represent observational or
experimental data so that others may use them?
(Jessie Kennedy) - Challenge Deal with issues of inconsistency and
scalability. - Challenge Formalize issues of policy with regard
to others databases. - Challenge Interoperability over a diversity of
users and types of equipment.
42Interoperability
- One approach Semantic Web the idea used to
express the growing desire to make information
access on the Web more knowledge-based so humans
and intelligent software can work together.
(Susan Gauch)
43Interoperability
- Another approach Make use of languages such as
XML developed to aid interoperability in business
and military collaborations.
44The Many Applications of Research on the Tree of
Life
- Side benefits in many fields
- Agriculture
- Biomedicine
- Biotechnology
- Natural resource management
- Pest control
- Control of emergent diseases
- Sustainable use of biodiversity resources
- Global climate change
45The Many Applications of Research on the Tree of
Life
- Lets say youre importing bananas from South
America
46The Many Applications of Research on the Tree of
Life
- A camera in the hold of the ship sees a spider.
- What kind of spider is it?
- Is it safe to unload your cargo of bananas?
47The Many Applications of Research on the Tree of
Life
- Luckily, you have a digitized natural history
database. - With an efficient search feature.
- (Thanks to Diana Lipscomb for this example)
48The Many Applications of Research on the Tree of
Life