The Tree of Life: Challenges for Discrete Mathematics and Theoretical Computer Science - PowerPoint PPT Presentation

About This Presentation
Title:

The Tree of Life: Challenges for Discrete Mathematics and Theoretical Computer Science

Description:

The Tree of Life: Challenges for Discrete Mathematics and Theoretical Computer Science Fred S. Roberts DIMACS Rutgers University What are DM and TCS? – PowerPoint PPT presentation

Number of Views:355
Avg rating:3.0/5.0
Slides: 49
Provided by: dimacsRut3
Category:

less

Transcript and Presenter's Notes

Title: The Tree of Life: Challenges for Discrete Mathematics and Theoretical Computer Science


1
The Tree of Life Challenges for Discrete
Mathematics and Theoretical Computer Science
Fred S. Roberts DIMACS Rutgers University
2
The tree of life problem raises new challenges
for mathematics and computer science just as it
does for biological science.
3
  • For math. and CS to become more effectively
    utilized, we need to
  • develop new tools
  • establish working partnerships between
    mathematical scientists and biological
    scientists
  • introduce the two communities to each others
    problems, language, and tools
  • .

4
  • introduce outstanding junior researchers from
    both sides to the issues, problems, and
    challenges of problems arising from the tree of
    life

5
  • involve biological and mathematical scientists
    together to define the agenda and develop the
    tools of this field.

6
These are some of the motivations for this
meeting. I will lay out some of the challenges
for math and CS, with emphasis on discrete math
and theoretical CS.
7
What are DM and TCS?
  • DM deals with
  • arrangements
  • designs
  • codes
  • patterns
  • schedules
  • assignments

8
TCS deals with the theory of computer algorithms.
  • During the first 30-40 years of the computer age,
    TCS, aided by powerful mathematical methods, had
    a direct impact on technology, by developing
    models, data structures, algorithms, and lower
    bounds that are now at the core of computing.

9
DM and TCS have found extensive use in many areas
of science and public policy, for example in
Molecular Biology. These tools seem
especially relevant to problems of the tree of
life
10
DM and TCS Continued
  • These tools are made especially relevant to the
    tree of life problem because of
  • Geographic Information Systems

11
DM and TCS Continued
  • Availability of large and disparate computerized
    databases on subjects relating to species and the
    relevance of modern methods of data mining.

12
Outline
  • Phylogenetic Tree Reconstruction
  • Database Issues
  • Nomenclature
  • Setting up a Species Bank
  • Digitization of Natural History Collections
  • Interoperability
  • The Many Applications of Research on the Tree of
    Life

13
Phylogenetic Tree Reconstruction
14
Phylogeny (continued)
  • New methods of phylogenetic tree reconstruction
    owe a significant amount to modern methods of
    DM/TCS.
  • Trees, supertrees, consensus trees will all be
    discussed at length in this meeting
  • I will only make a few brief remarks about them.

15
Phylogenetic Challenges for DM/TCS
  • Tailoring phylogenetic methods to describe the
    idiosyncracies of viral evolution -- going beyond
    a binary tree with a small number of
    contemporaneous species appearing as leaves.
  • Dealing with trees of thousands of vertices, many
    of high degree.
  • Making use of data about species at internal
    vertices (e.g., when data comes from serial
    sampling of patients).

16
Phylogenetic Challenges for DM/TCS Continued
  • Network representations of evolutionary history -
    if recombination has taken place.
  • Modeling viral evolution by a collection of trees
    -- to recognize the quasispecies nature of
    viruses.
  • Devising fast methods to average the quantities
    of interest over all likely trees.
  • Thanks to Eddie Holmes and Mike Steel for ideas.
  • DIMACS Working Group on Phylogenetic Trees and
    Rapidly Evolving Diseases, Sept. 3-6, 2003

17
Database Issues
  • Assembling the tree of life requires collecting
    massive amounts of data about the worlds
    scientific species.
  • Making it a collaborative project requires making
    such data universally available.
  • There are great challenges for Math and CS,
    specifically DM and TCS.
  • Thanks to the Global Biodiversity Information
    Facility (GBIF) for many of the following ideas.

18
Complexity of Data
  • In many ways, data about the worlds species are
    far more complex than genetic or protein sequence
    data. (GBIF)

19
Complexity of Data (contd)
  • There are databases of images, databases in
    numerous forms, etc.
  • Data is heterogeneous.
  • Data has errors and inconsistencies.

20
Nomenclature
  • There are some 1.75M named species
  • By some estimates, there are up to 10M actual
    species.

21
Nomenclature (contd)
  • The same species is often named more than once.
  • On the average, each species has two additional
    names (synonyms) besides its own name. (GBIF)

22
Nomenclature (contd)
  • Thus, there is need to assemble names in an
    electronic catalogue, with synonyms and common
    misspellings.
  • This would be of fundamental importance in aiding
    research on biodiversity.

23
Nomenclature (contd)
  • Because of errors, one major challenge for TCS is
    data cleaning.

24
Nomenclature (contd)
  • Another challenge is to search a database to see
    if two entries are similar.
  • This is a standard problem in database theory.
  • TCS algorithms involving k-nearest neighbor and
    other methods are very helpful here.

25
Setting up a Species Bank
26
Setting up a Species Bank (contd)
  • A species bank would provide not only names, but
    also data about a species
  • Type
  • Distribution
  • Ecological role
  • Phylogenetic history
  • Physiology
  • Genomics
  • This involves issues about huge datasets.

27
Setting up a Species Bank (contd)
  • NASA earth science satellites alone beam home
    image data at the rate of 1.2 terabytes a day.
  • By 2010, this is expected to grow to 10 petabytes
    a day. (Kathleen Bergen, U. Michigan)

28
Name Equal to Size in Bytes
Bit 1 bit 1/8
Nibble 4 bits 1/2 (rare)
Byte 8 bits 1
Kilobyte 1,024 bytes 1,024
Megabyte 1,024 kilobytes 1,048,576
Gigabyte 1,024 megabytes 1,073,741,824
Terrabyte 1,024 gigabytes 1,099,511,627,776
Petabyte 1,024 terrabytes 1,125,899,906,842,624
Exabyte 1.024 petabytes 1,152,921,504,606,846,976
Zettabyte 1,024 exabytes 1,180,591,620,717,411,303,424
Yottabyte 1,024 zettabytes 1,208,925,819,614,629,174,706,176
29
Setting up a Species Bank (contd)
  • The problem is even worse We need to combine
    information from many databases.
  • There is no known way to catalogue all species of
    plants in one place given current database
    systems techniques. (Jessie Kennedy, Napier
    University, Edinburgh)

30
Setting up a Species Bank (contd)
  • One possible approach Tree and graph methods to
    support overlapping classifications as directed
    acyclic graphs or with complex objects (taxa or
    specimens) as nodes. (Jessie Kennedy)

31
Digitizing Natural History Collections
  • It has been estimated that there are between 1.5
    and 3 Billion specimens in the worlds natural
    history collections, including herbaria, living
    microorganism stock centers, and other
    repositories (GBIF).

32
Digitizing Natural History Collections (contd)
  • If we could digitize information about these
    specimens, and make them available, we would
    have a treasure trove of information about the
    worlds biota. (GBIF)
  • Pilot projects have shown that utilizing
    digitized data from several institutions
    databases can be a powerful tool. (GBIF)

33
Digitizing Natural History Collections (contd)
  • Challenge digitization and reference of
    non-standard data (photos, sonograms, field
    notes)

34
Digitizing Natural History Collections (contd)
  • Challenge Develop methods for visualizing the
    data (e.g., species distributions)

35
Digitizing Natural History Collections (contd)
  • Challenge Develop search engines for real-time
    searching of such extremely large data sets.

36
Digitizing Natural History Collections (contd)
  • Challenge Make information access on the web
    more knowledge-based so humans and intelligent
    software can work together. (Susan Gauch, U.
    Kansas)

37
Digitizing Natural History Collections (contd)
  • Challenge Use intelligent agents to organize
    and present relevant information on the web.
    (Susan Gauch)

38
Digitizing Natural History Collections (contd)
  • Challenge Use partial information as training
    data for classification algorithms (Susan Gauch)
  • One approach Use training data and
    classification algorithms with learning
    capabilities.
  • (See DIMACS project on Monitoring Message
    Streams)

39
Digitizing Natural History Collections (contd)
  • Another approach to problems posed by
    digitization Use tools of knowledge
    inferencing (Yannis Ioannidis, University of
    Wisconsin)
  • Still another approach Use methods of
    spatio-temporal data mining (Ioannidis see work
    of Muthukrishnan at Rutgers)

40
Interoperability
  • Goal Devise standards for datasets so as to
    allow researchers to collaborate across datasets
    develop standards leading to database
    interoperability. (GBIF)

41
Interoperability
  • Challenge How do we develop ways to more
    accurately represent observational or
    experimental data so that others may use them?
    (Jessie Kennedy)
  • Challenge Deal with issues of inconsistency and
    scalability.
  • Challenge Formalize issues of policy with regard
    to others databases.
  • Challenge Interoperability over a diversity of
    users and types of equipment.

42
Interoperability
  • One approach Semantic Web the idea used to
    express the growing desire to make information
    access on the Web more knowledge-based so humans
    and intelligent software can work together.
    (Susan Gauch)

43
Interoperability
  • Another approach Make use of languages such as
    XML developed to aid interoperability in business
    and military collaborations.

44
The Many Applications of Research on the Tree of
Life
  • Side benefits in many fields
  • Agriculture
  • Biomedicine
  • Biotechnology
  • Natural resource management
  • Pest control
  • Control of emergent diseases
  • Sustainable use of biodiversity resources
  • Global climate change

45
The Many Applications of Research on the Tree of
Life
  • Lets say youre importing bananas from South
    America

46
The Many Applications of Research on the Tree of
Life
  • A camera in the hold of the ship sees a spider.
  • What kind of spider is it?
  • Is it safe to unload your cargo of bananas?

47
The Many Applications of Research on the Tree of
Life
  • Luckily, you have a digitized natural history
    database.
  • With an efficient search feature.
  • (Thanks to Diana Lipscomb for this example)

48
The Many Applications of Research on the Tree of
Life
Write a Comment
User Comments (0)
About PowerShow.com