The MoBIoS Project Molecular Biological Information System - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

The MoBIoS Project Molecular Biological Information System

Description:

New data types. Biological information system. Life science data types ... Divide the set into two equal halves. apply recursively. Query, q, range r. q. r ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 41
Provided by: danielm7
Category:

less

Transcript and Presenter's Notes

Title: The MoBIoS Project Molecular Biological Information System


1
The MoBIoS ProjectMolecular Biological
Information System
  • Daniel P. Miranker
  • Dept. of Computer Sciences
  • Center for Computational Biology and
    Bioinformatics
  • University of Texas

Weijia Xu, Rui Mao, Will Briggs, Smriti
Ramakrishnan, Shu Wang, Lulu Zhang
2
  • ProblemIn Life Sciencses, database management
    systems (DBMS) serve as glorified file managers.
  • Little use of sophisticated data and
    pattern-based retrieval
  • Real scientific and technological problems

3
When biological data is put in to an RDBMS
  • Primary data is stored in text or blob fields
  • Annotations may be relational
  • Data retrieval
  • Filter DB, sequential dump, O(n), to utilities
  • E.g. BLAST,

Organism Function Sequence
Yeast membrane AACCGGTTT
Yeast mitosis TATCGAAA
E. Coli membrane AGGCCTA
4
Linear Data Scans, O(n), Endemic in Life Sciences
  • Sequences
  • DNA, RNA, Protein databases
  • Mass Spectra
  • proteomics
  • Small Molecules Protein Structure
  • Protein interaction
  • Rational drug design
  • Pathways (graphs)
  • Phylogenies (graphs, trees in particular)

5
Scope To Find Common Ground Both Biology and
DBMS Have to Move
DBMS
Biological Information System
Metric-Space Database as the Common Ground
6
Metric Space is
  • a pair, M(D,d),
  • where
  • D is a set of points
  • d is metric distance function with the
    following properties
  • d(x,y) d (y,x)
    (symmetry)
  • d(x, y) gt 0, d(x,x) 0
    (non negativity)
  • d(x,z) lt d(x,y) d(y,z)
    (triangle inequality)

x y z
7
Definition - By Analogy
  • A Spatial Database Management System
  • Extend relational DBMS
  • Special indexes for 2D and 3D data k-d and
    R-trees
  • New data types
  • Geographic information systems
  • Topographic maps
  • Buildings and the like
  • A Metric-Space Database Management System
  • Extend Relational DBMS
  • Special indexes for metric-spaces
  • New data types
  • Biological information system
  • Life science data types

8
Develop index structures to support distance
nearest-neighbor queries
  • Well studied in main-memory
  • But by no means a closed problem
  • In databases (external/disk based methods)
  • Embryonic
  • Many myths
  • Often assumed to be the basis of multimedia
    database systems

9
How to build a metric-space index
  • Three algorithmic classes Tasan, Ozsoyoglu 04
  • Vantage points
  • Hyperplanes
  • Bounding spheres

10
Vantage Point Method BurkhardKeller73
11
Vantage Point Method
Choose a point,VP
And a radius, R
12
Vantage Point Method
  • Given VP, R
  • The predicates
  • d(VP,x) lt R
  • d(VP,x) ? R
  • Divide the set into two equal halves
  • apply recursively

Choose a point,VP
And a radius,R
13
Query, q, range r
r
q
14
Query, q, range r
  • if
  • d(q,VP) gt R r
  • then
  • all neighbors are outside the sphere

VP
R
r
q
15
Multi-vantage point method
16
Multi-vantage point method
  • Consider d(VPi, x) a projection onto an axis
  • Looks like a k-d tree
  • Choose number k d

17
Myths
  • Solved problem M-trees Ciaccia et.al. 96, 97
  • I cant get them to work on anything but their
    original synthetic data generator
  • Good choice for vantage points is to find
    cornersYianilos93 (farthest-first clustering)
  • Might be true for euclidean spaces
  • Early result, not true for our data
  • High dimensional indexing always asymptotically
    reduces to linear scans.
  • Formal result based on an assumption of uniform
    data distributions.

18
Comparison of Three Methods of Metric-Space
Indexing
19
Open problems
  • Is there a general metric-space index structure
    that is generally good for most work loads.
  • We are optimistic mvp trees further tuning
    will be a useful answer
  • Hyperplane methods are fair game there is
    circumstantial evidence that that is key
    component in Googles search engine.
  • No work addresses clustering data pages on disk.
  • Metric-space join algorithms

20
Biological Models are Usually Based on Similarity
  • Similarity
  • Biologist like scoring functions that reward each
    similar feature with a positive number
  • Intuitive
  • Distance
  • More Similar ? smaller numbers
  • Identical ? 0

21
But Do Metric Models Capture Biology?
  • Metrics are a subset of possible mathematical
    models

.
22
Sequence Problem 1
  • Sequence similarity based on weighted edit
    distance
  • Accepted weight matrices, PAM BLOSSUM, are not
    metric
  • Log-odd matrices negative values
  • Defy simple algebraic normalizationTaylorJones93,
    Linialetal97

23
Our First Result mPAM XuMiranker04
  • Dayhoffetals PAM Derivation74
  • Took a set of closely related protein sequences
  • Developed a phylogenetic tree
  • Counted substitutions to transform one sequence
    to another
  • Tree determines a measure of time

24
PAM vs. mPAM t 1/f
  • Using original substitution counts
  • PAM frequency of substitution
  • S(a,bt) log P(ba,t)/qb
  • mPAM expected time between substitutions
  • D(a,b) 1/log(1 ?(P(a,x)P(b,x))

x
25
Sequence Problem 2
  • Sequences long units (identity for storage and
    retrieval)
  • Genes
  • Chromosomes
  • Analysis comprises comparing small substrings

26
Soln Sequence View
  • New view type
  • Breaks sequences into q-grams

create SEQUENCEVIEW rice_sview as SELECT CREATE
FRAGMENTS (, 3, 1) FROM WHERE USING
HAMMING-DISTANCE
27
Materialize as an Index
D(AAA) 2
Rowd Offset Logical Fragment Logical Fragment Logical Fragment Logical Fragment Logical Fragment Logical Fragment
R1 1 A C A
R1 2 C A A
R1 3 A A C
R1 4 A C A

R2 1 A T C
R2 2 T C A
R2 3 C A A
R2 4 A A A

D(ACA) 1 D(CAA) 0 D(ATC) 1
Genomes Genomes
Rowid Seq
R1 CAACA
R2 ATCAAA
R3










28
Status
  • Started with McKoi
  • A Java open source object-relational DBMS
  • (Think of Postgress written in Java)
  • Added
  • Biological data types
  • Metric-space index
  • Extending SQL engine (in progress)

29
Computed in MoBIoS
  • Compare Arabidopsis Genome X Rice Genome
  • Locate nucleotide patterns of form
  • primer pair candidate
  • Eliminate non-unique primer candidates
  • Merge overlapping primer candidates
  • Usual implementations O(n2), n 109

Rice Arab.
?18 Matching Nucleotides
?18 Matching Nucleotides
Rice Gap 400 3000 Long Arab. Gap 400 3000
Long
30
mSQL Query to locate candidate primer pairs
  • SELECT merge(R1.fragment, A1.fragment)
  • FROM
  • G1_sview R1, G1_sview R2, G2_sview A1, G2_sview
    A2
  • WHERE
  • distance(HAMMINGDISTANCE', R1.fragment,
    A1.fragment) lt 1.0 AND distance(HAMMINGDISTANCE'
    , R2.fragment, A2.fragment) lt 1.0 AND
  • (FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment))
    gt 400 AND
  • (FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment))
    lt 3000 AND
  • (FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment))
    gt 400 AND
  • (FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment))
    lt 3000
  • GROUP BY R1.fragment, A1.fragment

31
Query Plan
  • Arab. Genome, O(n)
    Rice Genome, O(m)
  • Offline Build Sequence
  • View O(n log n)
  • Compare O(mlogn)
  • Indexed Nested Loop
  • Eliminate Duplicates
  • Eliminate Low Complexity
  • Primers (LZ compression)
  • Merge Overlapping Primers
  • 10,000 conserved

32
Preliminary Results
  • Found 13,418 possible primer pairs from MoBIoS
  • 100 best candidates BLASTed for matches in
    GenBank
  • 15 matched other plant genes and the primers
  • At least 2 of 15 showed potential after PCR
    amplification against Helianthus and
    Phalaenopsis.

33
MoBIoS Architecture(Molecular Biological
Information System)
34
Analysing Mass-Spectra
  • Spectrum Histogram of Mass/Charge Ratios of a
    collection peptides
  • Similarity Shared peaks count Inner Product
  • (0100101) (0111100) 2

35
Cosine Distance Approx. Inner Product
  • Drs 1 xrxs/(xrxr)1/2(xsxs)1/2
  • shown store and retrieve mass-spectra
  • using cosine distance, and it scales

36
mSQL Query for Protein Identification by
Mass-Spec. Signature Database Look
  • SELECT Prot.accesion_id, Prot.sequence
  • FROM protein_sequences Prot, digested_sequences
    DS,
  • mass_spectra MS
  • WHERE
  • MS.enzyme DS.enzyme E and
  • Cosine_Distance(S, MS.spectrum, range1) and
  • DS.accession_id MS.accession_id
    Prot.accesion_id and
  • DS.ms_peak P and MPAM250(PS, DS.sequence,
    range2)

37
Matching Electrostatic Shape of Molecules
38
Still benefit from grid-services
  • Intermittently, but regularly compile (recluster)
    the indices O(nlog n), n gt 106
  • Rational drug design O(log n) finite element
    solutions to traverse search tree.
  • Make a service call to the grid for these
    operations only
  • Mirror data contents to minimize I/O
  • Since need is intermittant, one grid serves many
    MoBIoS servers

recluster
G R I D




MoBIoS Server
New index
Shape match (FEM)
Distance(real)
High speed I/O
Mirror DB-Contents
39
Hyper-planes Ulhmann91
  • If d(x,h1) lt d(x,h2) then x assigned to h1

h1
x
h2
40
Develop a Hierarchical Clustering
C
A
E
B
D
F
  • Hierarchy of Bounding spheres, (center, radius),
  • Bounding spheres may overlap
  • Inspired by R-trees
Write a Comment
User Comments (0)
About PowerShow.com