Parallel Detection of Regulatory Elements with gMP - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Parallel Detection of Regulatory Elements with gMP

Description:

counts occurrences of DNA motifs upstream of each ORF ... For each ORF can only calculate fraction of occurence frequencies for all motifs ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 29
Provided by: Csu48
Learn more at: http://www.cs.umd.edu
Category:

less

Transcript and Presenter's Notes

Title: Parallel Detection of Regulatory Elements with gMP


1
Parallel Detection of Regulatory Elements with gMP
Bertil Schmidt, Lin Feng, Amey Laud, Yusdi Santoso
Damayanti Gupta CMSC 838 Presentation
2
Motivation
  • Fundamental question
  • How are expression levels of thousands of genes
    regulated ?
  • Very important
  • Understanding of gene function
  • Response to environment
  • Understand genetic causes of diseases
  • Evaluate effects of drus
  • Detect mutations
  • Remember
  • Sets of genes -gt Pathways -gt Genetic Networks
  • Gene regulation
  • Control decisions turn genes on/off
  • Gene Regulation Network

3
Talk Overview
  • Overview of talk
  • Motivation
  • Technique
  • Experiment
  • Related work
  • Conclusions

4
Technique
  • Motifs upstream of genes regulate gene expression
  • Motifs are sites of regulatory activity
  • Identify regulatory motifs by combining
  • Gene expression data
  • Detect common motifs occuring upstream of genes
  • Huge datasets
  • Utilise parallel computing

5
Technique
  • gRNA
  • Java development framework
  • gMP
  • Java communication library
  • REDUCE
  • Algorithm to identify regulatory motifs
  • REDUCE parallelised with gMP
  • Increase computing power
  • Get motifs ranked in statistical significance

6
gRNA framework
  • Consists of APIs

7
gRNA - APIs
  • Interact with data sources
  • Provide functionality from biology
  • Pipelines tasks into unified process
  • Repository of resources
  • Distributed programming

8
gRNA environment
  • gRNA Grid
  • Clustered computing environment
  • Application written for gRNA
  • Multiple-tier application
  • Applications operate from client computer
  • Communicates with cluster through single computer
  • Hosts EJB server
  • Server identifies processing nodes
  • each of these perform tasks

9
gRNA Grid
10
gMP
  • Java based message passing tool
  • Built on top of sockets
  • Manages virtual processors to run on available
    machines
  • Scalable
  • Machines added/removed easily

11
gMP
  • Processes are grouped
  • Communication primitives provided for sending and
    receiving data
  • Collective communication to several nodes enabled
    modularly and efficiently
  • Enables functions to be implemented on data

12
REDUCE algorithm
  • Based on model
  • Upstream motifs contribute additively to
    expression level of each gene
  • Quantify the extent to which these motifs
    contribute to expression data
  • Fit log of expression ratio to sum of activating
    and inhibitory terms
  • Find stastically most significant motifs
  • Plots of fitting parameters suggest biological
    function

13
REDUCE algorithm
  • Terms
  • Occurence vector
  • Measure of how often a motif is found
  • Expression vector
  • Measure of gene expression

14
REDUCE method
  • Consists of
  • 1) Motif frequency counter
  • counts occurrences of DNA motifs upstream of each
    ORF
  • motifs are about 711 nucleotides in length
  • get occurence vectors

15
REDUCE algorithm
  • 2) Significant motif finder
  • Use
  • i) Normalised occurrence vector made for each
    motif nµ
  • ii) Normalised vector of logs of gene expression
    ratio vectors- a
  • Take dot product of these (a . nµ) ,and square.
  • Can be considered as frequency of occurence X
    expressive power of regulatory motif
  • It is squared to get rid of negatives
  • Correlate gene expression with occurence of motif
  • Largest dot product is most significant motif

16
....
  • a is modified to remove effect of this motif
  • residual gene expression vector
  • Process repeated until motifs are ranked

17
Table Finding significant motifs
  • Uses a - (.5816,.2522,.2886,-.5947, -.1595,
    -.3683)

18
REDUCE parallelised with gMP...
  • Parallel motif frequency counter
  • Split set of ORFs equally
  • Distribute across available nodes
  • Each node calculates in parallel to get occurence
    vectors
  • Matrix transposition
  • Occurence vectors scattered across nodes
  • Advantageous to store each vector in single node
  • Transpose motif frequency matrix
  • For each ORF can only calculate fraction of
    occurence frequencies for all motifs
  • But the entire occurence frequency is needed

19
...
  • Parallel significant motif finder
  • Normalises occurence vector within each node
  • At each node, most significant motif calculated
  • Global most significant motif calculated
  • Process iterated to rank occurence vectors
  • Interface in gRNA allows ease of implementation

20
Experiment
  • Use Compaq Alpha system
  • Consists of cluster of 8 AlphaServer SC/ES45
  • Connected by high-speed Alpha SC 16-Port switch
    and ELAN PCI adapter cards.
  • Each server contains 4 Alpha EV68 processors

21
Results
  • Use 7090 gene expressions of yeast
  • ORFs of length 600
  • Motifs upto length 7
  • Throughput (in MBytes/s) also shown
  • 20 most significant motifs computed.

22
Analysis
  • Runtime scales well with number of processing
    nodes
  • Frequency counter scales perfectly
  • Motif finder also scales
  • Cannot achieve perfect scaling because of
    communication overhead.

23
Related work
  • DiscoveryLink
  • Provides configurable wrappers as interfaces to
    multiple data sources
  • Kleisli system
  • Systematically manages and integrates external
    databases
  • Uses functional query language to perform
    correlation across databases
  • Toolkits designed with functionality for
    specialised areas
  • BioJava, BioPerl, PAL
  • Sequence Analysis
  • Ensembl initiative, DAS
  • provide extensible approach to issue of
    annotating genomic data

24
Related work
  • Previous approaches using Java for high
    performance computing
  • Bindings into native message-passing
    APIs(e.g.MPI)
  • Does not allow easy integration into larger Java
    applications
  • Pure Java message passing interfaces
  • JMPI, CCJ
  • Both implemented on top of Java RMI
  • Slower than using raw sockets
  • CCJ tries to overcome
  • optimised RMI implementation
  • not portable
  • Both cannot handle integration

25
Comparison
  • According to authors ...
  • gRNA distinguishes itself
  • Uses whole range of requirements for applications
    in computational biology
  • Provides decoupled, yet inter-related subsystems
  • Ease of 3rd party implementation

26
Observations
  • REDUCE surpasses traditional clustering approach
  • REDUCE algorithm has high runtime
  • Complexity depends on product of number possible
    motifs and that of genes.
  • Grows exponentially with length of sequences
  • So length of motif is restricted
  • REDUCE algorithm is greedy
  • suboptimal
  • REDUCE is simplistic
  • lacks parameters for interactions between motifs
  • does not consider impact of other biological
    knowledge

27
...
  • Not clear that results of REDUCE are biologically
    significant
  • Experiment does not effectively show how higher
    computation power helps results
  • Only analysis from 9 to 16 processors, is this
    sufficient to determine good scaling?

28
Conclusions
  • Finally...
  • gRNA demonstrates efficient mechanism for
    development of genome-centric applications
  • Further...
  • Extensions to REDUCE have been proposed
  • require higher computing power
  • more specialised programming interfaces required
  • Identifying communication patterns
  • Use of data structures e.g. sequences, trees,
    matrices
Write a Comment
User Comments (0)
About PowerShow.com