Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly - PowerPoint PPT Presentation

About This Presentation
Title:

Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly

Description:

Bioinformatics. Genome Sequencing. Research Problem. Fuzzy Logic. Ongoing ... Bioinformatics ... important first step in many bioinformatics applications ... – PowerPoint PPT presentation

Number of Views:126
Avg rating:3.0/5.0
Slides: 36
Provided by: cse5
Learn more at: https://www.cse.unr.edu
Category:

less

Transcript and Presenter's Notes

Title: Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly


1
Utilizing Fuzzy Logic for Gene Sequence
Construction from Sub Sequences and
Characteristic Genome Derivation and Assembly
2
Contents
  • Team
  • Bioinformatics
  • Genome Sequencing
  • Research Problem
  • Fuzzy Logic
  • Ongoing Work
  • Future Work

3
Team
  • Advisors
  • PI Dr. Gregory Vert (Dept. of Computer Science,
    University of Nevada Reno)
  • Co-PI Dr. Alison Murray (Desert Research
    Institute, Reno)
  • Co-PI Dr. Monica Nicolescu (Dept. of Computer
    Science, University of Nevada Reno)
  • Student
  • Sara Nasser (Dept. of Computer Science,
    University of Nevada Reno)

4
Bioinformatics- Genome Sequencing
  • Genome sequencing is figuring out the order of
    DNA nucleotides, or bases, in a genomethe order
    of As, Cs, Gs, and Ts that make up an organism's
    DNA.
  • Sequencing the genome is an important step
    towards understanding it.
  • The whole genome can't be sequenced all at once
    because available methods of DNA sequencing can
    only handle short stretches of DNA at a time.

5
Genome Sequencing
  • Much of the work involved in sequencing lies in
    putting together this giant biological jigsaw
    puzzle.
  • Various problems occur such as
  • Errors in reading
  • Flips

6
Shot-Gun Sequencing
  • The "whole-genome shotgun" method, involves
    breaking the genome up into small pieces,
    sequencing the pieces, and reassembling the
    pieces into the full genome sequence.

7
Environmental Genomics
  • Multiple sequence alignment is an important first
    step in many bioinformatics applications such as
    structure prediction, phylogenetic analysis and
    detection of key functional residues.
  • The accuracy of these methods relies heavily on
    the quality of the underlying alignment.

1
8
Multiple Sequence Alignment
  • The traditional multiple sequence alignment
    problem is NP-hard, which means that it is
    impossible to solve for more than a few sequences
    1.
  • In order to align a large number of sequences,
    many different approaches have been developed.

9
Tools and Techniques
  • MUMMER
  • Phrap, Phred, Consed
  • TIGR
  • The Smith-Waterman Algorithm
  • Tree-Based Algorithms

10
Meta-genomics
  • Meta-genomics is the application of modern
    genomics techniques to the study of communities
    of microbial organisms directly in their natural
    environments, bypassing the need for isolation
    and lab cultivation of individual species.

2
11
Meta-genomics
  • Bacteria can often have minor variations in their
    DNA that can result in different metabolic
    characteristics.
  • The differences can make it difficult to classify
    bacteria taxonomically.
  • What has been needed is a method of creating a
    characteristic representation (characteristic
    genome) from the sub sequences of DNA found in
    several sub variant of a bacteria of the same
    species.
  • Such genome could be used for more efficient
    classification at a molecular level through the
    process of controlled generalization.

12
Research Goals
  • Given a collection of nucleotide sequences from
    multiple organisms, develop techniques based on
    fuzzy set theory and other methods for assembly
    of the sequences into the original full genome
    for each organism.
  • Using the above techniques to develop a
    generalized approach for creating a
    characteristic genome that represents a
    generalization of the original organisms that
    donated sequence data.

13
The Data
  • SYM (Original Raw Data)
  • Contains 302K Sequences
  • Average length of 450 base pairs (bp)
  • It was obtained from a community of bacteria
  • There is an estimated of 100 organisms
  • Lets say, for example 75 of data is repeated, we
    still need to reassemble a sequence of 33
    Million bp

14
Motivation
  • Current tools could not solve the problem
  • Complexity of the dataset, since they are from
    same species.
  • Sequencing environmental genomes, not a single
    organism.
  • Limited tools that sequence environmental
    genomes.
  • Algorithm
  • Underlying algorithm determines the accuracy of
    match.
  • Performance can be highly improved.
  • Interfaces could be improved.

15
Problem
  • Genome assembly is a O(2k) problem.
  • Using Dynamic Programming it can be reduced.
  • Example in seconds
  • Assembly that takes around 1125899906842624
    seconds to can be reduced to 2500 seconds!

16
A Start
  • We divide the problem in two steps
  • Acquiring subsets such that each subset
    represents an organism
  • Assembling this into a characteristic genome
    sequence
  • The above two steps to be obtained by
  • Clustering
  • Assembly

17
Steps

Clustering
Assembly
Raw Data
Assembled Sequences
Characteristic Genome
18
The Data
  • CAVEEG (Cleaned Dataset)
  • Contains 128K Sequences
  • Assembled Singletons
  • Length ranges from 200bp-1000bp

19
D2 Cluster
  • It is a software for clustering genome sequences
  • The technique is based on distance.

20
Clustering with D2 Cluster
  • Clustering was performed on 128K CAVEEG Dataset
  • One dataset with 100K Sequences was obtained
  • Majority of the data falls into one cluster
  • This makes the process of separating organisms
    hard
  • The clustering/assembly failed to assemble the
    sequences (the number of organisms were estimated
    manually and compared)

21
Problem with D2 clustering
  • Does not look for contigs
  • Ex A same cluster may have
  • AATGCGTATTCGATGCGC
  • CATACTTAGTCGATC AG
  • When we assemble we desire
  • AATGCGTATTCGATGCGC
  • TGCGCATCGTATCG

22
Problems
  • Since data is closely related the clustering
    technique assigns them to same cluster.
  • Existing tools are unable to assemble the data
    correctly.
  • The clustering software can only perform one
    round of clustering.

23
Ongoing Work
  • Genome assembly using dynamic programming
  • Uses Longest Common Sub-Sequence
  • LCS is commonly used (ex Mummer)
  • We added restrictions
  • Enforce strict matches
  • Encoding of data

24
Results
25
Then
  • We added clustering.
  • Instead of comparing each sequence with each
    other we can compare them with a group.
  • Faster, less number of comparisons.

26
Clustering
3
27
Comparison of Clustering
28
Performance Comparison
29
How much does it matter?
  • Obtaining an exact full length sequence it not
    essential
  • A sequence that is very close to the original is
    desired

30
Snapshot 1
31
Snapshot 2
32
Fuzzy Logic
  • Fuzzy Logic has been used extensively in
    approximate string matching using distance
    measures, etc.
  • However, very little work has been done in
    application of building genomes from subsequences
    of nucleotides.
  • The concept of similarity and application of
    fuzzy logic will be defined which is a relatively
    new area in nucleotide sequencing.

33
In Future..
  • Compare technique with Phrap (alignment software,
    Mummer)
  • Improve clustering
  • Define Similarity using Fuzzy Logic
  • Define Dissimilarity
  • Parallelize the process

34
References
  • 1 http//bioinformatics.oxfordjournals.org/cgi/c
    ontent/full/21/8/1408FIG1
  • accessed May, 2006.
  • 2 DeLong EF (2002) Microbial population
    genomics and ecology. Curr Opin Microbiol 5
    520524.
  • 3 http//www.togaware.com/datamining/survivor/km
    eans04.png

35
Questions
Write a Comment
User Comments (0)
About PowerShow.com