Bioinformatics PhD. Course - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Bioinformatics PhD. Course

Description:

AAC GAT TGC. ACG CGG GCC TTG ... AAC, CAA, ACG,... : 2. Find the overlaps AAC ACA, ... AAC GAT TGC. ACG CGG GCC TTG. GGC GGA CCG ATT. or find the ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 35
Provided by: lcl2
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics PhD. Course


1
Bioinformatics PhD. Course
Summary (approximate)
  • 1. Biological introduction
  • 2. Comparison of short sequences (lt10.000 bps)
  • 3 Comparison of large sequences (up to 250 000
    000)
  • 4 Sequence assembly
  • 5 Efficient data search structures and algorithms
  • 6 Proteins...

2
3. Comparison of large sequences
Summary (more or less)
  • 3.1 Overview
  • 3.2 Suffix trees
  • 3.3 MUMs

3
Sequence assembly
It has two applications
  • DNA sequencing
  • determining the bases of a DNA sequence.
  • EST assembly
  • using mRNA fragments to find the genes
  • expressed in a cell

4
DNA sequencing
Techniques employed
  • Hybridization allows the tuples of a given
    length
  • in a sequence to be found.
  • Shotgun breaks a sequence into small pieces.

5
DNA sequencing
Techniques employed
  • Hybridization allows the tuples of a given
    length
  • in a sequence to be found.
  • Shotgun breaks a sequence into small pieces.

6
Hybridization
Imagine we want to determine the sequence
xxxxxxxxxxxxx
and we know that it contains the following
triplets
AAC GAT TGC ACG CGG GCC TTG GGA ATT
How can the sequence be established?
7
Hybridization
We create a graph based on suffix-prefix
overlaps
AAC GAT TGC ACG CGG GCC TTG GGA
ATT
The sequence is deduced following the path in
the graph
AACGGATTGCC
What is the cost of finding the path?
8
Hybridization
Let us consider a more realistic case
For a general case we find the Hamiltonian path
(NP-Complet)
What is the cost of the entire hybridization
technique?
9
Hybridization technique
Cost
1. Find the L-tuples AAC, CAA, ACG,...
All possible 4L tuples are constructed and
searched
If there are m pieces of length L, then there
are O(m2 L2 ) comparisons
3. Create the graph and find the Hamiltonian path
NP- Complet
10
Note
Linear cost O(m)
Quadratic cost O(m2 )
Exponential cost O(2m )
11
Hybridization technique
Cost
1. Find the L-tuples AAC, CAA, ACG,...
All possible 4L tuples are constructed and
searched
If there are m pieces of length L, then there
are O(m2 L2 ) comparisons
3. Create the graph and find the Hamiltonian path
NP- Complet
How can we avoid NP-completeness?
12
Hybridization two reductions
Find the Hamiltonian path (NP-complete)
or find the Eulerian path (linear)
AA
AC
CG
GG
13
Hybridization Eulerian path
14
Hybridization Eulerian path
Algorithm Create a random path from a
starting node to an ending node
Add circuits at balanced nodes
15
Hybridization camí Eulerià
Algorithm Create a random path from a
starting node to an ending node
Add circuits at balanced nodes
16
Hybridization technique
Cost
1. Find the L-tuples AAC, CAA, ACG,...
All possible 4L tuples are constructed and
searched
If there are m pieces of length L, then there
are O(m2 L2 ) comparisons
3. Create the graph and find the Eulerian path
Linear
What is the limiting factor?
17
Hybridization limitations of the technique
Repeated fragments
CAACGGATTGCC
CAACGGACGGATTGCC
What is the probability that a fragment repeats?
18
Hybridization
We estimate the probability that a fragment
repeats
Model random sequence of length N with an
equally distribution (1/4),
Given 2 fragments, the probability that they are
identical 4-L
Given 3 fragment, the probability that two of
them are identical (32)4-L
Given m fragment, the probability that two of
them are identical (m2)4-L
If L8 and we want this probability to be 1,
then m 32
Conclusion the technique of Hybridization can
only be applied to short sequences.
19
Excursió hipòtesi dequiprobabilitat
Fins a quin punt són equiprobables les
seqüències?
20
(No Transcript)
21
Seqüenciació del DNA
De quines tècniques es disposa
  • Hybridization permet saber quins mots duna
  • longitud fixa es troben a la
    seqüencia.
  • Trets permet disparar sobre la seqüència i
  • trencar-la en trossos.

22
Trets
Imaginem que volem conèixer la seqüència
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
  • i la nostra tècnica ens permet
  • copiar-la
  • partir-la a latzar en trossos de diferent
    llargada i sense saber-ne lordre

Què podem fer?
23
Trets algorisme
Imaginem xxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx
xxxx
Lalgorisme serà
1er. Comparar tots els trossos dos a dos per
esbrinar com es superposen (eliminant
inclusions).
2on. Construir el graf sufix-prefix
3er. Buscar el camí
24
Trets
La copiem tres cops xxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxx
nobtenim els trossos accgt, aggt, acgatac,
accttta, tttaac, gataca, accgtacc, ggt,
acaggt,taacgat, accg, tacctt
25
Trets
Cal comparar els trossos per veure quins engalcen
sufix-prefix
  • Directament amb programació dinàmica (Cost
    quadràtic)
  • (tots contre tots i la majoria no
    engalceran)
  • En dos passos
  • Detectar els que engalcen
  • (Cost lineal amb lAlgorisme hash)
  • Aplicar Prog. Dinàmica només als que engalcen

26
Excursió algorisme de hash
27
Trets
construïm el graf (cost quadràtic)
28
Trets problemes
Problemes
  • Repeticions consecutives
  • Repeticions curtes llunyanes
  • Falta de recobriment (problemes al seqüenciar)
  • Errors en els trossos (problemes al seqüenciar)

29
Trets propietats del recobriment
Estudiem el recobriment
30
Trets percentatge de recobriment
Quin és el percentatge de recobriment de la
seqüència?
Grau de cobertura de la seqüència N d / L
Suposem que els segments estan uniformament
distribuïts.
31
Excursió distribució binomial
Distribució binomial B(n,p)
32
Excursió distribució de Poisson
Llavors la probabilitat de que almenys caigui una
bola és ProbXgt0 1-ProbX0 1- e-?
33
Trets percentatge de recobriment
Si volem un recobriment del 99 cal que N d / L
4.6
Si volem un recobriment del 99.9 cal que N d /
L 6.9
34
Engalçament dEST
Tenim milers de trosso de unes 500 bases de
longitud, que pertanyen a diferents
Lalgorisme serà
1er. Comparar tots els trossos dos a dos per
esbrinar quins estan relacionats(eliminant
inclusions).
2on. Construir el graf sufix-prefix
(surten molts petits grafs)
3er. Buscar el camí
Write a Comment
User Comments (0)
About PowerShow.com