Mapping Mutations Patterns in the HIV DNA - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Mapping Mutations Patterns in the HIV DNA

Description:

Mapping Mutations Patterns in the HIV DNA By Nimrod Bar-Yaakov nimrod-b_at_orbotech.com With co-operation of Dr. Zehava Grossman of the Israel s Multi-Center AIDS ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 47
Provided by: csTauAc
Category:

less

Transcript and Presenter's Notes

Title: Mapping Mutations Patterns in the HIV DNA


1
Mapping Mutations Patterns in the HIV DNA
  • By Nimrod Bar-Yaakov nimrod-b_at_orbotech.com
  • With co-operation of Dr. Zehava Grossman of the
    Israels Multi-Center AIDS Study Group, National
    HIV reference Laboratory in Tel-Hashomer.

2
Todays Topics
  • HIV Introduction.
  • What so important about the HIV DNA mutations?
  • Early stages Exploring RNA mutations
  • From RNA to Amino Acid Mutations
  • Recent studies and current work.
  • Results and future research.

3
Virus Overview
  • Viruses may be defined as acellular organisms
    whose genomes consist of nucleic acid, and which
    obligately replicate inside host cells using host
    metabolic machinery and ribosomes to form a pool
    of components which assemble into particles
    called VIRIONS, which serve to protect the genome
    and to transfer it to other cells.

4
Virus Animation
5
What is an HIV
  • human immunodeficiency virus, A type of
    retrovirus that is responsible for the fatal
    illness Acquired Immunodeficiency Syndrome (AIDS)
  • Retrovirus A virus that's carry their genetic
    material in the form of RNA rather than DNA and
    have the enzyme reverse transcriptase that can
    transcribe it into DNA.
  • In most animals and plants, DNA is usually made
    into RNA, hence "retro" is used to indicate the
    opposite direction

6
How does the HIV infects the body cells?
  • HIV begins its infection of a susceptible host
    cell by binding to the CD4 receptor on the host
    cell
  • The genetic material of the virus, which is RNA,
    is released and undergoes reverse transcription
    into DNA, which enters the host cell nucleus
    where it can be integrated into the genetic
    material of the cell.
  • Activation of the host cells results in the
    transcription of viral DNA into messenger RNA
    (mRNA), which is then translated into viral
    proteins.
  • The viral RNA and viral proteins assemble at the
    cell membrane into a new virus.
  • The virus then buds forth from the cell and is
    released to infect another cell.

7
Treatment related to the active RNA sites
  • The HIV DNA generates proteins that are essential
    to the virus life-cycle.
  • Medical treatment interfere or block the
    operation of these proteins.
  • Reverse Transcriptase medicines
  • Inhibits the transcription of the HIV RNA into
    the cells DNA
  • The HIV protease protein, is required to process
    other HIV proteins into their functional forms.
  • Protease inhibitors medicines, act by blocking
    this critical maturation step.

8
RNA mutations
  • Environmental/Biological processes may cause
    mutations in the HIV RNA.
  • The mutated HIV RNA merge into the infected
    cells DNA.
  • The generated Amino-Acids sequence is then
    altered.
  • A different Protein is generated by the cell.
  • The altered protein may resist the medical
    treatment!

9
Mutation families
  • The HIV RNA has a high mutation rate (a 1000
    times more than a regular cell).
  • Fast evolutionary processes causes the best
    mutated viruses to increase their population in
    the infected body.
  • Well focus on 3 main mutation families
  • Resistance mutations
  • Clade mutations
  • Other noise/random

10
The importance of identifying the resistance
mutations
  • Selecting the best medicine treatments
  • Understanding the way different medicines
    interacts with the HIV
  • Understanding the functional interpretation of
    the RNA sequence

11
Data Processing
  • DNA Sequence Extraction
  • DNA Sequence Alignment
  • Identifying and Filtering mutations
  • Creating consensus sequence and mutation matrix.
  • Find correlations between treatment and mutation
    patterns.

12
Extracting the RNA Sequence
  • The RNA sequences are transcript into DNA
    sequences.
  • The DNA sequences then multiplied several times
  • A DNA sequencer read the aligned DNA sequences.
  • The decision how to interpret a specific DNA
    segment is based over image processing algorithms
    (define the segment boundaries and find the best
    match for the segment pattern) and isnt
    deterministic!

13
Sequence Alignment (from Ron Shamirs Course)
14
(No Transcript)
15
Sequence Alignment
  • Before alignment
  • AtaaagakagggggacagctaaaagaggctctcTTAGACACAGGAGCAGA
    TGATACA
  • ACTCTTTGGCAGCGaCCCCGTTGTCACaATAAAAATagGGGGACAGCTAA
    gGGagGc
  • TAAAAGAGGCTCTCTTAGCACACAGGMGCAGAYGAYACAGTMCTTASCAA
    GAAATAA
  • ACTCTTTGGCAGCGACCCCTTGTcACAATAAAAGTAGAGGGACAGCTAAG
    GGAKGCT
  • ACTCTTTGGCAGCGaCCCCTTGTCACAATAAAAATAGGGGACAGCTAAGG
    GAGGCTC
  • ACTCTTTGGcAGCGACCCCTtGTCACAATAAAAGtAGGGGGaCAGCTAAA
    gGAGGCT
  • aCTnTTnGRCAGCGaCCCCTTgTCYCARtAAAAATAGGGGGGCAGRTAAR
    GGAGGCt
  • After Alignment
  • ------------------------------ATAAAGAKAGGGGG-ACAG-
    CTAAAAGAGG
  • ------------C-GACCCC--TTGTCACAATAARAATAGGGGG-ACAG-
    CTAAAAGAGG
  • ACTCTTTGGCAAC-GACCCC--TTGTCACAATAAGAGTAGGGGG-ACAG-
    CTAAAAGAGG
  • -CTCTTTGGCAAC-GA-CCCC-TTGTCACAGTAAAAATAGRAGG-ACAG-
    CTAAAAGAAG
  • ACTCTTTGGCAAC-GA-CCCC-TTGTCACAGTAAAAATAGGAGG-ACAG-
    CTAAAAGAAG
  • ACTCTTTGGCAAC-GA-CCCC-TTGTCACAGTAAAAATAGGAGG-ACAG-
    CTMAAAGAAG
  • ACTCTTTGGCAAC-GA-CCCC-TTGTCACAGTAAGAATAGGAGG-ACAG-
    CTAAAAGAAG
  • Degapping

16
From Sequences to Mutation Matrix
17
Data Overlook
18
Finding mutations and treatment correlation
  • We want to find for each RNA index i whether
    P(Mut_in_i) is significantly different from
    P(Mut_in_ i/ Treatment).
  • Well use the CHI square distribution test for
    each index to find that.

19
Chi Square Overview
  • We will use the Chi-Square test to check the
    probability that our observed results had came
    from the same statistical population as the
    expected (chance) results.
  • A probability of less than 0.05 means that the
    results are significant, I.e the populations are
    significally different .

20
Chi Square Calculations
  • Calculating the chi-square statistic
  • The probability Q that a X2 value calculated for
    an experiment with d degrees of freedom (where
    dk-1) is due to chance is

21
Example Mutation V82A
22
DNA processing problems
  • Curse of dimensionality
  • Noisy data
  • Sequenced data are of stochastic nature
  • Small number of samples
  • Clades and sub-clades
  • Vague definitions of independent variables
    values.
  • Silent mutations
  • Talk Bio language!

23
Mutation Table
24
Results Mutation D30N
  • D30N is an important resistance mutation. But it
    appears at frequency of 0.0258 in the C clade
    compare with 0.0945 in the B clade, Whats the
    explanation for this?
  • Correlation analysis reveals that in clade B,
    D30N is highly correlated with other resistance
    Mutations. In clade C its not.
  • One assumption can be that the Clade B structure
    can influence the connections between resistance
    mutations.

25
Moving from DNA to Amino Acid mutations
  • Because DNA is translated to AA that forms the
    protein, protein functional studies only focus on
    the AA aspects of the DNA.
  • Because 3 DNA nucleotides conforms to 1 AA we
    reduce our dimensionality 3 times (though each
    dimension contains 22 AA).
  • Several sequences conforms to the same AA
    reduce variability and noise.
  • HIV mutation research focus mainly on AA,
    therefore provides more comparison data.

26
Moving from DNA to Amino Acid mutations
Second nucleotide
  • Translating DNA sequence to AA sequence is
    straight forward

ATAAAGAKAGGGGGACAGCTAAAAGAGGC ATAARAATAGGGGGACAGCT
AAAAGAGGC ATAAGAGTAGGGGGACAGCTAAAAGAGGC GTAAAAATAG
RAGGACAGCTAAAAGAAGC GTAAAAATAGGAGGACAGCTAAAAGAAGC
GTAAAAATAGGAGGACAGCTMAAAGAAGC GTAAGAATAGGAGGACAGCT
AAAAGAAGC GTAAGAATAGGAGGACAGCTAAAAGAAGC
KWKPKIIGGIGGFVKVRQYDEVVVEICGK KWKPKMIG?IGGFIKVRQYD
QILIEICGK KWKPKMIGGIG?FI?VRQYEEILIEICGK ?WKPKMIGGI
GGFIKVRQYDQV?IEIC?K KWKPKMIGGIGGF?KVRQYDQIPIEICGK
RWKPKMIGGIGGFIKVRQY?QI?IEICGK KWKPKMIGGIGGFIKVRQYD
QILIEICGK KWKPKMIGGIGGF?KVRQYDQILIEICGK KWKPKIIGGI
GGLIKV?QYDNISIEICGK
27
Moving from DNA to Amino Acid mutations
  • A consensus AA sequence is then calculated, noisy
    data is filtered, and ambiguous AA are converted
    to the consensus values.
  • Though a Mutation can receive 20 mutated values,
    through filtering and comparison to literature, a
    max of 4 mutation per AA index is set.
  • Mutation frequency matrix is then calculated-
    where every mutation, even in the same index
    add a frequency column to the mutation matrix.

Mutations
0A000 V000L 0P00L VA000 000G0 0P000
010000 100001 001001 110000 000010 001000
Samples
28
Why searching for patterns?
  • 1 Dimensional AA sequence folds into a 3D protein
    structure.
  • The protein active sites located along its folds,
    usually contains more than one AA.
  • Protein mutated behavior occurs along its active
    sites
  • The AA 3D proximity is different than their
    sequence proximity.

29
Active sites patternfrom - http//www.rcsb.org/pd
b/index.html
AA Sequence
Protease Protein 3D Structure
ADDTVLEEINLPEKWTPKMIGGIGGFVKVRQYDQIPIEICGKKVIGAVL
VGPTPANVIGRNL
ADDTMLEEINLPEKWTPKMIGGIGGFVKVRQMDQIPICICGKKVIGAVL
VGPTPANVIGRNL
Sequence mutation pattern
Active site changing
30
Problem definition
  • Find a correlation between specific pattern all
    over the samples and a specific treatment

Mutations
NFV Treated
Chi Calculation
01110100 10000100 00100101 11100000 00000000 00100
110 00010000 00100100
1 0 1 0 0 0 0 1
Samples
31
Recent Biological studies
  • A thorough research and data gathering is done in
    Stanford university The HIV drug resistance
    database
  • Each sample can contribute only one vote to a
    patterns count, though many sub-patterns can be
    located in one sample.

32
Recent research Bayesian networks
  • K Deforche et al. has studied the dependencies
    relationships of treatment type combined with AA
    mutations using Bayesian networks.

33
Recent research Bayesian networks
  • Though it seems a promising way of finding
    relationships between a mutation and the
    treatment Bayesian Network looks for connection
    between one variable and another, where in our
    case we may want to look at the relationship
    between a group of variables and another.
  • Interpreting the Bayesian Network is also a hard
    task, and it may only give us directions or clues
    toward regions where we must research again the
    data in order to prove statistical significance
    between the variables.

34
Data Challenges
  • Because each samples can contain interesting
    and non-interesting mutations, or mutations
    from different patterns we must treat every
    mutation pattern in the sample as candidate.
  • We then sum the number of appearances of each
    pattern candidate in order to calculate the CHI
    statistics.

Samples total patterns 00100100110 1 00100
000011 1 10100101110 2 00010010000 2 0010010011
0 3 00110100100 4 00000100100 4 00110100110 5
10100100000 5
35
Data Challenges
  • The complexity of naïve traversing through all
    the patterns is O((N2)(2K)) , where K is
    largest number of mutations in a single sample.
    And N is the number of samples.
  • In our data K can reach 30 and N is 1000, so
    naïve search is not feasible.
  • Since p(a,b/T) is hard to predict from p(a/T) and
    p(b/T), gradient decent methods of traversing
    through the mutation pattern space (where in
    every step we add a mutation to the pattern), may
    be fruitless.
  • There is also no apparent trait of the statistic
    function we want to maximize, that may ease our
    search.

36
Branch and BoundLittle et. al (1963)
  • An algorithmic technique to find the optimal
    solution by keeping the best solution found so
    far. If a partial solution cannot improve on the
    best, it is abandoned
  • When we can determine that a given node in the
    solution space does not lead to the optimal
    solution-either because the given solution and
    all its successors are infeasible or because we
    have already found a solution that is guaranteed
    to be better than any successor of the given
    solution. In such cases, the given node and its
    successors need not be considered.
  • In effect, we can prune  the solution tree,
    thereby reducing the number of solutions to be
    considered.

37
Branch and Bound
38
Branch and Bound
  • Save pattern results in order to save calculation
  • Lower Bounds
  • If (ab) lt 3
  • If p(A) gt 0.5 and p(B) gt 0.5 no need to check
    (AB) empirically studied, probably has
    biological reasoning.
  • Upper Bound
  • Statistically significance

39
Branch and Bound results
  • Discover all single major mutations that appears
    in data.
  • Discover three major pattern groups two of them
    known, one is new need to find if there is any
    biological meaning.

40
Branch and Bound Results
41
Branch and Bound Results
  • Pros
  • Exhaustive good patterns cannot escape
  • Simple to understand and implement
  • Cons
  • Probability lower bound isnt well defined
  • Can take too long in DNA pattern calculations

42
BiclusteringCheng and. Church , 2000
  • A clustering process of simultaneously mining
    column and row (say row for observation/gene and
    column for dimension/sample).
  • A bi-cluster is a subset of rows that exhibit
    similar behavior across a subset of columns, and
    vice versa.
  • Each node can relate to several bi-clusters at a
    time.
  • Originally developed for mining gene expression
    data.

43
Biclustering SAMBAA. Tanay, R. Sharan, and R.
Shamir, 2002
  • Statistical-Algorithmic Method for Bicluster
    Analysis
  • Create a bi-partite graph from the data, where
    the left side is the genes and the right is the
    conditions.
  • Connect edges between the vertices on the two
    sides according to their similarity
    expression level, and weight it accordingly.

44
Biclustering SAMBAA. Tanay, R. Sharan, and R.
Shamir, 2002
  • Tanay et al. has shown how to assign weights to
    the vertex pairs so that a maximum weight
    bicluster corresponds to a maximum likelihood
    bicluster.
  • Therefore we can reduce the problem to finding
    heaviest sub graph in a bi-partite graph a
    known combinatorial problem.

45
Adapting SAMBA
  • For each treatment
  • Samples on one side, mutations on the other.
  • Add edge, if the sample contains the mutation
  • Modify weighting scheme so it can relate to the
    CHI square statistic

46
The End!
  • Thank you for listening.
  • Any Questions?
Write a Comment
User Comments (0)
About PowerShow.com