Difference%20Between%20Expected%20and%20Observed%20frequencies - PowerPoint PPT Presentation

About This Presentation
Title:

Difference%20Between%20Expected%20and%20Observed%20frequencies

Description:

M.tuberculosis has very high n-gram standard deviation values. ... 4-grams in M. tuberculosis have much higher 4-gram standard deviations from mean ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 22
Provided by: ZAM92
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Difference%20Between%20Expected%20and%20Observed%20frequencies


1
GENOME SIGNATURES OF MICROBIAL ORGANISMS
IDENTIFIED BY AMINO ACID N-GRAM
ANALYSIS B. Suman Bharathi Advisor Judith
Klein-Seetharaman Forschungszentrum, Juelich,
Germany
2
Genome Signatures
  • Sequence peptides which occur with unusually high
    frequency unlike others in particular organism or
    pathogen
  • Potential applications
  • Drug development synthetize drugs which target
    genome signature in pathogen
  • Sensor development use genome signature to
    identify organism quickly using antibody


3
Approach
  • Linguistic approach
  • N-gram analysis using toolkit
  • What the BLMT toolkit provides
  • N-gram statistical analysis
  • Definition of signature sequences
  • Use of toolkit on Neisseria Meningitidis

0.09
Neisseria meningitidis versus other species n4
0.08
0.07
0.06
0.05
Occurrence of n-gram ()
0.04
0.03
0.02
0.01
0
AAAL
SDGI
LAAA
ALAA
LAAL
AALA
AALL
LLAA
ALLA
AVLA
AAAA
MPSE
AVAA
AAAV
GRLK
EAAA
AEAA
AAEA
AAVA
AAAE
n-gram sequence of length n
4
Use of BLMT
  • N-gram statistical analysis gives us a detailed
    statistical data in terms of frequency of n-grams
    and their respective mean and standard
    deviations.
  • We have taken 45 organisms into consideration
    bacteria, archaea, mycoplasmas and human
  • Search for n-grams whose standard deviations are
    away from the mean values.
  • Indicates the difference between expected and
    observed values in frequency of the n-grams.
  • Eventually helps us to see the unsusuality of
    this n-gram in the organism unlike the others
    compared.

5
Difference Between Expected and Observed
frequencies
Xylella(black) Vibrio(red) Ureaplasma(green) Trepo
nema(blue) Thermotoga(yellow)
n-gram
The positive values indicate the over-represented
n-grams while the negative values indicate the
under-represented n-grams
6
Initial Points of difference between expected and
observed frequency graph
Xylella(black) Vibrio(red) Ureaplasma(green) Trepo
nema(blue) Thermotoga(yellow)
Ureapasma shows high difference values (approx
0.00021), indicating over-representation of
n-grams compared to expected probability of
occurence in the organism
7
Standard deviation away from the mean
  • Mycoplasma genitalium(black)
  • M.tuberculosis(red)
  • M.leprae(green)
  • Mesorhizobium(blue)
  • Lactococcus(yellow)
  • Mycoplasma genitalium(black)
  • M.tuberculosis(red)
  • M.leprae(green)
  • Mesorhizobium(blue)
  • Lactococcus(yellow)

Shows distribution of n-gram standard deviations
with both high and low values of difference,
indicating the over-expressed and
under-expressed n-gram values.
8
Highest standard deviations away from the mean
  • Mycoplasma genitalium(black)
  • M.tuberculosis(red)
  • M.leprae(green)
  • Mesorhizobium(blue)
  • Lactococcus(yellow)

Shows initial (highest) values of standard
deviation away from mean N-grams of
M.tuberculosis much higher than M.leprae.
9
Comparison of genome size with varying standard
deviations
  • Examine the relationship between genome size and
    distribution of n-gram standard deviations for
    each organism
  • Human genome taken as reference.
  • Compare genome size and standard deviations
    within same genus but across different species.

10
Size Distribution of Genomes
1.Human 22889476 2.Bacteria_Mesorhizobium_loti
4080256 3.Bacteria_Pseudomonas_aeruginosaPA01 37
30192 4.baceria E_coi0157H7Baceria_Escherichia_coi
O157H7 3229098 5.Bacteria_Escherichia_coliO157H
7EDL933 3228100 6.Bacteria_Escherichia_coliK12 27
26558 7.Bacteria_Mycobacterium_tuberculosisH37Rv 2
666338 8.Bacteria_Bacillus_subtilis 2442200 9.Ba
cteria_Bacillus_halodurans_C125 2384352 10.Bacter
ia_SynechocystisPCC6803 2072748 11.Bacteria_Vibri
o_cholerae_chr1 1725852 12.Bacteria_Deinococcus_r
adioduransR1_chr1 1559376 13.Bacteria_Xylella_fast
idiosa 1490262 14.Archaea_Archaeoglobus_fulgidus
1343990 15.Bacteria_Pasteurella_multocida
1340102 16.Bacteria_Lactococcus_lactis_subsp_lac
tis 1335222 17.Archaea_Aeropyrum_pernix 1280062 1
8.B_Neisseria_meningitidis_serogroupBstrainMC58 11
78096 19.Archaea_Halobacterium_spNRC1 1178038 20.
B_Neisseria_meningitidis_serogroupAstrainZ2491 117
6104 21.Bacteria_thermotoga_maritima 1167344 22.B
acteria_Pyrococcus_horikoshiiOT3 1141216
23.Bacteria_Mycobacterium_leprae_strinTN 1080756
24.A_Methanobacterium_thermoautotrophicum_deltaH 1
054752 25.Bacteria_Haemophilus_influenzaeRd 10455
72 26.Bacteria_Campylobacter_jejuni 1020944 27.Ba
cteria_Helicobacter_pylori_strianJ99 990942 28.Ba
cteria_Helicobacter_pylori26695 986258 29.Archaea
_Methanococcus_jannaschii 970558 30.Bacteriae_Aqu
ifex_aeolicus 968068 31.Archaea_Thermoplasma_acid
ophilum 909164 32.Archaea_thermoplasma_volcanium
903228 33.Bacteria_Chlamydophila_pneumonieaeJ138
735350 34.Bacteria_Chlamydophila_pneumonieaCWL029
725492 35.Bacteria_Chlamydophila_pneumonieaeAR39 7
29896 36.Bacteria_Treponema_pallidum 703414 37.Ba
cteria_Chlamydia_muridarum 646712 38.Bacteria_Chl
amydia_trachomatis 626142 39.Bacteria_Rickettsia_
prowazekii_strain_MadridE 559828 40.Bacteria_Mycop
lasma_pneumoniae 480870 41.Bacteria_Ureaplasma_ur
ealyticum 457608 42.Bacteria_Buchnera_sp_APS 371
470 43.mycoplasma genitalium 352826 44.Bacteria_
Borrelia_burgdorferi 300106
11
Size genome graph and varying std deviation values
  • Human(black22889476)
  • Mesorhizobium(red,4080256)
  • P.aeruginosa(green,3730192)
  • E_coi0157h7(blue,3229098)
  • E_coli0157h7EDl933
  • (yellow,3228100)

The organisms are listed in descending order of
genome size. The relation between distribution of
n-gram standard deviations and size is compared.
12
Tail end of Genome size and n-gram distribution
of standard deviations
Human(black,22889476) Mesorhizobium(red,4080256) P
.aeruginosa(green,3730192) E_coi0157h7(blue,322909
8) E_coli0157h7EDl933 (yellow,3228100)

Human genome, though largest in size, has low
values of n-gram standard deviation values away
from the mean compared to smaller genomes
13
Initial points Genome size and n-gram
distribution of standard deviations
Human(black,22889476) Mesorhizobium(red,4080256) P
.aeruginosa(green,3730192) E_coi0157h7(blue,322909
8) E_coli0157h7EDl933 (yellow,3228100)
Human n-gram std deviation values are almost
equal to Mesorhizobium though Mesorhizobium has
much smaller genome.
14
Genome size and n-gram distribution of standard
deviations
  • Human (black,22889476)
  • E_coliK12(red,2726558)
  • M.tuberculosis(green,2666338)
  • B.subtilis(blue,2442200)
  • B.halodurans(yellow,2384352)
  • Synechocystis(brown,2072748)

M.tuberculosis has very high n-gram standard
deviation values. It exceeds the values of human,
despite its smaller genome size.
15
Initial points of Genome size and n-gram
distribution of standard deviations
Human (black,22889476) E_coliK12(red,2726558) M.tu
berculosis(green,2666338) B.subtilis(blue,2442200)
B.halodurans(yellow,2384352) Synechocystis(brown,
2072748)
The thickness of lines indicates the genome
size. The thinnest line represents
E_coliK12. Mycobacterium tuberculosis shows
highest values.
16
Final points of Genome size and n-gram
distribution of standard deviations
Human (black,22889476) E_coliK12(red,2726558) M.tu
berculosis(green,2666338) B.subtilis(blue,2442200)
B.halodurans(yellow,2384352) Synechocystis(brown,
2072748)
M.tuberculosis and all other organisms here have
n-grams with higher difference values than human.
17
Same genus / different species
  • 4-grams in M. tuberculosis have much higher
    4-gram standard deviations from mean than M.
    leprae

18
Mycobacterium
M. tuberculosis
M. leprae
19
Other Organisms
Neisseria meningitidis
Thermotoga maritima
Synechocystis spec.
Haemophilus influenza
Human
20
Conclusions
  • n-grams which are at least 30 standard deviations
    away from the mean are significant candidates for
    genome signatures.
  • Difference graphs estimate the likelihood of
    n-gram observed in an organism.
  • Genome size graphs there is no specific
    relationship between the size of genome and its
    standard deviation values.
  • Same genus and different species, where genome
    size is specified There is a noticeable
    difference observed between Mycobacterium species
    (M.leprae and M.tuberculosis).


21
Current and future work
  • Find n-gram signatures n-grams in E.coli.
  • Explore the relationship between genome size and
    distribution of n-gram standard deviations
    different species of the same organism.
  • Find more specific targets to differentiate
    species in terms of signature peptides for all
    the 44 organisms taken for study.
Write a Comment
User Comments (0)
About PowerShow.com