Pattern Recognition - PowerPoint PPT Presentation

Loading...

PPT – Pattern Recognition PowerPoint presentation | free to download - id: 3c05d1-MzJjY



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Pattern Recognition

Description:

Pattern Recognition Primary Sequences with Functional Significance Gene Finding Recognition of Coding Regions What type of information is present within a primary ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 144
Provided by: genomeUab
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Pattern Recognition


1
Pattern Recognition
  • Primary Sequences with Functional Significance

2
Gene Finding
3
Recognition of Coding Regions
  • What type of information is present within a
    primary genomic nucleotide sequence which might
    provide a hint as to which genomic sequences code
    for proteins?

4
Gene Signals
  • Promoters
  • Terminators
  • Ribosome binding sites
  • Start/Stop codons

5
Context Searches
  • Open reading frames
  • Codon usage preference
  • Species-specific preferences
  • Non-random nucleotide patterns
  • GC base composition
  • Species-specificity GC percentage
  • Third position GC bias
  • General base frequency
  • Non-random triplet organization
  • Hidden Markov Models
  • Neural Networks

6
Reading Frame Prediction and Codon Analysis
  • Frames
  • CodonFrequency
  • Correspond

7
Gene Finding Programs
  • CodonPreference
  • TestCode
  • Genemark
  • Glimmer
  • GrailPro
  • Many Others

8
Translation
  • Translate
  • BackTranslate
  • PepData

9
Sequence Patterns
  • Composition
  • Terminator
  • Repeat
  • Window
  • StatPlot

10
Finding Open Reading Frames
11
Frames
  • Identification of Reading Frames
  • Open frames only
  • All start and stop codons
  • Necessary for viruses and eukaryotic genes
  • Rare codon display
  • Mark boxes

12
analyze frames -check humhbb.gb_pr1 Frames
shows open reading frames for the six translation
frames of a DNA sequence. Frames can superimpose
the pattern of rare codon choices if you provide
it with a codon frequency table. Minimal
Syntax frames -INfile1BacteriaEcoOmpa
-Default Prompted Parameters -BEGin1
-END2270 range of interest Local Data
Files -TRANSlatetranslate.txt defines start
and stop codons -MARk1ecoompa.mrk marks
regions of interest on the plot
13
Optional Parameters -ALL shows
all start and stop signals, not just open
frames -RAReecohigh.cod mark rare codons
according to codon frequency table -THReshold0.0
sets threshold at or below which rare codons
are shown -DENsity2270 sets the number of
bases per 100 platen units All GCG graphics
programs accept these and other switches. See the
Using Graphics chapter of the USERS GUIDE for
descriptions. -FIGureFileName stores plot
in a file for later input to FIGURE Add what
to the command line ? Process set to plot
with VT340 attached to term using the regd
graphic interface. Begin ( 1
) ? End ( 73308 ) ? When
your VT340 attached to tty is ready, press
ltReturngt.
14
E. coli ompA Gene
15
E. coli ompA Gene /rare
16
b-Actin gene ORF
17
b-Actin gene /all
18
Human Fetal b-Globin Gamma Gene ORF
19
Human Fetal b-Globin Gamma Gene /all
20
CodonPreference
  • Preference for particular codons within a
    synonymous group
  • Due to utilization of specific isoaccepting tRNA
    species
  • GC genomic biases
  • Organism and gene differences
  • High and low expressers

21
Codon Frequency Tables
  • Table of codon usage frequencies
  • Organisms
  • Classes of genes
  • Particular genes

22
analyze more ecohigh.cod !!CODON_FREQUENCY
1.0 Codon usage for enteric bacterial (highly
expressed) genes 7/19/83 AmAcid Codon
Number /1000 Fraction .. Gly GGG
13.00 1.89 0.02 Gly GGA
3.00 0.44 0.00 Gly GGU 365.00
52.99 0.59 Gly GGC 238.00
34.55 0.38 Glu GAG 108.00
15.68 0.22 Glu GAA 394.00 57.20
0.78 Asp GAU 149.00 21.63
0.33 Asp GAC 298.00 43.26 0.67
Val GUG 93.00 13.50 0.16 Val
GUA 146.00 21.20 0.26 Val
GUU 289.00 41.96 0.51 Val GUC
38.00 5.52 0.07 Ala GCG
161.00 23.37 0.26 Ala GCA
173.00 25.12 0.28 Ala GCU
212.00 30.78 0.35 Ala GCC
62.00 9.00 0.10
23
Available Tables - GCG
  • analyze to genrundata
  • /usr1/gcg/gcgcore/data/rundata
  • analyze ls .cod
  • ecohigh.cod
  • analyze to genmoredata
  • /usr1/gcg/gcgcore/data/moredata
  • analyze ls .cod
  • celegans_high.cod
  • drosophila_high.cod
  • ecolow.cod
  • maize_high.cod
  • celegans_low.cod
  • human_high.cod
  • yeast_high.cod

24
Codon Preference Statistic
  • Measure of how likely the use of a particular
    codon is, in comparison to its frequency in a
    random sequence of the same composition

25
Codon Preference Statistic
  • p F/R
  • FExpected frequency of a codon's occurrence
  • from the .cod file
  • R Frequency of a codon's occurrence in a random
    sequence of the same composition

26
Output
  • Codon preference statistic is averaged over a
    window of 25 codons (default)
  • Output is a curve of the statistic vs. nucleotide
    position

27
Third Position Nucleotide Bias
  • Preference for particular nucleotides in the
    third (wobble) position of the codon
  • Based upon overall GC content of the genome (?)
  • /biasGC
  • (or AT, or any desired nucleotides)

28
analyze codonpreference -check bg.seq
CodonPreference is a frame-specific gene finder
that tries to recognize protein coding sequences
by virtue of the similarity of their codon usage
to a codon frequency table or by the bias of
their composition (usually GC) in the third
position of each codon. Minimal Syntax
codonpreference -INfile1BacterialEcoOmpa
-Default Prompted Parameters -BEGin1
-END2270 range of interest -REVerse
use the back strand -FREQuencyecohigh
.cod codon frequency table -PWINdow25
preference window in codons -RARe0.1
rare codon display threshold (-NORARe
suppresses) -DENsity7
4.48 density in bases per centimeter
(11 x 17 paper) Local Data Files -TRANSlatetra
nslate.txt defines the start and stop
codons -MARkecoompa.mrk defines regions
of known interest
29
Optional Parameters -BIASGC shows third
position bias curves for GC (-NOBIAS
suppresses) -NOPREFerence suppresses
the codon preference curves -BWINdow25 bias
window in codons -FILeFName makes an output
file of the preference curve values -TABleFName
creates a table with the statistics for each
codon -ALLFrames shows all start and stop
codons -NOFRAmes suppresses the reading
frame part of the plot -NOPLOt suppresses
the whole plot -PHEIght77.0 sets the height of
the vertical axis in platen units -PLENgth120.0
sets the length of the horizontal axis in platen
units -PSCAlemax2.2 sets the maximum value on
the codon preference scale -BSCAlemax1.1 sets
the maximum value on the third position bias
scale All GCG graphics programs accept these
and other switches. See the Using Graphics
chapter of the USERS GUIDE for descriptions.
-FIGureFileName stores plot in a file for
later input to FIGURE
30
Add what to the command line ? Process set
to plot with VT340 attached to term using the
regd graphic interface. Begin
( 1 ) ? End ( 5000 ) ?
Reverse ( No ) ? What codon
frequency file ( GenRunDataecohigh.cod ) ?
What codon preference window size (in codons) (
25 ) ? The minimum density for a one-page
plot is 164.04 bases/cm. What density would
you like ( 164.04 ) ? When your VT340
attached to tty is ready, press ltReturngt.
Average codon preference for frame 1 0.4181
Average codon preference for frame 2 0.4013
Average codon preference for frame 3 0.4446
Average codon preference for a random sequence
0.4337
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
CodonFrequency
  • Creates a codon frequency table
  • Nucleotide sequence
  • Other codon frequency tables

44
analyze codonfrequency -check CodonFrequency
tabulates codon usage from sequences and/or
existing codon usage tables. The output file is
correctly formatted for input to the
CodonPreference, Correspond, and Frames
programs. Minimal Syntax codonfrequency
-INfile_at_Hsp70DNA.List -Default Prompted
Parameters -OUTfile1hsp70.cod output
file name Local Data Files -TRANSlatetransla
te.txt contains the genetic code
45
Optional Parameters -CODonfileecohigh.cod
input file of codon frequencies -BEGin1 -END100
range of interest for each sequence
(non-interactive mode
only) -REVerse strand for each
sequence
(non-interactive mode only) -ONEPEPtide
concatenate all DNA fragments in a multiple
sequence specification
before processing -CONtinue makes
CODONFREQUENCY continue after you write
out the file -NOMONitor
suppresses screen trace for each input
sequence Add what to the command line ?
46
You can add codon frequencies from either
E)xisting codon usage files
S)equence files Please select one ( S )
Count codons from what sequence(s) ?
humhbb.gb_pr1 Begin ( 1 )
? 19541 End ( 73308 ) ?
19632 Reverse ( No ) ? That
begins ATGGT and ends GGCAG. Is this correct (
Yes ) ? Get another exon from this gene ( No
) ? Yes Begin ( 1 ) ?
19755 End ( 73308 ) ? 19977
Reverse ( No ) ? That begins
ACTCC and ends TCAAG. Is this correct ( Yes ) ?
47
Get another exon from this gene ( No ) ?
That's done, now would you like to 1)
Get new sequence input file(s) 2) Get a new
codon table input file 3) Specify another
gene from humhbb.gb_pr1 W)rite the
frequencies to your output file Please choose
one ( W ) What should I call the output
file ( humhbb.cod ) ? analyze
48
(No Transcript)
49
Correspond
  • Compares codon frequency tables
  • How closely does the codon usage for one organism
    or gene match that of another?
  • The lower the D2 statistic, the closer the
    correlation

50
analyze correspond -check Correspond looks for
similar patterns of codon usage by
comparing codon frequency tables. Optional
Parameters -CONtinue makes Correspond
start over again automatically. Add what to
the command line ? Do you want to file the
results ( No ) ? Yes What should I call the
output file ( correspond.cor ) ? CORRESPOND
of what frequency file(s) ? genmoredata.cod
to what other frequency file(s) (
genmoredata.cod ) ?
51
Between and D-Squared
D Not-Counted .. celegans_high.cod
celegans_high.cod 0.000000 0.000000 0
celegans_high.cod celegans_low.cod 7.288287
2.699683 0 celegans_high.cod
drosophila_high.cod 3.594880 1.896017 0
celegans_high.cod ecolow.cod 6.176582
2.485273 3 celegans_high.cod
human_high.cod 4.627194 2.151091 0
celegans_high.cod maize_high.cod 5.393408
2.322371 0 celegans_high.cod
yeast_high.cod 5.480386 2.341022 0
celegans_low.cod celegans_high.cod 7.288287
2.699683 0 celegans_low.cod
celegans_low.cod 0.000000 0.000000 0
celegans_low.cod drosophila_high.cod 9.466039
3.076693 0 celegans_low.cod
ecolow.cod 1.814690 1.347104 3
celegans_low.cod human_high.cod 7.144900
2.672995 0 celegans_low.cod
maize_high.cod 10.524807 3.244196 0
celegans_low.cod yeast_high.cod 5.146653
2.268624 0 drosophila_high.cod
celegans_high.cod 3.594880 1.896017 0
drosophila_high.cod celegans_low.cod 9.466039
3.076693 0 drosophila_high.cod
drosophila_high.cod 0.000000 0.000000 0
drosophila_high.cod ecolow.cod 4.938439
2.222260 3 drosophila_high.cod
human_high.cod 1.574406 1.254753 0
drosophila_high.cod maize_high.cod 2.583944
1.607465 0 drosophila_high.cod
yeast_high.cod 8.846043 2.974230 0
ecolow.cod celegans_high.cod 6.176582
2.485273 3 ecolow.cod
celegans_low.cod 1.814690 1.347104 3
ecolow.cod drosophila_high.cod 4.938439
2.222260 3 ecolow.cod
ecolow.cod 0.000000 0.000000 3
ecolow.cod human_high.cod 3.370797
1.835973 3 ecolow.cod
maize_high.cod 5.637442 2.374330 3
ecolow.cod yeast_high.cod 5.821329
2.412743 3
52
TestCode
  • How random is the pattern of nucleotides in a
    sequence?
  • Coding regions may have a non-random nucleotide
    order
  • TestCode uses Fickett's algorithm to measure
    non-randomness of the sequence composition at
    every third base

53
TestCode Predictions
  • Only accurate for sequence lengths greater than
    200 bases
  • Does not predict the reading frame or strand used

54
TestCode Statistic
  • Measure of the base frequency between triplet
    nucleotides
  • Averaged over a window size of 200 (default)
  • Window shifted in increments of 3 (default)

55
TestCode Calculation
  • Determine the base composition at every third
    position
  • For each base select its maximum and minimum
    frequency of occurence within a run of triplets
  • Use these values to determine the asymmetry of
    the distribution of that base

56
TestCode Statistic
Max (n
, n
, n
)
1
2
3
Min (n
, n
, n
)
1
2
3
n Base composition for each base (A, T
, C, G)
1 Composition at positions 1, 4, 7,

...
2 Composition at positions 2, 5, 8,

...
3 Composition at positions 3, 6, 9,

...
Calculate the Max/Min for each base within a
window of 200 bases

(default).


Slide the window 3 bases (default) and repeat
the calculation.


Plot the results on a graph (test code
statistic vs. position)
57
TestCode Calculation
  • Use the asymmetry values and the base frequencies
    to determine the coding probability
  • Lookup the probabilities in a table of
    probability-of-coding values
  • Coding probabilities based on empirical evidence
    from known coding and non-coding sequence biases

58
TestCode Output
  • Plot of the TestCode statistic vs. sequence
    position
  • Top region Coding regions predicted to a 95
    level of confidence
  • Bottom region Non-coding regions predicted to
    95 confidence level
  • Middle region No significant prediction

59
analyze testcode -check bg.seq TestCode helps
you identify protein coding sequences by plotting
a measure of the non-randomness of the
composition at every third base. The statistic
does not require a codon frequency table.
Minimal Syntax testcode -INfileBacterialEc
oOmpa -Default Prompted Parameters -BEGin1
first base in plot -END2270
last base in plot -REVerse use the
reverse strand -WINdow200 sets the
window size -DENsity2270 sets the density
in bp per 100 platen units Local Data Files
-MARkecoompa.mrk marks the plot with regions
of known interest
60
Optional Parameters -INCrement3 lets
you set the window slide increment -POInts
makes points instead of a curve All GCG
graphics programs accept these and other
switches. See the Using Graphics chapter of the
USERS GUIDE for descriptions. -FIGureFileName
stores plot in a file for later input to
FIGURE Add what to the command line ?
Process set to plot with VT340 attached to term
using the regd graphic interface.
Begin ( 1 ) ? End ( 5000
) ? Reverse ( No ) ? What
window size in bp ( 200 ) ? The minimum
density for a one page plot is 5000.0
bases/page A typical density is about 3000.0
bases/page What density would you like (
5000.0 ) ? When your VT340 attached to tty is
ready, press ltReturngt.
61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
(No Transcript)
65
(No Transcript)
66
Vesicular Stomatitis Virus
  • Evaluating the coding potential of a virus

67
(No Transcript)
68
(No Transcript)
69
(No Transcript)
70
(No Transcript)
71
(No Transcript)
72
(No Transcript)
73
(No Transcript)
74
(No Transcript)
75
(No Transcript)
76
(No Transcript)
77
(No Transcript)
78
(No Transcript)
79
(No Transcript)
80
Human Beta-Globin
  • Intron/Exon Determination

81
bg mark file FEATURES
Location/Qualifiers .. !beta-globin and
beta-globin thalassemia 2187 2278 !beta-globin
thalassemia 2390 2408 !beta-globin 2409
2631 3482 3610
82
(No Transcript)
83
(No Transcript)
84
(No Transcript)
85
Gene Prediction Programs
86
GeneMark
  • Mark Borodovsky
  • School of Biology
  • Georgia Institute of Technology
  • http//genemark.biology.gatech.edu/GeneMark/
  • Hidden Markov Model of Gene structure

87
Grail
  • Neural Network-based model
  • Part of GCG
  • Use within SeqLab

88
Other Gene Prediction Programs
  • Glimmer
  • Steven Salzberg
  • TIGR
  • Johns Hopkins University
  • http//www.tigr.org/softlab/glimmer/glimmer.html
  • Interpolated Markov Models
  • GenScan
  • Christopher Burge
  • Stanford/MIT
  • http//genes.mit.edu/GENSCAN.html
  • General probabilistic model of Gene structure

89
Markov Chains and Hidden Markov Models
  • Statistical model of a biological feature
  • Compile a training set for some sequence pattern
  • Identify non-random patterns in primary sequence
    composition
  • Look for matches to the pattern in an unknown
    sequence

90
Markov Chains and Hidden Markov Models
91
Simple Markov Chain for Start Codon
a - 0.80 c - 0.02 g - 0.10 t - 0.08
a - 0.01 c - 0.01 g - 0.01 t - 0.97
a - 0.01 c - 0.01 g - 0.97 t - 0.01
A
T
G
Courtesy Computational Methods in Molecular
Biology - Elsevier
92
GeneMark Prediction HumHBB
GeneMark.hmm (Version 2.2a) Sequence name
HUMBB Sequence length 73308 bp GC content
39.46 Matrix Homo sapiens Tue Apr 24 110500
2001 Predicted genes/exons Gene Exon Strand
Exon Exon Range Exon
Start/End Type
Length Frame 1 1 Terminal
6168 6449 282 1 3 2
3 - Terminal 13450 13528 79
3 3 2 2 - Internal 16097
16311 215 2 1 2 1 - Initial
16436 16468 33 3 1
93
Gene Exon Strand Exon Exon Range
Exon Start/End Type
Length Frame 7 1
Initial 54790 54881 92 1
2 7 2 Internal 55010 55232
223 3 3 7 3 Terminal
60474 60557 84 1 3 8 1
Initial 62187 62278 92
1 2 8 2 Internal 62409 62631
223 3 3 8 3 Terminal
63482 63610 129 1 3 9 1
Initial 68183 68396 214
1 1 9 2 Terminal 68586 68746
161 2 3
94
bg mark file FEATURES
Location/Qualifiers .. !beta-globin and
beta-globin thalassemia 2187 2278 !beta-globin
thalassemia 2390 2408 !beta-globin 2409
2631 3482 3610
95
(No Transcript)
96
(No Transcript)
97
(No Transcript)
98
GrailPro
  • SeqLab

99
GrailPro
  • Neural network-based gene prediction
  • Requires a gene model for each species
  • Derived from a training set of known genes
  • Also provides detection of sequence features
  • CpG islands
  • Pol II promoters
  • polyA sites
  • Simple repeats
  • Known complex repetitive elements
  • Alu and LINE elements

100
Neural Networks
Input Layer
HiddenLayer
Output Layer (Model)
101
analyze grail Grail Error. No arguments
supplied. Usage grail cegoprs (options) For
further help, do grail -h. analyze grail
-h grail cegoprs -b file -d file -e file
-f -g file -h -i file -l (organism)
-o file -p output_type -r range -s
strand -t -v Requests -------- c
CpG Islands (All organisms) e Protein Coding
Exons (All organisms) g Gene Models
(Human/Mouse/Arab/Droso) o Pol II Promoters
(Human/Mouse only) p PolyaA Sites
(Human/Mouse only) r Complex Repeats
(Human/Mouse/Arab/Droso) s Simple Repeats
(All organisms)
102
Options ------- -b file Specifies
location of Blast 2.0. Default is
GRAILHOME/blast/blastall. -d file Specifies
location of database. Default is
GRAILHOME/databases/ltorgcodegt.rpt. -e file
Sets exon file for reading (no e request) or
writing (e request). -f Filter exons against
complex repeat database. Not done by default.
-g file Specifies location of .gnr file.
Default is GRAILHOME/generation/ltorgcodegt.gnr.
-h Prints help and exits. -i file Sets
input file. Default is stdin. -l (organism)
With no flag, prints organism list. Otherwise,
sets organism. -o file Sets output file.
Default is stdout. -p output_flag Sets output
format. Valid options are txt, raw, and rsf.
-r beg-end Sets the range of operation. Uses
reverse index with -s r set. -s strand Sets
the strand. f forward, r reverse, a auto
(default) -t Print protein translations of
exons and genes. Default is no. -v Print
version information.
103
GrailPro - SeqLab
  • Generates rsf file containing predicted features
  • Graphical representation of predicted elements
  • Graphical representation of gene predictions

104
LOCUS HUMHBB 73308 bp DNA
PRI 28-MAR-2001 DEFINITION Human beta
globin region on chromosome 11. mRNA
join(62137. .62278,62409. .62631,63482.
.63742) /gene"HBB"
/product"beta-globin" gene
62137. .63742
/gene"HBB" variation 62156
/gene"HBB"
/note"c in 62,76 t in 48" gene
lt62187. .62389
/gene"HBB thalassemia" CDS
join(62187. .62278,62390. .62408)
/db_xref"GI455998"
/gene"HBB thalassemia"
/note"beta-globin thalassemia"
/codon_start"1"
/protein_id"AAA16335.1" CDS
join(62187. .62278,62409. .62631,63482. .63610)
/db_xref"GI455997"
/gene"HBB"
/note"beta-globin"
/codon_start"1"
/protein_id"AAA16334.1"
105
GrailPro HBB Gene Prediction
feature 62187 62278 1 diamond solid
GRAIL_GENE_EXON Type Initial Frame 2
Score 96 feature 62279 62408 8 hat solid
GRAIL_GENE_INTRON feature 62409 62631 1 diamond
solid GRAIL_GENE_EXON Type Internal Frame
0 Score 100 feature 62632 63481 8 hat solid
GRAIL_GENE_INTRON feature 63482 63610 1 diamond
solid GRAIL_GENE_EXON Type Terminal Frame
1 Score 94
106
Translation
107
Translate
  • Translate a nucleotide sequence to an amino acid
    sequence
  • Uses a translation table
  • Translates a single sequence
  • Will piece together multiple exons into one
    protein

108
analyze translate -check bg.seq Translate
translates nucleotide sequences into peptide
sequences. Minimal Syntax translate
-INfile_at_Hsp70DNA.List -Default Prompted
Parameters -OUTfilehsp70.pep output
file name (single output sequence only) Local
Data Files -TRANSlatetranslate.txt contains
the genetic code
109
Optional Parameters -BEGin1 -END100
range of interest for each sequence -REVerse
strand for each sequence -ONEPEPtide
translate all concatentated DNA
fragments into a
single peptide -NOJOIN ignore
all "join" sequence attributes
specified in a list file -LIStfiletran
slate.list writes a list file of output sequence
names -EXTension.pep sets the file
name extension for output
sequence files Press q to quit or ltReturngt
for more -NOMONitor suppresses
the screen monitor Add what to the command
line ?
110
Begin ( 1 ) ? 2187
End ( 5000 ) ? 2278
Reverse ( No ) ? Range begins ATGGT and ends
GGCAG. Is this correct ( Yes ) ? That is
done, now would you like to A) Add another
exon from this sequence B) Add another exon
from a new sequence C) Translate and then
add more genes from this sequence D) Translate
and then add more genes from a new sequence
W) Translate assembly and write everything into a
file Please choose one ( W ) a
Begin ( 1 ) ? 2409 End
( 5000 ) ? 2631 Reverse ( No
) ? Range begins GCTGC and ends TCAGG. Is
this correct ( Yes ) ?
111
That is done, now would you like to A) Add
another exon from this sequence B) Add another
exon from a new sequence C) Translate and
then add more genes from this sequence D)
Translate and then add more genes from a new
sequence W) Translate assembly and write
everything into a file Please choose one ( W
) A Begin ( 1 ) ? 3482
End ( 5000 ) ? 3610
Reverse ( No ) ? Range begins CTCCT and
ends ACTAA. Is this correct ( Yes ) ? That
is done, now would you like to A) Add
another exon from this sequence B) Add another
exon from a new sequence C) Translate and
then add more genes from this sequence D)
Translate and then add more genes from a new
sequence W) Translate assembly and write
everything into a file Please choose one ( W
) What should I call the output file (
bg.pep ) ?
112
analyze cat bg.pep !!AA_SEQUENCE 1.0 TRANSLATE
of bg.seq check 3171 from 2187 to 2278
and of bg.seq check 3171 from 2409 to 2631
and of bg.seq check 3171 from 3482 to
3610 generated symbols 1 to 148. beta globin
region of humhbb !beta-globin and beta-globin
thalassemia 62187 62278 !beta-globin
thalassemia 62390 62408 !beta-globin . . .
bg.pep Length 148 April 27, 1998 1616 Type
P Check 5016 .. 1 MVHLTPEEKS
AVTALWGKVN VDEVGGEALG RLLVVYPWTQ RFFESFGDLS
51 TPDAVMGNPK VKAHGKKVLG AFSDGLAHLD NLKGTFATLS
ELHCDKLHVD 101 PENFRLLGNV LVCVLAHHFG
KEFTPPVQAA YQKVVAGVAN ALAHKYH analyze
113
(No Transcript)
114
(No Transcript)
115
BackTranslate
  • Derive a nucleotide sequence from an amino acid
    sequence
  • Uses a codon frequency table
  • Creates the most probable sequence
  • Creates the most ambiguous sequence

116
analyze backtranslate -check BackTranslate
backtranslates an amino acid sequence into
a nucleotide sequence. The output helps you
recognize minimally ambiguous regions that might
be good for constructing synthetic probes.
Minimal Syntax backtranslate
-INfile1ilvhiaa.pep -Default Prompted
Parameters -BEGin1 -END6 -MENuA menu
for what kind of output you want, where
A is for table of all back-translations and most
probable sequence B is for table of all
back-translations and most ambiguous sequence
C is for most probable sequence only D is
for most ambiguous sequence only
-INfile2ecohigh.cod codon frequency
table -OUTfileilvhiaa.seq output file name
Local Data Files -TRANSlatetranslate.txt
defines most ambiguous representation for
each codon family
117
Optional Parameters -WINdow4 shows
probability of the preferred codons for next "4"
amino acids occurring together by
chance Add what to the command line ?
BACKTRANSLATE what sequence ? bg.pep
Begin ( 1 ) ? End (
148 ) ? Would you like to see a)
table of back-translations and most probable
sequence b) table of back-translations and
most ambiguous sequence c) most probable
sequence only d) most ambiguous sequence
only Please choose one ( b ) ab Use
what codon frequency file ( GenRunDataecohigh.co
d ) ?
genmoredatahumad What should I call
the output file ( bg.seq ) ? bg.back
118
analyze more bg.back !!NA_SEQUENCE 1.0
BACKTRANSLATE of bg.pep check 5016 from 1
to 148 TRANSLATE of bg.seq check 3171 from
2187 to 2278 and of bg.seq check 3171
from 2409 to 2631 and of bg.seq check
3171 from 3482 to 3610 generated symbols 1 to
148. beta globin region of humhbb !beta-globin
and beta-globin thalassemia . . . Using codon
frequencies from /usr1/gcg/gcgcore/data/moredata/
human_high.cod CheckFile 786 .goto below
CODONFREQUENCY January 24, 1991 1655 From an
existing codon frequency file Humprb4l_217_1054.C
od FileCheck 8577 From an existing codon
frequency file Humprb4m_217_928.Cod FileCheck
7623 From an existing codon frequency file
Humprb1s_51_763.Cod FileCheck 8371 From an
existing codon frequency file Humprb1_51_946.Cod
FileCheck 119 . . .
119
Met Val His Leu
Thr Pro Glu ATG 1.00 GTG 0.64
CAC 0.79 CTG 0.58 ACC 0.57 CCC 0.48 GAG
0.75 GTC 0.25 CAT 0.21 CTC 0.26
ACG 0.15 CCT 0.19 GAA 0.25 GTT
0.07 TTG 0.06 ACA 0.14 CCG 0.17
GTA 0.05 CTT 0.05 ACT
0.14 CCA 0.16
CTA 0.03 TTA
0.02 293 167 125 119
154 221 157 8 - 14 Glu
Lys Ser Ala Val
Thr Ala GAG 0.75 AAG 0.82 AGC
0.34 GCC 0.53 GTG 0.64 ACC 0.57 GCC 0.53
GAA 0.25 AAA 0.18 TCC 0.28 GCG 0.17 GTC
0.25 ACG 0.15 GCG 0.17
TCT 0.13 GCT 0.17 GTT 0.07 ACA 0.14 GCT
0.17 AGT 0.10 GCA 0.13
GTA 0.05 ACT 0.14 GCA 0.13
TCG 0.09 TCA 0.05
111 95 66 102 112
175 154
120
bg.back Length 444 April 27, 1998 1619
Type N Check 9960 .. 1 ATGGTGCACC
TGACCCCCGA GGAGAAGAGC GCCGTGACCG CCCTGTGGGG
51 CAAGGTGAAC GTGGACGAGG TGGGCGGCGA GGCCCTGGGC
CGCCTGCTGG bg.back Length 444 April 27,
1998 1622 Type N Check 961 .. 1
ATGGTNCAYY TNACNCCNGA RGARAARWSN GCNGTNACNG
CNYTNTGGGG 51 NAARGTNAAY GTNGAYGARG
TNGGNGGNGA RGCNYTNGGN MGNYTNYTNG
121
(No Transcript)
122
(No Transcript)
123
PepData
  • Translates a DNA sequence in all six frames
  • Each translation frame follows the previous frame
    in the same output file
  • Use for database searches when starting with a
    sequence of unknown coding potential

124
analyze pepdata PepData translates DNA
sequence(s) in all six frames. PEPDATA from
what sequence(s) ? bg.seq bg
5,000 bp 9,996 aa PEPDATA complete with
Input files 1 Amino acids
9,996 Output files 1 Output file
names .pdt analyze
125
!!AA_SEQUENCE 1.0 PEPDATA from bg.seq check
3171 from 1 to 5,000 beta globin region of
humhbb bases 1 to 5000 translated into
1 to 1666 bases 2 to 5000 translated
into 1667 to 3332 bases 3 to 5000
translated into 3333 to 4998 reverse of bases
1 to 5000 translated into 4999 to 6664
reverse of bases 1 to 4999 translated into
6665 to 8330 reverse of bases 1 to 4998
translated into 8331 to 9996 bg.pdt
Length 9996 April 27, 1998 1719 Type P
Check 8970 .. 1 KALALTILVF QNTINITYII
ISSLCIFFD PGYLQKTYSN FRRTLYFTYT 51
CLLYQGCETG SKLSKSKTM LMQVINK IQNLTAKSNL
YVLTFKIFR 101 RLFPGFNMN LFSGIHVCLD
PHCFSFLQRN EYKKKILKFY PSYLYNHTA 151
FFNLGSRP KNQTLSACV RIIRVRFFHK YLMRVETGRK
SERSLFIQ 201 RKHLRESN GNKKFVNFLL
ITRNRGSSFF WLTILFHFI VLFYFILFYF
126
Sequence Patterns
127
Composition
  • Determine the composition of a sequence
  • DNA
  • Base composition
  • Di- and tri- nucleotide content
  • Protein
  • Amino acid content

128
analyze composition Composition determines the
composition of sequence(s). For nucleotide
sequence(s), Composition also determines
dinucleotide and trinucleotide content.
COMPOSITION on what sequence(s) ? vivsvcg
Begin ( 1 ) ? End
( 11161 ) ? What should I call the output
file ( vsvcg.composition ) ? COMPOSITION
complete Sequences 1 Total Length
11,161 CPU time 00.28 Output file
/ob2/users/class000/vsvcg.composition analyze
129
analyze more vsvcg.composition COMPOSITION of
vivsvcg Check 4851 from 1 to 11,161
April 27, 1998 1721
A 3,467 C 2,227 G
2,431 T 3,036
Other 0 Total 11,161
130
GG 609 GA 941 GT
506 GC 375 AG 789 AA
1,085 AT 1,004 AC 589 TG 800
TA 604 TT 875 TC 756
CG 233 CA 836 CT 651
CC 507 Other 0
Total 11,160
GGG 131
GGA 255 GGT 122 GGC 101
GAG 213 GAA 285 GAT 264
GAC 179 GTG 114 GTA 103
GTT 162 GTC 126 GCG 27
GCA 146 GCT 122 GCC 80
AGG 187 AGA 320 AGT 177
AGC 105 AAG 265 AAA 368
AAT 298 AAC 154 ATG 294
ATA 179 ATT 280 ATC 251
ACG 65 ACA 230 ACT 167
ACC 127 TGG 225 TGA 266
TGT 166 TGC 143 TAG 95
TAA 155 TAT 219 TAC 135
TTG 242 TTA 182 TTT 237
TTC 214 TCG 74 TCA 271
TCT 217 TCC 194 CGG 66
CGA 100 CGT 41 CGC 26 CAG
216 CAA 277 CAT 223 CAC
120 CTG 150 CTA 140 CTT 196
CTC 165 CCG 67 CCA 189
CCT 145 CCC 106
Other 0 Total
11,159
131
analyze cat bg.composition (Peptide)
COMPOSITION of bg.pep April 27, 1998 1724
A 15
C 2 D 7 E 8
F 8 G 13 H 9
K 11 L 18 M 2 N 6
P 7 Q 3 R 3
S 5 T 7 V 18 W 2
Y 3 1
Other 0
Total 148
132
Repeat
  • Finds direct sequence repeats within a single
    sequence
  • Settings
  • Length of the repeat
  • Stringency (number of matches)
  • Range within the sequence

133
analyze repeat -check gamma.seq Repeat finds
direct repeats in sequences. You must set the
size, stringency, and range within which the
repeat must occur all the repeats of that size
or greater are displayed as short alignments.
Local Data Files -MATRixrepeatdna.cmp
scoring matrix for nucleic acids -MATRixblosum62.
cmp scoring matrix for peptides Optional
Parameters -LIMit limits the number of
repeats written into the output file -SORt
sorts the repeats on quality -PAIr5 match
threshold for displaying '' Add what to the
command line ?
134
Begin ( 1 ) ?
End ( 11375 ) ? What minimum repeat window
( 7 ) ? What minimum stringency ( 7 ) ?
Find repeats through what range ( 50 ) ?
There are 92 repeats, would you like to 1)
File the repeats 2) Set new parameters
Please choose one ( 1 ) What should I call
the output file ( gamma.rpt ) ? analyze
135
REPEAT of gamma.seq check 6474 from 1 to
11375 Human fetal beta globins G and A
gamma from Shen, Slightom and Smithies, Cell 26
191-203. Analyzed by Smithies et al. Cell 26
345-353. Window 7 Stringency 7 Range 50
Repeats 92.. 172 TACAAAAAT 180
9 9 180 TACAAAAAT 188
352 AGATTCA 358 7 7
361 AGATTCA 367 935 AGAAAAAA 942
8 8 983 AGAAAAAA 990
965 AAAAAATAAA 974 10 10
985 AAAAAATAAA 994
136
Terminator
  • Searches for prokaryotic rho-independent
    terminators

137
Window
  • Looks for simple, short sequence patterns
  • GC R ATG
  • Creates a table of the frequencies with which the
    pattern occurs as a window of defined size is
    moved across a sequence

138
StatPlot
  • Reads the table created by Window and plots the
    results on a graph

139
analyze window gamma.seq Window makes a table
of the frequencies of different sequence patterns
within a window as it is moved along a sequence.
A pattern is any short sequence like GC or R or
ATG. You can plot the output with the program
StatPlot. Begin ( 1 ) ?
End ( 11375 ) ?
Reverse ( No ) ? What window size ( 100 )
? What shift increment ( 3 ) ? What
should I call the output file ( gamma.wdw ) ?
140
What functions do you want a) number
of patterns observed b) percent of
patterns observed c) fraction of patterns
observed d) number of observed -
expected(local) patterns e) number of
observed - expected(global) patterns f)
percent of observed - expected(local) patterns
g) percent of observed - expected(global)
patterns h) percent difference between two
patterns q)uit Please select up to 6
functions ( ae ) a What is the pattern for
the "a" stat in column 1 ? S (Note S G or
C)
141
(No Transcript)
142
(No Transcript)
143
Next
  • Protein Analysis
About PowerShow.com