Varia - PowerPoint PPT Presentation

1 / 73
About This Presentation
Title:

Varia

Description:

Capable of using multiple syntax formats. In Perl, 'there's more than one way to do it' ... RA ARDEN J.R., NAGATA O., SHOCKLEY M.S., PHILIP M., LAMEH J., SADEE W. ... – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 74
Provided by: Asatisfied332
Category:
Tags: varia

less

Transcript and Presenter's Notes

Title: Varia


1
(No Transcript)
2
Les 3 (theorie oefeningen)
  • Varia
  • http//biochema.rug.ac.be/
  • Herhaling
  • Demos (sequence retrieva, dotplot, pairwise
    alignment SW) tijdens les
  • Installatie
  • Windows versus Linux
  • Perl
  • Herhaling
  • Array/Hashes/FileIO
  • Oefeningen

3
Overview
  • Fr 4/10 Introduction History, Database
    Biology, Sequence Formats
  • Fr 11/10 Pairwise comparison, scoring, matrices
  • Do 17/10 B10 Theorie Herhaling (Demo) Perl
    Herhaling Oefeningen
  • Fr 18/10 Multiple alignment, Basic and advanced
    Database Searching
  • Do geen practicum / Oefeningen
  • Fr 25/10 geen les
  • Do geen practicum / Oefeningen
  • Fr 1/11 geen les
  • Fr 8/11 Phylogenetics, Gene prediction, junk
    mining (RNA prediction)
  • Fr 15/11 geen les
  • Fr 22/11 Protein structure, classification and
    engineering

4
Genetic Code Matrix
5
Other similarity scoring matrices might be
constructed from any property of amino acids
that can be quantified - partition coefficients
between hydrophobic and hydrophilic phases -
charge - molecular volume Unfortunately
,
Overview
6
Principles of Scoring Matrix Construction
  • 1978 1991
  • A 100 100
  • C 20 44
  • D 106 86
  • E 102 77
  • F 41 51
  • G 49 50
  • H 66 91
  • I 96 103
  • K 56 72
  • L 40 54
  • M 94 93
  • N 134 104
  • P 56 58
  • Q 93 84
  • R 65 83
  • S 120 117
  • T 97 107
  • V 74 98

7

Which matrix should I use?
  • When comparing sequences that were not known in
    advance to be related, for example when database
    scanning
  • default scoring matrix used is the BLOSUM62
    matrix
  • if one is restricted to using only PAM scoring
    matrices, then the PAM120 is recommended for
    general protein similarity searches
  • When using a local alignment method, Altschul
    suggests that three matrices should ideally be
    used PAM40, PAM120 and PAM250, the lower PAM
    matrices will tend to find short alignments of
    highly similar sequences, while higher PAM
    matrices will find longer, weaker local
    alignments.

8
Overview
  • C K H V F C R V C I
  • --------------------
  • C 5 3 3 3 2 2 1 1 1 0
  • K 4 4 3 3 2 1 1 1 0 0
  • K 3 4 3 3 2 1 1 1 0 0
  • C 4 3 3 3 2 2 1 1 1 0
  • F 3 2 2 2 3 1 1 1 0 0
  • C 4 2 2 2 2 2 1 1 1 0
  • K 2 3 2 2 2 1 1 1 0 0
  • C 2 1 1 1 1 2 1 0 1 0
  • V 0 0 0 1 0 0 0 1 0 0
  • C K H V F C R V C I
  • C K K C F C - K C V
  • C K H V F C R V C I
  • C K K C F C K - C V
  • C - K H V F C R V C I
  • C K K C - F C - K C V
  • C K H - V F C R V C I
  • C K K C - F C - K C V

9
Overview
  • C K H V F C R V C I
  • --------------------
  • C 5 3 3 3 2 2 1 1 1 0
  • K 4 4 3 3 2 1 1 1 0 0
  • K 3 4 3 3 2 1 1 1 0 0
  • C 4 3 3 3 2 2 1 1 1 0
  • F 3 2 2 2 3 1 1 1 0 0
  • C 4 2 2 2 2 2 1 1 1 0
  • K 2 3 2 2 2 1 1 1 0 0
  • C 2 1 1 1 1 2 1 0 1 0
  • V 0 0 0 1 0 0 0 1 0 0
  • C K H V F C R V C I
  • C K K C F C - K C V
  • C K H V F C R V C I
  • C K K C F C K - C V
  • C - K H V F C R V C I
  • C K K C - F C - K C V
  • C K H - V F C R V C I
  • C K K C - F C - K C V

10
Get Sequences
  • Entrez
  • Simple
  • ZK822 (genomic)
  • ZK822.4 (gene)
  • Limits Details
  • Eg. all ion channels, complete CDS
  • Batch-Entrez
  • SRS Sequence Retrieval System
  • Swissport

11
SQL structured query language
  • SQL is very powerful because it consists of only
    4 statements, sometimes referred to as CRUD
  • 1) Create - INSERT - to store new data
  • 2) Read - SELECT - to retrieve data
  • 3) Update - UPDATE - to change or modify data.
  • 4) Delete - DELETE - delete or remove data
  • Select from table where .

12
  • Select from table where . (http//sqlcourse2.co
    m/select2.html)
  • ((ion channelAll Fields AND "homo
    sapiens"Organism) AND ((((((1900MDAT
    3000MDAT) NOT gbdiv_estPROP) AND ((1900MDAT
    3000MDAT) NOT gbdiv_stsPROP)) AND
    ((1900MDAT 3000MDAT) NOT gbdiv_gssPROP))
    AND ((1900MDAT 3000MDAT) NOT
    gbdiv_htgPROP)) AND ((1900MDAT 3000MDAT)
    NOT gbdiv_patPROP)))

13
Get Sequences
  • Entrez
  • Simple
  • ZK822 (genomic)
  • ZK822.4 (gene)
  • Limits Details
  • Eg. all ion channels, complete CDS
  • Batch-Entrez
  • SRS Sequence Retrieval System
  • Swissport

14
  • Perform multiple alignment
  • Blosum62 12 -2
  • Change gap opening and extenion to get correct
    alignment
  • Goal Align two similar proteins

15
Extensions to basic dynamic programming method
Overview
  • use similarity function in initialization step -gt
    scoring tables
  • use gap penalties
  • constant gap penalty for gap gt 1
  • gap penalty proportional to gap size
  • one penalty for starting a gap (gap opening
    penalty)
  • different (lower) penalty for adding to a gap
    (gap extension penalty)

16
Sequence comparison with dot matrices
Dot matrices
  • Goal Graphically display regions of similarity
    between two sequences (e.g., domains in common
    between two proteins of suspected similar
    function)
  • Extremely USEFULL !!
  • Remember DNA is double stranded (plot again RCC)

17
Overview
18
Overview
  • Window size changes with goal of analysis
  • size of average exon
  • size of average protein structural element
  • size of gene promoter
  • size of enzyme active site

19
  • Dotmatrix

20
Les 3 (theorie oefeningen)
  • Varia
  • http//biochema.rug.ac.be/
  • Herhaling
  • Demos (sequence retrieva, dotplot, pairwise
    alignment SW) tijdens les
  • Installatie
  • Windows versus Linux
  • Perl
  • Herhaling
  • Array/Hashes/FileIO
  • Oefeningen

21
Perl installation
  • Perl
  • Perl is available for various operating systems.
    To download Perl and install it on your computer,
    have a look at the following resources
  • www.perl.com (O'Reilly).
  • Downloading Perl Software
  • ActiveState. ActivePerl for Windows, as well as
    for Linux and Solaris.
  • ActivePerl binary packages.
  • CPAN
  • PHPTriad
  • bevat Apache/PHP en MySQL http//sourceforge.net/
    projects/phptriad

22
BioPerl installation
  • Bioperl
  • Download bioperl bioperl-1.0.2.zip van
    http//bioperl.org/Core/Latest/
  • perl Makefile.PL
  • (n)make (oftwel nmake, download van microsoft
    http//download.microsoft.com/download/vc15/Patch/
    1.52/W95/EN-US/Nmake15.exe)
  • (n)make install
  • Voor bundle-bioperl
  • Download bundle from cpan http//search.cpan.org/a
    uthor/CRAFFI/Bundle-BioPerl-2.03/BioPerl.pm
  • perl Makefile.PL
  • (n)make (oftwel nmake, download van microsoft
    http//download.microsoft.com/download/vc15/Patch/
    1.52/W95/EN-US/Nmake15.exe)
  • (n)make install
  • DBI en DBDMysql met PPM (perl package manager
    van ActivePerl)

23
What is Perl ?
  • Perl is a High-level Scripting language
  • Faster than sh or csh, slower than C
  • No need for sed, awk, tr, wc, cut,
  • Compiles at run-time
  • Perl is a computer language that is
  • Interpreted
  • Loosely typed
  • String/text oriented
  • Capable of using multiple syntax formats
  • In Perl, theres more than one way to do it

24
Why use Perl for bioinformatics ?
  • Ease of use by novice programmers
  • Fast software prototyping
  • Flexible language
  • Compact code
  • Powerfull pattern matching via regular
    expressions (Best Regular Expressions on Earth)
  • Availability of Perl modules for Bioinformatics
    and Internet.
  • Available for Unix, PC, Mac
  • Portability, Best for CGI-programming.
  • Open Source easy to extend and custumize
  • No licensing fees
  • Some tasks are still better done with other
    languages (heavy computations / graphics).
  • With perl you can write simple programs fast, but
    on the other hand it is also suitable for large
    and complex programs. (yet, it is not adequate
    for very large projects).

25
What bioinformatics tasks are suited to Perl ?
  • Sequence manipulation and analysis
  • Parsing results of sequence analysis programs
    (Blast, Genscan, Hmmer etc)
  • Parsing database (eg Genbank) files
  • Obtaining multiple database entries over the
    internet

26
General Remarks
  • Perl is mostly a free format language add
    spaces, tabs or new lines wherever you want.
  • For clarity, it is recommended to write each
    statement in a separate line, and use indentation
    in nested structures.
  • Comments Anything from the sign to the end of
    the line is a comment. (There are no multi-line
    comments).
  • A perl program consists of all of the Perl
    statements of the file taken collectively as one
    big routine to execute.

27
How does the real perl program look like
!/usr/local/bin/perl print Hello everyone\n
Mandatory first line !
How to run it
1. Save the text of your code as a file --
program.pl 2. Execute it perl program.pl
Hello everyone
28
22 ?

- indicates a variable
a 2 b 2 c a b

- ends every command

- assigns a value to a variable
or
c 2 2
or
c 2 2
or
c 2 / 2
or
24 lt-gt 24 16
c 2 4
or
c 1.35 2 - 3 / (0.12 1)
29
Ok, c is 4. How do we know it?
c 4 print c
print command

- bracket output expression
print Hello \n
\n
- print a end-of-the-line character (equivalent
to pressing Enter)
Strings concatenation
print Hello everyone\n print Hello .
everyone . \n
Expressions and strings together
2 2 4
print 2 2 . (22) . \n
expression
30
Loops and cycles (for statement)
Output all the numbers from 1 to 100 for (n1
nlt100 n1) print n \n
1. Initialization
for ( n1 )
2. Increment
for ( n1 )
3. Termination (do until the criteria is
satisfied)
for ( nlt100 )
4. Body of the loop - command inside curly
brackets
for ( )
31
FOR IF -- all the even numbers from 1 to 100
for (n1 nlt100 n1) if ((n 2) 0)
print n
Note a b -- Modulus -- Remainder
when a is divided by b
32
Text Processing Functions
  • The substr function
  • Definition
  • The substr function extracts a substring out of a
    string and returns it. The function receives 3
    arguments a string value, a position on the
    string (starting to count from 0) and a length.
  • Example
  • a "university"
  • k substr (a, 3, 5)
  • k is now "versi" a remains unchanged.
  • If length is omitted, everything to the end of
    the string is returned.

33
  • !/usr/local/bin/perl
  • use strict
  • use warnings
  • my (sp_file, line, id, ac, de)
  • sp_file "sp.txt"
  • open (SP, sp_file) die "cannot open
    \"sp_file\" !"
  • while (line ltSPgt)
  • chomp (line)
  • my field substr (line, 0, 2)
  • my value substr (line, 5)
  • if (field eq "ID")
  • id value
  • if (field eq "AC")
  • ac value

34
Text Processing Functions
  • The split function
  • The split function splits a string to a list of
    substrings according to the positions of a given
    delimiter. The delimiter is written as a pattern
    enclosed by slashes /PATTERN/. Examples
  • string "programmingcourseforbioinformatic
    s"
  • _at_list split (//, string)
  • _at_list is now ("programming", "course", "for",
    "bioinformatics") string remains unchanged.
  • string "protein kinase C\t450 Kilodaltons\t120
    Kilobases"
  • _at_list split (/\t/, string) \t indicates tab
  • _at_list is now ("protein kinase C", "450
    Kilodaltons", "120 Kilobases")

35
Text Processing Functions
  • The join function
  • The join function does the opposite of split. It
    receives a delimiter and a list of strings, and
    joins the strings into a single string, such that
    they are separated by the delimiter.
  • Note that the delimiter is written inside quotes.
  • Examples
  • _at_list ("programming", "course", "for",
    "bioinformatics")
  • string join ("", _at_list)
  • string is now "programmingcourseforbioinf
    ormatics"
  • name "protein kinase C" mol_weight "450
    Kilodaltons" seq_length "120 Kilobases"
  • string join ("\t", name, mol_weight,
    seq_length)
  • string is now "protein kinase C\t450
    Kilodaltons\t120 Kilobases"

36
Regular Expressions
  • Match to a sequence of characters
  • The EcoRI restriction enzyme cuts at the
    consensus sequence GAATTC.
  • To find out whether a sequence contains a
    restriction site for EcoR1, write
  • if (sequence /GAATTC/)
  • ...

37
Regular Expressions
  • Match to a character class
  • Example
  • The BstYI restriction enzyme cuts at the
    consensus sequence rGATCy, namely A or G in the
    first position, then GATC, and then T or C. To
    find out whether a sequence contains a
    restriction site for BstYI, write
  • if (sequence /AGGATCTC/) ... This
    will match all of AGATCT, GGATCT, AGATCC, GGATCC.
  • Definition
  • When a list of characters is enclosed in square
    brackets , one and only one of these characters
    must be present at the corresponding position of
    the string in order for the pattern to match. You
    may specify a range of characters using a hyphen
    -.
  • A caret at the front of the list negates the
    character class.
  • Examples
  • if (string /AGTC/) ... matches any
    nucleotide
  • if (string /a-z/) ... matches any
    lowercase letter
  • if (string /chromosome1-6/) ...
    matches chromosome1, chromosome2 ... chromosome6
  • if (string /xyzXYZ/) ... matches any
    character except x, X, y, Y, z, Z

38
  • We would like to find out whether the concensus
    sequence is contained (somewhere) in a given
    sequence a.
  • Without quantifiers
  • if (a /ACCCCAGAGAGGTGT/) ...
  • With quantifiers
  • if (a /AC4AG3(GT)2/) ...

39
Regular Expressions
  • Case-insensitive pattern matching
  • To achieve a case-insensitive pattern matching,
    add the i modifier after the closing slash of the
    regular expression.
  • Example
  • When searching for HTML tags, we preferably do a
    case-insensitive search. e.g. if (doc
    /ltTABLE.gt/i)
  • This would match ltTABLE...gt, lttablegt, ltTAblEgt,
    etc.

40
Alternation
  • Alternation allows matching any one of several
    subexpressions. The alternative subexpressions
    are separated by vertical bar(s) .
  • Example 1
  • extract all lines including either human, rat or
    mouse proteins
  • if (line /HUMANRATMOUSE/) match line
    against either HUMAN, RAT or MOUSE
  • Example 2
  • In the same file, let us now restrict our search
    only for the ACM1 receptors in either human, rat
    or mouse. if (line /ACM1_(HUMANRATMOUSE)/)
    we enclosed the alternative subexpressions
    in parentheses (HUMANRATMOUSE) and added the
    receptor name prefix ACM1_ before them.

41
Anchoring a pattern to the beginning or end of a
string
  • To force matching of your pattern to the
    beginning of the string, write a caret as the
    first character of the regular expression.
  • To force the matching to the end of the string,
    write a dollar sign as the last character of
    the regular expression.
  • To print the "description" line, which starts
    with DE, we write
  • !/usr/local/bin/perl
  • my sp_file "sources/sp_entry"
  • open (SP, sp_file) die "cannot open
    \"sp_file\" !"
  • while (my line ltSPgt)
  • if (line /DE/)
  • print line
  • Result
  • DE MUSCARINIC ACETYLCHOLINE RECEPTOR M1.
  • Note if we omitted the caret from the regular
    expression
  • if (line /DE/)

42
Regex
  • OReilly book Mastering regular expressions (2nd
    edition)
  • Regular Expressions Tutorial

43
Substitutions
44
Translations
45
GC
  • sub gc_content
  • my seq shift
  • print "\seq",seq
  • my win shift
  • print "length ",length(seq),"\n"
  • for (my i 0 i lt length(seq) - win i)
  • my segment substr(seq,i,win)
  • my gc_count segment tr/GCgc/GCgc/
  • print i1,"\t",segment,"\t",gc_count,"\n"

46
Les 3 (theorie oefeningen)
  • Varia
  • http//biochema.rug.ac.be/
  • Herhaling
  • Demos (sequence retrieva, dotplot, pairwise
    alignment SW) tijdens les
  • Installatie
  • Windows versus Linux
  • Perl
  • Herhaling
  • Array/FileIO/Hashes
  • Oefeningen

47
Arrays
  • Definitions
  • A scalar variable contains a scalar value one
    number or one string. A string might contain many
    words, but Perl regards it as one unit.
  • An array variable contains a list of scalar data
    a list of numbers or a list of strings or a mixed
    list of numbers and strings. The order of
    elements in the list matters.
  • Syntax
  • Array variable names start with an _at_ sign.
  • You may use in the same program a variable named
    var and another variable named _at_var, and they
    will mean two different, unrelated things.
  • Example
  • Assume we have a list of numbers which were
    obtained as a result of some measurement. We can
    store this list in an array variable as the
    following
  • _at_msr (3, 2, 5, 9, 7, 13, 16)

48
The foreach construct
  • The foreach construct iterates over a list of
    scalar values (e.g. that are contained in an
    array) and executes a block of code for each of
    the values.
  • Example
  • foreach i (_at_some_array)
  • statement_1
  • statement_2
  • statement_3
  • Each element in _at_some_array is aliased to the
    variable i in turn, and the block of code inside
    the curly brackets is executed once for each
    element.
  • The variable i (or give it any other name you
    wish) is local to the foreach loop and regains
    its former value upon exiting of the loop.
  • Remark _

49
Binary assignment operators
  • A short hand for
  • k k - 2
  • is
  • k - 2
  • Similarly, you may use
  • k 2 same as k k 2
  • k 2 same as k k 2
  • k / 2 same as k k / 2
  • or even
  • k . "some string" same as k k . "some
    string" These are called binary assignment
    operators, and are very useful in iterative
    (looping) constructs.

50
Examples for using the foreach construct - cont.
  • Calculate sum of all array elements
  • !/usr/local/bin/perl
  • _at_msr (3, 2, 5, 9, 7, 13, 16)
  • sum 0
  • foreach i (_at_msr)
  • sum i
  • print "sum is sum\n"

51
Accessing individual array elements
  • Individual array elements may be accessed by
    indicating their position in the list (their
    index).
  • Example
  • _at_msr (3, 2, 5, 9, 7, 13, 16)
  • index value 0 3 1 2 2 5 3 9 4 7 5 13 6 16
  • First element msr0 (here has the value of 3),
  • Third element msr2 (here has the value of 5),
  • and so on.

52
The sort function
  • The sort function receives a list of variables
    (or an array) and returns the sorted list.
  • _at_array2 sort (_at_array1)
  • !/usr/local/bin/perl
  • _at_countries ("Israel", "Norway", "France",
    "Argentina")
  • _at_sorted_countries sort ( _at_countries)
  • print "ORIG _at_countries\n", "SORTED
    _at_sorted_countries\n"
  • Output
  • ORIG Israel Norway France Argentina
  • SORTED Argentina France Israel Norway
  • !/usr/local/bin/perl
  • _at_numbers (1 ,2, 4, 16, 18, 32, 64)
  • _at_sorted_num sort (_at_numbers)
  • print "ORIG _at_numbers \n", "SORTED _at_sorted_num
    \n"
  • Output
  • ORIG 1 2 4 16 18 32 64
  • SORTED 1 16 18 2 32 4 64

53
The push and shift functions
  • The push function adds a variable or a list of
    variables to the end of a given array.
  • Example
  • a 5
  • b 7
  • _at_array ("David", "John", "Gadi")
  • push (_at_array, a, b)
  • _at_array is now ("David", "John", "Gadi", 5, 7)
  • The shift function removes the first element of a
    given array and returns this element.
  • Example
  • _at_array ("David", "John", "Gadi")
  • k shift (_at_array)
  • _at_array is now ("John", "Gadi") k is now
    "David"
  • Note that after both the push and shift
    operations the given array _at_array is changed!

54
How can I know the length of a given array?
  • You have three options
  • Assing the array variable into a scalar variable,
    as in the previous slide. This is not
    recommended, because the code is confusing.
  • Use the scalar function. Example
  • x scalar (_at_array) x now contains the
    number of elements in _at_array.
  • Use the special variable array_name to get the
    index value of the last element of _at_array_name.
    Example
  • _at_fruits ("apple", "orange", "banana", "melon")
  • a fruits
  • a is now 3
  • b fruits 1
  • b is now 4, i.e. the no. of elements in
    _at_fruits.

55
File input / output
  • Opening a filehandle
  • In order to use a filehandle other than STDIN,
    STDOUT and STDERR, the filehandle needs to be
    opened. The open function opens a file or device
    and associates it with a filehandle.
  • It returns 1 upon success and undef otherwise.
  • Examples
  • open a filehandle for reading open
    (SOURCE_FILE, "filename")
  • or open (SOURCE_FILE, "ltfilename")
  • open a filehandle for writing open
    (RESULT_FILE, "gtfilename")
  • open a filehandle for appending open (LOGFILE,
    "gtgtfilename"

56
File input / output
  • Closing a filehandle
  • When you are finished with a filehandle, you may
    close it with the close function. The close
    function closes the file or device associated
    with the filehandle.
  • Example
  • close (MY_FILE_HANDLE) Filehandles are
    automatically closed when the program exits, or
    when the filehandle is reopened.

57
File input / output
  • The die function
  • Sometimes the open function fails. For example,
    opening a file for input might fail because the
    file does not exist, and opening a file for
    output might fail because the file does not have
    a write permission. A perl program will
    nevertheless use the filehandle, and will not
    warn you that all input and output activities are
    actually meaningless.
  • Therefore, it is recommended to explicitly check
    the result of the open command, and if it fails
    to print an error message and exit the program.
  • This is easily done using the die function.
  • Example
  • my k open (FILEHANDLE, "filename") unless
    (k) die ("cannot open file filename !")
    in case file "filename" cannot be opened, the
    argument of die will be printed on the screen
    and the program will exit. ! is a special
    variable that contains the respective error
    message sent by the operating system.. A short
    hand
  • open (FILEHANDLE, "filename") die "cannot open
    file filename !"

58
Using filehandles for writing
  • Example
  • !/usr/local/bin/perl use strict
  • use warnings
  • open (OUTF, "gtout_file") die "cannot open
    out_file !" open (LOGF, "gtgtlog_file") die
    "cannot open log_file !"
  • print OUTF "Here is my program output\n"
  • print LOGF "First task of my program
    completed\n"
  • print "Nice, isn't it?\n" will be printed on
    the screen close (OUTF)
  • close (LOGF)

59
Using filehandles for reading (1/3)
  • !/usr/local/bin/perl
  • use strict
  • use warnings
  • my infile "CEACAM3.txt"
  • my (line1, line2, line3)
  • open (FH, infile) die "cannot open
    \"infile\" !"
  • line1 ltFHgt read first line
  • print line1 proccess line (here we only
    print it)
  • line2 ltFHgt read next line
  • print line2 proccess line (here we only
    print it)
  • line3 ltFHgt read next line
  • print line3 proccess line (here we only
    print it)
  • close (FH)

60
Using filehandles for reading (2/3)
  • When ltFILEHANDLEgt is assigned into an array
    variable, all lines up to the end of the file are
    read at once. Each line becomes a separate
    element of the array.
  • !/usr/local/bin/perl
  • use strict
  • use warnings
  • my infile "CEACAM3.txt"
  • open (FH, infile) die "cannot open
    \"infile\" !"
  • my _at_lines ltFHgt
  • chomp (_at_lines) chomp each element of _at_lines
  • close (FH)
  • to process the lines you might wish to iterate
  • over the _at_lines array with a foreach loop
  • my line
  • foreach line (_at_lines)
  • process line. here we just print it.
  • print "line\n"

61
Using filehandles for reading (3/3)
  • Using a while loop, read one line at a time and
    assign it into a scalar variable, as long as the
    variable is not an empty string (which will
    happen at end-of-file).
  • Note that a blank line read from the file will
    not result in an empty string, since it still
    contains the terminating \n.
  • !/usr/local/bin/perl
  • use strict
  • use warnings
  • my infile "CEACAM3.txt"
  • open (FH, infile) die "cannot open
    \"infile\" !"
  • my line or, in one line
  • while (line ltFHgt) while (my line
    ltFHgt)
  • chomp (line)
  • print "line\n" process line. here we just
    print it.
  • close (FH)

62
Hashes
  • Definition
  • A hash variable contains a collection of
    key/value pairs, arranged such that you can
    easily use any key to find its associated value.
    The order of the key/value pairs in the hash is
    not important.
  • Hashes are also called associative arrays.
  • Hash variable names start with a sign.
  • You may use in the same program a variable named
    var and another variable named _at_var, and a third
    variable named var, and they will mean three
    different, unrelated things.

63
A Hash Is a Lookup Table
  • A hash is a lookup table.
  • We use a key to find an associated value. my
    translate
  • translate'atg' 'M'
  • translate'taa' ''
  • translate'ctt' 'K' oops
  • translate'ctt' 'L' fixed
  • print translate'atg'
  • Getting All Keys
  • keys translate
  • Removing Key, Value Pairs
  • delete translate'taa' keys translate
  • Initializing From a List
  • translate ( 'atg' gt 'M', 'taa' gt '', 'ctt'
    gt 'L', 'cct' gt 'P', )

64
  • AA1 ("TGT" gt "Cys",
  • "TGC" gt "Cys",
  • "GAT" gt "Asp",
  • "GAC" gt "Asp",
  • "GAA" gt "Glu",
  • "GAG" gt "Glu",
  • "TTT" gt "Phe")

65
  • Accessing individual hash elements
  • Whereas array elements are accessed by their
    (numerical) index, hash elements (values) are
    accessed by their keys.
  • Example
  • !/usr/local/bin/perl
  • use strict
  • use warnings
  • my (prices, s, t)
  • prices ("shirt" gt 45,
  • "pullover" gt 90,
  • "trousers" gt 120,
  • "socks" gt 15)
  • s prices"shirt"
  • t prices"trousers"

66
  • Adding an element to a hash
  • Simply assign a value to a hash individual
    element, e.g. prices"coat" 250
  • coat, 250 will be added to the prices hash
  • Deleting an element from a hash
  • use the delete function, e.g. delete
    prices"coat"
  • Checking whether a hash is empty
  • if (hash_name) will be false if hash is
    empty .......
  • Using a Hash for Counting
  • number_of_nuc"g"
  • Using a Hash for Eliminating Duplicates
  • genbankaccession 1In this case, the keys
    in the hash are what's important. The values may
    be irrelevant.

67
The keys function
  • The keys function yields a list of all the
    current keys in a given hash.
  • Example
  • !/usr/local/bin/perl
  • use strict
  • use warnings
  • my prices ("shirt" gt 45, "pullover" gt 90,
    "trousers" gt 120, "socks" gt 15)
  • my _at_items keys (prices)
  • print "ITEMS _at_items\n"
  • Result ITEMS pullover shirt socks trousers

68
Iterating over all hash elements using the keys
function
  • Example - printing all keys and values of the
    prices hash
  • my _at_items_list keys (prices)
  • foreach item (_at_items_list)
  • print "item pricesitem NIS\n"
  • or, shorter
  • foreach item (keys (prices))
  • print "item pricesitem NIS\n"
  • Result pullover 90 NIS shirt 45 NIS socks
    15 NIS trousers 120 NIS

69
The values function
  • The values function yields a list of all the
    current values in a given hash.
  • Example
  • !/usr/local/bin/perl
  • use strict
  • use warnings
  • my prices ("shirt" gt 45, "pullover" gt 90,
    "trousers" gt 120, "socks" gt 15)
  • my _at_EURO values (prices)
  • print "PRICES _at_EURO\n"
  • Result PRICES 90 45 15 120

70
Oefeningen http//biochema.rug.ac.be/
  • Which genes are involved in the PRADER-WILLI
    SYNDROME ?
  • How may different human PDE (phosphodiesterases)
    are available in Genbank ?
  • How big is the anthrax genome and how many genes
    are present ?
  • Which of the 4 sequences (seq1/2/3/4)
  • Contains a hexokinases signature
    (LIVM-G-F-TN-F-S-FY-P-x(5)-LIVM-DNST-x(3
    )-LIVM- x(2)-W-T-K-x-LF)
  • How many of them?
  • Where (hint) ?
  • Write program (random.pl) to generate 10 random
    sequences of 1000 bp and write them to a file in
    fasta format
  • Find the answer in ultimate-sequence.txt
  • (hint use AA1 to perform translation(s))
  • What is the restriction enzyme which the longest
    recognition site ?

71
  • gtSEQ1
  • MGNLFENCTHRYSFEYIYENCTNTTNQCGLIRNVASSIDVFHWLDVYIST
    TIFVISGILNFYCLFIALYT YYFLDNETRKHYVFVLSRFLSSILVIISL
    LVLESTLFSESLSPTFAYYAVAFSIYDFSMDTLFFSYIMIS
    LITYFGVVHYNFYRRHVSLRSLYIILISMWTFSLAIAIPLGLYEAASNSQ
    GPIKCDLSYCGKVVEWITCS LQGCDSFYNANELLVQSIISSVETLVGSL
    VFLTDPLINIFFDKNISKMVKLQLTLGKWFIALYRFLFQMT
    NIFENCSTHYSFEKNLQKCVNASNPCQLLQKMNTAHSLMIWMGFYIPSAM
    CFLAVLVDTYCLLVTISILK SLKKQSRKQYIFVVVRLSAAILIALCIII
    IQSTYFIDIPFRDTFAFFAVLFIIYDFSILSLLGSFTGVAM
    MTYFGVMRPLVYRDKFTLKTIYIIAFAIVLFSVCVAIPFGLFQAADEIDG
    PIKCDSESCELIVKWLLFCI ACLILMGCTGTLLFVTVSLHWHSYKSKKM
    GNVSSSAFNHGKSRLTWTTTILVILCCVELIPTGLLAAFGK
    SESISDDCYDFYNANSLIFPAIVSSLETFLGSITFLLDPIINFSFDKRIS
    KVFSSQVSMFSIFFCGKR
  • gtSEQ2
  • MLDDRARMEA AKKEKVEQIL AEFQLQEEDL KKVMRRMQKE
    MDRGLRLETH EEASVKMLPT YVRSTPEGSE VGDFLSLDLG
    GTNFRVMLVK VGEGEEGQWS VKTKHQMYSI PEDAMTGTAE
    MLFDYISECI SDFLDKHQMK HKKLPLGFTF SFPVRHEDID
    KGILLNWTKG FKASGAEGNN VVGLLRDAIK RRGDFEMDVV
    AMVNDTVATM ISCYYEDHQC EVGMIVGTGC NACYMEEMQN
    VELVEGDEGR MCVNTEWGAF GDSGELDEFL LEYDRLVDES
    SANPGQQLYE KLIGGKYMGE LVRLVLLRLV DENLLFHGEA
    SEQLRTRGAF ETRFVSQVES DTGDRKQIYN ILSTLGLRPS
    TTDCDIVRRA CESVSTRAAH MCSAGLAGVI NRMRESRSED
    VMRITVGVDG SVYKLHPSFK ERFHASVRRL TPSCEITFIE
    SEEGSGRGAA LVSAVACKKA CMLGQ
  • gtSEQ3
  • MESDSFEDFLKGEDFSNYSYSSDLPPFLLDAAPCEPESLEINKYFVVIIY
    VLVFLLSLLGNSLVMLVILY SRVGRSVTDVYLLNLALADLLFALTLPIW
    AASKVTGWIFGTFLCKVVSLLKEVNFYSGILLLACISVDRY
    LAIVHATRTLTQKRYLVKFICLSIWGLSLLLALPVLIFRKTIYPPYVSPV
    CYEDMGNNTANWRMLLRILP QSFGFIVPLLIMLFCYGFTLRTLFKAHMG
    QKHRAMRVIFAVVLIFLLCWLPYNLVLLADTLMRTWVIQET
    CERRNDIDRALEATEILGILHSCLNPLIYAFIGQKFRHGLLKILAIHGLI
    SKDSLPKDSRPSFVGSSSGH TSTTL
  • gtSEQ4
  • MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK
    NGEVLAMAPA FISTLPTGAE TGDFLALDFG GTNLRVCWIQ
    LLGDGKYEMK HSKSVLPREC VRNESVKPII DFMSDHVELF
    IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW
    TKGLNIPEAI NKDFAQFLTE GFKARNLPIR IEAVINDTVG
    TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG
    KCTGDHMLIN MEWGATDFSC LHSTRYDLLL DHDTPNAGRQ
    IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA
    WSLETSVLSR MMVERSAENV RNVLSTFKFR FRSDEEALYL
    WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS
    LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA
    ALCALQAVKE KKGLA MEANFQQAVK KLVNDFEYPT ESLREAVKEF
    DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG
    GTNLRVCWIQ LLGDGKYEMK HSKSVLPREC VRNESVKPII
    DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV
    SITESYLLRW TKGLNIPEAI NKDFAQFLTE GFKARNLPIR
    IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE
    QMNQIPKLAG KCTGDHMLIN MEWGATDFSC LHSTRYDLLL
    DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE
    GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR
    FRSDEEALYL WDAAHAIGRR AARMSAVPIA SLYLSTGRAG
    KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG
    IAKDGSGIGA ALCALQAVKE KKGLA

72
  • gtultimate-sequence
  • ACTCGTTATGATATTTTTTTTGAACGTGAAAATACTTTTCGTGCTATGGA
    AGGACTCGTTATCGTGAAGTTGAACGTTCTGAATGTATGCCTCTTGAAAT
    GGAAAATACTCATTGTTTATCTGAAATTTGAATGGGAATTTTATCTACAA
    TGTTTTATTCTTACAGAACATTAAATTGTGTTATGTTTCATTTCACATTT
    TAGTAGTTTTTTCAGTGAAAGCTTGAAAACCACCAAGAAGAAAAGCTGGT
    ATGCGTAGCTATGTATATATAAAATTAGATTTTCCACAAAAAATGATCTG
    ATAAACCTTCTCTGTTGGCTCCAAGTATAAGTACGAAAAGAAATACGTTC
    CCAAGAATTAGCTTCATGAGTAAGAAGAAAAGCTGGTATGCGTAGCTATG
    TATATATAAAATTAGATTTTCCACAAAAAATGATCTGATAA

73
  • my AA1 (
  • 'UUU','F',
  • 'UUC','F',
  • 'UUA','L',
  • 'UUG','L',
  • 'UCU','S',
  • 'UCC','S',
  • 'UCA','S',
  • 'UCG','S',
  • 'UAU','Y',
  • 'UAC','Y',
  • 'UAA','',
  • 'UAG','',
  • 'UGU','C',
  • 'UGC','C',
  • 'UGA','',
  • 'UGG','W',
  • 'CUU','L',
Write a Comment
User Comments (0)
About PowerShow.com