Introduction to Perl: part 1 - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Introduction to Perl: part 1

Description:

Basic knowledge of Perl is essential for realistic bioinformatics exercises. ... table = ( ['aa', 'ab', 'ac'], ['ba' , 'bb' , 'bc'] ); print scalar(_at_table) . 'n' ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 24
Provided by: isrecI
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Perl: part 1


1
Introduction to Perl part 1
What is Perl ? How is Perl used ? Why is Perl
useful in Biology ? Basics of Perl Text file
parsing Arrays Input output Basic operators
Control structures Perl regular expressions By
the way Perl stays for Practical Extraction and
Report Language
2
What is Perl.
  • Interpreted programming language
  • Perl code is compiled on the fly.
  • As a result, Perl programs are slow.
  • Complicated and rich programming language
  • Many commands, functions, and data types.
  • Same command may do different things in different
    contexts.
  • Powerful but dangerous language
  • A lot can be done with a few lines of Perl code.
  • You are allowed to do almost everything you want
    on your own risk.
  • Typos often lead to unexpected actions rather
    than compilation errors.
  • Perl code is very compact but often difficult to
    read.
  • A Unix shell
  • Unix system commands are accessible within Perl
    code.

3
Pros and Cons of Perl
  • Advantages
  • Fast development It often takes less than an
    hour to write a Perl script
  • Powerful text processing functions advanced
    support for regular expressions, especially
    useful for input parsing and reformatting tasks.
  • Easy access to system commands.
  • Support for web-based applications through CGI
    interface.
  • Disadvantages
  • Code is very concise and usually difficult to
    read
  • Slow for computationally intensive tasks.
  • Little control over systems resources.
  • Not very transparent many trivial tasks like
    initialization of variables are done
    automatically - but you dont know exactly how.

4
For what purposes is Perl used in biology
  • To parse and reformat structured text files, e.g.
    nucleotide sequence entry files.
  • Web-based program interfaces.
  • For implementing computationally inexpensive
    algorithms.
  • For testing computationally expensive algorithms
    on small data sets.
  • For piloting complex data processing pipelines
    invoking several compiled programs in succession.
  • For automating any repetitive simple task.
  • For educational purposes Allows fast
    implementation of standard algorithms you will
    not escape writing the Smith-Waterman algorithm
    in Perl.

5
Why do we have to know Perl?
  • Perl has become the standard language for light
    programming tasks in bioinformatics.
  • Perl is often essential for reformatting public
    biological data.
  • Basic knowledge of Perl is essential for
    realistic bioinformatics exercises.
  • Perl allows you to quickly implement the
    arithmetic procedures underlying the paper and
    pencil exercises of this course. You can in this
    way verify your results.
  • Perl scripts are powerful tools for exploring
    voluminous program output files
  • Perl allows you to apply your programming skills
    to realistic biological data analysis problems
    with reasonable time investment.
  • In your future career, you will almost certainly
    encounter data processing or analysis problems
    that can be efficiently solved with Perl.

6
How is Perl used ?
On a Unix system, write a text file named e.g.
myrog.pl.
Indicates location of Perl Interpreter on local
machine
!/usr/bin/perl print "Hello!\n"
\n represents a new line (line feed) character
Then make this file executable for you chmod x
myprog.pl and call it like any Unix command
myprog.pl Alternatively, Perl commands can also
be submitted directly from the UNIX command
line perl -e 'print "hello!\n"'
7
Basics of Perl
  • Lines starting with are ignored , except the
    line starting with !, which indicates the path
    to the Perl interpreter on the system.
  • Individual commands are separated by
  • Variable names start with a
  • Blocks of commands are encompassed by

Example of a Perl script which computes and
prints the square-roots of integers 1 to 20.
!/usr/bin/perl for(i1 i lt 10 i i1)
sqrt i0.5 print "square-root of
i is sqrt\n"
8
Result
!/usr/bin/perl for(i1 i lt 10 i i1)
sqrt i0.5 print "square-root of
i is sqrt\n"
square-root of 1 is 1 square-root of 2 is
1.4142135623731 square-root of 3 is
1.73205080756888 square-root of 4 is
2 square-root of 5 is 2.23606797749979 square-root
of 6 is 2.44948974278318 square-root of 7 is
2.64575131106459 square-root of 8 is
2.82842712474619 square-root of 9 is
3 square-root of 10 is 3.16227766016838
9
Example of a simple text file parsing script
!/usr/bin/perl prt 0 while(ltSTDINgt)
if(/ID/) text "_" prt 0 if(/OS
Homo sapiens/) prt 1 if(/DE/) text
text . "_" if(/\/\// and prt) print
"text"
This script scans a SWISS-PROT sequence library
file and prints for each human entry The ID and
DE lines. The Swiss-Prot library file is read
from the standard input. The script may be called
as follows text_parsing.pl lt swissprot.dat
Parsing means identifying and extracting
elements and objects from a structured text file.
10
Example of a simple text file parsing script
!/usr/bin/perl prt 0 while(ltSTDINgt)
if(/ID/) text "_" prt 0 if(/OS
Homo sapiens/) prt 1 if(/DE/) text
text . "_" if(/\/\// and prt) print
"text"
while(expr)block repeats block as long as expr
returns true (1). ltSTDINgt reads one line from
standard input, stores content in predefined
variable _ and returns true if
successful. /string/ returns true (1) if string
is found at the beginning () of _. text .
"_" concatenates two character
strings. \/\/ backslashes are needed to force
slashes to be interpreted as components of a
character string rather than syntactic elements
of the Perl language. prt this user-defind
variable indicates whether the current entry is
from Homo sapiens.
11
Example of a simple text file parsing script
!/usr/bin/perl prt 0 while(ltSTDINgt)
if(/ID/) text "_" prt 0 if(/OS
Homo sapiens/) prt 1 if(/DE/) text
text . "_" if(/\/\// and prt) print
"text"
What will be the output if the SWISS-PROT entry
LCK_HUMAN is given as input ?
Example of a SWISS-PROT entry
ID LCK_HUMAN Reviewed
508 AA. AC P06239 P07100 Q12850 Q13152
Q5TDH8 Q5TDH9 Q96DW4 Q9NYT8 DT 01-JAN-1988,
integrated into UniProtKB/Swiss-Prot. DT
01-FEB-1994, sequence version 5. DT
12-DEC-2006, entry version 103. DE
Proto-oncogene tyrosine-protein kinase LCK (EC
2.7.10.2) (p56-LCK) DE (Lymphocyte
cell-specific protein-tyrosine kinase) (LSK) (T
cell- DE specific protein-tyrosine kinase). GN
NameLCK OS Homo sapiens (Human). ... //
12
Arrays
Example
!/usr/bin/perl _at_numbers ("one", "two",
"three", "four") print scalar(_at_numbers) .
"\n" print "numbers0\n"
Output
4 one
Notes Array names start with _at_. References to
array elements start with . Square brackets
encompass indices of array elements. The
numbering of array elements starts with 0, as in
C. The function scalar returns the size of the
array.
13
Two-dimensional array (matrix)
!/usr/local/bin/perl _at_table ( "aa", "ab",
"ac", "ba" , "bb" , "bc" ) print
scalar(_at_table) . "\n" print "_at_table\n" print
"table0\n" print "table12\n"
Output
2 1 2 bc
Note _at_array returns the subscript of the
last element of _at_array.
14
Alternative ways of generating a two-dimensional
array
!/usr/bin/perl _at_table ( "aa", "ab", "ac",
"ba" , "bb" , "bc" ) print scalar(_at_table) .
"\n" print "_at_table\n" print
"table0\n" print "table12\n"
!/usr/bin/perl _at_table0 "aa", "ab",
"ac" _at_table1 "ba", "bb", "bc"
!/usr/bin/perl table00"aa"
table01"ab" table02
"ac" table10"ba" table11"bb"
table12 "bc"
15
Basic input and output mechanisms
Files are accessed via so-called
filehandles open(FILE, "ltfilename") read
from existing file open(FILE, "gtfilename")
create file and write to it open(FILE,
"gtgtfilename") append to existing
file close(FILE) closes
connection to FILE Here, FILE is the name chosen
for the filehandle. Input output
commands line ltFILEgt reads one line from
FILE, returns true if successful.
while(ltFILEgt) only within while loop condition
reads one line from file and (if successful)
stores contents in _. print FILE "Hello" writes
character string to FILE. line ltgt reads from
standard input print "Hello" writes to standard
output
16
Formatted Output
The function printf FILEHANDLE FORMAT,
LIST Printf FORMAT, LIST uses format descriptors
like the C programming language.
Example Script Output
e exp(1.0) text "e " printf "s i\n",
text, e printf ".1s 5i\n", text, e printf
"5s f\n", text, e printf "-5s 5.3f\n",
text, e printf "e 5.3f\n", e
e 2 e 2 e 2.718 e 2.718 e 2.718
Notes formatting specifications have the general
structure Length.PrecisionType. Length and
precision are integers. They are both optional.
Type is a single character s character string,
i integer, f floating point. A minus sign
preceding the length means that the formatted
output should be left-adjusted.
17
Reading command line arguments
The command line arguments supplied with a Perl
script are automatically stored in an array named
_at_ARGV. Example Script argv_demo.pl
title ARGV0 n ARGV1 print
"title\n\n" for(i 1 i lt n i i1)
printf "square-root of 2i is 6.4f\n", i,
i0.5
Command./argv_demo.pl "Table 1" 5 Output
Table 1 square-root of 1 is 1.0000 square-root
of 2 is 1.4142 square-root of 3 is
1.7321 square-root of 4 is 2.0000 square-root of
5 is 2.2361
18
Control structures
Loops
If then else constructs
i1 while(i lt 3) print "i\n" i
for(i 1 i lt 3 i) print
"i\n" _at_numbers (1,2,3) foreach i
(_at_numbers) print "i\n"
i ARGV0 if(i 1) print "i1\n"
elsif(i 2) print "i2\n" else
print "i must be 1 or 2\n"
Note is the logical equality operator.
Note the three loops perform the same action.
i is a shortcut for i i 1.
19
What will be the output if the script is called
with the following command line arguments
mystery_2.pl 1 3 1 3
What is the following script mystery_1.pl doing
?
!/usr/bin/perl m 0 foreach x (_at_ARGV)
m m x m m/scalar(_at_ARGV) v
0 foreach x (_at_ARGV) v v (x-m)2
s (v/scalar(_at_ARGV))0.5 print "m,
s\n"
20
Important Operators
addition - substraction subtraction / divi
sion exponentiation (eq) logical equal !
(ne) not equal gt, gt greater, greater or equal
than lt, lt lower, lower or equal than
(and) logical And (or) logical Or !
(not) logical not . string concatenation
21
String matching regular expressions
Elements of a regular expression A matches
character A ABC A, B, or C ABC any
character except A, B, or C A-Z0-9 upper case
character between A and Z, or digit between 0 and
9 \ , back slash is needed to prevent from
having syntactic meaning \S any non-whit-space
character . any character . any character one
to many times . any character zero to many
times \S10 any non-white-space character
repeated 10 times \S8,12 any non-white-space
character repeated 8-12 times \n new line
character \t tab
22
Pattern matching operators
/regex/ returns true if regex occurs in
_ /(regex)/ returns true if regex occurs in _
and stores target string in variable
1 /(regex).(regex)/ returns true if regex
occurs twice in _ and stores the two target
strings in variable 1 and 2,
respectively. s/regex/char_string/ replaces
first occurrence of regex in _ by
char_string s/regex/char_string/g replaces all
occurrences of regex in _ by char_string str
s/regex/char_string/ replaces first occurrence of
regex in str by char_string str
tr/A-Z/a-z/ replaces all upper case characters in
str by the corresponding lower case characters.

23
Text parsing with regular expressions an example
ID LCK_HUMAN Reviewed
508 AA. AC P06239 P07100 Q12850 Q13152
Q5TDH8 Q5TDH9 Q96DW4 Q9NYT8 DT 01-JAN-1988,
integrated into UniProtKB/Swiss-Prot. DT
01-FEB-1994, sequence version 5. DT
12-DEC-2006, entry version 103. DE
Proto-oncogene tyrosine-protein kinase LCK (EC
2.7.10.2) (p56-LCK) DE (Lymphocyte
cell-specific protein-tyrosine kinase) (LSK) (T
cell- DE specific protein-tyrosine kinase). GN
NameLCK
!/usr//bin/perl while(ltSTDINgt) if(/ID
(\S). (0-9) AA/) print "ID
1\n" print "Length 2\n"
if(/GN Name(,)/) print "Gene
1\n"
ID LCK_HUMAN Length 508 Gene LCK
Write a Comment
User Comments (0)
About PowerShow.com