Welcome to lecture 2: Feeling at home in *nix - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

Welcome to lecture 2: Feeling at home in *nix

Description:

How s it coming along? BioKnoppix Remote logins, navigation Unix / linux ... is the default set of commands for line editing in the Bash Shell! Let ... – PowerPoint PPT presentation

Number of Views:116
Avg rating:3.0/5.0
Slides: 59
Provided by: xli57
Learn more at: http://www.chem.ucla.edu
Category:
Tags: bash | feeling | home | lecture | linux | nix | shell | welcome

less

Transcript and Presenter's Notes

Title: Welcome to lecture 2: Feeling at home in *nix


1
Welcome to lecture 2Feeling at home in nix
  • IGERT Sponsored Bioinformatics Workshop Series
  • Michael Janis and Max Kopelevich, Ph.D.
  • Dept. of Chemistry Biochemistry, UCLA

2
Last time
  • We covered a bit of material
  • Try to keep up with the reading its all in
    there!
  • Hows it coming along?
  • BioKnoppix
  • Remote logins, navigation
  • Unix / linux concepts?
  • General questions?

3
The CLI and YOU
  • Most of bioinformatics is accomplished through
    command-line tools
  • Command line interaction is easily batched
  • Command line interaction is easily integrated
  • Command line interaction is a form of
    PROGRAMMING
  • Its therefore worthwhile to become familiar
    with your nix environment in a non-graphical
    interface

4
Commands
  • In Bioinformatics, we are mostly concerned with
    TEXT PROCESSING the CLI is well suited for this
    type of work
  • Specific commands are used to perform functions
    in the shell
  • Each command is itself a program and takes
    command line arguments
  • The syntax order is program -options filename
  • For help on a specific command type
  • man command apropos topic command --help

5
Some review of system tools
  • Who
  • W
  • Uname
  • Pwd
  • Find
  • Top

6
Another example of a pipe
Command 1 (cut)
file
Command 2 (sort)
Pipe
Stdout
cut d -f1 lt /etc/passwd sort
  • The file /etc/passwd stores information about
    users accounts on the system
  • Lets get a sorted listing of all user names

7
Example redirecting STDOUT
OUTPUT_FILE
cut d -f1 lt /etc/passwd sort gt
output_file more output_file
redirection operator
8
Process Control
  • Each specific job / command is called a process
  • Each process runs in a shell
  • BEFORE prompt available
  • DURING prompt NOT available
  • AFTER prompt available
  • Control keys
  • CTRL-C -gt stop current command
  • CTRL-D -gt end of input

9
Two Ways to monitor Processes
  • top
  • Lists all jobs
  • Uses a table format
  • Dynamically changes
  • ps
  • man ps
  • static content
  • Command options

10
What are you doing, Dave?
11
Background / Foreground
  • Commands running in foreground prevent prompt
    from being used until command completes
  • Commands can also run in BACKGROUND
  • Backgrounded commands DO NOT AFFECT the prompt

12
Two Ways to Background jobs
  • Running a command with automacically sends it
    to the background
  • Backgrounded commands return the prompt
  • bg
  • Once a command is run from the prompt
  • Stop the command
  • Then background it
  • Starts the command again
  • Returns the prompt for use

13
File System Navigation
  • Absolute filepaths begin with the root /
  • Relative filepaths dont have a preceding slash
    they begin from the cwd
  • What is the absolute path to cd from john to
    mary?
  • What is the relative path to cd from john to
    mary?
  • Once you are in mary, and your username is john,
    what are two ways to return to your home
    directory?

14
The society for anti-defamation of computer
mouses opposes this slide
  • Theres very little reason to leave the CLI
  • Most tasks can be written within the shell
  • The user-friendliness becomes self-limiting

15
Lets take an example
  • Suppose you wanted to do some biological analysis
    like motif searching through a database of
    biological sequences What do you need to do
    this?
  • You need to retrieve the sequences
  • You need to describe the motif
  • You need to search the sequences

16
I want to search for zinc-finger motifs
genomically in yeast (S.c.)
  • Im going to need the genomic sequence for
    Saccharomyces cerevisiae (http//www.yeastgenome.o
    rg)
  • Im going to need the motif that describes the
    zinc finger Id like to search for (ProSite).
  • Im going to need do do this search many times
    across every chromosome.

17
A brief overview of some databases / biological
information repositories
  • NCBI
  • Genome-specific databases (SGD)
  • SMD http//genome-www5.stanford.edu/
  • The Stanford Microarray Database. Repository of
    microarray analysis from a wide variety.
  • PROSITE http//au.expasy.org/prosite/Used to
    rapidly search your protein sequences for
    catalogued motifs.
  • SWISSPROT http//www.ebi.ac.uk/swissprot/SWISSPRO
    T is a "one stop shop" for protein sequence
    information. Use it to extend your knowledge of
    your proteins.
  • PDB The Protein Databank http//www.rcsb.org/pdb/
    The Protein Data Bank is the single worldwide
    archive of structural data of biological
    macromolecules. Structure implies function in
    general.
  • PFAM http//www.sanger.ac.uk/Software/Pfam/search
    .shtmlThis database is a collection of protein
    motifs. 
  • PRODOM http//protein.toulouse.inra.fr/prodom/curr
    ent/html/home.phpPRODOM is similar to PFAM in
    that it is a set of curated protein domain
    families. However, the underlying computational
    engine is different.
  • BLOCKS http//blocks.fhcrc.org/Blocks are
    multiply aligned ungapped segments corresponding
    to the most highly conserved regions of
    proteins.  The blocks for the Blocks Database are
    made automatically by looking for the most highly
    conserved regions in groups of proteins
    documented in InterPro. 
  • COG http//www.ncbi.nlm.nih.gov/COG/COG stands
    for Clusters of Orthologous Groups of proteins. 
    This is a tool for phylogenetic classification of
    proteins encoded in complete genomes.  COGs were
    delineated by comparing protein sequences encoded
    in complete genomes, representing major
    phylogenetic lineages.

18
Retrieving data
19
Retrieving data
  • You dont have to leave the CLI. Really.
  • If you need to do something, chances are theres
    a utility to do so
  • Debian is your friend (search packages FIRST!!!)

Introducing wget gtwget ftp//genome-ftp.stanford.
edu/pub/yeast/data_download/protein_info/hypotheti
cal_peptides/.gz
Of course you can use ftp gtftp
genome-ftp.stanford.edu -login anonymous use
your email address as passwd -traverse
filesystem like any linux CLI -bin, get, prompt,
mget
20
A note about file archives
  • Most files will be compressed. Usually using
    gunzip.
  • Most files will be agglomerative, using TAR.

Introducing gunzip gtgunzip .gz
Introducing tar (tape archive) gttar xvf
.tar Or to create a tar gttar cvf output.tar
.
21
A brief note about the biological file format
called FASTA
  • In bioinformatics, FASTA format is a file format
    used to exchange information between genetic
    sequence databases. Its format looks like this
  • gtSEQUENCE_1 comment line 1 (optional)
    MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAA
    KKADRLAAEGLVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAE
    LEKENEERRRLKDPNKPEHKIPQFASRKQLSDAILKEAEE
  • It consists of a header line (beginning with a
    'gt') which gives a name and/or a unique
    identifier for the sequence. Many different
    sequence databases use FASTA files.
  • After the header line and comments, one or more
    sequence lines may follow. Sequences may be
    protein sequences or DNA sequences
  • they must be shorther than 80 characters and can
    contain gaps or alignment characters
  • FASTA format files often have file extensions
    like .fa or .fsa
  • The simple format of FASTA files makes them easy
    to manipulate using text processing tools and
    scripting languages like Perl.
  • From http//en.wikipedia.org/wiki/Fasta_format

22
ProSite motif
23
Describing the motif - GREP
  • GREP searches contents of a file or directory
    of files
  • Get Regex uses regular expressions
  • File wildcards can be used like with ls
  • grep 1sq /DATA/.CEL -gt array type used
  • We explored this last time (briefly!)

24
Regular expressions
  • A regular expression, often called a pattern, is
    an expression that describes a set of strings.
    They are usually used to give a concise
    description of a set, without having to list all
    elements.
  • For example, the set containing the three strings
    Mike, Mark, and Matt can be described by the
    pattern M((ike(arkatt))?)"
  • Alternatively, it is said that the pattern
    M((ike(arkatt))?)" matches each of the three
    strings.
  • There are usually multiple different patterns
    describing any given set. Most formalisms provide
    the following operations to construct regular
    expressions.

25
Formalisms of regular expressions
  • alternation
  • A vertical bar separates alternatives. For
    example, "graygrey" matches grey or gray.
  • grouping
  • Parentheses are used to define the scope and
    precedence of the operators. For example,
    "graygrey" and "gr(ae)y" are different
    patterns, but they both describe the set
    containing gray and grey.
  • quantification
  • A quantifier after a character or group specifies
    how often that preceding expression is allowed to
    occur. The most common quantifiers are ?, , and
  • ?
  • The question mark indicates that the preceding
    character may be present at most once. For
    example, "colou?r" matches color and colour.
  • The asterisk indicates that the preceding
    character may be present zero, one, or more
    times. For example, "042" matches 42, 042, 0042,
    etc.
  • The plus sign indicates that the preceding
    character must be present at least once. For
    example, "gogle" matches the infinite set gogle,
    google, gooogle, etc. (but not ggle).
  • These constructions can be combined to form
    arbitrarily complex expressions, very much like
    one can construct arithmetical expressions from
    the numbers and the operations , -, and /.
  • From http//en.wikipedia.org/wiki/Regular_express
    ion

26
The real world is fuzzy and complex
  • What if we just want to search for a string in
    the format of a phone number
  • E.g. 825 8901
  • 213 487 0353
  • Obviously we cant check for each possible phone
    number (some 1010 possibilities makes for a very
    long set of statements).

No area code
Area code
27
This is where regular expressions come in
  • Regular expressions describe generalised patterns
    of strings instead of exact strings.
  • (clearly this is a little more complex as an
    example)

gtgrep /(0-93 )0,10-93 0-94/)
filename
28
Special characters(metacharacters)
  • . is a wildcard and matches any character

gtgrep .ed filename
If file contains bed -will find If file
contains red -will find If file contains
head -will not find If file contains edward
-will find
29
Special characters(metacharacters)
  • means zero or more of the previous
    character.

gtgrep bed filename
If file contains bed -will find If file
contains red -will not find If file contains
beeeed -will find If file contains bd -will
find
30
Special characters(metacharacters)
  • means one or more of the previous character.

gtgrep bed filename
If file contains bed -will find If file
contains red -will not find If file contains
beeeed -will find If file contains bd -will
not find
31
Start and end of line
  • is designates the start of the line, the
    end.

gtgrep bed filename
gtgrep bed filename
If file contains bed -will find If file
contains bedbed -will find If file contains
xxxbedxxx - will find
Iff file contains bed on line by itself -will
find If file contains bedbed -will not find If
file contains xxxbedxxx will not find
32
Grouping with parentheses
  • Parentheses group characters

gtgrep (bed) filename
If file contains bed -will find If file
contains bedbed -will find If file contains
beddd -will not find
33
Character classes
  • The square brackets are used to denote whole
    groups of characters

gtgrep brfed filename
If file contains bed -will find If file
contains red -will find If file contains
led -will not find
34
Character classes (cont)
  • A hyphen designates a range

gtgrep a-zed filename
If file contains bed -will find If file
contains fed -will find If file contains
Bed -will NOT find (why not?)
35
Character class shortcuts
  • Some character classes are so common there are
    in-built shortcuts
  • 0-9 \d
  • A-Za-z0-9 \w
  • \f\t\n\r \s

36
Quantifying
  • Curly brackets quantify repeats better than
    (0) or (1)
  • a3,5 three, four or five as.

gtgrep la3,5
If file contains laaaad -will find If file
contains laaaaaaad -will not find
37
Referencing
  • Back-slashes match the substring previously
    matched by the nth parenthesized subexpression of
    the regular expression.
  • The back-reference is denoted \n', where n is a
    single digit

gtgrep (a)\1
If file contains laaaad -will find If file
contains lad -will not find
38
Back to our ProSite motif
  • We can use regular expressions to describe the
    motif
  • The motif is actually a REGULAR EXPRESSION!

gtgrep -n E -color B2 C.2C.4,8RHDGSCVYWF
MVIL.CS.2,5CHEQ.DNSAGEYFVLI.LIVFMC.2
C .fsa
chr04.peptides.20040928.fsa-4202-gtAnnotated04135
60551357359 frame 1 YDR448W/ADA2 Verified
this gene contains 1 exon chr04.peptides.20040928.
fsa4203MSNKFHCDVCSADCTNRVRVSCAICPEYDLCVPCFSQGSYT
GKHRPYHDYRIIETNSYPILCPDWGADEELQLIKGAQTL
39
Did it work?
40
Lets try this
  • Download the genomic DNA sequence from SGD
  • Search for any variant of the TATA box promoter
  • TATAAA
  • TATAAT
  • TATATT
  • TAATAA
  • TAATAT

41
More more more
  • Many MS tools allow for wildcard searching
  • The shell allows variables interpolation
    control structures
  • For example, attempt to find a palindrome of
    length 4 within genomic sequences (hint use
    backreferences!)
  • Variables allow for persistence and control
    structures

gtmyVargrep -n E -color C.2C.4,8RHDGSCV
YWFMVIL.CS.2,5CHEQ.DNSAGEYFVLI.LIVFMC
.2C .fsa
mako_at_subi echo myVar chr04.peptides.20040928.f
sa4203MSNKFHCDVCSADCTNRVRVSCAICPEYDLCVPCFSQGSYTG
KHRPYHDYRIIETNSYPILCPDWGADEELQLIKGAQTL
42
A better variable interpolation
  • The variable is allowed to change
  • We can set the variable to the Prosite Pattern

mako_at_subi myVarC\.2C\.4,8RHDGSCVYWFMVIL
\.CS\.2,5CHEQ\.DNSAGEYFVLI\.LIVFMC\.
2C mako_at_subi echo myVar C.2C.4,8RHDGSC
VYWFMVIL.CS.2,5CHEQ.DNSAGEYFVLI.LIVF
MC.2C mako_at_subi grep -n -E --color myVar
.fsa chr04.peptides.20040928.fsa4203MSNKFHCDVCS
ADCTNRVRVSCAICPEYDLCVPCFSQGSYTGKHRPYHDYRIIETNSYPIL
CPDWGADEELQLIKGAQTL
43
Variables can be overwritten
  • The variable is allowed to change
  • We can set the variable to the Prosite Pattern

mako_at_subi function afun gt for i in 1 2 3 4
5 gt do gt echo i gt echo myVar gt done gt
mako_at_subi afun 1 C.2C.4,8RHDGSCVYWFMVI
L.CS.2,5CHEQ.DNSAGEYFVLI.LIVFMC.2C
2 C.2C.4,8RHDGSCVYWFMVIL.CS.2,5CHEQ.
DNSAGEYFVLI.LIVFMC.2C 3 C.2C.4,8RHDGS
CVYWFMVIL.CS.2,5CHEQ.DNSAGEYFVLI.LIV
FMC.2C 4 C.2C.4,8RHDGSCVYWFMVIL.CS.2
,5CHEQ.DNSAGEYFVLI.LIVFMC.2C 5 C.2C.
4,8RHDGSCVYWFMVIL.CS.2,5CHEQ.DNSAGEY
FVLI.LIVFMC.2C
44
Functions
  • What if we wanted to search every ProSite pattern
    against our genomic database?
  • Wed have to repeatedly do our search
  • This is called a loop
  • We have to write this so the computer knows
    exactly what to repeat, how many times to repeat,
    and where to find the next ProSite pattern to
    match
  • We would store the what and where in VARIABLES
  • We would utilize a CONTROL STRUCTURE to handle
    the how

45
Control structures
  • All out programs so far have run from start to
    finish. Each line has been executed in turn.
  • What if we only want to run some lines some of
    the time?
  • This is where control structures come in.

46
Control structures
  • Programming languages generally have a number of
    control structures.
  • Basic structures
  • if
  • while
  • for foreach
  • There are others (e.g. unless)

47
for example
gtafunction() for i in 1 2 3 4 5 do echo
"Looping ... number i" done
48
Variables can interpolated
  • The command is substituted from the system
  • Its like a pipe, but we are allowed to operate

mako_at_subi afun() gt myvar(ls -1 .fsa) gt
for i in myvar gt do gt echo i gt done gt
mako_at_subi afun chr01.fsa chr01.peptides.20040
928.fsa chr02.peptides.20040928.fsa chr03.peptides
.20040928.fsa chr04.peptides.20040928.fsa chr05.pe
ptides.20040928.fsa chr06.peptides.20040928.fsa ch
r07.peptides.20040928.fsa chr08.peptides.20040928.
fsa chr09.peptides.20040928.fsa chr10.peptides.200
40928.fsa chr11.peptides.20040928.fsa
49
The while control structure (combined with
opening files)
  • The while control stucture keeps looping while
    a given condition is satisfied
  • while and open files go together very well

mako_at_subi afun() gt while read f gt do gt echo
f gt done gt mako_at_subi afun lt
chrmt.peptides.20040928.fsa gtNotannotatedmt385
459 frame 1 MNYILLLLLIKLLIIINMKLIKIL
50
Editors
  • Shell programming is like a batch file
  • Commands are linked together in a procedure
  • The procedure is accessed via a file
  • We need an editor that will allow us to construct
    that file
  • Well use Emacs (or you can use vi, pico, )
  • Comprehensive, extensible working environment
  • Complete (arguable!) IDE
  • Integration
  • Extensible (elisp)

51
Emacs
  • Invoking Emacs is easy emacs nw filename
  • In many cases, Emacs will work out the mode
    appropriate for your file (.cpp, .pl, etc)
  • The mode allows Emacs to become sensitive to the
    task
  • There is a biomode for reverse complement, etc.
  • You can write your own!
  • Emacs has many tools
  • Search, replace, cut, paste, mail
  • File navigation, ftp, remote shells

52
The Emacs survival guide
  • Notation
  • Emacs uses the control key and escape key
    heavily. We write it like this
  • C-x Pronounced "Control-x
  • Hold down the Ctrl key (usually in the lower left
    corner of the keyboard) while pressing the x key.
  • Both Ctrl and x must be down at the same time.
    M-x Pronounced "Meta-x"Press the Esc key
    (usually in the upper left corner of the
    keyboard), release it, then press the x key.
  • Esc and x should not be down at the same time. So
    C-x C-f means hold down the control key, then
    type x and then f while holding it down. (This is
    the command to load a file into emacs).
  • Typing
  • Just type. All the regular keys, arrow keys,
    delete, backspace, and page up/down keys should
    work. Alternatively, you can try these commands
    C-f cursor forward, C-b cursor back, C-p previous
    line, C-n next line, M-v page up, C-v page down.
  • Exiting
  • Type C-x C-c. If you have any unsaved work, emacs
    will ask you if you want to save it. Type y.
  • Other commands
  • Most control or escape sequences are commands.
    Usually a prompt appears in the command line at
    the bottom of the window. Here are a few
  • C-x C-f Load file, prompt for filenameC-x C-s
    Save file without exiting C-x C-c Exit, prompt to
    save files C-s Search forward, prompt for search
    string C-r Search backward, prompt for search
    string C-h ?Show help options, prompt for choice
    C-h t Start emacs tutorial If you make a mistake
    or change your mind you can always escape
  • C-g
  • Abandon command and resume typing

53
Command line editing
  • Learning the keybindings can be difficult
  • But it will increase your speed
  • Faster than using a mouse
  • Transferable! The keybindings for command line
    editing from Emacs is the default set of commands
    for line editing in the Bash Shell!

54
Lets try it
  • Open up the file that we found contained the
    ProSite Motif
  • Open a second window
  • Goto the line that contains the motif (hint use
    grep with n!)
  • Copy and paste that line into a new file
  • Save and close that file

55
AWK is your pre-perl friend
  • Use to print a subset of fields
  • Default field delimiter is (white space)
  • Useful for grabbing a subset of fields
  • Useful for rearranging fields
  • field1 filed2 field3 field4 . . .
  • 1 2 3 4 . . . .

56
Using AWK
pipe
  • awk F print 1
  • awk F print 1 2
  • awk F print 1\t2
  • \t TAB
  • \n newline

57
Overwrite versus Append
  • gt OVERWRITE delete and replace
  • gtgt APPEND add to end of existing file

58
Example microarray data tracking
  • grep 1sq /DATA/.CEL (gives array info)
  • grep 1sq /DATA/.CEL awk print 12 gives
    array type only
  • grep 1sq /DATA/.CEL awk print 12 gt
    arrayTypes.txt (store results in file)
  • ls /DATA/.DAT wc (gives a count)
Write a Comment
User Comments (0)
About PowerShow.com