Programming for Bioinformatics - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Programming for Bioinformatics

Description:

SCOP contains relevant information, but we cannot answer the above questions ... We use SCOP database and run database queries from a Python script ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 35
Provided by: biotecTu
Category:

less

Transcript and Presenter's Notes

Title: Programming for Bioinformatics


1
Programming for Bioinformatics
2
The module
  • will teach students basic programming skills
    relevant to bioinformatics, which will enable
    them to actively develop bioinformatics tools.
  • will take a problem-driven approach.
  • will present bioinformatics problems and show how
    to solve them using existing online tools and how
    to implement such tools.
  • will revisit some of the problems and databases
    discussed in applied bioinformatics.
  • will be very practical and hands-on approach to
    basic computer science tools such as using
    command line operating systems, programming in
    Python, and using relational databases.

3
Objectives
  • Students will have an understanding of different
    operating systems
  • Students will be able to automate simple
    repetitive information retrieval tasks
  • Students will be able to write simple programs in
    Python
  • Students will be able to work with relational
    databases
  • Students will appreciate the principles, limits,
    and possibilities of programming
  • Students will be able to formulate biological
    questions as information processing problems
  • Students will understand when and how programming
    can help to automate bioinformatics problems

4
Module Structure
  • Introduction
  • Databases
  • Introduction to SQL
  • A Little Exercise
  • A Little Science
  • Introduction to Python
  • Data types and loops
  • Sequences and lists
  • Patterns and functions
  • Dictionaries
  • Object Orientation
  • Advanced topics
  • More Python
  • Dynamic programming
  • Clustering
  • Revision Class

5
Books
  • You will need two books for the module a
    reference book on MySQL and a book on Python

6
Books Python
  • We will follow a number of online resources.
    (see course web page)
  • Further, we look in Python in a Nutshell, Alex
    Martelli, OReilly
  • Wesley Chun's Core Python Programming
  • Python Cookbook (OReilly)
  • The publisher OReilly has many general
    programming books on linux, python, etc.
  • They allow you to read all books for 2 weeks
    online for free. This is very nice to decide what
    to buy and what not.
  • You can also buy electronic copies of the book.

7
Books MySQL
  • There are many, many books on MySQL
  • The following two are just sugestions, as there
    are many other books covering the same material
  • MySQL Cookbook by Paul DuBois, O'Reilly or
  • MySQL by Paul DuBois, Michael Widenius, O'Reilly

8
Structure of Labs
  • Databases
  • Lab 1,2 Simple SQL
  • Lab 3,4 SQL to answer interesting scientific
    questions
  • Python
  • Lab 5 Data types and loops, accessing a DB from
    Python
  • Lab 6 Sequences and lists
  • Lab 7 Patterns and Functions
  • Lab 8 Dictionaries
  • Lab 9 BioPython
  • Lab 10 Python PyMOL
  • More Python
  • Lab 11 Dynamic programming revisited
  • Lab 12 Clustering revisited
  • Lab 13 Revision

9
Assessment
  • Lab
  • Exercises
  • Each week during the lab you get exercises which
    you have to do during the lab and finish on your
    own during the week
  • These exercises need to be handed in at the next
    lecture
  • Results are discussed during the labs and as part
    of the assessment you will have to present a
    solution once
  • Doing the exercises is compulsory, but there are
    no marks
  • Project/Test
  • You will demonstrate your programming skills by
    implementing a comprehensive software project.
  • The test will include a thorough code review and
    you will be asked to present the features of your
    programme.
  • Exam
  • Pen and paper exam on material covered in lecture

10
Motivation Databases
  • In the last term,
  • we accessed most information online via the web
  • we interacted directly and manually with
    databases and tools
  • we had to manually submit queries, interpret
    results. select interesting results, cutpaste
    them, and submit queries again,
  • Pro
  • Reasonably easy to get hold of information
  • Con
  • Not possible to ask many queries
  • Queries limited by interface provided by web page
  • Difficult/impossible to integrate information
    from different sites
  • In this term, we will look at the databases
    underlying the online front ends
  • How is the data internally stored?
  • How can we - and more important computer programs
    - directly interact with the underlying data, so
    that we can ask more powerful queries, large
    queries, and integrate different systems

11
What actually happens
  • You are limited by what web server allows you to
    ask
  • Example CATH
  • PDB ID,
  • CATH code, or
  • General text
  • But you cannot ask
  • In how many different PDB structures is there a
    P-loop domain?
  • Is there a PDB entry with a P-loop and a
    DNA-binding domain
  • How many different superfamilies does the largest
    structure in PDB have?
  • With direct access to the underlying database you
    could answer all these questions (and many more)

12
Motivation SCOP as Relational Database
  • We worked with SCOP, the Structural
    Classification of Proteins
  • Family gt30 sequence identity
  • Superfamily Similar structure and function
    (possibly lower 30 sequence identity)

13
Motivation Databases
  • We wish to answer the following questions
  • How many families and superfamilies are there?
  • Do all superfamilies roughly have the same number
    of families?
  • How many families does the immunoglobulin
    superfamily have?
  • Which superfamily has the most families and how
    many?
  • How many percent of superfamilies have only one
    family?
  • Which PDB structure has the largest number of
    distinct superfamilies?
  • How many percent of PDB structures have only one
    type of superfamily, how many percent have at
    least two?
  • Which is the most popular superfamily?
  • Are all superfamilies equally likely to co-occur
    or do they have preferences?
  • Which superfamily has the most co-occurrence
    partners?
  • Is the number of co-occurrence partners and the
    frequency of the superfamily correlated?

14
What is a Database
  • SCOP contains relevant information, but we cannot
    answer the above questions through the
    web-interface of SCOP
  • The problem is that we do not have access to the
    underlying database
  • What is a database anyway?
  • A database provides
  • Logical organization of data
  • data models, schema design, dictionaries
  • Physical organization of data
  • Fast retrieval, indexing, compact storage of data

15
Relational Database
  • Central Idea Data as relations in a table
  • E.g. Employee

------------------------------- id
name salary role --------------------
----------- 46457 pete 50.000
director 46458 jane 60.000 nurse
46459 asif 70.000 driver
-------------------------------
16
Relational Database
  • Central Idea Data as relations in a table
  • E.g. SCOP, Structural Classification of Proteins

---------------------------------------------
------------------------ id type sccs
sid description
-------------------------------------------
-------------------------- 46457 cf a.1
- Globin-like
46458 sf a.1.1 -
Globin-like 46459
fa a.1.1.1 - Truncated hemoglobin
46460 dm a.1.1.1 -
Truncated hemoglobin 46461
sp a.1.1.1 - Ciliate (Paramecium
caudatum) 14982 px a.1.1.1
d1dlwa_ 1dlw A
46462 sp a.1.1.1 - Green alga
(Chlamydomonas eugametos) 14983 px
a.1.1.1 d1dlya_ 1dly A
63437 sp a.1.1.1 -
Mycobacterium tuberculosis 62301
px a.1.1.1 d1idra_ 1idr A
------------------------------
---------------------------------------
17
SCOP Tables
  • mysqlgt select from cla limit 1
  • --------------------------------------------
    -------------------------------
  • sid pdb_id sccs cl cf sf
    fa dm sp px
  • --------------------------------------------
    -------------------------------
  • d1dlwa_ 1dlw a.1.1.1 46456 46457
    46458 46459 46460 46461 14982
  • --------------------------------------------
    -------------------------------
  • mysqlgt select from des limit 1
  • ---------------------------------------------
  • id type sccs sid description
  • ---------------------------------------------
  • 46456 cl a - All alpha proteins
  • ---------------------------------------------
  • mysqlgt select from astral limit 1
  • -----------------------------------------------
    ------------------------------
  • sid sccs seq
  • -----------------------------------------------
    ------------------------------
  • d1dlwa_ a.1.1.1 slfeqlggqaavqavtaqfyaniqadat
    vatffngidmpnqtnktaaflcaalgg...

18
SCOP Tables
  • mysqlgt select from cla limit 1
  • --------------------------------------------
    -------------------------------
  • sid pdb_id sccs cl cf sf
    fa dm sp px
  • --------------------------------------------
    -------------------------------
  • d1dlwa_ 1dlw a.1.1.1 46456 46457
    46458 46459 46460 46461 14982
  • --------------------------------------------
    -------------------------------
  • mysqlgt select from des limit 1
  • ---------------------------------------------
  • id type sccs sid description
  • ---------------------------------------------
  • 46456 cl a - All alpha proteins
  • ---------------------------------------------
  • mysqlgt select from astral limit 1
  • -----------------------------------------------
    ------------------------------
  • sid sccs seq
  • -----------------------------------------------
    ------------------------------
  • d1dlwa_ a.1.1.1 slfeqlggqaavqavtaqfyaniqadat
    vatffngidmpnqtnktaaflcaalgg...

19
Querying Relational Databases
  • SQL Structured Query Language
  • Select Which attributes? from Which tables? where
    Which conditions?
  • Select from where
  • Distinct
  • Like
  • Union/intersect
  • Join
  • Count/average/sum/min/max
  • Group by
  • Having
  • Show tables
  • Show databases
  • Use
  • Create database
  • Create table as
  • Drop table
  • Load data
  • Insert into

20
Databases
  • Given SCOP as relational database, we can answer
    all the questions raised above using the SQL
    constructs of the previous slide!

21
Programming
  • We will use Python (Guido van Rossum, named after
    Monty Python) as a convenient extension to the
    operating system
  • Easy to write quick programs
  • More than just a scripting language
  • Interpreted, interactive, indented
  • Supports string processing well
  • Widely used in bioinformatics
  • Object oriented, general purpose
  • Many nice libraries for database access,
    Graphics, Web, GUI, R
  • Scientific orientation Numerical Python (math),
    Scientific Python, Biopython
  • Beware Python is inefficient, but
    computationally expensive parts can be included
    as C-libraries

22
Motivation Families and Identity
  • We said that SCOP families share gt30 identity
  • What does that mean?
  • Any two structures in a family gt30?
  • At least one other member in family with gt30?
  • What is the average sequence similarity within a
    family? Within a superfamily?
  • Given a sequence and that we know already which
    superfamily it belongs to. Can we find the
    superfamilys family best suited for the sequence

23
Two approaches Blast vs. DIY
  • We can answer the above easily
  • We use SCOP database and run database queries
    from a Python script
  • For a given superfamily select all corresponding
    sequences from the astral table
  • For all pairs of selected sequences
  • Call Blast and record the sequence identity
  • Or run your own dynamic programming algorithm and
    record the sequence identity
  • For second problem Compare sequence to all
    family sequences and assign it to the family
    which shares the highest (must be gt30)
    similarity with the sequence

24
Motivation Sequence vs. Structure
  • Can we verify the plot below?
  • Can we create a similar plot for specific
    superfamilies? E.g. DNA-binding domains?

25
Motivation Sequence vs. Structure
  • Again select the relevant sequences from the
    astral table and besides computing the sequence
    identity, we compute structural similarity to the
    relevant structure using an algorithm like Dali
    or CE
  • Then plot the two similarities against each other
    in a scatter plot

26
Motivation Amino Acid Composition of Families
  • Can we characterise the amino acid composition of
    different families/superfamilies?
  • Again select the relevant sequences from astral
    and count the frequencies of amino acids
  • Is the amino acid composition at the interface of
    a domain different from the rest of the domain?

27
Motivation Lets rebuild SCOP families
  • Given a SCOP superfamily and its sequences, how
    can we divide it into families?
  • First, we need dynamic programming to determine
    the sequence similarity
  • Then we do the following
  • For all pairs of sequences, call the sequence
    similarity algorithm and record the similarity
    into a distance matrix
  • Next, run hierarchical clustering to cluster the
    sequences.

28
Whats needed
  • programming in Python

29
Python Programming Constructs
  • Variables, strings,
  • For/while Loops
  • If statements
  • File I/O
  • Regular expressions
  • Data structures Lists, Hashes
  • Code Structure Objects, classes, modules

30
Hello World in Python
  • Given a file helloworld.py
  • Open a shell and type at the command prompt
    helloworld.py
  • The shell then executes your programme
  • In the first line, it realises that the python
    interpreter needs to be loaded and that what
    follows is a python program
  • The line below prints a message

File helloworld.py
print "Hello World"
31
Read a text file in python
  • The command open opens a text file and creates
  • r as second argument after the filename
    indicates that file is read (this is default, ie.
    can be left out)
  • w as second argument indicates that file is
    written to
  • a as second argument indicates that file is
    appended to
  • The for-loop reads all lines of the file one by
    one (requires python gt2.2)
  • The body of the loop prints them on the screen
    (note that print adds a new line automatically,
    avoid that with adding a , )

File fileIO.py
Output
data open("seq.txt, r) for line in data
print "Line, line,
File seq.txt
Line acgt Line gggt
acgt gggt
32
Variables in Python
  • The symbol is used to assign values to
    variables
  • The symbol is also used to concatenate strings

File fileIO.pl
lineNo 1 for line in open(seq.txt) print
lineNo line, lineNo lineNo1
Output
File seq.txt
1 acgt 2 gggt
acgt gggt
33
If-then-else and strings in Python
File seqcomp.py
data open("seq.txt") line1 data.readline().rst
rip() line2 data.readline().rstrip() len1len(l
ine1) len2len(line2) if len1 lt len2 minLen
len1 else minLen len2 line3 "" for i
in range(minLen) if line1i line2i
line3line3"" else
line3line3" " print "Sequence
comparison" print line1 print line2 print line3
File seq.txt
acgt gggt
Output
Sequence comparison acgt gggt
34
Programming Example
Write a Comment
User Comments (0)
About PowerShow.com