Some research at the Bioinformatics Research Centre - PowerPoint PPT Presentation

1 / 76
About This Presentation
Title:

Some research at the Bioinformatics Research Centre

Description:

gene therapy. genetic modification of food crops and animals, etc. ... Physical. Sciences. Bioinformatics Research Centre, University of Glasgow. 5 ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 77
Provided by: davidg77
Category:

less

Transcript and Presenter's Notes

Title: Some research at the Bioinformatics Research Centre


1
Some research at theBioinformatics Research
Centre
  • brc_at_brc.dcs.gla.ac.uk
  • www.brc.dcs.gla.ac.uk
  • Department of Computing Science
  • 416 Davidson Building (Biochemistry Molecular
    Biology, Institute of Biomedical Life Sciences)
  • University of Glasgow

2
Bioinformatics Research Centre
  • Provides an environment for collaborative
    interdisciplinary research in Bioinformatics.
  • Hosts researchers from
  • Department of Computing Science (5 RAE)
  • Institute of Biomedical and Life Sciences.
    (5/5 RAE)
  • Physically located in the Institute of Biomedical
    and Life Sciences (Davidson Building -
    Biochemistry Molecular Biology)
  • Strong links with
  • Sir Henry Welcome Functional Genomics Facility.
  • Statistical Bioinformatics
  • Mathematical Biology
  • Protein Crystallography
  • Outreach programme (visitors etc)

3
What is Bioinformatics
Aims of research in Bioinformatics
  • Bio - molecular biology
  • Informatics - computer science.
  • Development application of computing,
    mathematical and statistical methods to analyse
    biological, biochemical and biophysical data, and
    to guide biological assays.
  • Computational Biology USA.
  • Understand the functioning of living things - to
    improve the quality of life.
  • drug design
  • identification of genetic risk factors
  • gene therapy
  • genetic modification of food crops and animals,
    etc.

4
Bioinformatics in context
5
Bioinformatics in context (applications)
6
Computational techniques we use
  •   Knowledge discovery and machine learning
  •   Algorithm design
  • Computations over graphs, constraint
    computation
  •   Reasoning under uncertainty
  •  Databases and database indexing for new data
    types.
  •   Software and data integration.
  •  Distributed computations and data use over the
    GRID.
  •  Innovative visualisation and query mechanisms
    for large data sets

7
Applications
  • Biological areas
  • sequence and structure analysis,
  • biochemical networks,
  • experimental design.
  • Specifically
  • disease gene finding,
  • protein function prediction,
  • protein structure modelling,
  • microarray analysis.

8
BRC Members
  • Investigators
  • David Gilbert (Director, Machine learning,
    Biochemical networks, protein structure) c
  • David Leader (Visualisation tools) b
  • Rod Page (Phylogenetic trees) b
  • Pawel Herzyk (Protein structure) b
  • Ela Hunt (Database indexing) c
  • Gerhard May (Signalling pathways) b
  • Olivier Sand (Transcriptional regulatory
    regions) c
  • Richard Sinnott (Grid computing / eScience) c
  • Juris Viksna (Graph algorithms) c
  • Research Assistants Brian Ross, Neil Hanlon,
    Nigel Harding, Gilleain Torrance
  • Research students Ali Al-Shahib, Iain Darroch,
    Susan Fairley, Eilidh Grant, Andrew Jones, Aik
    Choon Tan, Tim Troup, Mallika Veeramalai
  • Associated Malcolm Atkinson, Ernst Wit, John
    McClure

9
Some funded projects
  • Malcolm Atkinson (National e-Science Centre),
    David Gilbert, Ela Hunt, Anna Dominiczak and
    David White (IBM UK Life Sciences), "BRIDGES
    BioMedical Research Informatics Delivered by Grid
    Enabled Services", EPSRC / E-Science Core
    Programme
  • Mathis Riehle, David Gilbert, Chris Wilkinson and
    Stephen Yarwood, " Engineering Cell Form and
    Function with Nanometric Surfaces", a
    collaboration between the Centre for Cell
    Engineering and the Bioinformatics Research
    Centre. Funded by the Royal Society Wolfson
    Foundation Laboratory Refurbishment Scheme,
    February 2003 for 1 year
  • Yves Deville and David Gilbert, Application of
    constraint satisfaction for the analysis of
    biochemical networks, September 2002 for 1 year,
    funded by the EPSRC
  • A Software Tool for the Simulation and Analysis
    of Biochemical Networks, DTI, David Gilbert,
    Muffy Calder, Walter Kolch
  • Cardiovascular Functional Genomics, Wellcome
    Trust, Anna Dominiczak, Malcolm Atkinson, David
    Gilbert, and Ela Hunt

10
Some funded projects (2)
  • MRC Special Research Training Fellowship in
    Bioinformatics Ela Hunt, MRC
  • SAILS Self-Adjusting Indexes for Large
    Sequences, EPSRC, Malcolm Atkinson (Nigel Harding
    RA)
  • "Structural patterns in the composite regulatory
    regions of genomes", Olivier Sand, European Union
  • "Patterns, functions and structures on a protein
    topology database", BBSRC/EPSRC Bioinformatics
    Initiative, David Gilbert (Gilleain Torrance
    RA) joint project with David Westhead, Leeds
    University (Department of Biochemistry
    Molecular Biology).
  • "Automatic extraction of rules for protein
    classification", Juris Viksna, Wellcome Trust.

11
Where we are
Functional Genomics (Joseph Black)
12
BRC location
13
BRC Phase 1
14
Davidson level 4
BRC Phase 1
Gardiner lab
BRC Phase 2
15
Bioinformatics Research Centre Davidson Building
20 workstations visitors facilities
Webserver
Fileserver
Unix Appserver
Microsoft App server
Cluster ( Scotgrid)
17 Lilybank Gardens
Boyd-Orr Building (backup)
16
DNA to proteins
compute
compute
?how?
17
Database Growth
PDB protein structures
18
How can we analyse the flood of data ?
  • Data don't just store it, analyze it ! By
    comparing sequences, one can find out about
    things like
  • how organisms are related evolution
  • protein structures
  • how proteins function
  • population variability
  • diseases

19
Computational bottlenecks
  • Caused by
  • Data characteristics
  • Lots of it
  • heterogeneous
  • distributed
  • incomplete
  • dirty
  • (Traditional) complexity issues time, space
  • Induction constructing discriminatory/descriptive
    functions from large data sets

20
Computational bottlenecks
  • Data representation
  • sequences (DNA, RNA, amino-acid)
  • trees (phylogentic,)
  • graphs (protein structure, biochemical networks)
  • matrices (micro-arrays, metabolic pathways)

21
What kind of computational approaches do we use?
  • Operations over
  • sequences (match)
  • trees (e.g. suffix trees, supertree, joining,
    ...)
  • graphs (sub-graph isomorphism, maximal common
    subgraph, path searching)
  • Data modelling, databases, data conversion
  • Machine learning, knowledge discovery, pattern
    discovery,...
  • Clustering
  • Theorem proving, concurrency analysis,
  • Integration data, knowledge
  • Data visualisation
  • Web services, Grid, Coarse Grain parallelism,
    eScience,...

22
Data, information, knowledge
  • data nucleotide sequence
  • information where are the genes.

Found using classifier, pattern, rule which has
been mined/discovered
  • knowledge facts and rules
  • If a gene X has a weak psi-blast assignment to a
    function F
  • and that gene is in an expression cluster
  • and sufficient members of that cluster are known
    to have function F,
  • ? then believe assignment of F to X.

23
An abstract view
  • Givenp9, p1, q8, p3, q2, q6, p5, q4,
    p7, q0
  • Cluster p9, p1, p3, p5, p7 q8, q2,
    q6, q4, q0
  • Background knowledge gt -
  • Induce0 is qX is q if X-2 is q and X gt 0
  • X is p if not(X is q)

24
To Learn
to acquire knowledge of (a subject) or skill
in (an art, etc.) as a result of study,
experience, or teaching (OED)
What is Machine Learning?
a computer program that can learn from
experience with respect to some class of tasks
and performance measure (Mitchell, 1997)
25
Learning Approaches
(Ramaswamy and Golub 2002)
26
Machine learning tasks (in bioinformatics as
elsewhere)
  • Classification predicting the class of an item
    -
  • Clustering finding groups of items
  • Characterisation describing a group
  • Deviation Detection finding changes
  • Linkage Analysis finding relationships
    associations
  • Visualisation presenting data visually to
    facilitate knowledge discovery by humans (human
    in the loop)

27
Types of Machine Learning
  • Symbolic approaches
  • Employ some kind of description language in which
    the learned pattern is expressed.
  • Much more transparent and easier to interpret.

28
Single Machine Learning Approach
Machine Learning
Classifier
C4.5 SVM k-NN ANN
29
Decision Trees (Quinlan, 1993)
If gene_1671 lt 56.9 Then Normal If gene_1671 ?
56.9 and gene_682 lt 107.4 Then Normal If
gene_1671 ? 56.9 and gene_682 ? 107.4 and
gene_201 lt 3145.5 Then Tumour If gene_1671 ?
56.9 and gene_682 ? 107.4 and gene_201 ? 3145.5
Then Normal
30
Ensemble Machine Learning
Combined Classifier
31
Why Ensemble Learning?
  • Advantages
  • Compliment each other weakness
  • Increase predictive power
  • Approximate the true hypothesis
  • Disadvantage
  • Difficult to combine - lack of coherence
  • Increase computational time

32
Cross-validation
33
Confusion matrix / Contingency Table
Training set test set
True Positives(TP) x?X and h(x) TRUE True
Negatives(TN) x?X- and h(x) FALSE False
Positives(FP) x?X- and h(x) TRUE False
Negatives(FN) x?X and h(x) FALSE
34
Classification conservation problems
Classification and - examples
Characterisation examples only
S-
clean training data
S
F-
S
F-
clean training data
F
F
?
?
S-
S
noisy training data
noisy training data
F-
S
F-
F
?
F
?
35
The challenge of increasing data
Language of the pattern L(P)
36
Protein family analysis
  • Collect sequences (structures) in family
  • Analyze
  • local multiple alignment
  • global multiple alignment
  • pattern discovery
  • Make family description
  • Pick up more family members?
  • Analyze extended set

37
String or structure comparison motif discovery
Str Database
Eidhammer, Jonassen Taylor, Structure
Comparison and Structure Patterns, JCB, 75 pp
685-716, 2000.
38
What is a pattern?
39
Types of Pattern
  • Deterministic
  • is a boolean function which either matches a
    given object (i.e. sequence, structure) or not
  • R-x-Y-ST
  • (e.g. regular expression for sequence pattern)

1 2 3 4 5 6 7 8 9 10 S1 R V Q R
A Y S Y V N S2 P L M R A Y S I A
S S3 L V I R P Y T P V S S4 L C M R
A Y T P T S S5 E K L R L Y S I A
S R.2 V.4 Q.2 R1 A.6 Y1 S.6 Y.2
V.4 N.2 P.2 L.2 M.4 P.2
T.4 I.4 A.4 S.8 L.4 V.2 I.2
L.2 P.4 T.2 E.2
  • Probabilistic
  • Assigns each sequence with a
  • probability that generated by the
  • model. The higher the probability,
  • the better is the match between a
  • sequence and a pattern
  • (e.g. Profile for sequence pattern)

40
Approaches to pattern discovery
  • Pattern driven
  • enumerate all (or some) patterns up to certain
    complexity (length), for each calculate the
    score, and report the best
  • Sequence driven
  • look for patterns by aligning the given sequences

Brazma et al, Approaches to the automatic
discovery of patterns in biosequences, Journal of
Computational Biology, 5(2)277-303, 1988
41
Sequence driven algorithms
  • Group similar sequences together (e.g., in
    pairs)
  • For each group find a common pattern (e.g., by
    dynamic programming)
  • Group similar patterns together and repeat the
    previous step until there is only one group left

42
Sequence driven approach
s1
p1
s2
p4
s3
p2
s4
p3
s5
43
Pattern driven approach
  • Given a set of examples E
  • Set pattern P ø
  • While (match_all(P,E)true) do
  • P P c
  • Return P

44
Topological pattern discovery (pattern extension
and repeated matching)
Repeat
Works (in theory) on set of any size
- Find new sheet
- Extend current sheet
- Find circuits
45
Rating patterns
  • Size (e.g. number of characters)
  • Compression
  • measure of how much of each of the items in the
    learning set is described
  • Sensitivity, Specificity etc
  • requires evaluation against learning training
    test sets

46
Compression
  • Send the pattern once, and
  • Send the uncovered parts of each structure

Pattern
Domain 1
Domain 2
Special case When 2 examples, compression gives
comparison
47
TOPSProteintopologyDavid Gilbert, Juris
Viksna, Gilleain Torrance (BRC, Glasgow),David
Westhead and Ioannis Michalopoulos
(Leeds)BBSRC/EPSRC funded
48
TOPS website
http//www.tops.leeds.ac.uk/
49
Topological structure comparison 1 against all
Structure comparison server http//tops.ebi.ac.uk/
tops
50
Comparing structures - NADP binding domains
dihyrofoliate reductase
51
Dendrogram of comparisons
Pairwise comparison all x all n (n-1) / 2
1413/291 hierarchical clustering
52
Pairwise comparison
  • Pairwise comparison all x all
  • n (n-1) / 2
  • 14,000 protein domains ? 108 comparisons

53
Coverage vs Error
54
Topological representation of transcriptional
regulatory regionsOlivier Sand,
osand_at_brc.dcs.gla.ac.uk
55
Hierarchical Machine Learning of Patterns for
Characterising Protein Families
Aik Choon TAN actan_at_brc.dcs.gla.ac.uk
Research Aim To construct a novel approach to
induce invariant relationships between
distributed heterogeneous biological databases
using knowledge discovery and hierarchical
machine learning techniques.
56
Biological Data Distributed and Heterogeneous!!
57
Development of Twilight Friendly Software Using
a Rule-Based System that Provides Functional
Annotation with Varying Levels of Uncertainty
Ali Al-Shahib www.brc.dcs.gla.ac.uk/alshahib alsh
ahib_at_brc.dcs.gla.ac.uk
If a gene X has a weak psi-blast assignment to a
function F and that gene is in an expression
cluster and sufficient members of that cluster
are known to have function F, ? then believe
assignment of F to X. (for some suitable values
of weak and sufficient).
  • WEIGHTED LOGICAL RULES
  • Rules that will limit the measures of uncertainty

58
Data storage, integration indexing Ela Hunt
ela_at_brc.dcs.gla.ac.uk
EXTERNAL DATABASES - NCBI - Ensembl - TIGR -
Mouse database - Drosophila database - Mouse
microarrays - chromosome 5 db
Experimental data -images -images converted to
numbers or strings
59
IndexingEla Hunt ela_at_brc.dcs.gla.ac.uk
  • String indexing structures can be used to index
    DNA, proteins, XML and phylogenetic trees
  • All data is read once, index in created on disk
  • Index reduces the search space of the query (we
    read a of disk only)

60
New database technologies for storing the output
from high-throughput biological experiments
Andrew Jones
  • Proteomics study the set of proteins expressed
    in a sample
  • Complex, variable output
  • High-Resolution images
  • Numerical data generated by lab. equipment and
    software
  • Human Annotation
  • The data is not suitable for storage in a
    standard relational database
  • Storage, retrieval and exchange of data is
    important
  • XML (Extensible Markup Language) is being
    investigated for storing such data

61
SAILS SELF ADJUSTING INDEXES FOR LARGE
SEQUENCESNigel Hardinghardingn_at_dcs.gla.ac.uk
  • EPSRC, 30 months from August 2001Malcolm
    Atkinson P.I.
  • Developing indexes for very large collections of
    reference data (e.g. mammalian genomes) that will
    tune themselves automatically in response to the
    queries being submitted against them.
  • However, before introducing self adjustment need
    to establish how query performance depends upon
    index structure.

62
Loading GenBank files Viewing Gene Information
BugView - A Genome Visualization ToolDavid P.
Leader d.leader_at_bio.gla.ac.uk
63
Comparing Genes from different Bacterial Genomes
64
Molecular Evolution A Phylogenetic Approach
Rod Pager.page_at_bio.gla.ac.uk
Locating genome duplications Q did one or more
genome-wide events affect all gene families?
65
Supertrees combining small trees into one large
tree
Input k trees
Q can we do this in polynomial time?
supertree
graph
all minimum cuts
66
Data complexityMethionine Biosynthesis in E.coli
67
Biochemical networks
  • Pathway navigation
  • Pathway comparison
  • Pathway motif discovery
  • Pathway simulation
  • High-level abstraction inferred from low-level
    descriptions
  • Novel pathways from gene expression experiments

68
Biochemical Pathway Simulator A Software Tool
for Simulation Analysis of Biochemical Networks
DTI Beacon project, 0.9M, 4 years
  • Muffy Calder muffy_at_dcs.gla.ac.uk
  • David Gilbert drg_at_brc.dcs.gla.ac.uk
  • Walter Kolch wkolch_at_beatson.gla.ac.uk
  • Keith van Rijsbergen keith_at_dcs.gla.ac.uk
  • Brian Ross rossbs_at_dcs.gla.ac.uk

69
Complexity real bioinformaticsClosing the loop
from wet lab to in-silico
Abstract model
Human feedback (in-the-loop)
Simulator
Database MAPK
Lab MAPK
Analysis
Web portal
DATA
User Interface
Pathway Editor
Rules
Database Apoptosis
Text miner
Simulator Calder Ross Concurrency theory
Bioinformatics Gilbert Tools, database, interface
Bio Kolch Lab/Literature
70
Distributed databases and computation
Cardiovascular Functional Genomics
  • -5.4 million project, 5 UK Universities.
  • Combined studies
  • scientific models of disease (Rat)
  • parallel studies of patients
  • large family and population DNA collections
  • 3 pronged approach
  • Targeted transcript sequencing
  • Microarray gene expression profiling
  • Comparative genome analysis.
  • Data generated at each of the 5 sites made
    available for analysis
  • Issues of distributed data and computation.
  • Mapping gene sequences Rat ? Mouse ?Human
  • an added layer of complexity in the computation.

71
Cardiovascular Functional Genomics
  • Glasgow Anna Dominiczak, John Connell, David
    Gilbert, Malcolm Atkinson, Ela Hunt
  • Leicester Nilesh Samani, Richard Trembath, Paul
    Burton
  • Edinburgh John Mullins, Ian Wilmut, Jonathon
    Seckl
  • Imperial/MRC Clinical Sciences Centre Timothy
    Aitman, James Scott, Helen Causton
  • Oxford Dominique Gauguier, Hugh Watkins
  • Maastricht Henry Struijker Boudier

72
Distributed data computation
73
Wellcome Trust Cardiovascular Functional
Genomics
74
BRIDGES BioMedical Research Informatics
Delivered by Grid Enabled Services
  • National e-Science Centre, Bioinformatics
    Research Centre, IBM UK Life Sciences
  • Incrementally develop and explore database
    integration over 6 geographically distributed
    research sites within the framework of the large
    Wellcome Trust biomedical research project
    Cardiovascular Functional Genomics.
  • Three classes of integration will be developed to
    support a sophisticated bioinformatics
    infrastructure supporting
  • data sources (both public and project generated),
  • bioinformatics analysis and visualisation tools,
  • research activities combining shared and private
    data.
  • The inclusion of patient records and animal
    experiment data means that privacy and access
    control are particular concerns.
  • An exploration of index factories accelerating
    sequence processing will test the hypothesis that
    the Grid makes a new class of e-Science indexes
    feasible. Both OGSA-DAI and IBM DiscoveryLink
    technology will be employed and a report will
    identify how each performed in this context.

75
The Scottish Bioinformatics Forum (SBF)
  • Network of Bioinformatics researchers and
    industries in Scotland
  • A vehicle for developing Scotland as a Centre of
    Bioinformatics Excellence
  • Nodes in Glasgow, Edinburgh, Dundee, Aberdeen,
    ...
  • Promoting collaborative research
  • Development of a Bioinformatics educational
    programme
  • www.sbforum.org, sbforum-general_at_sbforum.org

76
The Future
GPCVIII, ECCB04 ISMB04 at Glasgow
Scottish Bioinformatics Forum (SBF)
  • Closing the loop from wet lab to in silico !

Collaboration!
http//www.brc.dcs.gla.ac.uk
Write a Comment
User Comments (0)
About PowerShow.com