Bioinformatics Tools - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Bioinformatics Tools

Description:

Bioinformatics Tools. What is Bioinformatics. The use of computers to collect, ... Now you will have a chance to try out some of these bioinformatics tools: ... – PowerPoint PPT presentation

Number of Views:2166
Avg rating:3.0/5.0
Slides: 48
Provided by: researchco3
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics Tools


1
  • Bioinformatics Tools

2
What is Bioinformatics
  • The use of computers to collect, analyze, and
    interpret biological information at the molecular
    level.
  • "The mathematical, statistical and computing
    methods that aim to solve biological problems
    using DNA and amino acid sequences and related
    information."
  • A set of software tools for molecular sequence
    analysis

3
Major Types of Tools
  • Database design and query tools for biology
  • String alignment (pairwise multiple)
  • Similarity searching
  • (scoring alignments vs. hash tables)
  • Pattern searching
  • finding genes in genomes
  • promoters - combinatorial regulators
  • functional regions of proteins
  • Clustering
  • phylogenetics (evolutionary trees)
  • gene expression profiling
  • genetic profiling

4
GenBank
  • All DNA sequence data is stored in a database
    called GenBank managed by the National Center for
    Biotechnology Information (NCBI)
  • The NCBI is a branch of the National Library of
    Medicine, which is part of the NIH (National
    Institutes of Health).
  • http//ncbi.nlm.nih.gov

5
(No Transcript)
6
Entrez is a Tool for Finding Sequences
  • NCBI has created a Web-based tool called Entrez
    for finding sequences in GenBank.
  • Each sequence in GenBank has a unique accession
    number.
  • Entrez can also search for keywords such as gene
    names, protein names, and the names of orgainisms
    or biological functions

7

8
Finding Genes in GenBank
  • GenBank contains approximately 10 billion bases
    in 8.2 million sequence records (as of December
    2000).
  • These billions of G, A, T, and C letters would be
    almost useless without descriptions of what genes
    they contain, the organisms they come from, etc.
  • All of this information is contained in the
    "annotation" part of each sequence record.

9
(No Transcript)
10
Flatfile Relational
  • GenBank itself is a flatfile database with
    keywork delimited fields.
  • GenBank is also indexed into a much larger
    relational database that has links to published
    papers, protein structures and much more.

11
(No Transcript)
12
Entrez is Internally Cross-linked
  • DNA and protein sequences are linked to other
    similar sequences
  • Medline citations are linked to other citations
    that contain similar keywords
  • 3-D structures are linked to similar structures

13
(No Transcript)
14
Pairwise Alignment
  • The alignment of two sequences (DNA or protein)
    is a relatively straightforward computational
    problem.
  • The best solution seems to be an approach called
    Dynamic Programming.

15
(No Transcript)
16
(No Transcript)
17
Multiple Alignments
  • Aligning a large number of sequences using
    dynamic programming is essentially impossible
  • The problem increases exponentially with the
    number of sequences involved
  • Current multiple alignment programs use a
    progressive method that makes pairwise
    alignments, then adds new sequences one at a time
    to these aligned groups.

18
(No Transcript)
19
Similarity Searching
  • There are a variety of computer programs that are
    used for making comparisons between DNA
    sequences.
  • The most popular is called BLAST (Basic Local
    Alignment Search Tool)
  • BLAST is free at the NCBI website
  • BLAST compares your sequence to all of the
    sequences in GenBank and finds the best matches.

20
(No Transcript)
21
  • gtgbBE588357.1BE588357 194087 BARC 5BOV Bos
    taurus cDNA 5'.
  • Length 369
  • Score 272 bits (137), Expect 4e-71
  • Identities 258/297 (86), Gaps 1/297 (0)
  • Strand Plus / Plus

  • Query 17 aggatccaacgtcgctccagctgctcttgacgactccac
    agataccccgaagccatggca 76

  • Sbjct 1 aggatccaacgtcgctgcggctacccttaaccact-cgc
    agaccccccgcagccatggcc 59

  • Query 77 agcaagggcttgcaggacctgaagcaacaggtggagggg
    accgcccaggaagccgtgtca 136

  • Sbjct 60 agcaagggcttgcaggacctgaagaagcaagtggagggg
    gcggcccaggaagcggtgaca 119

  • Query 137 gcggccggagcggcagctcagcaagtggtggaccaggcc
    acagaggcggggcagaaagcc 196

22
BLAST is Complex
  • Similarity searching relies on both alignment and
    distance between pairs of sequences.
  • Distances can only be measured between aligned
    sequences (match vs. mismatch at each position).
  • A similarity search is a process of scoring the
    alignment of a query sequence with every sequence
    in a database.

23
BLAST is Approximate
  • BLAST makes similarity searches very quickly
    because it takes shortcuts.
  • It also makes errors
  • missing some important similarities
  • making many incorrect matches

24
Gene Finding
  • How can we find genes on chromosomes (sequenced
    genomic DNA)?
  • Genome project data is just huge chunks of DNA.
  • Does automatic annotation work?

25
Raw Genome Data
26
Finding Genes is Not Easy
  • About 1 of human DNA encodes functional genes.
  • Genes are interspersed among long stretches of
    non-coding DNA.
  • Repeats, pseudo-genes, and introns confound
    matters

27
Pattern Finding Tools
  • It is possible to use DNA sequence patterns to
    predict genes
  • promoters
  • translational start and stop codes (ORFs)
  • intron splice sites
  • codon bias

28
Similarity to Known Genes
  • It is also possible to scan new DNA sequence for
    known genes
  • Can look for annotated genes/proteins
  • Or just for RNAs (ESTs)

29
(No Transcript)
30
Patterns in Proteins
31
Motifs
  • Some structures can be recognized as sequence
    patterns
  • transmembrane domains
  • coiled coils
  • helix-turn-helix
  • signal peptides

32
Functional Motifs
  • Other functional portions of proteins can be
    recognized by their sequence, even if their 3-D
    structure is not known.
  • There are many databases of protein
    motifs/domains ProSite, Pfam, ProDom, etc.

33
(No Transcript)
34
Protein 3-D Structure
35
Structure Function
  • Proteins function by 3-D interactions with other
    molecules (i.e. physical chemistry).
  • So for a protein, 3-D structure is function.
  • But we cant accurately determine 3-D structure
    from gene sequence.

36
Structure Prediction
  • Predicting a proteins 3-D structure from its
    amino acid sequence is incredibly complex
  • proteins are polypeptides (long chains of amino
    acids)
  • can fold and rotate around bonds within each
    amino acid as well as the bonds between them
  • it is not possible to evaluate every possible
    folding pattern for an amino acid sequence

37
Chemical Properties
  • Some chemical properties of a protein can be
    calculated from its amino acid sequence
  • molecular weight
  • charge/pH
  • hydrophobicity

38
Secondary Structure
  • The local structure of the amino acids in a
    protein can also be predicted to some extent.
  • Each amino acid has a tendency to form either an
    alpha helix or a beta sheet

39
Threading
  • Rather than computing a 3-D structure from
    scratch, it may be possible to find a similar
    structure
  • Must have 25 aa sequence identity
  • Uses a process called threading to create a new
    structure based on a known structure

40
(No Transcript)
41
Protein Data Base
  • There is a database of all known protein
    structures called the PDB.
  • These have been determined by X-ray
    crystalography and/or NMR.
  • Anyone download and view these structures with a
    PDB viewer program.

42
Clustering
  • Clustering is used in many different ways in
    biology
  • Grouping things based on shared properties is a
    fundamental aspect of biology
  • There are many different clustering algorithms
    it is not at all clear which ones perform best
    for the various applications.

43
Phylogenetics
  • Evolution mutation of DNA (and protein)
    sequences
  • Can we define evolutionary relationships between
    organisms by comparing DNA sequences
  • is there one molecular clock?
  • phenetic vs. cladisitic approaches
  • lots of methods and software, what is the
    "correct" analysis?

44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
Computer Exercise
  • Now you will have a chance to try out some of
    these bioinformatics tools
  • Use Entrez to search for sequence and journal
    articles
  • Find genes in human DNA
  • Use BLAST to search for similar sequences
  • Look at a family of protein sequence in a motif
    database
  • Think about how these tools are built -
    underlying databases and algorithms, interface,
    etc.
Write a Comment
User Comments (0)
About PowerShow.com