GenBank Files - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

GenBank Files

Description:

Using offsets to control file access. GenBank Files. Beginning Perl for Bioinformatics ... mouse. car. house. GenBank Files. How a computer thinks of a file ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 26
Provided by: MartinS48
Category:

less

Transcript and Presenter's Notes

Title: GenBank Files


1
GenBank Files
2
GenBank Files
  • Arrays vs. Scalars
  • Using the input separator variable /
  • Using offsets to control file access

3
GenBank Files
  • Beginning Perl for Bioinformatics
  • O'Reilly web site has online resources for this
    book http//perl.oreilly.com/
  • Download Examples and Exercises
  • Be sure to get the module BeginPerlBioinfo.pm

4
GenBank Files
  • In order to use the sub's inside
    BeginPerlBioinfo.pm, you need to include this
    line in your code
  • use BeginPerlBioinfo
  • assumes the module is in the same directory
    as the perl code that uses the module.
    Otherwise, you need to use the command
  • use lib 'path/to/lib'

5
GenBank Files
  • Parsing a file arrays vs. scalars
  • Each has its advantages and disadvantages
  • Arrays
  • each line is an element
  • useful if each line is processed the same way
  • not so useful when info you want spans multiple
    lines

6
GenBank Files
  • Parsing a file arrays vs. scalars
  • Scalars
  • entire file is a scalar
  • useful when info you want spans multiple lines
  • info extracted with regular expressions

7
GenBank Files
  • Parsing a file as an array chronological order
  • if(at_beginning)
  • keep going
  • elsif(in_middle)
  • extract_info
  • elsif(at_end)
  • finish_up

This order seems logical but it doesn't work
well.
8
GenBank Files
  • Parsing a file as an array use reverse
    chronological order
  • if(at_end)
  • finish up
  • elsif(in_middle)
  • extract_info
  • elsif(at_beginning)
  • keep going

9
GenBank Files
  • !/usr/bin/perl -w example 10-1
  • use strict
  • use BeginPerlBioinfo
  • my _at_annotation ( )
  • my sequence ''
  • my filename 'record.gb'
  • parse1(\_at_annotation, \sequence, filename)
  • print _at_annotation
  • print_sequence(sequence, 50)
  • exit

10
GenBank Files
  • sub parse1
  • my(annotation, dna, filename) _at__
  • annotation-reference to array
  • dna -reference to scalar
  • filename -scalar
  • declare and initialize variables
  • my in_sequence 0
  • my _at_GenBankFile ( )
  • Get the GenBank data into an array from a
    file
  • _at_GenBankFile get_file_data(filename)

11
GenBank Files
  • foreach my line (_at_GenBankFile)
  • if( line /\/\/\n/ )
  • last break out of the foreach loop.
  • elsif( in_sequence) flag true
  • dna . line
  • elsif ( line /ORIGIN/ )
  • in_sequence 1 set the flag.
  • else otherwise
  • push( _at_annotation, line)
  • dna s/\s0-9//g strip whitespace
  • and numbers

12
GenBank Files
  • sub parse1
  • note no return value
  • references to variables declared in main are
    passed into sub
  • values are added to references to variables in
    main, therefore no need to return anything
  • sub call
  • parse1(\_at_annotation, \sequence, filename)

13
GenBank Files
  • Parsing a GenBank file as a scalar
  • Entire record is a scalar
  • Extract info with regular expressions

14
GenBank Files
  • sub get_annotation_and_dna
  • my(record) _at__ ltlt GenBank file as
    scalar
  • my(ann) ''
  • my(dna) ''
  • (ann, dna) (record /(LOCUS.ORIGIN\s\n)(
    .)\/\/\n/s)
  • dna s/\s\///g
  • return(annotation, dna)

15
GenBank Files
  • Using regular expressions to match multiline
    scalars
  • pattern modifiers /s and /m
  • file /pattern1(.)pattern2/s
  • /s allows the . to match any character
    including newline character(\n)
  • /s does NOT affect the anchors and

16
GenBank Files
  • Using regular expressions to match multiline
    scalars
  • pattern modifiers /s and /m
  • file /pattern1(.)pattern2/m
  • /m allows the . to match any character
    including newline character(\n) AND permits pat
    to match
  • \npat in a multiline scalar
  • /m affects the and anchors /s does not

17
GenBank Files
  • How we think of files
  • my list
  • cat
  • horse
  • mouse
  • car
  • house

18
GenBank Files
  • How a computer thinks of a file
  • my list \ncat\nhorse\nmouse\ncar\nhouse(eof)
  • file /horse.ar/s no match
  • file /horse.ar/m MATCH

19
GenBank Files
  • Using regular expressions to extract information
    from the GenBank File
  • Accession
  • DNA Sequence
  • Protein Sequence
  • Gene
  • Organism

20
GenBank Files
  • Using regular expressions to grab the Accession
    number from file
  • ACCESSION BC013459
  • sub get_accession
  • my GB_file shift
  • if(GB_file/ACCESSION\s(\w)/)
  • return 1
  • else
  • return 'error'

21
GenBank Files
  • Using regular expressions to grab the Gene name
    from file
  • /gene"Bmp4"
  • sub get_gene
  • my GB_file shift
  • if(GB_file/gene"(.?)"/s)
  • return 1
  • else
  • return 'unknown'

22
GenBank Files
  • Using regular expressions to grab the DNA
    sequence from file
  • ORIGIN
  • 1 gagagggtgg tgctggaggg tgggaaggca
    agagcgcgag
  • ...
  • 1741 tggactttta tcttaaaaaa aaaaaaaaaa aaaaaa
  • //

23
GenBank Files
  • Using regular expressions to grab the DNA
    sequence from file
  • sub get_dna_sequence
  • my GB_file shift
  • my seq
  • if(GB_file/ORIGIN\s(.)\/\//s)
  • seq 1
  • else
  • return "error"
  • seq s/\s\d//g
  • return uc(seq)

24
GenBank Files
  • Using regular expressions to grab the protein
    sequence from file
  • /translation"MIPGNRMLMVVLLCQVLLGGASHASLIPETGKKK
    VAEIQGHAGG
  • RRSGQSHELLRDFEATLLQMFGLRRRPQPSKSAVIPDYMRDLYRLQSG
    EEEEEEQSQG
  • TGLEYPERPASRANTVRSFHHEEHLENIPGTSESSAFRFLFNLSSIPE
    NEVISSAELR
  • LFREQVDQGPDWEQGFHRINIYEVMKPPAEMVPGHLITRLLDTRLVHH
    NVTRWETFDV
  • SPAVLRWTREKQPNYGLAIEVTHLHQTRTHQGQHVRISRSLPQGSGDW
    AQLRPLLVTF
  • GHDGRGHTLTRRRAKRSPKHHPQRSRKKNKNCRRHSLYVDFSDVGWND
    WIVAPPGYQA
  • FYCHGDCPFPLADHLNSTNHAIVQTLVNSVNSSIPKACCVPTELSAIS
    MLYLDEYDKV
  • VLKNYQEMVVEGCGCR"

25
GenBank Files
  • Using regular expressions to grab the protein
    sequence from file
  • sub get_protein_sequence
  • my GB_file shift
  • my pro
  • if(GB_file/translation"(.?)"/s)
  • pro 1
  • else
  • return "error"
  • pro s/\s//g
  • return uc(pro)
Write a Comment
User Comments (0)
About PowerShow.com