The Shocking Details of Genome.ucsc.edu - PowerPoint PPT Presentation

About This Presentation
Title:

The Shocking Details of Genome.ucsc.edu

Description:

Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules include a Worm genome browser (Intronerator), and GigAssembler ... – PowerPoint PPT presentation

Number of Views:159
Avg rating:3.0/5.0
Slides: 65
Provided by: jimk88
Category:
Tags: details | edu | genome | shocking | ucsc | worm

less

Transcript and Presenter's Notes

Title: The Shocking Details of Genome.ucsc.edu


1
The Shocking Details of Genome.ucsc.edu
2
History of the Code
  • Started in 1999 in C after Java proved hopelessly
    unportable across browsers.
  • Early modules include a Worm genome browser
    (Intronerator), and GigAssembler which produced
    working draft of human genome.
  • In 2001 a few other grad students started working
    on the code.
  • In 2002 hired staff to help with Genome Browser
  • Currently project employs 20 full time people.

3
The Genome Browser Staff
  • 5 programmers Mark, Angie, Hiram, Kate, Rachel,
    Fan, Jim
  • 4 quality assurance engineers - Heather, Bob,
    Mike, Galt
  • 3 post-docs - Terry, Gill, Katie
  • 9 grad students - Chuck, Daryl, Brian, Robert,
    Yontao, Krish, Adam, Ryan, Andy
  • 3 system administrators - Paul, Jorge, Patrick
  • 1 writer - Donna
  • David Haussler and CBSE Staff
  • About 1/3 of staff (including me 3 days a week)
    telecommutes.

4
The Goal
Make the human genome understandable by humans.
5
Prognosis
Maybe well understand it one of these days
6
(No Transcript)
7
Cardiac Troponin T2
8
Comparative Genomics at BMP10
9
Normalized eScores
10
Conservation Levels of Regulatory Regions
11
Complex Transcription
12
Add Your Own Tracks
  • Users can extend the browser with their own
    tracks.
  • User tracks can be private or public.
  • No programming required.
  • GFF, GTF, PSL or BED formats supported
  • chrom start end name strand score
  • chr1 1302347 1302357 SP1 800
  • chr1 1504778 1504787 SP2 980

13
The Underlying Database
  • Power users and bioinformaticians sometimes want
    underlying database.
  • There is a table for each track.
  • Larger tracks have a table for each chromosome.
  • Format of a track table generally similar to
    add-your-own track formats.
  • Pieces of database available from tables
    browser.
  • Whole database available as tab-separated files.
  • Most of database served via DAS.

14
Parasol and Kilo Cluster
  • UCSC cluster has 1000 CPUs running Linux
  • 1,000,000 BLASTZ jobs in 25 hours for mouse/human
    alignment
  • We wrote Parasol job scheduler to keep up.
  • Very fast and free.
  • Jobs are organized into batches.
  • Error checking at job and at batch level.

15
Science is Hard
16
Coding Discipline Is Required
  • While software development is immune from almost
    all physical laws, entropy his us hard. - The
    Pragmatic Programmer
  • To keep the system from devolving into disorder
    we have to follow code conventions and insist on
    a lot of testing.
  • We use CVS (concurrent version system) to help
    all of us work on the same code at once.

17
Obtaining the Code from CVS
  • See http//genome.ucsc.edu/admin/cvs.html
  • This gets you a sandbox - a local copy of the
    source to compile and edit.
  • Type make in the lib and utilities directory.
  • You can do a cvs update to get our updates to
    the code base.
  • To add permanently to code base email me to
    enable cvs commit

18
Expand Your Mental Capacity With
19
Lagging Edge Software
  • C language - compilers still available!
  • CGI Scripts - portable if not pretty.
  • SQL database - at least MySQL is free.

20
Problems with C
  • Missing booleans and strings.
  • No real objects.
  • Must free things

21
Advantages of C
  • Very fast at runtime.
  • Very portable.
  • Language is simple.
  • No tangled inheritance hierarchy.
  • Excellent free tools are available.
  • Libraries and conventions can compensate for
    language weaknesses.

22
Coping with Missing Data Types in C
  • define boolean int
  • Fixing lack of real string type much harder
  • lineFile/common modules and autoSql code
    generator make parsing files relatively painless
  • dyString module not a horrible string class

23
Object Oriented Programming in C
  • Build objects around structures.
  • Make families of functions with names that start
    with the structure name, and that take the
    structure as the first argument.
  • Implement polymorphism/virtual functions with
    function pointers in structure.
  • Inheritance is still difficult. Perhaps this is
    not such a bad thing.

24
  • struct dnaSeq
  • / A dna sequence in one-letter-per-base format.
    /
  • struct dnaSeq next / Next in list. /
  • char name / Sequence name. /
  • char dna / as cs gs and ts. Null
    terminated /
  • int size / Number of bases. /
  • struct dnaSeq dnaSeqFromString(char string)
  • / Convert string containing sequence and
    possibly
  • white space and numbers to a dnaSeq. /
  • void dnaSeqFree(struct dnaSeq pSeq)
  • / Free dnaSeq and set pointer to NULL. /
  • void dnaSeqFreeList(struct dnaSeq pList)
  • / Free list of dnaSeqs. /

25
  • struct screenObj
  • / A two dimensional object in a sleazy video
    game. /
  • struct screenObj next / Next in list. /
  • char name / Object name. /
  • int x,y,width,height / Bounds of object.
    /
  • void (draw)(struct screenObj obj) / Draw
    object /
  • boolean (in)(struct screenObj obj, int x,
    int y)
  • / Return true if x,y is in
    object /
  • void custom / Custom data for a
    particular type /
  • void (freeCustom)(struct screenObj obj)
  • / Free custom data. /
  • define screenObjDraw(obj) (obj-gtdraw(obj))
  • / Draw object. /
  • void screenObjFree(struct screenObj pObj)
  • / Free up screen object including custom part. /

26
Naming Conventions
  • Code is constrained by few natural laws.
  • There are many ways to do things, so programmers
    make arbitrary decisions.
  • Arbitrary decisions are hard to remember.
  • Conventions make decisions less arbitrary.
  • varName vs. VarName vs varname vs var_name. We
    use varName.
  • variable vs. var vs. vrbl vs. vble vs varible if
    you need to abbreviate, keep it short.

27
Commenting Conventions
  • Each module has a comment describing its overall
    purpose.
  • Each function also has an overall comment.
  • Each field in a structure has a comment.
  • Longer functions broken into paragraphs that
    each begin with a comment.
  • The module, function, and structure comments are
    replicated in the .h file, which serves as an
    index to the module.

28
Error Handling
  • Code prints out a message and aborts (via the
    errAbort function) when there is a problem.
  • This saves loads of error handling code and is
    generally the right thing to do.
  • You can catch an errAbort if necessary, though
    it rarely is.

29
Memory
  • Uninitialized memory leads to difficult bugs.
  • Compiler set to warn of uninitialized vars
  • Dynamic memory goes through needMem. It is
    always zeroed.
  • Memory usually freed with freez(), which sets
    pointer to null as well as freeing it.
  • Careful memory handler can be pushed to help
    track down memory bugs
  • Sentinal values to detect writing past end of
    array
  • Detects memory freed twice or not freed
  • Detects heap corruption in general.

30
(No Transcript)
31
Generally Useful Modules
  • String handling - common dystring wildcmp
  • Collections - common (singly linked lists), hash,
    dlist, binRange rbTree
  • DNA - dnautils dnaseq
  • Web - htmshell, cheapcgi, htmlPage
  • I/O - linefile, xap (XML), fa, nib, twoBit,
    blastParse, blastOut, maf, chain, gff
  • Graphics - memgfx, gifwrite, psGfx, vGfx

32
Anatomy of a CGI Script
  • Gets called by Web Server when user clicks submit
    or follows a cgi link.
  • Input is in environment variables and sometimes
    also stdin. Routines in cheapCgi move this to a
    hash table.
  • Output is to stdout. Routines in htmshell help
    with output formatting.
  • In the middle often access a database.

33
Challenges of CGI
  • Each click launches program anew.
  • User state can be kept in cart variables
  • Run from Web Server, harder to debug
  • Use cgiSpoof to run from command line
  • Push an error handler that will close out web
    page, so can see your error messages. htmShell
    does this, but webShell may not.
  • Ideally should run in less than 2 seconds.

34
Relational Databases
  • Relational databases consist of tables, indices,
    and the Structured Query Language (SQL).
  • Tables are much like tab-separated files
    chrom start end name strand score
    chr22 14600000 14612345 ldlr
    0.989 chr21 18283999 18298577 vldlr -
    0.998Fields are simple - no lists or
    substructures.
  • Can join tables based on a shared field. This is
    flexible, but only as fast as the index.
  • Tables and joins are accessed a row at a time.
  • The row is represented as an array of strings.

35
Converting A Row to Object
struct exoFish exoFishLoad(char row) / Load a
exoFish from row fetched with select from
exoFish from database. Dispose of this with
exoFishFree(). / struct exoFish
ret AllocVar(ret) ret-gtchrom
cloneString(row0) ret-gtchromStart
sqlUnsigned(row1) ret-gtchromEnd
sqlUnsigned(row2) ret-gtname
cloneString(row3) ret-gtscore
sqlUnsigned(row4) return ret
36
Motivation for AutoSql
  • Row to object code is tedious at best.
  • Also have save object, free object code to write.
  • SQL create statement needs to match C structure.
  • Lack of lists without doing a join can seriously
    impact performance and complicate schema.

37
AutoSql Data Declaration
table exoFish "An evolutionarily conserved region
(ecore) with Tetroadon" ( string chrom
"Human chromosome or FPC contig" uint
chromStart "Start position in chromosome"
uint chromEnd "End position in
chromosome" string name "Ecore name
in Genoscope database" uint score
"Score from 0 to 1000" )
See autoSql.doc for more details. See also
autoXml
38
Coding Conclusion
  • Its always safer on the lagging edge
  • Consider redesigning system as COBOL
    character-based application

39
UCSC Gene Family Browser
Expression and other information on genes in a
big sorted, linked table
40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
Up in Testes, Down in Brain
52
(No Transcript)
53
(No Transcript)
54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
(No Transcript)
59
(No Transcript)
60
(No Transcript)
61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
Conclusions
  • Genome browser - good for exploring genome and
    displaying your custom tracks
  • kent code base - a good starting point for many
    programming projects
  • Family browser - a fine way to collect data sets.
  • Browser staff - helpful but overworked.
Write a Comment
User Comments (0)
About PowerShow.com