Title: Analyzing cDNA microarray data using Python and the C clustering library: Why scripts are better than GUIs
1Analyzing cDNA microarray data using Python and
the C clustering libraryWhy scripts are better
than GUIs
- Michiel de Hoon, Seiya Imoto, Satoru Miyano
- Human Genome Center, University of Tokyo
The third Bioinformatics Open Source Conference,
BOSC 2002 August 1-2, 2002, Edmonton, Canada
2Scripting languages are already heavily used in
bioinformatics
- Bioperl http//www.bioperl.org/
- Biopython http//www.biopython.org/
- Bioruby http//www.bioruby.org/
- G-language http//www.g-language.org/ (uses
Perl) -
However, numerical analysis of cDNA microarray
data is still dominated by GUI-based codes
Why?
Because excellent GUI-based codes are available
for gene expression data analysis (such as
Cluster/TreeView by Michael Eisen, and
GeneCluster by Pablo Tamayo)
3What scripting languages can do for you(Perl,
Python, Ruby, Tcl, )
- Easier to write, less prone to bugs, ideal for
developing new algorithms - Avoid checking your algorithm and chasing pointer
errors at the same time - More flexible than GUIs
- Allow batch processing
- Can run on any platform (Windows, Cygwin,
Macintosh, Unix) - Including supercomputers!
- Often compiler-independent (unlike GUI-based
code) - Makes open source software development easier
- A large number of people have contributed to
scripting languages ? - Software packages are often already
available - File handling
- Text parsing
- Graphics
- Numerics Data structures , algorithms, random
number generators - Beats writing a C/Fortran code from scratch
- One script can contain a complete data analysis
- Downloading data from a data base, file parsing,
numerical data analysis, drawing figures
invaluable for replication (see the example on
our website)
4Scripting languages make code development easier
- Write your new algorithm in Python
- Test it
- Improve the algorithm
- Implement the numerically intensive routines in
C - which can be called from Python
- thus combining the speed of C with the
flexibility of Python - If needed, the C routines can be used in other
programs as well - so scripting languages can make development
- of GUI-based codes easier too
repeat
5An example The C clustering library
- The C clustering library contains routines for
commonly used clustering methods - hierarchical clustering pairwise single,
maximum, centroid, and average linkage - k-means clustering
- self-organizing maps on a 2D rectangular grid
- principal component analysis
- The C clustering library can be used in three
ways - by calling routines in the library from other
programs - as an extension module for Python
- through the improved version of the GUI-code
Cluster/TreeView - (which calls routines in the C clustering
library) - All three are available from our website.
The C clustering library as a Python extension
module has been compiled successfully on Windows,
Linux, and Unix (SGI-Cray Origin2000) systems
using GNUs gcc. No commercial compiler is needed
even for Cluster/TreeView. The library was
released under the GNU Lesser General Public
License.
6At http//bonsai.ims.u-tokyo.ac.jp, follow the
link to
7At our poster, you will find more examples of
using the C clustering library
- Downloading, analyzing, and clustering of gene
expression data with Python - An implementation in Python of a bootstrap
calculation of hierarchical clustering - Cluster/TreeView 3.0
How to find us Go to http//bonsai.ims.u-tokyo.ac
.jp, click on