Data-intensive Computing: Case Study Area 1: Bioinformatics presentation

About This Presentation

Transcript and Presenter's Notes

Title: Data-intensive Computing: Case Study Area 1: Bioinformatics

1
Data-intensive Computing Case Study Area 1
Bioinformatics

2
Human Genetics

3
Human cell

Base pair of DNA CG, AT
C cytosine, G guanine, A adenine , T -
thymine
Each human cell contains approximately 3 billion
base pairs.
The DNA of a single cell contains so much
information that if it were represented in
printed words, simply listing the first letter of
each base would require over 1.5 million pages of
text!
If laid end-to-end, the DNA strand measures about
2 3 meters.
DNA is a single large molecule at the nucleus of
cell
It is coiled a double helix
Each strand of the DNA molecule is made of A, C,
G and T example AAAGTTCTTAATTA that will be
matched on the other strand by the matching base
TTTCAAGAATTAAT
These string of alphabets contain all the codes
needed for the human functions
Ref text Bioinformatics Databases, tools and
algorithms, by. O. Bosu and S.K. Thukral

4
More details

Sequence of base pairs are grouped to make sense
genes
When a gene inside needs to be activated, the DNA
molecule at the cell nucleus uncoils and unfurls
to the right extent to expose that gene
From the exposed ends of the DNA a RNA is formed.
mRNA or messenger RNA is formed that carries with
it the print of the open DNA section
RNA and DNA differ in one respect RNA does not
contain T or thymine but it has uracil (U). RNA
is short-lived
Once mRNA is formed open sections of the DNA
close off.

5
Protein formation

mRNA travels to the cytoplasm where it meets the
ribosome (rRNA)
Ribosome reads the code in the mRNA (codon) and
forms the amino acids.
Twenty amino acids are prevalent in human cells.
Ex codon GCU GCC GCA correspond to alanine
In effect ribosome is a process control computer
that takes in as input codons and produces amino
acids as output.
Amino acids polymerize and form polypeptide
chains called proteins
Proteins fold and form the basic structures such
as skin and hair.
Even though brain controls major human functions
at the cell level it is the DNA that has the
command and control.
DNA is fixed code for a given human. (WORM
characteristics)

6
Lifes processes

DNA is program that controls functions,
operations and structure of a cell and in turn
that of our life processes.
Life processes are in fact dependent of the
program in a DNA and the hundreds of millions of
ribosomes.
Life in this context appears as an immense
distributed system.

7
Bioinformatics

Can we study, understand and analyze the
complexity of the immensely complex system? Its
structure and programs?
University of Arizonas tree of life project
(ToL) http//tolweb.org
Human Genome project (NIH and DOE) collecting
approximately 30,000 genes in human DNA and
determining the sequences three billion bases
that make up the human DNA.
Out of the 30000 genes we do not know the
functions of more than 50 of them.
99.9 of the nucleotide sequence is same for all
of us
0.1 is attributed to individual differences such
as race, color of skin, disposition to diseases
High throughput sequencing is generating ultra
scale biological data how to analyze this data?
That is a data-intensive problem.

8
Existing solutions?

Traditional databases store, retrieve, analyze
and/or predict huge biological data
Software tools for implementing algorithms, and
developing applications for in-silico experiments
Visualization tools, user interfaces, web
accessibility for search through data
Machine learning and data mining methodologies.

9
Databases

10
Tools

11
How can we help?

How can we leverage our knowledge of large scale
data management to address bioinformatics
problems? DC methods.
Large number of tools and data how we
standardize the efforts so that they are
complementary or repetitive? Cloud computing.

12
Text Mining vs Genetic Sequence Mining (Dot plot)
C O R R E L A T I O N S
R
E
L
A
T
I
O
N
S
H
I
P
A C T C T A G G A G T C
G
A
T
A
A
T
T
C
G
A
T
C

Write a Comment

User Comments (0)

About PowerShow.com

Data-intensive Computing: Case Study Area 1: Bioinformatics PowerPoint PPT Presentation