Title: Data Quality Issues: Traps & Pitfalls
1Data Quality IssuesTraps Pitfalls
- Ashok Kolaskar
- Vice-Chancellor
- University of Pune, Pune 411 007. India
- puvc_at_unipune.ernet.in
2Cancer cell growth appears to be related to
evolutionary development of plump fruits and
vegetables
- Large tomatoes can evolve from wild,
blueberry-size tomatoes. The genetic mechanism
responsible for this is similar to the one that
proliferates cancer cells in mammalians. - This is a case where we found a connection
between agricultural research, in how plants make
edible fruit and how humans become susceptible to
cancer. That's a connection nobody could have
made in the past.
Cornell University News, July 2000
3Size of Tomato Fruit
Single gene, ORFX, that is responsible for QTL
has a sequence and structural similarity to the
human oncogene c-H-ras p21.
Fruit size alterations, imparted by fw2.2
alleles, are most likely to be due to the changes
in regulation rather than in sequence/structure
of protein.
- fw2.2 A Quantitative Trait Locus (QTL) key to
the Evolution of Tomato Fruit Size. Anne Frary
(2000) Science, 289 85-88
4Genome Update Public domain
- Published Complete Genomes 59
- Archaeal 9
- Bacterial 36
- Eukaryal 14
- Ongoing Genomes 335
- Prokaryotic 203
- Eukaryotic 132
-
Private sector holds data of more than 100
finished unfinished genomes.
5Challenges in Post-Genomic era Unlocking
Secretes of quantitative variation
- For even after genomes have been sequenced and
the functions of most genes revealed, we will
have no better understanding of the naturally
occurring variation that determines why one
person is more disease prone than another, or why
one variety of tomato yields more fruit than the
next. - Identifying genes like fw2.2 is a critical first
step toward attaining this understanding.
6Value of Genome Sequence Data
- Genome sequence data provides, in a rapid and
cost effective manner, the primary information
used by each organism to carry on all of its life
functions. - This data set constitutes a stable, primary
resource for both basic and applied research. - This resource is the essential link required to
efficiently utilize the vast amounts of
potentially applicable data and expertise
available in other segments of the biomedical
research community.
7Challenges
- Genome databases have individual genes with
relatively limited functional annotation
(enzymatic reaction, structural role) - Molecular reactions need to be placed in the
context of higher level cellular functions
8Nature of Biological data
- Biomolecular Sequence Data
- Nucleic acids
- Protein
- Carbohydrates
- Genes and Genome
- Biomolecular structure data
- Pathways/wire diagrams
- DNA array data
- Protein array data
9Bioinformatics Databases
- Usually organised in flat files
- Huge collection of Data
- Include alpha-numeric and pictorial data
- Latest databases have gene/protein expression
data (images) - Demand
- High quality curated data
- Interconnectivity between data sets
- Fast and accurate data retrieval tools
- queries using fussy logic
- Excellent Data mining tools
- For sequence and structural patters
10What is CODATA?
- CODATA is the committee on Data for Science and
Technology of the International Council of
Scientific Unions. - It was established in to improve the quality,
reliability, processing, management and
accessibility of data for science and technology. - CODATA Task Group on Biological Macromolecules
has recently surveyed quality control issues of
archival databanks in in molecular biology
11Task Group on Biological Macromolecules
12Quality Control Issues
- The quality of archived data can, of course, be
no better than the data determined in the
contributing laboratories. - Nevertheless, careful curation of the data can
help to identify errors. - Disagreement between duplicate determinations is
as always, a clear warning of an error in one or
the other. - Similarly, results that disagree with established
principles may contain errors. - It is useful, for instance, to flag deviations
from expected stereochemistry in protein
structures, but such outliers'' are not
necessarily wrong.
13QCI contd..
- The state of the experimental art is the most
important determinant of data quality. - Quality control procedures provide the second
level of protection. Indices of quality, even if
they do not permit error correction, can help
scientists avoid basing conclusions on
questionable data.
14Typical Databank record Journey from Entry to
Distribution
- Sequence in journal publication nucleic acid
sequence not found in EMBL data library - Data input sequence and journal information
keyboarded three times - Data verification different keyboardings
compared - Release of data directly after verification
sequences were added to the public dataset
15Typical Databank record Journey from Entry to
Distribution
- Nucleic acid sequence submitted to EMBL Data
Library with no associated publication. - Data input nucleic acid sequence translated into
protein sequence - Data verification none
- Release of data directly after data input
sequences were added to the public dataset.
16Typical Databank record Journey from Entry to
Distribution
- Nucleic acid sequence submitted to EMBL Data
Library with associated publication protein
sequence displayed in paper. - Data input nucleic acid sequence translated into
protein sequence - Data verification sequence and journal
information keyboarded once comparison of
translation with published sequence. - Release of data directly after verification
sequences were added to the public dataset
17Typical Databank record Journey from Entry to
Distribution
- D. Nucleic acid sequence submitted to EMBL Data
Library with no associated publication protein
sequence NOT displayed in paper - Data input nucleic acid sequence translated into
protein sequence - Data verification journal information keyboarded
once comparison of journal information - Release of data directly after verification
sequence were added to the pubic dataset.
18Errors in DNA sequence and Data Annotation
- Current technology should reduce error rates to
as low as 1 base in 10000 as every base is
sequenced between 6-10 times and at least one
reading per strand. - Therefore, in a procaryote, error of 1 isolated
wrong base would result to one amino acid error
in 10-15 proteins. - In human genome gene-dense regions contain about
1 gene per 10000 bases, with average estimated at
1 gene per 30000bases. - Therefore, corresponding error rate would be
roughly one amino acid substitution in 100
proteins. - But large scale error in sequence assembly can
also occur. Missing a nucleotide can cause a
frameshift error.
19DNA data
- The DNA databases (EMBL/ GenBank/ DDBJ) carry out
quality checks on every sequence submitted. - No general quality control algorithm is yet in
widespread use. - Some annotations are hypothetical because they
are inferences derived from the sequences. - Ex. Identification of coding regions. These
inferences have error rates of their own.
20Policies of PIR
- Entries in the PIR database are subject to
continual update, correction, and modification.
At least 20-25 of entries are updated during
each quarterly release cycle. - Every entry added or revised is run through a
battery of checking programs. Some fields have
controlled vocabulary and others are linked in
way that can be checked. For example, enzymes
that are identified by EC number are required to
have certain appropriate keywords scientific and
common names for an organism are required to be
consistent.
21Policies of PIR contd..
- Features are checked for the identity of the
amino acids involved, e.g., disulfide bonds must
involve only Cys residues. - Standards list and auxiliary database used in the
checking procedures include database for
enzymes, human genes, taxonomy, and residue
modifications and standard lists for journal
abbreviations, keywords, super family names, and
some other fields.
22Indices of quality maintained by the databank
- When data from different sources are merged
into a single entry, any difference in the
reported sequences are explicitly shown unless
they are too extensive.
23Policies of SWISS-PROT
- An annotated protein sequence database
established in 1986 and maintained
collaboratively, since 1987, by the Department of
Medical Biochemistry of the University of Geneva
and the EMBL Data Library. - SWISS-PROT is curated protein sequence database
which strives to provide a high level of
annotations, a minimal level of redundancy,
integration with other databases, and extensive
external documentation.
24SWISS-PROT
- Contributors provide Sequence (99.9 translated
from the DNA database), bibliographic references.
Cross reference to DNA database. - Databank staff adds Annotations, keywords,
feature table, cross-reference to DNA database. - Processing of an entry from point of arrival
through to distribution - Sequence, References, Annotations.
25Yeast genome dataDifferent Centre announce
different numbers on the same day!
- MIPS http//www.mips.biochem.mpg.de
- SGD http//genome-www.stanford.edu
- YPD http//www.proteome.com
- Total proteins
- MIPS 4344 ORFs
- YPD 6149 out of these 4270 were reported to be
characterized experimentally. - MIPS 6368 ORFs
- Out of these about 178 correspond to small
proteins of length lt 100.
26Yeast genome ..
- In brief, because of different definitions of
unknown or hypothetical and uncoding or
questionable ORFs, the number of yeast proteins
of which the function remains to be identified is
estimated to be 300 (the Cebrat uncoding) or
1568 (the MIPS hypothetical) or 1879 (the YPD
unknown).
27Annotation of Genome Data
- In general, annotation of bacterial genomes is
more complete and accurate than that of
eukaryotes. - The types of errors that tend to appear are
entries with frame shift sequencing errors, which
lead to truncation of predicted reading frames or
even double errors leading to a mistranslated
internal fragment. - Small genes, indeed any small functionally
important sequences, are likely to be missed, as
they may fall below statistically significant
limits. - In higher organisms, identifying genes is harder
and, in consequence, database annotation is more
dubious. Experimental studies can improve the
annotations more dubious. - Alternative splicing patterns present a
particular difficulty.
28Annotation of Human Genome
- In contrast, the sequence of the human genome is
being determined in many labs and its annotation
varies from nothing, for certain regions, to gene
predictions that are based on different methods
and that reflect different thresholds of accepted
significance. - Therefore the annotation of DNA sequences must be
frequently updated and not frozen. It is a
challenge for databanks to find ways to link
primary sequence data to new and updated
annotations.
29Quantitating the signals from DNA arrays
- A linear response that covers two or three orders
of magnitude is often needed to detect low and
high copy number transcripts on the same array. - In cases where this is not possible it may be
necessary to scan the chip at different wave
lengths, or to amplify the signal with an immune
sandwich on top of the bound sample.
30Standardization of DNA microarrays
- Comparison of data obtained from independent
arrays and from different laboratories requires
standardization. Both the Affymetrix chips and
the custom made cDNA chips use different methods
for standardization. - The Affymetrix chips have approximately 20 probes
per gene and standardization is either based on
the expression level of selected genes, like
actin and GAPDH, or on a setting of the global
chip intensity to approximately 150 units per
gene on the chip. - In this way, chip data from different experiments
can be compared to each other. - In out hands, the data obtained with the two
standardization methods differ only by
approximately 10 (unpublished observations).
31Samples for expression monitoring
- The analysis of relatively homogenous cell
populations (cloned cell lines, yeast, etc.) has
proven much simpler than the analysis of tissue
biopsies as the latter often contain many cell
types (epithelial, endothelial, inflammatory,
nerve, muscle, and connective tissue cells) that
are present in variable amounts. - Standardization may require microdissection of
the tissue to isolate specific cell types,
although the number of cells needed for the assay
is well above a million. Sampling of specific
cell types using laser capture microdissection
(LCM) can be a time-consuming task, and given
that mRNA is prone to degradation the processing
time must be kept to a minimum.
32Quantitation of Protein array data
- Even though there are several tools available for
the quantitation of protein spots, there is at
present no available procedure for quantitating
all of the proteins resolved in a complex
mixture. - Part of the problem lies in the large dynamic
range of protein expression, lack of resolution,
post-translational modifications, staining
behavior of the protein as well as in the fact
that many abundant proteins streak over less
abundant components interfering with the
measurements. - At present, fluorescent technology seems to be
way ahead as with the fluorescence stain Sypro
Ruby there is a linear response with respect to
the sample amount over a wide range of abundance.
- Quantitative fluorescence measurements can be
performed with CCD-Camera based systems as well
as with laser scanner systems.
33Gene expression profiling techniquesChallenges
Perspectives
- A major challenge in the near future will be to
define a base line for the normal gene expression
phenotype of a given cell type, tissue or body
fluid. - This is not a trivial task, however, as it will
require the analysis of hundreds or even
thousands of samples. -
34Current Limitations of Gene expression profiling
techniques
- Technical problems associated with the analysis
of expression profiles derived from tissues that
are composed of different cell types - Lack of procedures for identifying targets that
lie in the pathway of disease - Need for Bioinformatics tools for rapidly
assessing the function of the putative targets, - The latter is so for paramount importance to the
pharmaceutical industry as the identification of
disease deregulated targets alone is not
sufficient to start costly drug screening
process.
35Protein ArraysStatistical issues in data
collection phase
- Within labs
- Signal to noise ratio
- quantifying and making as high as possible
- identifying and controlling sources of
variability - reproducibility
- Between Labs
- Inter lab variability and biases
- Reproducibility
- Tends to have been ignored in the excitement ?
Cost ? - really obvious/big effects
- Becomes important when dealing with more subtle
effects - Lab effects and scanning effects
- Needs systematic designed experiments to quantify
sources of variation. - Strategies for optimizing and monitoring
processes
36Protein ArraysStatistical issues-data analysis
phase
- Whats being done now
- visualization of data as an image
- Clustering of rows and columns to interpret
arrays - Some limitations
- Visualizations tend to be of raw expression data
- Methods tend to ignore structure on rows-genes
and columns-samples - Methods involove rectangular clusters
- Genes usually restricted to lie in one cluster
37Protein ArraysStatistical issues-data analysis
phase contd..
- What's needed?
- - Other ways of visualizing the data which can
also use information about rows and columns - - Local clustering which is not restricted to
rectangles - - Genes in more than one cluster
- - Clustering with prior information
- - Analysis of experimental designs where the
response is a vector of microarray data - Dimension reduction
- Methods for finding associations between large
number of predictor and response variables
38Quality Control Issues related to 3-D structure
data determined using X-rays
- The reported parameter called the B-factor' of
each atom describes its effective size, and for
proteins it should be treated as an empirical
value. - Because every atom contributes to every
observation, it is difficult to estimate errors
in individual atomic positions.
39Resolution of structures in PDB
- Low resolution . . . High
- Resolution in Å 4.0 3.5 3.0 2.5 2.0 1.5
- Ratio of observations to Parameters 0.3 0.4 0.6
1.1 2.2 3.8 - The median resolution of structures in the
Protein Data Bank is about 2.0 Å .
40R-factor contd..
- The R-factor measures how well the model fits the
data. If the set of observed X-ray intensities is
Fo, and the corresponding predicted intensities
calculated from the model are Fc, the R-factor is
defined as ?Fo Fc /?Fo. (The set of F's may
contain a list of tens of thousands of numbers.) - For high resolution models values around
0.180.22 are good. For low resolution studies,
however, good' R-factor values may be obtained
even for models that are largely or entirely
wrong. A more sophisticated quality measure is
the cross-validation R factor, Rfree.
41R-factor, R-free contd..
- Murshudov and Dodson estimate overall
uncertainties of atomic positions in
macromolecules from the Rfree values, giving in a
typical case values of about 0.05 Å at 1.5 Å
resolution and 0.15 Å at 2 Å resolution. - They approximate uncertainties of individual
atomic positions from B-factors, giving values of
about 0.16 Å for an atom with B20 Å and 0.3 Å
for an atom with B60 Å.
42Methods to detect the outliersType I
- Nomenclature and convention-related checks
- Examples include incorrect chirality, and the
naming of chemically equivalent side-chain atoms
(e.g., in phenylalanine and tyrosine rings). - Such errors can be corrected confidently without
reference to experimental data and current
submissions can be fixed at the time of
deposition.Checking of old datasets is in
progress.
43Methods to detect the outliersType II
- Self-consistency tests
- Many stereochemical features of macromolecular
models are restrained during refinement. Bond
lengths and angles are restrained to ideal
values, planarity is imposed on aromatic rings
and carboxylate groups, non-bonded atoms are
prevented from clashing, temperature factors of
atoms bonded to each other are forced to be
similar, etc. Methods that assess how well these
restraints are satisfied are an important part of
the arsenal of structure verification tools. - Nevertheless, their inadequacy in detecting
genuine shortcomings in models has been
demonstrated.
44Type II continue
- Proper assessment of outliers (as features or
errors) requires access to the experimental data.
Sometimes,outliers warn of more serious problems
and may require careful inspection of the
electron-density maps and even model rebuilding
by an experienced crystallographer.Unfortunately,
not all errors can be fixed, even by appeal to
structure factors and maps some regions are
fatally disordered.
45Methods to detect the outliersType III
- Orthogonal tests
- Most revealing and useful are verification
methods independent of the restraints used during
model refinement. Such methods use database
derived information to assess how usual or
unusual an atom, residue, or entire molecule is. - Examples include the analysis of torsion angles
of the protein main-chain (Ramachandran analysis)
and side-chain atoms (rotamer analysis), the
orientation of the peptide plane (peptide-flip
analysis), atomic volumes, geometry of the
Ca-backbone, nonbonded contacts, and the use of
sequence-structure profiles.
46Quality of NMR structure determination
- NMR is the second major technique for determining
macromolecular structure. - The experiments determine approximate values of a
set of inter-atomic distances and conformational
angles. - These distances, derived from the Nuclear
Overhauser Effect (NOE), identify pairs of atoms
close together in space, including those from
residues distant in the sequence which are
essential for assembling the overall folding
pattern. - Calculations then produce sets of structures that
are consistentas far as possiblewith the
experimental constraints on distances and angles,
and that have proper stereochemistry.
47Q.C.I of NMR data
- None of these measures really relates to
accuracy, i.e. the similarity of the calculated
structure to the true'' structure. - One can determine, however, whether a calculated
structure is consistent with experimental data
not used to constrain it. - One such approach is cross-validation. A
proportion of constraints is omitted from the
structure calculation, and the consistency of the
resulting structure with the unused constraints
is taken as a measure of accuracy. (This is
analogous to the procedures used by
crystallographers in measuring Rfree).
48Conclusions
- Two factors dominate current developments in
Bioinformatics - The amount of raw data is increasing in quantity,
spectacularly so, and in quality. Methods for
annotation are improving but by no means at a
comparable rate. Tools for identification of
errors are improving both through enhanced
understanding of what to expect and from a better
statistical base from which to flag outliers. - A proliferation of web sites provides different
views or slices or means of access to these data
and an increasingly dense reticulation of these
sites provides links among databanks and
information-retrieval engines. These links
provide useful avenues to applications but they
also provide routes for propagation of errors in
raw or immature data. subsequently corrected in
the databanks but the corrections not passed on,
and in annotation.
49Conclusions contd../
- Annotation is a weak component of the enterprise.
- Automation of annotation is possible only to a
limited extent and getting annotation right
remains labor-intensive. - But the importance of proper annotation, however,
cannot be underestimated. - P. Bork has commented that for people interested
in analysing the protein sequences implicit in
genome sequence information, errors in gene
assignment vitiate the high quality of the
sequence data. - The only possible solution is a distributed and
dynamic error-correction and annotation process.
50Contd../
- The workload must be distributed because databank
staff have neither the time nor the expertise for
the job specialists will have to act as
curators. - The process must be dynamic, in that progress in
automation of annotation and error identification
/correction will permit re-annotation of
databanks. - As a result, we will have to give up the safe''
idea of a stable databank composed of entries
that are correct when they are first distributed
in mature form and stay fixed thereafter. - Databanks will become a seething broth of
information both growing in size, and maturingwe
must hopein quality.
51Contd../
- This will create problems, however, in organizing
applications. - Many institutions maintain local copies of
databanks At present, maintain'' means top
up'' yet this will no longer be sufficient. - In the face of dynamically changing databanks,
how can we avoid proliferation of various copies
in various states? - How will it be possible to reproduce a scientific
investigation based on a database search? - One possible solution is to maintain adequate
history records in each databank itself in order
to be able to reconstruct its form at any time. - This is analogous to the information in the
Oxford English Dictionary, which permits
reconstruction of a English dictionary
appropriate for 1616 or 1756.