Title: The HAP webserver: Tools for the Discovery of Genetic Basis of Human Disease
1The HAP webserverTools for the Discovery of
Genetic Basis of Human Disease
HYUN MIN KANG Computer Science and
Engineering University of California, San Diego
NOAH ZAITLEN Bioinformatics Program University of
California, San Diego
TAURIN TAN-ATICHAT Electrical and Computer
Engineering University of California, San Diego
EDWARD SHYU Computer Science and
Engineering University of California, San Diego
GRACE SHAW Computer Science and
Engineering University of California, San Diego
DAFNA BITTON Computer Science and
Engineering University of Calfornia, San Diego
ELEAZAR ESKIN Computer Science and
Engineering University of California, San Diego
ELAD HAZAN Department of Computer
Science Princeton University
ERAN HALPERIN International Computer- Science
Institute, Berkeley
4. Identifying Association via Statistical
TestsLeveraging haplotype structure
Quantitative phenotypes Dose-effects
Nonparametric Tests Covariates
- IntroductionUnderstanding the structure of
human variation is important for understanding
the genetic basis of human diseases. Recent
advances in high-throughput genotyping technology
generating a tremendous amount of high density
single nucleotide polymorphism(SNP) data holds
great promise for discovering genetic risk
factors associated with disease. In order to
identify association between disease and
variations in an individuals chromosome, the
genotype data must be phased into haplotypes.
Based on HAP, which is a very efficient tool for
haplotype resolution based on imprefect
phylogeny, HAP webserver provides an integrated
method to reconstruct haplotype structure and to
identify genetic variants associated with complex
phenotypes which can give insight into the
genetic factors of complex diseases. Our methods
leverage interplay between genotype phasing,
haplotype phylogeny, association analysis, and
functional SNPs prediction. Our methods leverage
new insights into the structure of human
variation which allows us to observe phenotype
associations directly from genotype and phenotype
data. We demonstrate our methods via an analysis
of two genes implicated in hypertension. Our
methods are easily accessible via the webserver,
providing complete results of association
analysis including graphical visualizations. We
expect that our methods will facilitate current
association studies.
CHGA HAPLOTYPE ID NUCLEOTIDE AT POSITION NUCLEOTIDE AT POSITION NUCLEOTIDE AT POSITION NUCLEOTIDE AT POSITION NUCLEOTIDE AT POSITION NUCLEOTIDE AT POSITION NUCLEOTIDE AT POSITION NUCLEOTIDE AT POSITION STATISTICAL TESTS STATISTICAL TESTS STATISTICAL TESTS STATISTICAL TESTS
CHGA HAPLOTYPE ID -1106 -1018 -1014 -988 -462 -415 -89 -57 LinearRegression Unpairedt-test Mann-Whitney Jonckheere-Terpstra
A G A T T G T C C .948() .963 () .969 () .963 ()
B A A T T G T C C .977(?) .999 (?) .999 (?) .996 (?)
C G A C G A T A C .175 (?) .209 (?) .505 (?) .485 (?)
D G A T T G C C C .999 (?) .990 () .983 () .997 ()
E G T T T G C C T .004 () .004 () .011 () .011 ()
F G A C G A T C C .836 (?) .836 (?) .978 (?) .986 (?)
2. HAP haplotype resolution HAP is a haplotype
analysis system which is aimed in helping
geneticists perform disease association studies.
The main feature of HAP is a phasing method which
is based on the assumption of imperfect
phylogeny. The phasing method is very efficient,
which allows HAP to work with very large data
sets, and to perform other operations such as
finding a partition of the region into blocks of
limited diversity or performing association tests
on each of these block with in vitro experiments
already published. HAP takes as input a set
of genotypes over a region, taken form a
population, and returns the haplotype phase of
each of the individuals genotypes. From our
studies, we observed that HAP is very accurate
when the number of individual taken is at least a
couple of dozens. In addition to phasing, HAP
also produces a partition of the region into
blocks of correlated SNPs. The block partition of
the haplotypes is such that it minimizes the
number of tag SNPs. HAP leverages a new insight
into the underlying structure of haplotypes which
shows that SNPs are organized in highly
correlated blocks(Daly et al 01, Patil et al
01). HAP has shown to have competitive
accuracy compared to the state of the art
sofrwares(such as PHASE, HAPLOTYPER). On the
other hand, HAP is extremely fast and can be used
on very large datasets. Recently, HAP is
successfully used in revealing whole genome
haplotype structure. (Hinds et al. 05)
Table 1 ? Haplotype analysis between CHGA
promoter region and CHGA284-301 plasma levels
Statistical p-values for the association between
the haplotypes in CHGA promoter region and
CHGA284-301 plasma levels in 221 African
Americans over various statistical tests. Each
haplotype ID and its sequence is identical to
that of Figure 2. The p-values are evaluated by
permutation tests with 105 times of random
shuffling of phenotypes. The p-values are also
adjusted to multiple comparisons, thus no further
conservative adjustments are required. The plus
or minus sign next to each p-value denotes
whether the haplotype variant shows positive or
negative effect on the phenotype for each
statistical test. Single and double asterisks by
the p-value denotes that the p-value is less than
0.05 and 0.01, respectively. This table is
automatically generated by our webserver.
Figure 5 ? CHGA functional SNPs
predictionResults of predicting how each SNP
contributes to the association identified in
Table 1. The y-axis is a score that represents
the degree of functional contribution. The SNP at
the position -89 makes the highest functional
contribution, and those at positions
-1014,-988,-462 share the second highest score.
This results is consistent to the in vitro
experiments previously published. This figure is
automatically generated by our webserver.
Figure 4 ? CHGA association visualization A
histogram of CHGA284-301 levels grouped by the
number of copies of the haplotypes E in Table 1.
The x-axis represents plasma levels, and y-axis
represents the fraction of individuals with given
plasma level. It can be observed that there are
significant association for the haplotype to
increase plasma level. This figure is
automatically generated by our webserver.
Figure 1 ? HAP webserver (a) HAP is used in
revealing whole genome haplotype structure. The
article Whole-Genome Patterns of Common DNA
Variation in Three Human Populations is
published on the cover of Science. (b) The
screenshot of HAP webserver main page, available
at http//research.calit2.net/hap
5. Functional SNPs PredictionOnce associated
haplotypes are identified using rigorous
statistical tests, our methods provide a method
for estimating the likelihood of each SNP
contributing the association. To make this
prediction, we iterate over several groupings of
the haplotypes to attempt to isolate the
functional SNPs. The outcome of the second step
is a score distribution over the SNPs estimating
how likely each SNP is to be functional.
3. Inferring Phylogenetic Relationships between
HaplotypesRecent studies have shown that within
short regions, there is limited genetic
variability, and only a small number of
haplotypes account for the entire population. In
a typical region of 20kb, three or four common
haplotypes account for 80 of the population.
Futhermore, most rare variants appear to be minor
variants of common haplotypes. Using these
results, phylogeny is inferred by identifying
most likely ancestors for the each of the rare
haplotypes given the frequent ones. Then,
ancestral haplotypes are found by searching for
similar common variants.
6. Whole Genome Association Studies with HDL
Mouse Phenome Database
Figure 3 ? Linkage disequilibrium plot
Results of of running HAP webserver with linkage
disequilibrium data. The example data is
available via webserver. The axis represent SNP
positions. The red regions indicate high
disequilibrium while the blue indicates low
disequilibrium.
Figure 2 ? Predicted CHGA phylogenyEach symbol
denotes a haplotype variants of CHGA promoter.
Each haplotype variant is classified into one of
three groups ancestral, common, or recent
haplotype. A solid line denotes mutant, and
dashed lines denotes recombination. This figure
is automatically generated by our webserver.
Figure 7 ? Random The association test results
for randomly permuted HDL phenotype in figure 6.
Figure 6 ? HDL Phenotype The association test
results for the level of HDL cholesterol in the
different mouse strains.