Title: Stata%20commands%20for%20moving%20data%20between%20PHASE%20and%20HaploView%20Stata%20Conference%20DC%20
1Stata commands for moving data between PHASE and
HaploViewStata Conference DC 09July 30-31,
2009
- John Charles Chuck Huber Jr, PhD
- Assistant Professor of Biostatistics
- Department of Epidemiology and Biostatistics
- School of Rural Public Health
- Texas AM Health Science Center
- jchuber_at_tamu.edu
2Motivation
- Many rapidly growing areas of research utilize
multiple specialty boutique computer programs
to conduct highly specialized analyses. - The Stata user is faced with two choices
- Write new Stata commands that do the same
analyses - Write Stata commands that efficiently export and
import data for these boutique programs -
3Stata for Genetic Data Analysis
- Outline
- Genetic Data Analysis using Stata
- Genetics Background
- The file commands in Stata
- The phasein and phaseout commands
- The HaploView program
- The haploviewout command
- Summary
-
4Stata for Genetic Data Analysis
- 2007 UK Stata Users Group meeting
- http//www.stata.com/meeting/13uk/
- A brief introduction to genetic epidemiology
using Stata - Neil Shephard, University of Sheffield
- An overview of using Stata to perform candidate
gene association analysis will be presented.
Areas covered will include data manipulation,
HardyWeinberg equilibrium, calculating and
plotting linkage disequilibrium, estimating
haplotypes, and interfacing with external
programs. -
5User Written Genetics Commands
- Programs written by David Clayton
- ginsheet- Read genotype data from text files.
- gloci - Make a list of loci.
- greshape - Reshape a file containing genotypes to
a file of alleles. - gtab - Tabulate allele frequencies within
genotypes and generate indicators (performs
Hardy-Weinberg Equilibrium testing). - gtype - Create a single genotype variable from
two allele variables. - htype - Create a haplotype variable from allele
variables. - mltdt - Multiple locus TDT for haplotype tagging
SNPs (htSNPs). - origin - Analysis of parental origin effect in
TDT trios. - pseudocc - Create a pseudo-case-control study
from case-parent trios. - pscc - Experimental version of pseudocc in which
there may be several groups of linked loci. - pwld - Pairwise linkage disequilibrium measures.
- rclogit - Conditional logistic regression with
robust standard errors. - snp2hap - Infer haplotypes of 2-locus SNP
markers. - tdt - Classical TDT test.
- trios - Tabulate genotypes of parent-offspring
trios.
6User Written Genetics Commands
- Programs written by Adrian Mander
- gipf - Graphical representation of log-linear
models. - hapipf - Haplotype frequency estimation using an
EM algorithm and log-linear modelling. - pedread - Read's pedigree data file (in
pre-Makeped LINKAGE format), similar to ginsheet - pedsumm - Summarises a pre-Makeped LINKAGE file
that is currently in Stata's memory. - pedraw - Draws one pedigree in the graphics
window - plotmatrix - Produces LD heatmaps displaying
graphically the strength of LD between markers. - profhap - Calculates profile likelihood
confidence intervals for results from hapipf - swblock - A step-wise hapipf routine to identify
the parsimonious model to describe the Haplotype
block pattern. - qhapipf - Analysis of quantitative traits using
regression and log-linear modelling when phase is
unknown. - hapblock - attempts to find the edge of areas
containing high LD within a set of loci
7User Written Genetics Commands
- Programs written by Mario Cleves
- gencc - Genetic case-control tests
- genhw - Hardy-Weinberg Equilibrium tests
- qtlsnp - A program for testng associations
between SNPs an a quantitative trait. - Programs written by Catherine Saunders
- co_power - Power calculations for Case-only study
designs. - gei_matching -
- geipower - Power calculations for
Gene-Environment interactions. - ggipower - Power calculations for Gene-Gene
interactions. - tdt_geipower - Power calculations for
Gene-Environment interactions via TDT analysis. - tdt_ggipower - Power calculations for Gene-Gene
interactions via TDT analysis. - Programs written by Neil Shephard
- genass- Performs a number of statistical tests on
your genotypic data and collates the results into
a Stata formatted data set for browsing.
8The Post-Genome Era
February 16, 2001
February 15, 2001
9Scientific Method Observe
Hartl Jones (1998) pg 18, Figure 1.13
10Scientific Method Predict
Watson et al. (2004) pg 29, Box 2-2
11Scientific Method Manipulate
12The Structure of DNA
Hartl Jones (1998) pg 9, Figure 1.5
13The Structure of DNA
Watson et al. (2004) pg 23, Figure 2.5
14What is a SNP?
- A SNP is a single nucleotide polymorphism (the
individual nucleotides are called alleles) - ataagtcgatactgatgcat
agctagctgactgacgcgat ataagtccatactgatgcat
agctagctgactgaagcgat - ataagtccatactgatgcatagctagctgactgacgcgat
- ataagtcgatactgatgcatagctagctgactgaagcgat
Person 1 Chromosome 1
Person 1 Chromosome 2
Person 2 Chromosome 1
Person 2 Chromosome 2
SNP1
SNP2
15Allelic Association
- Simple 2x2 table
- One table per SNP
- Compute a simple chi-squared statistic or odds
ratio for each SNP
SNP1 Allele SNP1 Allele
g c
Case 250 750
Control 650 350
16Genotypic Association
- Compute chi-squared tests
- Allows testing of various disease models
(dominant, recessive, additivity)
SNP1 Genotype SNP1 Genotype SNP1 Genotype
gg gc cc
Case 100 250 150
Control 300 150 50
17What is a Haplotype?
- A haplotype is the combination of one or more
alleles found on the same chromosome - Person 1 has a gc haplotype and a ca
haplotype - Person 2 has a cc haplotype and a ga
haplotype - ataagtcgatactgatgcat
agctagctgactgacgcgat ataagtccatactgatgcat
agctagctgactgaagcgat - ataagtccatactgatgcatagctagctgactgacgcgat
- ataagtcgatactgatgcatagctagctgactgaagcgat
Person 1 Chromosome 1
Person 1 Chromosome 2
Person 2 Chromosome 1
Person 2 Chromosome 2
SNP1
SNP2
18Haplotypic Association
- Compute chi-squared tests
- Two SNPs with genotypes a/g and c/t respectively
SNP1SNP2 Haplotype SNP1SNP2 Haplotype SNP1SNP2 Haplotype SNP1SNP2 Haplotype
ac at gc gt
Case 100 250 75 75
Control 300 100 50 50
19Why are haplotypes important?
2009 Oxford and Cambridge Boat Race
http//www.theboatrace.org/gallery/2009?page7
20Why are haplotypes important?
SNP1
SNP2
SNP3
SNP4
SNP5
Chromosome R
Chromosome D
President
VP
State
Defense
Treasury
21Why are haplotypes important?
SNP1
SNP2
SNP3
SNP4
SNP5
Chromosome R
Chromosome D
President
VP
State
Defense
Treasury
Rearranging the members of each chromosome
could have a profound effect!
22Why are haplotypes important?
Hartl Jones (1998) pg 18, Figure 1.13
23Hartl Jones (1998) pg 18, Figure 1.13
24Why are haplotypes important?
Watson et al. (2004) pg 29, Box 2-2
25The PHASE Program
- Unfortunately, haplotypes are not observed
directly using modern, high-throughput lab
techniques - We observe genotypes and must infer the haplotype
structure using algorithms - PHASE is a very popular program for inferring
haplotypes from many SNPs simultaneously
(Stephens, Smith Donnelly, 2001)
26The phaseout Command
Raw Genotype Data in Stata
27The phaseout Command
Input file format for PHASE
28The phaseout Command
I need to get my data from here
to here
29The file commands in Stata
- Using file open, file write and file close
- file open Example1 using "ExampleFile.txt", write
replace - file write Example1 "Hello World" _newline(1)
- file write Example1 "Why so blue?" _newline(1)
- file close Example1
30The file commands in Stata
- Using file open, file read and file close
- . file open Example2 using "ExampleFile.txt",
read - . file read Example2 Line1
- . file read Example2 Line2
- . file close Example2
- . disp "Line1 Line1'"
- Line1 Hello World
- . disp "Line2 Line2'"
- Line2 Why so blue?
31The phaseout Command
- Syntax for phaseout
- phaseout SNPlist , idvariable(string)
filename(string) missing(string)
separator(string) positions(string) - Example
- local SNPList "rs1413711 rs3024987 rs3024989"
- local PositionsList "674 836 1955
- phaseout SNPList' , idvariable("id")
filename("VEGF.inp") missing("X/X 9/9")
positions(PositionsList') separator("/")
32The phaseout Command
- Example
- local SNPList "rs1413711 rs3024987 rs3024989"
- local PositionsList "674 836 1955
- phaseout SNPList' , idvariable("id")
filename("VEGF.inp") missing("X/X 9/9")
positions(PositionsList') separator("/")
33The phaseout Command
- Example
- local SNPList "rs1413711 rs3024987 rs3024989"
- local PositionsList "674 836 1955
- phaseout SNPList' , idvariable("id")
filename("VEGF.inp") missing("X/X 9/9")
positions(PositionsList') separator("/")
34The phasein Command
- Syntax for phasein
- phasein PhaseOutputFile , markers(string)
positions(string) - Example
- phasein VEGF.out, markers("MarkerList.txt")
positions("PositionList.txt")
35The phasein Command
Output file format from PHASE
36The phasein Command
- Example
- phasein VEGF.out, markers("MarkerList.txt")
positions("PositionList.txt")
37The phasein Command
- Example
- phasein VEGF.out, markers("MarkerList.txt")
positions("PositionList.txt")
38The HaploView Program
- Once we have inferred our haplotypes, we can
conduct further association analyses using the
full complement of Stata commands. - We might also want to explore our data in the
popular program HaploView (Barrett et al, 2005)
39The haploviewout Command
- Syntax for haploviewout
- haploviewout SNPlist , idvariable(string)
filename(string) positions(string)
familyid(string) poslabel - Example
- local MarkerList "rs1413711 rs3024987 rs3024989
- haploviewout MarkerList', idvariable(id)
filename("VEGF") poslabel
40The haploviewout Command
- Example
- local SNPList "rs1413711 rs3024987 rs3024989
- haploviewout MarkerList', idvariable(id)
filename("VEGF") poslabel
41The haploviewout Command
- Example
- local SNPList "rs1413711 rs3024987 rs3024989
- haploviewout MarkerList', idvariable(id)
filename("VEGF") poslabel
42The haploviewout Command
43The haploviewout Command
44The haploviewout Command
45Summary
- Compared to recreating boutique programs in
Stata, it is relatively easy to create programs
for exporting and importing data. -
46Acknowledgements
- Grant 1-R01DK073618-02 from the National
Institute of Diabetes and Digestive and Kidney
Diseases - Grant 2006-35205-16715 from the United States
Department of Agriculture. - Drs. Loren Skow, Krista Fritz, Candice
Brinkmeyer-Langford of the Texas AM College of
Veterinary Medicine - Roger Newson of the Imperial College London
47References
- Barrett, J., Fry, B., Maller, J., Daly, M.
(2005). Haploview analysis and visualization of
LD and haplotype maps. Bioinformatics, 21,
263-265. - Hartl, D.L., Jones, E.W. (1998) Genetics
Principles and Analysis, 4th Ed. Jones
Bartlett Publishers - Stephens, M., Donnelly, P. (2003). A Comparison
of Bayesian Methods for Haplotype Reconstruction
from Population Genotype Data. American Journal
of Human Genetics, 73, 11621169. - Stephens, M., Smith, N. J., Donnelly, P.
(2001). A New Statistical Method for Haplotype
Reconstruction from Population Data. American
Journal of Human Genetics, 68, 978989. - Watson, J.D., Baker, T.A., Bell, S.P., Gann, A.,
Levine, M., Losick, R. (2004) Molecular Biology
of the Gene, 5th Ed. Benjamin Cummings