Stata%20commands%20for%20moving%20data%20between%20PHASE%20and%20HaploView%20Stata%20Conference%20DC%20 - PowerPoint PPT Presentation

About This Presentation
Title:

Stata%20commands%20for%20moving%20data%20between%20PHASE%20and%20HaploView%20Stata%20Conference%20DC%20

Description:

Many rapidly growing areas of research utilize multiple specialty ' ... Programs written by Adrian Mander. gipf - Graphical representation of log-linear models. ... – PowerPoint PPT presentation

Number of Views:234
Avg rating:3.0/5.0
Slides: 48
Provided by: chuck50
Learn more at: http://fmwww.bc.edu
Category:

less

Transcript and Presenter's Notes

Title: Stata%20commands%20for%20moving%20data%20between%20PHASE%20and%20HaploView%20Stata%20Conference%20DC%20


1
Stata commands for moving data between PHASE and
HaploViewStata Conference DC 09July 30-31,
2009
  • John Charles Chuck Huber Jr, PhD
  • Assistant Professor of Biostatistics
  • Department of Epidemiology and Biostatistics
  • School of Rural Public Health
  • Texas AM Health Science Center
  • jchuber_at_tamu.edu

2
Motivation
  • Many rapidly growing areas of research utilize
    multiple specialty boutique computer programs
    to conduct highly specialized analyses.
  • The Stata user is faced with two choices
  • Write new Stata commands that do the same
    analyses
  • Write Stata commands that efficiently export and
    import data for these boutique programs

3
Stata for Genetic Data Analysis
  • Outline
  • Genetic Data Analysis using Stata
  • Genetics Background
  • The file commands in Stata
  • The phasein and phaseout commands
  • The HaploView program
  • The haploviewout command
  • Summary

4
Stata for Genetic Data Analysis
  • 2007 UK Stata Users Group meeting
  • http//www.stata.com/meeting/13uk/
  • A brief introduction to genetic epidemiology
    using Stata
  • Neil Shephard, University of Sheffield
  • An overview of using Stata to perform candidate
    gene association analysis will be presented.
    Areas covered will include data manipulation,
    HardyWeinberg equilibrium, calculating and
    plotting linkage disequilibrium, estimating
    haplotypes, and interfacing with external
    programs.

5
User Written Genetics Commands
  • Programs written by David Clayton
  • ginsheet- Read genotype data from text files.
  • gloci - Make a list of loci.
  • greshape - Reshape a file containing genotypes to
    a file of alleles.
  • gtab - Tabulate allele frequencies within
    genotypes and generate indicators (performs
    Hardy-Weinberg Equilibrium testing).
  • gtype - Create a single genotype variable from
    two allele variables.
  • htype - Create a haplotype variable from allele
    variables.
  • mltdt - Multiple locus TDT for haplotype tagging
    SNPs (htSNPs).
  • origin - Analysis of parental origin effect in
    TDT trios.
  • pseudocc - Create a pseudo-case-control study
    from case-parent trios.
  • pscc - Experimental version of pseudocc in which
    there may be several groups of linked loci.
  • pwld - Pairwise linkage disequilibrium measures.
  • rclogit - Conditional logistic regression with
    robust standard errors.
  • snp2hap - Infer haplotypes of 2-locus SNP
    markers.
  • tdt - Classical TDT test.
  • trios - Tabulate genotypes of parent-offspring
    trios.

6
User Written Genetics Commands
  • Programs written by Adrian Mander
  • gipf - Graphical representation of log-linear
    models.
  • hapipf - Haplotype frequency estimation using an
    EM algorithm and log-linear modelling.
  • pedread - Read's pedigree data file (in
    pre-Makeped LINKAGE format), similar to ginsheet
  • pedsumm - Summarises a pre-Makeped LINKAGE file
    that is currently in Stata's memory.
  • pedraw - Draws one pedigree in the graphics
    window
  • plotmatrix - Produces LD heatmaps displaying
    graphically the strength of LD between markers.
  • profhap - Calculates profile likelihood
    confidence intervals for results from hapipf
  • swblock - A step-wise hapipf routine to identify
    the parsimonious model to describe the Haplotype
    block pattern.
  • qhapipf - Analysis of quantitative traits using
    regression and log-linear modelling when phase is
    unknown.
  • hapblock - attempts to find the edge of areas
    containing high LD within a set of loci

7
User Written Genetics Commands
  • Programs written by Mario Cleves
  • gencc - Genetic case-control tests
  • genhw - Hardy-Weinberg Equilibrium tests
  • qtlsnp - A program for testng associations
    between SNPs an a quantitative trait.
  • Programs written by Catherine Saunders
  • co_power - Power calculations for Case-only study
    designs.
  • gei_matching -
  • geipower - Power calculations for
    Gene-Environment interactions.
  • ggipower - Power calculations for Gene-Gene
    interactions.
  • tdt_geipower - Power calculations for
    Gene-Environment interactions via TDT analysis.
  • tdt_ggipower - Power calculations for Gene-Gene
    interactions via TDT analysis.
  • Programs written by Neil Shephard
  • genass- Performs a number of statistical tests on
    your genotypic data and collates the results into
    a Stata formatted data set for browsing.

8
The Post-Genome Era
February 16, 2001
February 15, 2001
9
Scientific Method Observe
Hartl Jones (1998) pg 18, Figure 1.13
10
Scientific Method Predict
Watson et al. (2004) pg 29, Box 2-2
11
Scientific Method Manipulate
12
The Structure of DNA
Hartl Jones (1998) pg 9, Figure 1.5
13
The Structure of DNA
Watson et al. (2004) pg 23, Figure 2.5
14
What is a SNP?
  • A SNP is a single nucleotide polymorphism (the
    individual nucleotides are called alleles)
  • ataagtcgatactgatgcat
    agctagctgactgacgcgat ataagtccatactgatgcat
    agctagctgactgaagcgat
  • ataagtccatactgatgcatagctagctgactgacgcgat
  • ataagtcgatactgatgcatagctagctgactgaagcgat

Person 1 Chromosome 1
Person 1 Chromosome 2
Person 2 Chromosome 1
Person 2 Chromosome 2
SNP1
SNP2
15
Allelic Association
  • Simple 2x2 table
  • One table per SNP
  • Compute a simple chi-squared statistic or odds
    ratio for each SNP

SNP1 Allele SNP1 Allele
g c
Case 250 750
Control 650 350
16
Genotypic Association
  • Compute chi-squared tests
  • Allows testing of various disease models
    (dominant, recessive, additivity)

SNP1 Genotype SNP1 Genotype SNP1 Genotype
gg gc cc
Case 100 250 150
Control 300 150 50
17
What is a Haplotype?
  • A haplotype is the combination of one or more
    alleles found on the same chromosome
  • Person 1 has a gc haplotype and a ca
    haplotype
  • Person 2 has a cc haplotype and a ga
    haplotype
  • ataagtcgatactgatgcat
    agctagctgactgacgcgat ataagtccatactgatgcat
    agctagctgactgaagcgat
  • ataagtccatactgatgcatagctagctgactgacgcgat
  • ataagtcgatactgatgcatagctagctgactgaagcgat

Person 1 Chromosome 1
Person 1 Chromosome 2
Person 2 Chromosome 1
Person 2 Chromosome 2
SNP1
SNP2
18
Haplotypic Association
  • Compute chi-squared tests
  • Two SNPs with genotypes a/g and c/t respectively

SNP1SNP2 Haplotype SNP1SNP2 Haplotype SNP1SNP2 Haplotype SNP1SNP2 Haplotype
ac at gc gt
Case 100 250 75 75
Control 300 100 50 50
19
Why are haplotypes important?
2009 Oxford and Cambridge Boat Race
http//www.theboatrace.org/gallery/2009?page7
20
Why are haplotypes important?
SNP1
SNP2
SNP3
SNP4
SNP5
Chromosome R
Chromosome D
President
VP
State
Defense
Treasury
21
Why are haplotypes important?
SNP1
SNP2
SNP3
SNP4
SNP5
Chromosome R
Chromosome D
President
VP
State
Defense
Treasury
Rearranging the members of each chromosome
could have a profound effect!
22
Why are haplotypes important?
Hartl Jones (1998) pg 18, Figure 1.13
23
Hartl Jones (1998) pg 18, Figure 1.13
24
Why are haplotypes important?
Watson et al. (2004) pg 29, Box 2-2
25
The PHASE Program
  • Unfortunately, haplotypes are not observed
    directly using modern, high-throughput lab
    techniques
  • We observe genotypes and must infer the haplotype
    structure using algorithms
  • PHASE is a very popular program for inferring
    haplotypes from many SNPs simultaneously
    (Stephens, Smith Donnelly, 2001)

26
The phaseout Command
Raw Genotype Data in Stata
27
The phaseout Command
Input file format for PHASE
28
The phaseout Command
I need to get my data from here
to here
29
The file commands in Stata
  • Using file open, file write and file close
  • file open Example1 using "ExampleFile.txt", write
    replace
  • file write Example1 "Hello World" _newline(1)
  • file write Example1 "Why so blue?" _newline(1)
  • file close Example1

30
The file commands in Stata
  • Using file open, file read and file close
  • . file open Example2 using "ExampleFile.txt",
    read
  • . file read Example2 Line1
  • . file read Example2 Line2
  • . file close Example2
  • . disp "Line1 Line1'"
  • Line1 Hello World
  • . disp "Line2 Line2'"
  • Line2 Why so blue?

31
The phaseout Command
  • Syntax for phaseout
  • phaseout SNPlist , idvariable(string)
    filename(string) missing(string)
    separator(string) positions(string)
  • Example
  • local SNPList "rs1413711 rs3024987 rs3024989"
  • local PositionsList "674 836 1955
  • phaseout SNPList' , idvariable("id")
    filename("VEGF.inp") missing("X/X 9/9")
    positions(PositionsList') separator("/")

32
The phaseout Command
  • Example
  • local SNPList "rs1413711 rs3024987 rs3024989"
  • local PositionsList "674 836 1955
  • phaseout SNPList' , idvariable("id")
    filename("VEGF.inp") missing("X/X 9/9")
    positions(PositionsList') separator("/")

33
The phaseout Command
  • Example
  • local SNPList "rs1413711 rs3024987 rs3024989"
  • local PositionsList "674 836 1955
  • phaseout SNPList' , idvariable("id")
    filename("VEGF.inp") missing("X/X 9/9")
    positions(PositionsList') separator("/")

34
The phasein Command
  • Syntax for phasein
  • phasein PhaseOutputFile , markers(string)
    positions(string)
  • Example
  • phasein VEGF.out, markers("MarkerList.txt")
    positions("PositionList.txt")

35
The phasein Command
Output file format from PHASE
36
The phasein Command
  • Example
  • phasein VEGF.out, markers("MarkerList.txt")
    positions("PositionList.txt")

37
The phasein Command
  • Example
  • phasein VEGF.out, markers("MarkerList.txt")
    positions("PositionList.txt")

38
The HaploView Program
  • Once we have inferred our haplotypes, we can
    conduct further association analyses using the
    full complement of Stata commands.
  • We might also want to explore our data in the
    popular program HaploView (Barrett et al, 2005)

39
The haploviewout Command
  • Syntax for haploviewout
  • haploviewout SNPlist , idvariable(string)
    filename(string) positions(string)
    familyid(string) poslabel
  • Example
  • local MarkerList "rs1413711 rs3024987 rs3024989
  • haploviewout MarkerList', idvariable(id)
    filename("VEGF") poslabel

40
The haploviewout Command
  • Example
  • local SNPList "rs1413711 rs3024987 rs3024989
  • haploviewout MarkerList', idvariable(id)
    filename("VEGF") poslabel

41
The haploviewout Command
  • Example
  • local SNPList "rs1413711 rs3024987 rs3024989
  • haploviewout MarkerList', idvariable(id)
    filename("VEGF") poslabel

42
The haploviewout Command
43
The haploviewout Command
44
The haploviewout Command
45
Summary
  • Compared to recreating boutique programs in
    Stata, it is relatively easy to create programs
    for exporting and importing data.

46
Acknowledgements
  • Grant 1-R01DK073618-02 from the National
    Institute of Diabetes and Digestive and Kidney
    Diseases
  • Grant 2006-35205-16715 from the United States
    Department of Agriculture.
  • Drs. Loren Skow, Krista Fritz, Candice
    Brinkmeyer-Langford of the Texas AM College of
    Veterinary Medicine
  • Roger Newson of the Imperial College London

47
References
  • Barrett, J., Fry, B., Maller, J., Daly, M.
    (2005). Haploview analysis and visualization of
    LD and haplotype maps. Bioinformatics, 21,
    263-265.
  • Hartl, D.L., Jones, E.W. (1998) Genetics
    Principles and Analysis, 4th Ed. Jones
    Bartlett Publishers
  • Stephens, M., Donnelly, P. (2003). A Comparison
    of Bayesian Methods for Haplotype Reconstruction
    from Population Genotype Data. American Journal
    of Human Genetics, 73, 11621169.
  • Stephens, M., Smith, N. J., Donnelly, P.
    (2001). A New Statistical Method for Haplotype
    Reconstruction from Population Data. American
    Journal of Human Genetics, 68, 978989.
  • Watson, J.D., Baker, T.A., Bell, S.P., Gann, A.,
    Levine, M., Losick, R. (2004) Molecular Biology
    of the Gene, 5th Ed. Benjamin Cummings
Write a Comment
User Comments (0)
About PowerShow.com