Linear Reduction Method for Tag SNPs Selection

- Jingwu He
- Alex Zelikovsky

Outline

- SNPs , haplotypes and genotypes
- Haplotype tagging problem
- Linear reduction method for tagging
- Maximizing tagging separability
- Conclusions future work

Outline

- SNPs , haplotypes and genotypes
- Haplotype tagging problem
- Linear reduction method for tagging
- Maximizing tagging separability
- Conclusions future work

Human Genome and SNPs

- Length of Human Genome ? 3 ? 109 base pairs
- Difference b/w any people ? 0.1 of genome ? 3 ?

106 SNPs - Total single nucleotide polymorphisms (SNP) ?

1 ? 107 - SNPs are mostly bi-allelic, e.g., alleles A and C

- Minor allele frequency should be considerable

e.g. gt 1 - Diploid two different copies of each chromosome

- Haplotype description of single copy (0,1)
- Genotype description of mixed two copies

(000, 111, 201)

0

1

1

1 0 0 1

1

0

0

1

1

1 0 0 1

1

0

Two

haplotypes

per individual

Two

haplotypes

per individual

?

1

1

0

1 0 0 1

0

0

1

1

0

1 0 0 1

0

0

Genotype for the individual

Genotype for the individual

2

1

2

1 0 0 1

2

0

2

1

2

1 0 0 1

2

0

Haplotype and Disease Association

- Haplotypes/genotypes define our individuality
- Genetically engineered athletes might win at

Beijing Olympics (Time (07/2004)) - Haplotypes contribute to risk factors of complex

diseases (e.g., diabetes) - International HapMap project http//www.hapmap.or

g - SNPs causing disease reason are hidden among 10

million SNPs. - Too expensive to search
- HapMap tries to identify 1 million tag SNPs

providing almost as much mapping information as

entire 10 million SNPs.

Outline

- SNPs, haplotypes and genotypes
- Haplotype tagging problem
- Linear reduction method for tagging
- Maximizing tagging separability
- Conclusions future work

Tagging Reduces Cost

- Decrease SNP haplotyping cost
- sequence only small amount of SNPs tag SNP
- infer rest of (certain) SNPs based on sequenced

tag SNPs - Cost-saving ratio m / k (infinite population)
- Traditional tagging linkage disequilibrium (LD)

needs too many SNPs, cost-saving ratio is too

small ( 2) - Proposed linear reduction method cost-saving

ratio 20

Number of SNPs m Number of Tags k

Haplotype Tagging Problem

- Given the full pattern of all SNPs for sample
- Find minimum number of tag SNPs that will allow

for reconstructing the complete haplotype for

each individual

Outline

- SNPs, haplotypes and genotypes
- Haplotype tagging problem
- Linear reduction method for tagging
- Maximizing tagging separability
- Conclusions future work

Linear Rank of Recombinations

- Human Haplotype Evolution
- Mutations introduce SNPs
- Recombinations propagate SNPs over entire

population - Replace notations (0, 1) with (1, 1)
- Theorem Haplotype population generated from l

haplotypes with recombinations at k spots has

linear rank (l-1)(k2) - It is much less than number of all haplotypes l

k - Conclusion use only linearly independent SNPs

as tags

Tag SNPs Selection

- Tag Selecting Algorithm
- Using Gauss-Jordan Elimination find Row Reduced

Echelon Form (RREF) X of sample matrix S. - Extract the basis T of sample S
- Factorize sample S T ? X
- Output set of tags T
- Fact In sample, each SNP is a linear

combination of tag SNPs - Conjecture In entire population, each SNP is

same linear combination of tags as in sample

rref X

tags T

Sample S

Haplotype Reconstruction

- Given tags t of unknown haplotype h
- and RREF X of sample matrix S
- Find unknown haplotype h
- Predict the h t ? X
- We may have errors, since predicted h may not

equal to unknown haplotype h. we assign 1 if

predicted values are negative and 1 otherwise.

(RLRP) - Variant randomly reshuffle SNPs before choosing

tags (RLR)

Unknown haplotype h

rref X

Predicted haplotype h

tags set

?

Results for Simulated Data

- Cost-saving ratio for 2 error for LR is 3.9 and

for RLRP is 13

- P 1000 different haplotypes
- m 25000 sites
- Sample size k (number of tag SNPs)

50,100,,750

Results for Real Data

- Cost-saving ratio for 5 error for LR is 2.1 and

for RLRP is 2.8

- P 158 different haplotypes (Daly el.,)
- m 103 sites
- Sample size k (number of tag SNPs)

10,15,20,,90

Outline

- SNPs, haplotypes and genotypes
- Haplotype tagging problem
- Linear reduction method for tagging
- Maximizing tagging separability
- Conclusions future work

Tag Separability

- Correlation between number of zeros for SNPs in

RREF X and number of errors in prediction column - Greedy heuristic gives a more separable basis.

For 5 error, cost-saving ratio 2.8 vs 3.3 for

RLRP

Conclusions and Future work

- Our contributions
- new SNP tagging problem formulation
- linear reduction method for SNP tagging
- enhancement of linear reduction using separable

basis - Future work
- application of tagging for genotype and haplotype

disease association

Thank you