CSE182-L17 - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

CSE182-L17

Description:

CSE182-L17 Clustering Population Genetics: Basics – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 39
Provided by: Vine47
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: CSE182-L17


1
CSE182-L17
  • Clustering
  • Population Genetics Basics

2
Unsupervised Clustering
  • Given a set of points (in n-dimensions), and k,
    compute the k best clusters.
  • In k-means, clustering is done by choosing k
    centers (means).
  • Each point is assigned to the closest center.
  • The notion of best is defined by distances to
    the center.
  • Question How can we compute the k best centers?

3
Distance
  • Given a data point v and a set of points X,
  • define the distance from v to X
  • d(v, X)
  • as the (Euclidean) distance from v to
    the closest point from X.
  • Given a set of n data points Vv1vn and a set
    of k points X,
  • define the Squared Error Distortion
  • d(V,X) ?d(vi, X)2 / n 1 lt i lt n

v
4
K-Means Clustering Problem Formulation
  • Input A set, V, consisting of n points and a
    parameter k
  • Output A set X consisting of k points (cluster
    centers) that minimizes the squared error
    distortion d(V,X) over all possible choices of X
  • This problem is NP-complete in general.

5
1-Means Clustering Problem an Easy Case
  • Input A set, V, consisting of n points.
  • Output A single point X that minimizes d(V,X)
    over all possible choices of X.
  • This problem is easy.
  • However, it becomes very difficult for more
    than one center.
  • An efficient heuristic method for k-Means
    clustering is the Lloyd algorithm

6
K-means Lloyds algorithm
  • Choose k centers at random
  • X x1,x2,x3,xk
  • Repeat
  • XX
  • Assign each v ? V to the closest cluster j
  • d(v,xj) d(v,X) ? Cj Cj ? v
  • Recompute X
  • xj ? (? v ? Cj v) /Cj
  • until (X X)

7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
Conservative K-Means Algorithm
  • Lloyd algorithm is fast but in each iteration it
    moves many data points, not necessarily causing
    better convergence.
  • A more conservative method would be to move one
    point at a time only if it improves the overall
    clustering cost
  • The smaller the clustering cost of a partition of
    data points is the better that clustering is
  • Different methods can be used to measure this
    clustering cost (for example in the last
    algorithm the squared error distortion was used)

12
Microarray summary
  • Microarrays (like MS) are a technology for
    probing the dynamic state of the cell.
  • We answered questions like the following
  • Which genes are coordinately regulated (They have
    similar expression patterns in different
    conditions)?
  • How can we reduce the dimensionality of the
    system?
  • Using gene expression values from a sample, can
    you predict if the sample is normal (state A) or
    diseased (state B)
  • The techniques employed for classification/cluster
    ing etc. are general and can be employed in a
    number of contexts.

13
Microarray non-summary
  • We did not cover
  • How are the gene expression values measured (the
    technology)? (CSE183)
  • How do you control variability across different
    experiments (normalization)? (CSE183)
  • What controls the expression of a gene (gene
    regulation), or a set of genes? (CSE 181)

14
Population Genetics
  • The sequence of an individual does not say
    anything about the diversity of a population.
  • Small individual genetic differences can have a
    profound impact on phenotypes
  • Response to drugs
  • Susceptibility to diseases
  • Soon, we will have sequences of many individuals
    from the same species. Studying the differences
    will be a major challenge.

15
Population Structure
  • 377 locations (loci) were sampled in 1000 people
    from 52 populations.
  • 6 genetic clusters were obtained, which
    corresponded to 5 geographic regions (Rosenberg
    et al. Science 2003)

Oceania
Eurasia
East Asia
America
Africa
16
Population Genetics
  • What is it about our genetic makeup that makes us
    measurably different?
  • These genetic differences are correlated with
    phenotypic differences
  • With cost reduction in sequencing and genotyping
    technologies, we will know the sequence for
    entire populations of individuals.
  • Here, we will study the basics of this
    polymorphism data, and tools that are being
    developed to analyze it.

17
What causes variation in a population?
  • Mutations (may lead to SNPs)
  • Recombinations
  • Other genetic events (Ex microsatellite
    repeats)
  • Deletions, inversions

18
Single Nucleotide Polymorphisms
Infinite Sites Assumption Each site mutates at
most once
00000101011 10001101001 01000101010 01000000011 00
011110000 00101100110
19
Short Tandem Repeats
GCTAGATCATCATCATCATTGCTAG GCTAGATCATCATCATTGCTAGTT
A GCTAGATCATCATCATCATCATTGC GCTAGATCATCATCATTGCTAG
TTA GCTAGATCATCATCATTGCTAGTTA GCTAGATCATCATCATCATC
ATTGC
4 3 5 3 3 5
20
STR can be used as a DNA fingerprint
  • Consider a collection of regions with variable
    length repeats.
  • Variable length repeats will lead to variable
    length DNA
  • Vector of lengths is a finger-print

4 2 3 3 5 1 3 2 3 1 5 3
individuals
positions
21
Recombination
00000000 11111111 00011111
22
What if there were no recombinations?
  • Life would be simpler
  • Each sequence would have a single parent
  • The relationship is expressed as a tree.

23
The Infinite Sites Assumption
0 0 0 0 0 0 0 0
3
0 0 1 0 0 0 0 0
5
8
0 0 1 0 1 0 0 0
0 0 1 0 0 0 0 1
  • The different sites are linked. A 1 in position
    8 implies 0 in position 5, and vice versa.
  • Some phenotypes could be linked to the
    polymorphisms
  • Some of the linkage is destroyed by
    recombination

24
Infinite sites assumption and Perfect Phylogeny
  • Each site is mutated at most once in the history.
  • All descendants must carry the mutated value, and
    all others must carry the ancestral value

i
1 in position i
0 in position i
25
Perfect Phylogeny
  • Assume an evolutionary model in which no
    recombination takes place, only mutation.
  • The evolutionary history is explained by a tree
    in which every mutation is on an edge of the
    tree. All the species in one sub-tree contain a
    0, and all species in the other contain a 1. Such
    a tree is called a perfect phylogeny.
  • How can one reconstruct such a tree?

26
The 4-gamete condition
  • A column i partitions the set of species into two
    sets i0, and i1
  • A column is homogeneous w.r.t a set of species,
    if it has the same value for all species.
    Otherwise, it is heterogenous.
  • EX i is heterogenous w.r.t A,D,E

i A 0 B 0 C 0 D 1 E 1 F 1
i0
i1
27
4 Gamete Condition
  • 4 Gamete Condition
  • There exists a perfect phylogeny if and only if
    for all pair of columns (i,j), either j is not
    heterogenous w.r.t i0, or i1.
  • Equivalent to
  • There exists a perfect phylogeny if and only if
    for all pairs of columns (i,j), the following 4
    rows do not exist
  • (0,0), (0,1), (1,0), (1,1)

28
4-gamete condition proof
  • Depending on which edge the mutation j occurs,
    either i0, or i1 should be homogenous.
  • (only if) Every perfect phylogeny satisfies the
    4-gamete condition
  • (if) If the 4-gamete condition is satisfied, does
    a prefect phylogeny exist?

29
An algorithm for constructing a perfect phylogeny
  • We will consider the case where 0 is the
    ancestral state, and 1 is the mutated state. This
    will be fixed later.
  • In any tree, each node (except the root) has a
    single parent.
  • It is sufficient to construct a parent for every
    node.
  • In each step, we add a column and refine some of
    the nodes containing multiple children.
  • Stop if all columns have been considered.

30
Inclusion Property
  • For any pair of columns i,j
  • i lt j if and only if i1 ? j1
  • Note that if iltj then the edge containing i is an
    ancestor of the edge containing i

i
j
31
Example
1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D
0 0 1 0 1 E 1 0 0 0 0
Initially, there is a single clade r, and each
node has r as its parent
32
Sort columns
  • Sort columns according to the inclusion property
    (note that the columns are already sorted here).
  • This can be achieved by considering the columns
    as binary representations of numbers (most
    significant bit in row 1) and sorting in
    decreasing order

1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1
0 D 0 0 1 0 1 E 1 0 0 0 0
33
Add first column
1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1
0 D 0 0 1 0 1 E 1 0 0 0 0
  • In adding column i
  • Check each edge and decide which side you belong.
  • Finally add a node if you can resolve a clade

r
u
B
D
A
C
E
34
Adding other columns
1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D
0 0 1 0 1 E 1 0 0 0 0
  • Add other columns on edges using the ordering
    property

r
1
3
E
B
2
5
4
D
A
C
35
Unrooted case
  • Switch the values in each column, so that 0 is
    the majority element.
  • Apply the algorithm for the rooted case

36
Handling recombination
  • A tree is not sufficient as a sequence may have 2
    parents
  • Recombination leads to loss of correlation
    between columns

37
Linkage (Dis)-equilibrium (LD)
  • Consider sites A B
  • Case 1 No recombination
  • PrA,B0,1 0.25
  • Linkage disequilibrium
  • Case 2Extensive recombination
  • PrA,B(0,1)0.125
  • Linkage equilibrium

A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0
38
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com