CSE182-L17 - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

CSE182-L17

Description:

CSE182-L17 Clustering Population Genetics: Basics – PowerPoint PPT presentation

Number of Views:87

Avg rating:3.0/5.0

Slides: 39

Provided by: Vine47

Learn more at: https://cseweb.ucsd.edu

Category:

more less

Transcript and Presenter's Notes

Title: CSE182-L17

1
CSE182-L17

Clustering
Population Genetics Basics

2
Unsupervised Clustering

Given a set of points (in n-dimensions), and k,
compute the k best clusters.
In k-means, clustering is done by choosing k
centers (means).
Each point is assigned to the closest center.
The notion of best is defined by distances to
the center.
Question How can we compute the k best centers?

3
Distance

Given a data point v and a set of points X,
define the distance from v to X
d(v, X)
as the (Euclidean) distance from v to
the closest point from X.
Given a set of n data points Vv1vn and a set
of k points X,
define the Squared Error Distortion
d(V,X) ?d(vi, X)2 / n 1 lt i lt n

v
4
K-Means Clustering Problem Formulation

Input A set, V, consisting of n points and a
parameter k
Output A set X consisting of k points (cluster
centers) that minimizes the squared error
distortion d(V,X) over all possible choices of X
This problem is NP-complete in general.

5
1-Means Clustering Problem an Easy Case

Input A set, V, consisting of n points.
Output A single point X that minimizes d(V,X)
over all possible choices of X.
This problem is easy.
However, it becomes very difficult for more
than one center.
An efficient heuristic method for k-Means
clustering is the Lloyd algorithm

6
K-means Lloyds algorithm

Choose k centers at random
X x1,x2,x3,xk
Repeat
XX
Assign each v ? V to the closest cluster j
d(v,xj) d(v,X) ? Cj Cj ? v
Recompute X
xj ? (? v ? Cj v) /Cj
until (X X)

7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
Conservative K-Means Algorithm

Lloyd algorithm is fast but in each iteration it
moves many data points, not necessarily causing
better convergence.
A more conservative method would be to move one
point at a time only if it improves the overall
clustering cost
The smaller the clustering cost of a partition of
data points is the better that clustering is
Different methods can be used to measure this
clustering cost (for example in the last
algorithm the squared error distortion was used)

12
Microarray summary

Microarrays (like MS) are a technology for
probing the dynamic state of the cell.
We answered questions like the following
Which genes are coordinately regulated (They have
similar expression patterns in different
conditions)?
How can we reduce the dimensionality of the
system?
Using gene expression values from a sample, can
you predict if the sample is normal (state A) or
diseased (state B)
The techniques employed for classification/cluster
ing etc. are general and can be employed in a
number of contexts.

13
Microarray non-summary

We did not cover
How are the gene expression values measured (the
technology)? (CSE183)
How do you control variability across different
experiments (normalization)? (CSE183)
What controls the expression of a gene (gene
regulation), or a set of genes? (CSE 181)

14
Population Genetics

The sequence of an individual does not say
anything about the diversity of a population.
Small individual genetic differences can have a
profound impact on phenotypes
Response to drugs
Susceptibility to diseases
Soon, we will have sequences of many individuals
from the same species. Studying the differences
will be a major challenge.

15
Population Structure

377 locations (loci) were sampled in 1000 people
from 52 populations.
6 genetic clusters were obtained, which
corresponded to 5 geographic regions (Rosenberg
et al. Science 2003)

Oceania
Eurasia
East Asia
America
Africa
16
Population Genetics

What is it about our genetic makeup that makes us
measurably different?
These genetic differences are correlated with
phenotypic differences
With cost reduction in sequencing and genotyping
technologies, we will know the sequence for
entire populations of individuals.
Here, we will study the basics of this
polymorphism data, and tools that are being
developed to analyze it.

17
What causes variation in a population?

Mutations (may lead to SNPs)
Recombinations
Other genetic events (Ex microsatellite
repeats)
Deletions, inversions

18
Single Nucleotide Polymorphisms
Infinite Sites Assumption Each site mutates at
most once
00000101011 10001101001 01000101010 01000000011 00
011110000 00101100110
19
Short Tandem Repeats
GCTAGATCATCATCATCATTGCTAG GCTAGATCATCATCATTGCTAGTT
A GCTAGATCATCATCATCATCATTGC GCTAGATCATCATCATTGCTAG
TTA GCTAGATCATCATCATTGCTAGTTA GCTAGATCATCATCATCATC
ATTGC
4 3 5 3 3 5
20
STR can be used as a DNA fingerprint

Consider a collection of regions with variable
length repeats.
Variable length repeats will lead to variable
length DNA
Vector of lengths is a finger-print

4 2 3 3 5 1 3 2 3 1 5 3
individuals
positions
21
Recombination
00000000 11111111 00011111
22
What if there were no recombinations?

Life would be simpler
Each sequence would have a single parent
The relationship is expressed as a tree.

23
The Infinite Sites Assumption
0 0 0 0 0 0 0 0
3
0 0 1 0 0 0 0 0
5
8
0 0 1 0 1 0 0 0
0 0 1 0 0 0 0 1

The different sites are linked. A 1 in position
8 implies 0 in position 5, and vice versa.
Some phenotypes could be linked to the
polymorphisms
Some of the linkage is destroyed by
recombination

24
Infinite sites assumption and Perfect Phylogeny

Each site is mutated at most once in the history.
All descendants must carry the mutated value, and
all others must carry the ancestral value

i
1 in position i
0 in position i
25
Perfect Phylogeny

Assume an evolutionary model in which no
recombination takes place, only mutation.
The evolutionary history is explained by a tree
in which every mutation is on an edge of the
tree. All the species in one sub-tree contain a
0, and all species in the other contain a 1. Such
a tree is called a perfect phylogeny.
How can one reconstruct such a tree?

26
The 4-gamete condition

A column i partitions the set of species into two
sets i0, and i1
A column is homogeneous w.r.t a set of species,
if it has the same value for all species.
Otherwise, it is heterogenous.
EX i is heterogenous w.r.t A,D,E

i A 0 B 0 C 0 D 1 E 1 F 1
i0
i1
27
4 Gamete Condition

4 Gamete Condition
There exists a perfect phylogeny if and only if
for all pair of columns (i,j), either j is not
heterogenous w.r.t i0, or i1.
Equivalent to
There exists a perfect phylogeny if and only if
for all pairs of columns (i,j), the following 4
rows do not exist
(0,0), (0,1), (1,0), (1,1)

28
4-gamete condition proof

Depending on which edge the mutation j occurs,
either i0, or i1 should be homogenous.
(only if) Every perfect phylogeny satisfies the
4-gamete condition
(if) If the 4-gamete condition is satisfied, does
a prefect phylogeny exist?

29
An algorithm for constructing a perfect phylogeny

We will consider the case where 0 is the
ancestral state, and 1 is the mutated state. This
will be fixed later.
In any tree, each node (except the root) has a
single parent.
It is sufficient to construct a parent for every
node.
In each step, we add a column and refine some of
the nodes containing multiple children.
Stop if all columns have been considered.

30
Inclusion Property

For any pair of columns i,j
i lt j if and only if i1 ? j1
Note that if iltj then the edge containing i is an
ancestor of the edge containing i

i
j
31
Example
1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D
0 0 1 0 1 E 1 0 0 0 0
Initially, there is a single clade r, and each
node has r as its parent
32
Sort columns

Sort columns according to the inclusion property
(note that the columns are already sorted here).
This can be achieved by considering the columns
as binary representations of numbers (most
significant bit in row 1) and sorting in
decreasing order

1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1
0 D 0 0 1 0 1 E 1 0 0 0 0
33
Add first column
1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1
0 D 0 0 1 0 1 E 1 0 0 0 0