Two-stage Cluster Sampling When Clusters are of Unequal Size

About This Presentation

Title:

Two-stage Cluster Sampling When Clusters are of Unequal Size

Description:

Title: Two-stage Cluster Sampling When Clusters are of Unequal Size Author: Last modified by: nkfust Created Date: 5/13/2002 2:19:22 PM – PowerPoint PPT presentation

Number of Views:131

Avg rating:3.0/5.0

Slides: 87

Provided by: 4518

Category:

more less

Transcript and Presenter's Notes

Title: Two-stage Cluster Sampling When Clusters are of Unequal Size

1
Cluster Analysis
2

First used by Tryon (1939) encompasses a
number of different algorithms and methods for
grouping objects of similar kind into respective
categories.

3
???? ??
????????????,?????????(???)????? ????????????????
??,???????????????????(homogeneity),??????????????
????
4
???????????????????????,???????????????
5

??????
????
???
???
???
???
???

???????
???????????????????????????????????,??????????????
,?????????????
??????????????????????,?????????N?????????????????
?????????

7
??????

????????????,??????????????(Euclidean Distance)
??N????,??????M???,??X?NM?????,???????????

8
dij
????????????,????????????????????????,??????0,????
?1?
9
??????

??????????????????,???????????,??????????????????
??????????????????(matching coefficient)???

10
Ex ?i?j???????(1???????,0????????)
11
??????
12

?????????
???? (non-hierarchical)????????????????????,?????
??

a. ?????? (sequential threshold)
?????,???????????,????????,???????????????????????
????????????????????,???????????????????,????????
?
13
b.??????(paralleled threshold)
???????????????????????,???????,???????????????,??
???????????????,?????(???)???????????
c.?????(optimizing partitioning) ????????
(???????????) ???,????????,????? (criterion
measure) ????????
14
d.????(K-means Method) ??????????????,??????????
??K???,????????????????,??????????????????????????
??????????????,??????????????????????,????????????
??????
15
??? (hierarchical)????? ??????????,?????????????
???????,?????????????????
????????,?????????????,?????
16
??????????????????,??????????,????????????,???????
??????????????????????? ???????????????????????,?
????????????,????????,??????????
17
?K-means???????
1.?????????K????? 2.?????????????(???)??(????????)
,?????????????????????????????????????????????????
??? 3.??????,?????????????????????
18
Ex????????????????
??????????????????,????1,2?????3,4?,??????????????
???
???1,2?     ?? ?3,4?
19
X2
X2
????????????????????,??????????????
?D21?1,2?(12-2)2(8-6)2104
??????????4????3,4?????,??????????2????3,4??????
,?????????3,4???????????1???2,3,4?,????????
20
???1?      ???2,3,4?
X112         X1
X28          X2
????????????1?????2,3,4??????
21
????????1????1????????2,3,4????2,3,4??????,?????
??????,???K2???,??????1?????2,3,4??
22
Two-stage Cluster Sampling When Clusters are of
Unequal Size

Desired Sample Proportion pn/N
a Desired of Clusters Selected in the 1st
Stage
A Total of Clusters
b Sample Size within Each Cluster Selected
Ni of Elements in Cluster i

23
Simple Two-stage Cluster Sampling

The First-stage Prob. p1a/A
The Second-stage Prob. p2p?(a/A)
Sample size in cluster I, ni p2Ni

24
Probability Proportional to Size
where
25
Example

Draw a sample of 1,000 households from a city
that contains about 200,000 households
distributed among 2000 blocks of unequal but
known size.
The desired sample proportion 1/200
The desired of clusters selected in the 1st
stage100
How do we conduct the two-stage cluster sampling?

26
What is Cluster Analysis?

Cluster Analysis is a class of statistical
techniques that can be applied to data that
exhibit natural groupings.
CA is an interdependence technique that makes no
distinction between dependent and independent
variables.
There is NO statistical significance testing in
CA.
CA is more a group of different algorithms that
put objects into clusters following well-defined
similarity rules.

27
What is A Cluster?

A cluster is a group of relatively homogeneous
cases and observations.
Clusters exhibit high internal homogeneity and
high external heterogeneity.

28
A Cluster Diagram Drinkers Perceptions of
Alcohol
29
Characteristics of CA

Cluster Analysis is a tool of discovery.
It discovers structures in data but does NOT
explain why they exist.
CA is used when we do not have an a priori
hypothesis, but when we are in the exploratory
phase.

30
How does CA differ

From Discriminant Analysis
A dependence technique
Predict the probability that an object will fall
into one of two or more mutually exclusive
categories based on several independent
variables.
Find a linear combination of independent
variables.
Find natural groupings based on distances among
objects.

From Factor Analysis
Similar to cluster analysis in that it is an
interdependence technique.
Primary difference lies in the focus on objects
and variables.
Factor analysis reduces variables to a few
factors. Cluster analysis reduces objects to a
few clusters.

32
Cluster Analysis Methods

Three Cluster Analysis Methods
Joining (Tree Clustering)
Two-way Joining
K-means Testing

33
Joining (Tree Clustering)

A type of hierarchical clustering --
agglomerative
Each unit is a cluster.
Dendogram ?
Many other methods

34
The first level shows all samples xi as singleton
clusters. Increase levels, more samples are
clustered together in a hierarchical manner.
35
It is based on sets where each cluster level may
contain sets that are subclusters as shown in the
Venn diagram.
36
Two-way Joining Hartigan (1975)

Two-way Joining tries to cluster both variables
and objects.
Only useful if you think clustering along BOTH
lines will be useful.
Very rare in application.

37
k-Means Clustering

Begin with a preconception about the number of
clusters (k).
Thought of as ANOVA in reverse.
ANOVA evaluates between group var. against within
group var. when computing stat. signif. of
hypothesis that groups are different.
In k-Means the computer will try to move objects
in and out of the groups to get the most
significant ANOVA results.

38
Its all about distance

Distance Measures
Euclidean Distance
Squared Euclidean Distance
Manhattan Distance
Chebychev Distance
Power Distance

39
EQUATION Euclidean Distance

Basic equation for determining distance measure.
Distance (x,y) Si (xi yi)21/2
A standard formula for determining the distance
between two points on a plane

40
Fairly simple, right?
41
In other words, how do we get from this
42
To this
43
To this
44
How to Determine Clusters.

Use a computer.
Call a professional.

Clusters in the
Real World

46
Why is Cluster Analysis Important?

Relatively new/evolving technique
Highly useful for market segmentation
Segmentation identifying groupings of customers
using statistical multi-variate analysis, often
based on perceptions and attitudes as well as
demographics and behavior.
Segmentation helpful to small companies
attempting to carve out a niche
Large companies trying to tailor their
products/services to different segments

47
In addition to segmentation, clusters are used to

Design products and establish brands
Target direct mail
Make decisions about customer conversion and
retention
Decide on marketing cost levels

48
Ex Luxury Car Customers

Demographic examples easier to illustrate
Demographics
Gender
Education
Age
149 customers (objects) of a luxury car dealership

49
Using SPSS for Clustering

Chose TwoStep Cluster Analysis
Basically, the agglomerative technique
(dendogram).
Step One Creates very small (individual)
sub-clusters.
Step Two Cluster sub-clusters into desired
number of clusters.
Automatically finds optimum number of clusters.

50
Two-Step CA Output
What are these clusters?
51
Two-Step CA Output
52
(No Transcript)
53
(No Transcript)
54
What does this mean?

Cluster 5
Age 36 - 65
Education High School graduate or above
Gender Female
Could have used k-Means, would have generated
different results.
Clustering is a powerful marketing research tool.

55
Claritas Clustering Experts

Example Claritas Corporation
Claritas founded the U.S. geodemographic industry
when it launched the first PRIZM segmentation
system in 1974.
PRIZM (Potential Rating Index for Zip Markets)
categorizes every U.S. neighborhood into 1 of 62
clusters.
Descriptive Names
Money and Brains
Young Literati
Shotguns and Pickups

56
Money and Brains

Sophisticated Urban Fringe Couples
Cluster is a mix of family types singles,
married couples with children and married couples
without children. These families own their homes
in upscale neighborhoods near cities. Dual
incomes provide luxuries, travel and
entertainment.
Demographics
Affluent
Age Groups 55-64, 65
Predominantly White, High Asian

57
Clusters Work!

At a conservative estimate, more than 20,000
companies in the United States and Canada alone
used clusters as part of their marketing
information mix last year.

58
Web Sources

http//cwis.livjm.ac.uk/bus/busrmccl/ae230/lect10.
ppt
http//www.clusterbigip1.claritas.com/claritas/Def
ault.jsp?main3submenusegsubcatsegprizm
http//www.clusterbigip1.claritas.com/claritas/Def
ault.jsp?main3submenusegsubcatsegprizmne
http//www.insightsc.ie/newsletter7.htm
http//www.directionsmag.com/article.asp?article_i
d12
http//fun.supereva.it/scoleri.freeweb/cern/biogra
fie/hawking.jpg
http//www.statsoft.com/textbook/stcluan.html
http//www-db.stanford.edu/ullman/mining/cluster1
.pdf
http//www.snr.missouri.edu/multivariate/ClusterAn
alysis.pdf

59
Print Sources

Recent Developments in Clustering and Data
Analysis. Edited by Chikio Hayashi, Edwin Diday,
Michel Jambou, Noboru Ohsumi. Academic Press,
Inc. 1988.
Finding Groups in Data An Introduction to
Cluster Analysis. Leonard Kaufman, Peter J.
Rousseeuw. John Wiley and Sons, Inc. 1990.
Marketing Research An Aid to Decision Making.
Dr. Alan T. Shao. South-Western. 2002.
Exploring Marketing Research. William G. Zikmund.
South-Western. 2003.

60
Ex 7 Hypothetical Data
Subject Id. Income (1000) Education (years)
S1 5 5
S2 6 6
S3 15 14
S4 16 15
S5 25 20
S6 30 19
61
Similarity Matrix (Euclidean Distances)
Id S1 S2 S3 S4 S5 S6
S1 0 2 181 221 625 821
S2 2 0 145 181 557 745
S3 181 145 0 2 136 250
S4 221 181 2 0 106 212
S5 625 557 136 106 0 26
S6 821 745 250 212 26 0
d(S1, S3) ? (15-5)2 (19-5)2 181 d(S1, S2)
? 2 ???? (?????) ???
62
Centroid Method Five ClustersData For Five
Clusters
Cluster Cluster Members Income (1000) Education (years)
1 S1S2 (5,5) (6,6) 5.5 56/2 5.5 56/2
2 S3 15 14
3 S4 16 15
4 S5 25 20
5 S6 30 19
63
Similarity Matrix (Euclidean Distances)
Id S1 S2 S3 S4 S5 S6
S1 S2 0 162.5 200.5 590.5 782.5
S3 162 0 2 135.96 250
S4 200.5 2 0 106 212
S5 590.5 135.96 106 0 26
S6 782.5 250 212 26 0
d(S1 S2 , S3) ? (5.5-15)2 (5.5-14)2 ?
162.5 d( S3, S4) ? 2 ???? (?????) ???
64
Centroid Method Four ClustersData For Four
Clusters
Cluster Cluster Members Income (1000) Education (years)
1 S1S2 (5,5) (6,6) 5.5 56/2 5.5 56/2
2 S3 S4 (15,14) (16,15) 15.5 1516/2 14 .5 1415/2
3 S5 25 20
4 S6 30 19
65
Similarity Matrix (Euclidean Distances)
Id S1 S2 S3S4 S5 S6
S1 S2 0 181 590.5 782.5
S3 S4 181 0 120.5 230.5
S5 590.5 120.5 0 26
S6 782.5 230.5 26 0
d(S1 S2 , S5) ? (5.5-25)2 (5.5-20)2 ?
590.5 d( S5, S6) ? 26 ???? (?????) ???
66
Centroid Method Three ClustersData For Three
Clusters
Cluster Cluster Members Income (1000) Education (years)
1 S1S2 (5,5) (6,6) 5.5 56/2 5.5 56/2
2 S3 S4 (15,14) (16,15) 15.5 1516/2 14 .5 1415/2
3 S5 S6 (25,20) (30,19) 27.5 2530/2 19.5 1415/2
67
Similarity Matrix (Euclidean Distances)
Id S1 S2 S3S4 S5 S6
S1 S2 0 181 680
S3 S4 181 0 169
S5 S6 680 169 0
d(S1 S2 , S5 S6) ? (5.5-27.5)2 (5.5-19.5)2
? 680 d( S3 S4, S5 S6) ? 169 ????
(?????) ???
68
Exhibit 7-1SAS Output for cluster analysis on
data in Table 7.1
1
???????????

Simple statistics
Mean Std Dev
Skewness Kurtosis Bimodality
INCOME 16.1667 9.9883 0.2684
-1.4015 0.2211
EDUC 13.1667 6.3692
-0.4510 -1.8108 0.2711
Root-Mean-Square Total-Sample Standard Deviation
8.376555

69
Root-Mean-Square Total-Sample Standard
Deviation8.376555 (RMSSTD)
RMSSTO?????????????(?????????)

Step Number
Frequency RMS STD
Number of
of New of New Semipartial
Centroid
Clusters Clusters Joined Cluster
Cluster R-Squared R-Squared
Distance
1 5 S1 S2
2 0.707107 0.001425
0.998575 1.4142
2 4 S3 S4
2 0.707107 0.001425
0.997150 1.4142
3 3 S5 S6
2 2.549510 0.018527
0.978622 5.0990
4 2 CL4 CL3
4 5.522681 0.240855 0.737767
13.0000
5 1 CL5 CL2
6 8.376555 0.737767 0.000000
19.7041

?????,?R2????
70

CLUSTER1 CLUSTER2
CLUSTER3
OBS SID INCOME EDUC OBS SID INCOME EDUC
OBS SID INCOME EDUC
1 S1 5 5 3
S3 15 14 5
S5 25 20
2 S2 6 6 4
S4 16 15 6
S6 30 19

71
Exhibit 7.2Non-hierarchical Clustering On Data

ReplaceFULL Radius0 Maxclusters3 Maxiter20
Converge0.02
Initial Seeds
Cluster INCOME EDUC
-------- -----------------------------------
1 5.0000 5.0000
2 30.0000 19.0000
3 16.0000 15.0000

??????????S1, S6, S4
72
Exhibit 7-2 (continued)

Minimum Distance Between Seeds 14.56022
Iteration Change in Cluster Seeds
1 2
3
-------------------------------------------------
-
1 0.707107 2.54951 0.707107
2 0 0
0
Statistics for Variables
Variable Total STD Within STD
R-Squared RSQ/(1-RSQ)
-------------- -----------------------------------
-------------------------------------------
INCOME 9.988327 2.121320
0.972937 35.950617
EDUC 6.369197 0.707107
0.992605 134.222222
OVER-ALL 8.376555 1.581139
0.978622 45.777778

73
Exhibit 7-2 (continued)

Pseudo
F Statistic 68.67
Approximate Expected Over-All R-Squared .
Cubic Clustering
Criterion .
WARNING The two above values are invalid for
correlated variables.
Cluster Means
Cluster INCOME EDUC
--------- -----------------------------------
1 5.5000 5.5000
2 27.5000 19.5000
3 15.5000 14.5000

???????(?????)
74
Exhibit 7.4 Hierarchical Cluster Analysis For
Food Data

SINGLE LINKAGE CLUSTER ANALYSIS
SIMPLE STATISTICS
MEAN STD DEV SKEWNESS KURTOSIS
BIMODALITY
CALORIES 207.407 101.208
0.542 -0.675 0.478
PROTEIN 19.000 4.252
-0.824 1.327
0.357
FAT 13.481 11.257
0.790 -0.624
0.589
CALCIUM 43.963 78.034
3.159 11.345 0.746
IRON 2.381 1.461
1.230 1.469
0.518

75
Exhibit 7.4 (continued)
(?????)

COMPLETE LINKAGE CLUSTER ANALYSIS
NUMBER
FREQUENCY RMS STD
OF CLUSTERS
OF NEW OF NEW
SEMIPARTIAL MAXIMUM
CLUSTERS JOINED
CLUSTER CLUSTER R-SQUARED
R-SQUARED DISTANCE
10 CL15 CANNED CRABMEAT
4 11.32324
0.003476 0.985594 50.6665
9 CL17 ROAST LAMB
SHOUL 3 12.59929
0.003226 0.982367 55.6611
8 CL14 CANNED SHRIMP
3 16.10565
0.005231 0.977136 71.1677
7 CL13 ROAST BEEF
6 14.34190
0.009755 0.967381 80.9343
6 CL10 CL8
7
22.14096 0.023782 0.943599
108.1758
5 CL9 CL11
11
20.22234 0.039103 0.904496
141.7814
4 CL6 CL12
9
30.07489 0.048662 0.855835
154.4447
3 CL7 CL5
17
38.73570 0.220433 0.635402
262.5666
2 CL4 CANNED
SARDINES 10 51.36181
0.192623 0.442779 364.8934
1 CL3 CL2
27
57.40958 0.442779 0.000000
433.7617

76
Exhibit 7.4 (continued)

ROOT-MEAN-SQUARE TOTAL-SAMPLE STANDARD DEVIATION
57.4096
NUMBER
FREQUENCY RMS STD
OF CLUSTERS
OF NEW OF NEW
SEMIPARTIAL MINIMUM
CLUSTERS JOINED
CLUSTER CLUSTER R-SQUARED
R-SQUARED DISTANCE
10 CANNED CANNED
2 11.16786
0.001455 0.973438 35.3159
MACKEREL SALMON
9 CL14
ROAST LAMB 3 12.59929
0.003226 0.970211
35.4131
SHOULDER
8 CL11
CANNED 12 16.80697
0.014701 0.955510
39.5267
CRABMEAT
7 CL15
CL9 8
20.48901 0.028341 0.927169
40.1627
6 CL7
CL8 20
40.04817 0.285060 0.642109
40.2746
5 CL12
CANNED 3 16.10565
0.005231 0.636878
44.8504
SHRIMP
4 CL6
ROAST BEEF 21 43.49500
0.085924 0.550954
45.7642
3 CL4
CL5 24
48.72189 0.189548 0.361406
48.7139
2 CL3
CL10 26
50.53988 0.106595 0.254811
62.2624
1 CL2
CANNED 27 57.40958
0.254811 0.000000
211.5691

77
Exhibit 7.4 (continued)
(???)

CENTROID HIERARCHICAL CLUSTER ANALYSIS
NUMBER
FREQUENCY RMS STD
OF CLUSTERS
OF NEW OF NEW SEMIPARTIAL
CENTROID
CLUSTERS JOINED
CLUSTER CLUSTER R-SQUARED R-SQUARED
DISTANCE
10 CL15 CANNED
4 11.32324
0.003476 0.985594 44.5633
CRABMEAT
9 CL16 ROAST
LAMB 3 12.59929
0.003226 0.982367 45.5370
SHOULDER
8 CL14 CANNED
SHRIMP 3 16.10565 0.005231
0.977136 57.9815
7 CL13 CL10
12 16.80697
0.026857 0.950279 65.6901
6 CL12 ROAST
BEEF 6 14.34190
0.009755 0 940524 70.8222
5 CL6 CL9
9 24.36751
0.039727 0.900797
92.2533
4 CL8 CL11
5 26.85628
0.026158 0.874639 96.6423
3 CL7 CL4
17 31.36108
0.113709 0.760930 117.4906
2 CL5 CL3
26 50.53988
0.506119 0.254811 191.9655
1 CL2 CANNED
27 57.40958
0.254811 0.000000 336.7134
SARDINES

78
Exhibit 7.4 (continued)
(???)

WARD'S MINIMUM VARIANCE CLUSTER ANALYSIS
NUMBER
FREQUENCY RMS STD
BETWEEN-
OF CLUSTERS
OF NEW OF NEW SEMIPARTIAL
CLUSTER
CLUSTERS JOINED
CLUSTER CLUSTER R-SQUARED R-SQUARED
SUM OF

SQUARES
10 CL14 CANNED
4 11.32324 0.003476
0.985908 1489.42
CRABMEAT
9 CL16 CL20
8 7.75641
0.003541 0.982367 1517.12
8 CL15 CANNED
3 16.10565 0.005231
0.977136 2241.24
SHRIMP
7 CL12 ROAST BEEF
6 14.34190 0.009755
0.967381 4179.83
6 CL10 CL8
7 22.14096
0.023782 0.943599 10189.5
5 CL11 CL9
11 20.22234
0.039103 0.904496 16754.1
4 CL6 CL13
9 30.07489
0.048662 0.855835 20849.7
3 CL5 CL4
20 36.22080
0.158726 0.697109 68007.8
2 CL3 CANNED
21 47.72546 0.240715
0.456394 103137
SARDINES
1 CL7 CL2
27 57.40958
0.456394 0.000000 195548

79
Exhibit 7.5 Non-Hierarchical Analysis For
Food-Nutrient Data

INITIAL SEEDS (??????)
CLUSTER CALORIES PROTEIN
FAT CALCIUM IRON
--------------------------------------------------
-------------------------------------------------
1 331.111 19.000
27.556 8.778 2.467
2 161.667 20.500
7.500 14.250 1.925
3 100.000 14.800
3.400 114.000 3.000

80
Exhibit 7.5 (continued)

MINIMUM DISTANCE BETWEEN SEEDS 117.4876
ITERATION CHANGE IN CLUSTER SEEDS
1
2 3
----------------------- --------------------------
----------------
1 10.8475
6.46446 0.3
2 0
6.85281 12.7855
3 0
0 0

CLUSTER SUMMARY
MAXIMUM
DISTANCE
CLUSTER RMS STD
FROM SEED TO NEAREST CENTROID
NUMBER FREQUENCY DEVIATION OBSERVATION
CLUSTER DISTANCE
--------------------------------------------------
--------------------------------------------------
------------
1 8 20.8936
78.8882 2 168.5
2 12 16.3651
70.9576 3 117.9
3 6 27.8059
79.6672 2 117.9
????? ?2?????? ??? ?????
??? ???

82
?????(??)???,?????RMSSTD.????,???? Within
SD/Total SD

VARIABLE TOTAL STD WITHIN STD
R-SQUARED RSQ/(1-RSQ)
-------------------------------------------------
--------------------------------------------------
-------
CALORIES 103.06085
39.89286 0.86216
6.25453
PROTEIN 4.29257
3.58590 0.35798
0.55758
FAT 11.44357
4.52989 0.85584
5.93681
CALCIUM 44.70188
22.76009 0.76150
3.19291
IRON 1.49005
1.51663 0.04688
0.04919
OVER-ALL 50.53988
20.71299 0.84547
5.47135
PSEUDO F STATISTIC 62.92
APPROXIMATE EXPECTED OVER-ALL R-SQUARED
0.78678
CUBIC
CLUSTERING CRITERION 2.186

STATISTICS FOR VARIABLES
83
Exhibit 7.5 (continued)

CLUSTER MEANS
CLUSTER CALORIES PROTEIN FAT
CALCIUM IRON
--------------------------------------------------
---------------------------------------------
1 341.875
18.750 28.875 8.750
2.437
2 174.583
21.083 8.750 11.833
2.083
3 98.333
14.667 3.167 101.333
2.883