Title: 1.2 Displaying and Describing Categorical
11.2 Displaying and Describing Categorical
Quantitative Data
2You should be able to
- Recognize when a variable is categorical or
quantitative - Choose an appropriate display for a categorical
variable and a quantitative variable - Summarize the distribution with a bar, pie chart,
stem-leaf plot, histogram, dot plot, box plots - Know how to make a contingency table
- Describe the distribution of categorical
variables in terms of relative frequencies - Be able to describe the distribution of
quantitative variables in terms of its shape,
center and spread - Describe abnormalities or extraordinary features
of distribution - Discuss outliers and how they deviate from the
overall pattern
33 RULES of EXPLORATORY DATA ANALYSIS
- MAKE A PICTURE find patterns in difficult to
see from a chart - MAKE A PICTURE show important features in graph
- MAKE A PICTURE- communicates your data to others
4Concepts to know!
- Bar graph
- Histogram
- Dot plot
- Stem leaf plot
- Scatterplots
- Boxplots
5Which graph to use?
- Depends on type of data
- Depends on what you want to illustrate
6Categorical Data
- The objects being studied are grouped into
categories based on some qualitative trait. - The resulting data are merely
- labels or categories.
7Categorical Data(Single Variable)
Eye Color BLUE BROWN GREEN
Frequency (COUNTS) 20 50 5
Relative Frequency 20/75 .27 50/75 .66 5/75 .07
8Pie Chart(Data is Counts or Percentages)
9Bar Graph(Shows distribution of data)
10Bar Graph
- Summarizes categorical data.
- Horizontal axis represents categories, while
vertical axis represents either counts
(frequencies) or percentages (relative
frequencies). - Used to illustrate the differences in percentages
(or counts) between categories.
11Contingency Table(How data is distributed
across multiple variables)
Class Class Class Class Class Class
Survival First Second Third Crew Total
Survival ALIVE 203 118 178 212 711
Survival DEAD 122 167 528 673 1490
Survival Total 325 285 706 885 2201
12What can go wrong when working with categorical
data?
- Pay attention to the variables and what the
percentages represent - (9.4 of passengers who were in first class
survived is different from 67 of survivors were
first class passengers!!!) -
- Make sure you have a reasonably large data set
(67 of the rats tested died and 1 lived)
13Analogy
Bar chart is to categorical data as histogram is
to ...
quantitative data.
14Histogram
15Histogram
- Divide measurement up into equal-sized categories
(BIN WIDTH) - Determine number (or percentage) of measurements
falling into each category. - Draw a bar for each category so bars heights
represent number (or percent) falling into the
categories. - Label and title appropriately.
- http//www.stat.sc.edu/west/javahtml/Histogram.ht
ml
16Histogram
Use common sense in determining number of
categories to use. Between 6 15 intervals is
preferable
(Trial-and-error works fine, too.)
17Too few categories
18Too many categories
19Dot Plot
20Dot Plot
- Summarizes quantitative data.
- Horizontal axis represents measurement scale.
- Plot one dot for each data point.
21Stem-and-Leaf Plot
Stem-and-leaf of Shoes N 139 Leaf Unit
1.0 12 0 223334444444 63 0
55555555555556666666667777777888888888888899999999
9 (33) 1 000000000000011112222233333333444
43 1 555555556667777888 25 2
0000000000023 12 2 5557 8 3 0023
4 3 4 4 00 2 4 2 5 0
1 5 1 6 1 6 1 7
1 7 5
22Stem-and-Leaf Plot
- Summarizes quantitative data.
- Each data point is broken down into a stem and
a leaf. - First, stems are aligned in a column.
- Then, leaves are attached to the stems.
23Box Plot
24Box Plot
- Summarizes quantitative data.
- Vertical (or horizontal) axis represents
measurement scale. - Lines in box represent the 25th percentile
(first quartile), the 50th percentile
(median), and the 75th percentile (third
quartile), respectively.
255 Number Summary
- Minimum
- Q1 (25th percentile)
- Median (50th percentile)
- Q3 (75th percentile)
- Maximum
26An aside...
- Roughly speaking
- The 25th percentile is the number such that 25
of the data points fall below the number. - The median or 50th percentile is the number
such that half of the data points fall below the
number. - The 75th percentile is the number such that 75
of the data points fall below the number.
27Box Plot (contd)
- Outliers are drawn to the most extreme data
points that are not more than 1.5 times the
length of the box beyond either quartile(IQR). - IQR Q3 - Q1
- Outliers(upper) gt Q31.5 IQR
- Outliers(lower)ltQ1-1.5IQR
28Using Box Plots to Compare
Outliers
29Strengths and Weaknesses of Graphs for
Quantitative Data
- Histograms
- Uses intervals
- Good to judge the shape of a data
- Not good for small data sets
- Stem-Leaf Plots
- Good for sorting data (find the median)
- Not good for large data sets
30Strengths and Weaknesses of Graphs for
Quantitative Data
- Dotplots
- Uses individual data points
- Good to show general descriptions of center and
variation - Not good for judging shape for large data sets
- Boxplots
- Good for showing exact look at center, spread and
outliers - Not good for judging shape
31Analogy
Contingency table is to categorical data with two
variables as scatterplot is to ..
quantitative data with two variables.
32Scatter Plots
33Scatter Plots
- Summarizes the relationship between two
quantitative variables. - Horizontal axis represents one variable and
vertical axis represents second variable. - Plot one point for each pair of measurements.
34No relationship
35Summary
- Many possible types of graphs.
- Use common sense in reading graphs.
- When creating graphs, dont summarize your data
too much or too little. - When creating graphs, label everything for
others. - Remember you are trying to communicate something
to others!
36 37 38Graphical Analysis
Center (Location)
Spread (Variation)
Shape
39Interesting Features Identified by Graphs
- Center (Location)
- Spread (Variability)
- Shape
- Individual Values
- Compare Groups
- Identify Outliers
- LOOK FOR PATTERNS, CLUSTERS, GAPS!!!!!
- LOOK FOR DEVIATIONS FROM THE GENERAL PATTERN!!!!!
- http//bcs.whfreeman.com/bps3e/
40Shape of Graph
41Shape of Graph
- Symmetry-Skewness
- Modes(peaks)
42SYMMETRY SKEWNESS
- Skewness - a measure of symmetry, or more
precisely, the lack of symmetry. - A distribution, or data set, is symmetric if it
looks the same to the left and right of the
center point.
43Kurtosis
- Kurtosis - a measure of whether the data are
peaked or flat relative to a normal distribution.
- High kurtosis tend to have a distinct peak near
the mean, decline rather rapidly, and have heavy
tails. L - Low kurtosis tend to have a flat top near the
mean rather than a sharp peak.
44Symmetric -- Mesokurtic
45Symmetric -- Platykurtic
46Symmetric -- Leptokurtic
47Nonsymmetric Skewed Positive -Skewed Right
48Nonsymmetric Skewed Negative -- Skewed Left
49Symmetric -- Bimodal
50Describing Distributions SOCS Rock!
- Shape
- Outliers
- Center
- Spread
51Graphical Analysis - Scale
SOL Scores Class 1
SOL Scores Class 2
52Graphical Analysis - Scale
SOL Scores Class 1
SOL Scores Class 2
53SOL Scores Class 1
SOL Scores Class 2
54