Data analysis - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Data analysis

Description:

Data analysis Measures of Position Measures of position (or central tendency) describe where the data are concentrated. Mean The Mean is simply the mathematical ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 41
Provided by: 5686908
Category:

less

Transcript and Presenter's Notes

Title: Data analysis


1
Data analysis
2
  • The first step in any data analysis strategy is
    to calculate summary measures to get a general
    feel for the data. Summary measures for a data
    set are often referred to as descriptive
    statistics. Descriptive statistics fall into
    three main categories
  • measures of position (or central tendency)
  • measures of variability
  • measures of skewness

3
  • The purpose of descriptive statistics is to
    describe the data.
  • The type of data will determine which descriptive
    statistic is appropriate.
  • Specifically, one can only calculate a mean with
    interval or ratio data, whereas a mode can be
    calculated with nominal, ordinal, interval or
    ratio data.

4
Measures of Position
  • Measures of position (or central tendency)
    describe where the data are concentrated.
  • MeanThe Mean is simply the mathematical average
    of the data. T
  • the mean provides you with a quick way of
    describing your data, and is probably the most
    used measure of central tendency.
  • However, the mean is greatly influenced by
    outliers. For example, consider the following
    set 1 1 2 4 5 5 6 6 7 150
  • While the mean for this data set is 18.7, it is
    obvious that nine out of ten of the observation
    lie below the mean because of the large final
    observation.
  • Consequently, the mean is not always the best
    measure of central tendency.

5
  • MedianThe median is the middle observation in a
    data set. That is, 50 of the observation are
    above the median and 50 are below the median
    (for sets with an even number of observation, the
    median is the average of the middle two
    observation).
  • The median is often used when a data set is not
    symmetrical, or when there are outlying
    observation.
  • For example, median income is generally reported
    rather than mean income because of the outlying
    observation.

6
  •  
  • To get the median, first put your numbers in
    ascending or descending order. Then just use
    check to see which of the following two rules
    applies
  • Rule One. If you have an odd number of numbers,
    the median is the center number (e.g., three is
    the median for the numbers 1, 1, 3, 4, 9).  
  • Rule Two. If you have an even number of numbers,
    the median is the average of the two innermost
    numbers (e.g., 2.5 is the median for the numbers
    1, 2, 3, 7).  

7
  • ModeThe Mode is the value around which the
    greatest number of observation are concentrated,
    or quite simply the most common observation.
  • Mode is often used with nominal data, but is not
    the preferred measure for other types of data.

8
  • The mean, median, and mode are affected
    differently by skewness (i.e., lack of symmetry)
    in the data.

9
  • When a variable is normally distributed, the
    mean, median, and mode are the same number.  

10
  • When the variable is skewed to the left (i.e.,
    negatively skewed), the mean is pulled to the
    leftthe most, the median is pulled to the left
    the second most, and the mode the least affected.
  • Therefore, mean lt median lt mode.

11
  • When the variable is skewed to the right (i.e.,
    positively skewed), the mean is pulled to the
    right the most, the median is pulled to the right
    the second most, and the mode the least affected.
  • Therefore, mean gt median gt mode.

12
Measures of Variability
  • While measures of position describe where the
    data points are concentrated, measures of
    variability measure the dispersion (or spread) of
    the data set.
  • Range
  • The range is the difference between the largest
    and the smallest observations in the data set.
    However, This is a limited measure because it
    depends on only two of the numbers in the data
    set. Using the above data set again, the range is
    149, but that does not provide any information
    regarding the concentration of the data at the
    low end of the scale. Another limitation of range
    is that it is affected by the number of
    observations in the data set.
  • Generally, the more observation there are, the
    more spread out they will be. One use of range in
    everyday life is in newspaper stock market
    summaries, which give the day's high and low
    numbers.

13
  • Measures of Variability
  • Measures of variability tell you how "spread out"
    or how much variability is present in a set of
    numbers.
  • For example, which set of the following numbers
    appears to be the most spread out?
  • Set A.  93, 96, 98, 99, 99, 99, 100
  • Set B.  10, 29, 52, 69, 87, 92, 100
  • Right! The numbers in set B are more "spread
    out."
  • One crude indicator of variability is the range
    (i.e., the difference between the highest and
    lowest numbers).

14
  • Two commonly used indicators of variability are
    the variance and the standard deviation.
  • VarianceUnlike range, variance takes into
    consideration all the data points in the data
    set. If all the observation are the same, the
    variance would be zero. The more spread out the
    observation are, the larger the variance.
  • The variance tells you (exactly) the average
    deviation from the mean, in "squared units."

15
  • Standard DeviationStandard deviation is the
    positive square root of the variance, and is the
    most common measure of variability. Standard
    deviation indicates how close to or how far the
    numbers tend to vary from the mean. The larger
    the standard deviation, the more variation there
    is in the data set.
  •  
  • (If the standard deviation is 7, then the
    numbers tend to be about 7 units from the mean.
    If the standard deviation is 1500, then the
    numbers tend to be about 1500 units from the
    mean.)

16
  • Virtually everyone in education is already
    familiar with the normal curve
  • An easy rule applying to data that follow the
    normal curve is the "68, 95, 99.7 percent rule."
    That is . . .  
  •  Approximately 68 of the cases will fall within
    one standard deviation of the mean.
  • Approximately 95 of the cases will fall within
    two standard deviations of the mean.
  • Approximately 99.7 of the cases will fall within
    three standard deviations of the mean.

17
  • Higher values for both of these indicators stand
    for a larger amount of variability. Zero stands
    for no variability at all (e.g., for the data 3,
    3, 3, 3, 3, 3, the variance and standard
    deviation will equal zero).

18
  • Frequency Distributions
  • One useful way to view information in a variable
    is to construct a frequency distribution (i.e.,
    an arrangement in which the frequencies, and
    sometimes percentages, of the occurrence of each
    unique data value are shown).
  • When a variable has a wide range of values, you
    may prefer using a grouped frequency distribution
    (i.e., where the data values are grouped into
    intervals, 0-9, 10-19, 20- 29, etc., and the
    frequencies of the intervals are shown).

19
  • Graphic Representations of Data
  • Another excellent way to clearly describe your
    data (especially for visually oriented learners)
    is to construct graphical representations of the
    data (i.e., pictorial representations of the data
    in two-dimensional space).  
  • A bar graph uses vertical bars to represent the
    data. The height of the bars usually represent
    the frequencies for the categories shown on the X
    axis(i.e., the horizontal axis). (By the way, the
    Y axis is the vertical axis.)

20
  • A line graph uses one or more lines to depict
    information about one or more variables.  
  • A simple line graph might be used to show a trend
    over time (e.g., with the years on the X axis and
    the population sizes on the Y axis).
  • Line graphs are used for many different purposes
    in research. For example, (GPA is on the X axis
    and frequency is on the Y axis)

21
  • A scatterplot is used to depict the relationship
    between two quantitative variables.
  • Typically, the independent or predictor variable
    is represented by the X axis (i.e., on the
    horizontal axis) and the dependent variable is
    represented by the Y axis (i.e., on the vertical
    axis).

22
  • The relationship is not always positive
  • Correlation coefficient range between -1 and 1
  • Interpretation of Pearson r
  • 1 highly positvely correlated
  • -1 highly negatively correlated
  • Close to zero, no correlation

23
  • Correlation does not necessarily indicate
    causation
  • .82 tells us that a person with an average score
    on the test will probably obtained an average
    score on other test

24
  • How to Interpret the Values of Correlations.
  • The correlation coefficient (r) represents the
    linear relationship between two variables. If the
    correlation coefficient is squared, then the
    resulting value (r2, the coefficient of
    determination) will represent the proportion of
    common variation in the two variables (i.e., the
    "strength" or "magnitude" of the relationship).
  • In order to evaluate the correlation between
    variables, it is important to know this
    "magnitude" or "strength" as well as the
    significance of the correlation.

25
  • Outliers.
  • Outliers are atypical (by definition), infrequent
    observations.
  • Outliers have a profound influence on the slope
    of the regression line and consequently on the
    value of the correlation coefficient.
  • A single outlier is capable of considerably
    changing the slope of the regression line and,
    consequently, the value of the correlation, as
    demonstrated in the following example.

26
(No Transcript)
27
  • Analyses for Comparison
  • Nominal Data Chi-Square
  • Interval Data t-Test
  • Interval Data One-Way ANOVA
  • Interval Data Factorial ANOVA
  • Analyses for Association
  • Interval Data Pearson Product-Moment Correlation
    (r)
  • Nominal Data Phi Coefficient
  • Ordinal Data Spearman Rank-Order Correlation

28
parametric Methods Non parametric Methods
t-test for independent samples Mann-Whitney U test
ANOVA/MANOVA (multiple groups) Kruskal-Wallis analysis of ranks and the Median test.
t-test for dependent samples (two variables measured in the same samplE) Sign test and Wilcoxon's matched pairs test
29
  • t-test for independent samples
  • Purpose, Assumptions.
  • The t-test is the most commonly used method to
    evaluate the differences in means between two
    groups.
  • For example, the t-test can be used to test for a
    difference in test scores between a group of
    patients who were given a drug and a control
    group who received a placebo.
  • Theoretically, the t-test can be used even if the
    sample sizes are very small (e.g., as small as
    10 some researchers claim that even smaller n's
    are possible), as long as the variables are
    normally distributed within each group and the
    variation of scores in the two groups is not
    reliably different

30
  • The normality assumption can be evaluated by
    looking at the distribution of the data (via
    histograms) or by performing a normality test.
  • The equality of variances assumption can be
    verified with the F test, or you can use the more
    robust Levene's test.
  • If these conditions are not met, then you can
    evaluate the differences in means between two
    groups using one of the nonparametric
    alternatives to the t- test (Nonparametrics).

31
  • Independent sample t test

Mean N Std.Deviation Std. Error Mean
Talk Low stress High stress 42.20 22.07 15 15 24.97 27.14 6.45 7.01
Sx SD/v15
Standard deviation of the sample means
IV
DV
  F Sig. T Df Sig. (2-tailed Mean diff Std. error diff
Talk Equal variance assumed Equal variance not assumed .023 .881 2.43 2.430 28 27.808 .022 .022  .
In this case, variances are similar
Tested at a .05
Levenes test for equality of variance
You want a small F
Here you want variance to equal
The larger the F value the more dissimilar the
varainces are
32
  • An independent t st was conducted to evaluate the
    hypothesis that students talk differently (amount
    of talkin) under different stress condition. The
    test was significant, t (28) 2.43, p .022.
    Students in high stress-condition talked less
    (M22.07 SD 27.14) than students in
    low-stressed condition (M45.20 SD 24.97)

33
  • t-test for dependent samples (paired sampel
    t-test
  • Test two groups of observations (that are to be
    compared) are based on the same sample of
    subjects who were tested twice (e.g., before and
    after a treatment )

Mean N Std.Deviation Std. Error Mean
PAY SECURITY 5.67 4.50 30 30 1.49 1.83 .27 .33
Sx SD/v30
Standard deviation of the sample means
34
  Mean Std. Dev. Std. Err. Lower Upper t df Sig. (2-tailed)
Pay- security 1.17 2.26 .41 .32 2.01 2.827 29  .008
  • A paired-sample t test was conducted to evaluate
    whether employees were more concerned with pay or
    job security. The results indicated that the mean
    concern for pay (M 5.67, SD 1.49) was
    significantly greater than the mean concern for
    security (M 4.50, SD 1.83), t (29) 2.83, p
    .008.

35
  • It was suggested (Marija J. Norusis) that
  • When reporting your results, give the exact
    observed significance level. It will help the
    rader evaluate your findings
  • Eg p .008, 8 chances in 1000 you would
    observe the difference between the two sample.
  • Eg p .08 8 chances in 100 but you have set
    that you will only acet if it is 5 chances in
    100

36
  • Pearson Chi-square.
  • The Pearson Chi-square is the most common test
    for significance of the relationship between
    categorical variables.
  • This measure is based on the fact that we can
    compute the expected frequencies in a two-way
    table (i.e., frequencies that we would expect if
    there was no relationship between the variables).
  • For example, suppose we ask 20 males and 20
    females to choose between two brands of jeans
    (brands A and B).
  • If there is no relationship between preference
    and gender, then we would expect about an equal
    number of choices of brand A and brand B for each
    sex.
  • The Chi-square test becomes increasingly
    significant as the numbers deviate further from
    this expected pattern that is, the more this
    pattern of choices for males and females differs.

37
  • The Goodness of Fit test used to find out if the
    population under study follow the distribution
    values
  • Ho the population distribution is uniform, that
    is, each brand of cola drinks is prefered by an
    equal percentage of the population
  • Ha the population distribution is not uniform,
    that is, each brand of cola drinks is not
    prefered by an equal percentage of the population

38
brand O E O-E (O-E)2 (O-E)2/E
A 50 60
B 65 60
C 45 60
D 70 60
E 70 60
Total 300 60
X 2 (df5) 9.18, let say the significant value
is 9.49, then Ho has to rejected and we cannot
say that cola brands are preferred by an equal
percentage of the population Df (r-1). (c-1)
39
  • Test of independence we can test the
    realtionship between nominal variables)
  • The data are obtained from a random sample
  • We use count data (frequencies)
  • We want to test whether perception of life is
    independent of gender or men and women find life
    equaly exciting

40
Life excitement male female
excited 300 384 684
Not excited 296 481 777
596 865 1461
Chi square 4.76, DF 1 p .0290
What can you conclude?
Write a Comment
User Comments (0)
About PowerShow.com