Association - PowerPoint PPT Presentation


PPT – Association PowerPoint presentation | free to download - id: 119b25-OGM1M


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation



Normal Quantile Plot ... How this skewness is reflected in the normal quantile plot? ... How do we plot a quantile plot to check on gamma density? ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 58
Provided by: Hung72


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Association

  • Reference
  • Browns Lecture Note 1
  • Grading on Curve

  • Method for studying relationships among several
  • Scatter plot
  • Correlation coefficient
  • Association and causation.
  • Regression
  • Examine the distribution of a single variable.
  • QQplot

  • Sir Francis Galton in his 1885 Presidential
    address before the anthropology section of the
    British Association for the Advancement of
    Science described a study he had made of
  • How tall children are compared to their parents?
  • He thought he had made a discovery when he found
    that childs heights tend to be more moderate
    than that of their parents.
  • For example, if the parents were very tall their
    children tended to be tall, but shorter than the
  • This discovery he called a regression to the
  • The term regression has come to be applied to the
    least squares technique that we now use to
    produce results of the type he found (but which
    he did not use to produce his results).
  • Association between variables
  • Two variables measured on the same individuals
    are associated if some values of one variable
    tend to occur more often with some values of the
    second variable than with other values of that

Study relationships among several variables
  • Associations are possible between
  • Two quantitative variables.
  • A quantitative and a categorical variable.
  • Two categorical variables.
  • Quantitative and categorical variables
  • Regression
  • Response variable and explanatory variable
  • A response variable measures an outcome of a
  • An explanatory variable explains or causes
    changes in the response variables.
  • If one sets values of one variable, what effect
    does it have on the other variable?
  • Other names
  • Response variable dependent variable.
  • Explanatory variable independent variable

Principles for studying association
  • Start with graphical display scatterplots
  • Display the relationship between two quantitative
  • The values of one variable appear on the
    horizontal axis (the x axis) and the values of
    the other variable on the vertical axis (the y
  • Each individual is the point in the plot fixed by
    the values of both variables for that individual.
  • In regression, usually call the explanatory
    variable x and the response variable y.
  • Look for overall patterns and for striking
    deviations from the pattern interpreting
  • Overall pattern the relationship has ...
  • form (linear relationships, curved relationships,
  • direction (positive/negative association)
  • strength (how close the points follow a clear
  • Outliers
  • For a categorical x and quantitative y, show the
    distributions of y for each category of x.
  • When the overall pattern is quite regular, use a
    compact mathematical model to describe it.

Positive/negative association
  • Two variables are positively associated when
    above-average values of one tend to accompany
    above-average values of the other and
    below-average values also tend to occur together.
  • Two variables are negatively associated when
    above-average values of one accompany
    below-average values of the other and vice

Association or Causation
Add numerical summaries - the correlation
Straight-line (linear) relations are particularly
interesting. (correlation)
Our eyes are not a good judges of how strong a
relationship is - affected by the plotting scales
and the amount of white space around the cloud of
  • The correlation r measures the direction and
    strength of the linear relationship between two
    quantitative variables.
  • For the data for n individuals on variables x and
  • Calculation
  • Begins by standardizing the observations.
  • Standardized values have no units.
  • r is an average of the products of the
    standardized x and y values for the n

Properties of r
  • Makes no use of distinction between explanatory
    and response variables.
  • Requires both variables be quantitative.
  • Does not change when the units of measurements
    are changed.
  • rgt0 for a positive association and rlt0 for
  • -1? r ? 1.
  • Near-zero r indicate a weak linear relationship
    the strength of the relationship increases as r
    moves away from 0 toward either -1 or 1.
  • The extreme values r-1 or 1 occur only when the
    points lie exactly along a straight line.
  • It measures the strength of only the linear
  • Scatterplots and correlations
  • It is not so easy to guess the value of r from a

Various data and their correlations
Cautions about correlation
  • Correlation is not a complete description of
    two-variable data.
  • A high correlation means bigger linear
    relationship but not similarity.
  • Summary If a scatterplot shows a linear
    relationship, wed like to summarize the overall
    pattern by drawing a line on the scatterplot.
  • Use a compact mathematical model to describe it -
    least squares regression.
  • A regression line
  • It summarizes the relationship between two
    variables, one explanatory and another response.
  • It is a straight line that describes how a
    response variable y changes as an explanatory
    variable x changes.
  • Often used to predict the value of y for a
    given value of x.

(No Transcript)
Mean height of children against age
  • Strong, positive, linear relationship. (r0.994)

Fitting a line to data
  • It means to draw a line that comes as close as
    possible to the points.
  • The equation of the line gives a compact
    description of the dependence of the response
    variable y on the explanatory variable x.
  • A mathematical model for the straight-line
  • A straight line relating y to x has an equation
    of the form

  • Height 64.93 (0.635Age)
  • Predict the mean height of the children 32, 0 and
    240 months of age.
  • Can we do extrapolation?

  • The accuracy of predictions from a regression
    line depend on how much scatter the data shows
    around the line.
  • Extrapolation is the use of regression line for
    prediction far outside the range of values of the
    explanatory variable x that you used to obtain
    the line.
  • Such predictions are often not accurate.

Which line??
Least-squares regression
  • We need a way to draw a regression line that does
    not depend on our eyeball guess.
  • We want a regression line that makes the
    prediction errors as small as possible.
  • The least-squares idea.
  • The least-squares regression line of y on x is
    the line that makes the sum of the squares of the
    vertical distances of the data points from the
    line as small as possible.
  • Find a and b such that

is the smallest. (y-hat is predicted response for
the given x)
(No Transcript)
Equation of the LS regression line
  • The equation of the least-squares regression line
    of y on x
  • Interpreting the regression line and its
  • A change of one standard deviation in x
    corresponds to a change of r standard deviation
    in y.
  • It always passes through the point (x-bar, y-bar).

The height-age data
Correlation and regression
  • In regression, x and y play different roles.
  • In correlation, they dont.
  • Comparing the regression of y on x and x on y.
  • The slope of the LS regression involves r.
  • r2 is the fraction of the variance of y that is
    explained by the LS regression of y on x.
  • If r0.7 or -0.7, r20.49 and about half the
    variation is accounted for by the linear
  • Quantify the success of regression in explaining
    y. Two sources of variation in y, one systematic
    another random.

  • r20.989
  • r20.849

Scatterplot smoothers
  • Systematic methods of extracting the overall
  • Help us see overall patterns.
  • Reveal relationships that are not obvious from a
    scatterplot alone.

Categorical explanatory variable
  • Make a side-by-side comparison of the
    distributions of the response for each category.
  • back-to-back stemplots, side-by-side boxplots.
  • If the categorical variable is not ordinal,
    i.e. has no natural order, its hard to speak the
    direction of the association.

  • Francis Galton (1822 1911) measured the heights
    of about 1,000 fathers and sons.
  • The following plot summarizes the data on sons
  • The curve on the histogram is a N(68.2, 2.62)
    density curve.

Data is often normally distributed
  • The following table summarizes some aspects of
    the data
  • Quantiles
  • 100.0 maximum
  • 90.0
  • 75.0 quartile
  • 50.0 median
  • 25.0 quartile
  • 10.0
  • 0.0 minimum
  • Moments
  • Mean 68.20 Std Dev 2.60
    N 952

Normal Quantile Plot
  • A normal quantile plot provides a better way of
    determining whether data is well fitted by a
    normal distribution.
  • How these plots are formed and interpreted?
  • The plot for the Galton data on sons heights

Normal Quantile Plot
  • The data points very nearly follow a straight
    line on this plot.
  • This verifies that the data is approximately
    normally distributed.
  • This is data from the population of all adult,
    English, male heights.
  • The fact that the sample is approximately normal
    is a reflection of the fact that this population
    of heights is normally distributed or at least
    approximately so.
  • IF the POPULATION is really normal how close to
    normal should the SAMPLE histogram be and how
    straight should the normal probability plot be?
  • Empirical Cumulative Distribution Function
  • Suppose that x1,x2.,xn is a batch of numbers
    (the word sample is often used in the case that
    the xi are independently and identically
    distributed with some distribution function the
    word batch will imply no such commitment to a
    stochastic model).
  • The empirical cumulative distribution function
    (ecdf) is defined as (with this definition, Fn is

Empirical Cumulative Distribution Function
  • The random variables I(Xi?x) are independent
    Bernoulli random variables.
  • nFn(x) is a binomial random variable (n trials,
    probability F(x) of success) and so
  • E?Fn(x)? F(x), Var?Fn(x) ? n-1F(x)?1-
  • Fn(x) is an unbiased estimate of F(x) and has a
    maximum variance at that value of x such that
    F(x) 0.5, that is, at the median.
  • As x becomes very large or very small, the
    variance tends to zero.
  • The Survival Function
  • It is equivalent to a distribution function and
    is defined as
  • S(t) P(T ? t) 1- F(t)
  • Here T is a random variable with cdf F.
  • In applications where the data consist of times
    until failure or death and are thus nonnegative,
    it is often customary to work with the survival
    function rather than the cumulative distribution
    function, although the two give equivalent
  • Data of this type occur in medical and
    reliability studies. In these cases, S(t) is
    simply the probability that the lifetime will be
    longer than t. we will be concerned with the
    sample analogue of S, Sn(t) 1- Fn(t).

Quantile-Quantile Plots
  • Q-Q Plots are useful for comparing distribution
  • If X is a continuous random variable with a
    strictly increasing distribution function, F, the
    pth quantile of F was defined to be that value of
    x such that F(x) p or Xp F-1(p).
  • In a Q-Q plot, the quantiles of one distribution
    are plotted against those of another.
  • A Q-Q plot is simply constructed by plotting the
    points (X(i),Y(i)).
  • If the batches are of unequal size, an
    interpolation process can be used.
  • Suppose that one cdf (F) is a model for
    observations (x) of a control group and another
    cdf (G) is a model for observations (y) of a
    group that has received some treatment.
  • The simplest effect that the treatment could be
    to increase the expected response of every member
    of the treatment group by the same amount, say h
  • Both the weakest and the strongest individuals
    would have their responses changed by h. Then yp
    xp h, and the Q-Q plot would be a straight
    line with slope 1 and intercept h.

Quantile-Quantile Plots
  • The cdfs are related as G(y) F(y h).
  • Another possible effect of a treatment would be
    multiplicative The response (such as lifetime or
    strength) is multiplied by a constant, c.
  • The quantiles would then be related as yp cxp,
    and the Q-Q plot would be a straight line with
    slope c and intercept 0. The cdfs would be
    related as G(y) F(y/c).

  • Here is a histogram and probability plot for a
    sample of size 1000 from a perfectly normal
    population with mean 68 and SD 2.6.

Moments Mean 67.92 Std Dev 2.60 N
Summary on parents heights
Another Data Set
  • R. A. Fisher (1890 1962) (who many claim was
    the greatest statistician ever) analyzed a series
    of measurements of Iris flowers in some of his
    important developmental papers.
  • Histogram of the sepal lengths of 50 iris setosa

This data has mean 5.0 and S.D. 3.5. The curve is
the density of a N(5, 3.52) distribution.
Normal Quantile Plot Sepal length
  • Why are the dots on this plot arranged in neat
    little rows?
  • Apart from this, the data nicely follows a
    straight line pattern on the plot.

Fisher's Iris Data
  • Array giving 4 measurements on 50 flowers from
    each of 3 species of iris.
  • Sepal length and width, and petal length and
    width are measured in centimeters.
  • Species are Setosa, Versicolor, and Virginica.
  • R. A. Fisher, "The Use of Multiple Measurements
    in Taxonomic Problems", Annals of Eugenics, 7,
    Part II, 1936, pp. 179-188. Republished by
    permission of Cambridge University Press.
  • The data were collected by Edgar Anderson, "The
    irises of the Gaspe Peninsula", Bulletin of the
    American Iris Society, 59, 1935, pp. 2-5.

Not all real data is approximately normal
  • Histogram and normal probability plot for the
    salaries (in 1,000) of all major league baseball
    players in 1987.
  • Only position players not pitchers who were
    on a major league roster for the entire season
    are included.

Moments Mean 529.7 S.D. 441.6 N 260
Normal Quantile Plot
  • This distribution is skewed to the right.
  • How this skewness is reflected in the normal
    quantile plot?
  • Both the largest salaries and the smallest
    salaries are much too large to match an ideal
    normal pattern. (They can be called outliers.)
  • This histogram seems something like an
    exponential density. Further investigation
    confirms a reasonable agreement with an
    exponential density truncated below at 67.5.

Judging whether a distribution is approximately
normal or not
  • Personal incomes, survival times, etc are usually
    skewed and not normal.
  • Risky to assume that a distribution is normal
    without actually inspecting the data.
  • Stemplots and histograms are useful.
  • Still more useful tool is the normal quantile

Normal quantile plots
  • Arrange the data in increasing order. Record
    percentiles of each data value.
  • Do normal distribution calculations to find the
    z-scores at these same percentiles.
  • Plot each data point x against the corresponding
  • If the data distribution is standard normal, the
    points will lie close to the 45-degree line xz.
  • If it is close to any normal distribution, the
    points will lie close to some straight line.

  • granularity

  • Right-skewed distribution

(No Transcript)
qqline (R-function)
  • Plots a line through the first and third quartile
    of the data, and the corresponding quantiles of
    the standard normal distribution.
  • Provide a good straight line that helps us see
    whether the points lie close to a straight line.

(No Transcript)
  • Pulse data

  • Density curves relative frequencies.
  • The mean (?), median, quantiles, standard
    deviation (?).
  • The normal distributions N(?,?2).
  • Standardizing z-score (z(x-?)/?)
  • 68-95-99.7 rule standard normal distribution and
  • Normal quantile plots/lines.

Another non-normal pattern
  • The data here is the number of runs scored in the
    1986 season by each of the players in the above
    data set.

Moments Mean 55.33 S.D. 25.02 N
261 Note n 261 here, but in the preceding data
n 260. The discrepancy results from the fact
that one player in the data set has a missing
salary figure.
Gamma Quantile Plot
  • This data is fairly well fit by a gamma density
    with parameters a 4.55 and l 12.16. (How do
    we find those two numbers?)
  • What is the gamma density curve?
  • How do we plot a quantile plot to check on gamma
  • The data points form a fairly straight line on
    this plot hence there is reasonable agreement
    between the data and a theoretical G(4.55,12.16)

Methods of Estimation
  • Basic approach on parameter estimation
  • The observed data will be regarded as realization
    of random variables X1,X2,, , Xn, whose joint
    distribution depends on an unknown parameter ?.
  • ? may be a vector, such as (?, ?) in Gamma
    density function.
  • When the Xi can be modeled as independent random
    variables all having the same distribution
    ??x???, in which case their joint distribution is
    ??x1?????x2??? ??xn??? .
  • Refer to such Xi as independent and identically
    distributed, or i.i.d.
  • An estimate of ? will be a function of X1,X2,,Xn
    and will hence be a random variable with a
    probability distribution called its sampling
  • We will use approximations to the sampling
    distribution to assess the variability of the
    estimate, most frequently through its standard
    deviation, which is commonly called its standard
  • The Method of Moments
  • The Methods of Maximum Likelihood

The Method of Moments
  • The kth moment of a probability law is defined
    as ?k E(Xk)
  • Here X is a random variable following that
    probability law (of course, this is defined only
    if the expectation exists).
  • ?k is a function of ? when the Xi have the
    distribution ??x???.
  • If X1,X2,, , Xn, are i.i.d. random variables
    from that distribution, the kth sample moment is
    define as n-1Si(Xi)k.
  • According to the central limit theorem, the
    sample moment n-1Si(Xi)k converges to the
    population moments ?k in probability.
  • If the functions relating to the sample moments
    are continuous, the estimates will converge to
    the parameters as the sample moment converge to
    the population moments. ?.
  • The method of moments estimates parameters by
    finding expressions form them in terms of the
    lowest possible order moments and then
    substituting sample moments in E(Xk) to

The Method of Maximum Likelihood
  • It can be applied to a great variety of other
    statistical problems, such as regression, for
    example. This general utility is one of the major
    reasons of the importance of likelihood methods
    in statistics.
  • The maximum likelihood estimate (mle) of ? is
    that value of ? the maximizes the likelihood?that
    is, makes the observed data most probable or
    most likely.
  • Rather than maximizing the likelihood itself, it
    is usually easier to maximize its natural
    logarithm (which is equivalent since the
    logarithm is a monotonic function).
  • For an i.i.d. sample, the log likelihood is
  • The large sample distribution of a maximum
    likelihood estimate is approximately normal with
    mean ?0 and variance 1?nI(?0).
  • This is merely a limiting result, which holds as
    the sample size tends to infinity, we say that
    the mle is asymptotically unbiased and refer to
    the variance of the limiting normal distribution
    as the asymptotic variance of the mle.

  • x lt- qgamma(seq(.001, .999, len 100), 1.5)
    compute a vector of quantiles
  • plot(x, dgamma(x, 1.5), type "l") density
    plot for shape 1.5
  • QQplots are used to assess
  • whether data have a particular distribution, or
  • whether two datasets have the same distribution.
  • If the distributions are the same, then the
    QQplot will be approximately a straight line.
  • The extreme points have more variability than
    points toward the center.
  • A plot with a "U" shape means that one
    distribution is skewed relative to the other.
  • An "S" shape implies that one distribution has
    longer tails than the other.
  • In the default configuration a plot from qqnorm
    that is bent down on the left and bent up on the
    right means that the data have longer tails than
    the Gaussian.
  • plot(qlnorm(ppoints(y)), sort(y)) log normal