The Mean Regression or Regression to the Mean - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

The Mean Regression or Regression to the Mean

Description:

Is childhood obesity an increasing problem in the community? ... Epidemiologists put a context and see if the statistical model fits. RCN/UO- APHEO ... – PowerPoint PPT presentation

Number of Views:150
Avg rating:3.0/5.0
Slides: 37
Provided by: ramac
Category:

less

Transcript and Presenter's Notes

Title: The Mean Regression or Regression to the Mean


1
The Mean Regression or Regression to the Mean
  • Rama C. Nair
  • Professor
  • Epidemiology and Community Medicine

2
Outline
  • Why Data Analysis?
  • Qualitative and quantitative data
  • Role of Statistics
  • Quantitative information
  • Variability in data
  • Bias and Random error
  • Population vs sample
  • Regression Analysis
  • Definition
  • Types of regressions
  • Benefits and pitfalls of regression analyses

3
Why data analysis
  • Research Question
  • Collection of information
  • Qualitative and quantitative
  • Only quantitative information (or information
    that can be quantified) considered in this
    presentation
  • Analyzing the information collected to arrive at
    a conclusion (decision) about the research
    question

4
Example
  • Is childhood obesity an increasing problem in the
    community?
  • What might be the major causes for this
    increasing trend?
  • Would introduction of supervised physical
    activities in the school address the issue?
  • What are the important considerations for a good
    school program?

5
Role of Statistics in Data Analysis
  • Variability in data
  • Measured by Probability Distributions
  • Understanding the variability
  • Reasons for variation
  • Effects of variability on inference
  • Reliable and valid inference
  • Random error and Bias

6
Role of statistics in data analysis
  • Analyzing variability in observed data and
    arriving at inferences (answers to research
    questions) that are reliable and valid, in the
    presence of the bias and random errors that are
    inherent in all data
  • A tall order
  • Statisticians start with
  • Let X1, X2,, Xn be i.i.d. (independently and
    identically distributed) with N (µ,s) to describe
    the n observations
  • Epidemiologists put a context and see if the
    statistical model fits

7
Population vs sample
  • Truly, statistics only come into play when we are
    observing only a sample of the whole population
  • Inference from sample to population is based on
    the statistical properties of sample statistics
    based on the probability distribution of the
    variables
  • Descriptive analysis involving estimation of
    characteristics (mean, relative risk, odds ratio,
    or simply probabilities of events)
  • Statistical tests of hypotheses about the
    characteristics (alone or in combination with
    some estimation)

8
Bias and Random Error
  • How does one deal with bias and random error in
    statistical analyses?
  • Try to minimize bias by choosing appropriate
    design
  • Alternately, using mathematical modeling, one may
    eliminate bias in analysis
  • Random error cannot be avoided, but effect on
    inference can be minimized by increasing sample
    size

9
Regression analysis
  • What is regression

10
Regression - Wikipedia
  • In statistics, regression analysis examines the
    relation of a dependent variable (response
    variable) to specified independent variables
    (explanatory variables). The mathematical model
    of their relationship is the regression equation.
    The dependent variable is modeled as a random
    variable because of uncertainty as to its value,
    given only the value of each independent
    variable. A regression equation contains
    estimates of one or more hypothesized regression
    parameters ("constants"). These estimates are
    constructed using data for the variables, such as
    from a sample. The estimates measure the
    relationship between the dependent variable and
    each of the independent variables. They also
    allow estimating the value of the dependent
    variable for a given value of each respective
    independent variable.
  • Uses of regression include curve fitting,
    prediction (including forecasting of time-series
    data), modeling of causal relationships, and
    testing scientific hypotheses about relationships
    between variables.

11
The MEAN Regression
  • Relating (regressing) the dependent variable to
    the independent variable(s)
  • Simply a way of characterizing a relationship
    through a mathematical (statistical) model
  • Simple linear regression
  • Logistic regression
  • Cox regression (proportional hazards model for
    survival analysis)
  • Time series analysis of recurrent data

12
The mathematical model
  • All starts with a simple observation
  • If two variables are related to each other, can
    one predict the value of one of the variables if
    the value of the other variable is knows?
  • Yf(X), where f is a known mathematical function
  • The quest is to find the correct form of f

13
The mathematical model
  • Where does f come from?
  • Observation
  • Plotting values of X and Y in a bivariate plot to
    see if there is any obvious pattern
  • Theoretical considerations
  • Area (rectangle) length x width
  • A combination of the two

14
The simple linear regression
  • The simplest form of regression
  • One dependent variable, Y is related to one
    independent variable, X
  • Plot X and Y (scatterplot)
  • Is there a straight line relationship (Is Y
    changing proportionally to X)?
  • YaßX
  • Two parameters determine the equation
  • Slope of the line and the intercept of the line
    on the X (independent variable) axis

15
Example of a simple linear regression
16
Simple Linear Regression
  • Notice that not all data points are on the line,
    so obviously the equation does not fit all the
    observations
  • Not a perfect relationship
  • The actual relationship is something more
    complicated
  • Can we use this relationship as approximation
  • What are the risks in using this equation to
    estimate the relationship?

17
Simple Linear Regression
  • What is the purpose of identifying this
    relationship?
  • Predict values of Y for any given X?
  • Predict trends in Y based on trends in X?
  • Predict gain/loss if we introduced a program to
    change values of Y in the population, by changing
    values of X?

18
Simple Linear Regression
  • If Yi is an actual observation in the previous
    picture, and the equation to the blue line is
    YaßX, then
  • YiaßXiei
  • The ei would be the deviation (error) of the
    observed value from the fitted value a
    measure of uncertainty about the model being a
    good fit to the data
  • Clearly many possible lines (other than the blue
    line) can be drawn and each of them will have
    different distribution of ei
  • Which line do we choose as the best fit (one with
    the least error)?
  • Since many data points, we want a cumulative
    error
  • Does Mean squared error seem reasonable?

19
Least squares regression
  • Using minimum mean squared error as the criterion
  • What is the straight line that best fits the
    data?
  • Estimates of a (a) and ß (b)
  • Sample vs Population
  • a and b are the best estimates of a and ß based
    on the observations and these values are going to
    be different in different sample, even if the
    straight line relationship is fixed for the
    population
  • Sampling variation of a and b
  • Measured by standard error of these estimates

20
Inference on regression
  • The regression coefficient
  • Slope of the regression line signifies the
    magnitude of change in Y expected with changes in
    X
  • For prediction, one needs to know the value of ß
  • Estimated by b
  • Standard error of b allows one to draw
    conclusions as to possible true values of ß

21
Assumptions for the linear regression
  • As with many statistical procedures, the first
    assumption is that the observations are
    statistically independent of each other
  • This is essential in constructing the probability
    distribution of the sample values
  • It is also assumed that the random errors e are
    Normally distributed
  • This is essential in calculating the actual
    probability distribution as long as the
    distributional form is known, one can do this
    even if the distribution is not Normal (though
    difficult)
  • However, the least squares method of estimating
    the parameters that we used is optimal when the
    distribution is Normal

22
Assumptions
  • A third assumption for the estimated regression
    equation to be reliable and valid is that the
    deviation of the observed values from the fitted
    values remains similar for all values of X
    (homoscedasticity)
  • This is essential for the estimates to be
    unbiased (reliable)

23
Multiple Linear Regression
  • That was simple.
  • Now what happens if there are more than one
    independent variable that might have something to
    do with the dependent variable?
  • Can fit slr for each variable but that is
    wasteful, and can create confusion, specially if
    many of the Xs themselves are related to each
    other
  • A comprehensive equation, relating all of them in
    one equation to the dependent variable
  • Y Xß e
  • (matrix notation)
  • Yiß1X1iß2X2ißkXkiei, for the ith observation

24
Multiple Linear Regression
  • Essentially same as linear regression
  • The regression coefficients are now partial, in
    that it signifies the amount of linear
    relationship of one independent variable to the
    dependent variable, with all the others in the
    equation
  • The method of estimation and testing hypotheses
    are essentially same as simple linear regression
  • Assumptions are also similar

25
Linear Regression
  • Goodness of Fit
  • How does one assess how good the relationship is?
  • Are the ßs significantly different from 0?
  • Back to the purpose of the regression
  • Explain the variability in Y as results of
    variability in X (in other words, Y and X are
    related)
  • Amount of variability in Y (variance of Y,
    function of Mean squared deviation from the mean)
  • Amount of variability still unexplained after
    the regression (mean squared deviation of the
    residuals from the fitted line)

26
Linear Regression
  • Unexplained variation
  • If perfect fit, the sum of squares of deviation
    of the residuals is zero
  • If completely random, (ß0) then this sum of
    squares is the same as the sum of squares of Y
  • The difference between the two is a masure of
    variability explained by the relationship,
    called regression SS
  • Therefore the ratio of regression SS to the Total
    SS serves as a criterion for how good the fit is
  • 0ltR2lt1, known as the coefficient of
    determination

27
Regression
  • In the linear regression, notice that we assumed
    Y has a Normal distribution (by virtue of the
    linear regression equation and the distribution
    of random errors)
  • So the dependent variable has to be a continuous
    variable
  • What if it is dichotomous, as with most
    epidemiologic studies where we are looking at
    illness or similar entities measured on a
    dichotomous scale?

28
Logistic Regression
  • Y is now a dichotomous variable
  • Clearly we can only talk about proportions
    (probabilities) of Y being 1 or 0 as something we
    can predict
  • Transforming Y to the logistic function, would
    help this (mathematical derivation of why this is
    feasible or desirable is available in many texts
    e.g. Hosmer and Lemeshow Applied Logistic
    Regression)

29
Derivation of the logistic model
  • Let ?(x)(e?0 ? 1X)/(1 e ? 0 ? 1X)
  • The logit transformation
  • g(x)ln ?(x)/(1- ?(x))
  • g(x) ? 0 ? 1X
  • linear regression for g(x)
  • Original outcome y
  • Distribution of y not Normal
  • y ?(x)e
  • e1- ?(x) with prob. ?(x) when y1
  • e-?(x) with prob 1- ?(x) when y0

30
Logistic regression
  • Analysis steps
  • n independent pairs (xi,yi)
  • estimate regression coefficients and goodness of
    fit of the model
  • linear regression - least squares
  • maximum likelihood if y normally distributed
  • logistic regression -maximum likelihood

31
Logistic regression
  • Maximum likelihood method
  • Given a parametric model, the maximum likelihood
    estimates for a set of parameters maximizes the
    probability of obtaining the observed data
  • The likelihood function joint probability of
    observations under the given probability
    distribution

32
Logistic Regression
  • P(Y1x) ?(x)
  • P(Y0x) 1-?(x)
  • Prob. For observation (xi,yi)
  • ?(xi) if yi1
  • 1-?(xi) if yi0
  • ?(xi)yi(1-?(xi))1-yi in general
  • For n observations, the joint probability
    (because independent)
  • Prod ?(xi)yi(1-?(xi))1-yi
  • This is the likelihood function l

33
Logistic regression
  • Maximizing the likelihood function is achieved by
    maximizing its log
  • Unlike linear regression, one cannot get a linear
    equation to calculate the regression coefficients
  • Need to obtain estimates by iteration because the
    equation is nonlinear

34
Logistic regression
  • Inferences on the regression coefficient follows
    the same rules as linear regression
  • Estimates and standard errors of ß are calculated
    and approximate Normal distributions are used
    (Wald test)
  • Interpretation of ß
  • Related to odds ratio as e -ß
  • Calculate odds ratio and its standard error
  • Goodness of fit
  • Again not as simple as simple linear regression
  • Many methods available

35
Regression
  • In summary
  • Regression is a simple way of relating variables
    by the use of mathematical functions, allowing
    one to examine the variability in one variable as
    a function of variability in the other
  • Relationship could be one-one or one-many
  • Allows for adjustment of confounding, (assuming
    general linear model)
  • Some allowance for testing effect modification
  • Need to be careful of the assumptions regarding
    data collection, data format, patterns of
    variability, study design

36
Regression
  • In summary
  • Any model can be used to fit the data
  • Interpretation depends primarily on the
    theoretical foundation for the model
  • Parameters of the model may have identifiable
    characteristics (for example the odds ratio in
    logistic regression) and meaningful definitions
    when the theoretical foundation is solid
  • While confounding can bd adjusted and effect
    modification detected, this is very much model
    dependent
Write a Comment
User Comments (0)
About PowerShow.com