Biostatistics and Computer Applications - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

Biostatistics and Computer Applications

Description:

Understand statistical concepts (population, sample, t-test, slope, significant etc. ... (Shape skewness, kurtosis) 16. Frequency Distribution (discrete var. ... – PowerPoint PPT presentation

Number of Views:3531
Avg rating:3.0/5.0
Slides: 53
Provided by: dafen
Category:

less

Transcript and Presenter's Notes

Title: Biostatistics and Computer Applications


1
Biostatistics and Computer Applications
Dafeng Hui Introduction Descriptive
Statistics SAS
2
Introduction
  • Introduction to the basic concepts of statistics
    as applied to problems in biological science.
  • Goal of the course
  • Understand statistical concepts (population,
    sample, t-test, slope, significant etc.)
  • Identify appropriate methods for your data (e.g.,
    paired t-test or independent t-test, one-way
    block or two-way ANOVA)
  • Select correct SAS procedures to analyze data
    (you may use different SAS procedure for one
    purpose, which more is more suitable)
  • Scientific reading and interpretation.

3
BiostatisticsComputer Applications
  • Why study Biostatistics?
  • Statistical methods are widely used in biological
    field
  • Examples are from biological field, practical and
    useful
  • Focus on application instead of mathematical
    derivation
  • Help to evaluate the paper in an intelligent
    manner.
  • Statistics - the science and art of obtaining
    reliable results and conclusions from data that
    is subject to variation.
  • Biostatistics (Biometry)- the application of
    statistics to the biologic sciences.

4
BiostatisticsComputer Applications
  • Why Computer Applications?
  • Statistical methods are mostly difficult and
    complicated (ANOVA, regression etc)
  • Advances in computer technology and statistical
    software development make the application of
    statistical method much easier today than before
  • Software such as SAS needs time to learn.

5
Is Biostatistics hard to study?
  • Factors make it hard for some students to learn
    statistics
  • The terminology is deceptive. To understand
    statistics, you have to understand the
    statistical meaning of terms such as significant,
    error and hypothesis are distinct from ordinary
    uses of these words.
  • Statistics requires mastering abstract concepts.
    It is not easy to think about theoretical
    concepts such as populations, probability
    distributions, and null hypotheses.

6
Is Biostatistics hard? (cont)
  • Statistics is at the interface of mathematics and
    science. To really grasp the concepts of
    statistics, you need to be able to think about it
    from both angles.
  • The derivation of many statistical tests involves
    difficult math. However, you can learn to use
    statistical tests and interpret the results even
    if you do not fully understand how they work. You
    only need to know enough about how the tool works
    so that you can avoid using them in inappropriate
    situations.
  • Basically, you can calculate statistical tests
    and interpret results even if you dont
    understand how the equations were derived, as
    long as you know enough to use the statistical
    tests appropriately.

7
Questions about this class
  • Is this class to be hard?
  • No. Concept is easy and procedure is clear.
  • Why do we spend time on theoretical stuff?
  • Helpful to understand the application
  • Do we need to know all the stuff?
  • You may not need all, but be prepared

8
Role of statistics in Biological Science
Statistics 1.Mathematical model /
hypothesis 2.Study design 3.Descriptive
statistics 4.Inferential statistics
  • Science
  • 1.Idea or Question


  • 2.Collect data/make observations
  • 3.Describe data / observations
  • 4.Assess the strength of evidence for / against
    the hypothesis

9
Contents of the course
  • Descriptive statistics
  • Graph, table, mean and standard deviation
  • Inferential statistics
  • Probability and distribution
  • Hypothesis test
  • Analysis of Variation
  • Correlation and regression analysis
  • Other special topic

10
Basic Concept
  • Data
  • numerical facts, measurements, or observations
    obtained from an investigation, experiment aimed
    at answering a question
  • Statistical analyses deal with numbers
  • Variable
  • a characteristic that can take on different
    values for different persons, places or things
  • Statistical analyses need variability otherwise
    there is nothing to study
  • Examples
  • Concentration of a substance, pH values obtained
    from atmospheric precipitation, birth weight of
    babies whose mothers are smokers, etc.

11
Basic Concept (cont.)
  • Type of Variable
  • Continuous variable
  • Between any two values of a variable, there is
    another possible value
  • Examples height, weight, concentration
  • Discrete variable
  • Value can be only integer
  • Example number of people, plant etc.

12
Basic Concept (cont.)
  • Population
  • Population a set or collection of objects we are
    interested in. (finite, infinite)
  • Parameter a descriptive measure associated with
    a variable of an entire population, usually
    unknown because the whole population cannot be
    enumerated.
  • For example,
  • Plant height under warming conditions
  • Graduates in US Smokers in the world.

13
Basic Concept (cont.)
  • Sample
  • Sample a small number of subjects from a
    population to make inference about the
    population
  • Random sample A sample of size n drawn from a
    population of size N in such a way that every
    possible sample of size n has the same chance of
    being selected.
  • Statistic a descriptive measure associated with
    a random variable of a sample.

14
Basic Concept (cont.)
  • Population and Sample
  • Sample?Population, Statistic?Parameter

population
Parameter predict properties of
sample
Generalize to a population
sample
statistic
15
Descriptive Statistics
  • Graphical Summaries
  • Frequency distribution
  • Histogram
  • Stem and Leaf plot
  • (Barplot, Boxplot)
  • Numerical Summaries
  • Location - mean, median, mode.
  • Spread - range, variance, standard deviation
  • (Shape skewness, kurtosis)

16
Frequency Distribution (discrete var.)
  • Example Number of sedge plant, Carex flacca,
    found in 800 sample quadrats (1m2) in an
    ecological study of grasses
  • 1, 4, 1, 0, 0, 1, 0, 0, 2, 3, 1, 2, 3, 1, 0, 2,
    0, 1, 2,
  • .
  • 1, 2, 3, 2, 1, 1, 0, 5, 0, 0, 1, 0, 1, 0, 2, 4,
    7, 2, 1,0
  • How is the plant number in a quadrat distributed?

17
Frequency Distribution (discrete var.)
  • Table 1. The frequency, relative frequency,
    cumulative frequencies of plant sedge in a
    quadrat.
  • frequency - number of times value occurs in
    data.(probability for population).
  • relative frequency - the of the time that the
    value occurs (frequency/n).
  • cumulative relative frequency - the of the
    sample that is equal to or smaller than the value
    (cumulative frequency/n).

18
Frequency Distribution (Conti. Var.)
  • Grouping of continuous outcome
  • Examples weight, height.
  • Better understanding of what data show rather
    than individual values
  • Example Fiber length of a cotton (n106)
  • Data
  • 27.5,28.6,29.4,30.5,31.4,29.8,27.6,28.7,27.6
  • 31.8,32.0,27.8

19
Frequency Distribution
Table 2. Frequency and relative frequency
distribution of fiber length (mm) of a cotton
variety (n106)
20
Frequency Distribution (cont. var.)
  • Calculate Range Rmax(X)-min(x)5.13
  • Set Number of intervals g and interval range i
  • Some rules exist, but generally create 8-15
    equal sized intervals, g11
  • i R/(g-1)0.5
  • Set intervals
  • L1min(X)-i /227.0, L2L1i 27.5,
  • Count number in each interval

21
Frequency Distribution
Table 2. Frequency and relative frequency
distribution of fiber length (mm) of a cotton
variety (n106)
22
Histogram (Bar graph) and polygon
  • Histogram graph of frequencies
  • Can be used to visually compare frequencies
  • Easier to assess magnitude of differences rather
    than trying to judge numbers
  • Frequency polygon - similar to histogram

Fig. 1. Frequency distribution of plants in a
quadrat.
23
Histogram (Bar graph) and polygon
Fig. 2. Frequency distribution in fiber length of
a cotton.
24
Stem-and-Leaf Displays
  • Another way to assess frequencies
  • Does preserve individual measure information, so
    not useful for large data sets
  • Stem is first digit(s) of measurements, leaves
    are last digit of measurements
  • Most useful for two digit numbers, more
    cumbersome for three digits

20 X 30 XXX 40 XXXX 50 XX 60 X
2 1 3 244 4 2468 5 26 6 4
Stem leaf
25
Summary
  • In practice, descriptive statistics play a major
    role
  • Always the first 1-2 tables/figures in a paper
  • Statistician needs to know about each variable
    before deciding how to analyze to answer research
    questions
  • In any analysis, 90 of the effort goes into
    setting up the data
  • Descriptive statistics are part of that 90

26
Descriptive StatisticsMeasures of Location
  • Descriptive measure computed from population data
    - parameter
  • Descriptive measure computed from sample data -
    statistic
  • Most common measures of location
  • Mean
  • Median
  • Mode
  • Geometric Mean, harmonic mean

27
Arithmetic mean (population)
  • Suppose we have N measurements of a particular
    variable in a population.We denote these N
    measurements as
  • X1, X2, X3,,XN
  • where X1 is the first measurement, X2 is the
    second, etc.
  • Definition
  • More accurately called the arithmetic mean, it is
    defined as the sum of measures observed divided
    by the number of observations.

28
Arithmetic mean (sample)
  • Sample Suppose we have n measurements of a
    particular variable in a population with N
    measurements.The n measurements are
  • X1, X2, X3,,Xn
  • where X1 is the first measurement, X2 is the
    second, etc.
  • Definition

29
Arithmetic mean (sample)
  • Some Properties of the Arithmetic Mean
  • 1. ,
  • Prove 1.
  • 2.

30
Median
  • Frequently used if there are extreme values in a
    distribution or if the distribution is non-normal
  • Definition
  • That value that divides the ordered array into
    two equal parts
  • If an odd number of observations, the median Md
    will be the (n1)/2 observation
  • ex. median of 11 observations is the 6th
    observation
  • If an even number of observations, the median Md
    will be the midpoint between the middle two
    observations
  • ex. median of 12 observations is the midpoint
    between 6th and 7th

31
Mode
  • Definition
  • Value that occurs most frequently in data set
  • Example
  • 2 3 4 5 3 4 5 6 7 5 3 2 5, mode Mo5
  • If all values different, no mode
  • May be more than one mode
  • Bimodal or multimodal
  • Not used very frequently in practice

32
Example Central Location
Suppose the ages of the 10 trees you are studying
are 34,24,56,52,21,44,64,34,
42,46 Then the mean age of this group is To
find the median, first order the
data 21,24,34,34,42,44,46,52,56,64 The mode
is 34 years Mo34 (occurred twice).
Mean are commonly used
33
Geometric mean
  • Used to calculate mean growth rate
  • Definition
  • Antilog of the mean of the log xi

34
Geometric mean
  • Example Root growth at 25oC, calculate mean
    growth rate (mm/d).

35
Descriptive Statistics Measures of Dispersion
  • Look at these two data sets
  • Set 1 100, 30, 20, 7, 20, 30, 100
  • Set 2 10, 3, 2, 7, -2, -3, -10
  • If we calculate mean
  • Set 1.
  • Set 2.
  • How to measure dispersion (spread, variability)?

36
Descriptive Statistics Measures of Dispersion
  • Common measures
  • Range
  • Variance and Standard deviation
  • Coefficient of variation
  • Many distributions are well-described by measure
    of location and dispersion

37
Range
  • Range is the difference between the largest and
    smallest values in the data set
  • RMax(Xi)-Min(Xi)
  • Heavily influenced by two most extreme values and
    ignores the rest of the distribution
  • Set 1 100, 30, 20, 7, 20, 30, 100
  • Set 2 10, 3, 2, 7, -2, -3, -10
  • R1200
  • R220

38
Variance and standard deviation (population)
  • Suppose we have N measurements of a particular
    variable in a population X1, X2, X3,,XN,
  • The mean is , as ,
    we define
  • as variance, unit is X unit2
  • as standard deviation

39
Variance and standard deviation (sample)
  • Suppose we have n measurements of a particular
    variable in a sample X1, X2, X3,,Xn,
  • The mean is , we define
  • ?
  • as mean squares, or sample variance
  • ?
  • as standard deviation

40
Variance and standard deviation
  • Corrected Sum of Squares (CSS)
  • Degree of freedom
  • n-1 used because if we know n-1 deviations, the
    nth deviation is known
  • Deviations have to sum to zero

41
Example
  • Suppose the ages of the 10 trees you are studying
    are 34,24,56,52,21,44,64,34,42,46, We calculated
  • Calculate range, variation, standard deviation
    and CV.

R64-2143 y, s21692.1/9188.01 y2, s13.72 y.
42
Coefficient of Variation
  • Relative variation rather than absolute variation
    such as standard deviation
  • Definition of C.V.
  • Useful in comparing variation between two
    distributions
  • Used particularly in comparing laboratory
    measures to identify those determinations with
    more variation

43
Example
  • Set 1 100, 30, 20, 7, 20, 30, 100
  • Set 2 10, 3, 2, 7, -2, -3, -10
  • Calculate , s2, s and CV.
  • Set s2 s CV
  • 1 1 3773.7 61.4 61.4
  • 2 1 44.7 6.7 6.7

44
Descriptive Statistics (Summmary)
  • Graphical Summaries
  • Frequency distribution
  • Histogram
  • Stem and Leaf plot
  • Boxplot
  • Numerical Summaries
  • Location - mean, median, mode.
  • Dispersion - range, variance, standard deviation
  • Shape (lab)

45
Software
  • Statistical software
  • SAS
  • SPSS
  • Stata
  • BMDP
  • MINITAB
  • Graphical software
  • Sigmaplot
  • Harvard Graphics
  • PowerPoint
  • Excel

46
SAS
  • Statistical Analysis System (SAS)
  • World leader in business-intelligence software
    and services
  • Founded in 1976, SAS serves more than 39,000
    business, government and university sites in 118
    countries.

47
SAS introduction
  • SAS has grown far beyond its origins as a
    "statistics package" and has positioned itself as
    "enterprise software", i.e. a complete system to
    manage, analyze, and present information,
    especially in a business environment.
  • Standard statistical software

48
Design and function of SAS
  • Provides tools to scientists, so they do not need
    to spend time on the data analysis, but data
    collection and results interpretation.
  • Four data-driven tasks data access, data
    management, data analysis and data presentation

49
SAS programming
  • SAS windows
  • A simple SAS program
  • DATA step
  • PROC step
  • SAS procedures for descriptive statistics
  • UNIVARIATE, MEANS, SUMMARY

50
  • Merry Christmas!

51
Box Plots (explain later)
  • Descriptive method to convey information about
    measures of location and dispersion
  • Box-and-Whisker plots
  • Construction of boxplot
  • Box is IQR
  • Line at median
  • Whiskers at smallest and largest observations
  • Other conventions can be used, especially to
    represent extreme values

52
Box Plots
Drug
Write a Comment
User Comments (0)
About PowerShow.com