IT 223 - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

IT 223

Description:

The Gallup poll interviews 5,000 of them. This is equivalent to ... number of cigarettes smoked per day by each subject (male current smokers), as shown below. ... – PowerPoint PPT presentation

Number of Views:147
Avg rating:3.0/5.0
Slides: 41
Provided by: tmp95
Category:
Tags:

less

Transcript and Presenter's Notes

Title: IT 223


1
IT 223
  • Data Analysis

2
About me
  • Nedjla Ougouag
  • Office Room 701 CTI building
  • Ph (312) 362-5166
  • Email ntiouririne_at_cs.depaul.edu
  • Homepage
  • http//condor.depaul.edu/ntiourir/Homepage.html
  • Office hours
  • Without appointment Tuesday 330-500
  • With appointment After class.

3
About this course
  • The course will discuss simple statistical
    methods and basic concepts of probability theory.
  • The topics of the course are
  • descriptive statistics and representing data
    using graphs.
  • Linear regression models.
  • Sampling and experimental design.
  • An introduction to statistical inference
  • confidence intervals and
  • hypothesis testing.
  •  
  • We will use the statistical package SAS
    (Statistical Analysis System)

4
About this course
  • The statistical software SAS runs
  • on UNIX
  • (accounts on Hawk are available to students)
  • on PC's with Windows
  • (available in the computer labs)

5
About this course
  • Required textbook
  • D.S. Moore and G.P. McCabe, "Introduction to the
    Practice of Statistics", Fourth Edition, 2003.
    ISBN0-7167-9657-0
  • Optional
  • Michael Evans, "SAS Manual for Moore and
    McCabe's Introduction to the Practice of
    Statistics" , Third Edition, 1999. ISBN
    0-7167-3657-8

6
Grading
  • Homework assignments 40,
  • Midterm 25,
  • Final 35.
  • Assignment grading is based on meeting the goals
    defined as well as on effort.
  • Syllabus (online)

7
Assignment rules
  • Only legible, organized homework will be graded.
    Always include your name, date, section, and
    homework number at the top of your assignment.
  • How to submit your assignments
  • via the DLWeb (collate into ONE Word file).
  • in class (staple pages together).
  • No e-mail submissions will be accepted.

8
Class time
  • Attendance
  • my notes are not enough to get by. Plan on
    attending to do well.
  • Since youre here - participate!
  • Your homework assignments are the most important
    way for me to view your progress.

9
Important information
  • Update your email on Campus Connect
  • Go to ID Services to apply for your student
    computer account if you do not already have one.
    For more information go to the link
    http//is.depaul.edu/communication/web/personal_st
    udents.asp

10
Course material
  • All lectures will be posted on the Distance
    Learning Web (DLWeb) https//dlweb.cti.depaul.edu
    /login/login.asp
  • Assignments and grades will be posted on the
    DLWeb.
  • A class discussion forum is available for you on
    the DLWeb.

11
About you
  • Your contact info Name, e-mail
  • Major. Concentration if CS major. Graduate or
    undergraduate?
  • Have you ever taken a statistics or probability
    course? If so, which ones and how long ago?
  • Have you ever used any statistical software like
    SAS before? If so, which tool(s) and how familiar
    are you with it? What do you hope to get from
    this class?
  • How familiar are you with Unix?

This questionnaire online Fill out and e-mail to
me
12
Lecture 1 Exploratory Data Analysis
13
Outline
  • Quick Math assessment/review
  • Exploratory data analysis (Sec. 1.1, 1.2)
  • Discovering information from the data through
    graphs and numbers.

14
Math review
  • 300 is what of 2,000?
  • 100,000 families -gt 0.1 of 1 of these have
    income greater than 75K. How many families is
    that?
  • There are 100 millions eligible voters in the US.
    The Gallup poll interviews 5,000 of them. This
    is equivalent to 1 out of every ?
  • In the US (hypothetically), 1 in 500 is in the
    Army and 3 in 10,000 are officers. What of
    Army personnel are officers?
  • 1 out of 1500 people is a Marine and 1/10 of
    Marines are officers. What of population are
    officers in the Marine?
  • Without calcuator Sqrt(100,000) is closest to
    30, 100, 300, 1000?
  • Is Sqrt(0.5) lt 0.5 ?
  • Is Sqrt(2) lt 2?

15
  • 9. Solve for x y
  • x 3y 1
  • 2x y -3
  • A quart of Vodka is 40 alcohol. A drink is made
    of OJ and Vodka (V quarts of Vodka for J quarts
    of OJ). What is the alcohol in the drink?
    (formula function of V and J)
  • Tom's mother is 3 times as old as he is. Next
    year, their ages will add up to 50. How old is
    Tom?
  • Probability
  • Throw one die. What is the chance of getting a
    1?
  • Throw two dice. What is the chance of getting
    two 1's?
  • Consider two situations
  • A) a Coin is tossed 100 times. If it comes up
    heads 60 times or more you win.
  • B) It is tossed 1,000 times. If it comes up
    heads 600 times or more, you win. Which is
    better A or B?

16
Summary of course content
  • Statistics Science of assembling, organizing,
    and analyzing data
  • First gather data (available or to be sampled)
  • Then analyze the data (graphs, numerical
    analysis)
  • Probability theory
  • Distributions
  • Making predictions Inference
  • Assessing prediction accuracy Hypothesis testing

17
Exploratory Data Analysis
  • The goal of statistics is to gain information
    from the data.
  • First Data are collected.
  • Data come from several sources
  • Available data
  • Census data, Federal agencies, Governmental
    Statistical Offices (www.fedstats.gov), General
    Social Survey at the University of Chicagos
    NORC .
  • Several databases are available on the Internet
    or at DePaul library!!
  • New Data
  • Sampling from population of interest
    Observational studies
  • Conducting statistical experiments medical
    trials, controlled experiments. When well
    designed, provide most reliable source of
    information!!

18
  • Next step after data collection?
  • Long listings of data are of little value.
  • Statistical methods come to help us.
  • Exploratory data analysis set of methods to
    display and summarize the data.
  • In this course we will deal with data on one
    variable at a time.
  • The distribution of the observations is analyzed
    by
  • Displaying the data in a graph that shows overall
    patterns and unusual observations (histogram, box
    plot, density curve)
  • Computing descriptive statistics that summarize
    specific aspects of the data (center and spread).

19
To designate data with values Random variables
  • Data contain information about group of
    individuals / subjects
  • A variable is a characteristic of an observed
    individual which takes different values for
    different individuals
  • Quantitative variable (continuous) takes
    numerical values.
  • Ex. Height, Weight, Age, Income, Measurements
  • Qualitative/Categorical variable classifies an
    individual into categories or groups.
  • Ex. Sex, Religion, Occupation, Age (in classes
    e.g. 10-20, 20- 30, 30-40)
  • The distribution of a variable tells us what
    values it takes and how often it takes those
    values
  • Different statistical methods are used to analyze
    quantitative or categorical
  • variables.

20
Graphs for categorical variables
  • The values of a categorical variable are labels.
  • The distribution of a categorical variable lists
    the count or percentage of individuals in each
    category.

Counts 212 168
20
A sample of 400 wireless internet users.
21
(No Transcript)
22
Example On the morning of April 10, 1912 the
Titanic sailed from the port of Southampton (UK)
directed to NY. Altogether there were 2,201
passengers and crew members on board. This is the
table of the survivors of the famous tragic
accident.
Define the categorical variables
23
Bar chart representing the data in the table
above (in percentages)
24
Graphs for quantitative variables the histogram
Example CEO salaries Forbes magazine published
data on the best small firms in 1993. These were
firms with annual sales of more than five and
less than 350 million. Firms were ranked by
five-year average return on investment. The data
extracted are the age and annual salary of the
chief executive officer for the first 60 ranked
firms. (Data at http//lib.stat.cmu.edu/DASL/DataA
rchive.html )
Salary of chief executive officer (including
bonuses), in thousands 145 621 262 208
362 424 339 736 291 58 498 643
390 332 750 368 659 234 396 300
343 536 543 217 298 1103 406 254
862 204 206 250 21 298 350 800 726
370 536 291 808 543 149 350 242
198 213 296 317 482 155 802 200
282 573 388 250 396 572
25
  • Drawing a histogram
  • Construct a distribution table
  • Define class intervals or bins (Choose intervals
    of equal width!)
  • Count the percentage of observations in each
    interval
  • End-point convention left endpoint of the
    interval is included, and the right endpoint is
    excluded, i.e. a,b
  • Draw the horizontal axis.
  • Construct the blocks
  • Height of block percentages!
  • The total area under an histogram must be 100

26
(No Transcript)
27
30.50
23.73
3.39
1.70
The area of each block represents the percentages
of cases in the corresponding class interval (or
bin).
28
  • Remarks
  • A histogram represents percent by area. The area
    of each block represents the percentages of cases
    in the corresponding class interval.
  • The total area under a histogram is 100
  • There is no fixed choice for the number of
    classes in a histogram
  • If class intervals are too small, the histogram
    will have spikes
  • If class intervals are too large, some
    information will be missed.
  • Use your judgment!
  • Typically statistical software will choose the
    class intervals for you, but you can modify them.
  • Let's try various binning levels.

29
  • Example Smoking
  • In a Public Health Service study, a histogram was
    plotted showing the
  • number of cigarettes smoked per day by each
    subject (male current smokers),
  • as shown below. The density is marked in
    parentheses. The class intervals
  • include the left endpoint, but not the right.
  • The percentage who smoked less than two packs a
    day but at least a pack, is around (There are 20
    cigarettes in a pack.)
  • 1.5 15 30 50
  • The percent who smoked at least a pack a day is
    around
  • 1.5 15 30 50
  • The percent who smoked at least 3 packs a day is
    around
  • 0.25 of 1 0.5 of 1 10
  • The percent who smoked 20 cigarettes a day is
    around
  • 0.35 of 1 0.5 of 1 1.5 3.5 10

30
  • Answers
  • The percentage who smoked less than two packs a
    day but at least a pack, is given by (note there
    are 20 cigarettes in a pack.) the area of the
    third block 1.5x(40-20)1.5x2030
  • The percent who smoked at least a pack a day is
    given by the area of the third and fourth blocks
    300.5x4050
  • The percent who smoked at least 3 packs a day is
    the area of the block for number of cigarettes
    greater or equal to 60. This is half of the
    fourth block 10
  • The percent who smoked 20 cigarettes a day use
    the left endpoint convention, so 20 belongs to
    the third block. The answer is 1.5.

31
Using histograms for comparisons
Fuel economy for model year 2001 compact and
two-seater cars (Table 1.8 pg 38) City
Consumption Highway consumption
32
(No Transcript)
33
(No Transcript)
34
Describing distributions with numbers
  • A distribution can be described through the
    measures of its center and of its spread.
  • Measuring the center
  • The most common measures are the mean or average
    and the median.
  • The Mean or Average
  • To calculate the average of a set of
    observations, add their value and divide by the
    number of observations

Data Number of home runs hit by Babe Ruth as a
Yankee 54, 59, 35, 41, 46, 25, 47, 60, 54, 46,
49, 46, 41, 34, 22 The mean number of home runs
hit in a year is
35
  • The median
  • The median M is the midpoint of a distribution,
    the number such that half the observations are
    smaller and the other half are larger.
  • To find the median
  • Sort all the observations in order of size from
    smallest to largest
  • If the number of observations n is odd, the
    median M is the center observation in the ordered
    list I.e. M(n1)/2-th obs.
  • If the number of observations n is even, the
    median M is the mean of the two center
    observations in the ordered list.

Example 1 Ordered list of home run hits by Babe
Ruth 22 25 34 35 41 41 46 46 46 47 49 54 54 59
60 N15 Median 46
8th
Example 2 Ordered list of home run hits by Roger
Maris in 1961 8 13 14 16 23 26 28 33 39 61
N10 Median (2326)/224.5
36
Symmetric distribution
  • The mean and median of a symmetric distribution
    are close together

50
Mean Median
  • In skewed distributions, the mean is farther out
    in the long tail than is the median. The mean is
    more sensitive to extreme values.

Left-skewed distribution
Right-skewed distribution
50
50
Median
Mean
Median
Mean
37
Mean or median?
  • The mean is a good measure for the center of a
    symmetric distribution
  • The median is a resistant measure and should be
    used for skewed distributions. Its value is only
    slightly affected by the presence of extreme
    observations, no matter how large these
    observations are.

38
Example Shopping in a supermarket
A marketing consultant observed 50 consecutive
shoppers at a supermarket. The histogram below
shows how much each shopper spent in the store.
Summary statistics Mean 34.70 Median
27.855
? ?
The mean does not say much The median says that
about 50 of the shoppers spent less than 28
dollars What else would you like to know?
39
Spread of a Distribution
Two measures of spread 1. The Quartiles First
quartile Q1 the value such that 25 of the
observations fall at or below it, (Q1 is often
called 25th percentile). The third quartile Q3
the value such that 75 of the observations
fall at or below it, (Q3 is often called 75th
percentile).   Typically used if the distribution
of the observations is skewed.
The Inter-Quartile Range IQR is defined as the
distance between the two quartiles IQR Q3 Q1
Q1 M Q3
IQR
40
Example Shopping in a supermarket (continued)
A marketing consultant observed 50 consecutive
shoppers at a supermarket. The histogram below
shows how much each shopper spent in the store.
Summary statistics Mean 34.70 Median
27.855 Q1 19.27 Q3 45.40 IQR
45.40-19.27 26.13
About 50 of the shoppers spent less than 28
dollars, 25 spent less than 20 dollars and 25
of the customers of the store spent more that 45
dollars. Moreover, 50 of the customers spent
between 20 and 45 dollars! Extreme values for
purchases gt Q3 1.5xIQR84.59
Write a Comment
User Comments (0)
About PowerShow.com