Financial Data Mining and Analysis - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Financial Data Mining and Analysis

Description:

U.S. News and World Report's Business & Technology section, 12/21/98, by William ... Information science:learning from data. Probabilistic inference based on ... – PowerPoint PPT presentation

Number of Views:2228
Avg rating:3.0/5.0
Slides: 54
Provided by: hung7
Category:

less

Transcript and Presenter's Notes

Title: Financial Data Mining and Analysis


1
Financial Data Mining and Analysis
  • References
  • Prof. Hua Chens Lecture note (at National Taiwan
    University)
  • U.S. News and World Report's Business
    Technology section, 12/21/98, by William J.
    Holstein
  • Prof. Jurans lecture note 1 (at Columbia
    University)
  • J.H. Friedman (1999) Data Mining and Statistics.
    technical report, Dept. of Stat., Stanford
    University

2
Main Goal
  • Study statistical tools useful in managerial
    decision making.
  • Most management problems involve some degree of
    uncertainty.
  • People have poor intuitive judgment of
    uncertainty.
  • IT revolution... abundance of available
    quantitative information
  • data mining large databases of info, ...
  • market segmentation targeting
  • stock market data
  • almost anything else you may want to know...
  • What conclusions can you draw from your data?
  • How much data do you need to support your
    conclusions?

3
Applications in Management
  • Operations management
  • e.g., model uncertainty in demand, production
    function...
  • Decision models
  • portfolio optimization, simulation, simulation
    based optimization...
  • Capital markets
  • understand risk, hedging, portfolios, beta's...
  • Derivatives, options, ...
  • it is all about modeling uncertainty
  • Operations and information technology
  • dynamic pricing, revenue management, auction
    design, ...
  • Data mining... many applications

4
Portfolio Selection
  • You want to select a stock portfolio of companies
    A, B, C,
  • Information Stock Annual returns by year
  • A 10, 14, 13, 27,
  • B 16, 27, 42, 23,
  • Questions
  • How do we measure the volatility of each stock?
  • How do we quantify the risk associated with a
    given portfolio?
  • What is the tradeoff between risk and returns?

5
(No Transcript)
6
Currency Value (Relative to Jan 2 1998)
7
Introduction
  • Premise All business becomes information driven.
  • The concept of Data Mining is becoming
    increasingly popular as a business information
    management tool where it is expected to reveal
    knowledge structures that can guide decisions in
    conditions of limited certainty.
  • Competitiveness How you collect and exploit
    information to your advantage?
  • The challenges
  • Most corporate data systems are not ready.
  • Can they share information?
  • What is the quality of the input information
  • Most data techniques come from the empirical
    sciences the world is not a laboratory.
  • Defining good metrics abandoning gut rules of
    thumb may be too "risky" for the manager.
  • Communicating success, setting the right
    expectations.

8
A visualization of a Naive Bayes model for
predicting who in the U.S. earns more than
50,000 in yearly salary. The higher the bar,
the greater the amount of evidence a person with
this attribute value earns a high salary.
9
Data Mining and Statistics
  • Data Mining is used to discover patterns and
    relationships in data with an emphasis on large
    observational data bases.
  • It sits at the common frontiers of several fields
    including Data Base Management, Artificial
    Intelligence, Machine Learning, Pattern
    Recognition and Data Visualization.
  • From a statistical perspective it can be viewed
    as computer automated exploratory data analysis
    of large complex data sets.
  • Many organizations have large transaction
    oriented data bases used for inventory billing
    accounting, etc. These data bases were very
    expensive to create and are costly to maintain.
    For a relatively small additional investment DM
    tools offer to discover highly profitable nuggets
    of information hidden in these data.
  • Data, especially large amounts of it reside in
    data base management systems DBMS.
  • Conventional DBMS are focused on online
    transaction processing (OLTP) that is the
    storage and fast retrieval of individual records
    for purposes of data organization. They are used
    to keep track of inventory payroll records,
    billing records, invoices, etc.

10
Data Mining Techniques
  • Data Mining as an analytic process designed to
  • explore data (usually large amounts of -
    typically business or market related - data) in
    search for consistent patterns and/or systematic
    relationships between variables.
  • to validate the findings by applying the detected
    patterns to new subsets of data.
  • The ultimate goal of data mining is prediction -
    and predictive data mining is the most common
    type of data mining and one that has most direct
    business applications.
  • The process of data mining consists of three
    stages
  • the initial exploration.
  • model building or pattern identification with
    validation and verification.
  • deployment (i.e., the application of the model to
    new data in order to generate predictions).

11
Stage 1 Exploration
  • It usually starts with data preparation which may
    involve cleaning data, data transformations,
    selecting subsets of records and - in case of
    data sets with large numbers of variables
    ("fields") - performing some preliminary feature
    selection operations to bring the number of
    variables to a manageable range (depending on the
    statistical methods which are being considered).
  • Depending on the nature of the analytic problem,
    this first stage of the process of data mining
    may involve anywhere between a simple choice of
    straightforward predictors for a regression
    model, to elaborate exploratory analyses using a
    wide variety of graphical and statistical methods
    in order to identify the most relevant variables
    and determine the complexity and/or the general
    nature of models that can be taken into account
    in the next stage.

12
Stage 2 Model building and validation
  • This stage involves considering various models
    and choosing the best one based on their
    predictive performance
  • Explain the variability in question and
  • Producing stable results across samples.
  • This may sound like a simple operation, but in
    fact, it sometimes involves a very elaborate
    process.
  • "competitive evaluation of models," that is,
    applying different models to the same data set
    and then comparing their performance to choose
    the best.
  • These techniques - which are often considered the
    core of predictive data mining - include Bagging
    (Voting, Averaging), Boosting, Stacking (Stacked
    Generalizations), and Meta-Learning.

13
Models for Data Mining
  • In the business environment, complex data mining
    projects may require the coordinate efforts of
    various experts, stakeholders, or departments
    throughout an entire organization.
  • In the data mining literature, various "general
    frameworks" have been proposed to serve as
    blueprints for how to organize the process of
    gathering data, analyzing data, disseminating
    results, implementing results, and monitoring
    improvements.
  • CRISP (Cross-Industry Standard Process for data
    mining) was proposed in the mid-1990s by a
    European consortium of companies to serve as a
    non-proprietary standard process model for data
    mining.
  • The Six Sigma methodology - is a well-structured,
    data-driven methodology for eliminating defects,
    waste, or quality control problems of all kinds
    in manufacturing, service delivery, management,
    and other business activities.

14
CRISP
  • CRISP postulates the following general sequence
    of steps for data mining projects

15
Six Sigma
  • This model has recently become very popular (due
    to its successful implementations) in various
    American industries, and it appears to gain favor
    worldwide. It postulated a sequence of,
    so-called, DMAIC steps
  • The categories of activities Define (D), Measure
    (M), Analyze (A), Improve (I), Control (C ).
  • Postulates the following general sequence of
    steps for data mining projects
  • Define (D) ? Measure (M) ? Analyze (A)
    ? Improve (I) ? Control (C )
  • - It grew up from the manufacturing, quality
    improvement, and process control traditions and
    is particularly well suited to production
    environments (including "production of services,"
    i.e., service industries).
  • Define. It is concerned with the definition of
    project goals and boundaries, and the
    identification of issues that need to be
    addressed to achieve the higher sigma level.
  • Measure. The goal of this phase is to gather
    information about the current situation, to
    obtain baseline data on current process
    performance, and to identify problem areas.
  • Analyze. The goal of this phase is to identify
    the root cause(s) of quality problems, and to
    confirm those causes using the appropriate data
    analysis tools.
  • Improve. The goal of this phase is to implement
    solutions that address the problems (root causes)
    identified during the previous (Analyze) phase.
  • Control. The goal of the Control phase is to
    evaluate and monitor the results of the previous
    phase (Improve).

16
Sampling
  • Objective Determine the average amount of money
    spent in the Central Mall.
  • Sampling A Central City official randomly
    samples 12 people as they exit the mall.
  • He asks them the amount of money spent and
    records the data.
  • Data for the 12 people
  • Person spent Person spent
    Person spent
  • 1 132 5
    123 9 449
  • 2 334 6
    5 10 133
  • 3 33 7
    6 11 44
  • 4 10 8
    14 12 1
  • The official is trying to estimate mean and
    variance of the population based on a sample of
    12 data points.

17
Population versus Sample
  • A population is usually a group we want to know
    something about
  • all potential customers, all eligible voters, all
    the products coming off an assembly line, all
    items in inventory, etc....
  • Finite population u1, u2, ... , uN versus
    Infinite population
  • A population parameter is a number (q) relevant
    to the population that is of interest to us
  • the proportion (in the population) that would buy
    a product, the proportion of eligible voters who
    will vote for a candidate, the average number of
    MM's in a pack....
  • A sample is a subset of the population that we
    actually do know about (by taking measurements of
    some kind)
  • a group who fill out a survey, a group of voters
    that are polled, a number of randomly chosen
    items off the line....
  • x1, x2, ... , xn
  • A sample statistic g(x1, x2, ... , xn) is often
    the only practical estimate of a population
    parameter.
  • We will use g(x1, x2, ... , xn) as proxies for q,
    but remember their difference.

18
Average Amount of Money spent in the Central Mall
  • A sample (x1, x2, ... , xn)
  • Its mean is the sum of their values divided by
    the number of observations.
  • The sample mean, the sample variance, and the
    sample standard deviation are 107, 220,854, and
    144.40, respectively.
  • It claims that on average 107 are spent per
    shopper with a standard deviation of 144.40.

19
  • The variance s2 of a set of observations is the
    average of the squares of the deviations of the
    observations from their mean.
  • The standard deviation s is the square root of
    the variance s2 .
  • How far the observations are from the mean? s2
    and s will be
  • large if the observations are widely spread about
    their mean,
  • small if they are all close to the mean.

20
Stock Market Indexes
  • It is a statistical measure that shows how the
    prices of a group of stocks changes over time.
  • Price-Weighted Index DJIA
  • Market-Value-Weighted Index Standard and Poors
    500 composite Index
  • Equally Weighted Index Wilshire 5000 Equity
    Index
  • Price-Weighted Index It shows the change in the
    average price of the stock that are included in
    the index.
  • Price per share in current period P0 and price
    per share in next period P1.
  • Number of shares outstanding in current period Q0
    and number of shares outstanding in next period
    Q1.

21
Data Analysis
  • Statistical Thinking is understanding variation
    and how to deal with it.
  • Move as far as possible to the right on this
    continuum
  • Ignorance--gtUncertainty--gtRisk--gtCertainty
  • Information sciencelearning from data
  • Probabilistic inference based on mathematics
  • What is Statistics?
  • What is the connection if any
  • Fields including Data Base Management Artificial
    Intelligence

22
Probability the study of randomness
  • It is based on a lecture given by Professor
    Costis Maglaras at Columbia University.

23
Randomness
  • Coin tossing.
  • A phenomenon is random
  • if individual outcomes are uncertain but there is
    a regular distribution of outcomes in a large
    number of repetitions.

24
Probability
  • The probability of any outcome of a random
    phenomenon is
  • long term relative frequency, i.e.
  • the proportion of the times the outcome would
    occur in a very long series of repetitions.
    (empirical)
  • Trials need to be independent.
  • Computer simulation is a good tool to study
    random behavior.
  • The uses of probability
  • Begins with gambling.
  • Now applied to analyze data in astronomy,
    mortality data, traffic flow, telephone
    interchange, genetics, epidemics, investment...

25
Probability Terms
  • Random Experiment A process leading to at least
    2 possible outcomes with uncertainty as to which
    will occur.
  • Event An event is a subset of all possible
    outcomes of an experiment.
  • Intersection of Events Let A and B be two
    events. Then the intersection of the two events,
    denoted A ? B, is the event that both A and B
    occur.
  • Union of Events The union of the two events,
    denoted A ? B, is the event that A or B (or both)
    occurs.
  • Complement Let A be an event. The complement of
    A (denoted ) is the event that A does not occur.
  • Mutually Exclusive Events A and B are said to be
    mutually exclusive if at most one of the events A
    and B can occur.
  • Basic Outcomes The simple indecomposable
    possible results of an experiment. One and
    exactly one of these outcomes must occur. The set
    of basic outcomes is mutually exclusive and
    collectively exhaustive.
  • Sample Space The totality of basic outcomes of
    an experiment.

26
Basic Probability Rules
  • 1. For any event A, 0 ? P(A) ? 1.
  • 2. If A and B can never both occur (they are
    mutually exclusive), then
  • P(A and B) P(A ? B) 0.
  • 3. P(A or B) P(A ? B) P(A) P(B) - P(A ? B).
  • 4. If A and B are mutually exclusive events, then
    P(A or B) P(A ? B) P(A) P(B).
  • 5. P(Ac) 1 - P(A).
  • Independent Events
  • Two events A and B are said to be independent if
    the fact that A has occurred or not does not
    affect your assessment of the probability of B
    occurring. Conversely, the fact that B has
    occurred or not does not affect your assessment
    of the probability of A occurring.
  • 6. If A and B are independent events, then
  • P(A and B) P(A ? B) P(A) ? P(B).
    (Markov??)

27
Probability models
  • Two parts in coin tossing.
  • A list of possible outcomes.
  • A probability for each outcome.
  • The Sample space S of a random phenomenon is the
    set of all possible outcomes.
  • Examples. Sheads, tailsH,T
  • General analysis is possible.

28
Event
  • An event is an outcome or a set of outcomes. (
    it is a subset of the sample space)
  • AHHTT,HTHT,HTTH,THHT,THTH,TTHH
  • Two events A and B are independent if knowing
    that one occurs does not change the probability
    that the other occurs.
  • If A and B are independent,P(A and B) P(A)P(B)
  • The heads of successive coin tosses are
    independent, not independent.
  • The colors of successive cards dealt from the
    same deck are independent, not independent.

29
P(A ? B) P(AB)P(B) P(BA)P(A)
Conditional Probability
  • In these simple calculations, we are making use
    of the conditional probability formula
  • P(AB) P(A holds given that B holds)
    P(AnB)/P(B)
  • This relationship is known as Bayes' Law, after
    the English clergyman Thomas Bayes (1702-1761),
    who first derived it. Bayes' Law was later
    generalized by the French mathematician
    Pierre-Simon LaPlace (1749-1827).

30
Random Variables
  • A random variable is a variable whose value is a
    numerical outcome of a random phenomenon.
  • Sample spaces need not consist of numbers.
  • Examples number of heads in 4 coin tossing,

31
Random Variable
  • A random variable is called discrete if it has
    countably many possible values otherwise, it is
    called continuous.
  • The following quantities would typically be
    modeled as discrete random variables
  • The number of defects in a batch of 20 items.
  • The number of people preferring one brand over
    another in a market research study.
  • The following would typically be modeled as
    continuous random variables
  • The yield on a 10-year Treasury bond three years
    from today.
  • The proportion of defects in a batch of 10,000
    items.
  • Sometimes, we approximate a discrete random
    variable with a continuous one if the possible
    values are very close together e.g., stock
    prices are often treated as continuous random
    variables.

32
Distribution discrete
  • If X is a discrete random variable then we denote
    its pmf by PX.
  • The rule that assigns specific probabilities to
    specific values for a discrete random variable is
    called its probability mass function or pmf.
  • For any value x, PX(x) is the probability of the
    event that X x i.e.,
  • PX(x) P(X x) probability that the value
    of X is x.
  • We always use capital letters for random
    variables. Lower-case letters like x and y stand
    for possible values (i.e., numbers).
  • The pmf gives us one way to describe the
    distribution of a random variable. Another way is
    provided by the cumulative probability function,
    denoted by FX and defined by FX(x) P(X? x)
  • It is the probability that X is less than or
    equal to x.
  • The the pdf gives the probability that the random
    variable lands on a particular value, the cpf
    gives the probability that it lands on or below a
    particular value. In particular, FX is always
    an increasing function.

33
Distribution continuous
  • The distribution of a continuous random variable
    cannot be specified through a probability mass
    function because if X is continuous, then P(X
    x) 0 for all x i.e., the probability of any
    particular value is zero. Instead, we must look
    at probabilities of ranges of values.
  • The probabilities of ranges of values of a
    continuous random variable are determined by a
    density function. It is denoted by fX. The area
    under a density is always 1.
  • The probability that X falls between two points a
    and b is the area under fX between the points a
    and b. The familiar bell-shaped normal curve is
    an example of a density.
  • The cumulative distribution function or cdf of a
    continuous random variable is obtained from the
    density in much the same way a cpf is obtained
    from the pmf of a discrete distribution.
  • The cdf of X, denoted by FX, is given by FX(x)
    P(X? x).
  • FX(x) is the area under the density fX to the
    left of x.

34
Expectation
  • The expected value of a random variable is
    denoted by EX.
  • It can be thought of as the average value
    attained by the random variable.
  • The expected value of a random variable is also
    called its mean, in which case we use the
    notation mX.
  • The formula for the expected value of a discrete
    random variable is this EX Sx xPX(x).
  • The expected value is the sum, over all possible
    values x, of x times its probability PX(x).
  • The expected value of a continuous random
    variable cannot be expressed as a sum instead it
    is an integral involving the density.
  • If g is a function (for example, g(x) x2), then
    the expected value of g(X) is Eg(X) Sx
    x2PX(x).
  • The variance of a random variable X is denoted by
    either VarX or sX2.
  • The variance is defined by sX2 E(X- mX)2
    EX2 - (EX)2.
  • For a discrete distribution, we can write the
    variance as Sx (x- mX)2PX(x).

35
Discrete random variable
  • Discrete random variable X has a finite number of
    possible values.
  • The probability distribution of X lists the
    values and their probabilities.
  • The probabilities pk must satisfy ...
  • Every probability pi is a number between 0 and 1.
  • p1 p2... pk1.
  • Probability histogram
  • Possible values of X and corresponding
    probability.

36
Commonly Used Continuous Distribution
  • The Normal Distribution
  • History
  • Abraham de Moivre (1667-1754) first described the
    normal distribution in 1733.
  • Adolphe Quetelet (1796-1874) used the normal
    distribution to describe the concept of l'homme
    moyen (the average man), thus popularizing the
    notion of the bell-shaped curve.
  • Carl Friedrich Gauss (1777-1855) used the normal
    distribution to describe measurement errors in
    geography and astronomy. 

37
Bernoulli Processes and the Binomial Distribution
  • An airline reservations switchboard receives
    calls for reservations, and it is found that
  • When a reservation is made, there is a good
    chance that the caller will actually show up for
    the flight. In other words, there is some
    probability p (say for now p 0.9) that the
    caller will show up and buy the ticket the day of
    departure.
  • Consider a single person making a reservation.
    This particular reservation can either result in
    the person on the flight (a success) or a no
    show (a failure). Let X (a random variable)
    represent the result of a particular reservation.
    That is, we could assign a value of 1 to X if the
    person shows up for the flight (X 1), and let X
    0 if the person does not. Then, P(X 0) 1 -
    p and P(X 1) p.
  • The airline is not particularly interested in the
    decision made by any one individual, but is more
    concerned with the behavior of the total number
    of people with reservations.
  • Suppose each passenger carried on the plane
    provides a revenue of 100 for the airline and
    each bumped passenger (passengers that do not
    find a seat due to overbooking) results in a loss
    of 200 for the airline.
  • If a plane holds 16 people, not including pilots
    and crew, how many reservations should be taken?

38
Bernoulli process
  • This is an example of a Bernoulli process, named
    for the Swiss mathematician James Bernoulli
    (1654-1705).
  • A Bernoulli process is a sequence of n identical
    trials of a random experiment such that each
    trial
  • (1)  produces one of two possible complimentary
    outcomes that are conventionally called success
    and failure and
  • (2)  is independent of any other trial so that
    the probability of success or failure is constant
    from trial to trial.
  • Note that the success and failure probabilities
    are assumed to be constant from trial to trial,
    but they are not necessarily equal to each other.
  • In our example, the probability of a success is
    0.9 and the probability of a failure is 0.1.
  • The number of successes in a Bernoulli process is
    a binomial random variable.
  • Random Variable A numerical value determined by
    the outcome of an experiment.

39
Analysis
  • If the airline takes 16 reservations, what is the
    probability that there will be at least one empty
    seat?
  • P(at least one empty seat) 1 - (0.9)16
    0.815.
  • An 81.5 chance of having at least one empty
    seat! So the airline would be foolish not to
    overbook.
  • Suppose we take 20 reservations for a particular
    flight, let Y be the number of people who show
    up.
  • Y is a binomial random variable that takes on an
    integer value between 0 and 20.
  • What is the probability function or distribution
    of Y?
  • What is the probability of getting exactly 16
    passengers? A 0.08978
  • P(Y ? 16) 0.133, P(Y 17) 0.190, P(Y
    18)0.285, P(Y 19) 0.270, P(Y 20) 0.122
  • Consider B number of people bumped. The load
    L is Y - B.
  • The airline's total expected revenue (call this
    R, then R 100L - 200B)
  • E(R) E(100L - 200B) 100E(L) - 200E(B)
    1,182.81.

40
How many reservation?
  • Reservation 20 19 18
    17 16
  • E(Load) 15.943 15.839 15.599 15.132
    14.396
  • E(Bumps) 2.057 1.261 0.600
    0.167 0.000
  • E(Revenue) 1,183 1,332 1,440 1,480
    1,440
  • In this case, the best strategy is to take 17
    reservations.
  • Expected Value The expected value (or mean or
    expectation) of a random variable X with
    probability function P(X x) is
  • E(X) S x P(Xx)
  • where the summation is over all x that have
    P(X x) gt 0. It is sometimes denoted ?X or ?.
  • Variance The variance of a random variable X
    with probability function P(X x) is
  • Var(X) S (x- E(X))2P(Xx) ,
  • where the summation is over all x such that P(X
    x) gt 0. It is sometimes denoted ?2(X) or ?2.

41
Inference
  • Mean, Proportion, CLT
  • Bootstrap

42
From Probability to Statistics
  • In all our probability calculations, we have
    assumed that we know all quantities needed to
    solve the problem
  • To find the expected return and standard
    deviation of a portfolio, we assumed we knew the
    mean and standard deviation of the returns of the
    underlying stocks.
  • To find the proportion of bags below the 8-ounce
    minimum, we assumed we knew the mean and standard
    deviation of the weight of chips in each bags.
  • In practice, these types of parameters are not
    given to us we must estimate them from data.
  • Statistical analysis usually proceeds along the
    following lines
  • Postulate a probability model (usually including
    unknown parameters) for a situation involving
    uncertainty e.g., assume that a certain quantity
    follows a normal distribution.
  • Use data to estimate the unknown parameters in
    the model.
  • Plug the estimated parameters into the model in
    order to do make predictions from the model.

43
How do we start with?
  • The first step, picking a model, must be based on
    an understanding of the situation to be modeled.
  • Which assumptions are plausible?
  • Which are not?
  • These questions are answered by judgment, not by
    precise statistical techniques.
  • Examples
  • Assume that daily changes in a stock price follow
    a normal distribution.
  • Use historical data to estimate the mean and
    standard deviation.
  • Once we have estimates, we might use the model to
    predict future price ranges or to value an option
    on the stock.
  • Assume that demand for a fashion item is normally
    distributed.
  • Use historical data to estimate the mean and
    standard deviation.
  • Once we have estimates, we might use the model to
    set production levels.

44
How do we get data and make inference?
  • The first step in understanding the process of
    estimation is understanding basic properties of
    sampled data and sample statistics, since these
    are the basis of estimation.
  • When we talk about sampling it is always in the
    context of a fixed underlying population
  • If we look at 50 daily changes in IBM stock, we
    are looking at a sample of size 50 from the
    population of all daily changes in IBM stock.
  • If the population is very large (as in these
    examples), we generally treat it as though it
    were infinite this simplifies matters. Thus, we
    are primarily concerned with finite samples from
    infinite populations.
  • A single sample from a population is a random
    variable. Its distribution is the population
    distribution e.g.,
  • The distribution of a randomly selected daily
    change in IBM stock is the distribution over all
    daily changes

45
Random Sample
  • A random sample from a population is a set of
    randomly selected observations from that
    population. If X1,, Xn are a random sample, then
  • they are independent
  • they are identically distributed, all with the
    distribution of the underlying population.
  • A sample statistic is any quantity calculated
    from a random sample. The most familiar example
    of a sample statistic is the sample mean
  • , given by
  • (X1 X2 Xn)/n
  • The sample mean gives an estimate of the the
    population mean m EXi.

46
Distribution of the Sample Mean
  • Every sample statistic is a random variable.
  • Randomness is introduced through the sampling
    mechanism.
  • As noted above, the sample mean of a random
    sample X1,, Xn is an estimate of the population
    mean m EXi.
  • How good an estimate is it?
  • How can we assess the uncertainty in the
    estimate?
  • To answer these questions, we need to examine the
    sampling distribution of the sample mean that
    is, the distribution of the random variable .
  • Assume that the underlying population is normal
    with mean m and variance s2.
  • This means that Xi N(m,s2) for all i.
  • The Xi's are independent, since we assume we have
    a random sample.
  • The sum of independent normal random variables is
    normally distributed. The usual rules for means
    and variances apply
  • The expected value of the sum is the sum of the
    expected values.
  • The variance of the sum is the sum of the
    variances (by independence).
  • Any linear transformation of a normal random
    variable is normal in particular, multiplication
    by a constant preserves normality.

47
Distribution of the Sample Mean
  • Using these two facts, we find that if Xi
    N(m,s2) for all i, then
  • X1 X2 Xn N(nm,ns2)
  • The sample mean from a normal population has a
    normal distribution.
  • First consequence
  • The expected value of the sample mean is the
    population mean on average" the sample mean
    correctly estimates the underlying mean.
  • The standard deviation of a sample statistic is
    called its standard error. Thus, we have shown
    that the standard error of the sample mean is
    s/vn, where s is the underlying standard
    deviation and n is the sample size.
  • Second consequence
  • Because the standard error of sample mean is
    s/vn, the uncertainty in this estimate decreases
    as the sample size n increases. (That's good.)
  • The uncertainty (as measured by the standard
    deviation) decreases rather slowly to cut the
    standard deviation in half, we need to collect
    four times as much data, because of the square
    root. (That's not so good, but that's life.)

48
Example
  • Suppose the number of miles driven each week by
    US car owners is normally distributed with a
    standard deviation of s 75 miles.
  • Suppose we plan to estimate the population mean
    number of miles driven per week by US car owners
    using a random sample of size n 100.
  • What is the probability that our estimate will
    differ from the true value by more than 10 miles?
  • Denote the population mean by m and the sample
    mean by .
  • We need to find
    .
  • By symmetry of the normal distribution, it is

Thus, the probability that our estimate will be o
by more than 10 miles is 18.36.
  • If the underlying population is not normal, what
    can be done?

49
Central Limit Theorem
  • By the central limit theorem, regardless of the
    underlying population, the distribution of sample
    mean tends towards N(m,s2/n) as n becomes large.
  • If we accept the use of this approximation, we
    don't need to assume that the number of miles
    driven per week in the example is normally
    distributed (as long as our sample size n is
    large).
  • repeatedly to assess the error in X as an
    estimate of .
  • How large should n be for the normal
    approximation to be accurate?
  • There is no simple answer (it depends on the
    underlying distribution), but n? 30 is a
    reasonable rule of thumb.
  • If the underlying population is finite of size N,
    and if the sample size n is not a small
    proportion of N, we use the following small
    sample correction to the standard error

50
Sampling Distribution of the Sample Proportion
  • Consider estimating any of the following
    quantities
  • Proportion of voters who will vote for a
    third-party candidate in the next election.
  • Proportion of visits to a web site that result
    in a sale.
  • Proportion of shoppers who prefer crunchy over
    creamy.
  • In each of these examples, we are trying to
    estimate a population proportion. Denote a
    generic population proportion by the symbol p.
  • Estimate a population proportion using a sample
    proportion.
  • For example, if a poll surveys 1000 voters and
    finds that 85 of those surveyed plan to vote for
    a third-party candidate, then the sample
    proportion is 8.5.
  • The population proportion is what the poll would
    find if it could ask every voter in the
    population.
  • Denote the sample proportion by the symbol
  • Once we have collected a random sample, the
    sample proportion is known. We use it to
    estimate the true, unknown population proportion
    p.

51
EXAMPLE
  • Suppose that the true, unknown proportion p of
    voters who will vote for a third-party candidate
    in the next election is 9.
  • What is the probability that a poll of 1000
    voters will find a sample proportion that differs
    from the true proportion by more than 2?
  • We need to find
  • We conclude that the probability that the poll
    will be off by more than two percentage points is
    .027.

52
Bootstrap
  • As a general term, bootstrapping describes any
    operation which allows a system to generate
    itself from its own small well-defined subsets
    (e.g. compilers, software to read tapes written
    in computer-independent form).
  • The word is borrowed from the saying pull
    yourself up by your own bootstraps.
  • In statistics, the bootstrap is a method allowing
    one to judge the uncertainty of estimators
    obtained from small samples, without prior
    assumptions about the underlying probability
    distributions.
  • The method consists of forming many new samples
    of the same size as the observed sample, by
    drawing a random selection of the original
    observations, i.e. usually introducing some of
    the observations several times.
  • The estimator under study (e.g. a mean,
    a correlation coefficient) is then formed for
    every one of the samples thus generated, and will
    show a probability distribution of its own.
  • From this distribution, confidence limits can be
    given.
  • For details, see B. Efron (Computers and the
    Theory of Statistics, SIAM Rev. 21 (1979) 460.)
    or Efron (The Jackknife, the Bootstrap and Other
    Resampling Plans, SIAM, Bristol, 1982. )

53
Jackknife
  • The jackknife is a method in statistics allowing
    one to judge the uncertainties of estimators
    derived from small samples, without assumptions
    about the underlying probability distributions.
  • The method consists of forming new samples by
  • omitting, in turn, one of the observations of the
    original sample.
  • For each of the samples thus generated, the
    estimator under study can be calculated, and the
    probability distribution thus obtained will allow
    one to draw conclusions about the estimator's
    sensitivity to individual observations.
Write a Comment
User Comments (0)
About PowerShow.com