Prof. Juran's lecture note 1 (at Columbia University) .. - PowerPoint PPT Presentation

About This Presentation
Title:

Prof. Juran's lecture note 1 (at Columbia University) ..

Description:

Prof. Juran's lecture note 1 (at Columbia University) ... disks like 'Christmas on Death Row,' which features rapper Snoop Doggy Dogg. ... – PowerPoint PPT presentation

Number of Views:216
Avg rating:3.0/5.0
Slides: 39
Provided by: Hung72
Category:

less

Transcript and Presenter's Notes

Title: Prof. Juran's lecture note 1 (at Columbia University) ..


1
Data Mining
  • References
  • U.S. News and World Report's Business
    Technology section, 12/21/98, by William J.
    Holstein
  • Prof. Jurans lecture note 1 (at Columbia
    University)
  • J.H. Friedman (1999) Data Mining and Statistics.
    technical report, Dept. of Stat., Stanford
    University

2
Main Goal
  • Study statistical tools useful in managerial
    decision making.
  • Most management problems involve some degree of
    uncertainty.
  • People have poor intuitive judgment of
    uncertainty.
  • IT revolution... abundance of available
    quantitative information
  • data mining large databases of info, ...
  • market segmentation targeting
  • stock market data
  • almost anything else you may want to know...
  • What conclusions can you draw from your data?
  • How much data do you need to support your
    conclusions?

3
Applications in Management
  • Operations management
  • e.g., model uncertainty in demand, production
    function...
  • Decision models
  • portfolio optimization, simulation, simulation
    based optimization...
  • Capital markets
  • understand risk, hedging, portfolios, beta's...
  • Derivatives, options, ...
  • it is all about modeling uncertainty
  • Operations and information technology
  • dynamic pricing, revenue management, auction
    design, ...
  • Data mining... many applications

4
Portfolio Selection
  • You want to select a stock portfolio of companies
    A, B, C,
  • Information Stock Annual returns by year
  • A 10, 14, 13, 27,
  • B 16, 27, 42, 23,
  • Questions
  • How do we measure the volatility of each stock?
  • How do we quantify the risk associated with a
    given portfolio?
  • What is the tradeoff between risk and returns?

5
(No Transcript)
6
Currency Value (Relative to Jan 2 1998)
7
Introduction
  • Premise All business becomes information driven.
  • The concept of Data Mining is becoming
    increasingly popular as a business information
    management tool where it is expected to reveal
    knowledge structures that can guide decisions in
    conditions of limited certainty.
  • Competitiveness How you collect and exploit
    information to your advantage?
  • The challenges
  • Most corporate data systems are not ready.
  • Can they share information?
  • What is the quality of the information going in
  • Most data techniques come from the empirical
    sciences the world is not a laboratory.
  • Cutting through vendor hype, info-topia.
  • Defining good metrics abandoning gut rules of
    thumb may be too "risky" for the manager.
  • Communicating success, setting the right
    expectations.

8
Wal-Mart
  • U.S. News and World Report's Business
    Technology section, 12/21/98, by William J.
    Holstein
  • Data-Crunching Santa
  • Wal-Mart knows what you bought last Christmas
  • Wal-Mart is expected to finish the year with 135
    billion in sales, up from 118 billion last year.
  • It hurts department stores such as Sears, J. C.
    Penney, and Federated's Macy's and Bloomingdale's
    units, which have been slower to link all their
    operations from stores directly to manufacturers.
    .
  • For example, Sears stocked too many winter coats
    this season and was surprised by warmer than
    average weather.
  • The field of business analytics has improved
    significantly over the past few years, giving
    business users better insights, particularly from
    operational data stored in transactional systems.
    business analytics in its everyday activities.
  • Analytics are now routinely used in sales,
    marketing, supply chain optimization, and fraud
    detection.

9
A visualization of a Naive Bayes model for
predicting who in the U.S. earns more than
50,000 in yearly salary. The higher the bar,
the greater the amount of evidence a person with
this attribute value earns a high salary.
10
Telecommunications
  • Data mining flourishes in telecommunications due
    to the availability of vast quantities of
    high-quality data.
  • A significant stream of it consists of call
    records collected at network switches used
    primarily for billing it enables data mining
    applications in toll fraud detection and consumer
    marketing.
  • The best-known marketing application of data
    mining, albeit via unconfirmed anecdote, concerns
    MCIs Friends Family promotion launched in
    the domestic U.S. market in 1991.
  • As the anecdote goes, market researchers observed
    relatively small subgraphs in this long-distance
    phone companys large call-graph of network
    activity.
  • It reveals the promising strategy of adding
    entire calling circles to the companys
    subscriber base, rather than the traditional and
    costly approach of seeking individual customers
    one at a time. Indeed, MCI increased its domestic
    U.S. market share in the succeeding years by
    exploiting the viral capabilities of calling
    circles one infected member causes others to
    become infected.
  • Interestingly, the plan was abandoned some years
    later (not available since 1997), possibly
    because the virus had run its course but more
    likely due to other competitive forces.

11
Telecommunications
  • In toll-fraud detection, data mining has been
    instrumental in completely changing the landscape
    for how anomalous behaviors are detected.
  • Nearly all fraud detection systems in the
    telecommunications industry 10 years ago were
    based on global threshold models.
  • They can be expressed as rule sets of the form
    If a customer makes more than X calls per hour
    to country Y, then apply treatment Z.
  • The placeholders X, Y, and Z are parameters of
    these rule sets applied to all customers.
  • Given the range of telecommunication customers,
    blanket application of these rules produces many
    false positives.
  • Data mining methods for customized monitoring of
    land and mobile phone lines were subsequently
    developed by leading service providers, including
    ATT, MCI, and Verizon, whereby each customers
    historic calling patterns are used as a baseline
    against which all new calls are compared.
  • For customers routinely calling country Y more
    than X times a day, such alerts would be
    suppressed, but if they ventured to call a
    different country Y, an alert might be generated.

12
Risk management and targeted marketing
  • Insurance and direct mail are two industries that
    rely on data analysis to make profitable business
    decisions.
  • Insurers must be able to accurately assess the
    risks posed by their policyholders to set
    insurance premiums at competitive levels.
  • For example, overcharging low-risk policyholders
    would motivate them to seek lower premiums
    elsewhere undercharging high-risk policyholders
    would attract more of them due to the lower
    premiums.
  • In either case, costs would increase and profits
    inevitably decrease.
  • Effective data analysis leading to the creation
    of accurate predictive models is essential for
    addressing these issues.
  • In direct-mail targeted marketing, retailers must
    be able to identify subsets of the population
    likely to respond to promotions in order to
    offset mailing and printing costs.
  • Profits are maximized by mailing only to those
    potential customers most likely to generate net
    income to a retailer in excess of the retailers
    mailing and printing costs.

13
Medical applications (diabetic screening)
  • Preprocessing and postprocessing steps are often
    the most critical elements determining the
    effectiveness of real-life data-mining
    applications, as illustrated by the following
    recent medical application in diabetic patient
    screening.
  • In the 1990s in Singapore, about 10 of the
    population was diabetic, a disease with many side
    effects, including increased risk of eye disease
    kidney failure, and other complications.
  • However, early detection and proper care
    management can make a difference in the health
    and longevity of individual sufferers.
  • To combat the disease, the government of
    Singapore introduced a regular screening program
    for diabetic patients in its public hospitals in
    1992.
  • Patient information, clinical symptoms,
    eye-disease diagnosis, treatments, and other
    details, were captured in a database maintained
    by government medical authorities.
  • After almost 10 years of collecting data, a
    wealth of medical information is available. This
    vast store of historical data leads naturally to
    the application of data mining techniques to
    discover interesting patterns.
  • The objective is to find rules physicians can use
    to understand more about diabetes and how it
    might be associated with different segments of
    the population.

14
Christmas Season Georgia Stores
  • Store at Decatur (just east of Atlanta)
  • A black middle-income community
  • Decoration display African-American angels and
    ethnic Santas aplenty
  • Music section Promoting seasonal disks like
    "Christmas on Death Row," which features rapper
    Snoop Doggy Dogg.
  • Toy department a large selection of
    brown-skinned dolls
  • Store at Dunwoody (20 miles away fom Decatur)
  • An affluent, mostly white suburb (north of
    Atlanta)
  • Music section Showcasing Christmas tunes by
    country superstar Garth Brooks.
  • Toy department a few expensive toys that aren't
    available in the Decatur store Out of the
    hundreds of dolls in stock, only two have brown
    skin.
  • How to determine the kinds of products that are
    carried by various Wal-Marts across the land?

15
Wal-Mart system
  • Every item in the store has a laser bar code, so
    when customers pay for their purchases a scanner
    captures information about
  • what is selling on what day of the week and at
    what price.
  • The scanner also records what other products were
    in each shopper's basket.
  • Wal-Mart analyzes what is in the shopping cart
    itself.
  • The combination of what's in a purchaser's cart
    gives you a good indication of the age of that
    consumer and the preferences in terms of ethnic
    background.
  • Wal-Mart combines the in-store data with
    information about the demographics of communities
    around each store.
  • The end result is surprisingly different
    personalities for Wal-Marts.
  • It also help Wal-Mart figure out how to place
    goods on the floor to get what retailers call
    "affinity sales," or sales of related products.

16
Wal-Mart system (Cont.)
  • One big strength of the system is that about
    5,000 manufacturers are tied into it through the
    company's Retail Link program, which they access
    via the Internet.
  • Pepsi, Disney, or Mattel, for example, can tap
    into Wal-Mart's data warehouse to see how well
    each product is selling at each Wal-Mart.
  • They can look at how things are selling in
    individual areas and make decisions about
    categories where there may be an opportunity to
    expand.
  • That tight information link helps Wal-Mart work
    with its suppliers to replenish stock of products
    that are selling well and to quickly pull those
    that aren't.

17
Data Mining and Statistics
  • Data Mining is used to discover patterns and
    relationships in data with an emphasis on large
    observational data bases.
  • It sits at the common frontiers of several fields
    including Data Base Management, Artificial
    Intelligence, Machine Learning, Pattern
    Recognition and Data Visualization.
  • From a statistical perspective it can be viewed
    as computer automated exploratory data analysis
    of large complex data sets.
  • Many organizations have large transaction
    oriented data bases used for inventory billing
    accounting, etc. These data bases were very
    expensive to create and are costly to maintain.
    For a relatively small additional investment DM
    tools offer to discover highly profitable nuggets
    of information hidden in these data.
  • Data, especially large amounts of it reside in
    data base management systems DBMS.
  • Conventional DBMS are focused on online
    transaction processing (OLTP) that is the
    storage and fast retrieval of individual records
    for purposes of data organization. They are used
    to keep track of inventory payroll records,
    billing records, invoices, etc.

18
Data Mining Techniques
  • Data Mining as an analytic process designed to
  • explore data (usually large amounts of -
    typically business or market related - data) in
    search for consistent patterns and/or systematic
    relationships between variables, and then
  • to validate the findings by applying the detected
    patterns to new subsets of data.
  • The ultimate goal of data mining is prediction -
    and predictive data mining is the most common
    type of data mining and one that has most direct
    business applications.
  • The process of data mining consists of three
    stages
  • the initial exploration,
  • model building or pattern identification with
    validation and verification, and it is concluded
    with
  • deployment (i.e., the application of the model to
    new data in order to generate predictions).

19
Stage 1 Exploration
  • It usually starts with data preparation which may
    involve cleaning data, data transformations,
    selecting subsets of records and - in case of
    data sets with large numbers of variables
    ("fields") - performing some preliminary feature
    selection operations to bring the number of
    variables to a manageable range (depending on the
    statistical methods which are being considered).
  • Depending on the nature of the analytic problem,
    this first stage of the process of data mining
    may involve anywhere between a simple choice of
    straightforward predictors for a regression
    model, to elaborate exploratory analyses using a
    wide variety of graphical and statistical methods
    in order to identify the most relevant variables
    and determine the complexity and/or the general
    nature of models that can be taken into account
    in the next stage.

20
Stage 2 Model building and validation
  • This stage involves considering various models
    and choosing the best one based on their
    predictive performance
  • Explain the variability in question and
  • Producing stable results across samples.
  • How do we achieve these goals?
  • This may sound like a simple operation, but in
    fact, it sometimes involves a very elaborate
    process.
  • "competitive evaluation of models," that is,
    applying different models to the same data set
    and then comparing their performance to choose
    the best.
  • These techniques - which are often considered the
    core of predictive data mining - include Bagging
    (Voting, Averaging), Boosting, Stacking (Stacked
    Generalizations), and Meta-Learning.

21
Models for Data Mining
  • In the business environment, complex data mining
    projects may require the coordinate efforts of
    various experts, stakeholders, or departments
    throughout an entire organization.
  • In the data mining literature, various "general
    frameworks" have been proposed to serve as
    blueprints for how to organize the process of
    gathering data, analyzing data, disseminating
    results, implementing results, and monitoring
    improvements.
  • CRISP (Cross-Industry Standard Process for data
    mining) was proposed in the mid-1990s by a
    European consortium of companies to serve as a
    non-proprietary standard process model for data
    mining.
  • The Six Sigma methodology - is a well-structured,
    data-driven methodology for eliminating defects,
    waste, or quality control problems of all kinds
    in manufacturing, service delivery, management,
    and other business activities.

22
CRISP
  • This general approach postulates the following
    (perhaps not particularly controversial) general
    sequence of steps for data mining projects

23
Six Sigma
  • This model has recently become very popular (due
    to its successful implementations) in various
    American industries, and it appears to gain favor
    worldwide. It postulated a sequence of,
    so-called, DMAIC steps
  • The categories of activities Define (D), Measure
    (M), Analyze (A), Improve (I), Control (C ).
  • Postulates the following general sequence of
    steps for data mining projects
  • Define (D) ? Measure (M) ? Analyze (A)
    ? Improve (I) ? Control (C )
  • - It grew up from the manufacturing, quality
    improvement, and process control traditions and
    is particularly well suited to production
    environments (including "production of services,"
    i.e., service industries).
  • Define. It is concerned with the definition of
    project goals and boundaries, and the
    identification of issues that need to be
    addressed to achieve the higher sigma level.
  • Measure. The goal of this phase is to gather
    information about the current situation, to
    obtain baseline data on current process
    performance, and to identify problem areas.
  • Analyze. The goal of this phase is to identify
    the root cause(s) of quality problems, and to
    confirm those causes using the appropriate data
    analysis tools.
  • Improve. The goal of this phase is to implement
    solutions that address the problems (root causes)
    identified during the previous (Analyze) phase.
  • Control. The goal of the Control phase is to
    evaluate and monitor the results of the previous
    phase (Improve).

24
Six Sigma Process
  • A six sigma process is one that can be expected
    to produce only 3.4 defects per one million
    opportunities.
  • The concept of the six sigma process is important
    in Six Sigma quality improvement programs.
  • The term Six Sigma derives from the goal to
    achieve a process variation, so that 6sigma (the
    estimate of the population standard deviation)
    will "fit" inside the lower and upper
    specification limits for the process.
  • In that case, even if the process mean shifts by
    1.5sigma in one direction (e.g., to 1.5 sigma
    in the direction of the upper specification
    limit), then the process will still produce very
    few defects.
  • For example, suppose we expressed the area above
    the upper specification limit in terms of one
    million opportunities to produce defects. The
    6sigma process shifted upwards by 1.5 sigma
    will only produce 3.4 defects (i.e., "parts" or
    "cases" greater than the upper specification
    limit) per one million opportunities

25
Statisticianss remark on DM paradigms
  • The DM community may have to moderate its romance
    with big.
  • A prevailing attitude seems to be that unless an
    analysis involves gigabytes or terabytes of data,
    it can not possibly be worthwhile.
  • It seems to be a requirement that all of the data
    that has been collected must be used in every
    aspect of the analysis.
  • Sophisticated procedures that cannot
    simultaneously handle data sets of such size are
    not considered relevant to DM.
  • Most DM applications routinely require data sets
    that are considerably larger than those that have
    been addressed by traditional statistical
    procedures (kilobytes).
  • It is often the case that the questions being
    asked of the data can be answered to sufficient
    accuracy with less than the entire giga or
    terabyte data base.
  • Sampling methodology which has a long tradition
    in Statistics can profitably be used to improve
    accuracy while mitigating computational
    requirements.
  • Also a powerful computationally intense procedure
    operating on a subsample of the data may in fact
    provide superior accuracy than a less
    sophisticated one using the entire data base.

26
Sampling
  • Objective Determine the average amount of money
    spent in the Central Mall.
  • Sampling A Central City official randomly
    samples 12 people as they exit the mall.
  • He asks them the amount of money spent and
    records the data.
  • Data for the 12 people
  • Person spent Person spent
    Person spent
  • 1 132 5
    123 9 449
  • 2 334 6
    5 10 133
  • 3 33 7
    6 11 44
  • 4 10 8
    14 12 1
  • The official is trying to estimate mean and
    variance of the population based on a sample of
    12 data points.

27
Population versus Sample
  • A population is usually a group we want to know
    something about
  • all potential customers, all eligible voters, all
    the products coming off an assembly line, all
    items in inventory, etc....
  • Finite population u1, u2, ... , uN versus
    Infinite population
  • A population parameter is a number (q) relevant
    to the population that is of interest to us
  • the proportion (in the population) that would buy
    a product, the proportion of eligible voters who
    will vote for a candidate, the average number of
    MM's in a pack....
  • A sample is a subset of the population that we
    actually do know about (by taking measurements of
    some kind)
  • a group who fill out a survey, a group of voters
    that are polled, a number of randomly chosen
    items off the line....
  • x1, x2, ... , xn
  • A sample statistic g(x1, x2, ... , xn) is often
    the only practical estimate of a population
    parameter.
  • We will use g(x1, x2, ... , xn) as proxies for q,
    but remember their difference.

28
Average Amount of Money spent in the Central Mall
  • A sample (x1, x2, ... , xn)
  • Its mean is the sum of their values divided by
    the number of observations.
  • The sample mean, the sample variance, and the
    sample standard deviation are 107, 220,854, and
    144.40, respectively.
  • It claims that on average 107 are spent per
    shopper with a standard deviation of 144.40.
  • Why can we claim so?

29
  • The variance s2 of a set of observations is the
    average of the squares of the deviations of the
    observations from their mean.
  • The standard deviation s is the square root of
    the variance s2 .
  • How far the observations are from the mean? s2
    and s will be
  • large if the observations are widely spread about
    their mean,
  • small if they are all close to the mean.

30
Stock Market Indexes
  • It is a statistical measure that shows how the
    prices of a group of stocks changes over time.
  • Price-Weighted Index DJIA
  • Market-Value-Weighted Index Standard and Poors
    500 composite Index
  • Equally Weighted Index Wilshire 5000 Equity
    Index
  • Price-Weighted Index It shows the change in the
    average price of the stock that are included in
    the index.
  • Price per share in current period P0 and price
    per share in next period P1.
  • Number of shares outstanding in current period Q0
    and number of shares outstanding in next period
    Q1.

31
DJIA
  • Dow Jones industrial average (DJIA)
  • Charles Dow first concocted his 12-stock
    industrial average in 1896 (expanding to 30 in
    1928)
  • Original It is an arithmetic average of the
    thirty stock prices that make up the index.
  • DJIA (P01 P02 P0,30)/30/(P11
    P12 P1,30)/30
  • Current It is adjusted for stock splits and the
    insurance of stock dividends.
  • DJIA (P01 P02 P0,30)/AD1/(P11
    P12 P1,30)
  • where AD1 is the appropriate divisor.
  • How do we adjust AD1 to account for stock splits,
    adding new stocks,...?
  • The adjustment process is designed to keep the
    index value the same as it would have been if the
    split had not occurred.
  • Suppose X30 splits 21 from 100 to 50. Then
    change c to c0 such that (X1 X2 100)/c
    (X1 X2 50)/c0
  • change to c0 lt c to keep index constant before
    after split.
  • How about when new stocks are added and others
    are removed?

32
DJIA
  • How each stock in the Dow performed during the
    period when the Dow rose 100 percent (from its
    close above 5,000 on Nov. 21, 1995 until it
    closed above 10,000 on March 29, 1999).
  • Companies not in the Dow when it crossed 5,000.
  • Adjusted for spinoffs. Does not reflect
    performance of stocks spun off to shareholders.
  • Company Weight in the Dow () Change in
    Price ()
  • Alcoa 1.9
    52
  • AlliedSignal 2.3
    129
  • Amer. Express 5.5
    185
  • ATT 3.6
    87
  • Boeing 1.5
    -5
  • Caterpillar 2.1
    59
  • Chevron 4.0
    77
  • Citigroup 2.8
    262
  • Coca-Cola 3.0
    69
  • Du Pont 2.5
    76
  • Eastman Kodak 2.9
    -6

33
DJIA
  • Company Weight in the Dow () Change in
    Price ()
  • Exxon 3.2
    83
  • General Electric 5.3
    232
  • General Motors 3.9
    89
  • Goodyear 2.2
    23
  • Hewlett-Packard 3.1
    66
  • I.B.M. 1.9
    276
  • International Paper 2.0
    24
  • J. P. Morgan 5.0
    63
  • Johnson Johnson 4.2
    120
  • McDonald's 2.0
    102
  • Merck 3.6
    175
  • Minnesota Mining 3.2
    15
  • Philip Morris 1.8
    37
  • Procter Gamble 4.5
    134
  • Sears, Roebuck 2.1
    18
  • Union Carbide 2.1
    19
  • United Technologies 6.0
    196
  • Wal-Mart 4.2
    288

34
SP 500
  • The SP 500, which started in 1957, weights
    stocks on the basis of their total market value.
  • Suppose X30 splits 21 from 100 to 50. Then
    change c to c0 such that (X1 X2 100)/c
    (X1 X2 50)/c0
  • change to c0 lt c to keep index constant before
    after split.
  • How about when new stocks are added and others
    are removed?
  • SP 500 is computed by
  • SP 500 (w1X1 w2X2 w500X500)/c
  • where Xiprice of ith stock and wi of shares
    of ith stock.
  • What happens when a stock splits?
  • It is a weighted average.

35
Sample vs Population
  • For both problems, we try to infer properties of
    a large group (population) by analyzing a small
    subgroup (the sample).
  • The population is the group we are trying to
    analyze e.g., all eligible voters, etc.
  • A sample is a subset of the total population that
    we have observed or collected data from e.g.,
    voters that are actually polled, etc.
  • How to draw a sample which can be used to make
    statements about the population?
  • Sample must be representative of the population
  • Sampling is the way to obtain reliable
    information in a cost effective way (why not
    census?)

36
Issues in sampling
  • Representativeness
  • Interviewer discretion
  • Respondent discretion - non-response
  • Key question is the reason for non-response
    related to the attribute you are trying to
    measure? Illegal aliens/Census. Start-up
    companies/not in phone book. Library exit survey.
  • Good samples
  • Good samples probability samples each unit in
    the population has a known probability of being
    in the sample
  • Simplest case equal probability sample, each
    unit has the same chance of being in the sample.

37
Utopian Sample for Analysis
  • You have a complete and accurate list of ALL the
    units in the target population (sampling frame)
  • From this you draw an equal probability sample
    (generate a list of random numbers)
  • Reality check incomplete frame, impossible
    frame, practical constraints on the simple random
    sample (cost and time of sampling)
  • Precision considerations
  • How large a sample do I need?
  • Focus on confidence interval - choose coverage
    rate (90, 95, 99) margin of error (half the
    width). Typically trade off width against
    coverage rate.
  • Simple rule of thumb for a population proportion
    - if it's a 95 CI, then use n 1/(margin of
    error)2.

38
Data Analysis
  • Statistical Thinking is understanding variation
    and how to deal with it.
  • Move as far as possible to the right on this
    continuum
  • Ignorance--gtUncertainty--gtRisk--gtCertainty
  • Information sciencelearning from data
  • Probabilistic inference based on mathematics
  • What is Statistics?
  • What is the connection if any
  • elds including Data Base Management Articial In
Write a Comment
User Comments (0)
About PowerShow.com