Business System Analysis - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Business System Analysis

Description:

car locator (want ads) car ownership information. Sources of revenue: banner ads. used car ads. partnership agreements (fee for referral) 41. How Did You Get Here? ... – PowerPoint PPT presentation

Number of Views:113
Avg rating:3.0/5.0
Slides: 58
Provided by: peggyca6
Category:

less

Transcript and Presenter's Notes

Title: Business System Analysis


1
Business System Analysis Decision Making Data
Mining and Web Mining
  • Zhangxi Lin
  • ISQS 5340
  • Summer II 2006

2
Outline
  • Introduction to data mining text mining
  • Constructing a decision tree using SAS Enterprise
    Miner
  • Web mining

3
Data Mining and Text Mining
4
Review - Decision Tree (1)
Total 2 Accept 2 Reject 0 Accuracy
100 Coverage 50
Yes
Total 5 Accept 3 Reject 2 Accuracy
60 Coverage 75
Credit Card Insurance
Female
Total 3 Accept 1 Reject 2 Accuracy
33.3 Coverage 25
Total 10 Accept 4 Reject 6 Accuracy
40 Coverage 100
No
Gender
Total 5 Accept 1 Reject 4 Accuracy
20 Coverage 25
Male
5
Review - Decision Tree (2)
Total 2 Accept 2 Reject 0 Accuracy
100 Coverage 50
Female
Total 4 Accept 3 Reject 1 Accuracy
75 Coverage 75
Gender
Yes
Total 2 Accept 1 Reject 1 Accuracy
50 Coverage 25
Total 10 Accept 4 Reject 6 Accuracy
40 Coverage 100
Male
Credit Card Insurance
Total 6 Accept 1 Reject 5 Accuracy
16.7 Coverage 25
No
What are the differences of this decision tree
from the last one?
6
Confusion Matrix (Rule GenderFemale)
Computed Accept
Computed Reject
Coverage 3 / (3 1) 0.75
Actual Accept
1
3
Actual Reject
2
4
5 Accuracy 3 / (23) 0.6
5
7
Confusion Matrix (Rule Credit Promotion Yes)
Computed Accept
Computed Reject
Coverage 3 / (3 1) 0.75
Actual Accept
1
3
Actual Reject
5
1
4 Accuracy 3 / (13) 0.75
6
8
Generalizing data analysis ideas
  • Question How to useful rule from a large amount
    of data generated in business operations?
  • Answer Applying data mining techniques/tools

9
What is Data Mining? (See Wikipedia data mining)
  • Many Definitions
  • Non-trivial extraction of implicit, previously
    unknown and potentially useful information from
    data
  • Exploration analysis, by automatic or
    semi-automatic means, of large quantities of
    data in order to discover meaningful patterns

10
Origins of Data Mining
  • Draws ideas from machine learning/AI, pattern
    recognition, statistics, and database systems
  • Traditional Techniquesmay be unsuitable due to
  • Enormity of data
  • High dimensionality of data
  • Heterogeneous, distributed nature of data

Statistics/AI
Machine Learning/ Pattern Recognition
Data Mining
Database systems
11
Why Mine Data? Commercial Viewpoint
  • Lots of data is being collected and warehoused
  • Web data, e-commerce
  • purchases at department/grocery stores
  • Bank/Credit Card transactions
  • Computers have become cheaper and more powerful
  • Competitive Pressure is Strong
  • Provide better, customized services for an edge
    (e.g. in Customer Relationship Management)

12
Why Mine Data? Scientific Viewpoint
  • Data collected and stored at enormous speeds
    (GB/hour)
  • remote sensors on a satellite
  • telescopes scanning the skies
  • microarray s generating gene expression data
  • scientific simulations generating terabytes of
    data
  • Traditional techniques infeasible for raw data
  • Data mining may help scientists
  • in classifying and segmenting data
  • in Hypothesis Formation

13
Data Mining Tasks
  • Prediction Methods
  • Use some variables to predict unknown or future
    values of other variables.
  • Description Methods
  • Find human-interpretable patterns that describe
    the data.

From Fayyad, et.al. Advances in Knowledge
Discovery and Data Mining, 1996
14
Data Mining Tasks...
  • Classification Predictive
  • Clustering Descriptive
  • Association Rule Discovery Descriptive
  • Sequential Pattern Discovery Descriptive
  • Regression Predictive
  • Deviation Detection Predictive

15
What Text Mining Is (See Wikipedia text mining)
  • Text mining is a process that employs a set of
    algorithms for converting unstructured text into
    structured data objects and the quantitative
    methods used to analyze these data objects.
  • SAS defines text mining as the process of
    investigating a large collection of free-form
    documents in order to discover and use the
    knowledge that exists in the collection as a
    whole. (SAS Text Miner Distilling Textual Data
    for Competitive Business Advantage)

16
A simple text mining example
  • A tiny case - 9 documents
  • deposit the cash and check in the bank - Fin
  • the river boat is on the bank - Riv
  • borrow based on credit - Fin
  • river boat floats up the river - Riv
  • boat is by the dock near the bank - Riv
  • with credit, I can borrow cash from the bank -
    Fin
  • boat floats by dock near the river bank - Riv
  • check the parade route to see the floats - Par
  • along the parade route - Par

17
Text Mining Strengths
  • Clustering documents in a corpus
  • Investigating word (token) distribution across
    documents within a corpus
  • Identifying words with the highest discriminatory
    power
  • Classifying documents into predefined categories
  • Integrating text data with structured data to
    enrich predictive modeling endeavors

18
Text Mining Deficiencies
  • Text mining algorithms perform poorly in
    distinguishing negations, for example
  • Herman was involved in a motor vehicle accident.
  • Herman was NOT involved in a motor vehicle
    accident
  • Text mining cannot generally make value
    judgments, for example, classifying an article as
    positive or negative with respect to any tokens
    it contains.
  • Text mining algorithms do not work well with
    large documents.
  • Performance is slow.
  • Increased term occurrence across documents
    decreases separation of documents.

19
Using Data Mining Tools
  • Statistics Analysis System (http//www.sas.org)
    SAS9 is the most recent release of SAS. It
    delivers analytical, data manipulation and
    reporting capabilities within a completely new
    framework.
  • SPSS (http//www.spss.com) SPSS customers
    include telecommunications, banking, finance,
    insurance, healthcare, manufacturing, retail,
    consumer packaged goods, higher education,
    government, and market research.
  • Weka, an open source software product
    (http//www.cs.waikato.ac.nz/ml/weka/ )
  • Microsoft SQL Server comes with major data mining
    utilities
  • There are more

20
Using SAS Enterprise Mine to Construct A Decision
Tree
21
SAS Enterprise Miner 4.3
  • Basic
  • How to use the application main menu
  • Using the pop-up menus
  • Enterprise Miner documentation
  • Project Diagram
  • The SEMMA methodology
  • Sample
  • Explore
  • Modify
  • Model
  • Assess

22
Exercise 5.0
  • Explore SAS and SAS Enterprise Miner

23
Decision Tree Example
  • Life Insurance Promotion
  • Dataset CreditProm

24
Life Insurance Promotion Data
Income Range Magazine Promo Watch Promo Life Ins Promo Credit Card Ins. Sex Age
40-50,000 Yes No No No Male 45
30-40,000 Yes Yes Yes No Female 40
40-50,000 No No No No Male 42
30-40,000 Yes Yes Yes Yes Male 43
50-60,000 Yes No Yes No Female 38
20-30,000 No No No No Female 55
30-40,000 Yes No Yes Yes Male 35
20-30,000 No Yes No No Male 27
30-40,000 Yes No No No Male 43
30-40,000 Yes Yes Yes No Female 41
40-50,000 No Yes Yes No Female 43
20-30,000 No Yes Yes No Male 29
50-60,000 Yes Yes Yes No Female 39
40-50,000 No Yes No No Male 55
20-30,000 No No Yes Yes Female 19
25
Tree Algorithm Find Best Split for Input
Consider that the consumers in the life insurance
promotion dataset have two attributes credit
card promotion, gender.
Best Split x1
0.7
x1
X1 (Credit Prom)
Missing in left branch
Training Data
Missing in right branch
26
Tree Algorithm Repeat for Other Inputs
X2 (Gender)
Kass Adjusted
Logworth
0.7
x2
Missing in left branch
Training Data
Missing in right branch
27
Tree Algorithm Compare Best Splits
x2
Best Split x1
Best Split x2
0.7
x1
Missing in left branch
Training Data
Missing in right branch
28
Tree Algorithm Partition with Best Split
Best Split
x2
x1
Training Data
29
Tree Algorithm Repeat within Partitions
x2
x1
Training Data
30
Tree Algorithm Partition with Best Split
x2
x1
Training Data
31
Tree Algorithm Construct Maximal Tree
x2
x1
Training Data
32
Overfitting
Overfitting
We use training dataset to find the decision
rules. These must be applicable to other
datasets. In order to test the validity of the
rules, a test dataset is used. Compare the
outcomes between these two datasets, we can
identify any inconsistency and create a good
decision tree.
Overfitting The tree is split too much and the
classification error rate is getting higher
33
Overfitting due to Insufficient Examples
Lack of data points in the lower half of the
diagram makes it difficult to predict correctly
the class labels of that region - Insufficient
number of training records in the region causes
the decision tree to predict the test examples
using other training records that are irrelevant
to the classification task
34
How to Address Overfitting
  • Pre-Pruning (Early Stopping Rule)
  • Stop the algorithm before it becomes a
    fully-grown tree
  • We typically use two datasets
  • Training dataset for growing the decision tree
    and obtaining rules
  • Test dataset for testing if the rules are good
    enough with regard to the errors rate when
    applying the rules from training dataset to the
    test dataset.
  • If there is no test dataset, the original dataset
    will be partitioned into two subsets for the
    above purpose.

35
Exercise 5
  • Download the Life Insurance Promotion dataset
    (CreditProm)
  • Import the data to SAS
  • Try out SAS Decision Tree modeling

36
SAS Data Mining Example
  • A German Banks Credit Data
  • Online SAS materials (View PDF (2.24MB))
  • P70, dataset description
  • P71, decision matrix

37
Web Mining
38
Case study CarPort.com
  • CarPort.com is
  • a fictitious Web site that is used to illustrate
    components of Web site design and Web log
    analysis
  • a services Web site.

39
CarPort.com
  • Visitor profile could be any of the following
  • 1. buyer looking for a car
  • 2. seller looking to sell a used car
  • 3. curious information seeker
  • 4. competitor
  • 5. robot or spider
  • 6. lost Web surfer
  • 7. SAS course developer.

40
CarPort.com
  • Services
  • car locator (want ads)
  • car ownership information
  • Sources of revenue
  • banner ads
  • used car ads
  • partnership agreements (fee for referral)

41
How Did You Get Here?
  • Followed a link from another site
  • Clicked on a banner ad
  • Did a Google search
  • Saw an advertisement on television, or heard one
    on radio
  • Received a direct mail solicitation
  • Received a phone solicitation
  • Heard the site mentioned or recommended on a news
    or specialty program, or read about it in the
    printed media

42
Title
URL
Images
Links
Banner Ad
LinkImage
43
Click on this link to find out more or e-mail the
seller.
Link to dealers Web site.
44
Web Mining for Profitability
  • Increase viewing, navigation, and transaction
    efficiency.
  • Improve the customer experience.
  • Add services and features that promote
    cross-selling and up-selling opportunities.
  • Identify problem areas.
  • Improve security.
  • Attract more high quality customers.

45
Michael Berrys Internet Business Taxonomy
  • Classification is based on an Internet companys
    business model, which may include
  • selling things that get delivered in a truck
  • selling things that get delivered through the
    ether
  • selling eyes to advertisers
  • connecting sellers and buyers
  • empowering communities and collecting donations.

46
Some Business Questions
  • Who is visiting my Web site?
  • Who is buying my product(s)?
  • Who are my repeat buyers?
  • Which customers are churning?
  • Which Web design produces the most purchases?
  • What campaign strategies are most effective in
    increasing Web site visits?

47
More Questions
  • What factors influence product purchases?
  • Time-of-day effects
  • Gender, Age, Income, and so forth
  • Latent factors e-shopper, Web expert, and so
    forth
  • Which sales channels produce the most profitable
    customers?
  • Do any site-visit patterns correlate with
    outcomes that can be exploited for business
    advantage?

48
Web Log Fields
  • Users IP address, also called
  • Remote host name
  • Client IP address
  • User name, also called
  • Remote user log name (may be different)
  • Authenticated user name
  • Date and time of request, with or without a UTC
    offset
  • Request type, also called method
  • HTTP request with (CLF) or without (IIS) argument
  • Status HTTP three digit status code
  • Number of bytes sent to client

continued...
49
Web Log Fields
  • The URL path requested, if request type has no
    argument
  • The port to which the request was served
  • The name of the server
  • The IP address of the server
  • The time taken to serve the request
  • Number of bytes in the request received from the
    client
  • User agent, which is usually a text string with
    the name and version number of Web browser used
    by the client and the operating system of the
    client machine
  • The domain name or IP address of the referring
    URL
  • Query information in a text string
  • Cookie information in a text string

50
Common Log Format
Value
Example
Remote Host Name
111.22.333.44
Remote User Log Name
-
Username
IRVINE/terry
Date
15/Apr/2000
Time and UTC Offset
112814 -0700
Request Type
GET /index.html HTTP/1.1
Service Status Code
200
Bytes Sent
2792
51
The User Session
User requests index.htm.
Server sends copy of index.htm.
Browser parses index.htm, finds references to
image files, and requests image files.
Web Server
Browser
...
52
Association Rule Mining
  • Given a set of transactions, find rules that will
    predict the occurrence of an item based on the
    occurrences of other items in the transaction

Market-Basket transactions
Example of Association Rules
Diaper ? Beer,Milk, Bread ?
Eggs,Coke,Beer, Bread ? Milk,
Implication means co-occurrence, not causality!
53
Definition Association Rule
  • Association Rule
  • An implication expression of the form X ? Y,
    where X and Y are itemsets
  • Example Milk, Diaper ? Beer
  • Rule Evaluation Metrics
  • Support (s)
  • Fraction of transactions that contain both X and
    Y
  • Confidence (c)
  • Measures how often items in Y appear in
    transactions thatcontain X

54
Obtaining a Dataset from Web Log for SAS Data
Analysis
  • Example IMWs Web Log Data (raw data, SAS
    dataset)
  • Data Procession Skills
  • Converting the dataset into an Excel file
  • Importing the data into SAS

55
SAS Association Model
56
Association Rules from IMWs Dataset
57
Exercise 6
  • Download IMWs Web Log raw data (raw data)
  • Data conversion within Excel
  • Import the dataset to SAS
  • Try out SAS Association Analysis model
Write a Comment
User Comments (0)
About PowerShow.com