Our goal for this project is to analyze customer and credit-card information, from the Berka dataset, to extrapolate the type of customer who makes a good candidate for a credit-card, and what level of credit to extend to that customer. - PowerPoint PPT Presentation

Loading...

PPT – Our goal for this project is to analyze customer and credit-card information, from the Berka dataset, to extrapolate the type of customer who makes a good candidate for a credit-card, and what level of credit to extend to that customer. PowerPoint presentation | free to download - id: 61fd89-MzQ0M



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Our goal for this project is to analyze customer and credit-card information, from the Berka dataset, to extrapolate the type of customer who makes a good candidate for a credit-card, and what level of credit to extend to that customer.

Description:

Credit Card Analysis of Czech Bank Goals Our goal for this project is to analyze customer and credit-card information, from the Berka dataset, to extrapolate the type ... – PowerPoint PPT presentation

Number of Views:179
Avg rating:3.0/5.0
Slides: 36
Provided by: a540
Learn more at: http://cciweb.uncc.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Our goal for this project is to analyze customer and credit-card information, from the Berka dataset, to extrapolate the type of customer who makes a good candidate for a credit-card, and what level of credit to extend to that customer.


1
  • Our goal for this project is to analyze customer
    and credit-card information, from the Berka
    dataset, to extrapolate the type of customer who
    makes a good candidate for a credit-card, and
    what level of credit to extend to that customer.
  • The Berka dataset is from the 1999 PKDD Discovery
    Challenge.
  • The Berka dataset is a collection of financial
    information from a Czech bank.

2
  • Entity-Relationship
  • Description
  • Each account has both static characteristics
    (e.g. date of creation, address of the branch)
    given in relation "account" and dynamic
    characteristics (e.g. payments debited or
    credited, balances) given in relations
    "permanent order" and "transaction".
  • Relation "client" describes characteristics of
    persons who can manipulate with the accounts.
  • Relations "loan" and "credit card" describe some
    services which the bank offers to its clients
  • Relation "demographic data" gives some publicly
    available information about the districts (e.g.
    the unemployment rate) additional information
    about the clients can be deduced from this.

3
  • The dataset contains the following tables
  • Accounts
  • Each record describes static characteristics of
    an account
  • Size 4500 records
  • Clients
  • Each record describes characteristics of a client
  • Size 5369 records
  • Disposition (Disp)
  • Each record relates a client with an account and
    describes the clients right to operate that
    account
  • Size 5369 records

4
  • Dataset Table Description, Continued
  • Permanent Orders, Debit Only (Orders)
  • Each record describes characteristics of a
    payment order
  • Size 6471 records
  • Transactions (Trans)
  • Each record describes one transaction on an
    account
  • Size 1056320 records
  • Loans
  • Each record describes a loan granted for a given
    account
  • Size 682 records

5
  • Dataset Table Description, Continued
  • Credit Cards (Cards)
  • Each record describes a credit card issued to an
    account
  • Size 892 records
  • Demographic Data (District)
  • Each record describes demographic characteristics
    of a district
  • Size 77 records

6
  • Converted the ascii files to
  • MS Excel and/or MS Word files for cleaning data
  • MS Access database for use in
  • data mining,
  • de-normalizing or flattening files, and
  • basic querying to learn more about the data.
  • Put all modified files into file types recognized
    by Weka.
  • These files are comma delimited with a heading
    of attribute definition information.


7
  • Verified all table relationships
  • Every account has an Owner via Disp and Account
    tables
  • Order and Loan records are duplicated in
    transaction records. That is, the transactions
    include Order records and Loan payments.
  • Loan records in Trans are identified by
    k_symbolLP


8
  • De-normalize, or flatten, files for mining. Our
    database is relational. In order to mine or
    cluster attributes, those attributes must be in a
    single table. We have created a de-normalized
    table based on our goals.
  • Goal Analyze credit-card information to
    extrapolate the type of customer who makes a good
    candidate for a credit-card.
  •                   i.      Account-Client-Disp-Card
    -District-Loan-Transaction Table
  • Using information we discovered about accounts
    from previous clustering, cluster customer
    information
  •         Using card type as clustering attribute
  •        Added "N" (None) as a possible value to
    the Loan Status attribute
  • In order to better understand customers, we
    looked at this table in two ways
  • To identify, from all customers, which were
    credit card holders and which were not.
  • To examine the variances that exist between all
    credit card holding customers.
  •              


9
  • Add, change, remove, and descretize attributes as
    necessary. The tables shown below describe the
    changes made
  • Preprocessing Activities Account Table


Changes Made With Excel
10
  • Preprocessing Activities Client Table

11
  • Preprocessing Activities Client Table,
    Continued

12
  • Preprocessing Activities Disposition Table

Changes Made With Excel
  • Preprocessing Activities Order Table

Changes Made With Excel
13
  • Preprocessing Activities Loan Table

Changes Made With Excel
14
  • Preprocessing Activities Loan Table, Continued

15
  • Preprocessing Activities Transaction Table

Changes Made With Excel
16
  • Preprocessing Activities Demographic Table

Changes Made With Excel
17
  • Preprocessing Activities Credit Card Table

Changes Made With Excel
18
  • Methodology Overview
  • Attribute Ranking
  • Classification Analysis
  • Clustering Analysis

19
  • Classification is a process where a model is
    built describing a predetermined set of data
    classes.
  • The model is constructed by analyzing all the
    records in the database.
  • Each tuple is assumed to belong to a predefined
    class, as determined by one of the attributes
    called the class label.
  • The tuples analyzed to build the model form the
    training data set.
  • Typically, the learned model is expressed in
    terms of decision trees or classsification rules.
  • These rules can be used to predict the test data
    set.

20
  • We used See5 as the classification tool for our
    project. A brief description of the algorithm
    is as follows
  • The tree starts as a single node representing the
    training samples.
  • If the samples are all of the same class, then
    the node becomes a leaf and is labeled with that
    class.
  • Otherwise, the algorithm uses an entropy-based
    measure known as Information Gain as a heuristic
    for selecting the attribute that will best
    separate the samples into individual classes.
    This attribute becomes the "test" or "decision"
    attribute at the node.
  • A branch is created for each known value of the
    test attribute and the samples are partitioned
    accordingly.

21
  • See5 Algorithm, cont.
  • The algorithm uses the same process recursively
    to form a decision tree for the samples at each
    partition.
  • The recursive partition stops when all the
    samples for a given node belong to the same class
    or if there are no remaining attributes on which
    samples may be further partitioned.

22
We ran See5 using the RuleSets option and the
Boost option with 3 trials. This represents
simplified version of the decision tree generated
by See5. To build this tree, we used Rules
obtained, which had high confidence (gt85).
Trans_Avg_Balance lt 37097.18
Class 0 (1828.3/93.4)
Client_District_ID 10
Class 1 (6.5)
Client_District ID 38
Class 1 (8.5 /0.9)
Trans_Avg_Balance gt 46123.76
Class 1 (8.3)
Account_Opened 1995
Client_District ID 1
Client_District ID 15
Client_District ID 13
Class 1 (5.3)
Class 1 (5.3)
Trans_Avg_Balance gt 66899.77
Class 1 (8.3)
Client_Age Senior
Class 1 (5.3)
23
See5 also has the ability to express the
clasifiers as Rule sets, which are easier to
understand. 96 rules were generated, out of which
we have listed 3 rules. All of these have
confidence of more than 90 Rule 1 Rule 1/1
(1828.4/93.4, lift 1.3)     Trans_Avg_Balance lt
37097.18 --gt class 0 0.948 If
Trans_Avg_Balance lt 37097.18, Then Class
Prediction is taken as Non-CardHolder
24
Rule 2 Rule 1/3 (8.3, lift 3.7)     Client_Age
M     Client_District_ID 1     Trans_Avg_Balan
ce gt 66899.77 --gt class 1 0.903 If
(Client_Age Middle-Age) and (Client_District_ID
1) and (Trans_Avg_Balance 66899.77), Then
Class Prediction is that of a Card Holder.
25
Rule 3 Rule 1/4 (8.3, lift 3.7)     Client_Distr
ict_ID 60     Trans_Avg_Balance gt
46123.76     Loan_Status none --gt class 1
0.903 If (Client_District_ID 60) and
(Trans_Avg_Balance gt 46123.76) and (Loan_Status
none), Then Class Prediction is that of a Card
Holder.
26
Training Set Evaluation
  • The estimated predictive error (64 390)/3600
    12.6
  • The percentage of instances that were correctly
    classified as Non-card holders (2816/2880)
    97
  • The percentage of instances that were incorrectly
    classified as Non-card holders (64/2880)
    22.2
  • The percentage of instances that were
    in-correctly classified as Card holders
    (390/720) 54 The percentage of instances that
    were correctly classified as Card holders
    (330/720) 45

27
Test Set Evaluation
  • The estimated predictive error on the Test set is
    ( 59 119)/900 19.7
  • The percentage of instances that were correctly
    classified as Non-card holders (669/728) 91
  • The percentage of instances that were incorrectly
    classified as Non-card holders (59/728) 8.1
  • The percentage of instances that were
    in-correctly classified as Card holders
    (119/172) 69
  • The percentage of instances that were correctly
    classified as Card holders (53/172) 73.6

28
  • We used cluster analysis to partition the data
    into a set of classes, grouping together
    customers or attributes with similar
    characteristics.
  • The purpose of cluster analysis is to place
    observations into groups or clusters suggested by
    the data such that observations in a given
    cluster tend to be similar to each other in some
    sense, and objects in different clusters tend to
    be dissimilar.
  • The COBWEB Algorithm was chosen for this task.
    The COBWEB algorithm used was implemented in the
    Weka Toolkit.

29
  • The COBWEB algorithm was run twice. Once with a
    cutoff value of 0.18 and once with a cutoff value
    of 0.17.
  • This is done because Weka returns the cluster
    information only for the leaf clusters and prunes
    the tree by levels.
  • The results of the COBWEB Clustering analysis are
    interpreted in the following chart.

30
Description All Card Holders Cluster 2 Cluster 3 Cluster 5 Cluster 6 Cluster 7 Cluster 57 Cluster 59 Cluster 60
Number of Instances 892  75  182  83  83  469  277  94  83 
Percentage Distributions within Clusters by AGE Percentage Distributions within Clusters by AGE Percentage Distributions within Clusters by AGE Percentage Distributions within Clusters by AGE Percentage Distributions within Clusters by AGE Percentage Distributions within Clusters by AGE Percentage Distributions within Clusters by AGE Percentage Distributions within Clusters by AGE Percentage Distributions within Clusters by AGE Percentage Distributions within Clusters by AGE
Youth 21.00  12.00  0.00  98.00  0.00  20.00  0.00  100.00  0.00 
Adult 21.00  24.00  0.00  0.00  94.00  19.00  0.00  0.00  93.00 
Middle Age 56.00  60.00  100.00  0.00  0.00  59.00  100.00  0.00  0.00 
Senior 2.00  4.00  0.00  1.00  6.00  2.00  0.00  0.00  6.00 
Percentage Distributions within Clusters by GENDER Percentage Distributions within Clusters by GENDER Percentage Distributions within Clusters by GENDER Percentage Distributions within Clusters by GENDER Percentage Distributions within Clusters by GENDER Percentage Distributions within Clusters by GENDER Percentage Distributions within Clusters by GENDER Percentage Distributions within Clusters by GENDER Percentage Distributions within Clusters by GENDER Percentage Distributions within Clusters by GENDER
Female 47.00  100.00  100.00  100.00  100.00  0.00  0.00  0.00  0.00 
Male 53.00  0.00  0.00  0.00  0.00  100.00  100.00  100.00  100.00 
Mean Value of AVERAGE BALANCE Mean Value of AVERAGE BALANCE Mean Value of AVERAGE BALANCE Mean Value of AVERAGE BALANCE Mean Value of AVERAGE BALANCE Mean Value of AVERAGE BALANCE Mean Value of AVERAGE BALANCE Mean Value of AVERAGE BALANCE Mean Value of AVERAGE BALANCE Mean Value of AVERAGE BALANCE
Mean 51653.23  54304.88  51668.22  46762.24  51136.43  52179.92  53217.44  48965.39  51136.43 
Std. Deviation 11109.32  12241.39  9367.23  13780.03  10064.00  10983.18  11055.47  11241.49  10064.01 
Median 51292.00  54205.72  51580.11  43661.80  50401.42  51449.19  52323.52  48804.16  50401.42 
Percentage Distributions Within Clusters by NUMBER OF USERS PER ACCOUNT Percentage Distributions Within Clusters by NUMBER OF USERS PER ACCOUNT Percentage Distributions Within Clusters by NUMBER OF USERS PER ACCOUNT Percentage Distributions Within Clusters by NUMBER OF USERS PER ACCOUNT Percentage Distributions Within Clusters by NUMBER OF USERS PER ACCOUNT Percentage Distributions Within Clusters by NUMBER OF USERS PER ACCOUNT Percentage Distributions Within Clusters by NUMBER OF USERS PER ACCOUNT Percentage Distributions Within Clusters by NUMBER OF USERS PER ACCOUNT Percentage Distributions Within Clusters by NUMBER OF USERS PER ACCOUNT Percentage Distributions Within Clusters by NUMBER OF USERS PER ACCOUNT
Single 83.00  0.00  100.00  100.00  100.00  83.00  82.00  88.00  100.00 
Joint 17.00  100.00  0.00  0.00  0.00  17.00  18.00  12.00  0.00 
Percentage Distributions Within Clusters by CARDTYPE Percentage Distributions Within Clusters by CARDTYPE Percentage Distributions Within Clusters by CARDTYPE Percentage Distributions Within Clusters by CARDTYPE Percentage Distributions Within Clusters by CARDTYPE Percentage Distributions Within Clusters by CARDTYPE Percentage Distributions Within Clusters by CARDTYPE Percentage Distributions Within Clusters by CARDTYPE Percentage Distributions Within Clusters by CARDTYPE Percentage Distributions Within Clusters by CARDTYPE
Junior 17.00  10.00  0.00  74.00  1.00  16.00  0.00  78.00  1.00 
Classic 74.00  80.00  91.00  21.00  88.00  73.00  86.00  18.00  88.00 
Gold 10.00  10.00  9.00  4.00  11.00  11.00  13.00  3.00  11.00 
31
  • The distinction of each cluster was not obvious
    beyond the coupling of age group and card type.
  • To further investigate the characteristics of
    customers, we ran another clustering analysis
    using partitioning methods. We used the SAS
    Enterprise Miner product. SAS uses a partitioning
    clustering tool implementing the WARD (Minimum
    variance) method.

32
Description All Card Holders Cluster 1 Cluster 2 Cluster 4 Cluster 6 Cluster 19 Cluster 21
Percentage Distributions within Clusters by AGE Percentage Distributions within Clusters by AGE Percentage Distributions within Clusters by AGE Percentage Distributions within Clusters by AGE Percentage Distributions within Clusters by AGE Percentage Distributions within Clusters by AGE Percentage Distributions within Clusters by AGE Percentage Distributions within Clusters by AGE
Youth 21.00  80.00  18.18  17.65  62.50  0.00  0.00 
Adult 21.00  0.00  18.00  11.76  0.00  100.00  50.00 
Middle Age 56.00  20  63.64  70.59  37.50  0.00  50.00 
Senior 2.00  0.00  0.00  0.00  0.00  0.00  0.00 
Percentage Distributions within Clusters by GENDER Percentage Distributions within Clusters by GENDER Percentage Distributions within Clusters by GENDER Percentage Distributions within Clusters by GENDER Percentage Distributions within Clusters by GENDER Percentage Distributions within Clusters by GENDER Percentage Distributions within Clusters by GENDER Percentage Distributions within Clusters by GENDER
Female 47.00  20.00  45.45  29.41  37.50  0.00  50.00 
Male 53.00  80.00  54.55  70.59  62.50  100.00  50.00 
Mean Value of AVERAGE BALANCE Mean Value of AVERAGE BALANCE Mean Value of AVERAGE BALANCE Mean Value of AVERAGE BALANCE Mean Value of AVERAGE BALANCE Mean Value of AVERAGE BALANCE Mean Value of AVERAGE BALANCE Mean Value of AVERAGE BALANCE
Mean 51653.23  26887.36  75684.47  73667.43  30521.51  78237.63  80549.51 
Std. Deviation 11109.32  328.09  508.51  579.88  601.73  0.00  909.67 
Minimum 0.00  26522.30  74787.91  72794.25  29640.67  78237.63  79906.27 
Maximum 0.00  27362.37  76453.60  7424.22  31253.80  78237.63  81192.75 
Percentage Distributions Within Clusters by NUMBER OF USERS PER ACCOUNT Percentage Distributions Within Clusters by NUMBER OF USERS PER ACCOUNT Percentage Distributions Within Clusters by NUMBER OF USERS PER ACCOUNT Percentage Distributions Within Clusters by NUMBER OF USERS PER ACCOUNT Percentage Distributions Within Clusters by NUMBER OF USERS PER ACCOUNT Percentage Distributions Within Clusters by NUMBER OF USERS PER ACCOUNT Percentage Distributions Within Clusters by NUMBER OF USERS PER ACCOUNT Percentage Distributions Within Clusters by NUMBER OF USERS PER ACCOUNT
Single 83.00  80.00  54.55  70.59  100.00  100.00  100.00 
Joint 17.00  20.00  45.45  29.41  0.00  0.00  0.00 
Percentage Distributions Within Clusters by CARDTYPE Percentage Distributions Within Clusters by CARDTYPE Percentage Distributions Within Clusters by CARDTYPE Percentage Distributions Within Clusters by CARDTYPE Percentage Distributions Within Clusters by CARDTYPE Percentage Distributions Within Clusters by CARDTYPE Percentage Distributions Within Clusters by CARDTYPE Percentage Distributions Within Clusters by CARDTYPE
Junior 16.00  60.00  9.09  11.76  62.50  0.00  0.00 
Classic 74.00  20.00  45.45  47.06  37.50  100.00  100.00 
Gold 10.00  20.00  45.45  41.18  0.00  0.00  0.00 
33
  • The following rules represent our findings in
    analyzing whether or not a customer will/will not
    have a credit card
  • If the average account balance is
    less-than-or-equal 37097.18,
  • then the client will not be a card holder.
  • If the client is middle-age and lives in Prague,
    and his/her average
  • account balance is greater-than 66899.77, then
    that client will
  • be a card holder.
  • If the client lives in Prostejov, and has an
    average account balance
  • in excess of 46123.76, then that client will be a
    card holder.

34
  • The following are characterizations of
    cardholders within the current population of bank
    clients
  • There is no apparent difference between men
    and women in the
  • issuance of credit cards.
  • Middle aged single females have a higher
    percentage of classic
  • and gold cards than the general
    population.
  • All credit card customers that have taken a
    loan have good
  • status in repayment.
  • Customers who are in the age bracket less
    than 24 and whose
  • average balance is under 50000.00 are
    likely candidates for a
  • JUNIOR Card.
  • Otherwise, it is not easily determined from
    customer information
  • which card type has been selected.
    Perhaps further examination of
  • customer behavior is warranted.

35
  • Future continuation of this analysis might
    include
  • identifying customers with a junior or classic
    credit card to whom the bank can offer a
    higher-limit card.
  • Other Opportunities
  • Which accounts are likely to default on loan
    payments? Why?
  • What are the characteristics of a good bank
    client ("good" is defined here, in the most
    general of terms, as who will make payments on
    their outstanding balances in a timely manner)?
  • What are the identifying characteristics of a
    good bank branch (again, we define "good" only in
    the most general terms of which branches are most
    successful in the collection of loan payments)?
About PowerShow.com