# Exploratory Data Analysis (EDA) in the data analysis process - PowerPoint PPT Presentation

PPT – Exploratory Data Analysis (EDA) in the data analysis process PowerPoint presentation | free to download - id: 53f150-MjRmM

The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
Title:

## Exploratory Data Analysis (EDA) in the data analysis process

Description:

### Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13 Learning Objectives students should be able to Construct a dot plot for a numeric ... – PowerPoint PPT presentation

Number of Views:325
Avg rating:3.0/5.0
Slides: 31
Provided by: CarlosB59
Category:
Tags:
Transcript and Presenter's Notes

Title: Exploratory Data Analysis (EDA) in the data analysis process

1
Exploratory Data Analysis (EDA) in the data
analysis process
• Module B2 Session 13

2
Learning Objectives
• students should be able to
• Construct a dot plot for a numeric variable
• split by a categorical variable
• Apply EDA concepts to a large dataset
• Explain the use of Excels pivot tables
• and filters, in the EDA process
• Explain the importance of EDA
• for data checking and at the start of the
analysis
• Relate EDA
• to the principles of official statistics .

3
EDA with small and large data sets
• Session 12
• Stressed the importance of EDA
• Introduced 2 new tools (dot and stem)
• Practiced with small data sets
• In this session we scale up
• Look at large data sets
• The tools do not scale up easily
• But the concepts do scale up
• EDA becomes even more crucial
• Most data sets are large!
• at least compared with teaching examples

4
The essence of a stem and leaf plot
The leaf shows the next digit. This can be
useful in the exploration phase
data5.35.46.0..11.111.9
Stem and leaf plot
Stacked dot plot
5
What are the key points?
• We look at individual data points
• not summaries at this stage
• this is general for EDA
• The stem and leaf plot in particular
• keeps the actual numbers as far as possible
• This can be important
• An example uses the Tanzania survey

6
Tanzania agriculture survey
This is the variable we wish to explore. It is a
value between 0 and 100
7
The data in Excel
The variable to explore before analysis
8
How to explore this value
• Can we do a stem and leaf plot?
• By hand in Excel but there are 16628 values!
• Even if automated, that is too many!
• The essence of a stem and leaf plot
• is to look at all the possible values
• Try a pivot table
• a powerful feature in Excel
• used previously on categorical data

9
The pivot table
10
Some results
11
(No Transcript)
12
What do you deduce?
• There are oddities in rounding
• Perhaps enumerator differences
• Can this question be answered to 1?
• So what should be done before analysis?
• First look further at the data
• Excel can help it can drill down to examine
individual records
• The concept
• Use the table to look for oddities
• Then examine them in more detail

13
Drilling down an example
Make the 6 corresponding to 2 the active cell
Then double click to give the detail
4 of these values are from the same village so
same enumerator
14
(No Transcript)
15
What do you conclude technique/results
• Technique
• Stem and leaf plots when looking at small
datasets
• Pivot tables when datasets are large
• But the principle is general
• Numbers must be looked at carefully!
• The principle can be adapted for the data
• and explored effectively in Excel
• Results
• Did enumerators have different interpretations
• of the precision required in the percentages
• This needs further exploration
• and the analysis needs to take account of this

16
Another new element in this session
• Exploratory analysis includes
• looking for oddities in the data
• Unexplained oddities cause variation
• that can make it difficult to detect the pattern
• because they add unnecessary noise to the data
• How do you tame the variation
• One way is to examine related variables
• This is important in the analysis
• the next slide is a repeat from Session 3
• It is also a key weapon in data exploration
• and is covered in the practical

17
Slide from Module B2 Session 3
• To do good statistics you must
• fight the curse of variation
• Two main strategies to overcome variation
• 1. Take enough observations
• In the Tanzania survey there were 3223 households
just from this one region
• 2. Measure characteristics that explain variation
• Variation itself is not necessarily the problem
• Variation you do not understand is the problem
• Here we start understanding variation
• at the exploration stage

18
Practical three parts
• Tanzania data
• practice what has been done in these slides
• Dot plots split by a factor
• demonstration and practice
• Swaziland data
• apply the concepts
• checking factors
• as well as numeric columns
• Then the key points are reviewed

19
Points for review after the practical
• Looking for individual problems
• And surprising patterns
• Exploratory graphics
• need to help the analyst and data checker
• see dot plots on next slide
• Tables are also useful
• especially with the facility to drill down
• Look at individual variables
• and at records as a whole
• It is useful to estimate results
• And question the computer if they are very
different

20
Dot plots - yield by variety
Outliers (typing errors) are clear, but only
because of the 2nd variable They are not outliers
overall
21
EDA is a continuous process
• EDA effectively is a continuation of the data
checking process
• The example on the previous slide shows
• how some oddities only become clear once the
analysis is undertaken
• This continues into the formal analysis
• where it involves looking at the residuals
• They are the unexplained variation
• As discussed in Session 3!
• So analysis is not just a set of rules
• It is a thoughtful process
• Where you become the data detective!

22
Swaziland data was for checking
23
Investigating the column called Presence
What does 0 mean? Why are there blanks? Next
steps 1. Look at the questionnaire 2. Select
these records
You are becoming detectives!
24
Codes for the column
Seems clear enough. Zeros and blanks still a
puzzle
25
Selecting the blank records
Missing also
Too young and all the same
Crop code not recognised
Areas too large
i.e. serious problems with the whole record
26
Dot plot of area by Presence
Odd crop areas were ALL associated with odd codes
for the column PRESENCE
It was found to be a data transfer problem with
one byte missing in these records
27
Checking data quality and EDA
Where Why How By Whom
Before data entry To ensure complete data set received Manual check supervisor
During data entry To highlight anomalies Filter, dot plots etc Supervisor and helpers
Before analysis Double check As above Analyst/ statistician
During analysis Remain critical Residuals Analyst/statistician
28
Importance principles of official statistics
• Principle 2 Professional standards
• It is unprofessional to analyse the data and
report results without exploring critically at
all stages
• Principle 4 Prevention of misuse
• We risk misusing the data unless we explore the
data critically
• Principle 5 Sources of statistics
• Includes a requirement to avoid undue burden on
respondents
• We must process the data fully and effectively.
This needs EDA
• Otherwise the burden imposed on respondents is to
some extent wasted

29
Can you now
• Apply EDA concepts to a large dataset
• Explain the importance of EDA for data checking
and at the start of the analysis
• Relate EDA to the principles of official
statistics

30
Now you can organise the data for analysis And
then do an exploratory analysis
We show next how the analysis is easy IF your
objectives are clear