Quality Assurance - PowerPoint PPT Presentation

1 / 68
About This Presentation
Title:

Quality Assurance

Description:

Graphics and Statistics. Outlier detection. Archiving data ... Attempt to determine if contamination is responsible and, if so, flag the contaminated value. ... – PowerPoint PPT presentation

Number of Views:165
Avg rating:3.0/5.0
Slides: 69
Provided by: wmic
Category:

less

Transcript and Presenter's Notes

Title: Quality Assurance


1
Quality Assurance Quality Control
  • Kristin Vanderbilt, Ph.D.
  • Sevilleta LTER
  • New Mexico, USA

2
What is data quality?
  • fitness of a dataset or record for a particular
    use

Chrisman, 1991
3
Two Types of Errors
  • Commission Incorrect or inaccurate data are
    entered into a dataset
  • Transcription Error
  • Malfunctioning instrumentation
  • Omission Data or metadata are not recorded
  • Difficult or impossible to find
  • Inadequate documentation of data values, sampling
    methods, anomalies in field, human errors

4
Loss of data quality can occur at many stages
  • At the time of collection
  • During digitisation
  • During documentation
  • During archiving

5
QA/QC
  • mechanisms that are designed to prevent the
    introduction of errors into a data set, a process
    known as data contamination
  • Brunt 2000

6
Scientist
Database
  • Quality Assurance
  • Graphics
  • Statistics
  • Documentation
  • Quality Control
  • Datasheet Design
  • Data Entry Constraint checks
  • Documentation

Cost of error correction increases
7
QA/QC Activities
  • Defining and enforcing standards for formats,
    codes, measurement units and metadata.
  • Checking for unusual or unreasonable patterns in
    data.
  • Checking for comparability of values between data
    sets.
  • Documenting all QA/QC activities
  • Brunt 2000

8
Outline
  • QC procedures
  • Designing data sheets
  • Data entry using validation rules, filters,
    lookup tables
  • QA procedures
  • Graphics and Statistics
  • Outlier detection
  • Archiving data

9
The Cost of Errors
  • Error prevention is considered to be far superior
    to error detection, since detection is often
    costly and can never guarantee to be 100
    successful (Dalcin 2004).

10
Quality Control starts with people
  • Assign responsibility for the quality of data to
    those who create them. If this is not possible,
    assign responsibility as close to data creation
    as possible (Redman 2001)
  • Poor training lies at the root of many data
    quality problems.

11
Flowering Plant Phenology Data Collection Form
Design
  • Three sites, each with 3 transects
  • On each transect, every species will have its
    phenological class recorded

Deep Well
Five Points

Goat Draw
12
Data Collection Form Development
Whats wrong with this data sheet?
Plant Life Stage ____________________________
_____________ ____________________________________
_____ _________________________________________ __
_______________________________________ __________
_______________________________
13
PHENOLOGY DATA SHEET Collectors__________________
_______________ Date___________________
Time_________ Location deep well, five points,
goat draw Transect 1 2 3 Notes
_________________________________________
Plant Life Stage ardi P/G V B FL
FR M S D NP arpu P/G V B FL FR
M S D NP atca P/G V B FL FR
M S D NP bamu P/G V B FL FR M
S D NP zigr P/G V B FL FR M S
D NP P/G V B FL FR M S D
NP P/G V B FL FR M S D NP
P/G perennating or germinating M
dispersing V vegetating S senescing B
budding D dead FL flowering NP not
present FR fruiting
14
PHENOLOGY DATA ENTRY
INTERFACE Collectors Mike Friggens Date
16 May 1998 Time 1312 Location
Deep Well Transect 1 Notes Cloudy day, 3
gopher burrows on transect
15
Validation Rules
  • Control the values that a user can enter into a
    field
  • Example in Access
  • Between 1/1/70 and Date()
  • Coordinate data (latitude)
  • Degrees 0 and
  • Minutes 0 and

16
Validation rules in MS Access Enter in Table
Design View
17
Look-up Fields
  • Display a list of values from which entry can be
    selected

18
Look-up Tables in MS Access Enter in
Table Design View
19
Other methods for preventing data contamination
  • Double-keying of data by independent data entry
    technicians followed by computer verification for
    agreement
  • Use text-to-speech program to read data back
  • Filters for illegal data
  • Statistical/database/spreadsheet programs
  • Legal range of values
  • Sanity checks
  • Unit counts
  • Edwards 2000

20
Flow of Information in Filtering Illegal Data
Raw Data File
Illegal Data Filter
Table of Possible Values and Ranges
Report of Probable Errors
Edwards 2000
21
Tree Growth Data
22
Spreadsheet column statisticsPeromyscus truei
example
23
Spreadsheet range checks
if(mass50,1,0)
24
Unit counts was everything measured?
N
Web 2
E
W
Web 3
S
This study design is replicated at 3 sites.
At each site Ensure that there are 16 quads per
web
25
SQL SELECT site, season, web, count(distinct
plot, quad) FROM palmtop.new_npp where year
2003 group by site ASC, season ASC, web ASC
26
(No Transcript)
27
Other methods for preventing data contamination
  • Double-keying of data by independent data entry
    technicians followed by computer verification for
    agreement
  • Use text-to-speech program to read data back
  • Filters for illegal data
  • Statistical/database/spreadsheet programs
  • Legal range of values
  • Sanity checks
  • Properly designing database (taxon names,
    localities, persons are only entered once)
  • Edwards 2000

28
Good database design can improve data quality
  • Atomize data
  • Use consistent terminologies
  • FLOWER COLOURRED, and
  • FLOWER COLOURCRIMSON.
  • Document changes
  • Record level what changes have been made and by
    whom
  • Dataset level

29
Atomize data One cell, one piece of information
30
Avoid Domain schizophrenia
  • Fields used for purposes for which they were not
    intended

Dalcin 2000
31
Document changes to data
  • Why?
  • Avoid duplication of error checking
  • Users can determine fitness of data for use
  • How?
  • Include data quality/accuracy fields in database
    design
  • Develop an audit trail, so that changes can be
    undone record the who, when, how and why of
    record updates

32
Metadata for bad data
  • Variable 9 Name Average Wind Speed
  • Label Avg_Windspeed
  • Definition Average wind speed
    during the hour at 3 m
  • Units of Measure
    meters/second
  • Precision of Measurements .11
    m/s
  • Range or List of Values 0-50
  • Data Type Real
  • Column Format .
  • Field Position Columns 51-58
  • Missing Data Code -999 (bad)
    -888 (not measured)
  • Computational Method for
    Derived Data na

33
Flagging Data Values
34
Outline
  • QC procedures
  • Designing data sheets
  • Data entry using validation rules, filters,
    lookup tables
  • QA procedures
  • Graphics and Statistics
  • Consistency checks
  • Unusual patterns
  • Outliers
  • Archiving data

35

Consistency check to identify sensor errors
Comparison of data from three Meteorology
stations, Sevilleta LTER
36
Identification of Sensor Errors Comparison of
data from three Met stations, Sevilleta LTER
37
Manual QC/QA Flagging
Sheldon, GCE LTER
38
Outliers
  • An outlier is an unusually extreme value for a
    variable, given the statistical model in use
  • The goal of QA is NOT to eliminate outliers!
    Rather, we wish to detect unusually extreme
    values.
  • Attempt to determine if contamination is
    responsible and, if so, flag the contaminated
    value.
  • Edwards 2000

39
Methods for Detecting Outliers
  • Graphics
  • Scatter plots
  • Box plots
  • Histograms
  • Normal probability plots
  • Formal statistical methods
  • Grubbs test
  • Edwards 2000

40
X-Y scatter plots of gopher tortoise
morphometrics Michener 2000
41
Box Plot Interpretation
IQR Q(75) Q(25) Upper adjacent value
largest observation IQR)) Lower adjacent value smallest observation
(Q(25) - (1.5 X IQR)) Extreme outlier 3 X
IQR beyond upper or lower adjacent values
Inter-quartile range
Median
42
Box Plots Depicting Statistical Distribution of
Soil Temperature
43
Normal density and Cumulative Distribution
Functions
95
99
Edwards 2000
If data are normally distributed, a data point
falling more than 3 standard deviations away from
the mean is an outlier
44
Normal Plot of 30 Observations from a Normal
Distribution
Edwards 2000
45
Normal Plots from Non-normally Distributed Data
Edwards 2000
46
Statistical tests for outliers assume that the
data are normally distributed.
CHECK THIS ASSUMPTION!
47
Grubbs test for outlier detection in a
univariate data set
Tn (Yn Ybar)/S where Yn is the possible
outlier, Ybar is the mean of the sample, and S
is the standard deviation of the
sample Contamination exists if Tn is greater than
T.01n
Grubbs, Frank (February 1969), Procedures for
Detecting Outlying Observations in Samples,
Technometrics, Vol. 11, No. 1, pp. 1-21.
48
Example of Grubbs test for outliers rainfall
in acre-feet from seeded clouds (Simpson et al.
1975)
  • 4.1 7.7 17.5 31.4 32.7 40.6 92.4 115.3 118.
    3 119.0 129.6 198.6 200.7 242.5 255.0 274.7 274.7
    302.8 334.1 430.0 489.1 703.4 978.0 1656.0 1697.8
    2745.6
  • T26 3.539 3.029 Contaminated
  • Edwards 2000

But Grubbs test is sensitive to non-normality
49
Checking Assumptions on Rainfall Data
Skewed distribution Grubbs Test detects
contaminating points Normal Distribution
Grubbs test detects no contamination
Edwards 2000
50
References about outliers
  • Barnett, V. and Lewis, T. 1994, Outliers in
    Statistical Data, John Wiley Sons, New York
  • Iglewicz, B. and Hoaglin, D. C. 1993 How to
    Detect and Handle Outliers, American Society for
    Quality Control, Milwaukee, WI.

51
QA/QC in the Lab Using Control Charts
52

Laboratory quality control using statistical
process control charts
  • Determine whether analytical system is in
    control by examining
  • Mean
  • Variability (range)

53
Mean Control Chart
N concentration in sample of known concentration
UCL Mean 3 SD
UCL
UCL
Mean
Mean
N Concentration
LCL
LCL
Time
54
Linear trend
55
Control Charts for QA of weather data
Hourly air temperatures (1-24) for each month
(calculated for many years of historical data)
Hourly air temperature, 26th day of each month
2002
Eching and Snyder 2005
56
Daily average air temperature for each month
Daily average air temperature for each day in 2002
57
North Temperate Lakes QA Algorithm
Figure 7. Hourly wind speed plot of 1997 and 2000
with 0 speed highlighted in dots
Hu and Benson, unpublished
58
of wind speeds in 0-1 bin/ Total of wind
speeds recorded
Figure 9. Hourly average wind speed PDF by month
for 1995.
59
Average frequency of 0 wind speed in each month
based on 1995 data 3 SD
Figure 10. Detecting abnormal wind speed
distribution.
60
Outline
  • QC procedures
  • Designing data sheets
  • Data entry using validation rules, filters,
    lookup tables
  • QA procedures
  • Graphics and Statistics
  • Consistency checks
  • Unusual patterns
  • Outliers
  • Archiving data

61
Archiving high quality data for easy reuse
  • Avoid using the same column title more than once
  • Avoid inconsistencies (e.g. different date ranges
    in title vs. the data)

Figure courtesy of Christine Laney, JRN LTER
62
Avoid formatting errors, cryptic data, and
metadata interspersed with the data
Figure courtesy of Christine Laney, JRN LTER
63
The details
  • Dates as an example of what not to do
  • 2-digit years
  • range of dates in single cell (e.g.,
    02/01-03/2006 or 02/01/2006,02/03/2006)
  • date with a letter appended to the end (ex
    02/01/1999A)
  • single digit day and month, especially when there
    are no delimiters between month, day, year.
    (e.g., 1212005)

courtesy of Christine Laney, JRN LTER
64
Preferred data formats for synthesis
  • Simple ascii delimited with commas, spaces, tabs,
    etc. with headers, or very simple excel
    spreadsheets. If fixed-width, give widths and
    spaces.
  • Metadata in separate file
  • All data in single file, not separated by year.
    If not possible, each file in exactly the same
    format.
  • Complex formatting systems, like multisheets
    several tables in one sheet, are more difficult
    to interpret and extract information.

65
More archival suggestions
  • Assign descriptive file names
  • File names should be unique and reflect the file
    contents
  • Bad file names
  • Mydata
  • 2001_data
  • Good file name
  • Sevilleta _LTER_NM_2001_NPP.asc
  • Sevilleta_LTER is the project name
  • NM is the state abbreviation
  • 2001 is the calendar year
  • NPP represents Net Primary Productivity Data
  • Asc stands for the file type-ASCII
  • Assign descriptive data set titles (similar to
    file names)
  • Shrub net primary productivity at the Sevilleta
    LTER, New Mexico, 2001

66
Best practices reference
  • Cook, R. B., R. J. Olson, P. Kanciruk, and L. A.
    Hook. 2001. Best practices for preparing
    ecological and ground-based data sets to share
    and archive. Ecol. Bulletins 82138-141.

67
References
  • Michener and Brunt (2000) Ecological Data
    Design, Management and Processing. Blackwell
    Science.
  • Edwards (2000), Data Quality Assurance
  • Brunt (2000) Ch. 2, Data Management Principles,
    Implementation, and Administration
  • Michener (2000) Ch. 7 Transforming Data into
    Information and Knowledge
  • Redman, T.C. 2001. Data Quality The Field Guide.
    Boston, MA Digital Press.
  • Chapman, A.D. 2005a. Principles of Data Quality.
    Report for the Global Biodiversity Information
    Facility 2004. Copenhagen GBIF.
    http//www.gbif.org/prog/digit/data_quality/data_q
    uality
  • Dalcin, E.C. 2005. Data Quality Concepts Applied
    to Taxonomic Databases. Ph.D. Dissertation,
    Univeresity of Southhamptom, UK 2005.
    (http//www.dalcin.org/eduardo/downloads/edalcin_t
    hesis_submission.pdf)

68
Questions?
Write a Comment
User Comments (0)
About PowerShow.com