Title: Auto-Validating Your Data Store: A do-it-yourself approach to data integrity and anomaly detection. Evan Davies Office of Strategic Planning and Analysis The College of William and Mary in Virginia
1Auto-Validating Your Data Store A
do-it-yourself approach to data integrity and
anomaly detection.Evan DaviesOffice of
Strategic Planning and AnalysisThe College of
William and Mary in Virginia
2Overview
- Institutional researchers increasingly rely on
data marts, stores, and warehouses for management
information. - Given the perpetually developing status of
these environments, both commercial and
institutional, you can spend significant time
dealing with data that does not meet even
shifting standards for table logic, variable
conventions, and values. - Such anomalies can stop production programs or
lead to inaccurate information until detected.
3Does this ever happen to you?
- NOTE Table WORK.ENROLL created, with 8241 rows
and 19 columns. ? - NOTE Table WORK.PERSON created, with 8241 rows
and 13 columns. ? - NOTE Table WORK.MAJOR created, with 8253 rows
and 30 columns. ? - NOTE Table WORK.MAJOR_PERSON created, with 8264
rows and 30 columns. ? - Who are these extra people? Why are they in here?
- ????????
4Agenda
- This presentation demonstrates a simple yet
sophisticated way of using SAS Enterprise Guide
to check tables, variables, and values
automatically to find out if that data meets IR
standards and premises before you start
analytical work (and to let others know that
things may need fixing).
5Things You Should Know
- Simple is a relative term, as is
sophisticated - This involves more coding than mouse clicking
- You should have at some concepts of SAS coding,
SQL, and relational databases - To make this work back at your campus, you need
to know (or find out) how to access your data - If you have significant structured programming
and SQL experience, please refrain from laughing
out loud. Snickering is acceptable.
6The College of William Mary
- The only royally chartered colonial institution,
1693, by King William III and Queen Mary II - Making it the second oldest college in the United
States - Phi Beta Kappa, the first Greek honor society,
was founded here in 1776 - Became state-supported in 1906 and coeducational
in 1918 - The Alma Mater of George Washington and Thomas
Jefferson , as well as Jon Stewart and Secretary
of Defense Robert Gates - Named one of Intel's 50 Most Unwired College
Campuses for our campus-wide wireless network - The Colonial Campus section of the 1,200 acre
campus is restored to its 18th-century appearance
7The Wren Building (1700)The Oldest Academic
Building in Continuous Use in the U.S.
8The College Today
- 5,800 undergraduates and 1,950 graduate students
from all 50 states and 30 foreign countries - 22 percent are students of color
- 79 percent of freshmen graduated in the top ten
percent of their class - Highest SAT middle 50th range of all public
institutions in Virginia - 111 student-faculty ratio
- WM has more recipients of the Commonwealth's
Outstanding Faculty Award than any other
institution - 5 undergraduate and graduate schools Arts
Sciences, Business, Education, Law, and Marine
Science - 36 undergraduate programs, 12 masters , doctoral,
and professional degrees - is a Highly Selective Public Liberal Arts
University
9The College of William and Mary
External System Interfaces/Entities
College Systems that Interface with
Administrative Computing Systems
Bank ACH Direct Deposits Check
Verification Cancelled Checks Lock-Box Payments
Federal Govt IRS/SSA W2s/1099s
VA APA
VA SCHEV
VA Dept of Taxation W2s/1099s CDS Vendors
FRS Financial Record System Accounts
Payable Purchasing General Ledger Budget CARS
Interface/Recon EDI CDS Vendors Financial
Statements Grants Account Fixed Assets (6/99)
Old SIS Student Info System Residence
Life Student Billing Transcripts
Office Supply Vendor
Office Supplies Systems
Cash Receipts System
VA DOA CARS System Payroll Acct EDI to WM CDS
Pymts EDI Vendor Pymts Expense/Cash
Acct Agency/CPRS Tape
SAT Scores
Warehouse Systems
FAACS (Old Fixed Assets)
Data History We had many homegrown systems, some
based on Information Associates architecture,
but extensively modified.
ACT Scores
GRE Scores
Work Order System
Federal Dept Labor Employment Compliance
MAT Scores
MCAT Scores
VA DPT BES System Benefits Eligibility
WORCS Student Web
DARS
NCAA
VA VRS Retire, Life Ins, Opt Life
HRS Human Resource System Applications Personnel
Mgmt Position Control Benefits Mgmt Work
Study Payroll/Account A21 Certification CARS
Interface
Faculty Salary Tracking
New SIS Student Info System Prospects Admissions
Student Records Registration Course
Schedule PO Box Management
Old Financial Aid
Prospective Student Search
VA VEC Unemployment Benefits
Schedule 25 Resource 25
Leave Account
FSA Administrator
Petersons Pros. Student Data
Athlete Tracking
Old Campus Police
TIAA - CREF Annuity Eligibility
1500 Hour Tracking
Web Applications (IP)
Benefits Vendors
National Clearing House Loan Verification
VA DPT PMIS Personnel Mgmt
College Systems that Interface with
Administrative Computing Systems
VA DPB Budget Admin
SSA/FICA Deposits Payroll Office
Checks 1-2-3 Pay Checks Direct Deposit
Stubs W2s AP checks 1099s General Accounting
Office
New Campus Police
Parking System
SWEM Patron Info System
Mysoft Call Account Sys Telecom
Power FAIDS New Financial Aid
PELL Grants
Social Security Administration
ISIR F.A. Input
Identification System Food Services
US Savings Bonds Admin Payroll Office
Student Health System
Door Access System
Alumni Development System
Fed Reserve US Savings Bonds
Wiz Kid
10Recent Data History
v.6, v.7, v.8 v.1, v.2 v.1, v.2,
v.3.
- After false starts, in 2003 we bought a large
Banner enterprise system, and then added a
datamart, and then a data store. We are now five
years into our two year installation period. - Or put another way, we are current on
version(s) of 8.2 and 3.1, with new version(s)
around the corner. - When the IR staff start to understand the
relationships and foibles of a particular
version, it is time to upgrade to a newer version
in which some things are fixed and some other
things are broken differently enabled.
11Things To Realize
- SungardHE Banner products are not a bad system.
Nor are any other commercial vendors products. - Any enterprise-level system with a data store is
a permanently evolving, almost organic entity, - with multiple input opportunities for breaking
constraints and premises - and for finding new ways to induce unexpected
results through changing business rules,
institutional decentralization, flexibility, and
collegiality.
12Data Integrity in Pre- and Post-Enterprise Systems
- Old History I.T. used to do a system General
Edit, with general meaning edited for
operational purposes, not analytical purposes.
If the data got the payroll to run today or
allowed you to admit a student, it was valid. - New History Data validity is still measured
against operational standards.
13Data Integrity Post-Enterprise System
- And I.T. is now even busier than ever just making
the production system run, without markedly
larger numbers of staff to commit to the data
warehousing activity. They actually do less
general editing because more transactions are
interactive rather than process-oriented. - At the end of a day, data is off-loaded into the
data store, in different forms and with different
premises from how the data is held from the
production system.
14Data Integrity Pre- and Post-Enterprise System
- This means that there is now an even bigger split
between operational and analytical data
integrity, especially in terms of the differing
forms of the data. - The data store is not as rigorously evaluated for
data integrity or meaning precisely because it is
not production. And since the operational
offices of the institution are satisfied with
their pieces of the data pie, everything is
working fine.
15The Institutional Research Role
- IR has always had premises for data reporting
that go beyond general edit. - We analyze data relationships that make up the
whole picture. - We work in the aggregate, rather than by
individual transaction. - We are ideally poised to discover the anomalies
that occur between multiple institutional sources
-- between one offices interpretation of a
transaction and another offices idea of the same
information.
16A Data Stores Added Task to IR
- IR now has to do the same comprehensive
validation tasks that we used to do, plus
identify and deal with the newer introduced
problem of tables that violate their own premises
or have unexpected values due to complex dynamics
among
UPGRADE
17The Complex Dynamics
- imperfect translation of data between the
production transaction system and the data
store/warehouse, - vendor maintenance or institutional business
rules changes that intentionally induce changes
in tables. - generally caused by the inability to predict all
of the systemic results of making unspecified or
unimagined changes in a system, sometimes known
as the Butterfly Effect. - the continuous upgrade cycles applied to the
system, - and data that results from imperfectly recorded
transactions in uncertain environments with less
than adequate collaboration and training.
18So What Can We Do?
- Write a program to
- check the logic premises of frequently used
tables - check for missing or out-of-range values in data
that affects IR - run the program frequently to uncover problems in
time for the current census and prevent future
term anomalies. Keep the results for
documentation. - communicate findings promptly and efficiently in
order to effect change - build in flexibility to test different things at
different times in different ways
19Limitations in Place
- Do it on your own, since it is for IR purposes.
Remember, the data already meets everybody elses
needs. ? - Use existing resources. Keep it simple.
- Someone else is going to have to be notified to
deal with the data anomaly once it is identified. - Dont weaponize the process. This is why we
choose to have anomalies, not mistakes.
20How To Accomplish?
- The use of SAS EG on a PC platform allows for
the remote submission of a premise and
data-checking program during off-hours, when
demand on the system is lessened. - It also allows a convenient environment in which
to store the program and results, access
validation and history tables, and send automatic
e-mails to interested parties such as admissions,
registrar, IR and IT staff.
21- SAS Enterprise Guide, a powerful Microsoft
Windows client application that provides a guided
mechanism to exploit the power of SAS and publish
dynamic results throughout your institution. Its
the preferred interface to SAS for analysts,
statisticians and programmers SAS Enterprise
Guide saves time by automatically generating
computer code with an easy point-and-click
interface. - Think of it as a graphical office environment for
the SAS language. As Microsoft Outlook ties MS
Office products together, Enterprise Guide has a
similar role for SAS
SAS and all other SAS Institute Inc. product or
service names are registered trademarks or
trademarks of SAS Institute Inc. in the USA and
other countries. indicates USA registration.
22Still not clear on SAS/EG ?
- It is an environment in which SAS 9.1 runs
- It can bring together data views and any type of
files from any network data servers, including
Oracle. - It contains SAS program(s), the larger project,
notes, and a graphical description of how the
project, processes, and programs relate to each
other. - It documents how any processes or programs have
been run, and the results. - It can generate code for procedures and
datasteps. - It inherits libraries, autoexecs, etc. from 9.1
23We Use Enterprise Guide To
- Schedule and launch the anomaly tracking job
- Provide a comprehensive project environment for
IR staff to be able to visit independently and
add new anomaly checks - Be able to see and modify the associated tables
and data in one place - Provide our novice IR staff a more centralized
and friendlier view of the process
24What Does It Look Like?
25(No Transcript)
26(No Transcript)
27(No Transcript)
28(No Transcript)
29(No Transcript)
30(No Transcript)
31Can You Just Use SAS Itself?
- YES!?
- Provided you schedule the job through MS Task
Scheduler and have no desire for the previously
mentioned features or facilities. - Programmatically, all process features are part
of base SAS v9.1 on an XP or Vista platform.
32Beginning Steps
- Survey yourself and other staff (IR and other
offices) to make a preliminary list of tables and
values that have issues - Establish the names, emails, and hierarchy of
those who will receive automated communications. - Make a calendar of when you want to test certain
items due to different functional cycles
(admissions, registration, etc)
33Lets Go Coding
- I cant show you the entire code for my anomaly
program today. ? - It would need to be modified to fit your data
structures and needs anyway. - Instead I will concentrate on imparting some
- Key Program Ideas
- SQL techniques
- to enable you to develop your own program
34The Overall Design Concept
35The Overall Design Concept (2)
36(No Transcript)
37Set up a SAS program which launches automatically
using SAS/EG
38- let term_start_limit200325
- let term_end_limit 201030
- let highestssn'772'
- let es_validset'EL','MW','WD','WM','WW' /all
valid statuses encountered by enrolled students
AFTER enrollment/dropadd/ - let es_bad'QW','WB','AW' /all bad or
ineligible statuses not to be counted by census
date/ - let tooold'1900' out-of-range birthday year
cutoff
39- Macro variable assignment increases flexibility
as table names change. - let table_enrollenrollment
- let table_academic_studyacademic_study
- let table_addcurrentaddress_current
- let table_prevedslotprevious_education_slot
- let table_scheduleschedule_offering
- let table_course_catalogcourse_catalog
- let table_coursestudent_course
40- format table for anomalies
- proc format
- value anom
- 1 'withdrawn with classes'
- 2 'duplicate person recs '
- 3-4 'missing value '
- 5 'ssn out-of-range '
- 6 'Two ids, one ssn '
- 7 'dup recs course_catalo '
-
-
- run
41(No Transcript)
42- Create table todayterm as
- select stvterm_code as studyterm,
- datepart(stvterm_start_date) as begd ,
- datepart(stvterm_end_date) as endd
- from (your data source)
- where today() gt datepart(stvterm_start_date) and
- today() lt datepart(stvterm_end_date)
- set a global study term variable based on value
in todayterm - data _null_
- set todayterm
- call symput('study',studyterm)
43- Use up to 3 variables to show people what is
anomalous about any particular situation. Name
them A, B, and C. The variables you will pass to
these variables will differ with each problem. - You will need to put both the value and the name
of the variable into your anomaly report, plus
some identifier(s) for the student or entity,
plus some anomaly details.
44- Macro to help transfer of anomalies
- macro keep
- (keep id person_uid aval aval_desc
- bval bval_desc cval cval_desc
- anom studyterm
- first_anom_date suspend_date data_own)
- mend keep
45- macro pass (aval,bval,cval,dsn,anom,studyter
m,suspend_date,data_own) - aval_desc "aval"
- bval_desc "bval"
- cval_desc "cval"
- aval put(aval,25.)
- bval put(bval,25.)
- cval put(cval,25.)
- anomanom
- studytermput(studyterm,6.)
- first_anom_datetoday()
- suspend_datesuspend_date
- data_owndata_own
- mend pass
PASSES THE NAME OF THE VARIABLE
PASSES char VALUE OF THE VARIABLE
PASSES OTHER VALUES
46- proc sql
- connect to odbc as mydb (datasrc"datasrc"
useruser passwordpassword) - create table addressc as select from
connection to mydb - (
- select m.PERSON_UID, m.id, n.address_type,
n.postal_code, n.city, n.county,
n.state_province, n.nation - from table_enroll m
- inner join
- table_addcurrent n
- on m.person_uid n.entity_uid
- where m.ACADEMIC_PERIOD in (study) and
- ((m.ENROLLED_IND'Y' and
m.REGISTERED_IND'Y' ) or - (m.ENROLLED_IND'Y' and
m.ENROLLMENT_STATUS in (es_set))) - and n.address_type in ('IN', 'P1', 'MA')
- )
47- from table_enroll m
- inner join
- table_addcurrent n
- on m.person_uid n.entity_uid
- where m.ACADEMIC_PERIOD in (study) and
- and n.address_type in ('IN', 'P1', 'MA')
STUDENTS THIS TERM
ADDRESSES TO CHECK
ALL ADDRESSES
48Let your server do the heavy data work
Result Set
Oracle
SQL call
49- The most useful technique for detecting a table
that violates record premises is by joining the
table back to a copy of itself that has been
summarized by the number of expected rows,
keeping only those records that dont meet
expectations. - A SQL statement that uses a count() function, a
group by clause, and a having clause does
this job effectively.
50- select l. from
- (select PERSON_UID, ID, ACADEMIC_PERIOD,
PROGRAM, - PRIMARY_PROGRAM_IND, ADMISSIONS_POPULATION
from table_academic_study - where academic_period in (study) ) l
- inner join
- (select person_uid
- from table_academic_study
- where academic_period in (study)
- group by person_uid, program
- having count(person_uid) gt 1 ) r
- on l.person_uid r.person_uid
51- select
- person_uid
- from table_academic_study
- where academic_period in (study)
- group by person_uid, program
- having count(person_uid) gt 1
-
PERSON_UID PROGRAM ACADEMIC_PERIOD COUNT
1234 BA-GOVT 200910 1
1234 BA-GOVT 200910 1
PERSON_UID PROGRAM ACADEMIC_PERIOD COUNT
1234 BA-GOVT 200910 2
52- select l.person_uid, r.person_uid as other_uid,
- l.tax_id, r.tax_id as other_tax_id,
l.full_name_lfmi - from person l inner join person r
- on l.TAX_ID r. TAX_ID
- AND l.person_uid ltgt r.person_uid
- where l.tax_id is not null
- and r.tax_id is not null
- This SQL will find two different university ids
that share the same SSN. This generally occurs
when the institution has issued two ids to the
same person without an adequate search of
records. The faster this is spotted, the better. -
53- In addition to testing table logic, once you have
the datasets established, any variety or
combination of values can be tested. Here are
four conditions to get you started thinking about
what can be tested - - missing values
- if citizenship_type '' then do
- - withdrawn with classes
- if enrolled_ind 'Y' and registered_ind 'Y'
and enrollment_status 'WB' then do
54- if state_province in ('AA','AE','AP','PR','VI','AL
','AK','AZ','AR','CA', 'WI','WY' ) and nation gt
'' - or
- state_province in ('AB','BC','MB','NB','NL','NT',
'NS','NU','ON','PE','QC','SK','YT') and nation
'CA' - or
- nation 'CA' and state_province not in
('AB','BC','MB','NB','NL','NT','NS','NU','ON','PE'
,'QC','SK','YT') - or
- state_province in ('FC' ,'HK', 'RQ', 'XX') then
do
55- - ssn out of range
- if tax_id '' then
- do
- ssnverf indexc(substr(tax_id,1,9),'
','-abcdefghijklmnopqrstuvwxyz','ABCDEFGHIJKLMNOPQ
RSTUVWXYZ') - if substr(tax_id,6,4) '0000' or
substr(tax_id,1,3) lt '001' - or substr(tax_id,4,2) '00' or
substr(tax_id,1,3) gt highestssn or ssnverf gt 0
then do
56- data T_person_v keep
- set person
- test3 - missing value
- anom0
- if citizenship_type '' then
- do
- pass(avalcitizenship_type, bvalfull_name_lfmi,
cval, dsnperson, anom3, studytermstudy,
suspend_date., data_own'reg') - end
- if anom gt 0
- run
The KEEP Macro in place
The Anomaly Testing
The PASS Macro in place
57- test6 - two ids, one ssn
- data T_person_v3 keep
- set anom_person2
- pass(avaltax_id, bvalother_tax_id,
cvalother_id, dsnperson, anom6,
studytermstudy, suspend_date.,data_own'reg') - run
The KEEP Macro
PASS
58- Nomenclature of Test Datasets -
- T_area_XN
- where
- T Test
- area broad anomaly category
- X v(ariable based)
- a(ssumption of table logic violated)
- N incremental test set number
- dataset T_person_v3 is the third datset testing
the demographic (person) table for variable-based
anomalies such as missing values or incompatible
statuses
59- create temp work dataset for todays
transactions - data today
- length aval_desc bval_desc cval_desc 25
- set
- T_addressc_v1
- T_addressc_v2
- T_addressc_a
- T_acadstudy_a
- T_ccat_a
- T_person_a
- T_person_v1
- T_person_v2
- T_person_v3
- all the transaction sets
- run
60- Remember to unduplicate all the duplicated
records found! You only need one example record
per anomaly. - proc sort NODUPKEY datatoday
- by person_uid anom aval bval cval studyterm
- run
- Failure to do so may result in multiple joins in
the next step, when you update the master dataset.
61Add New Records Keep Matching RecordsDelete
Non-Matches
Data master Update master today------------or-
------------Proc SQL Select from oldmaster
l right join today r on criteria
Todays Set
Master Set
62- proc sql
- create table newmaster as
- select r.person_uid, r.id, r.anom, r.aval,
r.aval_desc , other variables,
coalesce(l.first_anom_date, r.first_anom_date) as
first_anom_date - from
- oldmaster l right join today r
- on
- l.person_uid r.person_uid and l.anom r.anom
- and l.aval r.aval and l.bval r.bval and l.cval
r.cval - and l.studyterm r.studyterm
- quit
63data reg ban ir adm grr soe bur law set
master output ir for all records if
data_own'reg' then output reg else if data_own
'ban' then output ban else if data_own
'adm' then output adm else if data_own 'grr'
then output grr else if data_own 'soe' then
output soe else if data_own 'bur' then output
bur else if data_own 'law' then output law
64proc export datareg dbmsexcel2002
outfile 'g\temp\reg.xls' replace proc export
databan dbmsexcel2002 outfile
'g\temp\ban.xls' replace proc export datair
dbmsexcel2002 outfile 'g\temp\ir.xls'
replace
65filename reports email "esdav2_at_wm.edu" data
_null_ file reports set
departments put '!EM_TO! ' name
put '!EM_SUBJECT! Report for ' dept put
Hi fname -' put 'Here is the latest
report of anomalies for the' dept'.' if
dept'ban' then put '!EM_ATTACH!
g\temp\ban.xls' else if dept 'reg' then
put '!EM_ATTACH! g\temp\reg.xls'
else if dept 'ir' then put '!EM_ATTACH!
g\temp\adm.xls' put '!EM_SEND!'
put '!EM_NEWMSG!' put
'!EM_ABORT!' run
66Success!
Subject line
Attached xls
Name
Department
67Technical Hurdles Along The Way
- Dont use your Outlook mailer. Specify your SMTP
mail service as the mailer. You may have to pass
your authentication through the SAS sasv9.cfg
file to allow STMP mailing. - Depending on your network, you may have to use
Cscript as the keyword in the scheduler to
launch the program, rather than the implicit
Wscript. C stands for console.
68Lessons Learned
- Send all output to yourself for several days to
review, before allowing it to be sent out
automatically - Send lower priority anomalies infrequently, and
high priority ones weekly or daily - Send notice of table violations infrequently to
IT, once they have identified a problem and
resolution path. (Dont bug them if they, too,
are waiting for a vendor patch or fix)
69Lessons Learned
- Be aware of the length of time the job takes to
execute. You may need to adjust what and when you
test if it starts taking too much time - Bring your users into the process by asking
them what you can do differently to help them. Do
they need another variable to help isolate
problems? Alternate ID?
70Auto-Validating Your Data Store Evan
DaviesPresentation available online
athttp//web.wm.edu/ir/conferencepres.html
71(No Transcript)