Auto-Validating Your Data Store: A do-it-yourself approach to data integrity and anomaly detection. Evan Davies Office of Strategic Planning and Analysis The College of William and Mary in Virginia - PowerPoint PPT Presentation

1 / 71
About This Presentation

Auto-Validating Your Data Store: A do-it-yourself approach to data integrity and anomaly detection. Evan Davies Office of Strategic Planning and Analysis The College of William and Mary in Virginia


Auto-Validating Your Data Store: A do-it-yourself approach to data integrity and anomaly detection. Evan Davies Office of Strategic Planning and Analysis – PowerPoint PPT presentation

Number of Views:485
Avg rating:3.0/5.0


Transcript and Presenter's Notes

Title: Auto-Validating Your Data Store: A do-it-yourself approach to data integrity and anomaly detection. Evan Davies Office of Strategic Planning and Analysis The College of William and Mary in Virginia

Auto-Validating Your Data Store A
do-it-yourself approach to data integrity and
anomaly detection.Evan DaviesOffice of
Strategic Planning and AnalysisThe College of
William and Mary in Virginia
  • Institutional researchers increasingly rely on
    data marts, stores, and warehouses for management
  • Given the perpetually developing status of
    these environments, both commercial and
    institutional, you can spend significant time
    dealing with data that does not meet even
    shifting standards for table logic, variable
    conventions, and values.
  • Such anomalies can stop production programs or
    lead to inaccurate information until detected.

Does this ever happen to you?
  • NOTE Table WORK.ENROLL created, with 8241 rows
    and 19 columns. ?
  • NOTE Table WORK.PERSON created, with 8241 rows
    and 13 columns. ?
  • NOTE Table WORK.MAJOR created, with 8253 rows
    and 30 columns. ?
  • NOTE Table WORK.MAJOR_PERSON created, with 8264
    rows and 30 columns. ?
  • Who are these extra people? Why are they in here?
  • ????????

  • This presentation demonstrates a simple yet
    sophisticated way of using SAS Enterprise Guide
    to check tables, variables, and values
    automatically to find out if that data meets IR
    standards and premises before you start
    analytical work (and to let others know that
    things may need fixing).

Things You Should Know
  • Simple is a relative term, as is
  • This involves more coding than mouse clicking
  • You should have at some concepts of SAS coding,
    SQL, and relational databases
  • To make this work back at your campus, you need
    to know (or find out) how to access your data
  • If you have significant structured programming
    and SQL experience, please refrain from laughing
    out loud. Snickering is acceptable.

The College of William Mary
  • The only royally chartered colonial institution,
    1693, by King William III and Queen Mary II
  • Making it the second oldest college in the United
  • Phi Beta Kappa, the first Greek honor society,
    was founded here in 1776
  • Became state-supported in 1906 and coeducational
    in 1918
  • The Alma Mater of George Washington and Thomas
    Jefferson , as well as Jon Stewart and Secretary
    of Defense Robert Gates
  • Named one of Intel's 50 Most Unwired College
    Campuses for our campus-wide wireless network
  • The Colonial Campus section of the 1,200 acre
    campus is restored to its 18th-century appearance

The Wren Building (1700)The Oldest Academic
Building in Continuous Use in the U.S.
The College Today
  • 5,800 undergraduates and 1,950 graduate students
    from all 50 states and 30 foreign countries
  • 22 percent are students of color
  • 79 percent of freshmen graduated in the top ten
    percent of their class
  • Highest SAT middle 50th range of all public
    institutions in Virginia
  • 111 student-faculty ratio
  • WM has more recipients of the Commonwealth's
    Outstanding Faculty Award than any other
  • 5 undergraduate and graduate schools Arts
    Sciences, Business, Education, Law, and Marine
  • 36 undergraduate programs, 12 masters , doctoral,
    and professional degrees
  • is a Highly Selective Public Liberal Arts

The College of William and Mary
External System Interfaces/Entities
College Systems that Interface with
Administrative Computing Systems
Bank ACH Direct Deposits Check
Verification Cancelled Checks Lock-Box Payments
Federal Govt IRS/SSA W2s/1099s
VA Dept of Taxation W2s/1099s CDS Vendors
FRS Financial Record System Accounts
Payable Purchasing General Ledger Budget CARS
Interface/Recon EDI CDS Vendors Financial
Statements Grants Account Fixed Assets (6/99)
Old SIS Student Info System Residence
Life Student Billing Transcripts
Office Supply Vendor
Office Supplies Systems
Cash Receipts System
VA DOA CARS System Payroll Acct EDI to WM CDS
Pymts EDI Vendor Pymts Expense/Cash
Acct Agency/CPRS Tape
SAT Scores
Warehouse Systems
FAACS (Old Fixed Assets)
Data History We had many homegrown systems, some
based on Information Associates architecture,
but extensively modified.
ACT Scores
GRE Scores
Work Order System
Federal Dept Labor Employment Compliance
MAT Scores
MCAT Scores
VA DPT BES System Benefits Eligibility
WORCS Student Web
VA VRS Retire, Life Ins, Opt Life
HRS Human Resource System Applications Personnel
Mgmt Position Control Benefits Mgmt Work
Study Payroll/Account A21 Certification CARS
Faculty Salary Tracking
New SIS Student Info System Prospects Admissions
Student Records Registration Course
Schedule PO Box Management
Old Financial Aid
Prospective Student Search
VA VEC Unemployment Benefits
Schedule 25 Resource 25
Leave Account
FSA Administrator
Petersons Pros. Student Data
Athlete Tracking
Old Campus Police
TIAA - CREF Annuity Eligibility
1500 Hour Tracking
Web Applications (IP)
Benefits Vendors
National Clearing House Loan Verification
VA DPT PMIS Personnel Mgmt
College Systems that Interface with
Administrative Computing Systems
VA DPB Budget Admin
SSA/FICA Deposits Payroll Office
Checks 1-2-3 Pay Checks Direct Deposit
Stubs W2s AP checks 1099s General Accounting
New Campus Police
Parking System
SWEM Patron Info System
Mysoft Call Account Sys Telecom
Power FAIDS New Financial Aid
PELL Grants
Social Security Administration
ISIR F.A. Input
Identification System Food Services
US Savings Bonds Admin Payroll Office
Student Health System
Door Access System
Alumni Development System
Fed Reserve US Savings Bonds
Wiz Kid
Recent Data History
v.6, v.7, v.8 v.1, v.2 v.1, v.2,
  • After false starts, in 2003 we bought a large
    Banner enterprise system, and then added a
    datamart, and then a data store. We are now five
    years into our two year installation period.
  • Or put another way, we are current on
    version(s) of 8.2 and 3.1, with new version(s)
    around the corner.
  • When the IR staff start to understand the
    relationships and foibles of a particular
    version, it is time to upgrade to a newer version
    in which some things are fixed and some other
    things are broken differently enabled.

Things To Realize
  • SungardHE Banner products are not a bad system.
    Nor are any other commercial vendors products.
  • Any enterprise-level system with a data store is
    a permanently evolving, almost organic entity,
  • with multiple input opportunities for breaking
    constraints and premises
  • and for finding new ways to induce unexpected
    results through changing business rules,
    institutional decentralization, flexibility, and

Data Integrity in Pre- and Post-Enterprise Systems
  • Old History I.T. used to do a system General
    Edit, with general meaning edited for
    operational purposes, not analytical purposes.
    If the data got the payroll to run today or
    allowed you to admit a student, it was valid.
  • New History Data validity is still measured
    against operational standards.

Data Integrity Post-Enterprise System
  • And I.T. is now even busier than ever just making
    the production system run, without markedly
    larger numbers of staff to commit to the data
    warehousing activity. They actually do less
    general editing because more transactions are
    interactive rather than process-oriented.
  • At the end of a day, data is off-loaded into the
    data store, in different forms and with different
    premises from how the data is held from the
    production system.

Data Integrity Pre- and Post-Enterprise System
  • This means that there is now an even bigger split
    between operational and analytical data
    integrity, especially in terms of the differing
    forms of the data.
  • The data store is not as rigorously evaluated for
    data integrity or meaning precisely because it is
    not production. And since the operational
    offices of the institution are satisfied with
    their pieces of the data pie, everything is
    working fine.

The Institutional Research Role
  • IR has always had premises for data reporting
    that go beyond general edit.
  • We analyze data relationships that make up the
    whole picture.
  • We work in the aggregate, rather than by
    individual transaction.
  • We are ideally poised to discover the anomalies
    that occur between multiple institutional sources
    -- between one offices interpretation of a
    transaction and another offices idea of the same

A Data Stores Added Task to IR
  • IR now has to do the same comprehensive
    validation tasks that we used to do, plus
    identify and deal with the newer introduced
    problem of tables that violate their own premises
    or have unexpected values due to complex dynamics

The Complex Dynamics
  • imperfect translation of data between the
    production transaction system and the data
  • vendor maintenance or institutional business
    rules changes that intentionally induce changes
    in tables.
  • generally caused by the inability to predict all
    of the systemic results of making unspecified or
    unimagined changes in a system, sometimes known
    as the Butterfly Effect.
  • the continuous upgrade cycles applied to the
  • and data that results from imperfectly recorded
    transactions in uncertain environments with less
    than adequate collaboration and training.

So What Can We Do?
  • Write a program to
  • check the logic premises of frequently used
  • check for missing or out-of-range values in data
    that affects IR
  • run the program frequently to uncover problems in
    time for the current census and prevent future
    term anomalies. Keep the results for
  • communicate findings promptly and efficiently in
    order to effect change
  • build in flexibility to test different things at
    different times in different ways

Limitations in Place
  • Do it on your own, since it is for IR purposes.
    Remember, the data already meets everybody elses
    needs. ?
  • Use existing resources. Keep it simple.
  • Someone else is going to have to be notified to
    deal with the data anomaly once it is identified.
  • Dont weaponize the process. This is why we
    choose to have anomalies, not mistakes.

How To Accomplish?
  • The use of SAS EG on a PC platform allows for
    the remote submission of a premise and
    data-checking program during off-hours, when
    demand on the system is lessened.
  • It also allows a convenient environment in which
    to store the program and results, access
    validation and history tables, and send automatic
    e-mails to interested parties such as admissions,
    registrar, IR and IT staff.

  • SAS Enterprise Guide, a powerful Microsoft
    Windows client application that provides a guided
    mechanism to exploit the power of SAS and publish
    dynamic results throughout your institution. Its
    the preferred interface to SAS for analysts,
    statisticians and programmers SAS Enterprise
    Guide saves time by automatically generating
    computer code with an easy point-and-click
  • Think of it as a graphical office environment for
    the SAS language. As Microsoft Outlook ties MS
    Office products together, Enterprise Guide has a
    similar role for SAS

SAS and all other SAS Institute Inc. product or
service names are registered trademarks or
trademarks of SAS Institute Inc. in the USA and
other countries. indicates USA registration.
Still not clear on SAS/EG ?
  • It is an environment in which SAS 9.1 runs
  • It can bring together data views and any type of
    files from any network data servers, including
  • It contains SAS program(s), the larger project,
    notes, and a graphical description of how the
    project, processes, and programs relate to each
  • It documents how any processes or programs have
    been run, and the results.
  • It can generate code for procedures and
  • It inherits libraries, autoexecs, etc. from 9.1

We Use Enterprise Guide To
  • Schedule and launch the anomaly tracking job
  • Provide a comprehensive project environment for
    IR staff to be able to visit independently and
    add new anomaly checks
  • Be able to see and modify the associated tables
    and data in one place
  • Provide our novice IR staff a more centralized
    and friendlier view of the process

What Does It Look Like?
(No Transcript)
(No Transcript)
(No Transcript)
(No Transcript)
(No Transcript)
(No Transcript)
Can You Just Use SAS Itself?
  • YES!?
  • Provided you schedule the job through MS Task
    Scheduler and have no desire for the previously
    mentioned features or facilities.
  • Programmatically, all process features are part
    of base SAS v9.1 on an XP or Vista platform.

Beginning Steps
  • Survey yourself and other staff (IR and other
    offices) to make a preliminary list of tables and
    values that have issues
  • Establish the names, emails, and hierarchy of
    those who will receive automated communications.
  • Make a calendar of when you want to test certain
    items due to different functional cycles
    (admissions, registration, etc)

Lets Go Coding
  • I cant show you the entire code for my anomaly
    program today. ?
  • It would need to be modified to fit your data
    structures and needs anyway.
  • Instead I will concentrate on imparting some
  • Key Program Ideas
  • SQL techniques
  • to enable you to develop your own program

The Overall Design Concept
The Overall Design Concept (2)
(No Transcript)
Set up a SAS program which launches automatically
using SAS/EG
  • let term_start_limit200325
  • let term_end_limit 201030
  • let highestssn'772'
  • let es_validset'EL','MW','WD','WM','WW' /all
    valid statuses encountered by enrolled students
    AFTER enrollment/dropadd/
  • let es_bad'QW','WB','AW' /all bad or
    ineligible statuses not to be counted by census
  • let tooold'1900' out-of-range birthday year

  • Macro variable assignment increases flexibility
    as table names change.
  • let table_enrollenrollment
  • let table_academic_studyacademic_study
  • let table_addcurrentaddress_current
  • let table_prevedslotprevious_education_slot
  • let table_scheduleschedule_offering
  • let table_course_catalogcourse_catalog
  • let table_coursestudent_course

  • format table for anomalies
  • proc format
  • value anom
  • 1 'withdrawn with classes'
  • 2 'duplicate person recs '
  • 3-4 'missing value '
  • 5 'ssn out-of-range '
  • 6 'Two ids, one ssn '
  • 7 'dup recs course_catalo '
  • run

(No Transcript)
  • Create table todayterm as
  • select stvterm_code as studyterm,
  • datepart(stvterm_start_date) as begd ,
  • datepart(stvterm_end_date) as endd
  • from (your data source)
  • where today() gt datepart(stvterm_start_date) and
  • today() lt datepart(stvterm_end_date)
  • set a global study term variable based on value
    in todayterm
  • data _null_
  • set todayterm
  • call symput('study',studyterm)

  • Use up to 3 variables to show people what is
    anomalous about any particular situation. Name
    them A, B, and C. The variables you will pass to
    these variables will differ with each problem.
  • You will need to put both the value and the name
    of the variable into your anomaly report, plus
    some identifier(s) for the student or entity,
    plus some anomaly details.

  • Macro to help transfer of anomalies
  • macro keep
  • (keep id person_uid aval aval_desc
  • bval bval_desc cval cval_desc
  • anom studyterm
  • first_anom_date suspend_date data_own)
  • mend keep

  • macro pass (aval,bval,cval,dsn,anom,studyter
  • aval_desc "aval"
  • bval_desc "bval"
  • cval_desc "cval"
  • aval put(aval,25.)
  • bval put(bval,25.)
  • cval put(cval,25.)
  • anomanom
  • studytermput(studyterm,6.)
  • first_anom_datetoday()
  • suspend_datesuspend_date
  • data_owndata_own
  • mend pass

  • proc sql
  • connect to odbc as mydb (datasrc"datasrc"
    useruser passwordpassword)
  • create table addressc as select from
    connection to mydb
  • (
  • select m.PERSON_UID,, n.address_type,
    n.postal_code,, n.county,
    n.state_province, n.nation
  • from table_enroll m
  • inner join
  • table_addcurrent n
  • on m.person_uid n.entity_uid
  • where m.ACADEMIC_PERIOD in (study) and
  • ((m.ENROLLED_IND'Y' and
    m.REGISTERED_IND'Y' ) or
  • (m.ENROLLED_IND'Y' and
    m.ENROLLMENT_STATUS in (es_set)))
  • and n.address_type in ('IN', 'P1', 'MA')
  • )

  • from table_enroll m
  • inner join
  • table_addcurrent n
  • on m.person_uid n.entity_uid
  • where m.ACADEMIC_PERIOD in (study) and
  • and n.address_type in ('IN', 'P1', 'MA')


Let your server do the heavy data work
Result Set
SQL call
  • The most useful technique for detecting a table
    that violates record premises is by joining the
    table back to a copy of itself that has been
    summarized by the number of expected rows,
    keeping only those records that dont meet
  • A SQL statement that uses a count() function, a
    group by clause, and a having clause does
    this job effectively.

  • select l. from
    from table_academic_study
  • where academic_period in (study) ) l
  • inner join
  • (select person_uid
  • from table_academic_study
  • where academic_period in (study)
  • group by person_uid, program
  • having count(person_uid) gt 1 ) r
  • on l.person_uid r.person_uid

  • select
  • person_uid
  • from table_academic_study
  • where academic_period in (study)
  • group by person_uid, program
  • having count(person_uid) gt 1

1234 BA-GOVT 200910 1
1234 BA-GOVT 200910 1
1234 BA-GOVT 200910 2
  • select l.person_uid, r.person_uid as other_uid,
  • l.tax_id, r.tax_id as other_tax_id,
  • from person l inner join person r
  • on l.TAX_ID r. TAX_ID
  • AND l.person_uid ltgt r.person_uid
  • where l.tax_id is not null
  • and r.tax_id is not null
  • This SQL will find two different university ids
    that share the same SSN. This generally occurs
    when the institution has issued two ids to the
    same person without an adequate search of
    records. The faster this is spotted, the better.

  • In addition to testing table logic, once you have
    the datasets established, any variety or
    combination of values can be tested. Here are
    four conditions to get you started thinking about
    what can be tested
  • - missing values
  • if citizenship_type '' then do
  • - withdrawn with classes
  • if enrolled_ind 'Y' and registered_ind 'Y'
    and enrollment_status 'WB' then do

  • if state_province in ('AA','AE','AP','PR','VI','AL
    ','AK','AZ','AR','CA', 'WI','WY' ) and nation gt
  • or
  • state_province in ('AB','BC','MB','NB','NL','NT',
    'NS','NU','ON','PE','QC','SK','YT') and nation
  • or
  • nation 'CA' and state_province not in
  • or
  • state_province in ('FC' ,'HK', 'RQ', 'XX') then

  • - ssn out of range
  • if tax_id '' then
  • do
  • ssnverf indexc(substr(tax_id,1,9),'
  • if substr(tax_id,6,4) '0000' or
    substr(tax_id,1,3) lt '001'
  • or substr(tax_id,4,2) '00' or
    substr(tax_id,1,3) gt highestssn or ssnverf gt 0
    then do

  • data T_person_v keep
  • set person
  • test3 - missing value
  • anom0
  • if citizenship_type '' then
  • do
  • pass(avalcitizenship_type, bvalfull_name_lfmi,
    cval, dsnperson, anom3, studytermstudy,
    suspend_date., data_own'reg')
  • end
  • if anom gt 0
  • run

The KEEP Macro in place
The Anomaly Testing
The PASS Macro in place
  • test6 - two ids, one ssn
  • data T_person_v3 keep
  • set anom_person2
  • pass(avaltax_id, bvalother_tax_id,
    cvalother_id, dsnperson, anom6,
    studytermstudy, suspend_date.,data_own'reg')
  • run

The KEEP Macro
  • Nomenclature of Test Datasets -
  • T_area_XN
  • where
  • T Test
  • area broad anomaly category
  • X v(ariable based)
  • a(ssumption of table logic violated)
  • N incremental test set number
  • dataset T_person_v3 is the third datset testing
    the demographic (person) table for variable-based
    anomalies such as missing values or incompatible

  • create temp work dataset for todays
  • data today
  • length aval_desc bval_desc cval_desc 25
  • set
  • T_addressc_v1
  • T_addressc_v2
  • T_addressc_a
  • T_acadstudy_a
  • T_ccat_a
  • T_person_a
  • T_person_v1
  • T_person_v2
  • T_person_v3
  • all the transaction sets
  • run

  • Remember to unduplicate all the duplicated
    records found! You only need one example record
    per anomaly.
  • proc sort NODUPKEY datatoday
  • by person_uid anom aval bval cval studyterm
  • run
  • Failure to do so may result in multiple joins in
    the next step, when you update the master dataset.

Add New Records Keep Matching RecordsDelete
Data master Update master today------------or-
------------Proc SQL Select from oldmaster
l right join today r on criteria
Todays Set
Master Set
  • proc sql
  • create table newmaster as
  • select r.person_uid,, r.anom, r.aval,
    r.aval_desc , other variables,
    coalesce(l.first_anom_date, r.first_anom_date) as
  • from
  • oldmaster l right join today r
  • on
  • l.person_uid r.person_uid and l.anom r.anom
  • and l.aval r.aval and l.bval r.bval and l.cval
  • and l.studyterm r.studyterm
  • quit

data reg ban ir adm grr soe bur law set
master output ir for all records if
data_own'reg' then output reg else if data_own
'ban' then output ban else if data_own
'adm' then output adm else if data_own 'grr'
then output grr else if data_own 'soe' then
output soe else if data_own 'bur' then output
bur else if data_own 'law' then output law
proc export datareg dbmsexcel2002
outfile 'g\temp\reg.xls' replace proc export
databan dbmsexcel2002 outfile
'g\temp\ban.xls' replace proc export datair
dbmsexcel2002 outfile 'g\temp\ir.xls'
filename reports email "" data
_null_ file reports set
departments put '!EM_TO! ' name
put '!EM_SUBJECT! Report for ' dept put
Hi fname -' put 'Here is the latest
report of anomalies for the' dept'.' if
dept'ban' then put '!EM_ATTACH!
g\temp\ban.xls' else if dept 'reg' then
put '!EM_ATTACH! g\temp\reg.xls'
else if dept 'ir' then put '!EM_ATTACH!
g\temp\adm.xls' put '!EM_SEND!'
put '!EM_NEWMSG!' put
'!EM_ABORT!' run
Subject line
Attached xls
Technical Hurdles Along The Way
  • Dont use your Outlook mailer. Specify your SMTP
    mail service as the mailer. You may have to pass
    your authentication through the SAS sasv9.cfg
    file to allow STMP mailing.
  • Depending on your network, you may have to use
    Cscript as the keyword in the scheduler to
    launch the program, rather than the implicit
    Wscript. C stands for console.

Lessons Learned
  • Send all output to yourself for several days to
    review, before allowing it to be sent out
  • Send lower priority anomalies infrequently, and
    high priority ones weekly or daily
  • Send notice of table violations infrequently to
    IT, once they have identified a problem and
    resolution path. (Dont bug them if they, too,
    are waiting for a vendor patch or fix)

Lessons Learned
  • Be aware of the length of time the job takes to
    execute. You may need to adjust what and when you
    test if it starts taking too much time
  • Bring your users into the process by asking
    them what you can do differently to help them. Do
    they need another variable to help isolate
    problems? Alternate ID?

Auto-Validating Your Data Store Evan
DaviesPresentation available online
(No Transcript)
Write a Comment
User Comments (0)