Larger than Life - PowerPoint PPT Presentation

About This Presentation
Title:

Larger than Life

Description:

Linking SIR to other software - Winzip. open inf /dsnvar=fnamein /iostat=ierr1 /write /lrecl=300 ... write // 'Unzipping file using winzip ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 22
Provided by: CGC
Category:
Tags: life | winzip

less

Transcript and Presenter's Notes

Title: Larger than Life


1
Larger than Life ?
Presented By Adrian Hodgson
2
How big ?
  • Jobcentre client activity data - 11 million
    people
  • 80 Gbytes of data per month on DLT
  • Processing 20 tables 250m records per month
  • Largest tables have 80 million records, 5 Gbytes
  • Key extract retrievals take 6 - 20 hours to run
  • Full SPSS extracts for clients - 5 Gbytes
  • Twenty five SIR databases - over 100 Gbytes
  • from 1 Mbyte to 15 Gbytes

3
Evaluation Database - Overview
  • Background project requirements
  • Setting the environment
  • Visual PQL
  • The Data Dictionary
  • Program generators
  • Fuzzy matching
  • Use of the data
  • New developments
  • Questions

4
Background
  • Government program started in January 1998
  • New Deal for Young People was set up to encourage
    and assist unemployed groups into full time
    sustainable employment. It gives unemployed
    people aged 18-24 the opportunity to develop
    their potential, gain skills and experience and
    find work.
  • Employment Service needed an evaluation database
  • March 98 - ORC issues first extract with 18000
    clients
  • June 98 - program expanded to over 25s
  • June 99 - expanded to cover all Jobcentre clients
  • October 00 - Source database migrates from six
    regional Ingres databases to single Oracle
    database

5
Contract extension -press notice
  • The new contract will run to May 2004, with an
    option to run a further two years to May 2006.
  • ORC Internationals database tools are designed
    to help the ES evaluate the service it provides
    to all clients registered on the Governments
    JobCentre computer system.They allow the ES to
    regularly monitor clients with sustainable jobs,
    the effectiveness of equal opportunities
    measures, and the relationship between job
    vacancies and the labour market skills base.
  • It is our intention to create a project web
    site, which will allow multiple level access to
    different categories of users. This will include
    project documentation and progress reports,
    access to tabulations and small data extracts,
    and customer feedback areas, as well as links to
    other related sites.
  • ORC International is part of Opinion Research
    Corporation, which was founded in 1938 with
    offices in the United States, Europe, Asia, Latin
    America and Africa, the Company provides
    integrated marketing services to both businesses
    and governments in more than 100 countries.
    http//www.orc.co.uk

6
Project Requirements
  • LMS (Labour Market System) is a multi-user
    transaction system used in Jobcentres
  • Needed new database with evaluative functionality
  • Linked to additional data-sets including clerical
  • Flexibility to change structure periodically
  • Combine cross regional records for the same
    client
  • Extracts provided to ES for statistical purposes

7
Setting the environment
  • PROGRAM
  • PQL CONNECT DATABASE 'CLMIDS PREFIX
    ''
  • END PROGRAM
  • SET DATABASE CLMIDS
  • SET PROCFILE ' reset to main
    procedure file
  • ( main procedure file held in separate .SR4
    file )
  • Midlands database prefix
  • retval globals ( CLI_MIDS ,
    '\\urmston\d\50413\sirdb\client\')
  • and so on for the procedure file other
    databases

8
Visual PQL (1)
execute dbms CALL ddict.cprog ( tablenam
, region , extract ,
newdata , editdata , dbnum
)
9
Visual PQL (2)
10
Linking SIR to other software - Winzip
open inf /dsnvarfnamein /iostatierr1 /write
/lrecl300 ifthen ( ierr1 ne 0) . write 'File
not found write // 'Unzipping file using
winzip pql escape "C\Program
Files\WinZip\WINZIP32.EXE -e zipfn fnamein
" write 'Back to sir !!!!!! ' c Build in a
loop to check that input file now unzipped and
ready c wait and repeat if not else . write
'File opened OK ' fnamein / endif
11
The Data Dictionary
12
Automatic schema generation
  • Source field names, labels data types read from
    HTML into data dictionary database
  • Program to create SIR variable names
  • strip out underscore characters _
  • trims field length to eight
  • if duplicated replaces eighth character with a
    number
  • Program sets column positions based on field
    types
  • Data types converted from Ingres / Oracle to SIR
  • Procedure for Date Integers and true date fields

13
Load programs - the report files (1)
  • Client_action record
  • Read
    78224996
  • \ fixed_width_data \ 2001_05 \north
    19156652
  • east 6949927
  • west 11515552
  • south 11774525
  • northwest 11352025
  • midlands 12147512
  • oracle 5328791
  • dodgy 12


14
Load program generation (2)
  • Develop standard file naming directory location
  • Read pre -processing report files to pick up
    numbers of records to be loaded
  • Detect any missing report files using iostat
    values
  • Write loading program code to pql files
  • Read these pql files back into the procedure file

15
Load generator -finishing touches (3)
open inf6 / dsn 'Q\ps\50413\ProgGen\ldmidlands1
.txt' / read / lrecl 250 open ouf6 / dsn
'Q\ps\50413\ProgGen\ldmidlands2.txt' / write
/ lrecl 250 write ( ouf6 ) 'PROCEDURE
INITLOAD.T add the procedure header
line / 'call initload.dropall'
add call to drop all
databases // 'call initload.connmids' /
add call to connect
midlands loop . read (inf6, iostat ierr6)
textline(a250) copy the rest of the
program . if (ierr6 ne 0) exit loop
. write (ouf6)
textline end loop write (ouf6) 'call
initload.dropmids' add
module call to drop database / 'END
PROCEDURE
add the END PROCEDURE close (inf6)

close the input and output files close
(ouf6) pread 'Q\ps\50413\ProgGen\ldmidlands2.tx
t Pread the SIR pql
16
Load program - Midlands database (4)
call initload.dropall drop all connected
databases call initload.connmids connect the
midlands database call initload.update
(\\Urmston\d\50413\, sir_input_files\lms\2001
_04\Midlands\edited\,


d\50413\sirdb\client\Midlands\l
og\2001_04\, Midlands_client2001_04ORCIDsort,
edt, 0.75, 1, 1, 181450, 0) call initload.update
(\\Urmston\d\50413\, sir_input_files\lms\2001_
04\Midlands\edited\, d\50413\sirdb\client\Mid
lands\log\2001_04\, Midlands_client_action2001
_04ORCIDsort, edt, 0.75, 2, 1, 236835, 0) .
more call to load data call
initload.dropmids
disconnect the midlands database
17
Fuzzy matching
  • Linking data by best combinations of Nino,
    name,dob
  • Stripping non significant text
  • blanks, apostrophes, hyphens
  • Methods grown organically on case by case basis
  • Variety of scoring methods eg surname matches
  • HODGSON 4 points direct matches
  • HODGESON 3points 0.9 for misaligned - 6.7/8
    84
  • Flexible generic modules applicable to all match
    types
  • Reporting of false positive and negative matches
  • Manual review of near /doubtful matches

18
Generic fuzzy matching - issues
  • Key fields different sizes and names across
    applications
  • Some key fields absent or with high missing
    values
  • Quality of key fields varies widely
  • Matching varies from a handful to millions of
    records
  • Bringing Access and SIR together - Visual PQL ?
  • How to assess false positives when no other
    common fields in data sets being matched
  • - set of core procedures with options to bypass
    ?

19
General Issues
  • Continual growth in database extract size
  • Data Irregularities
  • Embedded carriage returns in text fields
  • Date formats (American /English)
  • The team
  • Keeping the routine /repeated processing
    interesting
  • Mushrooming similar code - keeping it generic
    /black box
  • SIR
  • several large tabfiles required - largest
    currently 11Gb
  • Some retrievals crash with sir.exe error 1 time
    in 5 -why?
  • P4 /Windows 2000 mixed performance

20
New Developments (1) - Project web site
  • Management tool for display of project statistics
  • Focus for collecting project documents ( Word )
  • FAQs
  • Glossary of abbreviations acronyms ?
  • which variable should I use for this ?
  • How is leaving date derived ?
  • Where can I get the latest data dictionary for
    NDYP ?
  • Whats the ORC variable name for expct_start_date
    ?
  • Small sampling /extraction tools
  • Links to other related government sites

21
New Developments (2) - Processing
  • Full benefits data set arrives on Monday
  • Increasingly complex extracts in SPSS and SAS
  • Moving more of the data processing to Linux
  • Generic fuzzy matching tools
  • Adding other data , deprivation index , surveys
  • Sir 2002 - reading an 800 byte record into a
    single string
  • Secondary indexing /lookups - replace Case 0
  • Better linking of the processing zipping /spss
    /excel /HTML
  • Questions

22
Larger than the Evaluation Database ?
Presented By Adrian Hodgson
Write a Comment
User Comments (0)
About PowerShow.com