Multiload and ETL Strategy - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Multiload and ETL Strategy

Description:

After each Multiload task you should check for rows that were loaded to the error tables. ... The data is loaded as is into the staging tables. ... – PowerPoint PPT presentation

Number of Views:345
Avg rating:3.0/5.0
Slides: 16
Provided by: ReneeM63
Category:

less

Transcript and Presenter's Notes

Title: Multiload and ETL Strategy


1
Multiload and ETL Strategy
  • Jeremy Christiansen,
  • Bull Services Systems Integrator

2
What does Multiload Do?
  • Allows you to read 1 or more files from a Legacy
    system
  • For each file you read, allows you to execute 1
    or more DML operations
  • DML operations come in 3 flavors and they define
    what to do with rows in the input file.
  • The 3 flavors are
  • Insert
  • Update
  • Delete

3
Logical flow of Multiload Script
  • Initialization
  • Define the Input File
  • Define what to do with the Input File
  • Import and Use the File
  • Clean Up

4
Sample Script
  • .LogTable AdjustmentWorkArea.LG_R214
  • Database AdjustmentWorkArea
  • DELETE R214
  • Drop Table ET_R214
  • Drop Table UV_R214
  • .Begin Import MLOAD
  • Tables R214
  • Worktables WT_R214
  • Errortables ET_R214
  • UV_R214
  • Sessions 8
  • Tenacity 4
  • .Layout Layout_01
  • .Field CLAIM_REFERENCE_NUMBER
    CHAR(015)
  • .Field IP_COMMENTS
    CHAR(132)
  • .Field Filler
    CHAR(019)
  • .DML Label Insert_01
  • Insert Into R214
  • Values (

5
Step 1 Initialization
  • .LogTable AdjustmentWorkArea.LG_R214
  • Database AdjustmentWorkArea
  • DELETE R214
  • Drop Table ET_R214
  • Drop Table UV_R214
  • .Begin Import MLOAD
  • Tables R214
  • Worktables WT_R214
  • Errortables ET_R214
  • UV_R214
  • Sessions 8
  • Tenacity 4
  • The first line of code must define a log table.
  • Optional SQL commands may be included next.
  • The .Begin statement starts a Multiload task.
  • A task is simply reading a file and doing
    something with it.
  • Each task has a .End statement as well.
  • Each input file has its own .Begin and .End
    statements.

6
Step 2 Define the Input File
  • .Layout Layout_01
  • .Field CLAIM_REFERENCE_NUMBER CHAR(015)
  • .Field IP_COMMENTS CHAR(132)
  • .Field Filler CHAR(019)
  • The .Layout command begins the definition of
    the input file
  • Subsequent .Field commands define each field in
    the file.
  • A .Filler command can be used to identify
    filler space in the record.
  • Each field must have a datatype defined. This
    appears after the asterisk.

7
Step 3 Define What to Do
  • .DML Label Insert_01
  • Insert Into R214
  • Values (
  • CLAIM_REFERENCE_NUMBER
  • ,IP_COMMENTS
  • )
  • The .DML command defines either an Insert,
    Update, or Delete operation.
  • You may define multiple DML commands. Each begins
    with .DML and is followed by an Insert, Update,
    or Delete statement.
  • Fields from the input file are referenced via the
    names defined in the Layout (prefixed with a
    colon).

8
Step 4 Do the Work
  • .Import Infile /DCH/homedch/dev/data/r214
  • Format Text
  • Layout Layout_01
  • Apply Insert_01
  • The .Import command begins the actual loading
    of data.
  • This is where you specify the location of the
    input file.
  • The Layout keyword is followed by the name of
    the layout that should be used to define the
    input file.
  • The Apply keyword is followed by the name of
    the DML operation to execute on the file.
  • You may apply several DML operations on the
    same file. In this case there will be multiple
    apply commands.

9
Step 5 Clean Up
  • .End MLoad
  • .If SYSUVCNT 0 Then
  • .Logoff 9
  • .ENDIF
  • .If SYSETCNT 0 Then
  • .Logoff 9
  • .Endif
  • .Logoff
  • The .End MLoad command terminates the last load
    task, but not the Multiload utility itself.
  • You may include several load tasks in a single
    Multiload script, each task being for a different
    input file.
  • After each Multiload task you should check for
    rows that were loaded to the error tables.
  • You can set non-zero return codes if errors were
    found.
  • The Logoff command terminates the Multiload
    utility.

10
What are the Error Tables for?
  • If Multiload encounters a problem with a row,
    the row is inserted into 1 of 2 error tables.
    Things that cause errors are
  • Invalid data for the datatype
  • Duplicate rows / Unique key violations
  • Rows in the input file that were not used by one
    of the DML commands.
  • You can add code that ignores duplicate rows
    and rows in the input file that could not be
    used. This code is part of each DML command.
    Ignored errors do not get put into the error
    tables.

11
ETL Strategy
  • There are many ways to do ETL, coming up with a
    standard helps keep large systems uniform and
    understandable.
  • A good standard will keep coding complexity to a
    minimum
  • A good standard will lend itself to maximum
    flexibility in terms of meeting the needs of the
    customer.

12
DCHs ETL Strategy
  • DCHs standard is a 2-pronged approach to ETL
  • Raw data from the Legacy systems is loaded to
    staging tables using the Multiload utility.
    Staging tables are holding areas for the incoming
    data. The data looks exactly as it did in the
    Legacy system.
  • Data in the staging tables is then moved to
    production tables designed in 3rd normal form.
    SQL scripts executed within the BTEQ utility
    accomplish the movement. It is here that data
    scrubbing and transforming is done if
    necessary.

13
DCHs ETL Strategy
  • This strategy has certain inherent benefits
  • Since the Legacy data is loaded as is (to the
    Staging tables), its easy to see what the data
    looked like prior to any scrubbing or
    transformation. Its a good audit trail.
  • The handling of bad data is incorporated into the
    database itself, giving you many powerful
    options
  • Put data in error tables
  • Fix the data
  • Generate reports
  • Make the errors more visible to people who need
    to see them.

14
DCHs ETL Strategy
  • The ETL process has the full power of Teradatas
    parallel processing architecture to transform the
    staging data into the production format. The ETL
    runs faster.
  • Multiload coding is simpler as it requires no
    complex logic. The data is loaded as is into the
    staging tables.
  • Load statistics are easy to generate because
    everything is being done in the database
    environment. These statistics can then be made
    available to users so that they know how fresh
    the data is and whether or not there were any
    problems with it.

15
Final Thoughts
  • DCHs ETL strategy is just one way of doing
    things, there are others.
  • When deciding on a strategy you need to evaluate
    it based on
  • Efficiency (How fast does it run?)
  • Understandability (Is it straightforward and easy
    to follow?)
  • Flexibilty (How easy is it to change when the
    underlying business changes?)
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com