A-Introduction To ETL and DataStage
A-Introduction To ETL and DataStage
ETL Basics
• Extraction Transformation & Load
Extracts data from source systems
Enforces data quality and consistency standards
Conforms data from different sources
Load data to target systems
• Usually a
batch process
Involves large volumes of data
• Scenarios
Load a data warehouse, data mart for analytical and reporting applications
Data Integration
Load packaged applications, or external systems through their APIs or interface
databases
Data Migration
August 7, 2021 2
ETL Basics
August 7, 2021 3
ETL Basics
Advantages of Tool-based ETL
• Reusability
• Metadata repository
• Incremental load
• Managed batch loading
• Simpler connectivity
• Parallel operation
• Vendor experience
August 7, 2021 4
ETL Basics
August 7, 2021 5
ETL Basics
August 7, 2021 6
ETL Basics
Metadata
ETL Engine
Data
August 7, 2021 7
IBM Information Server DataStage Overview
IBM Information Server DataStage
• Ideal tool for data integration projects – such as, data warehouses, data marts,
and system migrations
• Import, export, create, and manage metadata for use within jobs
August 7, 2021 9
DataStage Architecture
Engine
Metadata Metadata
Repository
Sources Targets
August 7, 2021 10
Some Product Flavors
• Enterprise Edition
Includes Parallel Engine, Server Engine & MetaStage
Supports Parallel & Server Jobs in a SMP & MPP environment
• Server Edition
Lower-end version, much less expensive
Includes Server Engine, supports only Server Jobs
Sufficient for less performance critical applications
MetaStage can also be packaged with it
• MVS Edition
An Extension that allows generation of Cobol Code & JCL for execution on Mainframes
Common development environment, but involves porting & compiling code on to the mainframe
• SOA Edition
RTI component to handle real-time interface
Allows job components to be exposed as web-services
Multiple servers service requests routed through the RTI component
Note that the web service client component is available even without purchasing the SOA Edition
August 7, 2021 11
DataStage Architecture
• Repository:
• Contains all the metadata, mapping rules, etc.
• DataStage applications are organized into Projects, each server can handle multiple
projects
• DataStage repository maintained in an internal file format & not in the database
August 7, 2021 12
DataStage Architecture
• Windows-based components
• Need to access the server at development time
• Designer: to create DataStage ‘jobs’ , compiled to create the executables,
Import & Export component definitions
• Director: validate, schedule, run, and monitor jobs
• Administrator: setting up users, creating and moving projects, and setting up
purging criteria, setting environment variables
• Designer & Director can connect to one Project at a time
August 7, 2021 13
Key DataStage Components
Project
• Usually created for each application (or version of an application, e.g. Test,
Dev, etc.)
• Multiple projects can exist on a single server box
• Associated with a specific directory with the same name as the Project: the
“Repository”, which contains all metadata associated with the project
• Consists of
DataStage Server & Parallel Jobs
Pre-built components (Stages, Functions, etc.)
User-defined components
• User Roles & Privileges set at this level
• Managed through the Information Server Web console/ DS Administrative
Client tool
• Connected to through other client components
August 7, 2021 14
Key DataStage Components
Category
Table Definition
Schema Files
• External metadata definition for a sequential file. Specific format & syntax for a file. Associated with a data
file at run-time
August 7, 2021 15
Key DataStage Components
Job
• Executable unit of work that can be compiled & executed independently or as part
of a data flow stream
• Created using DS Designer Client (Compile & Execute also available through
Designer)
• Managed (copy, rename, import, export) through DS Designer
• Executed, monitored through DS Director, Log also available through Director
• Parallel Jobs (Available with Enterprise Edition):
• have built-in functionality for Pipeline and Partitioning Parallelism
• Compiled into OSH (Orchestrate Scripting Language).
• The OSH executes “Operators” which are executable C++ class instances
• Server Jobs (Available with Enterprise as well as Server Editions):
• Compiled into Basic (interpreted pseudo-code)
• Limited functionality and parallelism
• Can accept parameters **
• Reads & writes from one or more files/tables, may include transformations
• Collection of stages & links
August 7, 2021 16
Key DataStage Components
Stages
• Pre-built component to
• Perform a frequently required operation on a record or set or records, e.g.
Aggregate, Sort, Join, Transform, etc.
• Read or write into a source or target table or file
Links
• Depicts flow of data between stages
Data Sets
• Data is internally carried through links in the form of Data Sets
• DataStage provides facility to “land” or store this data in the form of files
• Recommended for staging data as the data is partitioned & sorted data; so a fast
way of sharing/passing data between jobs
• Not recommended for back-ups or for sharing between applications as it is not
readable, except through DataStage
Shared Containers
• Reusable job elements – comprises of stages and links
August 7, 2021 17
Key DataStage Components
Routines
• Pre-built & Custom built
• Two Types
• Before/After Job: Can be executed before or after a job( or some stages), multiple input
arguments, returns a single error code
• Transform: Called within a Transform Stage to process record & produce a single return
value that can be allocated to or used in computation of an output field
• Custom Built
• Written & compiled using a C++ utility. The Object File created is registered as a routine & is
invoked from within DataStage
• Note that server jobs use routines written within the DS environment using an extended version
of the BASIC language
Job Sequence
• Definition of a workflow, executing jobs (or sub sequences), routines, OS commands, etc.
• Can accept specifications for dependency, e.g.
• when file A arrives, execute Job B
• Execute Job A, On Failure of Job A Execute OS Command <<XXX>> On Completion of Job
A execute Job B & C
• Can invoke parallel as well as server jobs
DS API
• SDK functions
• Can be embedded into C++ code, invoked through the command line or from shell scripts
• Can retrieve information, compile, start, & stop jobs
August 7, 2021 18
Key DataStage Components
Configuration File
• Defines the system size & configuration applicable to the job, in terms of nodes, node
pools, mapped to disk space & assigned scratch disk space
• Details maintained external to the job design
• Different files can be used according to individual job requirements
Environment Variables
• Set or defined through the Administrator at a project level
• Overridden at a job level
• Types
• Standard/Generic Variables: design and running of parallel jobs: e.g. buffering,
message logging, etc.
• User Defined Variables
August 7, 2021 19
Other DataStage Features
• Source & Target data supported:
• Text files
• Complex data structures in XML
• Enterprise application systems such as SAP, PeopleSoft, Siebel and Oracle Applications
• Almost any database - including partitioned databases, such as Oracle, IBM DB2 EE/EEE/ESE (with
and without DPF), Informix, Sybase, Teradata, SQL Server, and the list goes on including access
using ODBC
• Web services
• Messaging and EAI including WebSphereMQ and SeeBeyond
• SAS
August 7, 2021 20
Recap
• We Saw:
• What, Why & How ETL
• DataStage
• Architecture
• Flavors
• Components & Other Features
August 7, 2021 21
A Quick Demo Job
Case:
• Input File contains Sales Data with attribute including <Region ID, Zone, Total Sales>
• Note that
• Region ID is the Unique
• The file contains attributes other than the 3 mentioned above
• The required calculation is to
• compute the Regional Total as a percentage of the Zonal Total
• Compute the Rupee equivalent of the Regional Total by multiplying it with the exchange rate which should
be a parameter Region ID City Zone ID Regional Sales
1 City 1 Z1 10
e.g.
2 City 2 Z1 10
If input is
3 City 3 Z1 20
4 City 4 Z2 20
5 City 5 Z2 30
3 City 3 Z1 20 800 50
4 City 4 Z2 20 800 40
5 City 5 Z2 30 1200 60
August 7, 2021 22
A Quick Demo Job
• Step 0
• Project has been created
• User groups have been assigned appropriate roles
• Source Data is available
• ODBC connection DSNs to the source & target databases have been created <Not required for
this particular example>
August 7, 2021 23
A Quick Demo Job
• Step 2 : Define Metadata of source and/or target files
• Menu Option: Import > Table Definitions > Sequential File Definitions
• Browse to the directory & select source file.
• Select category under which to save the table definition & the name of the table definition
• Click on Import
August 7, 2021 24
A Quick Demo Job
• Step 2 …
• Define formatting (e.g. fixed width/delimited, what end of line character has been used, does
the first line contain column names, etc.)
• Set Column Names (if file does not already contain them), & widths
August 7, 2021 25
Designer Interface
• Open Designer
• Directly through Desktop or through tools menu in Director OR
• Create a new “Parallel Job”
• Save within the chosen ‘Category’ or folder
August 7, 2021 26
Designer Interface
Repository
Design pane
Palette
August 7, 2021 27
A Quick Demo Job
August 7, 2021 28
A Quick Demo Job
• Step 4 Contd.
August 7, 2021 29
A Quick Demo Job
• Step 6 – Run
August 7, 2021 30
Director Interface
Director view …
August 7, 2021 31
A Quick Demo Job
August 7, 2021 32
Sequential File Stage
• Features
• Normally executes in sequential mode**
• Can read from multiple files with same metadata
• Can accept wild-card path & names.
• The stage needs to be told:
• How file is divided into rows (record format)
• How row is divided into columns (column format)
• Stage Rules
• Accepts 1 input link OR 1 stream output link
• Rejects record(s) that have metadata mismatch. Options on reject
• Continue: ignore record
• Fail: Job aborts
• Output: Reject link metadata a single column, not alterable, can be written into a
file/table
August 7, 2021 33
Sequential File Stage
August 7, 2021 34
Sequential File Stage
Sequential File Stage properties …
August 7, 2021 35
Copy Stage
• Features of Copy Stage
• Copies single input link dataset to a number of output datasets
• Records can be copied with or without some modifications
• Modifications can be:
• Drop columns
• Change the order of columns
• Note that this functionality also provided by the Transform Stage but Copy is faster
Drop columns,
Change the order of
columns, rename columns
August 7, 2021 36
Transformer Stage
• Single input
• One or more output links
• Optional Reject link
• Column mappings – for each output link, selection of columns & creation of new derived columns
also possible
• Derivations
• Expressions written in Basic
• Final compiled code is C++ generated object code (Specified compiler must be available on the
DS Server)
• Powerful but expensive stage in terms of performance
• Stage variables
• For readability & for performance when same complex expression is used in multiple derivations
• Be aware that
• The values are retained across rows & order of definition of stage variables will matter.
• The values are retained across rows but only within a each partition
• Expressions for constraints and derivations can reference
• Input columns
• Job parameters
• Functions (built-in or user-defined)
• System variables and constants
• Stage variables – be aware that the variables are within each partition
• External routines
• Link Ordering - to use derivations from previous links
August 7, 2021 37
Transformer Stage
Inside Transformer Stage … Link
Area
Output
Links
Expressions/
Transforms
Input Links
Metadata
Area
August 7, 2021 38
Transformer Stage
Properties Section for each
output link
Column Mappings
Not all input columns Stage Variable
need to be used Derivation,
Expression
Metadata defined
for derived
columns
August 7, 2021 39
Transformer Stage
• Constraints
• Filter data
• Direct data down different output links
• For different processing or storage
• Output links may also be set to be “Otherwise/Log” to catch records that have not passed through
any of the links processed so far (link ordering is critical)
• Optional Reject link to catch records that failed to be written into any output because of write
errors or NULL
Do not output if
Region_ID is NULL
August 7, 2021 40
Join Stage
• Four types:
• Inner
• Left outer
• Right outer
• Full outer
• Join keys must have same name, can modify if required in a previous stage
• All input link data is pre-sorted & partitioned** on the join key
• By default
• Sort inserted by DataStage
• If data is pre-sorted (by a previous stage), does not pre-sort
** - to be discussed shortly
August 7, 2021 41
Join Stage
Join Stage Implementation Can have multiple
keys
Join Types
August 7, 2021 42
Aggregator Stage
** - to be discussed shortly
August 7, 2021 43
Aggregator Stage
• Hash
• Intermediate results for each group are stored in a hash table
• Final results are written out after all input has been processed
• No sort required
• Use when number of unique groups is small
• Running tally for each group’s aggregate calculations needs to fit
into memory. Requires about 1K RAM / group
• Sort
• Only a single aggregation group is kept in memory
• When new group is seen, current group is written out
August 7, 2021 44
Using Job Parameters
#XXX#
Direct usage for expression evaluation
Usage as stage parameter for string substitution
August 7, 2021 45
Job Parameters
August 7, 2021 46
Recap
• We Saw:
• Table Definition
• Job
• Stages
• Sequential File as source & target
• Aggregator
• Join
• Transform
• Job Parameters
August 7, 2021 47
Case Study 1