Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

IBM DataOps

DataStage on
IBM Cloud Pak
for Data
An automated data integration solution
on a multicloud data platform
Contents
The rise of a new AI fueled data
2 The rise of a new AI fueled
data integration strategy
integration strategy
3 Using containers for your According to IDC, the worldwide amount of stored data will grow
data integration tool nearly 17% in 2020 to 6.8 zettabytes (ZB), with compound annual
growth rate of nearly 18% through 2024. This dramatic growth in data
4 The five major benefits increases the amount of time and money it takes to ingest and manage
of deploying DataStage on enterprise-wide data, and this starts to hinder users’ productivity
IBM Cloud Pak for Data and client satisfaction. But with the rise of artificial intelligence (AI)
technology, there are new solutions to combat these problems. AI
5 Next steps technology accelerates the pace of data discovery, broadens the
range of data that can be leveraged and automates tasks that previously
required human expertise. Gartner even states that by the end of 2024,
75% of enterprises will shift from piloting to operationalizing AI, driving
a 5x increase in streaming data and analytics infrastructures.

That being said, AI can only be effective if the full range of data is
trustworthy, accessible and compatible. The increased use of AI
highlights weaknesses and limitations that have long existed in data
systems, so enterprises must turn to new, modern strategies. Such
agility requires a new information architecture, one that allows for
seamless integration and operation across the entire data lifecycle.
Which is why IBM clients are modernizing and transitioning away from
legacy systems to move to a modern cloud-based architecture: IBM
Cloud Pak® for Data. This data and AI platform provides improved
scalability and elasticity for varying workloads and lowers operating
costs while being able to connect to cloud data warehouses and real
time analytical applications.

There are many factors that contribute to a major shift in how data
integration tools are deployed and used with the rise of AI. These could
be anything from high data variety in an enterprise to data users’ needs,
and because of the many factors, companies need to adopt a process-
oriented approach to manage the data lifecycle with DataOps, improve
business performance, and increase competitiveness. Companies
embracing AI for their products and processes will require a highly
flexible and scalable data integration technology embedded in the
market-leading data integration tool IBM® DataStage® on IBM Cloud Pak
for Data. It is equipped with features that improve the productivity of
your business and IT users:

– A best-in-breed parallel engine and automatic workload balancing


to elastically scale your workloads up to 30% faster than DataStage
on-premises
– Design once, run anywhere capabilities to bring data integration
to your data
– Automated job design and integration with Netezza®, IBM Db2®
or cloud data warehouses, data virtualization or DataOps services

IBM DataOps DataStage on IBM Cloud Pak for Data 2


By integrating this technology seamlessly with other services on Powered by AI
the data platform, enterprises can benefit from comprehensive and
automated data provisioning while maintaining the performance, – Increased user productivity through built in design accelerators
security and governance they need. Containerized architectures— such as stage suggestions, schema propagation and automatic job
specifically those deployed on a cloud-enabled platform such as template generation
IBM Cloud Pak for Data—are key to this transformation.
IBM DataStage on IBM Cloud Pak for Data is the containerized version
of IBM InfoSphere DataStage, based on a microservices architecture

Using containers for your data and optimized for Kubernetes. Through IBM Cloud Pak for Data,
DataStage can run natively on Red Hat® OpenShift®, the world’s

integration tool leading container orchestration platform.

By breaking down the DataStage capabilities into microservices


Cloud native by design, IBM Cloud Pak for Data unifies market- instead of a monolithic stack, you gain several opportunities:
leading services spanning the entire data and analytics lifecycle. This
– Deploy within minutes; enable standard deployment and
includes the capabilities previously provided by the IBM InfoSphere®
management while retaining flexibility to modify parameters
Information Server platform which are now available as DataStage and
as needed.
IBM Watson Knowledge Catalog cloud-ready services on IBM Cloud
– Gain reliability due to out-of-the-box enhanced Kubernetes
Pak for Data. With IBM Cloud Pak for Data you can streamline your data
availability and support for high availability/disaster recovery
integration at a lower cost on a unified, cloud-native platform, and with
(HADR) automated failover.
the automation capabilities included on the service, your organization
– Reduce management burden with automated updates. Service
can gain business insights from your data in near real-time.
packs, versions and mods can be deployed with one click.
– Automate management by “application group” so administrators can
Often the IBM DataStage and Information Server platforms have
use namespaces to manage access control and provisioning options.
historically been deployed to handle large scale enterprise workloads
– Monitor and manage at an application level thanks to platform
and perform mission critical functions. To ensure a seamless move
and service-level features.
to a modernized AI and cloud ready architecture, the DataStage and
– Scale microservices independently to respond to changing needs.
Information Server modernization upgrades provide an easy migration
that delivers access to the capabilities on the platform as well as
Containerizing your data integration technology enables you to run
providing an even higher level of resiliency, scalability, automation
DataStage as part of a hybrid cloud environment (combination of cloud
and operational efficiency.
and non-cloud platforms) or multicloud environment (clouds from
different providers) that uses the appropriate infrastructure for each
By deploying DataStage and Watson Knowledge Catalog services on
type of data.
IBM Cloud Pak for Data, enterprises can leverage all of the powerful
features that make up an industry-leading data platform.
These advantages may account for the recent popularity of containers.
According to the Red Hat Global Customer Tech Outlook 2019, 57% of
Built for AI organizations are already using containers and container usage is also
expected to increase by 89% in the next 2 years. With IBM Cloud Pak
– In-line data quality and metadata exchange with Watson
for Data, you can more easily access the full scale of IBM services to
Knowledge Catalog for improved data governance
design, deploy and manage advanced analytics that help you deliver
– Out-of-the-box integration with data science, event messaging,
business value.
data virtualization and data warehousing services on IBM Cloud
Pak for Data

IBM DataOps DataStage on IBM Cloud Pak for Data 3


The five major benefits of 3. Savings on development time and costs thanks
to automated job design and DevOps support
deploying DataStage on IBM
To address the challenge of managing the number of containerized
Cloud Pak for Data applications across different operating systems, organizations need a
robust open source tool such as Red Hat OpenShift, available on IBM
Cloud Pak for Data. The IBM Cloud Pak for Data platform helps them
1. Ease of enabling hybrid cloud and scale and provision containers to support key IT initiatives such as
multicloud on a single, unified platform microservices and cloud migration strategies. DataStage containers
allow for creation and automation of continuous integration/continuous
According to Gartner, the majority of enterprises use more than one delivery (CI/CD) pipelines for jobs from dev to test to production.
cloud provider, and historically, from the context of data integration, They also help streamline CI/CD pipelines by supporting source
the challenge has been that enterprises need to incur data latencies control tools such as GitHub to frequently publish jobs and release
and data egress costs while moving data between different cloud to production.
platforms and their on-premises data sources. Organizations often had
to run individual applications across multiple providers to execute their IBM DataStage Flow Designer has features like built-in search, a quick
data integration jobs, and it took up more time and costs than should tour to get companies jump-started, automatic metadata propagation,
have been necessary. But now with IBM DataStage on IBM Cloud Pak smart palette, suggested stages and simultaneous highlighting of all
for Data, users have the freedom to choose any cloud provider with one compilation errors. Developers can use these features to be more
solution. With design once, run anywhere features within DataStage, productive while designing jobs, and their productivity can increase
users can design their jobs once on-premises, move runtimes to where to be as much as nine times faster than traditional hand coded jobs.
their data resides, and thereby avoid data latencies and millions of Users can expect up to 87% savings in development cost when using
dollars in egress costs. There’s no added need to move your data out visual and ML-assisted design, as compared to hand coding.
of where it’s already housed.
Many companies have thousands of jobs in a single project, and they
depend on these jobs to run 24 hours a day, 7 days a week. Rewriting
2. Parallel processing and automatic these jobs, with the likely possibility of errors and outages, is not an
workload balancing option for them. Using the DataStage Flow Designer on IBM Cloud Pak
for Data, these companies can take any existing DataStage job and
With a fully cloud-native architecture, DataStage can dynamically render it in the thin client so there’s no need to rewrite those jobs.
scale workloads as well as optimize for large data sets with a best-in- Moreover, clients can save millions on license costs by eliminating the
breed parallel engine (PX). Users have the choice to create a parallel need for purchasing thick clients for job design, by instead using the
or an Apache Spark job in IBM DataStage Flow Designer. DataStage Flow Designer thin client.

Moreover, customers can expect up to around a 30% decrease In addition to the design and development capabilities, DataStage
in execution time with IBM DataStage on IBM Cloud Pak for Data offers hundreds of out-of-the-box, pre-built, ready-to-use connectors
compared to traditional DataStage on-premises. These performance for Amazon S3, Azure, Db2, Hive and Kafka, and it also offers stages
improvements are particularly apparent during execution windows such as transformer, encode, annotate, tail and merge. These
of resource contention due to the automatic workload balancing drastically reduce the time developers spend on preparing data for
that distributes workloads across the worker nodes in the OpenShift analytics actions. With new operations added every few weeks,
cluster and maximizes throughput. developer productivity is enhanced over time.

IBM DataOps DataStage on IBM Cloud Pak for Data 4


4. Built-in integration with data and AI services In addition, the included Kubernetes distribution is enterprise-grade,
and benefits from hundreds of security, defect and performance fixes
With DataStage on IBM Cloud Pak for Data, it is easy to leverage in each release. Validated popular storage and networking plug-ins for
capabilities from the broader IBM and open source ecosystems. The Kubernetes are also available. And finally, open source Red Hat tools
platform includes many core services ranging from data warehouses, provide additional functionality options, such as Apache Spark for
Watson Knowledge Catalog, data science and data virtualization to streaming data, or the popular Python and R languages for machine
event messaging. Colocation with Netezza and Db2 on IBM Cloud Pak learning applications. The additional functionality ensures that
for Data system removes network bottlenecks and supports high- enterprises leverage essential open source tools necessary to develop,
speed data delivery. Easily connect cloud data warehouses with deploy and run applications through the OpenShift platform. When
pre-built connectors for Snowflake and Amazon Redshift to access and these varied resources are all part of a single, unified platform in IBM
transform data, no matter which cloud platform the data resides on. Cloud Pak for Data, they are easier to integrate and manage than they
would otherwise be.
To prevent data lakes from turning into “data swamps” with
ungoverned data, you can simultaneously track data lineage in ETL
jobs with IBM InfoSphere QualityStage® while data is ingested by
target environments, such as data lakes, to automatically resolve
Next steps
quality issues. You can also provide metadata support for policy-
driven access to sensitive data and prevent unauthorized users from When deployed via IBM Cloud Pak for Data, DataStage is more than
getting access to your sensitive data. This concept of data quality can a robust data integration tool that can process data at scale. It
be extended to support comprehensive data governance across the becomes part of a microservices-based data platform that also helps
enterprise data warehouse (EDW). you organize and analyze your data, infusing AI capabilities throughout
your enterprise.
With the included data virtualization capabilities on IBM Cloud Pak for
Data, business users can discover data, query data, and experiment DataStage on IBM Cloud Pak for Data provides:
with flows for data warehouses while also performing simple SQL-
1. AI capabilities, built for AI projects
based data transformations, running development and testing, and
2. Up to 50% lower cost of operations due to automatic failure
managing both structured and unstructured data.
resolution and automation of operational tasks such as backup,
recovery, and patch management
5. The value of Red Hat in IBM Cloud Pak for Data 3. 30% faster workload execution compared to traditional DataStage
thanks to built-in workload balancing and best-in-breed parallel
The hybrid cloud and multicloud options are enhanced by the runtime that optimize workload execution
advantages of Red Hat OpenShift, upon which IBM Cloud Pak for Data 4. 87% savings in development cost when using visual and ML-
is based. The Red Hat stack, OpenShift and Kubernetes operating assisted design, as compared to hand coding
together, is particularly beneficial. It allows you to develop secure 5. Savings on data movement costs by bringing integration workloads
and scalable Kubernetes applications without being overwhelmed by to the data using design once, run anywhere
the complexities of large-scale manual Kubernetes administration. 6. Pre-built integrations with data science, data warehouse and data
Using Kubernetes Operators, Red Hat OpenShift offers automated virtualization services using a common UI
installation, upgrades, and lifecycle management for every part of
the container stack: the operating system, Kubernetes and cluster Existing customers can retain their investments in skills and assets
services, applications, and persistent data storage. and save millions of dollars in license costs by eliminating the need
to purchase Windows or Citrix thick client licenses.
OpenShift provides a comprehensive platform that enables automated
operations and provides out-of-the-box support for languages such as DataStage on IBM Cloud Pak for Data offers a unique combination of
Java, Node.js, Ruby and Python. OpenShift also provides supporting containerized architecture, Red Hat infrastructure, data connectivity
services such as monitoring, authentication and authorization and and a broader IBM capability ecosystem, making it a compelling
network management. These OpenShift features are not in the open choice for enterprises that want to prepare their data foundations
source version of Kubernetes. for the opportunities ahead.

To get started try IBM Cloud Pak for Data for free
Schedule a free one-on-one consultation with a data integration expert.

IBM DataOps DataStage on IBM Cloud Pak for Data 5


© Copyright IBM Corporation 2020

IBM Corporation
New Orchard Road, Armonk, NY 10504

Produced in the United States of America


October 2020

IBM, the IBM logo, ibm.com, IBM Cloud Pak, DataStage, Netezza, Db2, InfoSphere,
IBM Watson, and QualityStage are trademarks of International Business Machines
Corp., registered in many jurisdictions worldwide. Other product and service
names might be trademarks of IBM or other companies. A current list of IBM
trademarks is available on the web at “Copyright and trademark information”
at www.ibm.com/legal/copytrade.shtml.

Microsoft, Windows, Windows NT, and the Windows logo are trademarks
of Microsoft Corporation in the United States, other countries, or both.

Java and all Java-based trademarks and logos are trademarks or registered
trademarks of Oracle and/or its affiliates.

Red Hat and OpenShift are registered trademarks of Red Hat, Inc. or its
subsidiaries in the United States and other countries.

This document is current as of the initial date of publication and may be


changed by IBM at any time. Not all offerings are available in every country
in which IBM operates.

The performance data discussed herein is presented as derived under specific


operating conditions. Actual results may vary. It is the user’s responsibility to
evaluate and verify the operation of any other products or programs with IBM
products and programs. THE INFORMATION IN THIS DOCUMENT IS PROVIDED
“AS IS” WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING WITHOUT
ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT. IBM
products are warranted according to the terms and conditions of the agreements
under which they are provided.

The client is responsible for ensuring compliance with laws and regulations
applicable to it. IBM does not provide legal advice or represent or warrant that
its services or products will ensure that the client is in compliance with any law
or regulation.

Statement of Good Security Practices: IT system security involves protecting


systems and information through prevention, detection and response to improper
access from within and outside your enterprise. Improper access can result in
information being altered, destroyed, misappropriated or misused or can result in
damage to or misuse of your systems, including for use in attacks on others. No IT
system or product should be considered completely secure and no single product,
service or security measure can be completely effective in preventing improper
use or access. IBM systems, products and services are designed to be part of a
lawful, comprehensive security approach, which will necessarily involve additional
operational procedures, and may require other systems, products or services to
be most effective. IBM DOES NOT WARRANT THAT ANY SYSTEMS, PRODUCTS
OR SERVICES ARE IMMUNE FROM, OR WILL MAKE YOUR ENTERPRISE IMMUNE
FROM, THE MALICIOUS OR ILLEGAL CONDUCT OF ANY PARTY.

Statements regarding IBM’s future direction and intent are subject to change
or withdrawal without notice, and represent goals and objectives only.

7EB2XONR

IBM DataOps DataStage on IBM Cloud Pak for Data 6

You might also like