Data Is Streaming Into The Cloud, Including From IBM i

November 14, 2022

Alex Woodie

We are in the midst of a vast migration of data into public clouds, as companies seek to leverage technological innovation occurring in the cloud for competitive advantage. While there are no IBM i runtimes in the public cloud yet, IBM i shops are still getting value by moving their valuable Db2 for i data into the cloud, often using streaming technologies and tools like Apache Kafka and Apache Camel, as well as an updated ETL offering from IBM.

It is tough to understate the magnitude of the transition to the public cloud is having on the $4 trillion IT industry. The analyst firm Gartner expects the cloud to account for half of enterprise IT spending by 2025, as we have previously written. Much of the action is occurring on the operational front, but even more progress has been made in migrating analytical data and analytical (OLAP) workloads into the cloud.

In the past, where companies might have used a Netezza or a Teradata machine to process and analyze data – or even an Apache Hadoop cluster if they were brave enough to go the open source route – today they’re embracing cheap and massive data lakes running on AWS, Google Cloud, and Microsoft Azure as the platform for their data analytics efforts. Thanks to the separation of storage and compute afforded by native cloud architectures, they can quickly spin up Azure Synapse, Databricks, or Snowflake instances to crunch their data and build machine learning models, and spin them down when they are done.

Need to load 100 PB of data into the cloud? AWS Snowmobile is probably your best bet.

Not surprisingly, IBM i shops would like to get in on this digital gold rush to the cloud, too. While IBM i shops can’t yet provision CPWs in the major public clouds to crunch their data in a native IBM i environment, they can still make use of the array of cloud services, ranging from traditional OLAP and advanced analytics to streaming analytics and even machine learning and AI.

The extract, transform, and load (ETL) industry has existed for as long as OLAP has been a thing, and there are established ETL vendors out there that have experience tapping into IBM i data. IBM, Informatica, Talend, Fortra (previously HelpSytsems), and Matillion all have ETL offerings that work with IBM i data. But there’s one more name that IBM i shops might want to add to their lists: DataMigrator.

DataMigrator Another ETL Option

The optional add-on to Db2 Web Query was originally unveiled back in 2015. But as part of the fall 2022 IBM i Tech Refresh and the release of Db2 Web Query version 2.4 last month, DataMigrator received a major upgrade in the area of cloud data warehouse connectivity.

John Westcott, a Db2 for i analytics consultant for IBM Technology Services (formerly IBM Lab Services) provided a compelling overview of DataMigrator’s low-code and no-code data transformation capabilities, as well as its new cloud streaming features, during the COMMON NAViGATE conference last month.

According to Westcott, the fact that DataMigrator is a native IBM i solution plays heavily in its favor.

DataMigrator provides a GUI for creating ETL data pipelines. (Image courtesy John Westcott, IBM)

“It’s an IBM i product. It’s not some open-source product that we’re using,” he says. “And because it’s an IBM i product, it understands Db2 for i. So it understands all the weirdness of Db2 – the multiple members [and] that we have bad data.”

As a SQL-based Db2 product, DataMigrator processes data sets, as opposed to processing rows of data. That’s important, Westcott says, because other ETL tools process data row by row, which can cause additional latency. It’s also significantly less expensive than competing ETL products, including the mainstream ETL product from IBM, he says.

“We’ve had a number of people come to us because they’ve gotten prices from either other IBM products or vendor products and the cost is extremely high to where DataMigrator is,” Westcott says. “Did I mention it’s low cost?” he emphasized again. “It is really low cost.”

There’s no point beating around the bush when it comes to IBM i’s bad data problem. Whether it’s dates that are characters, a single file that’s been blown up into a dozen separate files, files with multiple members, or flat files from a System/36 environment, it’s best to just admit that Db2 for i data isn’t always the easiest to work with, Westcott says.

“It’s a fact. We all talk about it. We all know it,” he says. “We’re all over the place. So maybe we need to pre-process that data before we even try to get it down to some kind of data science or BI system.”

DataMigrator sports an array of data transformation capabilities. Some of the simpler transformations can be configured in a no-code, drag-and-drop fashion using the product’s graphical interface. More advanced transformations, such as aggregations of sales or transformations of date data, may need a bit of coding.

And while DataMigrator is an IBM i product, it can also be used to extract and transform data from non-IBM i data sources, such as a SQL Server database, a JDBC connection, or an Excel spreadsheet. Westcott demonstrated how DataMigrator can be used to grab SQL Server data and join it with Db2 for i data and load it into another IBM i system during his NAViGATE session.

“If you had to develop this application writing an RPG application, how long would it take to write this and test this. Days? Weeks? Months?” he asks. “How long did it take me to write this flow? Maybe five minutes.”

Real-Time Data Migration

Westcott also demonstrated how DataMigrator can be used in combination with other tools to create data connections to cloud data warehouses, which is an increasingly popular use case.

Depending on the size and type of the data, as well as the latency requirement, users may choose to deploy DataMigrator in combination with Apache Kafka, a popular open source publish and subscribe system that runs on IBM i via PASE, or used in conjunction with Kafka and Apache Camel.

DataMigrator can be used with Kafka to stream data to cloud data warehouses. (Image courtesy John Westcott, IBM)

DataMigrator can be used to extract Db2 for i data, transform it or aggregate if it desired into JSON, then and directly load it into a Kafka topic. Once it’s on the Kafka topic, it is available for anybody who has access to that Kafka topic, whether it’s IBM Watson Studio or Azure data lake. There’s no need for the IBM i developer to know anything about the targets, which is one of the advantages of this approach, Westcott says.

“I don’t need to know how to talk to Watson. I don’t need to know how to talk to Azure,” he says. “Those services will pull from Kafka for what they want, when they want, if they want. We’re just making the data available, which in today’s world is a huge thing. It’s all about data availability.”

In the above scenario, the connection between DataMigrator and Kafka is handled through HTTP-based REST calls, including PUTS and GETS. There are certain advantages to that approach, including the ability to move just about any type of data of any size. But there is another way to integrate DataMigrator with Kafka (and thus downstream systems like Azure or Watson) by using data queues and Apache Camel, which has its own advantage, Westcott says.

Camel is an open-source routing tool that, like Kafka, runs on IBM i via the PASE runtime. The software allows users to configure rules that control what happens when certain data or workloads are triggered or completed. When DataMigrator is configured to send data to Camel via data queues, Camel essentially takes over the process of keeping the Kafka system full of data.

Each approach has its advantages, Westcott says. “We’ll be somewhat limited in the data that can go this way [data queue], because it has to meet whatever requirements of a data queue,” he says. There are also limitations in what data types can be sent via data queue.

However, performance can be better using the data queue approach, with the Camel-to-Kafka connection, “because once you pump the information into the data queue, DataMigrator is then free to go on to do the next task,” Westcott says. “So it’s a consideration for how close to near real time do you need it.”

The folks at IBM Technology Services have been getting queried recently about another open-source product called Debezium that simplifies this whole change data capture (CDC) stack. IBM typically points them to some combination of DataMigrator and Kafka or DataMigrator and Camel-Kafka.

“There is no Debezium IBM i connector,” Westcott says. “But that doesn’t mean we can’t emulate what that does.”

Data Is Streaming Into The Cloud, Including From IBM i

DataMigrator Another ETL Option

Real-Time Data Migration

RELATED STORIES