Apache Kafka is the de facto standard open-source event streaming platform. In event-driven architectures, applications publish events when data changes, allowing other systems to react in real-time rather than polling for updates.
An example is a CRM application that serves as the system of record for customer data. When a customer’s address changes, instead of having every application repeatedly query the CRM for current address data, the CRM can publish an ‘address-update’ event. Interested applications subscribe to these events and maintain their own current copy of the data.
Kafka provides native programming interfaces for Java, Python, and Scala. This article demonstrates how traditional z/OS applications can participate in Kafka-based event streaming using IBM MQ and Kafka Connect.
Native Kafka programming interfaces and Kafka Connect
Applications can interact directly with Kafka through native programming interfaces. Kafka, being Java-based, naturally supports Java applications. Other languages with native Kafka support include Python and Scala. IBM recently introduced a Kafka SDK for COBOL on z/OS, though I will not explore that approach here.
Kafka Connect bridges the gap for applications without native Kafka support. This open-source component sits between Kafka and other middleware technologies like databases and messaging systems, translating between their protocols and Kafka’s event streaming format.
Solution Architecture
Our solution enables z/OS applications to produce and consume Kafka events through IBM MQ, leveraging the well-established asynchronous messaging patterns familiar to mainframe developers.
Key Benefits:
Uses proven MQ messaging patterns
Works with both CICS online and batch applications
Supports any z/OS programming language that can create MQ messages (COBOL, PL/I, Java, Python, Node.js, Go)
No application code changes required beyond message formatting
Architecture Overview
The solution uses Kafka Connect as a bridge between MQ queues and Kafka topics.
For Event Production:
z/OS applications send messages to dedicated MQ queues
Kafka Connect reads from these queues
Messages are published to corresponding Kafka topics
Kafka broker makes events available to subscribers
For Event Consumption:
Kafka Connect subscribes to Kafka topics
Incoming events are placed on corresponding MQ queues
z/OS applications read from queues for business processing
Queue-to-Topic Mapping
Each Kafka topic has a dedicated MQ queue. This one-to-one mapping simplifies configuration and makes the data flow transparent for both operations and development teams.
Software Components
Kafka Connect runs as a started task on z/OS. Multiple instances can serve the same workload by sharing startup parameters, providing scalability and high availability. Kafka Connect includes a REST API for:
Configuring connectors for your applications
Monitoring connector status
Integrating with provisioning and deployment processes
Production Configuration
In a production environment, multiple Kafka Connect instances run across different LPARs for high availability. Each instance accesses application queues through MQ local binding connections. MQ queue sharing groups distribute workload across LPARs, ensuring both performance and resilience.
The infrastructure setup supports:
Load balancing across multiple z/OS instances
Fault tolerance through redundant components
Efficient local MQ connections
Summary
This article describes an architecture that provides a clean, straightforward path for z/OS applications to participate in event-driven systems using Apache Kafka. By leveraging existing MQ messaging patterns and Kafka Connect middleware, traditional mainframe applications can integrate with modern streaming platforms without requiring extensive code changes or new programming paradigms. The solution maintains the reliability and performance characteristics that z/OS environments demand while opening doors to real-time data integration and event-driven architectures.
I must shamefully admit I was not aware of the zopen community initiative before it recently became part of the Open Mainframe project. The zopen community provides a great set of open source tools ported for Z. Such as the dos2unix utility I wrote about earlier here.
On z/OS UNIX, the dos2unix utility is not included. You can achieve similar functionality using other tools available on z/OS UNIX, such as sed or tr. These tools can be used to convert DOS-style line endings (CRLF) to Unix-style line endings (LF).
For example, you can use sed to remove carriage return characters:
The slide deck tells me that it was in 2006 that I created a set of slides for “Kees” with an overview of the continuous availability features of an IBM mainframe setup.
The deck’s content was interesting enough to share here, with some enhancements.
What is availability?
First, let’s talk a little bit about availability. What do we mean when we talk about availability?
A highly available computing setup should provide for the following:
A highly available fault-tolerant infrastructure that enables applications to run continuously.
Continuous operations to allow for non-disruptive backups and infrastructure and application maintenance.
Disaster recovery measures that protect against unplanned outages due to disasters caused by factors that can not be controlled.
Definitions
Availability is the state of an application service being accessible to the end user.
An outage (unavailability) is when a system is unavailable to an end user. An outage can be planned, for example, for software or hardware maintenance, or unplanned.
What causes outages?
A research report from Standish Group from 2005 showed the various causes of outages.
Causes of outages
It is interesting to see that (cyber) security was not part of this picture, while more recent research published by UpTime Intelligence shows this growing concern. More on this later.
Causes of outages 2020 – 2021 – 2022
The myth of the nines
The table below shows the availability figures for an IBM mainframe setup versus Unix and LAN availability.
Things have changed. Unix (now: Linux) server availability has gone up. Server quality has improved, and so has software quality. Unix, however, still does not provide a capability similar to a z/OS sysplex. Such a sysplex simply beats any clustering facility by providing built-in, operating system-level availability.
Availability figures for an IBM mainframe setup versus Unix and LAN
At the time of writing, IBM publishes updated figures for a sysplex setup as well (see https://www.ibm.com/products/zos/parallel-sysplex): 99.99999% application availability for the footnote configuration: “… IBM Z servers must be configured in a Parallel Sysplex with z/OS 2.3 or above; GDPS data management and middleware recovery across Metro distance systems and storage and DS888X with IBM HyperSwap. Other resiliency technology and configurations may be needed.”
Redundant hardware
The following slides show the redundant hardware of a z9 EC (Enterprise Class), the flagship mainframe of that time.
The redundant hardware of a z9 EC
Contrasting this with today’s flagship, the z16 (source https://www.vm.ibm.com/library/presentations/z16hwov.pdf), is interesting. Since the mainframe is now mounted in a standard rack, the interesting views have moved to the rear of the apparatus. (iPDUs are the power supplies in this machine.)
The redundant hardware of a z16
Redundant IO configuration
A nice, highly tolerant server is insufficient for an ultimately highly available setup. Also, the IO configuration, a.k.a. storage configuration, must be highly available.
A redundant SAN setup
The following slide in the deck highlights how this can be achieved. Depending on your mood, what is amusing or annoying and what triggers me today are the “DASD CU” terms in the storage boxes. These boxes are the storage systems housing the physical disks. At that time, terminologies like storage and disk were more evident than DASD (Direct Access Storage Device, goodness, what a code word for disk) and CU (Control Unit, just an abstraction anyway). Then, I ignore the valueless addition of CSS (Channel SubSystem) and CHPID (Channel Path ID) for this slide.
What a prick I must have been at that time.
At least the term Director did get the explanatory text “Switch.”
A redundant storage setup for mainframes
RAS features for storage
I went on to explain that a “Storage Subsystem” has the following RAS features (RAS, ugh…, Reliability, Availability, Security):
Independent dual power feeds (so you could attach the storage box to two different independent power lines in the data center)
N+1 power supply technology/hot-swappable power supplies and fans
N+1 cooling
Battery backup
Non-volatile subsystem cache to protect writes that have not been hardened to DASD yet (which we jokingly referred to as non-violent storage)
Non-disruptive maintenance
Concurrent LIC activation (LIC – Licensed Internal Code, a smoke-and-mirrors term for software)
Concurrent repair and replacement actions
RAID architecture
Redundant microprocessors and data paths
Concurrent upgrade support (that is, the ability to add disks while the subsystem is online)
Redundant shared memory
Spare disk drives
Remote Copy to a second storage subsystem
Synchronous (Peer to Peer Remote Copy, PPRC)
Asynchronous (Extended Remote Copy, XRC)
Most of this is still valid today, except that we do not have spinning disks anymore, but everything is Solid State Drives nowadays.
Disk mirroring
Ensuring that data is safely stored in this redundant setup is achieved through disk mirroring at the lowest level. Every byte written to a disk in one storage system is replicated to one or more storage systems, which can be in different locations.
There are two options for disk mirroring: Peer-to-Peer Remote Copy (PPRC) or eXtended Remote Copy (XRC). PPRC is also known as a Mero mirror solution. Data is mirrored synchronously, meaning an application receives an “I/O complete” only after both primary and secondary disks are updated. Because updates must be made to both storage systems synchronously, they can only be 15 to 20 kilometers apart. Otherwise, updates would take too long. The speed of light is the inhibitor for such a limitation.
With XRC, data is mirrored asynchronously. An application receives “I/O complete” after the primary disk is updated. The storage systems can be at an unlimited distance apart from each other. A component called System Data Mover ensures the consistency of data in the secondary storage system.
PPRC and XRC
The following slide highlights how failover and failback would work in a PPRC configuration.
PPRC failover and failback
The operating system cluster: parallel sysplex
The presentation then explains how a z/OS parallel sysplex is configured to create a cluster without any single point of failure. All servers, LPARs, operating systems, and middleware are set up redundantly in a sysplex.
Features such as Dynamic Session Balancing and Dynamic Transaction Routing ensure that workloads are spread evenly across such a cluster. Facilities in the operating system and middleware work together to ensure that all data is safely and consistently shared, locking is in place when needed, and so forth.
The Coupling Facility is highlighted, which is a facility for sharing memory between the different members in a cluster. Sysplex Timers are shown; these ensure that the time of the different members in a sysplex is kept in sync.
A parallel sysplex
A few more facilities are discussed. Workload Balancing is achieved with the Workload Manager (WLM) component of z/OS. The ability to restart applications without interfering with other applications or the z/OS itself is done by the Automatic Restart Manager (ARM). The Resource Recovery Services (RRS) assist with Two-Phase commits across members in a sysplex.
Automation is critical for successful rapid recovery and continuity
Every operation must be automated to prevent human errors and improve recovery speed. The following slide kicks in several open doors about the benefits of automation:
Allows business continuity processes to be built on a reliable, consistent recovery time
Recovery times can remain consistent as the system scales to provide a flexible solution designed to meet changing business needs
Reduce infrastructure management costs and staffing skills
Reduces or eliminates human error during the recovery process at the time of disaster
Facilitates regular testing to help ensure repeatable, reliable, scalable business continuity
Helps maintain recovery readiness by managing and monitoring the server, data replication, workload, and network, along with the notification of events that occur within the environment
Tiers of Disaster Recovery
The following slide shows an awful picture highlighting the concept of tiers of Disaster Recovery from Zero Data Loss to the Pickup Truck method.
Tiers of Disaster Recovery
I mostly like the Pickup Truck Access Method.
GDPS
The following slide introduces GDPS (the abbreviation of the meaningless concept of Geographically Dispersed Parallel Sysplex). GDPS is a piece of software on top of z/OS that provides the automation solution that combines all the previously discussed components to provide a Continuously Available configuration. GDPS takes care of the actions needed when failures occur in a z/OS sysplex.
GDPS
GDPS comes in two flavors: GDPS/PPRC and GDPS/XRC.
GDPS/PPRC is designed to provide continuous availability and no data loss between z/OS members in a sysplex across two sites that are maximum at campus distance (15-20 km).
GDPS/XRC is designed to provide automatic failover of sites that are at extended distance from each other. Since GDPS/XRC is based on asynchronous data mirroring, minimum data loss can occur for data not committed to the remote site.
GDPS/PPRC and GDPS/XRC can be combined, providing a best-in-class solution having a high performance, zero data loss setup for local/metro operation, and an automatic site switch capability for extreme situations such as natural disasters.
In summary
The summary slide presents an overview of the capabilities of the server hardware, the Parallel Sysplex, and the GDPS setup.
Redundancy of Z server, Parallal Sysplex and GDPS
But we are not there yet: ransomware recovery
When I created this presentation, ransomware was not today’s big issue. Nowadays, the IBM solution for continuous availability has been enriched with a capability for ransomware recovery. This solution, called IBM Z Cyber Vault, is a combination of various capabilities from IBM Z. The IBM Z Cyber Vault solution can create immutable copies, or Safeguarded Copies, in IBM Z Cybervault terms, taken at multiple points in time on production data with rapid recovery capability. In addition, this solution can enable data validation to support testing on the validity of each captured copy.
The IBM Z Cyber Vault environment is isolated from the production environment.
Whatever types of mainframe configuration, this IBM Z Cyber Vault capability can provide a high degree of cyber resiliency.
ASCBNAME CSECT
EQUATES
SAVE (14,12),,TST/NDG/&SYSDATE/&SYSTIME/
USING ASCBNAME,R12 SET UP BASE ADDRESSABILITY
LR R12,R15 LOAD BASE REG WITH ENTRY POINT
LA R14,SAVE GET ADDRESS OF REGISTER SAVE
ST R13,4(0,R14) SAVE CALLER'S SAVE AREA ADDR
ST R14,8(0,R13) SAVE MY SAVE AREA ADDRESS
LR R13,R14 LOAD SAVE AREA ADDRESS
*
INIT DS 0H
OPEN (OUT,(OUTPUT))
*
DOE DS 0H
SR R1,R1 R1 = 0
USING PSA,R1 ADDRESS PSA
L R2,PSAAOLD GET ADDRESS CURRENT ASCB
DROP R1 RELEASE PSA ADDRESSING
USING ASCB,R2 ADDRESS CURRENT ASCB
L R1,ASCBJBNS GET ADDRESS ADDRESS SPACE NAME
DROP R2 RELEASE ASCB ADDRESSING
MVC ADDRSPC,0(R1) GET NAME
PUT OUT,OUTREC SCHRIJF
RETURN DS 0H
CLOSE OUT
SLR R15,R15
L R13,4(R13) LOAD CALLERS SAVE AREA ADDRESS
RETURN (14,12),RC=(15) RETURN TO CALLER
*
*
*
DC C'********** ************* WERKGEBIED ******'
SAVE DS 18F
OUTREC DS CL80
ORG OUTREC
ADDRSPC DC CL8' '
REST DC CL72' '
OUT DCB DDNAME=OUT, *
DSORG=PS, *
MACRF=(PM)
*
IHAASCB DSECT=YES
IHAPSA
END ,
Not sure what I used it for, but here is a simple program in assembler to create an ABEND with a completion code of your choice.
Look here in the IBM manuals for more specifics on the ABEND macro.
ABENDIT CSECT
EQUATES
SAVE (14,12),,ABENDIT/OURDEPT/&SYSDATE/&SYSTIME/
USING ABENDIT,R11 SET UP BASE ADDRESSABILITY
LR R11,R15 LOAD BASE REG WITH ENTRY POINT
LA R14,SAVE GET ADDRESS OF REGISTER SAVE
ST R13,4(0,R14) SAVE CALLER'S SAVE AREA ADDR
ST R14,8(0,R13) SAVE MY SAVE AREA ADDRESS
LR R13,R14 LOAD SAVE AREA ADDRESS
* Business Logic
ABEND 4321 4321 or some other code up to 4096
* Epilogue
RETURN EQU *
L R13,4(R13)
RETURN (14,12) RETURN TO CALLER
LTORG
SAVE DS 18F
END ABENDIT
This article summarizes a method to copy an MVS dataset to Windows, while keeping records and converting from ebcdic to the utf8 codepage.
This job copies the MVS dataset to a UNIX files, indicating to keep record indicators with the Windows CR character (UNIX uses the CRLF typically as record / line separators). The dataset below ‘YOUR.TEST.PS’ should be a PS – physical sequential – dataset or a PDS(E) member.
//STEP1 EXEC PGM=BPXBATCH
//STDOUT DD SYSOUT=*
//STDERR DD SYSOUT=*
//STDIN DD DUMMY
//* Values in STDENV below are kept but have no meaning for this function
//STDENV DD *
JAVA_HOME=/usr/lpp/java/J8.0_64
PATH=/usr/lpp/mqm/web/bin:/bin:/usr/sbin
LIBPATH=/usr/lpp/mqm/java/lib
//STDPARM DD *
SH cp -v -F cr "//'YOUR.TEST.PS'" /your/unixdir/tst.txt
You can now convert the EBCDIC data to UTF-8 as follows:
My review of programming languages I learned in during my years in IT.
BASIC
On the Texas Instruments TI99-4a.
Could do everything with it. Especially in combination with PEEK and POKE. Nice for building small games.
Impossible to maintain.
GOTO is unavoidable.
Assembler
In various variants.
Z80, 6802, PDP 11, System 390.
Fast, furious, unreadable, unmaintainable.
Algol 68
Loved this language. REF!
Have only seen it run on DEC 10. Mainly used in academic environments (in the Netherlands?)?
Pascal
Well. Structured. Pretty popular in the early 90s.
Again is this widely adopted?
COBOL
Old. Never programmed extensively in it – just for year 2000.
Totally Readable.
Funny (rediculous) numbering scheme.
Seems to be necessary to use GOTO in some cases which I do not believe.
Smalltalk
Beautiful language.
Should have become the de facto OO programming language but failed for unclear reasons.
Probably because it was way ahead of it’s time with it’s OO base.
Java
Totally nitty gritty programming language.
Productivity based on frameworks, which no one knows which to use.
Never understood why this language was so widely adopted – besides it’s openness and platform independency.
Should never have become the de facto OO programming language but did so because Sun made it open (good move).
Far too many framework needed. J(2)EE add more complexity than it resolves.
Always upgrade issues. (Proud programmer: We run the application in Java! Fed up IT manager: Which Java?)
Rexx
Can do everything quickly.
But nothing structurally.
Ugly code. Readable but ugly.
Some very very strong concepts.
Php
Hodge podgy language of programming concepts and html.
Likely high programmer productivity if you maintain a stark discipline of programming standards. Stark danger of creating unmaintainable crap code mix of html and php.
Python
Nice structured language.
Difficult to set up and reuse.
Can be productive if nitty gritty setup issues can be overcome.
Ruby (on Rails or off-track)
Nice, probably the most elegant OO language. Too nitty gritty to my taste still. Like it though.
I would start with this language if I had to start today.
What is next
Visual programming? Clicking building blocks together?
In programming we should maybe separate the construction of applications from the coding of functions (or objects, or whatever you call the lower level blocks of code.
Programming complex algorithms (efficiently) will probably always remain a craft for specialists.
Constructing applications from the pieces should be brought to a higher level.
The industry (well – the software selling industry) is looking at microservices but that gives operational issues and becomes too distrubuted.
We need a way to build a house from software bricks and doors and windows and roof elements.
Probably we need more standards for that.
Another bold statement.
AI systems “programming” themselves is nonsense (I have not seen a shred of evidence).
AI systems are stochastical systems.
Programming is imperical.
In summary, up to today you can not build software without getting into the nitty gritty very quickly.
It’s like building a house but having find your own tree and rocks first to cut wood and blicks from.
And then contruct nails and screws.
A better approach to that would help.
What do you think is the programming language of the future? What need should it address?
I walked into the restroom. A mechanic stood at the sink fixing something. It saw him holding a toilet seat. He was fooling around with the wiring of the apparatus. Then he replaced some electronics components and rewired the seat.
Toilet sensors
It never occurred to me that even toilets could be usefully equipped with electronic features. I asked the mechanic. He explained that the toilets in the building are all connected to the Internet. If there is something wrong with the antiseptic fluid produced by the toilet, it starts calling out for help. He told me that the towel dispenser was also connected to the Internet, so that when it runs out, a maintenance operator is called in. Makes sense.
Never has technology so much helped improve the The Loo.
To cell sensors
So all things will be supplied with sensors. And it looks like these sensorized things are getting smaller and smaller and a reaching the nano space.
Sensors are gtheetting so small that they can flow through our blood and mend our bodies. And maybe fix cancer cells in the future. Or detect issues with blood vessels. Or measure the chemistry in our bodies. They can be injected in plants to protect themselves from diseases. Or be used in constructions to measure stability at smaller scales than we had ever assumed possible. Possibilities beyond imagination.
Neb sensors surveilling the body
Imagine what it would mean if we could instrument every cell we like to. I would like a surveillance team of bot swimming through my body, like the Nebuchadnezzar in the Matrix flows through the sewers and tunnels of the abandoned cities.