Resilience for Z – Past present, future

This article is an extension of a presentation a colleague and I gave at the GS NL 2025 conference, in Almere on June 5, 2025.

Introduction

In this article, I will examine today’s challenges in IT Resilience and look at where we came from with mainframe technology. Today’s resilience is no longer just threatened by natural disasters or equipment failures. Today’s IT resilience must include measures to mitigate the consequences of cyberattacks, rapid changes in the geopolitical landscape, and the increasing reliance on IT services by international dependencies.

IT resilience is more important than ever. Regulatory bodies respond to changes in these contexts more quickly than ever. Yet our organizations should be able to anticipate these changes more effectively. Where a laid-back organizational ‘minimum viable solution’ approach was taken, the speed of change drives us to more actively anticipate changes, and cater for disaster scenarios before regulatory bodies force us to. And suppose you are not in an organization that is regulated. In that case, you may still need to pay close attention to what is happening to safeguard your continued market position and even the existence of your organization.

In this article, I will discuss some areas where we can improve the technical capabilities of the mainframe. As we will see, the mainframe’s centralized architecture is well-positioned to further strengthen its position as the most advanced platform for data resilience.

A production system and backups

Once we had a costly computer system. We stored our data on an expensive disk. Disk crashes were regularly happening, so we made backups on tape. When data was lost or corrupted, we restored it from the tape. When a computer crashed, we recovered the system and the data from our backups. Of course, downtimes would be considerable – the mean time to repair, MTTR, was enormous. The risk of data loss was significant: the recovery point objective, RPO, was well over zero.

Resilience in a single datcenter woth backup to tape

A production system and a failover system

At some point, the risk of data loss and the time required to recover our business functions in the event of a computer failure became too high. We needed to respond to computer failures faster. A second data center was built, and a second computer was installed in it. We backed up our data on tape and shipped a copy of the tapes to the other data center, allowing us to recover from the failure of an entire data center.

We still had to make backups at regular intervals that we could fall back to, leaving the RPO still significantly high. But we had better tolerance against equipment failures and even entire data center failures.

Our recovery procedures became more complex. It was always a challenge to ensure our systems would be recoverable in the secondary data center. The loss of data could not be prevented. The time it took to get the systems back up in the new data center was significant.

Resilience with two data centers and backups on tape that can be restored from in the second data center

Clustering systems with Parallel Sysplex

In the 1990s, IBM developed a clever mechanism that creates a cluster of two or more MVS (z/OS) operating system images. This included advanced facilities for middleware solutions to leverage such a cluster and build middleware clusters with unparalleled availability. Such a cluster is called a Parallel Sysplex. The members—the operating system instances—of a Parallel Sysplex can be up to 20 kilometers apart. With these facilities, you can create a Parallel Sysplex that spans two (or more) physical data centers. Data is replicated synchronously between the data centers, ensuring that any change to the data on the disk in one data center is also reflected on the disk in the secondary data center.

The strength of the Parallel Sysplex is that when one member of the Parallel Sysplex fails, the cluster continues operating, and the user does not notice. An entire data center could be lost, and the cluster member(s) in the surviving data center(s) can continue to function.

With Parallel Sysplex facilities, a configuration can be created that ensures no disruption occurs when a component or data center fails, resulting in a Recovery Time Objective (RTO) of 0. This allows operation to continue without any loss of data, with a Recovery Point Objective (RPO) of 0.

With Parallel Sysplex facilities, a configuration can be created that ensures no disruption occurs when a component or data center fails

Continuous availability with GDPS

In addition to the Parallel Sysplex, IBM developed GDPS. If you lose a data center, you eventually want the original Parallel Sysplex cluster to be fully recovered. For that, you would need to create numerous procedures. GDPS automates the task of failover for members in a sysplex and the recovery of these members in another data center. GDPS can act independently in failure situations and initiate recovery actions.

Thus, GDPS enhances the fault tolerance of the Parallel Sysplex and eliminates many tasks that engineers would otherwise have to execute manually in emergency situations.

GDPS automates the task of failover for members in a sysplex and the recovery of these members in another data center. GDPS can act independently in failure situations and initiate recovery actions.

Walhalla reached?

With GDPS, the mainframe configuration has reached a setup that could genuinely be called continuously available. So is this a resilience walhalla?

Unfortunately not.

The GDPS configuration also has its challenges and limitations.

Performance

The first challenge is performance. In the cluster setup, we want every change to be guaranteed to be made in both data centers. At any point in time, one entire data center could collapse, and still, we want to ensure that we have not lost any data. Every update that must be persisted must be guaranteed to be persisted on both sides. To achieve this, an I/O operation must not only be performed locally in the data center’s storage, but also in the secondary data center. Committed data can therefore only be guaranteed to be committed if the storage in the secondary data center has acknowledged to the first data center that an update has been written to disk.

To achieve this protocol, which we call synchronous data mirroring, a signal with the update must be sent to the secondary data center, and a confirmation message must be sent back to the primary data center.

Every update requires a round-trip, so the minimum theoretical latency due to distance alone is approximately 0.07 milliseconds—that is, the speed of light traveling 10 kilometers and back. In practice, actual update time will be higher due to network equipment latency, such as in switches and routers, protocol overhead, and disk write times. For a distance of 10 kilometers, an update could take between 1 and 2 milliseconds. This means for one specific application resource, you can only make 1000 to 500 updates per second. (Many resource managers, like database management systems, fortunately, provide a lot of parallelization in their updates.)

In other words, a Parallel Sysplex cluster offers significant advantages, but it also presents challenges in terms of performance. These challenges can be overcome, but additional attention is necessary to ensure optimal application performance, which comes at the cost of increased computing capacity required to maintain this performance.

Cyber threats

Another challenge has grown in our IT context nowadays: threats from malicious attackers. We have connected our IT systems to the Internet to allow our software systems to interact with our customers and partners. Unfortunately, this has also created an attack surface for individuals with malicious intent. Several types of cyberattacks have become a reality today, and cyber threats have infiltrated the realms of politics and warfare. Any organization today must analyse its cyber threat posture and defend against threats.

One of the worst nightmares is a ransomware attack, in which a hostile party has stolen or encrypted your organization’s data and will only return control after you have paid a sum of money or met their other demands.

The rock-bottom protection against ransomware attacks is to save your data in a secure location where attackers cannot access or manipulate it.

Enter Cybervault

A Cybervault is a solution that sits on top of your existing data storage infrastructure and backups. In a Cybervault, you store your data in a way that prevents physical manipulation: you create an immutable backup of your data.

In our mainframe setup, IBM has created a solution for this: IBM Z Cyber Vault. With IBM Z Cybervault, we add a third leg to our data mirroring setup, an asynchronous mirror. From this third copy of our data, we can create immutable copies. This solution combines IBM software and storage hardware. With IBM Z Cyber Vault, we can make a copy of all our data at regular intervals as needed. Typically, we can make an immutable copy every half hour or every hour. Some IBM Z users take a copy just once every day. This frequency can be configured. From every immutable copy, we can recover our entire system.

So now we have a highly available configuration, with a cyber vault that allows us to go back in time. Great. However, we still have more wishes on our list.

a highly available configuration, with a cyber vault that allows us to go back in time

Application forward recovery

In the great configuration we have just built, we can revert to a previous state if data corruption is detected. When the corruption is detected at time Tx, we can restore our copy from time T0, the backup closest to the moment T1 of data corruption.

revert to a previous state if data corruption is detected. When the corruption is detected at time Tx, we can restore our copy from time T0, the backup closest to the moment T1 of data corruption.

Data between the corruption at T1 and the detection at Tx is lost. But could we recover as much of the data that was still intact after the backup was made (T0) and before the corruption occurred (T1)?

Technically, it is possible to recover data in the database management system Db2 using the image copies and transaction logs that Db2 uses. With Db2 recovery tools, you can recover an image copy of a database and apply all changes from that backup point forward, using the transaction logs in which Db2 records all changes it makes. When we combine this technology with the Cybervault solution, we would need a few more facilities:

  1. A facility to store Db2 image copies and transaction logs. Immutable, of course.
  2. A facility to let Db2, when restored from some immutable copy, know that there are transaction logs and archive logs made in the future, to which it can perform a forward recovery.

That is work, but it is very feasible.

recover an image copy of a database and apply all changes from that backup point forward, using the transaction logs

Now, we have reached a point where we have created a highly available configuration, with a cybervault that can recover from a point as close to the point of corruption as possible.

Adding Linux workloads

Most of today’s mainframe users run Linux workloads on the mainframe, besides the traditional z/OS workloads. These workloads are often as business-critical as the z/OS workload. Therefore, it is great that we can now also include Linux workloads, including OpenShift container-based workloads, in the superb resilience capabilities of IBM Z.

we can now also include Linux workloads, including, not least, OpenShift container-based workloads, in the superb resilience capabilities of IBM Z

Challenges

As such, we have extended the best-in-class resilient platform. Unfortunately, we are pushed further to address today’s challenges.

What if your backup is too close to your primary data?

Data Centers 1 and 2, as discussed above, may be too close to each other. This is the case when a disaster can occur that affects operations in both data centers. These could include natural disasters, such as a power outage or a plane crash.

I have called them Data Centers so far. Yet, the more generic term in the industry is Availability Zones. An Availability Zone is a (part of) Data Center that has independent power, cooling, and security, just like a Data Center. When you spread your IT landscape over availability zones, or data centers across geographical distances, you put them in different Regions. Regions are geographic areas, often with different risk profiles for disasters.

The Data Centers, or Availability Zones, are relatively close together, especially in European organizations. They are in the same Region. With the recent changes in political and natural climate, large organizations are increasingly looking to address the risks in their IT landscape and add data center facilities in another Region.

Cross-region failover and its challenges

To cater to a cross-region failover, we need to change the data center setup.
With our GDPS technology, we can cater for this by adding a GDPS Global ‘leg’ to our GDPS setup. The underlying data replication is Global Mirror replication, asynchronous.

to cater to a cross-region failover, we need to change the data center setup.
With our GDPS technology, we can cater for this by adding a GDPS Global ‘leg’ to our GDPS setup.

The setup in this last picture summarizes the state of the art of the basic infrastructure capabilities of the mainframe. In comparison to other computing platforms, including cloud, the IBM Z infrastructure technologies highlighted here provide a comprehensive resilience solution for all workloads on the platform. This simplifies the realization and management of the resilience solution.

More challenges

Yet, there remains enough to be decided and designed, such as:

  • Are we going to set up a complete active-active configuration in Region B, or do we settle for a stripped-down configuration? Our business will need to decide whether to plan for a scenario in which our data center in Region A is destroyed and we cannot switch back.
  • Where do we put our Cybervault? In one region, or in both?
  • How do we cater for the service unavailability during a region switch-over? In our neat, active-active setup, we can switch between data centers without any disruption. This is not possible with a cross-region failover. Should we design a stand-in function for our most business-critical services?
  • We could lose data between our regions. The data synchronization is asynchronous. How should we cater for this potential data loss?

When tackling questions about risk, it all begins with understanding the organization’s risk appetite—the level of risk the business is willing to accept as it works toward its objectives. Leadership teams must decide which risks are best handled through technical solutions. For organizations operating in regulated spaces, minimum standards are clearly set by regulators.

Connecting from z/OS apps to Kafka via MQ

Apache Kafka is the de facto standard open-source event streaming platform. In event-driven architectures, applications publish events when data changes, allowing other systems to react in real-time rather than polling for updates.

An example is a CRM application that serves as the system of record for customer data. When a customer’s address changes, instead of having every application repeatedly query the CRM for current address data, the CRM can publish an ‘address-update’ event. Interested applications subscribe to these events and maintain their own current copy of the data.

Kafka provides native programming interfaces for Java, Python, and Scala. This article demonstrates how traditional z/OS applications can participate in Kafka-based event streaming using IBM MQ and Kafka Connect.

Native Kafka programming interfaces and Kafka Connect

Applications can interact directly with Kafka through native programming interfaces. Kafka, being Java-based, naturally supports Java applications. Other languages with native Kafka support include Python and Scala. IBM recently introduced a Kafka SDK for COBOL on z/OS, though I will not explore that approach here.

Kafka Connect bridges the gap for applications without native Kafka support. This open-source component sits between Kafka and other middleware technologies like databases and messaging systems, translating between their protocols and Kafka’s event streaming format.

Solution Architecture

Our solution enables z/OS applications to produce and consume Kafka events through IBM MQ, leveraging the well-established asynchronous messaging patterns familiar to mainframe developers.

Key Benefits:

  • Uses proven MQ messaging patterns
  • Works with both CICS online and batch applications
  • Supports any z/OS programming language that can create MQ messages (COBOL, PL/I, Java, Python, Node.js, Go)
  • No application code changes required beyond message formatting

Architecture Overview

The solution uses Kafka Connect as a bridge between MQ queues and Kafka topics.

For Event Production:

  • z/OS applications send messages to dedicated MQ queues
  • Kafka Connect reads from these queues
  • Messages are published to corresponding Kafka topics
  • Kafka broker makes events available to subscribers

For Event Consumption:

  • Kafka Connect subscribes to Kafka topics
  • Incoming events are placed on corresponding MQ queues
  • z/OS applications read from queues for business processing

Queue-to-Topic Mapping

Each Kafka topic has a dedicated MQ queue. This one-to-one mapping simplifies configuration and makes the data flow transparent for both operations and development teams.

Software Components

Kafka Connect runs as a started task on z/OS. Multiple instances can serve the same workload by sharing startup parameters, providing scalability and high availability.
Kafka Connect includes a REST API for:

  • Configuring connectors for your applications
  • Monitoring connector status
  • Integrating with provisioning and deployment processes

Production Configuration

In a production environment, multiple Kafka Connect instances run across different LPARs for high availability. Each instance accesses application queues through MQ local binding connections. MQ queue sharing groups distribute workload across LPARs, ensuring both performance and resilience.

The infrastructure setup supports:

  • Load balancing across multiple z/OS instances
  • Fault tolerance through redundant components
  • Efficient local MQ connections

Summary

This article describes an architecture that provides a clean, straightforward path for z/OS applications to participate in event-driven systems using Apache Kafka. By leveraging existing MQ messaging patterns and Kafka Connect middleware, traditional mainframe applications can integrate with modern streaming platforms without requiring extensive code changes or new programming paradigms.
The solution maintains the reliability and performance characteristics that z/OS environments demand while opening doors to real-time data integration and event-driven architectures.

Automation: From Repetition to Engineering

I often realize I’m doing repetitive tasks that could easily be automated. Money transfers, reminders, invoices-these are simple, low-effort activities that don’t deserve to consume my time. Every time this happens, I tell myself, “I should automate this to save time and mental effort.” And yet, somehow, I don’t. I tell myself I have no time to automate.

Automation frees up time

In IT, especially when automating mainframe processes, we encounter the same hesitation:

“We don’t need to automate this; once it’s done, we’ll never do it again.”

Which almost never turns out to be true.

Repetitive tasks-whether personal or IT-related-are often simple to automate but remain manual due to perceived time constraints.

In IT, automation is critical. It reduces manual errors, improves consistency, and frees up time for more strategic work.

A Shift in Mindset

Automation requires a different engineering mindset. Instead of the familiar cycle:

Do → Fix → Fix → Fix

We move to:

Engineer process → Run process / Fix process → Run process → Run process

Once engineered, automated processes run with minimal intervention, saving both time and effort.

When to Automate

If you find yourself performing a task more than twice, consider automating it. Whether through shell scripting, JCL, utilities, or tools like Ansible, automation quickly pays off.

Automation is not optional-it’s essential for efficient IT operations and professional growth. Start automating today to work smarter, not harder.

Don’t waste time doing things more than twice. If you do something for the third time, automate it-you’ll likely have to do it a fourth or fifth time as well.

Automate everything.

The Cathedral Effect: Designing Engineering Spaces for Creativity and Focus

In software engineering, as in many creative and technical fields, the environment shapes how we think and work. An intriguing psychological phenomenon known as the Cathedral Effect offers valuable insights into how physical and virtual workspaces can be designed to optimize both high-level creativity and detailed execution.

What Is the Cathedral Effect?

The Cathedral Effect reveals how ceiling height influences cognition and behavior. High ceilings evoke a sense of freedom and openness, fostering abstract thinking, creativity, and holistic problem-solving. In contrast, low ceilings create an enclosure that encourages focused, detail-oriented, and analytical work.

Research shows that exposure to high ceilings activates brain regions associated with visuospatial exploration and abstract thought, and confirm that people in high-ceiling environments engage in broader, more creative thinking, while low ceilings prime them for concrete, detail-focused tasks

Applying the Cathedral Effect to Software Engineering

Software development involves both high-level architectural design and detailed coding and testing. The Cathedral Effect suggests that these phases benefit from different environments:

  • High-level work (system architecture, brainstorming, innovation) thrives in “high ceiling” spaces- whether physical rooms with tall ceilings or metaphorical spaces that encourage free-flowing ideas and open discussion.
  • Detailed work (analysis, programming, debugging) benefits from “low ceiling” environments that support concentration, precision, and deep focus.

Matching the workspace to the task helps teams think and perform at their best.

Practical Suggestions for IT Teams and Organizations

Create Dedicated Physical and Virtual Spaces

If possible, design your office with distinct zones:

  • High-ceiling rooms for architects and strategists to collaborate and innovate. These spaces should be open, well-lit, and flexible.
  • Low-ceiling or enclosed rooms for developers and analysts to focus on detailed work without distractions.

For remote or hybrid teams, replicate this by:

  • Holding open, informal video sessions and collaborative whiteboard meetings for high-level ideation.
  • Scheduling “deep work” periods with minimal interruptions, supported by quiet virtual rooms or dedicated communication channels.

Match People to Their Preferred Environments

We should recognize that some team members excel at abstract thinking, while others thrive on details. Assign roles and tasks accordingly, and respect their preferred workspace to maximize productivity and job satisfaction.

Facilitate Transitions Between Modes

Switching between big-picture thinking and detailed work requires mental shifts. Encourage physical or virtual “room changes” to help reset focus and mindset, reducing cognitive fatigue.

Foster Cross-Pollination

While separation is beneficial, occasional collaboration between high-level thinkers and detail-oriented workers ensures ideas remain practical and grounded.

Why This Matters

Ignoring the Cathedral Effect can lead to mismatched environments that stifle creativity or undermine focus. For example, forcing detail-oriented developers into open brainstorming sessions can cause distraction and frustration. Conversely, confining architects to cramped spaces can limit innovation.

By consciously designing workspaces and workflows that respect the Cathedral Effect, organizations can foster both creativity and precision, leading to better software and more engaged teams.

dos2unix on z/OS

On z/OS UNIX, the dos2unix utility is not included. You can achieve similar functionality using other tools available on z/OS UNIX, such as sed or tr. These tools can be used to convert DOS-style line endings (CRLF) to Unix-style line endings (LF).

For example, you can use sed to remove carriage return characters:

sed 's/\r$//' inputfile > outputfile

Or you can use tr

tr -d '\r' < inputfile > outputfile

The 4 Eyes Principle

Another principle today.

In the realm of software development, the four-eyes principle dictates that an action can only be executed when it is approved by two individuals, each providing a unique perspective and oversight. This principle is designed to safeguard against errors and misuse, ensuring the integrity and quality of the software.

The four eyes principle can help during the construction of software systems by finding weaknesses in architecture, design or code and can help to improve the quality. In every phase of the software development cycle, this principle can be applied, from the requirements analysis phase to the detailed coding phase.

Software architecture, design, and code could be co-developed by two people or peer-reviewed.

In the design of software systems, the four-eye principle applies to the process of validating design decisions on various levels. Pair programming is a software development technique in which two programmers work together on code, one usually doing the coding and the other doing the validation. In other engineering industries, dual or duplicate inspection is a common practice.

In regulated environments such as Financial Institutions, compliance requirements may dictate that code is always peer-reviewed to prevent backdoors in code.

In software systems itself, the four-eyes principles may be implemented when supporting business processes requiring this for security or quality validation reasons.

Change management, a critical aspect of software development, often relies on the four-eyes principle. When code changes are transitioned into production, a formal change board may mandate a signed-off peer review, ensuring that all changes meet the required standards. Change and Configuration Management tools for software systems are often designed to support this four-eyes principle process, further enhancing the quality and security of the production environment.

Further assurance can be added by adding a (random) rotation scheme of authorized individuals to serve as the second pair of eyes. This may provide additional assurance as it will not be known beforehand which two individuals will be dealing with a given decision.

Related / similar: Dual Inspection, Code Review.

Continuous availability presentation in 2006, updated

Continuous availability

The slide deck tells me that it was in 2006 that I created a set of slides for “Kees” with an overview of the continuous availability features of an IBM mainframe setup.

The deck’s content was interesting enough to share here, with some enhancements.

What is availability?

First, let’s talk a little bit about availability. What do we mean when we talk about availability?

A highly available computing setup should provide for the following:

  • A highly available fault-tolerant infrastructure that enables applications to run continuously.
  • Continuous operations to allow for non-disruptive backups and infrastructure and application maintenance.
  • Disaster recovery measures that protect against unplanned outages due to disasters caused by factors that can not be controlled.

Definitions

Availability is the state of an application service being accessible to the end user.

An outage (unavailability) is when a system is unavailable to an end user. An outage can be planned, for example, for software or hardware maintenance, or unplanned.

What causes outages?

A research report from Standish Group from 2005 showed the various causes of outages.

Causes of outages 2006
Causes of outages

It is interesting to see that (cyber) security was not part of this picture, while more recent research published by UpTime Intelligence shows this growing concern. More on this later.

Causes of outages 2020 - 2021 - 2022
Causes of outages 2020 – 2021 – 2022

The myth of the nines

The table below shows the availability figures for an IBM mainframe setup versus Unix and LAN availability.

Things have changed. Unix (now: Linux) server availability has gone up. Server quality has improved, and so has software quality. Unix, however, still does not provide a capability similar to a z/OS sysplex. Such a sysplex simply beats any clustering facility by providing built-in, operating system-level availability.

Availability figures for an IBM mainframe setup versus Unix and LAN
Availability figures for an IBM mainframe setup versus Unix and LAN

At the time of writing, IBM publishes updated figures for a sysplex setup as well (see https://www.ibm.com/products/zos/parallel-sysplex): 99.99999% application availability for the footnote configuration: “… IBM Z servers must be configured in a Parallel Sysplex with z/OS 2.3 or above; GDPS data management and middleware recovery across Metro distance systems and storage and DS888X with IBM HyperSwap. Other resiliency technology and configurations may be needed.”

Redundant hardware

The following slides show the redundant hardware of a z9 EC (Enterprise Class), the flagship mainframe of that time.

The redundant hardware of a z9 EC
The redundant hardware of a z9 EC

Contrasting this with today’s flagship, the z16 (source https://www.vm.ibm.com/library/presentations/z16hwov.pdf), is interesting. Since the mainframe is now mounted in a standard rack, the interesting views have moved to the rear of the apparatus. (iPDUs are the power supplies in this machine.)

The redundant hardware of a z16
The redundant hardware of a z16

Redundant IO configuration

A nice, highly tolerant server is insufficient for an ultimately highly available setup. Also, the IO configuration, a.k.a. storage configuration, must be highly available.

A redundant SAN setup

The following slide in the deck highlights how this can be achieved. Depending on your mood, what is amusing or annoying and what triggers me today are the “DASD CU” terms in the storage boxes. These boxes are the storage systems housing the physical disks. At that time, terminologies like storage and disk were more evident than DASD (Direct Access Storage Device, goodness, what a code word for disk) and CU (Control Unit, just an abstraction anyway). Then, I ignore the valueless addition of CSS (Channel SubSystem) and CHPID (Channel Path ID) for this slide.

What a prick I must have been at that time.

At least the term Director did get the explanatory text “Switch.”

A redundant storage setup for mainframes
A redundant storage setup for mainframes

RAS features for storage

I went on to explain that a “Storage Subsystem” has the following RAS features (RAS, ugh…, Reliability, Availability, Security):

  • Independent dual power feeds (so you could attach the storage box to two different independent power lines in the data center)
    • N+1 power supply technology/hot-swappable power supplies and fans
    • N+1 cooling
    • Battery backup
    • Non-volatile subsystem cache to protect writes that have not been hardened to DASD yet (which we jokingly referred to as non-violent storage)
    • Non-disruptive maintenance
    • Concurrent LIC activation (LIC – Licensed Internal Code, a smoke-and-mirrors term for software)
    • Concurrent repair and replacement actions
    • RAID architecture
    • Redundant microprocessors and data paths
    • Concurrent upgrade support (that is, the ability to add disks while the subsystem is online)
    • Redundant shared memory
    • Spare disk drives
    • Remote Copy to a second storage subsystem
      • Synchronous (Peer to Peer Remote Copy, PPRC)
      • Asynchronous (Extended Remote Copy, XRC)

Most of this is still valid today, except that we do not have spinning disks anymore, but everything is Solid State Drives nowadays.

Disk mirroring

Ensuring that data is safely stored in this redundant setup is achieved through disk mirroring at the lowest level. Every byte written to a disk in one storage system is replicated to one or more storage systems, which can be in different locations.

There are two options for disk mirroring: Peer-to-Peer Remote Copy (PPRC) or eXtended Remote Copy (XRC). PPRC is also known as a Mero mirror solution. Data is mirrored synchronously, meaning an application receives an “I/O complete” only after both primary and secondary disks are updated. Because updates must be made to both storage systems synchronously, they can only be 15 to 20 kilometers apart. Otherwise, updates would take too long. The speed of light is the inhibitor for such a limitation.

With XRC, data is mirrored asynchronously. An application receives “I/O complete” after the primary disk is updated. The storage systems can be at an unlimited distance apart from each other. A component called System Data Mover ensures the consistency of data in the secondary storage system.

PPRC and XRC
PPRC and XRC

The following slide highlights how failover and failback would work in a PPRC configuration.

PPRC failover and failback
PPRC failover and failback

The operating system cluster: parallel sysplex

The presentation then explains how a z/OS parallel sysplex is configured to create a cluster without any single point of failure. All servers, LPARs, operating systems, and middleware are set up redundantly in a sysplex.

Features such as Dynamic Session Balancing and Dynamic Transaction Routing ensure that workloads are spread evenly across such a cluster. Facilities in the operating system and middleware work together to ensure that all data is safely and consistently shared, locking is in place when needed, and so forth.

The Coupling Facility is highlighted, which is a facility for sharing memory between the different members in a cluster. Sysplex Timers are shown; these ensure that the time of the different members in a sysplex is kept in sync.

A parallel sysplex
A parallel sysplex

A few more facilities are discussed. Workload Balancing is achieved with the Workload Manager (WLM) component of z/OS. The ability to restart applications without interfering with other applications or the z/OS itself is done by the Automatic Restart Manager (ARM). The Resource Recovery Services (RRS) assist with Two-Phase commits across members in a sysplex.

Automation is critical for successful rapid recovery and continuity

Every operation must be automated to prevent human errors and improve recovery speed. The following slide kicks in several open doors about the benefits of automation:

  • Allows business continuity processes to be built on a reliable, consistent recovery time
  • Recovery times can remain consistent as the system scales to provide a flexible solution designed to meet changing business needs
  • Reduce infrastructure management costs and staffing skills
  • Reduces or eliminates human error during the recovery process at the time of disaster
  • Facilitates regular testing to help ensure repeatable, reliable, scalable business continuity
  • Helps maintain recovery readiness by managing and monitoring the server, data replication, workload, and network, along with the notification of events that occur within the environment

Tiers of Disaster Recovery

The following slide shows an awful picture highlighting the concept of tiers of Disaster Recovery from Zero Data Loss to the Pickup Truck method.

Tiers of Disaster Recovery
Tiers of Disaster Recovery

I mostly like the Pickup Truck Access Method.

GDPS

The following slide introduces GDPS (the abbreviation of the meaningless concept of Geographically Dispersed Parallel Sysplex). GDPS is a piece of software on top of z/OS that provides the automation solution that combines all the previously discussed components to provide a Continuously Available configuration. GDPS takes care of the actions needed when failures occur in a z/OS sysplex.

GDPS
GDPS

GDPS comes in two flavors: GDPS/PPRC and GDPS/XRC.

GDPS/PPRC is designed to provide continuous availability and no data loss between z/OS members in a sysplex across two sites that are maximum at campus distance (15-20 km).

GDPS/XRC is designed to provide automatic failover of sites that are at extended distance from each other. Since GDPS/XRC is based on asynchronous data mirroring, minimum data loss can occur for data not committed to the remote site.

GDPS/PPRC and GDPS/XRC can be combined, providing a best-in-class solution having a high performance, zero data loss setup for local/metro operation, and an automatic site switch capability for extreme situations such as natural disasters.

In summary

The summary slide presents an overview of the capabilities of the server hardware, the Parallel Sysplex, and the GDPS setup.

Redundancy of Z server, Parallal Sysplex and GDPS
Redundancy of Z server, Parallal Sysplex and GDPS

But we are not there yet: ransomware recovery

When I created this presentation, ransomware was not today’s big issue. Nowadays, the IBM solution for continuous availability has been enriched with a capability for ransomware recovery. This solution, called IBM Z Cyber Vault, is a combination of various capabilities from IBM Z. The IBM Z Cyber Vault solution can create immutable copies, or Safeguarded Copies, in IBM Z Cybervault terms, taken at multiple points in time on production data with rapid recovery capability. In addition, this solution can enable data validation to support testing on the validity of each captured copy.

The IBM Z Cyber Vault environment is isolated from the production environment.

Whatever types of mainframe configuration, this IBM Z Cyber Vault capability can provide a high degree of cyber resiliency.

Source: https://www.redbooks.ibm.com/redbooks/pdfs/sg248511.pdf

IBM Z Cybervault
IBM Z Cybervault

System management

The z/OS operating system is designed to host many applications on a single platform. From the beginning, efficient management of the applications and their underlying infrastructure has been an essential part of the z/OS ecosystem.

This chapter will discuss the regular system operations, monitoring processes, and tools you find on z/OS. I will also look at monitoring tools that ensure all our automated business, application, and technical processes are running as expected.

System operations

The z/OS operating system has an extensive operator interface that gives the system operator the tools to control the z/OS platform and its applications and intervene when issues occur. You can compare these operations facilities very well with the operations of physical processes like in factories or power plants. The operator is equipped with many knobs, buttons, switches, and meters to keep the z/OS factory running.

Operator interfaces and some history

By design, the mainframe performs operations on so-called consoles. Consoles originally were physical terminal devices directly connected to the mainframe server with special equipment. Everything happening on the z/OS system was displayed on the console screens. A continuous newsfeed of messages generated by the numerous components running on the mainframe streamed over the console display. Warnings and failure messages were highlighted so an operator could quickly identify issues and take necessary actions.

Nowadays, physical consoles have been replaced by software equivalents. In the chapter on z/OS, I have already mentioned the tool SDSF from IBM or similar tools from other vendors available on z/OS for this purpose.  SDSF is the primary tool system operators and administrators use to view and manage the processes running on z/OS.

Additionally, z/OS has a central facility where information, warnings, and error messages from the hardware, operating system, middleware, and applications are gathered. This facility is called the system log. The system log can be viewed from the SDSF tool.

SDSF options
Executing an operator command through SDSF
The system log viewed through SDSF

An operator can intervene with the running z/OS system and applications with operator commands. z/OS itself provides many of these operator commands for a wide variety of functions. The middleware tools installed on top of z/OS often also bring their own set of operator messages and commands.

Operator commands are similar to Unix commands for Unix operating systems and functions provided by the Windows Task Manager and other Windows system administration functions. Operator commands can also be issued through application programming interfaces, which opens possibilities for building software for automated operations for the z/OS platform.

Automated operations

In the past, a crew of operators managed the daily operations of the business processes running on a central computer like the mainframe. The operators were gathered in the control room, also called a bridge, from where they monitored and operated the processes running on the mainframe.

Nowadays, daily operations have been automated. All everyday issues are handled through automated processes; special software oversees these operations. When the automation tools find issues they cannot resolve, an incident process is kicked off. A system or application administrator is then called from his bed to check out the problem.

Manual versus automated operations

Several software suppliers provide automation tools for z/OS operations. All these tools monitor the messages flowing through the system log, which reports the health of everything running on z/OS. The occurrence of messages can be turned into events for which automated actions can be programmed.

For example, if z/OS signals that a pool of disk storage is filling up, the storage management software will write a message to the system log. An automation process that increases the storage pool and sends a notification email to the storage administrator can be defined. The automation process is kicked off automatically when the message appears in the system log.

All automated operations tools for z/OS are based on this mechanism. Some tools provide more help with automation tasks and advanced functions than others. Solutions in the market include System Automation from IBM, CA-OPS/MVS from Broadcom, and AUTOMON for z/OS from Macro4.

Monitoring

System management aims to ensure that all automated business processes run smoothly. For this, detailed information must be made available to assess the health of the running processes. All the z/OS and middleware components provide a wide variety of data that can be used to analyze the health of these individual components. The amount of data is so large that it is necessary to use tools that help make sense of all this data. This is where monitoring tools can help.

Monitoring tools can be viewed on different levels of the operational system. In this section, I differentiate between infrastructure, application, and business monitoring for this chapter.

Figure 42 shows the different layers of monitoring that can be distinguished. It illustrates how application monitoring needs to be integrated to roll the information up into meaningful monitoring of the business process.

The following sections will go into the different layers of monitoring in a bit more detail.

Monitoring of different layers

Infrastructure monitoring

Infrastructure monitoring is needed to keep an eye on mainframe hardware, the z/OS operating system, and the middleware components running on top of z/OS. All these parts produce extensive data that can be used to monitor the health of the tools. z/OS provides standard facilities for infrastructure components to write monitoring data. The first one we have seen is the messages written to the system console. These are all saved in the system log. Additionally, z/OS has a System Management Facility (SMF) component, providing a low-level interface through which infrastructure components can write information in the SMF dataset in a special event log. 

There are many options for producing data, but what often does not come with these tools is the ability to make meaningful use of all that data.

To get a better grip on the health of these infrastructure components, various software vendors provide solutions to use that data to monitor and manage specific infrastructure components. Most of these tools offer particular functions for monitoring a piece of infrastructure, and integrating these tools is not always straightforward.

BMC’s Mainview suite provides tools to monitor z/OS, CICS, IMS, Db2, and other infrastructure components common in a z/OS setup.

IBM has a similar suite under the umbrella of Omegamon. The IBM suite also has tools for monitoring z/OS itself, storage, networks, and middleware such as CICS, IMS, Db2, and more.

Also, under the name ASG-TMON, ASG has an extensive suite of tools for the components mentioned above and more. This software is now acquired my Rocket Software.

Broadcom provides under their Intelligent Operations and Automation suite tools for z/OS and network monitoring.

Application monitoring

The next level of monitoring provides a view of the functioning of the applications and their components.

Off-the-shelf tools or frameworks do not extensively support this monitoring level for application in COBOL or PL/I. Application monitoring and logging frameworks like Java Management Extensions and Log4J are available for Java, but such tools are not available for languages like COBOL and PL/I. Many z/OS users have developed their frameworks for application monitoring, relying on various technologies.

Some tools can provide a certain level of application monitoring. For example, Dynatrace, AppDynamics, and IBM Application Performance Management provide capabilities to examine applications’ functioning. However, the functionality is often not easily extensible for application developers, like it is with the log4j and JMX mentioned above. There remains a need for a framework (preferably open-source) that allows application developers to create specific monitoring and logging information and events at particular points in an application on z/OS.

Business Monitoring

Ideally, the application and infrastructure monitoring tooling should feed some tools that can aggregate and enrich this information with information from other tools to create a comprehensive view of the IT components supporting business processes.

Recently, tools have become available on z/OS to gather logging and monitoring information and forward this to a central application. Syncsort’s IronStream and IBM’s Common Data Provider can collect data from different sources, such as system logs, application logs, and SMF data, and stream this to one or more destinations like Splunk or Elastic. With these tools, it is now possible to integrate available data into a cross-platform aggregated view, as shown in Figure 42. Today, Aggregated views are typically implemented in tools like Splunk or the open-source ELK stack with Elastic or other tools focused on data aggregation, analysis, and visualization.