I must shamefully admit I was not aware of the zopen community initiative before it recently became part of the Open Mainframe project. The zopen community provides a great set of open source tools ported for Z. Such as the dos2unix utility I wrote about earlier here.
On z/OS UNIX, the dos2unix utility is not included. You can achieve similar functionality using other tools available on z/OS UNIX, such as sed or tr. These tools can be used to convert DOS-style line endings (CRLF) to Unix-style line endings (LF).
For example, you can use sed to remove carriage return characters:
In the realm of software development, the four-eyes principle dictates that an action can only be executed when it is approved by two individuals, each providing a unique perspective and oversight. This principle is designed to safeguard against errors and misuse, ensuring the integrity and quality of the software.
The four eyes principle can help during the construction of software systems by finding weaknesses in architecture, design or code and can help to improve the quality. In every phase of the software development cycle, this principle can be applied, from the requirements analysis phase to the detailed coding phase.
Software architecture, design, and code could be co-developed by two people or peer-reviewed.
In the design of software systems, the four-eye principle applies to the process of validating design decisions on various levels. Pair programming is a software development technique in which two programmers work together on code, one usually doing the coding and the other doing the validation. In other engineering industries, dual or duplicate inspection is a common practice.
In regulated environments such as Financial Institutions, compliance requirements may dictate that code is always peer-reviewed to prevent backdoors in code.
In software systems itself, the four-eyes principles may be implemented when supporting business processes requiring this for security or quality validation reasons.
Change management, a critical aspect of software development, often relies on the four-eyes principle. When code changes are transitioned into production, a formal change board may mandate a signed-off peer review, ensuring that all changes meet the required standards. Change and Configuration Management tools for software systems are often designed to support this four-eyes principle process, further enhancing the quality and security of the production environment.
Further assurance can be added by adding a (random) rotation scheme of authorized individuals to serve as the second pair of eyes. This may provide additional assurance as it will not be known beforehand which two individuals will be dealing with a given decision.
The slide deck tells me that it was in 2006 that I created a set of slides for “Kees” with an overview of the continuous availability features of an IBM mainframe setup.
The deck’s content was interesting enough to share here, with some enhancements.
What is availability?
First, let’s talk a little bit about availability. What do we mean when we talk about availability?
A highly available computing setup should provide for the following:
A highly available fault-tolerant infrastructure that enables applications to run continuously.
Continuous operations to allow for non-disruptive backups and infrastructure and application maintenance.
Disaster recovery measures that protect against unplanned outages due to disasters caused by factors that can not be controlled.
Definitions
Availability is the state of an application service being accessible to the end user.
An outage (unavailability) is when a system is unavailable to an end user. An outage can be planned, for example, for software or hardware maintenance, or unplanned.
What causes outages?
A research report from Standish Group from 2005 showed the various causes of outages.
It is interesting to see that (cyber) security was not part of this picture, while more recent research published by UpTime Intelligence shows this growing concern. More on this later.
The myth of the nines
The table below shows the availability figures for an IBM mainframe setup versus Unix and LAN availability.
Things have changed. Unix (now: Linux) server availability has gone up. Server quality has improved, and so has software quality. Unix, however, still does not provide a capability similar to a z/OS sysplex. Such a sysplex simply beats any clustering facility by providing built-in, operating system-level availability.
At the time of writing, IBM publishes updated figures for a sysplex setup as well (see https://www.ibm.com/products/zos/parallel-sysplex): 99.99999% application availability for the footnote configuration: “… IBM Z servers must be configured in a Parallel Sysplex with z/OS 2.3 or above; GDPS data management and middleware recovery across Metro distance systems and storage and DS888X with IBM HyperSwap. Other resiliency technology and configurations may be needed.”
Redundant hardware
The following slides show the redundant hardware of a z9 EC (Enterprise Class), the flagship mainframe of that time.
Contrasting this with today’s flagship, the z16 (source https://www.vm.ibm.com/library/presentations/z16hwov.pdf), is interesting. Since the mainframe is now mounted in a standard rack, the interesting views have moved to the rear of the apparatus. (iPDUs are the power supplies in this machine.)
Redundant IO configuration
A nice, highly tolerant server is insufficient for an ultimately highly available setup. Also, the IO configuration, a.k.a. storage configuration, must be highly available.
A redundant SAN setup
The following slide in the deck highlights how this can be achieved. Depending on your mood, what is amusing or annoying and what triggers me today are the “DASD CU” terms in the storage boxes. These boxes are the storage systems housing the physical disks. At that time, terminologies like storage and disk were more evident than DASD (Direct Access Storage Device, goodness, what a code word for disk) and CU (Control Unit, just an abstraction anyway). Then, I ignore the valueless addition of CSS (Channel SubSystem) and CHPID (Channel Path ID) for this slide.
What a prick I must have been at that time.
At least the term Director did get the explanatory text “Switch.”
RAS features for storage
I went on to explain that a “Storage Subsystem” has the following RAS features (RAS, ugh…, Reliability, Availability, Security):
Independent dual power feeds (so you could attach the storage box to two different independent power lines in the data center)
N+1 power supply technology/hot-swappable power supplies and fans
N+1 cooling
Battery backup
Non-volatile subsystem cache to protect writes that have not been hardened to DASD yet (which we jokingly referred to as non-violent storage)
Non-disruptive maintenance
Concurrent LIC activation (LIC – Licensed Internal Code, a smoke-and-mirrors term for software)
Concurrent repair and replacement actions
RAID architecture
Redundant microprocessors and data paths
Concurrent upgrade support (that is, the ability to add disks while the subsystem is online)
Redundant shared memory
Spare disk drives
Remote Copy to a second storage subsystem
Synchronous (Peer to Peer Remote Copy, PPRC)
Asynchronous (Extended Remote Copy, XRC)
Most of this is still valid today, except that we do not have spinning disks anymore, but everything is Solid State Drives nowadays.
Disk mirroring
Ensuring that data is safely stored in this redundant setup is achieved through disk mirroring at the lowest level. Every byte written to a disk in one storage system is replicated to one or more storage systems, which can be in different locations.
There are two options for disk mirroring: Peer-to-Peer Remote Copy (PPRC) or eXtended Remote Copy (XRC). PPRC is also known as a Mero mirror solution. Data is mirrored synchronously, meaning an application receives an “I/O complete” only after both primary and secondary disks are updated. Because updates must be made to both storage systems synchronously, they can only be 15 to 20 kilometers apart. Otherwise, updates would take too long. The speed of light is the inhibitor for such a limitation.
With XRC, data is mirrored asynchronously. An application receives “I/O complete” after the primary disk is updated. The storage systems can be at an unlimited distance apart from each other. A component called System Data Mover ensures the consistency of data in the secondary storage system.
The following slide highlights how failover and failback would work in a PPRC configuration.
The operating system cluster: parallel sysplex
The presentation then explains how a z/OS parallel sysplex is configured to create a cluster without any single point of failure. All servers, LPARs, operating systems, and middleware are set up redundantly in a sysplex.
Features such as Dynamic Session Balancing and Dynamic Transaction Routing ensure that workloads are spread evenly across such a cluster. Facilities in the operating system and middleware work together to ensure that all data is safely and consistently shared, locking is in place when needed, and so forth.
The Coupling Facility is highlighted, which is a facility for sharing memory between the different members in a cluster. Sysplex Timers are shown; these ensure that the time of the different members in a sysplex is kept in sync.
A few more facilities are discussed. Workload Balancing is achieved with the Workload Manager (WLM) component of z/OS. The ability to restart applications without interfering with other applications or the z/OS itself is done by the Automatic Restart Manager (ARM). The Resource Recovery Services (RRS) assist with Two-Phase commits across members in a sysplex.
Automation is critical for successful rapid recovery and continuity
Every operation must be automated to prevent human errors and improve recovery speed. The following slide kicks in several open doors about the benefits of automation:
Allows business continuity processes to be built on a reliable, consistent recovery time
Recovery times can remain consistent as the system scales to provide a flexible solution designed to meet changing business needs
Reduce infrastructure management costs and staffing skills
Reduces or eliminates human error during the recovery process at the time of disaster
Facilitates regular testing to help ensure repeatable, reliable, scalable business continuity
Helps maintain recovery readiness by managing and monitoring the server, data replication, workload, and network, along with the notification of events that occur within the environment
Tiers of Disaster Recovery
The following slide shows an awful picture highlighting the concept of tiers of Disaster Recovery from Zero Data Loss to the Pickup Truck method.
I mostly like the Pickup Truck Access Method.
GDPS
The following slide introduces GDPS (the abbreviation of the meaningless concept of Geographically Dispersed Parallel Sysplex). GDPS is a piece of software on top of z/OS that provides the automation solution that combines all the previously discussed components to provide a Continuously Available configuration. GDPS takes care of the actions needed when failures occur in a z/OS sysplex.
GDPS comes in two flavors: GDPS/PPRC and GDPS/XRC.
GDPS/PPRC is designed to provide continuous availability and no data loss between z/OS members in a sysplex across two sites that are maximum at campus distance (15-20 km).
GDPS/XRC is designed to provide automatic failover of sites that are at extended distance from each other. Since GDPS/XRC is based on asynchronous data mirroring, minimum data loss can occur for data not committed to the remote site.
GDPS/PPRC and GDPS/XRC can be combined, providing a best-in-class solution having a high performance, zero data loss setup for local/metro operation, and an automatic site switch capability for extreme situations such as natural disasters.
In summary
The summary slide presents an overview of the capabilities of the server hardware, the Parallel Sysplex, and the GDPS setup.
But we are not there yet: ransomware recovery
When I created this presentation, ransomware was not today’s big issue. Nowadays, the IBM solution for continuous availability has been enriched with a capability for ransomware recovery. This solution, called IBM Z Cyber Vault, is a combination of various capabilities from IBM Z. The IBM Z Cyber Vault solution can create immutable copies, or Safeguarded Copies, in IBM Z Cybervault terms, taken at multiple points in time on production data with rapid recovery capability. In addition, this solution can enable data validation to support testing on the validity of each captured copy.
The IBM Z Cyber Vault environment is isolated from the production environment.
Whatever types of mainframe configuration, this IBM Z Cyber Vault capability can provide a high degree of cyber resiliency.
The z/OS operating system is designed to host many applications on a single platform. From the beginning, efficient management of the applications and their underlying infrastructure has been an essential part of the z/OS ecosystem.
This chapter will discuss the regular system operations, monitoring processes, and tools you find on z/OS. I will also look at monitoring tools that ensure all our automated business, application, and technical processes are running as expected.
System operations
The z/OS operating system has an extensive operator interface that gives the system operator the tools to control the z/OS platform and its applications and intervene when issues occur. You can compare these operations facilities very well with the operations of physical processes like in factories or power plants. The operator is equipped with many knobs, buttons, switches, and meters to keep the z/OS factory running.
Operator interfaces and some history
By design, the mainframe performs operations on so-called consoles. Consoles originally were physical terminal devices directly connected to the mainframe server with special equipment. Everything happening on the z/OS system was displayed on the console screens. A continuous newsfeed of messages generated by the numerous components running on the mainframe streamed over the console display. Warnings and failure messages were highlighted so an operator could quickly identify issues and take necessary actions.
Nowadays, physical consoles have been replaced by software equivalents. In the chapter on z/OS, I have already mentioned the tool SDSF from IBM or similar tools from other vendors available on z/OS for this purpose. SDSF is the primary tool system operators and administrators use to view and manage the processes running on z/OS.
Additionally, z/OS has a central facility where information, warnings, and error messages from the hardware, operating system, middleware, and applications are gathered. This facility is called the system log. The system log can be viewed from the SDSF tool.
An operator can intervene with the running z/OS system and applications with operator commands. z/OS itself provides many of these operator commands for a wide variety of functions. The middleware tools installed on top of z/OS often also bring their own set of operator messages and commands.
Operator commands are similar to Unix commands for Unix operating systems and functions provided by the Windows Task Manager and other Windows system administration functions. Operator commands can also be issued through application programming interfaces, which opens possibilities for building software for automated operations for the z/OS platform.
Automated operations
In the past, a crew of operators managed the daily operations of the business processes running on a central computer like the mainframe. The operators were gathered in the control room, also called a bridge, from where they monitored and operated the processes running on the mainframe.
Nowadays, daily operations have been automated. All everyday issues are handled through automated processes; special software oversees these operations. When the automation tools find issues they cannot resolve, an incident process is kicked off. A system or application administrator is then called from his bed to check out the problem.
Several software suppliers provide automation tools for z/OS operations. All these tools monitor the messages flowing through the system log, which reports the health of everything running on z/OS. The occurrence of messages can be turned into events for which automated actions can be programmed.
For example, if z/OS signals that a pool of disk storage is filling up, the storage management software will write a message to the system log. An automation process that increases the storage pool and sends a notification email to the storage administrator can be defined. The automation process is kicked off automatically when the message appears in the system log.
All automated operations tools for z/OS are based on this mechanism. Some tools provide more help with automation tasks and advanced functions than others. Solutions in the market include System Automation from IBM, CA-OPS/MVS from Broadcom, and AUTOMON for z/OS from Macro4.
Monitoring
System management aims to ensure that all automated business processes run smoothly. For this, detailed information must be made available to assess the health of the running processes. All the z/OS and middleware components provide a wide variety of data that can be used to analyze the health of these individual components. The amount of data is so large that it is necessary to use tools that help make sense of all this data. This is where monitoring tools can help.
Monitoring tools can be viewed on different levels of the operational system. In this section, I differentiate between infrastructure, application, and business monitoring for this chapter.
Figure 42 shows the different layers of monitoring that can be distinguished. It illustrates how application monitoring needs to be integrated to roll the information up into meaningful monitoring of the business process.
The following sections will go into the different layers of monitoring in a bit more detail.
Infrastructure monitoring
Infrastructure monitoring is needed to keep an eye on mainframe hardware, the z/OS operating system, and the middleware components running on top of z/OS. All these parts produce extensive data that can be used to monitor the health of the tools. z/OS provides standard facilities for infrastructure components to write monitoring data. The first one we have seen is the messages written to the system console. These are all saved in the system log. Additionally, z/OS has a System Management Facility (SMF) component, providing a low-level interface through which infrastructure components can write information in the SMF dataset in a special event log.
There are many options for producing data, but what often does not come with these tools is the ability to make meaningful use of all that data.
To get a better grip on the health of these infrastructure components, various software vendors provide solutions to use that data to monitor and manage specific infrastructure components. Most of these tools offer particular functions for monitoring a piece of infrastructure, and integrating these tools is not always straightforward.
BMC’s Mainview suite provides tools to monitor z/OS, CICS, IMS, Db2, and other infrastructure components common in a z/OS setup.
IBM has a similar suite under the umbrella of Omegamon. The IBM suite also has tools for monitoring z/OS itself, storage, networks, and middleware such as CICS, IMS, Db2, and more.
Also, under the name ASG-TMON, ASG has an extensive suite of tools for the components mentioned above and more. This software is now acquired my Rocket Software.
Broadcom provides under their Intelligent Operations and Automation suite tools for z/OS and network monitoring.
Application monitoring
The next level of monitoring provides a view of the functioning of the applications and their components.
Off-the-shelf tools or frameworks do not extensively support this monitoring level for application in COBOL or PL/I. Application monitoring and logging frameworks like Java Management Extensions and Log4J are available for Java, but such tools are not available for languages like COBOL and PL/I. Many z/OS users have developed their frameworks for application monitoring, relying on various technologies.
Some tools can provide a certain level of application monitoring. For example, Dynatrace, AppDynamics, and IBM Application Performance Management provide capabilities to examine applications’ functioning. However, the functionality is often not easily extensible for application developers, like it is with the log4j and JMX mentioned above. There remains a need for a framework (preferably open-source) that allows application developers to create specific monitoring and logging information and events at particular points in an application on z/OS.
Business Monitoring
Ideally, the application and infrastructure monitoring tooling should feed some tools that can aggregate and enrich this information with information from other tools to create a comprehensive view of the IT components supporting business processes.
Recently, tools have become available on z/OS to gather logging and monitoring information and forward this to a central application. Syncsort’s IronStream and IBM’s Common Data Provider can collect data from different sources, such as system logs, application logs, and SMF data, and stream this to one or more destinations like Splunk or Elastic. With these tools, it is now possible to integrate available data into a cross-platform aggregated view, as shown in Figure 42. Today, Aggregated views are typically implemented in tools like Splunk or the open-source ELK stack with Elastic or other tools focused on data aggregation, analysis, and visualization.
The principle of noise reduction in software systems improves software systems by removing inessential parts and options and/or making them invisible or only visible to selected users.
Reducing the options in a software solution increases usability. This goes for user interfaces as well as technical interfaces. We decide what an interface looks like and stick to it. All-too-famous examples of noise reduction are the Apple iPod and the Google search page.
Adding features for selected users means adding features and under-the-hood complexities for all clients.
Reducing options also makes the software more robust. If we build fewer interfaces, we can improve them. We can focus on really doing well with the limited set of interfaces.
In practice, we see hardware and software tools have many options and features. That is not because software suppliers desperately want to give their customers all the options but because we, their customers, are requesting these options. Software suppliers may view all these requests more critically. Some do.
Let’s aim to settle for less. We shouldn’t build more every time we can do with less just because we can. Also, we shouldn’t ask our suppliers to create features that are nice to have.
There are always more options, but let’s limit the options to 4 or better: 1.
Some companies have made a business model out of technology hypes. These are the same companies that tell the market what it needs by asking the market. Of course, this comes with an invoice mentioning generous compensation. These companies write classy reports with colorful graphics in which they advise organizations to do what the organizations tell them to do.
But hypes are for techies. Techies may feast on technology, but for organizations, jumping on hypes can be a risky and costly pastime.
There are two types of hypes. Hypes can be about something new. Other hypes are just reformulations of existing things, recycled ideas.
But hypes are hypes: they will go away. The vast majority of hypes disappear into thin air. The techie may have learned from them. Some remain. It might be valuable if a technology is still around after a few years. But usually, the stuff will not be as groundbreaking and revolutionary as predicted when announced by the hype cycle company, that is, by the market itself.
Blockchain, anyone?
Some hypes are recycled ideas. We have no memory, and we don’t read textbooks. SOA, AI, microservices, and technical advancements are wrapped in shiny new names and gift papers, so they appear to be a gift from your software supplier or consultancy company.
Photographers tend to suffer from Gear Acquisition Syndrome. They believe they will make better pictures with new gear and buy new lenses, cameras, and flashlights.
Then they find their work does not improve.
In IT, we do the same.
We have our old relational database management system.
But now we have this great Spark, MongoDB, CouchDB, or what have you. (I’m just taking a not-so-random example.) So now everything must be converted to Spark or Mongo.
We even forget that this old technology, the relational DBMS in this example, was so good at reliably processing transactions. It worked!
The new database is massively scalable, which is great. Unfortunately, it does not improve the reliability of processing our transactions.
But it’s hot, so we want it—because Google has it. Errr, but will you also use it to process web page indexes? Ah, no. You want to store your customer records in it. So, is it reliable? No. But it is satisfying our GAS.