The 4 Eyes Principle

  • Post category:Principles
  • Reading time:3 mins read

Another principle today.

In the realm of software development, the four-eyes principle dictates that an action can only be executed when it is approved by two individuals, each providing a unique perspective and oversight. This principle is designed to safeguard against errors and misuse, ensuring the integrity and quality of the software.

The four eyes principle can help during the construction of software systems by finding weaknesses in architecture, design or code and can help to improve the quality. In every phase of the software development cycle, this principle can be applied, from the requirements analysis phase to the detailed coding phase.

Software architecture, design, and code could be co-developed by two people or peer-reviewed.

In the design of software systems, the four-eye principle applies to the process of validating design decisions on various levels. Pair programming is a software development technique in which two programmers work together on code, one usually doing the coding and the other doing the validation. In other engineering industries, dual or duplicate inspection is a common practice.

In regulated environments such as Financial Institutions, compliance requirements may dictate that code is always peer-reviewed to prevent backdoors in code.

In software systems itself, the four-eyes principles may be implemented when supporting business processes requiring this for security or quality validation reasons.

Change management, a critical aspect of software development, often relies on the four-eyes principle. When code changes are transitioned into production, a formal change board may mandate a signed-off peer review, ensuring that all changes meet the required standards. Change and Configuration Management tools for software systems are often designed to support this four-eyes principle process, further enhancing the quality and security of the production environment.

Further assurance can be added by adding a (random) rotation scheme of authorized individuals to serve as the second pair of eyes. This may provide additional assurance as it will not be known beforehand which two individuals will be dealing with a given decision.

Related / similar: Dual Inspection, Code Review.

System Z Enthusiasts Discord

  • Post category:Uncategorized
  • Reading time:1 mins read

Continuous availability presentation in 2006, updated

  • Post category:Uncategorized
  • Reading time:16 mins read

Continuous availability

The slide deck tells me that it was in 2006 that I created a set of slides for “Kees” with an overview of the continuous availability features of an IBM mainframe setup.

The deck’s content was interesting enough to share here, with some enhancements.

What is availability?

First, let’s talk a little bit about availability. What do we mean when we talk about availability?

A highly available computing setup should provide for the following:

  • A highly available fault-tolerant infrastructure that enables applications to run continuously.
  • Continuous operations to allow for non-disruptive backups and infrastructure and application maintenance.
  • Disaster recovery measures that protect against unplanned outages due to disasters caused by factors that can not be controlled.

Definitions

Availability is the state of an application service being accessible to the end user.

An outage (unavailability) is when a system is unavailable to an end user. An outage can be planned, for example, for software or hardware maintenance, or unplanned.

What causes outages?

A research report from Standish Group from 2005 showed the various causes of outages.

Causes of outages 2006
Causes of outages

It is interesting to see that (cyber) security was not part of this picture, while more recent research published by UpTime Intelligence shows this growing concern. More on this later.

Causes of outages 2020 - 2021 - 2022
Causes of outages 2020 – 2021 – 2022

The myth of the nines

The table below shows the availability figures for an IBM mainframe setup versus Unix and LAN availability.

Things have changed. Unix (now: Linux) server availability has gone up. Server quality has improved, and so has software quality. Unix, however, still does not provide a capability similar to a z/OS sysplex. Such a sysplex simply beats any clustering facility by providing built-in, operating system-level availability.

Availability figures for an IBM mainframe setup versus Unix and LAN
Availability figures for an IBM mainframe setup versus Unix and LAN

At the time of writing, IBM publishes updated figures for a sysplex setup as well (see https://www.ibm.com/products/zos/parallel-sysplex): 99.99999% application availability for the footnote configuration: “… IBM Z servers must be configured in a Parallel Sysplex with z/OS 2.3 or above; GDPS data management and middleware recovery across Metro distance systems and storage and DS888X with IBM HyperSwap. Other resiliency technology and configurations may be needed.”

Redundant hardware

The following slides show the redundant hardware of a z9 EC (Enterprise Class), the flagship mainframe of that time.

The redundant hardware of a z9 EC
The redundant hardware of a z9 EC

Contrasting this with today’s flagship, the z16 (source https://www.vm.ibm.com/library/presentations/z16hwov.pdf), is interesting. Since the mainframe is now mounted in a standard rack, the interesting views have moved to the rear of the apparatus. (iPDUs are the power supplies in this machine.)

The redundant hardware of a z16
The redundant hardware of a z16

Redundant IO configuration

A nice, highly tolerant server is insufficient for an ultimately highly available setup. Also, the IO configuration, a.k.a. storage configuration, must be highly available.

A redundant SAN setup

The following slide in the deck highlights how this can be achieved. Depending on your mood, what is amusing or annoying and what triggers me today are the “DASD CU” terms in the storage boxes. These boxes are the storage systems housing the physical disks. At that time, terminologies like storage and disk were more evident than DASD (Direct Access Storage Device, goodness, what a code word for disk) and CU (Control Unit, just an abstraction anyway). Then, I ignore the valueless addition of CSS (Channel SubSystem) and CHPID (Channel Path ID) for this slide.

What a prick I must have been at that time.

At least the term Director did get the explanatory text “Switch.”

A redundant storage setup for mainframes
A redundant storage setup for mainframes

RAS features for storage

I went on to explain that a “Storage Subsystem” has the following RAS features (RAS, ugh…, Reliability, Availability, Security):

  • Independent dual power feeds (so you could attach the storage box to two different independent power lines in the data center)
    • N+1 power supply technology/hot-swappable power supplies and fans
    • N+1 cooling
    • Battery backup
    • Non-volatile subsystem cache to protect writes that have not been hardened to DASD yet (which we jokingly referred to as non-violent storage)
    • Non-disruptive maintenance
    • Concurrent LIC activation (LIC – Licensed Internal Code, a smoke-and-mirrors term for software)
    • Concurrent repair and replacement actions
    • RAID architecture
    • Redundant microprocessors and data paths
    • Concurrent upgrade support (that is, the ability to add disks while the subsystem is online)
    • Redundant shared memory
    • Spare disk drives
    • Remote Copy to a second storage subsystem
      • Synchronous (Peer to Peer Remote Copy, PPRC)
      • Asynchronous (Extended Remote Copy, XRC)

Most of this is still valid today, except that we do not have spinning disks anymore, but everything is Solid State Drives nowadays.

Disk mirroring

Ensuring that data is safely stored in this redundant setup is achieved through disk mirroring at the lowest level. Every byte written to a disk in one storage system is replicated to one or more storage systems, which can be in different locations.

There are two options for disk mirroring: Peer-to-Peer Remote Copy (PPRC) or eXtended Remote Copy (XRC). PPRC is also known as a Mero mirror solution. Data is mirrored synchronously, meaning an application receives an “I/O complete” only after both primary and secondary disks are updated. Because updates must be made to both storage systems synchronously, they can only be 15 to 20 kilometers apart. Otherwise, updates would take too long. The speed of light is the inhibitor for such a limitation.

With XRC, data is mirrored asynchronously. An application receives “I/O complete” after the primary disk is updated. The storage systems can be at an unlimited distance apart from each other. A component called System Data Mover ensures the consistency of data in the secondary storage system.

PPRC and XRC
PPRC and XRC

The following slide highlights how failover and failback would work in a PPRC configuration.

PPRC failover and failback
PPRC failover and failback

The operating system cluster: parallel sysplex

The presentation then explains how a z/OS parallel sysplex is configured to create a cluster without any single point of failure. All servers, LPARs, operating systems, and middleware are set up redundantly in a sysplex.

Features such as Dynamic Session Balancing and Dynamic Transaction Routing ensure that workloads are spread evenly across such a cluster. Facilities in the operating system and middleware work together to ensure that all data is safely and consistently shared, locking is in place when needed, and so forth.

The Coupling Facility is highlighted, which is a facility for sharing memory between the different members in a cluster. Sysplex Timers are shown; these ensure that the time of the different members in a sysplex is kept in sync.

A parallel sysplex
A parallel sysplex

A few more facilities are discussed. Workload Balancing is achieved with the Workload Manager (WLM) component of z/OS. The ability to restart applications without interfering with other applications or the z/OS itself is done by the Automatic Restart Manager (ARM). The Resource Recovery Services (RRS) assist with Two-Phase commits across members in a sysplex.

Automation is critical for successful rapid recovery and continuity

Every operation must be automated to prevent human errors and improve recovery speed. The following slide kicks in several open doors about the benefits of automation:

  • Allows business continuity processes to be built on a reliable, consistent recovery time
  • Recovery times can remain consistent as the system scales to provide a flexible solution designed to meet changing business needs
  • Reduce infrastructure management costs and staffing skills
  • Reduces or eliminates human error during the recovery process at the time of disaster
  • Facilitates regular testing to help ensure repeatable, reliable, scalable business continuity
  • Helps maintain recovery readiness by managing and monitoring the server, data replication, workload, and network, along with the notification of events that occur within the environment

Tiers of Disaster Recovery

The following slide shows an awful picture highlighting the concept of tiers of Disaster Recovery from Zero Data Loss to the Pickup Truck method.

Tiers of Disaster Recovery
Tiers of Disaster Recovery

I mostly like the Pickup Truck Access Method.

GDPS

The following slide introduces GDPS (the abbreviation of the meaningless concept of Geographically Dispersed Parallel Sysplex). GDPS is a piece of software on top of z/OS that provides the automation solution that combines all the previously discussed components to provide a Continuously Available configuration. GDPS takes care of the actions needed when failures occur in a z/OS sysplex.

GDPS
GDPS

GDPS comes in two flavors: GDPS/PPRC and GDPS/XRC.

GDPS/PPRC is designed to provide continuous availability and no data loss between z/OS members in a sysplex across two sites that are maximum at campus distance (15-20 km).

GDPS/XRC is designed to provide automatic failover of sites that are at extended distance from each other. Since GDPS/XRC is based on asynchronous data mirroring, minimum data loss can occur for data not committed to the remote site.

GDPS/PPRC and GDPS/XRC can be combined, providing a best-in-class solution having a high performance, zero data loss setup for local/metro operation, and an automatic site switch capability for extreme situations such as natural disasters.

In summary

The summary slide presents an overview of the capabilities of the server hardware, the Parallel Sysplex, and the GDPS setup.

Redundancy of Z server, Parallal Sysplex and GDPS
Redundancy of Z server, Parallal Sysplex and GDPS

But we are not there yet: ransomware recovery

When I created this presentation, ransomware was not today’s big issue. Nowadays, the IBM solution for continuous availability has been enriched with a capability for ransomware recovery. This solution, called IBM Z Cyber Vault, is a combination of various capabilities from IBM Z. The IBM Z Cyber Vault solution can create immutable copies, or Safeguarded Copies, in IBM Z Cybervault terms, taken at multiple points in time on production data with rapid recovery capability. In addition, this solution can enable data validation to support testing on the validity of each captured copy.

The IBM Z Cyber Vault environment is isolated from the production environment.

Whatever types of mainframe configuration, this IBM Z Cyber Vault capability can provide a high degree of cyber resiliency.

Source: https://www.redbooks.ibm.com/redbooks/pdfs/sg248511.pdf

IBM Z Cybervault
IBM Z Cybervault