In software engineering, as in many creative and technical fields, the environment shapes how we think and work. An intriguing psychological phenomenon known as the Cathedral Effect offers valuable insights into how physical and virtual workspaces can be designed to optimize both high-level creativity and detailed execution.
What Is the Cathedral Effect?
The Cathedral Effect reveals how ceiling height influences cognition and behavior. High ceilings evoke a sense of freedom and openness, fostering abstract thinking, creativity, and holistic problem-solving. In contrast, low ceilings create an enclosure that encourages focused, detail-oriented, and analytical work.
Research shows that exposure to high ceilings activates brain regions associated with visuospatial exploration and abstract thought, and confirm that people in high-ceiling environments engage in broader, more creative thinking, while low ceilings prime them for concrete, detail-focused tasks
Applying the Cathedral Effect to Software Engineering
Software development involves both high-level architectural design and detailed coding and testing. The Cathedral Effect suggests that these phases benefit from different environments:
High-level work (system architecture, brainstorming, innovation) thrives in “high ceiling” spaces- whether physical rooms with tall ceilings or metaphorical spaces that encourage free-flowing ideas and open discussion.
Detailed work (analysis, programming, debugging) benefits from “low ceiling” environments that support concentration, precision, and deep focus.
Matching the workspace to the task helps teams think and perform at their best.
Practical Suggestions for IT Teams and Organizations
Create Dedicated Physical and Virtual Spaces
If possible, design your office with distinct zones:
High-ceiling rooms for architects and strategists to collaborate and innovate. These spaces should be open, well-lit, and flexible.
Low-ceiling or enclosed rooms for developers and analysts to focus on detailed work without distractions.
For remote or hybrid teams, replicate this by:
Holding open, informal video sessions and collaborative whiteboard meetings for high-level ideation.
Scheduling “deep work” periods with minimal interruptions, supported by quiet virtual rooms or dedicated communication channels.
Match People to Their Preferred Environments
We should recognize that some team members excel at abstract thinking, while others thrive on details. Assign roles and tasks accordingly, and respect their preferred workspace to maximize productivity and job satisfaction.
Facilitate Transitions Between Modes
Switching between big-picture thinking and detailed work requires mental shifts. Encourage physical or virtual “room changes” to help reset focus and mindset, reducing cognitive fatigue.
Foster Cross-Pollination
While separation is beneficial, occasional collaboration between high-level thinkers and detail-oriented workers ensures ideas remain practical and grounded.
Why This Matters
Ignoring the Cathedral Effect can lead to mismatched environments that stifle creativity or undermine focus. For example, forcing detail-oriented developers into open brainstorming sessions can cause distraction and frustration. Conversely, confining architects to cramped spaces can limit innovation.
By consciously designing workspaces and workflows that respect the Cathedral Effect, organizations can foster both creativity and precision, leading to better software and more engaged teams.
The slide deck tells me that it was in 2006 that I created a set of slides for “Kees” with an overview of the continuous availability features of an IBM mainframe setup.
The deck’s content was interesting enough to share here, with some enhancements.
What is availability?
First, let’s talk a little bit about availability. What do we mean when we talk about availability?
A highly available computing setup should provide for the following:
A highly available fault-tolerant infrastructure that enables applications to run continuously.
Continuous operations to allow for non-disruptive backups and infrastructure and application maintenance.
Disaster recovery measures that protect against unplanned outages due to disasters caused by factors that can not be controlled.
Definitions
Availability is the state of an application service being accessible to the end user.
An outage (unavailability) is when a system is unavailable to an end user. An outage can be planned, for example, for software or hardware maintenance, or unplanned.
What causes outages?
A research report from Standish Group from 2005 showed the various causes of outages.
Causes of outages
It is interesting to see that (cyber) security was not part of this picture, while more recent research published by UpTime Intelligence shows this growing concern. More on this later.
Causes of outages 2020 – 2021 – 2022
The myth of the nines
The table below shows the availability figures for an IBM mainframe setup versus Unix and LAN availability.
Things have changed. Unix (now: Linux) server availability has gone up. Server quality has improved, and so has software quality. Unix, however, still does not provide a capability similar to a z/OS sysplex. Such a sysplex simply beats any clustering facility by providing built-in, operating system-level availability.
Availability figures for an IBM mainframe setup versus Unix and LAN
At the time of writing, IBM publishes updated figures for a sysplex setup as well (see https://www.ibm.com/products/zos/parallel-sysplex): 99.99999% application availability for the footnote configuration: “… IBM Z servers must be configured in a Parallel Sysplex with z/OS 2.3 or above; GDPS data management and middleware recovery across Metro distance systems and storage and DS888X with IBM HyperSwap. Other resiliency technology and configurations may be needed.”
Redundant hardware
The following slides show the redundant hardware of a z9 EC (Enterprise Class), the flagship mainframe of that time.
The redundant hardware of a z9 EC
Contrasting this with today’s flagship, the z16 (source https://www.vm.ibm.com/library/presentations/z16hwov.pdf), is interesting. Since the mainframe is now mounted in a standard rack, the interesting views have moved to the rear of the apparatus. (iPDUs are the power supplies in this machine.)
The redundant hardware of a z16
Redundant IO configuration
A nice, highly tolerant server is insufficient for an ultimately highly available setup. Also, the IO configuration, a.k.a. storage configuration, must be highly available.
A redundant SAN setup
The following slide in the deck highlights how this can be achieved. Depending on your mood, what is amusing or annoying and what triggers me today are the “DASD CU” terms in the storage boxes. These boxes are the storage systems housing the physical disks. At that time, terminologies like storage and disk were more evident than DASD (Direct Access Storage Device, goodness, what a code word for disk) and CU (Control Unit, just an abstraction anyway). Then, I ignore the valueless addition of CSS (Channel SubSystem) and CHPID (Channel Path ID) for this slide.
What a prick I must have been at that time.
At least the term Director did get the explanatory text “Switch.”
A redundant storage setup for mainframes
RAS features for storage
I went on to explain that a “Storage Subsystem” has the following RAS features (RAS, ugh…, Reliability, Availability, Security):
Independent dual power feeds (so you could attach the storage box to two different independent power lines in the data center)
N+1 power supply technology/hot-swappable power supplies and fans
N+1 cooling
Battery backup
Non-volatile subsystem cache to protect writes that have not been hardened to DASD yet (which we jokingly referred to as non-violent storage)
Non-disruptive maintenance
Concurrent LIC activation (LIC – Licensed Internal Code, a smoke-and-mirrors term for software)
Concurrent repair and replacement actions
RAID architecture
Redundant microprocessors and data paths
Concurrent upgrade support (that is, the ability to add disks while the subsystem is online)
Redundant shared memory
Spare disk drives
Remote Copy to a second storage subsystem
Synchronous (Peer to Peer Remote Copy, PPRC)
Asynchronous (Extended Remote Copy, XRC)
Most of this is still valid today, except that we do not have spinning disks anymore, but everything is Solid State Drives nowadays.
Disk mirroring
Ensuring that data is safely stored in this redundant setup is achieved through disk mirroring at the lowest level. Every byte written to a disk in one storage system is replicated to one or more storage systems, which can be in different locations.
There are two options for disk mirroring: Peer-to-Peer Remote Copy (PPRC) or eXtended Remote Copy (XRC). PPRC is also known as a Mero mirror solution. Data is mirrored synchronously, meaning an application receives an “I/O complete” only after both primary and secondary disks are updated. Because updates must be made to both storage systems synchronously, they can only be 15 to 20 kilometers apart. Otherwise, updates would take too long. The speed of light is the inhibitor for such a limitation.
With XRC, data is mirrored asynchronously. An application receives “I/O complete” after the primary disk is updated. The storage systems can be at an unlimited distance apart from each other. A component called System Data Mover ensures the consistency of data in the secondary storage system.
PPRC and XRC
The following slide highlights how failover and failback would work in a PPRC configuration.
PPRC failover and failback
The operating system cluster: parallel sysplex
The presentation then explains how a z/OS parallel sysplex is configured to create a cluster without any single point of failure. All servers, LPARs, operating systems, and middleware are set up redundantly in a sysplex.
Features such as Dynamic Session Balancing and Dynamic Transaction Routing ensure that workloads are spread evenly across such a cluster. Facilities in the operating system and middleware work together to ensure that all data is safely and consistently shared, locking is in place when needed, and so forth.
The Coupling Facility is highlighted, which is a facility for sharing memory between the different members in a cluster. Sysplex Timers are shown; these ensure that the time of the different members in a sysplex is kept in sync.
A parallel sysplex
A few more facilities are discussed. Workload Balancing is achieved with the Workload Manager (WLM) component of z/OS. The ability to restart applications without interfering with other applications or the z/OS itself is done by the Automatic Restart Manager (ARM). The Resource Recovery Services (RRS) assist with Two-Phase commits across members in a sysplex.
Automation is critical for successful rapid recovery and continuity
Every operation must be automated to prevent human errors and improve recovery speed. The following slide kicks in several open doors about the benefits of automation:
Allows business continuity processes to be built on a reliable, consistent recovery time
Recovery times can remain consistent as the system scales to provide a flexible solution designed to meet changing business needs
Reduce infrastructure management costs and staffing skills
Reduces or eliminates human error during the recovery process at the time of disaster
Facilitates regular testing to help ensure repeatable, reliable, scalable business continuity
Helps maintain recovery readiness by managing and monitoring the server, data replication, workload, and network, along with the notification of events that occur within the environment
Tiers of Disaster Recovery
The following slide shows an awful picture highlighting the concept of tiers of Disaster Recovery from Zero Data Loss to the Pickup Truck method.
Tiers of Disaster Recovery
I mostly like the Pickup Truck Access Method.
GDPS
The following slide introduces GDPS (the abbreviation of the meaningless concept of Geographically Dispersed Parallel Sysplex). GDPS is a piece of software on top of z/OS that provides the automation solution that combines all the previously discussed components to provide a Continuously Available configuration. GDPS takes care of the actions needed when failures occur in a z/OS sysplex.
GDPS
GDPS comes in two flavors: GDPS/PPRC and GDPS/XRC.
GDPS/PPRC is designed to provide continuous availability and no data loss between z/OS members in a sysplex across two sites that are maximum at campus distance (15-20 km).
GDPS/XRC is designed to provide automatic failover of sites that are at extended distance from each other. Since GDPS/XRC is based on asynchronous data mirroring, minimum data loss can occur for data not committed to the remote site.
GDPS/PPRC and GDPS/XRC can be combined, providing a best-in-class solution having a high performance, zero data loss setup for local/metro operation, and an automatic site switch capability for extreme situations such as natural disasters.
In summary
The summary slide presents an overview of the capabilities of the server hardware, the Parallel Sysplex, and the GDPS setup.
Redundancy of Z server, Parallal Sysplex and GDPS
But we are not there yet: ransomware recovery
When I created this presentation, ransomware was not today’s big issue. Nowadays, the IBM solution for continuous availability has been enriched with a capability for ransomware recovery. This solution, called IBM Z Cyber Vault, is a combination of various capabilities from IBM Z. The IBM Z Cyber Vault solution can create immutable copies, or Safeguarded Copies, in IBM Z Cybervault terms, taken at multiple points in time on production data with rapid recovery capability. In addition, this solution can enable data validation to support testing on the validity of each captured copy.
The IBM Z Cyber Vault environment is isolated from the production environment.
Whatever types of mainframe configuration, this IBM Z Cyber Vault capability can provide a high degree of cyber resiliency.
We can also apply this to technical designs. This often surprises a non-technical audience, but techies will recognize the beauty that can be present in technical solutions.
For example, symmetrical diagrams not only give a quick insight into an orderly, robust solution but are often also very appealing to the eye.
Symmetrical and well-colored diagrams are easier to read and understand.
Old PowerPoint presentations using the standard suggested colors were horrendously ugly, and I am sure the people using these colors did not want to be understood. (Nowadays, PowerPoint comes with more pleasing color schemes)
The success of the Python programming language is not in the least its forced readability. No crazy abbreviations as in C that make code unreadable (but programmers look very smart).
Beautiful code (yes, such a thing exists) is easier to read and understand.
If a Then b Else If c Then d Else If e Then f …
versus
Case a b Case c d Case e f …
It is pretty evident.
But do we care about the quality and beauty of code nowadays? Throw-away software is abundant. Software systems are built with the idea to throw them out and replace them within a few years.
Good programming is a profession that should be appreciated as such. Bad coding may be cheap, but only in the short run.
We don’t hire a moonlighter to build our house. We employ an architect and a construction professional who can make a comfortable house that can be used for generations.
Anyone in the product chain can pull the Andon Cord to stop production when he notices that the product’s quality is poor.
Stopping a system when a defect is suspected originates back to Toyota. The idea is that by blocking the system, you get an immediate opportunity for improvement or find a root cause instead of letting the defect move further down the line and be unresolved.
A crucial aspect of Toyota’s “Andon Cord” process was that when the team leader arrived at the workstation, they thanked the team member who pulled the Cord.
The incident would not be a paper report or a long-tail bureaucratic process. The problem would be immediately addressed, and the team member who pulled the cord would fix it.
For software systems, this practice is beneficial as well. However, the opposite process is likely the practice we see in our drive for quick results.
We don’t stop the process in case of issues. We apply a quick fix, and ‘we will resolve it later’.
The person noticing an issue is regarded as a whistle-blower. Issues may get covered in this culture, leading to even more severe problems.
When serious issues occur, we start a bureaucratic process that quickly becomes political, resulting in watered-down solutions and covering up the fundamental problems.
In software systems, backward compatibility is a blessing and a curse. While backward compatibility discharges users from mandatory software updates, it is also an excuse to ignore maintenance. For software vendors, omitting backward compatibility is a means to get users to buy new stuff; “enjoy our latest innovations!”.
1980s software on 64-bit hardware
DS Backward compatibility
You can not run Windows 95 software on Windows 11.
You can not Run MacOS X software on a PowerBook G4 from 2006.
You can not use Java version 5 software on a Java 11 runtime.
You can, however, run mainframe software compiled in 1980 for 16-bit hardware on the latest z/OS 64-bit operating system and the latest IBM Z hardware. This compatibility is one of the reasons for the success of the IBM mainframe.
Backward compatibility in software has significant benefits. The most significant benefit is that you do not need to change applications with technology upgrades. This saves large amounts of effort and, thus, money for changes that bring no business benefit.
The dangers of backward compatibility
Backward compatibility also has very significant drawbacks:
Because you do not need to fix software for technology upgrades, backward compatibility leads to laziness in maintenance. Just because it keeps running, the whole existence of the software is lost out of sight. Development teams lose the knowledge of the functionality and sometimes even the supporting business processes. Minor changes may be made haphazardly, leading to slowly increasing code complexity. Horrific additions are made to applications, using tools like screen scraping, leading to further complexity of the IT landscape. Then, significant changes are suddenly necessary, and you are in big trouble.
Backward compatibility hinders innovation. Not only can you not take advantage of modern hardware capabilities, but you also get stuck with programming and interfacing paradigms of the past. You can not exploit functionality trapped inside old programs, and it is tough to integrate through modern technologies like REST APIs.
The problem may be even more significant. Because you do not touch your code, other issues may appear.
Over the years, you will change from source code management tools. During these transitions, code can get lost, or insight into the correct versions of programs gets lost.
Also, compilers are upgraded all the time. And the specifications of the programming languages may change. Consequently, the code you have, which belongs to the programs running in your production environment, can not be compiled any longer. When changes are necessary, your code suddenly needs to catch up with all these changes. And that will make the change a lot riskier.
How to avoid backward compatibility complacency?
Establish a policy to recompile, test, and deploy programs every 2 or 3 years, even if the code needs no functional change. Prevent a pile of technical debt.
Is that a lot of work? It does not need to be. You could automate most, if not all, of the compilation and testing process. If nothing functionally changes, modern test tools can help support this process. With these tools, you can automate running tests, compare results with the expected output, and pinpoint issues.
This process also has a benefit: your recompiled code will run faster because it can use the latest hardware features. You can save money if your software bill is based on CPU consumption.
Don’t let backward compatibility make you backward.
My proverbial neighbor asked me some time ago if he could have a zero data loss ransomware recovery solution for his IT shop. He is not a very technical guy, yet responsible for the IT in his department, and he is wise enough to go seek advice on such matters. My man next door could very well be your boss, being provoked by a salesperson from your software vendor.
What is a zero data loss ransomware recovery solution?
A ransomware recovery solution is a tool that provides you the ability to recovery your IT systems from the incident in which a ransomware criminal has encrypted a crucial part of your IT systems. A zero data loss solution promises to provide such a recovery without the loss of any data. The promise of zero data loss must be approached with the necessary skepticism. A zero data loss solution requires you to be able to decrypt the data that your ransomware criminal has encrypted with the keys that he offers to give you for a nice sum of money. To get these keys you have two options:
Pay the criminal and hope he will send you the keys.
Create the keys yourself. This would require some highly advanced algorithm, possibly using a tool based on Quantum computing technology. This is a fantasy of course. This first person to know about the practical application of such technology would be your ransomware criminal himself, and he will have applied this in his encryption tooling.
So getting the keys is not an option, unless you are in the position to save up a lot of money, or find an insurer that will carry your ransomware risk. Although I expect that will come at an excruciating premium.
The next best option is to recover your data from a point in time just before the event of the ransomware attack. This requires a significant investment in advanced backup technology, and complex recovery procedures, while giving you little guarantee as to what state your systems can be recovered. And, setting the expectations, will come with the loss of all data that your ransomware criminal managed to encrypt. We cannot make it more beautiful.
The last couple of days I was working on a new setup for software development. I was surprised (actually somewhat irritated) by the efforts needed to get things working.
All the components I needed did not seem to work together: Eclipse, PHP plugin, Git plugin, html editor.
The same happened earlier when setting up for a Python project and some APIs (one based on Python 2, the other on Python 3).
I am still trying to think through what is the core problem. So far I can see that the components and platform are designed to integrate, but the tools all depend on small open-source components in the back, which we find incompatible between the components.
Maybe there should be a less granular approach to these things, and we should move to (application) platforms. Instead of picking components from GitHub while building our software, get an assembled platform of components. Somebody, or rather, somebody, would assemble and publish the open source platforms, periodically, say every 6 months.