Over the past three years I have had the opportunity to work on IT strategy with many different Fortune 500 companies in my role as a Staff Solutions Architect at VMware. Inside this wide exposure a few patterns have emerged. One consistent pattern is the focus on initial state. Much of IT’s efforts are placed upon creating a consistent, stable and secure initial state. This is an heirloom for the days before technology was prevalent in every aspect of business. Initial state is a critical aspect of time to market it does not represent the realities of IT. Every time IT releases an initial state, they create operational debt. This operational debt is complicated by the dynamic nature of any product thus creating a compounding interest of debt. This mountain of obligation accounts for roughly 70% of operational spend in the average IT shop. The cost of operations quickly becomes the true limiter to innovation and agility. There are three things that exponentially bring this problem to a new breaking point:
- Automation of provisioning
- Public cloud
- Dynamic nature of containers
Automation of provisioning
Being agile to the business is top of mind for most IT executives. This is addressed by some type of provisioning automation. This effort seeks to reduce the total time to delivery of assets to development or production. Focused effort can produce dramatic reductions in total process time. The determined focus on enabling self-service initial state may miss the required governance to be successful long term. Accelerated self-service and increased consumers only expedites the compounding operational debt.
Public Cloud
Public cloud adoption initially began as a cost savings measure with benefits. After initial waves of adoption many organizations recoiled from their stated “all in” cloud strategy with a discovery of the real cost. Today organizations are adopting public cloud for three reasons:
- Removal of operational debt of infrastructure
- Public Cloud unique services (machine learning, FaaS, WaF etc.)
- Data gravity (their data exists in the cloud)
All valid business reasons for public cloud adoption. The removal of operational debt is enabled by a software abstracted infrastructure and limited catalog options. The clouds accelerated speed of consumption for infrastructure services naturally puts pressure on private cloud. The public cloud trades cost for reduced infrastructure debt. Public cloud introduces unique features designed to limit your ability to move away from the chosen cloud. Public clouds are not incentivized to ease your movement away from their service. The real benefit of public cloud is the abstraction of infrastructure components into software allowing for automation. Once again public cloud today is highly focused on initial state and largely ignores the long-term operational cost of a service.
Dynamic Nature of containers
Containers and their immutability concepts immediately look like a liberation from operational debt. Throwing away misbehaving resources and replacing them with perfect copies immediately does appear a solution to operational debt. Moving to a declarative approach to IT continues to extract the individual value of infrastructure components. Container orchestrators add an element of maintaining designated declarative state which address some of the operational debt. Containers lack of continuance does not remove all debt but instead illustrate how much debt is ignored. Removal of patching and troubleshooting does not balance the increased observability and complexity. The average three tier application replaced by micro-services can balloon from twenty managed entities to hundreds. The average life of a container is measured in minutes and hours making the operational challenge even more acute. When Google first started deploying containers at scale, they quickly identified that their operational team could not scale to meet their new operational demand. In order to resolve this scale issue, they created the site reliability engineer role. The SRE is a developer who spends half their time working operations and half automating operations. This has allowed Google to operate their global platforms with only 1,500 operational staff. This is the first factor that is not singularly focused on initial state.
Lessons learned from the three factors:
The three factors provide some valuable capabilities to be considered while attempting to solve operational debt:
- Automation is required for both initial and operational state
- Software abstraction of infrastructure is critical to enable automation
- Declarative models allow us to enforce initial state post deployment
Moving forward
With these new climate pressures, the time to address the amplifying operational debt is now. Your strategy cannot simply include automating operations you have to create the correct landscape to overcome your organizational inertia. Operational debt investments scale independently of location or size of organization. They are the only investment that will continue to reduce operating expense year over year. Operational debt can be divided into two categories common and organizationally unique. Common operational debt includes credential management, patching, hardware refreshes, code promotion to production, monitoring, architecture changes and break fix. Common operation debt is accounts for 70% of operational cost and is common across all application. These tasks represent toil work that normally has zero value when done by a human operator they only add latency to the work.
Quantifying toil tasks
Identification of toil tasks should be evaluated based upon three factors:
- Repeated – use your ticket system to track commonality of the task and identify how much time is spend on the task each iteration
- Requires no human judgement – Many organizations throw out potential toil work due to their complex processes that seem to require humans when in fact this is a non-essential step in the process. Do no evaluate the current process instead focus on the desired outcome
- Is interrupt driven – do you take the action after receiving a ticket or notification then it is likely toil work
Once you have identified toil tasks use frequency as a guide to create your hit list of things to automation.
Service Orientation
Service orientated models approach the operation debt as part of the whole service. Operations should be automated as part of the service deployment thus removing all toil work before consumption. While I believe service orientation models and governance are the key to removing future operational debt, they don’t address the current legacy challenge.
Roadmap
The steps include:
- Implementation of software abstraction – as mentioned the key element of public clouds ability to deliver infrastructure without debt is use of software abstraction. Without this base for automation your debt reduction efforts will hit inertia that cannot be overcome.
- Identify repeat toil – Use your ticketing system to identify commonality of toil tasks this provides a prioritized hit list of tasks
- Automate repeat toil – begin automating toil tasks by removing all human interaction it is very simple to create a que in your ticket system that is serviced by automation as you transition toil from humans to automation
- Move to declarative models – Declarative models provide the ability to check for and potentially enforce expected state this shift for future development is enabled by software abstracted infrastructure and reduces future debt
- Continue to remove toil – declarative models do not remove all toil they only make software enforcement easier. Continued effort to remove debt is required to avoid the exponential increase
Organizations who have implemented this operation debt reduction strategy have seen a reduction of operational cost by up to 50%. This reduction allows for innovation and agility driving additional revenue. Debt reduction is a key step that moves IT from cost center to business partner.