Stemming from Nassim Nicholas Taleb’s concept of “antifragility”, at its core, SRE involves creating systems that learn from the errors and outages that inevitably arise. It helps turn massive, complex and fragile enterprise systems into “antifragile” ones that can meet dynamic challenges.
There is confusion around what SRE means in practice, similar to when DevOps first arrived on the scene. It’s perhaps best described by Ben Treynor, the Senior Vice President overseeing technical operations at Google - and originator of the term - who famously said SRE is “what happens when you ask a software engineer to design an operations team”.
One key issue SRE exists to solve is organisational silos, particularly between development and operations teams. A site reliability engineer’s job is to balance the development of new features while ensuring production systems run smoothly and reliably. SRE enables development teams to deploy faster, while using any failures to improve the overall health of the system.
Another key issue is value add metrics. You manage what you measure. So careful consideration is given to service level indicators (SLIs), metrics used to measure service performance, such as latency or error rate. This is vitally important as it forces the team to consider how to measure performance in a way that directly correlates to customer service level agreements (SLAs).
ANZ, Atlassian and, of course, Google, are just some of the businesses in Australia that have embraced SRE. While still a nascent field, SRE is rapidly increasing in popularity in enterprise organisations with many dipping their toes in the SRE pool.
There are several key principles of SRE that make it so effective.
First, SRE is driven by an appreciation for using errors and the metrics those errors produce to relentlessly improve the health of IT systems. It uses data to assess and report on the reliability and availability of systems at every stage of the development cycle to determine if changes are hurting or helping the business.
At the end of the day, business leaders care about SLAs with the customer, while technologists are all about SLIs. SRE forges a valuable link between these two, translating a site reliability engineer’s work into something that not only the business can understand but is also extremely valuable to ensuring SLAs are met. SLIs and SLAs are linked by service level objectives (SLOs), which are measurements of the performance of some aspect of the application, service or platform.
The IT infrastructure library (ITIL) framework and IT service management (ITSM) have struggled to find their place in the modern world of cloud computing and a DevOps culture. What many still fail to understand is that ITIL and most importantly ITSM, are at the heart and core of SRE practices.
Not without challenges
While adoption of SRE is growing in Australia, organisations that succeed at it are few and far between. There are several reasons for this.
The first is because SRE requires cultural and organisational transformation before it can be adopted successfully. This might include the formation of cross-functional teams to break down silos, the transition to a cloud operating model and the need for different skills within the business such as engineers who understand the full lifecycle of software development and deployment.
This is easier said than done. Australia has a significant shortage of technology talent, with the Australian Computer Society estimating an additional 100,000 workers will be needed by 2024.
The unfortunate reality is Australia bleeds home-grown tech talent to markets like the US where the opportunities in the industry are immense, with the cream of the crop that remain in the country often opting to work for tech giants like Google, Microsoft and Facebook.
I experienced the pitfalls of this first hand while executive manager for operations for the advanced analytics and automated decisioning area at a big four bank in Australia (not ANZ!). Software developers and operations engineers have very unique skill sets. It’s incredibly difficult to transform software developers to be more operations-minded, and operations engineers to be more development-minded.
Then there’s the challenge of executive buy-in. Even if business leaders understand SRE, many are taken aback by the cost of introducing more automation and managing legacy infrastructure with years of capitalised costs, especially as many are grappling with the economic fallout from COVID-19.
Finally, there’s a clear divide between adoption and successful adoption. In my experience, a lot of companies are reading SRE handbooks and going out on their own. As a result, they’re not successfully implementing SRE practices, which then leads to not getting executive buy-in to continue down this path.
Core business need
If World War III broke out and you wanted to check the internet was still operational, chances are you’d turn to Google. Why? Because Google is reliable. The tech giant realised early on the prerequisite to success and innovation is reliability. And look where it’s gotten the business.
In business IT, there are a million different ways things can go wrong. Enterprises need to be thinking less like a business-business and more like a tech-business because, in a market of fragile competition, SRE is increasingly becoming a key differentiator.
Michael Ewald is Director of Engineering and Consulting for APAC at Contino