Site Reliability Engineering (SRE)

a workbook review

DevOps - Background

To break down the silos of Developers and Operations a set of guidelines, practices and culture was invented (DevOps).
With the silos we had Developers throwing code over the fence to support to handle the code in whatever state it landed. As deployments are tricky things, it was common for the entire deployment to be manual.

Rather than finding the individual responsible for an accident, it's better to fix the issue and move forward. A speedy recovery is better than focusing on preventing all accidents. Hunting down the party in question will result in confusing issues, hiding the truth, and blaming others.

In order to change, it should be gradual and sustainable. Short bursts of automation can result in unstable changes. Ensure that small, low-risk changes with reliable rollbacks are done. This will lead to using Continuous Integration and Continuous Delivery.

The right culture is required for the right tooling to succeed. With the right culture, we have the keys to success. Culture eats strategy for breakfast, and broken tooling can be worked around with the right tooling.

Measurement is Crucial

When you're engineering for reliability, make sure you have metrics to show your progress. You need to align what you're doing with reality with objective measurements. You can't manage what you don't measure.

SRE - Background

Minimize Toil

If machines can perform a desired operation, then a machine often should. Examples like Azure DevOps, AWS CDK, GitHub Actions and IBM DevSecOps spring to mind.

Toil should not be the job. Time spent on operational tasks is not spent on project tasks. Making our services more reliable, scalable, cost-effective, secure, and sustainable through operational excellence should be the goal.

SRE compliments DevOps by reducing the cost of change (lowering the risk), measuring reality (feedback loops) and increasing the speed and efficiency of change (CI/CD). Where DevOps provides a framework, SRE delivers on some of those promises.

Automate this year's job away

In one of my first roles, I was given the opportunity to do DevOps. Through this, we used Developer Disciplines to Automation the Deployment of the Services.

With PowerShell, I was able to automate builds and deployments. How do we know that the deployment succeeded? This is where PowerShell can make calls like curl to check if the site is up. The most basic of checks and incredibly useful. There's a lot going into a call to a website including the Address lookup (DNS), Securing the traffic (Public key cryptography), Load balancing the traffic and Delivering the traffic closer to your doorstep (CDN)

curl "https://blog.mckie.info";
Let's wrap that up into something we can leave running in the background like running the tap to see if the water flow is interrupted.

while($true) { curl "https://blog.mckie.info"; }

Move fast by reducing the cost of failure

The cost of failure can be high. Especially if the change is large and the downtime for the change window is large. Try to reduce the cost of the failure with smaller, low risk changes to minimize downtime.

Operations is a Software Problem

Doing operations well means treating it like software. It is a software problem. Each script can be checked into source control. PowerShell can be tested and other scripts can be tested. Apply the SOLID Principles if you find your script is untestable.

Share Ownership with Developers

By using tools like Docker the code that runs on your local will also run in production. The containers run the code in a way the developer has to check in his settings for the servers that handle the traffic. Think of building a house, starting with a cookie-cutter house and then adding your customizations. That's how Docker adds a level of confidence to deployments.

Serverless is another way to achieve the same ends. The implementation details of how to host a particular piece of code are pushed onto the managed service.

Please read the original book which inspired this post and the digital companion

The Site Reliability Workbook: Practical ways to implement SRE (O'Reilly)

SRE Google