Lately I've been noticing how little of my work as an SRE is around actually hardening our product infrastructure. I almost never personally add redundancy to a service or eliminate a source of downtime.
Ultimately, the primary source of system instability is process and lack of information: people aren't aware of when they're introducing unstable or unsafe architectures, they don't have the tools to verify whether their service is failing, or your application is set up so that designing stable features is difficult and time consuming. None of these are things an SRE can fix by personally making code or cloud configs more reliable.
The only way to sustainably improve site reliability at an organization is to build systems and processes that make it easy to see when services are failing and easy to write services that don't. That looks like monitoring, training, internal tooling, and patterns to follow that make it difficult to build services that aren't reliable. It looks like getting buy-in from product so they understand their product's performance and are working with you and their team to keep it high.
I've worked at organizations where one brilliant engineer stepped in to clean up each service's stability personally. Those organizations ground to a halt when that engineer wasn't available, and that engineer was actively making their team less efficient.
Software is about building levers. If you always insist on personally being the lever, eventually you're going to snap in half.
#sre #devops #engineering #process #leadership
VP, Distinguished Engineer, Infrastructure/Cloud Architecture
1moSRE is a hot technical career path in Kyndryl! Both for helping our clients transform, and also for our own continuous improvement.