7 Pillars of Resilient Service Engineering

I recently wrote this email to my team and realized it could help others so I am posting it here…


Folks, I recently had the opportunity to interact with a number of teams across Microsoft: MSN, Windows, Universal Store, Azure, and others.  I found one thing to be universally true, these teams are not writing resilient code.  Lets not be like them, lets deliver resilient components!

Before I started at Microsoft I had my own ecommerce consulting and hosting company.  We had customers like NFL, Disney, Guess Jeans, Toys R Us.  One of the things every company required at that time, was help putting together their Internet development processes.  One of the things I always pushed them on was delivering resilient components.  To be resilient, every component must:

·         Be re-usable by ANY other team.  There can be no co-location requirements, no requirements for common knowledge of data stores, no dependencies other than the published interface.

·         Continue working regardless of downstream outages.  This can be challenging since downstream dependencies can result in critical errors.  What I am saying is if you have a downstream outage, figure out what is the most functionality you can provide despite the outage and keep doing that…

·         Expect bugs in upstream and downstream dependencies.  Your code must behave in a user friendly way despite the poor behavior of your dependencies.  This needs to go so far as to expect error codes could be wrong.

·         Plan for evolution.  Dependencies are on different release cycles, plan for this and ensure both your component and your components dependencies can version independently.

·         Handle unpredictable scale limitations.  Remember your dependencies may not scale at the same rate as you.  As a result, plan for your dependencies to run faster or slower than you over time and avoid race conditions.

·         Scale horizontally.  Expansion is unpredictable, new countries, new data centers, new edge nodes, new CDNs, on-prem/off-prem evolution, understand where your code runs is going to change over time.  Keep this in mind and build your component to leverage the best scale developments coming down the pipe.

·         Wherever possible be stateless.  More than any other resiliency requirement, developers do not spend enough time thinking about and making to make their component stateless.  Components that are not stateless cannot easily scale horizontally or handle unpredictable scale limitations as a result it is critical to think about how your component can be stateless and if the resulting performance tradeoff is worth it or not.

These are not release requirements, these are a way of being.  The code we have written is the code we have written, we will not stop our train to ensure our past code meets arbitrary resiliency requirements, we simply need to start pushing ourselves and each other, ever more in this direction.  This is about every one of us continually improving as an engineer and automatically designing, building, and testing for resiliency qualities and getting better every line of code we write.  You folks are a bad ass team (that means amazing if it does not translate well into Estonian).  If you start talking about these concepts and pushing each other to build resilient components, long-term, our foundation will be rock solid.  Think about it, you want other developers to look at our code and be intimidated because it is engineered so well.  Every Service at Microsoft should want to leverage and integrate with our code and be like us.

Writing resilient code is hard.  Some of the folks you work with may want to avoid the pain and just do things the way they always did.  Don’t let that happen.  Please, continue to talk about this amongst yourselves over time, think about why I am asking us to adopt this way of being, and find other qualities you want to add that make code better.

Imagine what could be the most amazing outcome of these conversations and start making that a reality today.  This is about building a culture of engineering, awesome engineering!

One thought on “7 Pillars of Resilient Service Engineering

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s