Fault tolerance at 100,000 requests per second

I bumped into this old post on the Netflix blog this weekend.  It is an amazingly practical walk through of an implementation of my post on the 7 Pillars of Resilient Service Engineering.  My post was written in 2015 and the Netflix post was written in 2012 but now more than ever teams need to write resilient service code.

I have become increasingly aware of this as I took a new role at Microsoft as Dev Manager for Azure Identity Data Science and Engineering.  In Azure Identity we receive 10s of billions of requests per day.  Each request is relatively small so we are only storing a little over 5 terabytes of data per day but that adds up fast if you want to execute dynamic queries of trillions of records consisting of 100s of terabytes of data.  Digging through this data for metrics on service quality I started to think about how Identity can get high availability and service quality given how critical the service is to all of Microsoft.  Clearly, combating new and changing code from degrading service quality while, at the same time, maintaining engineering agility is the challenge.

The common reaction to this problem, safe deploy, is to slow down and bake.  This is the wrong approach, your service will get passed by your competition.  Deliver fault tolerance and then enable rapid, exceptionally low overhead flighting and then let engineers and teams go at it.  Have an idea, code it up and throw it out there.  You can be confident risk to the service is non-existent and user exposure adequately contained.  That’s how you innovate, remove overhead not increasing it…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s