Fault tolerance at 100,000 requests per second

I bumped into this old post on the Netflix blog this weekend. It is an amazingly practical walk through of an implementation of my post on the 7 Pillars of Resilient Service Engineering. My post was written in 2015 and the Netflix post was written in 2012 but now more than ever teams need to write resilient service code.

I have become increasingly aware of this as I took a new role at Microsoft as Dev Manager for Azure Identity Data Science and Engineering. In Azure Identity we receive 10s of billions of requests per day. Each request is relatively small so we are only storing a little over 5 terabytes of data per day but that adds up fast if you want to execute dynamic queries of trillions of records consisting of 100s of terabytes of data. Digging through this data for metrics on service quality I started to think about how Identity can get high availability and service quality given how critical the service is to all of Microsoft. Clearly, combating new and changing code from degrading service quality while, at the same time, maintaining engineering agility is the challenge.

The common reaction to this problem, safe deploy, is to slow down and bake. This is the wrong approach, your service will get passed by your competition. Deliver fault tolerance and then enable rapid, exceptionally low overhead flighting and then let engineers and teams go at it. Have an idea, code it up and throw it out there. You can be confident risk to the service is non-existent and user exposure adequately contained. That’s how you innovate, remove overhead not increasing it…

The Tao of Todd

Change the color of your lenses

Fault tolerance at 100,000 requests per second

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply