When I was running my own company I hosted ecommerce sites for the likes of NFL, Disney, Toys R Us, and Nike. We closed these deals on the strength of our platform and site design but our hosting left something to be desired. At one point we were operating at something approaching 95% availability, that is a terrible number with just over 1 hour of downtime PER DAY. Ouch. Things finally reached a head when the CIO of NFL called me and said fix it or else. As the founder I had recruited an executive team and we had a VP of Operations running the data center and I spent my time setting technology direction, managing development and closing deals. As founder though, you have to respond to a call like this so I transferred my sales efforts to other leaders, I put all technology decisions on hold and I started flying to Denver Sunday evening and home Friday night. It was my company, I had to fix the problem.
The day I showed up to take over I spent 12 straight hours listening to the team. Asking them to list the issues they were aware of, encouraging them to complain about everything they hated about their job and asking them what they would do if they were king for a day. I made a list of everything. The list sounded like this, we have a flaky server, people are not trained on Linux, I never know what state the configurations are in, we do not have a structured customer support process, so and so is an idiot, such and such software sucks, I work too many hours. Basic stuff. The suggestions were even more basic, they were very tactical, so tactical I can no longer remember them. This is what happens when people get in a fire fight, they stop seeing the big picture as they are doing everything they can to keep putting one foot in front of another.
The next day I walked in and made the most important management decree in my career, henceforth, every issue will be tracked in our bug system, not CRM, and every bug will get a root cause analysis and fix.
That first day the NFL called to inform us their site went down again. Seriously, it was so bad they had to tell us when there was a problem. I think things might have gone worse for us but this time I personally answered the support line and that demonstrated change and commitment to them giving us a stay of execution. I was actually excited, here was our first problem to put through the process! We logged 3 bugs from that call, routing table fixes, routing table change management, and site monitoring with monitoring being the highest priority. We stayed up all night and implemented a depth site monitoring system that automatically issued escalations to the on call engineer and 6 minutes later to me. Engineers knew I would be on the phone with them 6 minutes after they got their page. The next night I went home at midnight to get some rest and at 4am my phone went off and I called the on-call engineer, they already had identified the problem, data center connectivity failure… And we were off to the races. It was really that simple, put in a continuous improvement process and hold myself personally accountable for the results by receiving escalations at 4am and 6 weeks later our service was making the customers happy running at 99.9% availability.
The Q Strategies hosting operations was experiencing exceptionally low site availability (~95%) with regular service outages leading to a loss of customer confidence. Because the circumstances were dire, employees were working ridiculous hours on a death march with no end in site, needless to say employee moral was in the toilet.
Here is a list of the most significant changes we put in place. Many of these things will seem trivial today but in 1999 they were da bomb.
- Incident Tracking System and Process – this was the foundation of all our success. By tracking every operational issue like it was a code bug and tracking every operational change with a version history we immediately got control over the situation.
- Continuous Improvement Process – by completing a root cause analysis for every issue, putting procedures in place to avoid repeating the same mistake twice, then fixing the issue you make amazing progress.
- Depth Service Monitoring System – that first call was embarrassing, having the customer notify us of a site outage, ridiculous. It was clear we had to put a monitoring system in place but it was also clear simply pinging the home page was not sufficient. We monitored every individual process, partner integration point, and verified the location of every asset, . Meaning we monitored the home page, catalogs, registration, shopping cart, check out, integration with payments, tax, shipping, etc. all with synthetic transactions. We knew of any important issue within seconds.
- Customer incident notification – now that we knew about issues before the customer, we put a process in place to notify customers of every incident regardless of the customer or employee impact. Complete transparency like this started the process of re-building their trust and confidence.
- Automated Escalation System – Monitoring is useless without an automated escalation system. Our system sent an SMS to the on-call engineer. That engineer had 1 minute to respond to the text or the system escalated to the back-up engineer who had one minute before the tertiary engineer was paged who had one minute before I was paged. Then, regardless of what happened above, I was paged 6 minutes after the first SMS was sent. This last step was the magic in the process, always paging me. This created accountability until the habits were built.
- Employee Accountability & Authority Model – In addition to holding engineers accountable for rapid response we also gave the engineers complete latitude on the fix. What ever the engineer felt was the appropriate fix is what was fixed. I was merely a participant in the fix review call, the engineer made the final call, always. This fundamentally changed the level of skin in the game and resulted in far happier engineers.
- Change Management for both Content AND Configuration – while we had a change management system in place for code we found the majority of issues were elsewhere. The first big set of issues was content. Catalog pictures would end up in the wrong locations. The next big set of issues were configuration settings. The response we took was to put all content in source control so we could have versioning. We also found configuration changes were required for every build and often were conflicting. We responded by putting all configurations in source control too, literally every router config, web server config, site config, service config, everything. If it had a configuration file we put it under source control. At least now we could always roll back changes.
- 1 button site build – We then found the problem was a release or roll back was too complex so we scripted a 1 click build process for any device on the network, and service, and site, all of it, simply hit the build key for it and it would put it in a known state.
- 1 button DC build – we quickly realized having to build every individual device was shit so we scripted a complete DC build. This literally put every piece of the service into a known state. This was a huge breakthrough. No longer were we managing individual network elements we were managing clusters of machines and services. Way more scalable!
- Minimal access rights – even with all of the changes above we were still getting too many failures and the configurations were drifting. This was another huge breakthrough, we took away all write permissions for every service engineer. Only two people had the credentials to log on to individual network elements or services. Availability jumped a significant % from this one change alone.
- Dev, Test, Pre-prod and Prod Environments – With write permissions taken away from every service engineer we also had to take away write permissions for all of the developers. To do that we had to have environments for people to do their work. We created identical deployments for Dev, Test, Pre, and Prod and gave the developers permission to run the 1 click build on the dev environment. This had the largest impact on availability.
- Service reporting to clients – Now that we were proactively delivering incident reports to customers we needed a system to show them the trends. Without this even 99% availability looks pretty bad, 15 minutes a day, so we had to deliver monthly trend reports to show we were making progress.
- Service health dashboard for clients – OK at this point the service is getting pretty healthy but when an end user calls one of our customer’s support lines, our customers had no visibility into the system to tell the customer what to expect. We delivered a web dashboard to provide visibility into the status of the system. This created another huge boost in customer confidence.
- Moved to HA data center – OK so now that we are 99% we are having problem pushing beyond that. 15 minutes of unscheduled down time is so we had to make more progress so we decided to move to an HA data center with Redundant egress, back up power, locked cages, back ground checks etc… This DC was amazing in 1999.
- Real-time redundant fault tolerance – even with all of these we were still having availability struggles with outages causing lost data. We fixed this by always simultaneously writs to 2 SQL backends.
Yes it is true we did not have tertiary redundancy or tectonic plate diversity but our customers were not willing to pay for more than 99.95% availability so this is where we stayed.
These actions drove the service from ~95% availability to 99.9% availability in 6 weeks and to 99.95% in 6 months.
We retained every customer thanks to the hard work of a great team.
And if you made it this far, here is your reward! Just like an after credits scene in a movie!