Google Infrastructure Czar: Cloud Gets It Done

2 thoughts on “Google Infrastructure Czar: Cloud Gets It Done”

  1. Om,

    Nice article on your session with Urs….I have indeed heard some great remarks about him!

    However, I was a little surprised by his statement – “In our apps, we’re not actually shooting for five nines because that would lower the feature velocity”.

    I understand there is a fine balance between reliability (cost) and service delivery (convenience or velocity) in everything we do in life? But that does not mean we trade-off reliability to a level lower than what world-class operational excellence should be? On the contrary, for a cloud based concept of “always available” why wouldn’t the goal be 6-sigma similar to what world-class physical operations companies strife for?

    Let me explain what I mean with an example – Amazon is THE world-class leader in e-commerce fulfillment and ships over 20 million sku’s to 100M+ customers globally. They do well over 500 Million shipments a year to their customers…Even if they deliver at 6-sigma for accuracy of their shipments to the right customers, think about how many customers they deliver the wrong stuff to – it would be 1700 customers they deliver the wrong shipment to every year and end up with free return costs, product replacement costs and BIG customer dissatisfaction! If they settle at only 4-nines similar to what Urs is saying they would be shipping nearly 241,000 shipments to the wrong customers every year!!!

    Do you think they settle for 4-nines in the context of introducing new products, features or global customers creating shipping complexity? Obviously they do not and even for a physical product delivery process like shippping, they push for constant defect reduction and a march towards 6-sigma on a daily basis for customer shipment reliability!

    So why would Google as THE leader in online search, cloud apps and online infrastructure space not aim for a 6-sigma approach and “Always On Online Availability” model? I’m not saying they compromise feature velocity for it….they should launch frequently / make changes but saying we compromise realiability below a certain level also does not make sense especially in a model like Cloud where the fundamental perception is no server downtime etc, right? They as the leader in this space should set the world-class gold standard.

    I’m curious to hear why Urs should not be similarly pushing for it as the goal and what he thinks about it – after all, isn’t he the operational excellence leader @Google with his new title as SVP of Operations?

  2. Rhagu,

    Your example of Amazon is very illustrative,

    But the issue with the cloud is more complex, let me explain you: Amazon has been working in optimizing their logistics service, such as Fedex and others, but when you reach a logistic level, you just make minor changes in time and big changes each 3 or 5 years, thats how they reach Six sigma, but as Hosle states, there is a continous process from Google labs to test new products, and its being made on a daily basis, the companies have been investing billions of dollars in improving their farm servers, for example the open project of Facebook with the increase performance of the servers, but even if these companies have the money, and the best engineers on the earth, there are moments they can suffer problems, just review what is happening with WordPress,
    The companies are akways looking to reach the 100%, but we are humans, and we are not perfects, so even a great hacker can made a mistake and cause the failure of the landing of the Mars Pathfinder.

    A final thought, I love the definition of the connection between the farm servers and the smartphones and how to simplify to the user.

    Greets to all

This site uses Akismet to reduce spam. Learn how your comment data is processed.