Skip to content

“…Priceless”

Notable “cloud quant” Joe Weinmann (purveyor of the Cloudonomics blog and generator of 10 Laws of Cloud Computing) organized a session at the 2010 Cloud Connect event in Santa Clara titled “ROI, Cost, and Economics”. [I am writing this just prior to the session, but I will speak in the past tense as it’s most likely you are reading it after the fact.] One segment of this session was a panel on “how to calculate ROI from the cloud”, in which I was proud to participate, offering (as my modest contribution) a view that contrasted bottom-line savings with top-line value.

In my turn, I gave a 5 minute (5 slides) explanation for the panel, which, if you missed it (or if you saw it and either liked it so much or hated it so much you want to drop me a comment – see link for that at the bottom), I reprise below – still about a 5-minute read if you don’t move your lips too slowly.

I listened in to a recent Gartner webcast called “Real Stories From the Front Lines”. Gartner analysts reviewed brief cloud computing case studies for companies like Eli Lilly, Fed Ex, Wipro, JohnsonDiversey, some smaller companies, and even a couple of Japanese government projects. Results and benefits were most often expressed in terms like:

  • Average server cost reduced from $2K to less than $800
  • Re-provisioning of assets reduced CAPEX 30%
  • Shrunk OPEX for email/collaboration by 70%
  • Investment payback in less than 14 months
  • Utilization improved from less than 10% to over 40%

Note that these are variations on IT cost- and savings-oriented statements. They are direct measures of reduced TCO (total cost of ownership, as if you didn’t know), capital expense (the cost of purchasing technology) and operational expense (the cost of making it go and keeping it going). Improvements in efficiency of converting those costs into services consumed by the constituent organizations are good; they mean IT is “doing more with less”.

When asked to calculate ROI, I think it’s easy for IT folks (and vendors) to think of the ‘R’ in the same terms as they think of the ‘I’, as an entry in the IT department ledger books. This is natural, but ultimately somewhat myopic, and unfortunately perpetuates the common view of IT as a cost center when it is really (and has for some time been) an indispensible part of the revenue generation engine of (almost) every enterprise.

I say myopic, because if you calculate the limit (you remember limits from calculus, don’t you?) of cost reductions and efficiency improvements only in the context of the IT budget, it is quite obviously only the size of the IT budget. If you work in IT, that may seem inordinately large to your CFO, or even very large to you (after all, it is your fiscal universe), but as a fraction of the total costs incurred by the operation of your enterprise, or of the total river of money that flows through it from the revenue headwaters into the pockets of employees, suppliers, and shareholders, it is actually pretty small (exceptions might be operations like Rackspace, where IT is the business of the company).

Sprinkled in among the savings and efficiency benefits, however, Gartner also reported some metrics stated in different terms:

  • Solution developed in 4 ½ months instead of 24
  • 50%-75% faster to change/add products
  • New server provisioning: 7.5 weeks -> 3 minutes
  • Time to bring up a new collaboration environment: 8 weeks -> 5 minutes
  • Time to provision a 64-node Linux cluster: 12 weeks -> 5 minutes

Notice that these measurements of time savings, while also expressions of efficiency, are actually efficiency benefits conferred upon IT’s users, not just IT. In these cases, constituent enterprises gain agility. Suddenly, they can do things suddenly. (Obviously, this must be important, or I wouldn’t use italics in three consecutive sentences.)

Sure, the automation required to reduce new server provisioning time also means IT is saving money on formerly-human-labor-intensive tasks, but it’s pretty easy to become target-fixated on saving a few cents per server-instance-hour and completely forget the massive investment an enterprise makes in accomplishing it’s mission (whether that is revenue generation or something measured in non-monetary terms), and that if IT can make that massive expenditure even a little more efficient, the benefits can be relatively enormous.

Here’s an amusing little illustration of the absurdity of obsessing about the cost efficiencies of cloud computing: Consider the cost of a standard server instance from, say, Amazon or Microsoft Azure, currently around $.12/hour. Now compare it to the “fully burdened” (meaning inclusive of benefits and the allocation of all fixed charges like office space) cost of your typical high-tech employee, around $160K/year (almost $77/hour, $1.28/minute, or a bit over $.02 per second – at least for the 8 hours a day, 5 days a week you can get them to work). This means you can afford almost 770 server instances for every hour you pay one of these fully-burdened workers (CA, the company for which I work, has approximately 13,000 employees, which makes for pretty staggering hypothetical server-equivalents). Thinking of adding an instance to a cloud application that’s laboring under load? If you save an employee 6 seconds of waiting for a response, you’ve broken even.

Sounds nuts, right? Well, it is. No company has a way to measure the improvement in business operations or market share or competitive advantage resulting from saving an employee six seconds an hour. What can be measured, however, is the end result from bringing a solution to market in 4 ½ months instead of 24, from bringing products to market 50%-75% faster, from reducing the time to provision IT resources – tools for the revenue-generating operations of the enterprise – by 5 orders of magnitude.

Cloud computing success stories are only starting to surface, but there are plenty of examples of cases where IT agility and efficiency produce market-changing end-results.

End results like taking massive market share from larger, entrenched competitors: One of Bill Coleman’s (Bill was Cassatt’s CEO) favorite agility stories is the pre-cloud case of MCI and AT&T and the battle for long-distance telephony market share. Upstart MCI came up with a billing trick (“Friends and Family”) that enabled them to bill differently for calls made to designated contacts than for everyone else. I know, sounds trivial, right? Turns out massive AT&T couldn’t respond with similar plans for over a year due to brittle, non-agile, billing and accounting systems. Result? The marketing domino “network effect” enables MCI to pick up 7 million new subscribers, 50% more market share, and billions of dollars — largely at AT&T’s expense — over the next couple of years.

Or, how about transforming IT efficiency into business efficiency? That’s Walmart’s model for automated supply chain management. There is no human-in-the-loop re-ordering product and scheduling delivery every time someone buys a TV or a toilet brush. Instead, Walmart’s IT systems automatically reorder and schedule delivery triggered by point-of-sale transactions. The same data is accumulated and data mined for statistics and trends, guiding finer and finer control of the entire supply chain (including their suppliers’ production), pricing, product offerings and placement, etc. Do you think Walmart regards their IT as a cost center, or as a competitive advantage?

The Internet revolution has created many examples where IT is not only the delivery medium for the value that enterprise creates, it is a fundamental part of that value. Take Amazon, whose initial “one click” bookseller business model that made them the “gorilla” of the on-line shopping market has easily stretched into sales of all manners of items, new and used, including (momentously, for us cloud watchers) even their own IT capacity.

And, it’s not just about money. US Joint Forces Command (JFCOM) coordinates joint training exercises that span our various military services and those of our allies, including real-time live, virtual, and simulated combatants. These widely-distributed scenario-based exercises are intended not only to train war-fighters, but commanders as well. The problem has been that it is so complex to set up and plan the infrastructure and logistics associated with an exercise that they’ve only been able to do one or two per year and the lead time for each can be over a year. That means any scenario (the context they are training to, like “what if Cuba invades Greenland?”) of necessity must be rather hypothetical. What they want is to be able to do 20 per year, if need be, and to do so on a two-week or less notice. This would provide the ability to actually train for a response to very relevant, real-time scenarios, thus greatly increasing the odds of success and decreasing the cost of any actual action. My alma mater, Cassatt, partnered with the Virginia Modeling and Simulation Center (VMASC) at Old Dominion University to demonstrate the ability to flexibly re-provision JFCOM’s IT infrastructure and simulation programs in literally minutes (that 5-orders-of-magnitude again), a key element in achieving their goal. The end result? Who can say? That ball is still in play, but the gain is potentially measured in human lives, a “MasterCard commercial” priceless outcome if I ever heard one.

I think it’s clear we’re in the very early days of understanding cloud computing’s true impact on enterprise and the world. Right now, we devolve to talking about the ROI of cloud computing in cost and IT-centric terms because we don’t have good ways of measuring or appreciating things like “agility” on business scales that are finer-grain than the quarterly report. Most agility metrics we do have are for technology feats-of-strength like provisioning, and it’s hard to transform those into top-line growth or advantage in a straightforward way. So, IT is left mostly trying to impress the CFO, to “do more with less”, and trying to wrestle Moore’s Law-scaling technology to the ground with only linearly-scaling human effort. To make top-line arguments for cloud computing, we need to engage the CEO or other revenue-responsible heads of business units. Their bonuses depend on finding or creating the competitive advantages cloud computing-enabled IT infrastructure and services can provide that are going to make the difference between leading their market and going out of business.

At some point in the near future these top-line advantages will be obvious to everyone, and even Geoffrey Moore’s “early majority” will see that having an agile, efficient IT infrastructure is going to have the same relative impact on business that having IT at all had in the first place, and that’s when we’ll see cloud computing swept into it’s “tornado” phase. When that happens, the IT money-saving aspect may be merely an interesting side-effect of cloud computing, like automobiles have the interesting side-effect of eliminating the need to feed and water the horses.

500 Words Or Less

Funny how seeing your start-up nearly do a convincing imitation of a smoking crater, having an enormous IT management software company swoop in at the 11th hour to make a surprising glove save (only spilled a little), and making the personal transition from a company of O(100) individuals or fewer at any time to one with over 13K employees makes time fly by almost unnoticed.  I am shocked and chagrined to see how long it’s been since I’ve posted.  Nothing to be done for it now, of course, but climb back in that saddle (I’m losing count; is that three or four metaphors in the mix?) and write.

Good news for me, though: Despite the elapsed time, cloud computing is still in its relative infancy, plenty yet to figure out, so at least I don’t have to change the theme of this blog.

So, writing again, but baby steps first:  Here’s a pointer to a post I guest-wrote for a new MITRE cloud blog, part of a monthly series where they ask a question (this month it’s “what do you perceive as the most significant concern for federal organizations who want to use cloud computing?”) and multiple correspondents from industry take a whack at an informative answer.  Fortunately for me, it’s multiple choice, though I was hoping for true/false to improve my odds of getting it right.  Who is MITRE?  From their web site: “As a public interest company, MITRE works in partnership with the government applying systems engineering and advanced technology to address issues of critical national importance.”  You can check them out in general at MITRE.org, or read their new “Ahead In The Clouds” blog (and my modest contribution — rules stipulate 500 words or less — for January) here.

(Did I mention I work for CA now as a “Distinguished Engineer” (”distinguished” means I’m old) working on cloud strategy and technology?  How embarrassing to have to read my own ancient posts to see where I left things…)

Everybody Lies

It’s no secret that Cassatt (my benevolent employer) has been struggling to make a go at selling internal cloud management software.  Our CEO, Bill Coleman, is on record stating that we ”vastly underestimated the social and cultural challenge of Cassatt.”  Surprisingly, this “challenge” often first manifests itself, not as a conceptual “value of cloud computing” sort of issue, but instead in the degree of difficulty assessing the starting point, the current condition and usage of the data center in question, and this lack of detailed self-knowledge can slow or completely stymie efforts to follow a successful PoC (Proof of Concept) technology demonstration with a fruitful deployment into production.  

How does this happen?  How can an IT organization, having recognized it has a data center efficiency problem (low utilization, high costs, rigid/brittle architecture, out of power/cooling/space, etc.) and vetted and tested potential solutions (and I’m not just talking cloud infrastructure management technology) to its satisfaction, find itself unable to productively roll it out where it will do the most good?

Perhaps a clue can be found on the tube (you know, television… like Hulu, only with more commercials).  The favorite saying of fictional TV doctor/diagnostician extraordinaire Gregory House, MD, is “everybody lies”.  In his case(s), he’s talking about patients’ universal propensity for falsifying or withholding critical medical history or symptom information that would be invaluable for diagnosing their disease, their untruthfulness generally attributable to protecting self-interest or avoiding embarrassment.  

Sometimes, life imitates art (if one is willing to call television “art”, or, for that matter, IT infrastructure management “life”).  As in any “House” episode, improving data center efficiency also involves a diagnosis phase, requiring collection of important deployment and operational data about the organization and utilization of resources and applications already running in the data center.  Whether the intended medicine is virtualization, automated provisioning, active power management, creation of a full-on internal cloud, or merely to discover and surgically remove orphan servers, like Dr. House’s experience, discovering exactly what is running in the data center, what it’s running on, how the pieces are all connected, who’s using what, and how much can be a surprisingly difficult forensic exercise — and for reasons Dr. House would find familiar.

Take a look at these real-world examples of the kinds of roadblocks we’ve encountered as IT tried to collect and analyze their applications and environments:

  • Misleading:  Customer A, considering active power management in their data center, polled server/app owners on what periods of time their servers might be unused and could be powered off.  Cross checking revealed contradictory answers for the exact same apps/servers that varied so wildly that all data had to be thrown out.
  • Uncooperative:  Even though Customer B had a leading monitoring tool already in use in the data center capable of collecting the desired utilization and deployment data, the efficiency task force that wanted the info didn’t “own” the tool and couldn’t get the monitoring tool admins to turn it on everywhere or collect/report on results.
  • Confidently wrong:  In Customer C’s dev/test data center, management claimed that their global outsourced development team made round-the-clock use of the systems, handing off servers from one shift to the next and leaving virtually no opportunity for savings.  A subsequent Active Profiling engagement revealed a large fraction of orphans, with the majority of all systems idle most of the time.
  • Dodging the question:  Customer D’s multi-data-center IT transformation effort, resourced by external professional services hired guns, planned a comprehensive database of servers/apps to guide server consolidation and deployment of virtualization, automated provisioning, and dynamic allocation.  Their multi-page survey elicited few responses, and those few collected were inconsistently and sparsely filled out.  End result: the entire transformation effort finally ground to a halt due to their inability to determine what to do where first.  

As you can see, rarely is an inability to glean facts the result of anyone’s direct intent to mislead.  Usually, it’s more like Obi-Wan Kenobi’s lame rationalization to an outraged Luke for obscuring the relationship between young Skywalker and the dark Lord Vader; people describe the world using statements that are “true, from a certain point of view”, and often can’t even see the need for the qualifier.  This is one reason science (the practice of finding facts, not truth) can be so difficult.  People aren’t machines (a good thing, but sometimes problematic) and aren’t calibrated.  As I’ve said before, our perceptions are generalized and heavily influenced by our filters, and when we make assertions based on those generalizations, it’s probably not even fair to expect them to represent the facts, and too judgmental to call them “lies” just because they are erroneous.

Lessons learned?  Data center efficiency projects need an accurate map and reliable profile of resources, apps, and utilization to guide implementation.  Don’t underestimate the difficulty of obtaining that detailed info, nor the social and organization hierarchy roadblocks that can defeat seemingly-reasonable efforts at data collection.  If possible, find a passive way to capture information that doesn’t require cooperation of app/resource owners (or maybe even entrenched IT operators and their existing tools) or rely on human estimates.  They’re probably wrong, possibly lying (but then, isn’t everybody?).

The Cloud’s Green Lining

[My apologies for the April posting hiatus.  You may have heard that Cassatt, my benevolent employer, may be "nearing the end".  The best horror movie scripts consist of prolonged uncertainty and suspense punctuated by frequent protagonist near-death experiences and oft-revived monsters.  I don't know how it's going to turn out, but it's been a great movie so far, and we aren't dead, yet.]

University of Michigan and Carnegie Mellon University researchers recently presented a paper at ASPLOS ’09 (the annual Architectural Support for Programming Languages and Operating Systems conference) in March proposing “PowerNap”, a dynamic and very fine-grain approach to turning off servers for idle periods as short as a second or less, making use of “sleep” capabilities built into most server components derived from commodity desktops.  Unlike personal computers, they propose automating activation of server “sleep” cycles between tasks.  Their hardware approach was guided by actual traces of utilization for general IT server applications (e.g., email) and “Web 2.0” applications.  

I love out-of-the-box thinking (even when the thinking is about, well, boxes).  I think the paper is clever and like how the authors expose some ugly truths about power efficiency in servers, in particular, the routine over-design of power supplies to handle peak loads when fully populated with disks, memory, NICs, etc., and the resulting horrible efficiency in more common lightly-populated configurations (hmm, over-provisioning for peak loads results in low efficiency…  where have I heard that before?).  Better read the fine print (or at least the power supply efficiency curves) next time you think you’re buying a green server.

Skeptic that I am, I wonder about some of the data they collected to fuel their efficiency models (I suspect round-robin load balancer distribution of at least the Web 2.0 workload artificially made the traffic they were recording look more fine grain than may necessarily be the case) and I expect “PowerNapping” servers may take a while to materialize in your favorite commodity server provider catalog.  However, it is undeniable that wide-spread availability of commodity servers that would automatically “power nap” between transactions could save big on energy costs in even traditionally-managed data centers.  

Fortunately, there is no need to wait.  While not as fine grain as “PowerNap”, adopting cloud computing, internal or external, saves energy and reduces carbon footprint in a variety of ways, and is accessible today:

  • Pooling/sharing resources increases utilization:  Using an external cloud or creating an internal cloud from existing data center resources implicitly means multi-tenant sharing of a pool of virtual and physical resources.  Sharing is good, driving utilization of resources higher than if applications remained isolated in over-provisioned silos.  Higher utilization means fewer resources necessary to provide equivalent levels of services, saving energy and the carbon cost of the resources themselves.  Of course, unallocated resources in the pool can be kept powered off until required. 
  • Automation reduces IT staff:   IT departments don’t like to hear it and vendors don’t like to say it, but cloud computing requires fewer people than traditional IT methods, and the remaining people need to be able to think and act more like managers than admins.  Clouds are dynamic and allocation/metering of resources must be automated to be practical at anything beyond the smallest scale.  Automation increases the number of servers and scale of capacity that can be administered by a person.  This not only reduces the number of people needed to provide a given level of capacity, saving money and the relative carbon footprint of IT, but it makes the jobs of IT more interesting — and less dangerous (studies show that most service outages are due to human error, just like HAL said) as well.  Again, this is true of internal and external clouds, from the perspective of the data center operators.  If you’re using an external cloud, you benefit from the automation of your outsourced IT capacity (well, perhaps not you, particularly if you’re one of the people displaced during the automation of your data center). 
  • Cost transparency encourages thriftiness: Internal cloud granular reporting/metering/billing of dynamically-allocated resources illuminates the wasteful hiding places in traditionally-managed data centers.  Like corruption and mold, waste can’t stand the bright light of day.  Orphan servers and underutilized resources are assimilated into the pool and can be automatically deactivated if not in use.  Of course, external clouds expose costs even more directly, by eliminating capital costs and billing directly only for the resources as they are consumed.

The good news is that these cloud computing infrastructure benefits are compatible and synergistic with hardware efficiency efforts, even those as extreme as PowerNap.  Dynamic allocation and sharing of resources among applications and between tenants adds complementary operational and deployment efficiencies that optimize hardware use, regardless of its inherent energy efficiency.  By all means, buy the most efficient gear you can, then deploy it in a cloud.

Amazon Introduces Inelastic Cloud

Amazon announced a new pricing structure for EC2 last Thursday based on “reserved instances”, the ability for a customer to pay an up-front fee that will set aside a Linux/Unix AWS instance (Windows reserved instances not yet available) for 1 or 3 years.  In return for the one-time fee, reserved instances carry a per-hour price tag that is only 30% of the cost of the existing on-demand variety of EC2 instance.  On-demand instance availability and pricing remains the same.

Amazon says they created reserved instances in response to customer requests for lower prices in return for a long-term commitment, as well as customer calls for a way to guarantee instance availability, particularly for disaster recovery use cases.

If you run the numbers, you see that reserving and operating a reserved instance 24/7 for a year costs 67% of what an on-demand instance would cost for a year.  The higher up-front cost of a 3-year commitment is amortized over a longer period, so 24/7 operations for 3 years would cost only 49% if using reserved instances instead of on-demand instances.  Actual savings would be somewhat less, as Amazon’s separate charges for bandwidth, storage, and IP addresses are the same for reserved and on-demand instances.  

It’s interesting to note that the cost savings ratios are exactly the same for all 5 instance sizes/prices (standard small/large/extra-large and high CPU medium/extra-large), perhaps providing hints of Amazon’s underlying cost structure.  By guaranteeing “there’s no chance of encountering any transient limitations in EC2 capacity” for reserved instances, Amazon is — at least implicitly — promising not to over-sell available reserved capacity, so the fixed-cost portion of the price should approximate Amazon’s margined TCO for a server (scaled by how they define a “Compute Unit” and the number of Compute Units provided by the instance type).  Hint or not, the announcement is providing more fodder for those arguing the nitty-gritty of the costs of external clouds vs. DIY (e.g., see Gartner’s Lydia Leong’s cautionary post).

Last week, I talked about the cost of cloud computing, arguing that the largest factor in the IaaS cost equation, for internal (private) or external (public) clouds (e.g., Amazon), is how efficiently they are employed, their utilization.  Chronic low utilization, the shame of traditionally-managed data centers world-wide, is pretty much the result of over-provisioning, reserving more capacity than an application or service needs at a given point in time, more capacity in the form of a larger-than-necessary server (for single-server apps), or in more servers than necessary (for distributed apps).  Reduce over-provisioning, and you probably increase utilization (let’s not quibble over corner cases) and do something good for data center TCO.

From the start, Amazon EC2 offered solutions to reduce both single-server and distributed over-provisioning, by providing varying-capacity individual servers as well as on-demand pay-as-you-go provisioning of servers from a shared resource pool.  However, reserved instances are something of a step back from both.  

Amazon suggests we should “think of the one-time fee as somewhat akin to acquiring hardware”, and it is, with the same kinds of limitations.  Reserved instances must be purchased/located in a particular “availability zone” (think “data center”) and region (though availability zones are US-only, so far).  Unlike on-demand instances, which can be launched in any zone, a customer must use the particular reserved resource in the original zone for which it was purchased, and there is currently no way to relocate reserved instances, so customers designing DR scenarios or trying to locate services near regional consumers should plan carefully.  Once the instance is purchased, it’s locked in place.  

Similarly, unlike on-demand instances, a reserved instance is what it is (i.e., standard small/large/extra-large or high CPU medium/extra-large).  Once purchased, the instance type cannot be changed.  If appetite for your application grows to require a larger server, or the server proves too large, reserved instances can’t be traded for a more appropriately sized resource the way on-demand instances can.  In this respect, Amazon’s resources fall further behind the arbitrary granularity of virtualization density that can be achieved with privately-operated servers.

While reserved instances and their hybrid up-front/by-the-hour cost model, used appropriately to host the most predictably-constant, high-utilization applications, can lower the cost of cloud computing (at least Amazon-based cloud computing), the inelastic aspects may be better for Amazon than for users.  Amazon gets more predictable capacity planning and revenue, and probably captures additional customers.  Users get a guarantee of instance availability (I understand the importance for a particular set of use cases, but it makes me wonder how often on-demand instance launch failures happen) and a lower price on any workload that consumes an instance more than 50-67% of the time, but only by sacrificing elasticity.  I would rather have seen a model that enabled more rather than less elasticity, like utilization-based pricing (e.g., if I’m only using 10% of the capacity of a $.10/hour instance, only charge me a penny/hour).

Werner Vogels, Amazon’s CTO, notes that reserved instances offer IT shops thinking about a move to cloud computing “a transition model that is closer to their current strategy”.  It’s possible that might not be a good thing.  “Current strategy” has produced the traditional IT management policies and practices that have filled data centers with under-utilized, over-provisioned application silos.  Should we be surprised if reserved instances entice some organizations to waste as much or more money in the cloud as they do today in their own data centers?

The Elephant in the Computer Room

I was sitting next to Jay Fry in Las Vegas (not at the tables, honest), listening to Tom Bittman’s keynote opening the Gartner Data Center Conference in December, when Tom said “[according to Gartner's analysis], if you fully utilize your own equipment, Amazon [EC2] will cost you twice as much.”  Jay kept on furiously taking notes, but I missed the next few minutes of Tom’s speech.  I was thinking “boy, that’s a big ‘if’.”

Analyses like Gartner’s have sparked many discussions on the true cost of IaaS vs. the true cost of operating your own gear, or even outsourcing operations.  A recent example of the math can be found in CIO’s Bernard Golden’s fourth part (of six, concluded this week) in the series “The Case Against Cloud Computing” (which is really about making the case for cloud computing by examining critics’ arguments then offering refuting remarks).  Bernard relays a couple of calculations of the TCO of Amazon EC2 large and small instances, summing to at-first-blush large amounts, and then advises cloud shoppers to “do the math correctly”, meaning (I take it), correctly account for all the costs of running your own gear or outsourcing, and implying that, if you are honest with your corporate self, you’ll see that Amazon’s EC2 servers are really not that expensive after all.

I think Bernard strikes a resonant note when, at the end of the article, among “the cloud cost advantages”, he lists things like “the pricing is transparent” and “the pricing is fixed”.  Business accounting sometimes seems deliberately designed to be opaque and complex.  Often IT-related costs are concealed in non-IT cost centers.  IT rarely gets the data center electric bill, for instance, instead it goes to Facilities, as does the cost of the computer room, HVAC, UPS, etc.  Sometimes it seems like accounting just punts on trying to figure out where costs should be charged, instead uniformly allocates them according to non-usage-related financial accounting formulae across cost centers.  For example, in budget-speak, every employee comes laden with “burden”, the allocated cost of the real estate occupied by their office and common areas, GA “overhead” like HR/accounting/receptionists/security/maintenance, etc.  And to further complicate the maze, capital costs — the cost of the assets themselves — are subject to arcane depreciation and cost-of-capital machinations by the financial wizards, making understanding “true” IT cost for a given server also subject to choices a business makes about tax treatment of assets (e.g., depreciation schedule and method), and how they pay for them.

By comparison, how refreshing to just get an invoice with a bottom-line number from your external cloud IaaS supplier, even if it might be larger than the cost of buying and running it yourself.

But Bernard’s sources are splitting hairs when they argue about the cost of servers, owned/operated by IT or rented from the cloud.  A more important question is, how much does the consumable application or end-user service provided by those servers cost?  That’s what the enterprise really cares about, and this higher-level view exposes a factor that overwhelms any of the cost-of-server factors: Utilization.   Remember that Tom Bittman’s statement was qualified “if you fully utilize…”  That “if” dominates the cost-of-service calculation and it applies to usage of external cloud IaaS sources like Amazon, as well as internal or private resources.  Utilization is the elephant in the computer room. 

Think about it.  No matter how we reasonably construct an equation for the internal cost of a server, the true cost of the end-user service or application is proportional to how efficiently IT converts that server capacity into useful work.  Utilization is a good measure of the efficiency of this conversion, and it has a huge effect on the cost of IT.  If you average 20% server utilization, for example, it’s a 5x potential multiplier on the sum of your server costs.  If you average 10% (or less — you know who you are), it’s a 10x multiplier.  So what if you get an additional 20% discount on new servers from Dell?  A drop in the bucket.  Doubled the number of servers each admin can manage?  Big deal.  Double your utilization and you can cut your IT budget in half without affecting service levels.

So, does the cloud help your utilization?  Maybe, but not because of the cost or pricing structure. Computational IaaS is sold by the glass, not by the drink.  It doesn’t matter if you quaff it dry or just sip the suds at the top, you pay the same amount.  

For instance (regrettable pun intended), Amazon doesn’t charge for EC2 by the CPU cycle or by the number of actual instructions executed (this lack of granularity is the key difference between storage and computation economics).  Instead, they charge by the instance-hour, a wall-clock-timed allocation of peak capacity that it’s up to the user to employ efficiently or wastefully, just like a real server you buy, plug in, and run yourself.  Low utilization, high utilization, Amazon doesn’t care.  You get the same bill either way, but that bill might be 5 or 10 times larger than it needs to be if your utilization is still running at industry standard averages.  

(If Amazon wanted to solve the server utilization problem at a stroke, they could charge for actual processor time instead of instance hours, the way time-share computers used to be billed.  If I’m only using 10% of a $.10/hour instance, only charge me a penny/hour.  Not much chance of that happening, but it’s not the only solution.)

The way that cloud computing can help utilization is by being an elastic resource, not by being a cheaper resource.  That elasticity, which comes from the ability to dynamically allocate shared computational resources to provide services in proportion to demand — and just as dynamically deallocate them again — is what fundamentally drives up utilization by reducing over-provisioning.  It cannot eliminate over-provisioning (even Amazon needs to over-provision EC2 to accommodate fluctuations in demand, and you can bet that cost is passed on to users), but it can dramatically reduce it by eliminating application silos and changing the capacity planning equation from “what is the peak this application will ever need?” to “what is the peak this pool of applications will ever need?”  By pooling applications (and pooling resources), non-coincident demand peaks are handled for free.  The more applications and resources you pool, the smaller the over-provisioning factor, and the higher the average utilization.

Elasticity is the key benefit from internal or private clouds as well.  Cassatt (my benevolent employer) makes internal cloud-enabling software that dynamically allocates servers from a shared resource pool to applications in proportion to service demand.  Just like Amazon, over-provisioning cannot be eliminated, but it can be dramatically reduced, and utilization correspondingly increased, slashing the cost of IT.  Increased utilization makes all the other benefits of cloud computing, like business agility and infrastructure resiliency, basically free (as in “free beer”, not “free speech”).

“Now, that’s a knife…”

Last week I focused (somewhat critically) on two cloud taxonomy/ontology proposals that had been kicking around, arguing that they both were neither taxonomy nor ontology, and in fact fell short of being very useful tools for categorizing, hence understanding, the organization of and relationships among the diverse entities in the cloud domain.  Well, I’m on vacation this week, but can’t help drawing attention to a new cloud taxonomy effort initiated by Jean-Lou Dupont.  The 451’s Rachel Chalmers called it “a real beauty”, and I’m inclined to agree — not because I think it’s complete or necessarily accurate, but because it takes a unique and so-far-surprisingly-productive approach.  Jean-Lou has cleverly created the taxonomy diagram as a wikimap in Mindmeister, allowing anyone to weigh in directly just by editing.   I grabbed the diagram during a few minutes of wifi access at the airport and immediately found it useful, even in it’s primitive early state.  I predict it will provide a framework for many interesting (and likely elevated-temperature) discussions and debates over coming weeks.  

It’s mainly taxonomy-by-enumeration, so far, with some branch labels more-or-less self-explanatory (SaaS, PaaS, IaaS, of course, and on sub-branches, SADIST-PIMP), while others have definitions and distinctions that are probably far from being agreed by consensus.

Jean-Lou’s map is growing rapidly (”probably tripled today”, he says in a comment to Rachel), and the wiki model makes it inclusive.  I have to believe there will come a time when this tree will sorely need pruning (criteria and candidates for cutting will be yet another hot topic, I’m sure, and perhaps there’s a point where moderation may be a good idea), but for now, I’d be content to let it grow organically as contributors add companies, products/technologies, and sub-categories, just as it’s good practice to be inclusive and accepting of (nearly any) proffered brainstorms in the early stage of any creative exercise.  There are obviously too many branches and leaves that are simply populated with company names under a single taxonomic label (I’m sure those companies will add distinguishing characteristics to differentiate themselves), and too few specific technologies as leaf cells, too little mention of distinguishing characteristics that might differentiate leaves that share a branch, but these are merely the visible signs of both the early stage of the map’s development and the rush to populate it.  

The taxonomy is also quite heterogeneous, including a nascent (heck, it’s all nascent) “community” family tree.  I’ll have to think about whether this “side band” of communications/analysis constitutes a worthy branch of cloud taxonomy (ideas as a service?), but it is an interesting touch, and I expect more surprises as different viewpoints are incorporated.

It’s going to be another week before I get more than transient internet access, so further exploration and experimentation will have to wait, but I’m already excited to see how far it’s expanded and refined when next I can grab an update.  Nice work Jean-Lou!

Cloud Burst

Recently, ex-Cassatter (”Cassattian”? “Cassattite”?) turned Ciscoer (uh… Ciscoan?), always-blogger James Urquhart called attention to both the need and a couple of proposals for cloud computing taxonomies/ontologies.  ”That would be nice”, I thought, because taxonomies and ontologies can bring a lot of clarity and precision to an otherwise murky, poorly-specified picture.  After taking a look at the proposals and discussion, I wondered if we were talking about the same things.

Taxonomies are frameworks for understanding the often-hierarchical organization of related “things”.  Probably the most famous and familiar taxonomy is the Linnaean taxonomy of life we all learned in school (you remember: kingdom, phylum, yadda, yadda, yadda, species, sub-species), but almost everything we have and do can be, and often is, organized into taxonomies.  With so many claimants touting cloud products and services, potential customers of those products and services (which include nearly everyone producing or consuming IT products and services) could probably use a comprehensive classification system to help put things in their proper context in the cloud ecosystem.  

Ontologies offer deeper context than taxonomies.  Where a taxonomy provides a classification system, a way of distinguishing and categorizing domain members, an ontology adds formal specification of relationships and interactions between members and classes that can be used to draw inferences beyond mere categorization.  Taxonomies often are expressed as trees or tables because they are, in essence, a sorting of members into smaller and smaller groups by successive application of more and more specialized criteria.  Ontologies may be expressed as more generally connected graphs or even in one of several formal specification languages.  For instance, the World Wide Web Consortium’s Semantic Web project has defined the web ontology language, OWL, as part of an ambitious effort to enable computers to use and “understand” (i.e., reason about) the web.  A true cloud ontology could enhance interoperability by specifying roles, relationships, and interactions between cloud domain members so completely and formally as to constitute (or at least facilitate creation of) APIs.  Cool, but much more difficult to achieve than a sort into taxonomic groups.

We already have one generally-accepted taxonomy in cloud-space, the SaaS/PaaS/IaaS, or “SPI” taxonomy.  It’s simple and clear, but it’s also informal and fails to account for many dimensions and elements of cloud-dom, including such “nuances” as delineating private or internal clouds from the external variety, providing places for components like service “governors”, and criteria for sorting into sub-categories of I, P, and S.  In fact, there are a lot more “aaS” categories out there.  David Linthicum convincingly lists 10 here (wisely, he doesn’t use the words “taxonomy” or “ontology”); his aaS “framework” includes storage, database, information, process, application, platform, integration, security, management, and testing — which William Vambenepe reorders and wickedly labels “SADIST-PIMP”.  Of course, many of these unmapped aspects of the cloud domain are also controversial and/or rapidly-evolving.  That makes figuring out whether they fit in a general cloud taxonomy – and if so, how — all the more important. 

So, do the new proposed candidates for a cloud taxonomy/ontology help?  Sadly, not really.

The Youseff/Butrico/Da Silva paper, somewhat over-titled ”Toward a Unified Ontology of Cloud Computing“, with authors from UCSB and IBM, falls far short of SADIST-PIMP as an enlightening taxonomy, much less blazing a trail toward an applicable ontology.  In addition to the subdivision of IaaS into “computation resources” (itself recursively labeled IaaS), “storage”, and “communications” (aren’t storage and communications also infrastructure?), the proposal adds two additional, somewhat discordant, layers to the standard SPI taxonomy: “kernel” and “HaaS” (hardware as a service).  Kernel refers to any and all software management of the underlying hardware, including hypervisors and operating systems, but the authors focus most on grid middleware, like Globus, as representative of this layer.  HaaS is epitomized, according the the authors, by hardware leases containing SLA terms, but the reference (a CNET article written from IBM’s press release) describes the complete outsourcing of Morgan Stanley’s IT to IBM.  The article calls it utility computing, and it may be (Morgan Stanley’s apps and infrastructure were all moving to centralized IBM data centers), but it doesn’t look much like HaaS.  Overall, the paper offers little new and useful classification structure and the embellishments actually detract from the clarity of SPI.

Chris Hoff, inspired by SADIST-PIMP and the Youseff/Butrico/Da Silva paper as reported by John Willis, bravely hosted something of a community effort at a “cloud taxonomy and ontology”, seeded by his own “mashup” of the predecessor material but withholding most explanations that might have clarified some otherwise very professional-looking illustrations.  Like the Youseff/Butrico/Da Silva paper, the net result is something of an embellishment of SPI, but perhaps a bit more useful as it also contains a representation of a cloud delivery stack (though not necessarily a correct or complete one).  Again, despite the title, it doesn’t really constitute a system of classification, so it’s hard to claim it’s a useful taxonomy (beyond the embedded SPI taxonomy aspect), and (probably inherently, as it is only an illustration and not a specification) it is not an ontology.  I might find more constructive criticism to offer, but the dearth of description and discussion of what it really means (beyond the blog’s comments, which were apparently truncated by TypePad) make the diagram something of a Rorschach test.  Anyone discussing it may be revealing more about themselves than what the concepts suggested by the diagram might actually mean.

I can’t blame the authors for not producing more useful tools.  If he were alive, I’m certain Douglas Adams would describe the width and breadth of cloud computing as “big; vastly, hugely, mind-bogglingly big”.  Like real clouds (the water-based atmospheric phenomena), it’s not a static thing, and there is tremendous variety as well as lots of confusing things that appear cloud-like, but arguably may just be smoke.  I think a genuinely useful cloud taxonomy gets built by first enumerating the fundamental classification principles and defining distinguishing attributes (something I’ll take a whack at in a future post; in the mean time, what do you think that list might include?), then by parsing the membership of the cloud domain, adjusting and adding principles and attributes as necessary, and as we learn and the cloud evolves.  Taxonomy isn’t a static thing, either.  Ontology, in my opinion, will just have to wait.  The general, formal specification of the many manifestations of clouds and their components (and, by association, definition of generic cloud APIs) is premature.  Things need to condense more before ontology will be a productive exercise.

I do think some blame (a mild chastisement) is owed to anyone participating in the cloud taxonomy conversation that is not exercising appropriately-high levels of skepticism and insisting on well-defined and valid standards in their frameworks.  Taxonomies are thought-shaping tools and bad tools make for bad thinking.   One commenter on one of the many blogs echoing/amplifying the taxonomy conversation remarked that some of the diagrams were mere “marketecture” and others warned against special interests warping the framework to suit their own ends.  We should all be such critical thinkers.

The Cloudology Manifesto

I’m a skeptic about cloud computing (if you’re new to cloud computing, check out Wikipedia’s pretty-good definition — watch out for occasional gopher holes in the rest of the article, but hey, it’s Wikipedia — and sample some of the many fine cloud-related blogs).  In fact, I try to be a skeptic about most things, most of the time, though (like most well-intentioned people) sometimes I fail.  

Defining what I mean by “skeptic” can be as subtle as the Free Software Foundation’s definition of “free” (”free” as in “free speech”, not “free beer”).  I’m not a pessimist (certain not about cloud computing), nor do I have anything against free beer.  The definition of skeptic to which I try to adhere is not “one who disbelieves”, but “one who applies rigorous principles of critical thinking” — sorry, not nearly as catchy as “free speech/not free beer”, but then I’m not Richard Stallman.

The rigorous principles of critical thinking are what is applied when following the scientific method, when carefully avoiding logical fallacies, when consuming media without being credulous, when mistrusting one’s own senses, emotions, and reactions, when keeping an open mind (even about one’s publicly-stated positions), and when divorcing one’s ego from one’s arguments and having the guts to reverse an opinion in the face of valid contrary evidence.  Critical thinking is being critical about one’s own thinking, which helps one examine the external world more accurately.  As my Communications 101 professor stated, it’s about having a sharpened “bullshit detector”, but it’s also about pointing it at yourself and even at the positions you hold dear.

I suck at critical thinking.  Almost everyone does.  Human brains are analog computers, much as we like to describe their functions today using terms from digital computing (a friend once ruefully described herself as  ”a 16-bit processor in a 32-bit world”, which, of course, also dates the exchange).  Analog computers suck at precise computation, but they can be absolutely fabulous at pattern recognition and classification, generalization, and approximating solutions to algorithmically-intractable problems.  Those strengths can also be fatal weaknesses.  The human brain is the result of billions of years of evolution, the most-fit survivor of competitive environments that bear little resemblance to today’s technologically enhanced society.  Our wetware was shaped by the need to find food, reproduce, and physically out-compete our predators and neighbors, not drive cars, program VCRs, or design and operate efficient IT-based applications and services infrastructure.

Worse, our poor analog computers, constantly struggling to analyze, sort, compress and re-compress, and imperfectly store a few representative bits out of a constant gigabytes-per-second stream of real-time sensory data from billions of analog input sensors, are swimming in a polluted soup of emotion-triggering hormones and self-generated mood-altering chemical messengers perpetually interfering with whatever approximations of logical high-order thought we manage to muster.  It’s a wonder we can think at all, and no wonder so much of human society is so goofy.  Pay no attention to that man behind the curtain, he’s having a continuous nervous breakdown.

Being so poorly equipped for logical thought, it’s amazing how well we’ve done as a species, augmenting our innate capabilities with language, mathematics, science, and technology.  It’s also not surprising, however, that even a field as fundamentally based on logic as computing (what is more irreducibly logical than the binary codes and Boolean logic at the heart of our digital world?) should be also subject to the same excesses of enthusiasm, hyperbole, and unexamined certitude we find in the rest of human society?  Even mathematicians can be zealots, why should IT be immune?

Which brings me back to cloud computing.  If you’ve been paying more than cursory attention recently, I’ll bet you’ve seen at least one breathless headline, press release, or blog entry that has raised your eyebrow or caused your own bullshit meter to bounce, even just a little.  I, myself, am of the opinion that cloud computing is both the greatest thing for IT since diced silicon, and (too often) look like an overinflated volume of insubstantial vapor.  Cloud computing is new, and that’s sparked something of a land rush to stake out market- and mind-share.  On the other hand, cloud computing encompasses many not-new predecessor concepts, like utility and grid computing, SOA, and Web 2.0., and there is an on-going struggle to work out just how and if it all fits together.  There are True Believers claiming practically everything is cloud computing (buy your toilet paper from Amazon?  That’s TPaaS!).  At the other end of the spectrum, there are those that say it’s all hype, all vapor (clouds are both a compelling and, sometimes, unfortunate metaphor).  Both are probably wrong, at least to a degree, and — like judging Olympic skating — we would be wise to be suspicious of both excessively high and ridiculously low scores.

I work for a company that’s arguably been working on cloud infrastructure management software for over 5 years (though we didn’t realize that’s what it was until recently :^) and I’m doing everything I can to help propagate the cloud computing wave, yet at the same time, in conversations with customers and others, I often struggle to try to correct misconceptions and deflate overhyped expectations.  The efficiency, agility, and accountability that cloud computing can bring to an organization are incredibly valuable; it’s not just about saving money, it’s about growing the top line.  The potential pain, expense, and opportunity cost of a misguided or inept cloud computing adoption attempt is almost unbounded, and even doing nothing could cost you your business in the end as you are outcompeted by successful cloud adopters, efficiently running their company on lower cost, tighter turning, higher capacity IT infrastructures.

My goal with this blog is to exercise good skepticism, examining cloud computing with a thoughtful yet questioning eye as the wave continues to build and ultimately sweeps over IT (I almost said “crashes on the beach of IT”, but “sweeps” sounds more survivable, doesn’t it?).  I do have a particular PoV that’s been shaped by my history, but I promise to try to watch my own processing as closely as I examine others’.  While I am in the industry, I won’t be an unquestioning fanboy, but I won’t be a John C. Dvorck, either.  I hope you enjoy what I hope will be a stimulating conversation as we practice “cloudology” together.