Just two more keynotes till lunch, but these are larger ones (the previous speakers were 15 minutes apiece; these are 45).  I’ll try to take good notes; every conference always says they’re going to make all the slides available afterwards but at best they usually get a 50% success rate on that.

First, Luiz Barroso from Google speaks on energy efficient operations. Now, server usage is only about 1% of total electricity consumption, but it doubled between 2000 and 2005.  Measuring computing energy efficiency is harder than measuring a refrigerator or the like.  Efficiency is defined as work done/energy used in physics terms. Efficiency for IT can be broken down into computing efficiency (work done/chip energy), server efficiency (chip energy/server energy) and server room efficiency (server energy/server room energy). Surveys show an average PUE (1/server room efficiency) of 1.83, and power supplies dissipate 25% of the power going to servers uselessly, more in PCs. Servers have poor (computing) energy efficiency in their most common usage range.

How do we address this?  First, the power provisioning problem in the data center. Energy isn’t the largest cost – building the center itself takes $10-$22 per watt, but the 10 year power is $9/watt.  Efficiency saves  on both. According to the uptime institute, the average cost breakdown is datacenter – 28%, electricity – 22%, hardware – 50%. (Software dwarfs this in many shops, I’ll note.)

To provision efficiently, consolidate to the minimum number of servers (duh). Also, measure power use, don’t trust nameplates. Study trends and investigate oversubscription potential (ICSA ’07 article on this). So you can provision more tightly.

They did a six month study at Google. It was a model that measured power at the rack, PDU (500-800 machines), and cluster (5k machines) levels and characterized four different workloads over 5k servers. They wanted to find the potential of various energy saving techniques. Because of scaling, they found that the larger the group, the more oversubscription – a given rack may be at peak power 50% of the time, but a PDU only 20, and a cluster 10. Also, different workloads have different power consumption requirements and thus mixing workloads is more efficient. So don’t fix oversubscription at the small grain (rack) level, but at the datacenter level. In other words, you don’t need to provision power enough to run everything at peak usage – do less.  Profile app power usage and mix workloads, and manage the risk of overload by having some lower-priority “victim” workload.

This is of interest to us at NI; we’re even now building out more data center space in our HQ and will be building one in our new third manufacturing site.

Now, he switches to talking about “energy-proportional computing.” Servers aren’t often very idle in real structures. High performance and high availability requires load balancing and wide data distribution mean no “idle,” but lots of “low activity.” And you have to overprovision, you can’t target 90% utilization on the Web. They created GFS to distribute data, which is replica based. Reads are load balanced but writes have to go to all replicas. So “sleep” or “power down” functionality is not real useful for servers. Don’t focus on efficiency at peak – you’re seldom at peak. Power efficiency is generally worse when a server is underutilized. There’s a new SPECPower benchmark, interesting, and it shows performance to power ratios dropping sharply with lowered target load.

So Luiz wants machines that scale power use linearly! Basically, current server power usage scales less-than-linear with workload. So at low workload, it’s still using a buttload of power. Of the components (CPU, RAM, Disk, other) the CPUs are actually doing OK at scaling. But that means that CPU power schemes (DVS) are becoming diminishing returns. Idle CPUs consume less than 30% of their peak energy, but RAM – 50%; disks – 75%; networking – 85%. Energy proportionality would save them lots (doesn’t affect peak).  Now there’s nothing *you* can do about proportionality, unless you’re making computers.  But you can harass your suppliers.  (Easy to say if you’re someone like Google; for most of us when we talk to our suppliers about stuff like this they just chuckle and give us a swirlie.)

In conclusion – write fast code! All the infrastructure work in the world can have about a 50% effect, but software engineering impact is almost without bound. (I actually drill this point into our new programmers in hew hire training.)  Consider reduction of all energy-related costs and bug suppliers about proportionality. And join up at climatesaverscomputing.org!

The second big keynote is by Javier Soltero of Hyperic. Cloud computing is nice, but you’re not going to move over to it 100%. And clouds add complexity, like any abstraction. So you are faced with questions – is the problem my app, or is it the cloud? If you can’t get the visibility, you can’t trust it. Hyperic started to try to solve this with HypericHQ. They put up http://www.cloudstatus.com/, a status of the AWS cloud, and will be adding clouds as they go. You can go see metrics like EC2 instance deployment latency (about a minute on average, for the record). So the site is kinda like Keynote for the cloud. Spiffy enough. Not too much more to write about it though.

A humorous note, that their bank cut them off because they were transferring a penny back and forth between accounts every minute with their payment system monitoring. Synthetic monitoring often can have “unintended side effects” of this sort.  Caveat monitor!

And now – lunch.  More to come!