We’ve reached the last couple sessions at Velocity 2008. Read me! Love me!

We hear about Capacity Planning with John Allspaw of Flickr. He says: No benchmarks! Use real production data. (How? We had to develop a program called WebReplay to do this because no one had anything. We’re open sourcing it soon, stay tuned.)

Use “safety factors” (from traditional engineering). Aka a reserve, overhead, etc.

They use squid a bunch. At NI we’ve been looking at Oracle’s WebCache – mainly because it supports ESIs and we’re thinking that may be a good way to go. There’s a half assed ESI plugin for squid but we hear it doesn’t work; apparently Zope paid for ESI support in squid 3.0 but no traction on that in 4 years best as we can tell. But I’d be happy not to spend the money.

Anyway. You should do forecasting. Assuming it’s linear which it never is. But you can take output from your stuff (ganglia, whatever) and fityk can give you a curve fit.

They use lots of nagios for monitoring – about 10 checks per host.

Determine a ceiling, and high/low water marks, and alert outside the water marks.

Then have a simple capacity dashboard for everything – how close to ceiling you are.

Horizontal scaling is all well and good, but sometimes you should do some vertical by upgrading. He calls it “diagonal.” By upgrading image proc servers, they got same CPU usage but 3x more work out of them. (We saw the same when we upgraded our Java app servers from Sun V440s to Dell 2850s a year or two ago – 50% performance improvement.) In their case, they also got faster processing time, less power usage, and less rack space.

Memcached. You turn it on, and the DBs go idle! Yay. But then your Web servers heat up as they become the bottleneck. So beware the wandering bottleneck.

Stupid Capacity Tricks! Before Puppet and Capistrano there was dsh (distributed shell). Ooo, I want it.

Shut Shit Off – they have software switches to disable various features when needed. (We have a lot of those switches at NI, but they’re not documented and under the control of business units not ops – sad.) Their programmers are good, they put flags in config files in order of importance to turn things on/off, read on the fly.

Host an outage page NOT in your datacenter, and use it – users appreciate knowing what’s up.

Bake dynamic into static. Some Yahoo! properties have a big red button to bake/unbake at will. Bye to DDoS attacks.

And at the end, a plaintive “We’re Hiring…” Like everyone else here. Man I need some good Web ops people. I have two open spots. We’re hiring too!!!

Question: You do lots of mini-code pushes (20/day). How the heck do you manage that and keep the site up? He says – culture is the biggest thing. They have devs that think like ops and don’t do retarded things. They’re ganglia addicted and they’re the ones hitting the big red buttons. Then less important are some technical parts, like a one button deploy and verbose logging of changes.

He uses more dirty words than I do. Boss.

Artur Bergman of Wikia speaks again, on Squid vs Varnish.

PHP is a pig and wikitext is hard to parse, so they need caching. A hit is 8 ms and a miss is 200 ms, and they have a 75% hit rate. You have to get the cache hits up, by making more cachable. Ooo, they’re playing around with ESI he says!

They decided to force caching for anonymous users. They’ve only gone up to 30 seconds, but no complaints. They ignore if-modified-since and purge. Be careful about vary-accept on encoding because there’s an annoying browser bug with misplaced commas.

Mediawiki lives and dies by squid and puts cache control in the code, which is bad because developers are stupid.

Squid – the slide actually says “Me hates it” and “Still a piece of shit.” Awesome.

Varnish – he loves varnish. He nearly cried when he read the source code (C). But it’s a little unstable. He got it up to 65k hps (squid doing 2800).

Varnish has some “novel” techniques. Its control language is VCL. (Side note, they monitor with lvs). This gets compiled down to C at runtime. So you can put assembly in if you want. Lawdy. It segfaults from time to time under load and they’re helping fix it. In a month or two he’ll have it crackin’!

And the last one – Puppet, by Luke Kanies, Puppet developer.

Automation tools are old and bad, and especially because they’re SSH based. (Agree!) And also because there’s not many people who cross the chasm between sysadmin and developer. They decided they had to solve the problem and create something a billion times better than anything (where anything is cfengine). Either you can manage many machines with little effort, or you can’t. You want to be able to. So this required abstraction. He’s using the analogy of C scaring the bejeezus out of assembly programmers – a good analogy.

It’s sad you have to do it, but he goes into why a more powerful tool should not scare people and put them all out of work. Developers seem to have gotten over this but not sysadmins. it’s stupid especially because “we’re understaffed” is the #1 thing I hear out of all of ours.

So they implemented Puppet with the metaphor of resources and resource providers, hiding all the file/command/UNIX admin stuff. (Well, kinda.) It’s easily extensible.

The Web 2.0 crowd has made “microformats.” Your infrastructure can use that idea too. Catch up with the times – if you’re proud of doing something developers have been doing for 10 years (like moving to version control in Subversion) then you’re behind. (Use git!) Anyway, you have to use polymorphism (overloading) to make a system like Puppet understand ssh on system 1 vs 2 vs 3.

Also, have one solution per problem. Not multiple. And most of the problems you face are NOT unique to you or your organization – so using a common tool like this can benefit from the network effect.

And the third big principle (were there only two before?) is completeness. Everything that matters in your config should be in the config. Not some minimal set. Relationships are important (dependencies). You can do things like have a service subscribe to a file and restart when it changes, for example.

Puppet is mainly used as a central config management tool. Each host gets a resource catalog. Machines get put in classes and they get lists of resources.

Puppet clients retrieve their resource catalog, determine order, check em, fix, and repeat every 30 minutes. “Like cfengine but sexier!” The completeness approach means clean management through the lifecycle – a freshly kickstarted box doesn’t end up different. You just kickstart enough to run puppet and use it to do everything. So all boxes are kept 100% up to date without artifacts.

And it has reporting underway too! They’re planning to charge for that to make some mooonay! Google, Stanford, Sony, Rackspace all use Puppet.

Why Puppet vs Capistrano? Cap is SSH in Ruby. Not something for yoru whole infrastructure.

Why Puppet vs cfengine? More open dev community and better.

What about Puppet slowness? It scales like HTTPS.

Puppet: Is XMLRPC but moving to REST. Uses certs and SSL, not keypairs. It’s in Ruby. He’s had to learn to be a developer in the process. It’s also an API to the systems. It supports VMs well and can get into the guts of the VMs unlike pure VM provisioning tools. Buy me! it’s open source but he sells support/trainin/addons. Discovery to come! There’s nagios integration of some sort. Vertebra, like capistrano, is an ad hoc change tool – Puppet isn’t (though you can use relsh for that).

That’s the last session – wrapup later once I power up my laptop and get some booze in me!