I’m starting out the first year of Velocity, the new O’Reilly-sponsored Web Performance and Operations Conference, watching robots dance to Beck on a video screen. The conference’s tagline is “fast, scalable, resilient, available,” which is just about identical to our Web Systems’ team’s charter.  (And our reputation with the ladies!)

For a long time, we’ve had to bottom-feed off of developer conferences, general interest conferences, etc. to address Web site operational issues; it’s great to see a conference specifically targeted at this growing area. The conference staff noted that the demand was way above what was expected, and were scurrying about to ensure they had enough materials. By rough headcount in the first keynote I’d estimate 400 attendees, with more arriving over time as West Coast standard wakeup time (10 AM, for the record) comes along.

Steve Souders and Jesse Robbins, the conference chairs, kick us off with a brief pep talk, and quickly introduce the first speaker – Bill Coleman, the “B” in BEA and the man who invented Solaris, who is talking about green data centers. His talk starts off about the huge and increasing complexity of computer systems. And then a semi-pointless digression into “Web 2.0!” Then he talks about getting to a true “cloud” or “dial tone” computing model where resources are there when needed, but not consuming resources when not. Dynamic provisioning/powering down based on utilization. On the mainframe, there’s a policy manager called LPARS that lets you manage different priority jobs and run at a high, set utilization. We need the same thing in the distributed world. And… he’s done. Hrm, I was hoping for some “here’s the solution.”

Next, two Keynote guys talking about their new tool, KITE 2.0. We have used KITE 1.0, being Keynote customers. It’s a nice Web page performance analysis tool. Now, I know there’s a bunch of those, but the huge benefit for us is that our performance SLAs are defined by Keynote monitors and KITE uses the exact same technology and can upload the scripts to Keynote, so in terms of a tool to distribute to random internal programmers and designers, it’s perfect. KITE 1.0 was very polished, and in 2.0 they’re adding some fun things like quick instant tests for free from 5 global cities, and bursting (running the same test back to back many times). Basically, in KITE you can record a transaction, play it (via IE integration) and it gives you a lovely waterfall of each page. You can record it as a script and save it to replay. KITE was Keynote customer only, but now 2.0 (out early August) will be free to all, which is awesome.

Following this is a pretty anticipated announcement from Scott Ruthfield of whitepages.com, an open source performance tool called “Jiffy.” They do about 500 searches per second for personal information there. They use Gomez, and their graphs have shown the same thing that Souders’ book is about, which is that the server time is by far the smallest part of their performance – it’s all front end. Their search results page integrates mapping, targeted ads, etc. “You can’t manage what you can’t measure.” The problem with synthetic measurements are the granularity – Gomez hits them a couple times in 20 minutes, but that misses 99% of their traffic. So they want something akin to real user monitoring, but they came at it from a different perspective than the network-heavy RUM vendors existing are.  They want to measure everything, no page performance impact, and near real time.

Jiffy (linked at code.whitepages.com as of NOW) consists of JavaScript page tagging, Apache config to log the hits, database schema and reporting, and a Firebug plugin for consuming the data. With Jiffy you “mark” where you want timing to start, and then “measure” the elapsed time since the mark. In this way it’s like any other page tagging (WebTrends, etc.) solution. This is my one concern with it, however – when we implemented page tagging at NI we went through a substantial process to validate tag logs versus our server logs, and there ended up being some very large bodies of omitted data that could not be attributed to any of the about dozen “known” reasons why page tagging and logs should show different information; in fact WebTrends ended up missing a large percentage of traffic.  But the “trends” are still there, just not all of your data.  That may be OK or may not be depending on how you try to use it (and what exactly is causing the missing data).  More on this later.

Next is conference circuit fave Artur Bergman from Wikia, who heads up their Web operations there. He is talking about the value of performance and reliability to the customer and your brand. He has a good point about user expectations – World of Warcraft has a lot of downtime and it’s an expectation set with users. The guy who runs WoWWiki has a much lower tolerance for downtime! So the power of setting expectations is strong.

Operations is about efficient use of resources, end user performance, and reliability. Bad operations wastes R&D money and cost of sale. Why do we not all know cost per page, or per page view? Isn’t your margin based on that? How do we make sound business decisions about operations?

Wikia [NB: changed from “Wikipedia” per comment below] was having performance problems and spawned a project to address them. The ad networks were a big problem (“Ad networks suck! You should be ashamed!” <crowd applauds>). They fixed this by overloading document.write, and discovered that a good percentage of the time the ads either timed out or the user left before they got there. He said some other things about the performance case but none of us could make them out. In closing – keep it simple and loosely coupled.

Another funny interstitial video from the Richter Scales. It reminds me that I need to update the funny videos in my new hire training!

Now, John Fowler, EVP of systems for Sun, talking about infrastructure driving innovation. He cites horizontal scaling, universal communication, and openness as endemic trends. They’re working on a “Web20kit” open source package that has squid/varnish/apache/mongrels/rails/glassfish/php/java, memcached, storage: mysql, mogile, hadoop, local FS… Not sure what that means. He’s moving fast.

More threads and thus more cores are faster and more efficient. So they see more cores, more memory as being the solution to computing challenges. And Open Storage, which is an OpenSolaris-based storage management solution.  They have new ZFS flash “SSDs” to replace hard drives – more reliable, faster, but more expensive. They see a new server memory/storage hierarchy emerging which ranges from cache to RAM to flash to disk in a “hybrid” storage pool.

I love Sun in my heart as I’m an old school UNIX guy. We were on Suns at Rice when I was there at the turn of the ’90’s. But I’m not convinced. We had to move off Sun to Dell for all our app servers because they were just plain faster. You can have a zillion cores but on the Web, user performance often equates to the fastest line of performance of one thread. Thread-safe programming is rare in IT and not that more common in the people making the app servers, tools, etc. that our apps depend on. Hell, the OAS app server we use aren’t even certified to work with 64-bit JVMs. For many slower chips to outperform fewer faster ones, you have to be able to successfully spread the workload across them in parallel, and most people can’t do that yet.  Also, many of our disk performance issues are from huge ass database instances – I’m not sure how this new storage solution caches those.

And that’s it for part one of the morning keynotes! We’re moving fast…  More in Part II!