O'Reilly's Velocity conference is the only generalized Web ops and performance conference out there. We really like it; you can go to various other conferences and have 10-20% of the content useful to you as a Web Admin, or you can go here and have most of it be relevant!
They've been doing some interim freebie Web conferences and there's one coming up. Check it out. They'll be talking about performance functionality in Google Webmaster Tools, mySQL, Show Slow, provisioning tools, and dynaTrace's new AJAX performance analysis tool.
O'Reilly Velocity Online Conference: "Speed and Stability"
Thursday, March 17; 9:00am PST
OK, I'll be honest. I started out attending "Metrics that Matter - Approaches to Managing High Performance Web Sites" (presentation available!) by Ben Rushlo, Keynote proserv. I bailed after a half hour to the other one, not because the info in that one was bad but because I knew what he was covering and wanted to get the less familiar information from the other workshop. Here's my brief notes from his session:
- Online apps are complex systems
- A siloed approach of deciding to improve midtier vs CDN vs front end engineering results in suboptimal experience to the end user - have to take holistic view. I totally agree with this, in our own caching project we took special care to do an analysis project first where we evaluated impact and benefit of each of these items not only in isolation but together so we'd know where we should expend effort.
- Use top level/end user metrics, not system metrics, to measure performance.
- There are other metrics that correlate to your performance - "key indicators."
- It's hard to take low level metrics and take them "up" into a meaningful picture of user experience.
He's covering good stuff but it's nothing I don't know. We see the differences and benefits in point in time tools, Passive RUM, tagging RUM, synthetic monitoring, end user/last mile synthetic monitoring... If you don't, read the presentation, it's good. As for me, it's off to the scaling session.
I hopped into this session a half hour late. It's Scalable Internet Architectures (again, go get the presentation) by Theo Schlossnagle, CEO of OmniTI and author of the similarly named book.
I like his talk, it starts by getting to the heart of what Web Operations - what we call "Web Admin" hereabouts - is. It kinda confuses architecture and operations initially but maybe that's because I came in late.
He talks about knowledge, tools, experience, and discipline, and mentions that discipline is the most lacking element in the field. Like him, I'm a "real engineer" who went into IT so I agree vigorously.
What specifically should you do?
- Use version control
- Serve static content using a CDN, and behind that a reverse proxy and behind that peer based HA. Distribute DNS for global distribution.
- Dynamic content - now it's time for optimization.
Optimizing Dynamic Content
Don't pay to generate the same content twice - use caching. Generate content only when things change and break the system into components so you can cache appropriately.
example: a php news site - articles are in oracle, personalization on each page, top new forum posts in a sidebar.
Why abuse oracle by hitting it every page view? updates are controlled. The page should pull user prefs from a cookie. (p.s. rewrite your query strings)
But it's still slow to pull from the db vs hardcoding it.
All blog sw does this, for example
Check for a hardcoded php page - if it's not there, run something that puts it there. Still dynamically puts in user personalization from the cookie. In the preso he provides details on how to do this.
Do cache invalidation on content change, use a message queuing system like openAMQ for async writes.
Apache is now the bottleneck - use APC (alternative php cache)
or use memcached - he says no timeouts! Or... be careful about them! Or something.
1. shard them
2. shoot yourself
Sharding, or breaking your data up by range across many databases, means you throw away relational constraints and that's sad. Get over it.
You may not need relations - use files fool! Or other options like couchdb, etc. Or hadoop, from the previous workshop!
Vertically scale first by:
- not hitting the damn db!
- run a good db. postgres! not mySQL boo-yah!
When you have to go horizontal, partition right - more than one shard shouldn't answer an oltp question. If that's not possible, consider duplication.
IM example. Store messages sharded by recipient. But then the sender wants to see them too and that's an expensive operation - so just store them twice!!!
But if it's not that simple, partitioning can hose you.
Do math and simulate it before you do it fool! Be an engineer!
Multi-master replication doesn't work right. But it's getting closer.
The network's part of it, can't forget it.
Of course if you're using Ruby on Rails the network will never make your app suck more. Heh, the random drive-by disses rile the crowd up.
A single machine can push a gig. More isn't hard with aggregated ports. Apache too, serving static files. Load balancers too. How to get to 10 or 20 Gbps though? All the drivers and firmware suck. Buy an expensive LB?
Use routing. It supports naive LB'ing. Or routing protocol on front end cache/LBs talking to your edge router. Use hashed routes upstream. User caches use same IP. Fault tolerant, distributed load, free.
Use isolation for floods. Set up a surge net. Route out based on MAC. Used vs DDoSes.
One of the most overlooked techniques for scalable systems. Why do now what you can postpone till later?
Break transaction into parts. Queue info. Process queues behind the scenes. Messaging! There's different options - AMQP, Spread, JMS. Specifically good message queuing options are:
- ActiveMQ (Java)
- OpenAMQ (C)
- RabbitMQ (erlang)
Most common - STOMP, sucks but universal.
Combine a queue and a job dispatcher to make this happen. Side note - Gearman, while cool, doesn't do this - it dispatches work but it doesn't decouple action from outcome - should be used to scale work that can't be decoupled. (Yes it does, says dude in crowd.)
It often boils down to "don't be an idiot." His words not mine. I like this guy. Performance is easier than scaling. Extremely high perf systems tend to be easier to scale because they don't have to scale as much.
e.g. An email marketing campaign with an URL not ending in a trailing slash. Guess what, you just doubled your hits. Use the damn trailing slash to avoid 302s.
How do you stop everyone from being an idiot though? Every person who sends a mass email from your company? That's our problem - with more than fifty programmers and business people generating apps and content for our Web site, there is always a weakest link.
Caching should be controlled not prevented in nearly any circumstance.
Understand the problem. going from 100k to 10MM users - don't just bucketize in small chunks and assume it will scale. Allow for margin for error. Designing for 100x or 1000x requires a profound understanding of the problem.
Example - I plan for a traffic spike of 3000 new visitors/sec. My page is about 300k. CPU bound. 8ms service time. Calculate servers needed. If I varnish the static assets, the calculation says I need 3-4 machines. But do the math and it's 8 GB/sec of throughput. No way. At 1.5MM packets/sec - the firewall dies. You have to keep the whole system in mind.
So spread out static resources across multiple datacenters, agg'd pipes.
The rest is only 350 Mbps, 75k packets per second, doable - except the 302 adds 50% overage in packets per sec.
Last bonus thought - use zfs/dtrace for dbs, so run them on solaris!
Velocity 2009 is well underway and going great! Here's my blow by blow of how it went down.
Peco, my erstwhile Bulgarian comrade, and I came in to San Jose from Austin on Sunday. We got situated at the fairly swank hotel, the Fairmont, and wandered out to find food. There was some festival going on so the area was really hopping. After a bit of wandering, we had a reasonably tasty dinner at Original Joe's. Then we walked around the cool pedestrian part of downtown San Jose and ended up watching "Terminator: Salvation" at a neighborhood movie theater.
We went down at 8 AM the next morning for registration. We saw good ol' Steve Souders, and hooked up with a big crew from BazaarVoice, a local Austin startup that's doing well. (P.S. I don't know who that hot brunette is in the lead image on their home page, but I can clearly tell that she wants me!)
This first day is an optional "workshop" day with a number of in depth 90 minute sessions. There were two tracks, operations and performance. Mostly I covered ops and Peco covered performance. Next time - the first session!
As Web Admins, we love Velocity. Usually, we have to bottom-feed at more generalized conferences looking for good relevant content on systems engineering. This is the only conference that is targeted right at us, and has a dual focus of performance and operations. The economy's hitting us hard this year and we could only do one conference - so this is the one we picked.
Look for full coverage on the sessions to come!
We knew that the historic inauguration of Barack Obama would be generating a lot more Internet traffic than usual, both in general and specifically here at NI. Being prudent Web Admin types, we checked around to make sure we thought that there wouldn't be any untoward effects on our Web site. Like many corporate sites, we use the same pipe for inbound Internet client usage and outbound Web traffic, so employees streaming video to watch the event could pose a problem. We got all thumbs up after consulting with our networking team, and decided to not even send any messaging asking people to avoid streaming. But, we monitored the situation carefully as the day unwound. Here's what we saw, just for your edification!
Our max inbound Internet throughput was 285 Mbps, about double our usual peak. We saw a ni.com Web site performance degradation of about 25% for less than two hours according to our Keynote stats. ni.com ASPs were affected proportionately which indicates the slowdown was Internet-wide and not unique to our specific Internet connection here in Austin. The slowdown was less pronounced internationally, but still visible. So in summary - not a global holocaust, but a noticeable bump.
Cacti graphs showing our Internet connection traffic:
Keynote graph of several of our Web assets, showing global response time in seconds:Looking at the traffic specifically, there were two main standouts. We had TCP 1935, which is Flash RTMP, peaking around 85 Mbps, and UDP 8247, which is a special CNN port (they use a plugin called "Octoshape" with their Flash streaming), peaking at 50 Mbps. We have an overall presence of about 2500 people here at our Austin HQ on an average day, but we can't tell exactly how many were streaming. (Our NetQoS setup shows us there were 13,600 'flows,' but every time a stream stops and starts that creates a new one - and the streams were hiccupping like crazy. We'd have to do a bunch of Excel work to figure out max concurrent, and have better things to do.)
In terms of the streaming provider breakdown - since everyone uses Akamai now, the vast majority showed as "Akamai". We could probably dig more to find out, but we don't really care all that much. And, many of the sources were overwhelmed, which helped some.
We just wanted to share the data, in case anyone finds it helpful or interesting.
Dave Artz has put together a simple Webcast tutorial on how to use webpagetest.org to measure and fix up your Web site. If all this talk about Web performance is a bit overwhelming, it's a great novice tutorial. He walks through the entire process visually and explains each metric. Great job Dave!
Well, I'm finally home with a spare minute to write. I and the two guys who went to the conference with me (Peco and Robert) got a lot out of it. I apologize for the brevity of style of the conference writeups, but they were notes taken on a precariously balanced laptop, under bad network and power conditions, while I was also trying to get around and participate meaningfully in a very fast-paced event. I've gone back and tried to soften them a little bit, but there's no rest for the wicked. You can access many of the slides for the sessions here.
The conference was quite a success. Everyone we spoke to was enthusiastic about the people and information there. O'Reilly is happy because attendance was above their expectations, and it looks like it's been expanded to 3 days next year, which is good - it was *so* session packed and fast paced I didn't get to talk to all the suppliers I wanted in the dealer room and at times it felt like the Bataan death march. The first day we barely had time to grab a fast food dinner, and we often found ourselves hungry and hurrying. We enjoyed talking with the people there, but it seemed less conversational than other conventions - maybe because of the pace, maybe because half the people there were from the area and thus needed to scamper off to work/home and were therefore not into small talk.
We've reached the last couple sessions at Velocity 2008. Read me! Love me!
We hear about Capacity Planning with John Allspaw of Flickr. He says: No benchmarks! Use real production data. (How? We had to develop a program called WebReplay to do this because no one had anything. We're open sourcing it soon, stay tuned.)
Use "safety factors" (from traditional engineering). Aka a reserve, overhead, etc.
They use squid a bunch. At NI we've been looking at Oracle's WebCache - mainly because it supports ESIs and we're thinking that may be a good way to go. There's a half assed ESI plugin for squid but we hear it doesn't work; apparently Zope paid for ESI support in squid 3.0 but no traction on that in 4 years best as we can tell. But I'd be happy not to spend the money.
After a tasty pseudo-Asian hotel lunch (though about anything would be tasty by now!), we move into the final stretch of afternoon sessions for Velocity. Everyone seems in a good mood after the interesting demos in the morning and the general success of the conference.
First, it's the eagerly awaited Even Faster Web Sites. Steve Souders, previously Yahoo! performance guru and now Google performance guru, has another set of recommendations regarding Web performance. His previous book with its 14 rules and the Firebug plugin, YSlow, that supported it, are one of the things that really got us hooked deeply into the Web performance space.
First, he reviews why front end performance is so important. In the steady state, 80-90% of your average page's load time the user sees is time after the server has spit it out. "Network time." Optimizing your code speed is therefore a smaller area of improvement than optimizing the front end. And it can be improved, often in simple ways.
Man, there's a wide variance in how people's pages perform with a primed cache - from no benefit (most of the Alexa top 10) to incredible benefit (Google and MS live Search results pages). Anyway, Steve developed his original 14 best practices for optimizing front end performance, and then built YSlow to measure them.
Welcome to the second (and final) day of the new Velocity Web performance and operations conference! I'm here to bring you the finest in big-hotel-ballroom-fueled info and drama from the day.
In the meantime, Peco had met our old friend Alistair Croll, once of Coradiant and now freelance, blogging on "Bitcurrent." Oh, and also at the vendor expo yesterday we saw something exciting - an open source offering from a company called ControlTier, which is a control and deployment app. We have one in house largely written by Peco called "Monolith" - more for control (self healing) and app deploys, which is why we don't use cfengine or puppet, which have very different use cases. His initial take is that ControlTier has all the features he's implemented and all the ones on his list to implement for Monolith, so we're very intrigued.
We kick off with a video of base jumpers, just to get the adrenaline going. Then, a "quirkily humorous" video about Faceball.
Steve and Jesse kick us off again today, and announce that the conference has more than 600 attendees, which is way above predictions! Sweet. And props to the program team, Artur Bergman (Wikia), Cal Henderson (Yahoo!), Jon Jenkins (Amazon), and Eric Shurman (Microsoft). Velocity 2009 is on! This makes us happy, we believe that this niche - web admin, web systems, web operations, whatever you call it - is getting quite large and needs/deserves some targeted attention.