Velocity 2009 – Scalable Internet Architectures

Jul.06, 2009 in Application Performance Management, Conferences, Velocity 2009

OK, I’ll be honest. I started out attending “Metrics that Matter – Approaches to Managing High Performance Web Sites” (presentation available!) by Ben Rushlo, Keynote proserv. I bailed after a half hour to the other one, not because the info in that one was bad but because I knew what he was covering and wanted to get the less familiar information from the other workshop. Here’s my brief notes from his session:

Online apps are complex systems
A siloed approach of deciding to improve midtier vs CDN vs front end engineering results in suboptimal experience to the end user – have to take holistic view. I totally agree with this, in our own caching project we took special care to do an analysis project first where we evaluated impact and benefit of each of these items not only in isolation but together so we’d know where we should expend effort.
Use top level/end user metrics, not system metrics, to measure performance.
There are other metrics that correlate to your performance – “key indicators.”
It’s hard to take low level metrics and take them “up” into a meaningful picture of user experience.

He’s covering good stuff but it’s nothing I don’t know. We see the differences and benefits in point in time tools, Passive RUM, tagging RUM, synthetic monitoring, end user/last mile synthetic monitoring… If you don’t, read the presentation, it’s good. As for me, it’s off to the scaling session.

I hopped into this session a half hour late. It’s Scalable Internet Architectures (again, go get the presentation) by Theo Schlossnagle, CEO of OmniTI and author of the similarly named book.

I like his talk, it starts by getting to the heart of what Web Operations – what we call “Web Admin” hereabouts – is. It kinda confuses architecture and operations initially but maybe that’s because I came in late.

He talks about knowledge, tools, experience, and discipline, and mentions that discipline is the most lacking element in the field. Like him, I’m a “real engineer” who went into IT so I agree vigorously.

What specifically should you do?

Use version control
Monitor
Serve static content using a CDN, and behind that a reverse proxy and behind that peer based HA. Distribute DNS for global distribution.
Dynamic content – now it’s time for optimization.

Optimizing Dynamic Content

Don’t pay to generate the same content twice – use caching. Generate content only when things change and break the system into components so you can cache appropriately.

example: a php news site – articles are in oracle, personalization on each page, top new forum posts in a sidebar.

Why abuse oracle by hitting it every page view? updates are controlled. The page should pull user prefs from a cookie. (p.s. rewrite your query strings)
But it’s still slow to pull from the db vs hardcoding it.
All blog sw does this, for example
Check for a hardcoded php page – if it’s not there, run something that puts it there. Still dynamically puts in user personalization from the cookie. In the preso he provides details on how to do this.
Do cache invalidation on content change, use a message queuing system like openAMQ for async writes.
Apache is now the bottleneck – use APC (alternative php cache)
or use memcached – he says no timeouts! Or… be careful about them! Or something.

Scaling Databases

1. shard them
2. shoot yourself

Sharding, or breaking your data up by range across many databases, means you throw away relational constraints and that’s sad. Get over it.

You may not need relations – use files fool! Or other options like couchdb, etc. Or hadoop, from the previous workshop!

Vertically scale first by:

not hitting the damn db!
run a good db. postgres! not mySQL boo-yah!

When you have to go horizontal, partition right – more than one shard shouldn’t answer an oltp question. If that’s not possible, consider duplication.

IM example. Store messages sharded by recipient. But then the sender wants to see them too and that’s an expensive operation – so just store them twice!!!

But if it’s not that simple, partitioning can hose you.

Do math and simulate it before you do it fool! Be an engineer!

Multi-master replication doesn’t work right. But it’s getting closer.

Networking

The network’s part of it, can’t forget it.

Of course if you’re using Ruby on Rails the network will never make your app suck more. Heh, the random drive-by disses rile the crowd up.

A single machine can push a gig. More isn’t hard with aggregated ports. Apache too, serving static files. Load balancers too. How to get to 10 or 20 Gbps though? All the drivers and firmware suck. Buy an expensive LB?

Use routing. It supports naive LB’ing. Or routing protocol on front end cache/LBs talking to your edge router. Use hashed routes upstream. User caches use same IP. Fault tolerant, distributed load, free.

Use isolation for floods. Set up a surge net. Route out based on MAC. Used vs DDoSes.

Service Decoupling

One of the most overlooked techniques for scalable systems. Why do now what you can postpone till later?

Break transaction into parts. Queue info. Process queues behind the scenes. Messaging! There’s different options – AMQP, Spread, JMS. Specifically good message queuing options are:

ActiveMQ (Java)
OpenAMQ (C)
RabbitMQ (erlang)

Most common – STOMP, sucks but universal.

Combine a queue and a job dispatcher to make this happen. Side note – Gearman, while cool, doesn’t do this – it dispatches work but it doesn’t decouple action from outcome – should be used to scale work that can’t be decoupled. (Yes it does, says dude in crowd.)

Scalability Problems

It often boils down to “don’t be an idiot.” His words not mine. I like this guy. Performance is easier than scaling. Extremely high perf systems tend to be easier to scale because they don’t have to scale as much.

e.g. An email marketing campaign with an URL not ending in a trailing slash. Guess what, you just doubled your hits. Use the damn trailing slash to avoid 302s.

How do you stop everyone from being an idiot though? Every person who sends a mass email from your company? That’s our problem – with more than fifty programmers and business people generating apps and content for our Web site, there is always a weakest link.

Caching should be controlled not prevented in nearly any circumstance.

Understand the problem. going from 100k to 10MM users – don’t just bucketize in small chunks and assume it will scale. Allow for margin for error. Designing for 100x or 1000x requires a profound understanding of the problem.

Example – I plan for a traffic spike of 3000 new visitors/sec. My page is about 300k. CPU bound. 8ms service time. Calculate servers needed. If I varnish the static assets, the calculation says I need 3-4 machines. But do the math and it’s 8 GB/sec of throughput. No way. At 1.5MM packets/sec – the firewall dies. You have to keep the whole system in mind.

So spread out static resources across multiple datacenters, agg’d pipes.
The rest is only 350 Mbps, 75k packets per second, doable – except the 302 adds 50% overage in packets per sec.

Last bonus thought – use zfs/dtrace for dbs, so run them on solaris!

Tags: scalability, velocity, velocityconf, velocityconf09

Welcome to WebAdminBlog!

This blog site is run by Josh Sokol, the Founder and CEO of SimpleRisk, a free tool for Governance, Risk Management, and Compliance. Josh is a former Web Admin and Information Security Program Owner of National Instruments.

Categories
Recent Posts
Recent Comments
devops
Links
Security
Tags
21ct agile amazon analysis application appsec attack aws browser cloud Cloud Computing code Conferences data devops ec2 firewall google hansen internet lynxeon malware Management network Operations owasp PCI performance project rsnake SaaS secure Security strategies velocity velocity08 velocityconf velocityconf08 velocityconf09 Virtualization vpn vulnerability waf web wifi
Categories
- Advertising (2)
- Application Performance Management (14)
- Automation (4)
- Browsers (4)
- Cloud Computing (9)
  - Elastic Compute Cloud (3)
- Conferences (64)
  - BSides Austin 2013 (1)
  - BSides Austin 2016 (1)
  - OWASP AppSec DC 2009 (16)
  - OWASP AppSec NYC 2008 (18)
  - OWASP LASCON 2017 (1)
  - OWASP LASCON 2018 (1)
  - TRISC 2009 (8)
  - Velocity 2008 (8)
  - Velocity 2009 (8)
- Content Management (2)
- Featured (3)
- Green Computing (1)
- High Availability (1)
- Log Management (2)
- Management (4)
- Monitoring (4)
- Networking (12)
  - Firewalls (4)
  - NetFlow (4)
- Operating Systems (2)
  - Linux (2)
  - Mac OSX (1)
  - Unix (2)
- Operations (11)
- Popular (2)
- SaaS (2)
- Sarcasm (1)
- Search (1)
  - Enterprise Search (1)
- Security (75)
  - Access Management (1)
  - Capture the Flag (4)
  - Cloud Computing (4)
  - Compliance (1)
  - Disaster Recovery (1)
  - Malware (4)
  - Metrics (2)
  - OWASP (2)
  - PCI (2)
  - Phishing (2)
  - Physical (1)
  - Risk Management (2)
  - Virtualization (3)
  - Web Application Security (32)
    - Dynamic Analysis (1)
    - Static Analysis (1)
  - Wireless Networks (5)
- Service-Oriented Architecture (1)
- Software and Tools (15)
  - Crashplan (1)
  - Drobo (1)
  - GRC (1)
- Training (2)
- Uncategorized (1)
- Virtualization (4)

Blogroll
- Agile Operations Blog
- Agile Testing
- Agile Web Operations
- Amazon Web Services Blog
- dev2ops – Web Ops at Scale
- Gilligan on Data Web Analytics pro tips
- ISSA Home The Information Systems Security Association (ISSA)® is a not-for-profit, international organization of information security professionals and practitioners.
- Kitchen Soap, A WebOps Blog
- Michael Howard's Blog Software security guy at Microsoft.
- National Instruments Home The majority of the contributers here are current or past NI employees.
- OWASP Home The Open Web Application Security Project (OWASP) is a worldwide free and open community focused on improving the security of application software.
- RSnake's Blog ha.ckers.org web application security lab
- Server Fault
- Steve Souders’ Blog Google High Performance Guru
- The Madstop
- The Open Minded Enterprise
- The Simple Logic
- Transparent Uptime blog
Archives
- March 2019
- October 2017
- April 2016
- January 2016
- December 2015
- May 2015
- November 2014
- August 2014
- June 2014
- May 2014
- October 2013
- September 2013
- August 2013
- May 2013
- March 2013
- February 2013
- October 2012
- May 2011
- April 2011
- December 2010
- July 2010
- June 2010
- April 2010
- March 2010
- February 2010
- January 2010
- November 2009
- September 2009
- July 2009
- June 2009
- April 2009
- March 2009
- February 2009
- January 2009
- December 2008
- October 2008
- September 2008
- August 2008
- July 2008
- June 2008
- May 2008
Tag Cloud
21ct agile amazon analysis application appsec attack aws browser cloud Cloud Computing code Conferences data devops ec2 firewall google hansen internet lynxeon malware Management network Operations owasp PCI performance project rsnake SaaS secure Security strategies velocity velocity08 velocityconf velocityconf08 velocityconf09 Virtualization vpn vulnerability waf web wifi

Web Admin Blog

Real Web Admins. Real World Experience.