Simplifying On-call Through Alert Aggregation

Mar.29, 2010 in Uncategorized

One of the coolest things about working on the Web Systems Team at National Instruments is that the company has invested in a wide variety of tools to assist us with our jobs. Since we are responsible for the availability of ni.com, we have the standard URL and content monitors (Sitescope and Nagios). We also have the ability to do real user monitoring with a tool called Coradiant TrueSight. We are also responsible for the website’s performance so we have purchased tools like Panorama to diagnose code level issues. We have Splunk for log monitoring and Gomez for a third-party performance and availability monitor. We even have a SaaS provider that does application security scanning. Having all of these tools at our disposal is quite awesome and allows us to quickly find and fix issues with the site. The problem is that every single one of those tools has it’s own alerting and reporting interface.

This isn’t a new problem by any means. I’ve seen this issue at every job that I’ve ever had where the responsibilities included operational support. You rely on multiple tools to tell you when things aren’t going quite right, but now you end up spending some non-zero portion of your time managing those tools. For example, lets say that your company has a small release that lasts a few hours once a month. You now have to log in to the control panel (GUI) for each one of those tools and disable your alerts for that time period so that your on-call device isn’t going crazy. Assume that you have only four alerting tools and it takes you approximately 5 minutes to log in to each, set the maintenance window, and log back out. You just spent 20 minutes to disable alerts! Now you’re getting to the end of the release and things didn’t go as planned so the release is running longer than expected. Now you have to spend another 20 minutes to extend the maintenance window. How frustrating is that?

The issue gets even more complicated when you have multiple people providing support in either an on-call rotation or follow-the-sun type of scenario. At NI, we have an operations team that handles alerts during normal business hours, an on-call admin who handles alerts from 5 PM to 2 AM, and then a super-awesome Hungarian Web Admin who takes over responding to pages after 2 AM (9 AM in Hungary). Most of the alerting configurations that these tools provide aren’t even able to handle this type of scenario, but let’s suppose they did. You’re still stuck logging into multiple systems every time there’s a holiday, somebody goes on vacation, etc. And what happens if you don’t have a dedicated on-call device to pass from person to person? Then you’re stuck updating the alert configurations every time the on-call person changes in your rotation.

This really got me thinking that there has to be a better way to do things. I searched the internet looking for a solution, but when I couldn’t find anything to do exactly what I wanted it to do, I ended up writing my own. It’s now my pleasure to share with you iAlertYou. The idea is quite simple. You take all of those different tools that send alerts and you aggregate them in the same place. In this case, it’s on ialertyou.com. By doing this, you gain the ability to control everything from a single, centralized, management platform. Have a maintenance window? No problem. Just log in, set it once, and it affects all of your alerts. Same thing for both alert scheduling (who should get pages and when) and contact groups (used for on-call rotations). Plus, by having all of your alerts going through a single aggregation point, it means that we can also do reporting on all of your alerts. Ever wondered how many of your alerts come from what tools? What times of the day you get the most alerts? It’s all possible through alert aggregation.

Certainly there are drawbacks to this type of scenario. Most importantly, you’re introducing another dependency in what is typically a mission critical activity. While I can’t eliminate this concern completely, I built the system on top of internet cloud technologies for superior scalability. I’ve architected the application using best-practices in availability, performance, security, and usability. Currently, the only offering is a $30/month “everything” plan, but if you spend more than 10-20 minutes a month changing alert configurations, the ROI is realized very quickly. I will also be rolling out a “free” plan (thanks to Peco) with a limited subset of the functionality. I’d like to invite you to check out http://www.ialertyou.com and see if it can help your company simplify on-call through alert aggregation.

Tags: aggregation, alert, centralized, combine, e-mail, email, monitor, Monitoring, on-call, oncall, pager, rotation, schedule, tool, triage

2 Comments on “Simplifying On-call Through Alert Aggregation”

Eric
April 13th, 2010 at 9:44 am
There is a great Real User Monitoring tool out there that is surprisingly free. Something that might be worth checking out http://www.real-user-monitoring.com

peter van der klugt
May 19th, 2010 at 7:07 am
I have been confronted many years with complex monitoring & control that all use the ‘convenient’ approach to bring all alert data to a single place. Convenient implies that it is so easy to design HMI’s to address the alerts from several task stations.
I have been convinced by now that this works only for setting up new systems. When system components become obsolete, have to be replaced by newer ones with unforeseen other data, it is a horribly expensive approach (a simple change may affect the whole system)
There are better approaches known today, but their popularity is seriously hampered by the ‘conveniency’ of today’s centralized database oriented tools.
These better approaches have in common that they separate HMI from data and that they leave data, and as much responsibility for data handling as acceptable, low in the system components.

Welcome to WebAdminBlog!

This blog site is run by Josh Sokol, the Founder and CEO of SimpleRisk, a free tool for Governance, Risk Management, and Compliance. Josh is a former Web Admin and Information Security Program Owner of National Instruments.

Categories
Recent Posts
Recent Comments
devops
Links
Security
Tags
21ct agile amazon analysis application appsec attack aws browser cloud Cloud Computing code Conferences data devops ec2 firewall google hansen internet lynxeon malware Management network Operations owasp PCI performance project rsnake SaaS secure Security strategies velocity velocity08 velocityconf velocityconf08 velocityconf09 Virtualization vpn vulnerability waf web wifi
Categories
- Advertising (2)
- Application Performance Management (14)
- Automation (4)
- Browsers (4)
- Cloud Computing (9)
  - Elastic Compute Cloud (3)
- Conferences (64)
  - BSides Austin 2013 (1)
  - BSides Austin 2016 (1)
  - OWASP AppSec DC 2009 (16)
  - OWASP AppSec NYC 2008 (18)
  - OWASP LASCON 2017 (1)
  - OWASP LASCON 2018 (1)
  - TRISC 2009 (8)
  - Velocity 2008 (8)
  - Velocity 2009 (8)
- Content Management (2)
- Featured (3)
- Green Computing (1)
- High Availability (1)
- Log Management (2)
- Management (4)
- Monitoring (4)
- Networking (12)
  - Firewalls (4)
  - NetFlow (4)
- Operating Systems (2)
  - Linux (2)
  - Mac OSX (1)
  - Unix (2)
- Operations (11)
- Popular (2)
- SaaS (2)
- Sarcasm (1)
- Search (1)
  - Enterprise Search (1)
- Security (75)
  - Access Management (1)
  - Capture the Flag (4)
  - Cloud Computing (4)
  - Compliance (1)
  - Disaster Recovery (1)
  - Malware (4)
  - Metrics (2)
  - OWASP (2)
  - PCI (2)
  - Phishing (2)
  - Physical (1)
  - Risk Management (2)
  - Virtualization (3)
  - Web Application Security (32)
    - Dynamic Analysis (1)
    - Static Analysis (1)
  - Wireless Networks (5)
- Service-Oriented Architecture (1)
- Software and Tools (15)
  - Crashplan (1)
  - Drobo (1)
  - GRC (1)
- Training (2)
- Uncategorized (1)
- Virtualization (4)

Blogroll
- Agile Operations Blog
- Agile Testing
- Agile Web Operations
- Amazon Web Services Blog
- dev2ops – Web Ops at Scale
- Gilligan on Data Web Analytics pro tips
- ISSA Home The Information Systems Security Association (ISSA)® is a not-for-profit, international organization of information security professionals and practitioners.
- Kitchen Soap, A WebOps Blog
- Michael Howard's Blog Software security guy at Microsoft.
- National Instruments Home The majority of the contributers here are current or past NI employees.
- OWASP Home The Open Web Application Security Project (OWASP) is a worldwide free and open community focused on improving the security of application software.
- RSnake's Blog ha.ckers.org web application security lab
- Server Fault
- Steve Souders’ Blog Google High Performance Guru
- The Madstop
- The Open Minded Enterprise
- The Simple Logic
- Transparent Uptime blog
Archives
- March 2019
- October 2017
- April 2016
- January 2016
- December 2015
- May 2015
- November 2014
- August 2014
- June 2014
- May 2014
- October 2013
- September 2013
- August 2013
- May 2013
- March 2013
- February 2013
- October 2012
- May 2011
- April 2011
- December 2010
- July 2010
- June 2010
- April 2010
- March 2010
- February 2010
- January 2010
- November 2009
- September 2009
- July 2009
- June 2009
- April 2009
- March 2009
- February 2009
- January 2009
- December 2008
- October 2008
- September 2008
- August 2008
- July 2008
- June 2008
- May 2008
Tag Cloud
21ct agile amazon analysis application appsec attack aws browser cloud Cloud Computing code Conferences data devops ec2 firewall google hansen internet lynxeon malware Management network Operations owasp PCI performance project rsnake SaaS secure Security strategies velocity velocity08 velocityconf velocityconf08 velocityconf09 Virtualization vpn vulnerability waf web wifi

Web Admin Blog

Real Web Admins. Real World Experience.