Web Admin Blog Real Web Admins. Real World Experience.

29Mar/102

Simplifying On-call Through Alert Aggregation

One of the coolest things about working on the Web Systems Team at National Instruments is that the company has invested in a wide variety of tools to assist us with our jobs. Since we are responsible for the availability of ni.com, we have the standard URL and content monitors (Sitescope and Nagios). We also have the ability to do real user monitoring with a tool called Coradiant TrueSight. We are also responsible for the website's performance so we have purchased tools like Panorama to diagnose code level issues. We have Splunk for log monitoring and Gomez for a third-party performance and availability monitor. We even have a SaaS provider that does application security scanning. Having all of these tools at our disposal is quite awesome and allows us to quickly find and fix issues with the site. The problem is that every single one of those tools has it's own alerting and reporting interface.

This isn't a new problem by any means. I've seen this issue at every job that I've ever had where the responsibilities included operational support. You rely on multiple tools to tell you when things aren't going quite right, but now you end up spending some non-zero portion of your time managing those tools. For example, lets say that your company has a small release that lasts a few hours once a month. You now have to log in to the control panel (GUI) for each one of those tools and disable your alerts for that time period so that your on-call device isn't going crazy. Assume that you have only four alerting tools and it takes you approximately 5 minutes to log in to each, set the maintenance window, and log back out. You just spent 20 minutes to disable alerts! Now you're getting to the end of the release and things didn't go as planned so the release is running longer than expected. Now you have to spend another 20 minutes to extend the maintenance window. How frustrating is that?

The issue gets even more complicated when you have multiple people providing support in either an on-call rotation or follow-the-sun type of scenario. At NI, we have an operations team that handles alerts during normal business hours, an on-call admin who handles alerts from 5 PM to 2 AM, and then a super-awesome Hungarian Web Admin who takes over responding to pages after 2 AM (9 AM in Hungary). Most of the alerting configurations that these tools provide aren't even able to handle this type of scenario, but let's suppose they did. You're still stuck logging into multiple systems every time there's a holiday, somebody goes on vacation, etc. And what happens if you don't have a dedicated on-call device to pass from person to person? Then you're stuck updating the alert configurations every time the on-call person changes in your rotation.

This really got me thinking that there has to be a better way to do things. I searched the internet looking for a solution, but when I couldn't find anything to do exactly what I wanted it to do, I ended up writing my own. It's now my pleasure to share with you iAlertYou. The idea is quite simple. You take all of those different tools that send alerts and you aggregate them in the same place. In this case, it's on ialertyou.com. By doing this, you gain the ability to control everything from a single, centralized, management platform. Have a maintenance window? No problem. Just log in, set it once, and it affects all of your alerts. Same thing for both alert scheduling (who should get pages and when) and contact groups (used for on-call rotations). Plus, by having all of your alerts going through a single aggregation point, it means that we can also do reporting on all of your alerts. Ever wondered how many of your alerts come from what tools? What times of the day you get the most alerts? It's all possible through alert aggregation.

Certainly there are drawbacks to this type of scenario. Most importantly, you're introducing another dependency in what is typically a mission critical activity. While I can't eliminate this concern completely, I built the system on top of internet cloud technologies for superior scalability. I've architected the application using best-practices in availability, performance, security, and usability. Currently, the only offering is a $30/month "everything" plan, but if you spend more than 10-20 minutes a month changing alert configurations, the ROI is realized very quickly. I will also be rolling out a "free" plan (thanks to Peco) with a limited subset of the functionality. I'd like to invite you to check out http://www.ialertyou.com and see if it can help your company simplify on-call through alert aggregation.