Web Admin Blog Real Web Admins. Real World Experience.


Analyzing NetFlow for Data Loss Detection

The 2014 Verizon Data Breach Investigation Report (DBIR) is out and it paints quite the gloomy picture of the world we live in today where cyber security is concerned.  With over 63,000 security incidents and 1,367 confirmed data breaches, the question is no longer if you get popped, but rather, when.  According to the report, data export is second only to credit card theft on the list of threat actions as a result of a breach.  And with the time to compromise typically measured in days and time to discovery measured in weeks or months, Houston, we have a problem.

I've written in the past about all of the cool tricks we've been doing to find malware and other security issues by performing NetFlow analysis using the 21CT LYNXeon tool and this time I've found another trick around data loss detection that I thought was worth writing about.  Before I get into the trick, let's quickly recap NetFlow for those who aren't familiar with it.

Think of NetFlow as the cliff notes of all of the network traffic that your systems handle on a daily basis.  Instead of seeing WHAT data was transmitted (a task for deep packet inspection/DPI), we see the summary of HOW the data was transmitted.  Things like source and destination IP, source and destination port, protocol, and bytes sent and received.  Because many network devices are capable of giving you this information for free, it only makes sense to capture it and start using it for security analytics.

So, now we have our NetFlow and we know that we're going to be breached eventually, the real question becomes how to detect it quickly and remediate before a significant data loss occurs.  Our LYNXeon tool allows us to create patterns of what to look for within NetFlow and other data sources.  So, to help detect for data loss, I've designed the following analytic:

LYNXeon Analytics for Data Loss

What this analytic does is it searches our NetFlow for any time an internal IP address is talking to an external IP address.  Then, it adds up the bytes sent for each of these unique sets of connections (same source, destination, and port) and presents me with a top 25 list.  Something like this:

Top 25 List

So, now we have a list of the top 25 source and destination pairs that are sending data outside of our organization.  There are also some interesting ports in this list like 12547, 22 (SSH), 443 (HTTPS), and 29234.  A system with 38.48 GB worth of data sent to a remote server seems like a bad sign and something that should be investigated.  You get the idea.  It's just a matter of analyzing the data and separating out what is typical vs what isn't and then digging deeper into those.

My advice is to run this report on an automated schedule at least daily so that you can quickly detect when data loss has begun in order to squash it at the source.  You could probably argue that an attacker might take a low and slow approach to remain undetected by my report, and you'd probably be right, but I'd also argue that if this were the case, then I've hopefully slowed them enough to catch them another way within a reasonable timespan.  Remember, security is all about defense in depth and with the many significant issues that are highlighted by the Verizon DBIR, we could use all of the defense we can muster.


Visual Correlelation of Security Events

I recently had the opportunity to play with a data analytics platform called LYNXeon by a local company (Austin, TX) called 21CT. The LYNXeon tool is billed as a "Big Data Analytics" tool that can assist you in finding answers among the flood of data that comes from your network and security devices and it does a fantastic job of doing just that. What follows are some of my experiences in using this platform and some of the reasons that I think companies can benefit from the visualizations which it provides.

Where I work, data on security events is in silos all over the place. First, there's the various security event notification systems that my team owns. This consists primarily of our IPS system and our malware prevention system. Next, there are our anti-virus and end-point management systems which are owned by our desktop security team. There's also event and application logs from our various data center systems which are owned by various teams. Lastly, there's our network team who owns the firewalls, the routers, the switches, and the wireless access points. As you can imagine, when trying to reconstruct what happened as part of a security event, the data from each of these systems can play a significant role. Even more important is your ability to correlate the data across these siloed systems to get the complete picture. This is where log management typically comes to play.

Don't get me wrong. I think that log management is great when it comes to correlating the siloed data, but what if you don't know what you're looking for? How do you find a problem that you don't know exists? Enter the LYNXeon platform.

The base of the LYNXeon platform is flow data obtained from your various network device. Regardless of whether you use Juniper JFlow, Cisco NetFlow, or one of the other many flow data options, knowing the data that is going from one place to another is crucial to understanding your network and any events that take place on it. Flow data consists of the following:

  • Source IP address
  • Destination IP address
  • IP protocol
  • Source port
  • Destination port
  • IP type of service

Flow data also can contain information about the size of the data on your network.

The default configuration of LYNXeon basically allows you to visually (and textually) analyze this flow data for issues which is immediately useful.  LYNXeon Analyst Studio comes with a bunch of pre-canned reporting which allows you to quickly sort through your flow data for interesting patterns.  For example, once a system has been compromised, the next step for the attacker is often times data exfiltration.  They want to get as much information out of the company as possible before they are identified and their access is squashed.  LYNXeon provides you with a report to identify the top destinations in terms of data size for outbound connections.  Some other extremely useful reporting that you can do with basic flow data in LYNXeon:

  • Identify DNS queries to non-corporate DNS servers.
  • Identify the use of protocols that are explicitly banned by corporate policy (P2P?  IM?).
  • Find inbound connection attempts from hostile countries.
  • Find outbound connections via internal protocols (SNMP?).

It's not currently part of the default configuration of LYNXeon, but they have some very smart guys working there who can provide services around importing pretty much any data type you can think of into the visualizations as well.  Think about the power of combining the data of what is talking to what along with information about anti-virus alerts, malware alerts, intrusion alerts, and so on.  Now, not only do you know that there was an alert in your IPS system, but you can track every system that target talked with after the fact.  Did it begin scanning the network for other hosts to compromise?  Did it make a call back out to China?  These questions and more can be answered with the visual correlation of events through the LYNXeon platform.  This is something that I have never seen a SIEM or other log management company be able to accomplish.

LYNXeon probably isn't for everybody.  While the interface itself is quite easy to use, it still requires a skilled security professional at the console to be able to analyze the data that is rendered.  And while the built-in analytics help tremendously in finding the proverbial "needle in the haystack", it still takes a trained person to be able to interpret the results.  But if your company has the expertise and the time to go about proactively finding problems, it is definitely worth looking into both from a network troubleshooting (something I really didn't cover) and security event management perspective.


Physical Security FAIL :-(

Notice anything wrong with this picture?

Iron mountain lock is unlocked.

I was walking by one of the Iron Mountain Secure Shredding bins at work one day several months ago and noticed that the lock wasn't actually locked. Being the security conscious individual that I am, I tried to latch the lock again, but the lock was so rusted that it wouldn't close as hard as I tried. I can't just leave it there like that so I call the number on the bin's label and there is an automated message that tells me that they're not taking local calls anymore and gave me a different number to try. I call that number and they ask me for my company ID number which I had no idea what it was. She informed me that without that ID number I couldn't submit a support request. I informed the lady that this bin contained sensitive personal and financial information and that the issue couldn't wait for some random company ID to be found. Fortunately, she gave in and created the support ticket for me saying that I should hear back from someone within four hours.

One week later, on Friday, Iron Mountain finally calls me back and says that they will come to replace the lock the following Monday before 5 PM. When the lock hadn't been replaced yet on Monday evening, I called Iron Mountain back up. Looking at their records, they showed that a new lock had been delivered, but they had no idea where and the signature was illegible. I work on a three-building campus with 14 floors between them and almost 3,000 people. If they can't tell me where the lock is, then there's no way for me to track it down. They said that they would investigate and call me back.

After not hearing back from them again for a couple of days, I called them back. The woman I spoke with had no real update on the investigation. She said that she would send another message "downstairs" and escalate to her supervisor. At this point it had been almost three weeks with sensitive documents sitting in a bin with a malfunctioning lock. The next day they called me back and said they were never able to track down who the new lock was left with so they would bring us a new one at no charge. Finally, after a total of 24 days with a unlocked Secure Shredding bin, Iron Mountain was able to replace the lock. Iron Mountain......FAIL.


Velocity 2009 – Hadoop Operations: Managing Big Data Clusters

Hadoop Operations: Managaing Big Data Clusters (see link on that page for preso) was given by Jeff Hammerbacher of Cloudera.

Other good references -
book: "Hadoop: The Definitive Guide"
preso: hadoop cluster management from USENIX 2009

Hadoop is an Apache project inspired by Google's infrastructure; it's software for programming warehouse-scale computers.

It has recently been split into three main subprojects - HDFS, MapReduce, and Hadoop Common - and sports an ecosystem of various smaller subprojects (hive, etc.).

Usually a hadoop cluster is a mess of stock 1 RU servers with 4x1TB SATA disks in them.  "I like my servers like I like my women - cheap and dirty," Jeff did not say.


  • Pools servers into a single hierarchical namespace
  • It's designed for large files, written once/read many times
  • It does checksumming, replication, compression
  • Access is from from Java, C, command line, etc.  Not usually mounted at the OS level.


  • Is a fault tolerant data layer and API for parallel data processing
  • Has a key/value pair model
  • Access is via Java, C++, streaming (for scripts), SQL (Hive), etc
  • Pushes work out to the data


  • Avro (serialization)
  • HBase (like Google BigTable)
  • Hive (SQL interface)
  • Pig (language for dataflow programming)
  • zookeeper (coordination for distrib. systems)

Facebook used scribe (log aggregation tool) to pull a big wad of info into hadoop, published it out to mysql for user dash, to oracle rac for internal...
Yahoo! uses it too.

Sample projects hadoop would be good for - log/message warehouse, database archival store, search team projects (autocomplete), targeted web crawls...
As boxes you can use unused desktops, retired db servers, amazon ec2...

Tools they use to make hadoop include subversion/jira/ant/ivy/junit/hudson/javadoc/forrest
It uses an Apache 2.0 license

Good configs for hadoop:

  • use 7200 rpm sata, ecc ram, 1U servers
  • use linux, ext3 or maybe xfs filesystem, with noatime
  • JBOD disk config, no raid
  • java6_14+

To manage it -

unix utes: sar, iostat, iftop, vmstat, nfsstat, strace, dmesg, friends

java utes: jps, jstack, jconsole
Get the rpm!  www.cloudera.com/hadoop

config: my.cloudera.com
modes - standalong, pseudo-distrib, distrib
"It's nice to use dsh, cfengine/puppet/bcfg2/chef for config managment across a cluster; maybe use scribe for centralized logging"

I love hearing what tools people are using, that's mainly how I find out about new ones!

Common hadoop problems:

  • "It's almost always DNS" - use hostnames
  • open ports
  • distrib ssh keys (expect)
  • write permissions
  • make sure you're using all the disks
  • don't share NFS mounts for large clusters
  • set JAVA_HOME to new jvm (stick to sun's)

HDFS In Depth

1.  NameNode (master)
VERSION file shows data structs, filesystem image (in memory) and edit log (persisted) - if they change, painful upgrade

2.  Secondary NameNode (aka checkpoint node) - checkpoints the FS image and then truncates edit log, usually run on a sep node
New backup node in .21 removes need for NFS mount write for HA

3.  DataNode (workers)
stores data in local fs
stored data into blk_<id> files, round robins through dirs
heartbeat to namenode
raw socket to serve to client

4.  Client (Java HDFS lib)
other stuff (libhdfs) more unstable

hdfs operator utilities

  • safe mode - when it starts up
  • fsck - hadoop version
  • dfsadmin
  • block scanner - runs every 3 wks, has web interface
  • balancer - examines ratio of used to total capacity across the cluster
  • har (like tar) archive - bunch up smaller files
  • distcp - parallel copy utility (uses mapreduce) for big loads
  • quotas

has users, groups, permissions - including x but there is no execution, but used for dirs
hadoop has some access trust issues - used through gateway cluster or in trusted env
audit logs - turn on in log4j.properties

has loads of Web UIs - on namenode go to /metrics, /logLevel, /stacks
non-hdfs access - HDFS proxy to http, or thriftfs
has trash (.Trash in home dir) - turn it on

includes benchmarks - testdfsio, nnbench

Common HDFS problems

  • disk capacity, esp due to log file sizes - crank up reserved space
  • slow but not dead disks and flapping NICS to slow mode
  • checkpointing and backing up metadata - monitor that it happens hourly
  • losing write pipeline for long lived writes - redo every hour is recommended
  • upgrades
  • many small files


use Fair Share or Capacity scheduler
distributed cache
jobcontrol for ordering

Monitoring - They use ganglia, jconsole, nagios and canary jobs for functionality

Question - how much admin resource would you need for hadoop?  Answer - Facebook ops team had 20% of 2 guys hadooping, estimate you can use 1 person/100 nodes

He also notes that this preso and maybe more are on slideshare under "jhammerb."

I thought this presentation was very complete and bad ass, and I may have some use cases that hadoop would be good for coming up!


Anatomy of an Attack: From Incident to Expedient Resolution

For the first session of the morning on the last day of the TRISC 2009 Conference, I decided to attend the "Anatomy of an Attack: From Incident to Expedient Resolution" talk by Chris Smithee, a Systems Engineer at Lancope.  He talked about the different types of attacks that you see on your network and how using FLOW data can be used to monitor and eliminate some of these types of threats.  My notes from the session are below:


Consider Your Hotel Network Hostile

As I'm preparing to take my trip to New York for the OWASP AppSec Conference, I came across a timely article on the risks involved with using a hotel network.  The Center for Hospitality Research at Cornell University surveyed 147 hotels and then conducted on-site vulnerability testing at 50 of those hotels.  Approximately 20% of those hotels still run basic ethernet hub-type networks and almost 93% offer wireless.  Only six of the 39 hotels that had WiFi networks were using encryption (see my blog on why are people still using WEP for why this is necessary).  What does this mean for you, Joe User?  It means that both your personal and company information is at risk any time you connect to those networks.  The next time you're surfing the web, start paying attention to all of the non-SSL links (http:// versus https://) that you visit.  Then, think about the information that you are passing along to those sites.  Are you signing in with a user name and password?  Entering credit card information?  Whatever it is, you better make sure that it's something that you wouldn't feel bad if it wound up on a billboard in Times Square, because that's about how risky your trasmission could be.

Before you get too concerned, there are a few things you can do to try to prevent this.  First, DO NOT visit any links where you transmit information unencrypted.  This is just asking for trouble.  Since many man-in-the-middle type attacks can still be used to exploit this, my second suggestion is to use some sort of VPN tunnel.  Whether it's a corporate VPN or just a freebie software VPN to your network back home, this allows you to encrypt all traffic over the untrusted hotel network.  Make this your standard operating procedure anytime you connect to an untrusted network (not just a hotel) and you should keep your data much safer.  Lastly, please be sure to have current firewall and anti-virus software on the computer you are using to connect to the untrusted network.  The last thing you want is to get infected by some worm or virus just by plugging in to the network.

One other thing that I think that deserves mentioning here is that if you don't absolutely have to use the internet on an untrusted network, then don't do it.  Obviously, there are times when you need access to do work, pay bills, etc, but if you can save those tasks until you reach a more familiar (and hopefully safer) network, that is far and away the best way to keep yourself and your data safe.