Splunk Best Practices

I’ve been working with Splunk a LOT lately. Over the past few months I’ve changed our Splunk configurations over and over again as I find out new and better ways to do things. I decided to put together this “Web Admin’s Guide to Splunk Best Practices” for those of you who are either considering implementing Splunk or who have already implemented Splunk and are having issues getting it to do what you need it to. Hope it helps!

Configuration

1. Indexes

Write your log data to the main index. This is the default index that all inputs will write to so no extra configration should be necessary to do this.
Create a new index for your configuration files. Originally I was writing these out to the main index as well, but they started getting deleted out of the index as it grew to it’s max size and/or max time. By writing configuration files out to a separate index, I am able to keep these files around for as long as I need to without worrying about them eventually falling off because of new logs coming in.
If you want to create dashboards which display lists of things like source names, source types, etc, they load a lot faster if you do some pre-processing and load that information into a new index. For example, our developers write all logs to /opt/apps/logs/appname directories and I wanted a dashboard that displayed a list of all of the appnames. I wrote a script which Splunk calls as a scripted input that does a find on the /opt/apps/logs filesystem and indexes the path to each file. Then, I’m able to use regular expressions to pull out the application names and display that list as a dashboard much faster than querying the main index for that information.

2. Bundles

Most of the Splunk documentation tells you to add your configuration files to the local bundle since that bundle won’t get overwritten when Splunk is upgraded. My recommendation is to use the local bundle for server specific configurations like the outputs.conf and create your own custom bundles for all other configurations (inputs.conf, props.conf, transforms.conf, etc) that similar servers might share. This allows me to use the same bundle for say all of my Apache web servers regardless of environment (dev, test, or prod), but use the outputs.conf in the local bundle to send it to the right indexing server (one for each environment).
Create new bundles for each different type of server. You should have different bundles for your web servers, application servers, database servers, etc. The general rule of thumb should be to use the same bundle if they can use the same (or very similar) inputs.conf files.

3. Deployment Server

If you did what I suggested earlier and created bundles for each different type of server, then the transition to using a deployment server to update your bundles remotely should be a piece of cake. Just set up your deployment.conf file on the forwarding servers and on the indexing server and then drop your bundles in the $SPLUNK_HOME/etc/modules/distributedDeployment/classes directory. Now, you modify your bundles in a single location for each environment and all of the servers are updated. It takes a little bit of extra time to set this up, but it will save you tons of time if you are constantly tweaking your configuration bundles like I do.

4. Inputs

We were intially seeing about 40 GB/day worth of logs on servers where I couldn’t find 40 GB total worth of log files on the entire system. Eventually, we tracked the issue down to how Splunk classifies XML files by default. These files automatically get categorized as “xml_file” sourcetype. Splunk thinks of this sourcetype as a configuration file where the entire file should get re-indexed if any part of it changes. The problem comes whenever you do logging in an XML format. With each new log (XML event) written out to the file, Splunk re-indexes the entire file. So as you log more and more, the file gets bigger and bigger, and Splunk’s per-day useage skyrockets. To avoid this, just make sure that any XML logs get classified as something other than “xml_file”.
Splunk tells when a log file has been changed by keeping a checksum on the head and tail of the log files it’s monitoring. This allows it to be able to tell if a file has been renamed by something like a logrotate script so it won’t just re-index it as a new file. Unfortunately, Splunk uses zcat to evaluate the contents of compressed files so it is not able to compare checksums in the event that a file is compressed as part of the logrotate process. It indexes these compressed files as though they were brand new files (even though the contents were already indexed before compression). While not nearly as impactful as the “xml_file” issue above, this will still double your per-day useage for these files. My best suggestion here is to blacklist all compressed files (.gz, .zip, etc) and do a batch import on them when you begin indexing with Splunk for the first time. This will give you the contents of the old log files without indexing the new ones twice.
If you are using any scripted inputs, do not place the scripts or any files created by the script inside of a bundle being deployed using the Deployment Server. This is because the bundles deployed using the Deployment Server are stored in a tar archive format. Splunk is able to read this format for the configurations, but as far as I can tell doesn’t actually extract the bundle from the tar archive. So while your scripts would get deployed, you won’t really be able to reference them in the inputs.conf. Instead, place the scripts and files outputted by them in either the local bundle on each server or create a new bundle for them that is not deployed with the Deployment Server. You can still use an inputs.conf inside of a deployed bundle to call the scripted input since you now know the exact path to the script.

5. Licensing

Splunk provides a “Licensing” tab in the admin section where you can view your license and see your daily use. This helps you to figure out where your usage is at in relation to your license, but doesn’t do a thing to help you evaluate where that useage is coming from. I created a custom dashboard I titled “Splunk License Usage” that displays the total usage, usage by host, usage by source, and usage by source type over the past 24 hours. This allows me to track down my biggest loggers to figure out what is really necessary and it’s easier to figure out when there are problems. You can download it off of SplunkBase here.
If you’re logging with log4j, you probably know that you can set logging level to debug, info, warn, and error. Make sure you set expectations with your developers to only log error messages in production. The rest of those log levels are probably fine in dev or test, but constitute excessive logging for prod and eat up valuable disk space, cpu cycles, etc.

Troubleshooting

Make sure to check the dates on your various types of files to make sure that the date in the file corresponds correctly with the date Splunk gives it. Especially do a search for all dates in the future to see if you turn up results for dates that haven’t happened yet. If this happens, you will need to specify time formats for those files in the props.conf.

That’s it for now, but this will be a living document that I plan on updating as new “best practices” are realized. Please feel free to leave comments or add suggestions. Thanks! – Josh

Comments (20)

Welcome to WebAdminBlog!

This blog site is run by Josh Sokol, the Founder and CEO of SimpleRisk, a free tool for Governance, Risk Management, and Compliance. Josh is a former Web Admin and Information Security Program Owner of National Instruments.

Categories
Recent Posts
Recent Comments
devops
Links
Security
Tags
21ct agile amazon analysis application appsec attack aws browser cloud Cloud Computing code Conferences data devops ec2 firewall google hansen internet lynxeon malware Management network Operations owasp PCI performance project rsnake SaaS secure Security strategies velocity velocity08 velocityconf velocityconf08 velocityconf09 Virtualization vpn vulnerability waf web wifi
Categories
- Advertising (2)
- Application Performance Management (14)
- Automation (4)
- Browsers (4)
- Cloud Computing (9)
  - Elastic Compute Cloud (3)
- Conferences (64)
  - BSides Austin 2013 (1)
  - BSides Austin 2016 (1)
  - OWASP AppSec DC 2009 (16)
  - OWASP AppSec NYC 2008 (18)
  - OWASP LASCON 2017 (1)
  - OWASP LASCON 2018 (1)
  - TRISC 2009 (8)
  - Velocity 2008 (8)
  - Velocity 2009 (8)
- Content Management (2)
- Featured (3)
- Green Computing (1)
- High Availability (1)
- Log Management (2)
- Management (4)
- Monitoring (4)
- Networking (12)
  - Firewalls (4)
  - NetFlow (4)
- Operating Systems (2)
  - Linux (2)
  - Mac OSX (1)
  - Unix (2)
- Operations (11)
- Popular (2)
- SaaS (2)
- Sarcasm (1)
- Search (1)
  - Enterprise Search (1)
- Security (75)
  - Access Management (1)
  - Capture the Flag (4)
  - Cloud Computing (4)
  - Compliance (1)
  - Disaster Recovery (1)
  - Malware (4)
  - Metrics (2)
  - OWASP (2)
  - PCI (2)
  - Phishing (2)
  - Physical (1)
  - Risk Management (2)
  - Virtualization (3)
  - Web Application Security (32)
    - Dynamic Analysis (1)
    - Static Analysis (1)
  - Wireless Networks (5)
- Service-Oriented Architecture (1)
- Software and Tools (15)
  - Crashplan (1)
  - Drobo (1)
  - GRC (1)
- Training (2)
- Uncategorized (1)
- Virtualization (4)

Blogroll
- Agile Operations Blog
- Agile Testing
- Agile Web Operations
- Amazon Web Services Blog
- dev2ops – Web Ops at Scale
- Gilligan on Data Web Analytics pro tips
- ISSA Home The Information Systems Security Association (ISSA)® is a not-for-profit, international organization of information security professionals and practitioners.
- Kitchen Soap, A WebOps Blog
- Michael Howard's Blog Software security guy at Microsoft.
- National Instruments Home The majority of the contributers here are current or past NI employees.
- OWASP Home The Open Web Application Security Project (OWASP) is a worldwide free and open community focused on improving the security of application software.
- RSnake's Blog ha.ckers.org web application security lab
- Server Fault
- Steve Souders’ Blog Google High Performance Guru
- The Madstop
- The Open Minded Enterprise
- The Simple Logic
- Transparent Uptime blog
Archives
- March 2019
- October 2017
- April 2016
- January 2016
- December 2015
- May 2015
- November 2014
- August 2014
- June 2014
- May 2014
- October 2013
- September 2013
- August 2013
- May 2013
- March 2013
- February 2013
- October 2012
- May 2011
- April 2011
- December 2010
- July 2010
- June 2010
- April 2010
- March 2010
- February 2010
- January 2010
- November 2009
- September 2009
- July 2009
- June 2009
- April 2009
- March 2009
- February 2009
- January 2009
- December 2008
- October 2008
- September 2008
- August 2008
- July 2008
- June 2008
- May 2008
Tag Cloud
21ct agile amazon analysis application appsec attack aws browser cloud Cloud Computing code Conferences data devops ec2 firewall google hansen internet lynxeon malware Management network Operations owasp PCI performance project rsnake SaaS secure Security strategies velocity velocity08 velocityconf velocityconf08 velocityconf09 Virtualization vpn vulnerability waf web wifi

Web Admin Blog

Real Web Admins. Real World Experience.