Splunk Best Practices
I've been working with Splunk a LOT lately. Over the past few months I've changed our Splunk configurations over and over again as I find out new and better ways to do things. I decided to put together this "Web Admin's Guide to Splunk Best Practices" for those of you who are either considering implementing Splunk or who have already implemented Splunk and are having issues getting it to do what you need it to. Hope it helps!
Configuration
1. Indexes
- Write your log data to the main index. This is the default index that all inputs will write to so no extra configration should be necessary to do this.
- Create a new index for your configuration files. Originally I was writing these out to the main index as well, but they started getting deleted out of the index as it grew to it's max size and/or max time. By writing configuration files out to a separate index, I am able to keep these files around for as long as I need to without worrying about them eventually falling off because of new logs coming in.
- If you want to create dashboards which display lists of things like source names, source types, etc, they load a lot faster if you do some pre-processing and load that information into a new index. For example, our developers write all logs to /opt/apps/logs/appname directories and I wanted a dashboard that displayed a list of all of the appnames. I wrote a script which Splunk calls as a scripted input that does a find on the /opt/apps/logs filesystem and indexes the path to each file. Then, I'm able to use regular expressions to pull out the application names and display that list as a dashboard much faster than querying the main index for that information.
2. Bundles
- Most of the Splunk documentation tells you to add your configuration files to the local bundle since that bundle won't get overwritten when Splunk is upgraded. My recommendation is to use the local bundle for server specific configurations like the outputs.conf and create your own custom bundles for all other configurations (inputs.conf, props.conf, transforms.conf, etc) that similar servers might share. This allows me to use the same bundle for say all of my Apache web servers regardless of environment (dev, test, or prod), but use the outputs.conf in the local bundle to send it to the right indexing server (one for each environment).
- Create new bundles for each different type of server. You should have different bundles for your web servers, application servers, database servers, etc. The general rule of thumb should be to use the same bundle if they can use the same (or very similar) inputs.conf files.
3. Deployment Server
- If you did what I suggested earlier and created bundles for each different type of server, then the transition to using a deployment server to update your bundles remotely should be a piece of cake. Just set up your deployment.conf file on the forwarding servers and on the indexing server and then drop your bundles in the $SPLUNK_HOME/etc/modules/distributedDeployment/classes directory. Now, you modify your bundles in a single location for each environment and all of the servers are updated. It takes a little bit of extra time to set this up, but it will save you tons of time if you are constantly tweaking your configuration bundles like I do.
4. Inputs
- We were intially seeing about 40 GB/day worth of logs on servers where I couldn't find 40 GB total worth of log files on the entire system. Eventually, we tracked the issue down to how Splunk classifies XML files by default. These files automatically get categorized as "xml_file" sourcetype. Splunk thinks of this sourcetype as a configuration file where the entire file should get re-indexed if any part of it changes. The problem comes whenever you do logging in an XML format. With each new log (XML event) written out to the file, Splunk re-indexes the entire file. So as you log more and more, the file gets bigger and bigger, and Splunk's per-day useage skyrockets. To avoid this, just make sure that any XML logs get classified as something other than "xml_file".
- Splunk tells when a log file has been changed by keeping a checksum on the head and tail of the log files it's monitoring. This allows it to be able to tell if a file has been renamed by something like a logrotate script so it won't just re-index it as a new file. Unfortunately, Splunk uses zcat to evaluate the contents of compressed files so it is not able to compare checksums in the event that a file is compressed as part of the logrotate process. It indexes these compressed files as though they were brand new files (even though the contents were already indexed before compression). While not nearly as impactful as the "xml_file" issue above, this will still double your per-day useage for these files. My best suggestion here is to blacklist all compressed files (.gz, .zip, etc) and do a batch import on them when you begin indexing with Splunk for the first time. This will give you the contents of the old log files without indexing the new ones twice.
- If you are using any scripted inputs, do not place the scripts or any files created by the script inside of a bundle being deployed using the Deployment Server. This is because the bundles deployed using the Deployment Server are stored in a tar archive format. Splunk is able to read this format for the configurations, but as far as I can tell doesn't actually extract the bundle from the tar archive. So while your scripts would get deployed, you won't really be able to reference them in the inputs.conf. Instead, place the scripts and files outputted by them in either the local bundle on each server or create a new bundle for them that is not deployed with the Deployment Server. You can still use an inputs.conf inside of a deployed bundle to call the scripted input since you now know the exact path to the script.
5. Licensing
- Splunk provides a "Licensing" tab in the admin section where you can view your license and see your daily use. This helps you to figure out where your usage is at in relation to your license, but doesn't do a thing to help you evaluate where that useage is coming from. I created a custom dashboard I titled "Splunk License Usage" that displays the total usage, usage by host, usage by source, and usage by source type over the past 24 hours. This allows me to track down my biggest loggers to figure out what is really necessary and it's easier to figure out when there are problems. You can download it off of SplunkBase here.
- If you're logging with log4j, you probably know that you can set logging level to debug, info, warn, and error. Make sure you set expectations with your developers to only log error messages in production. The rest of those log levels are probably fine in dev or test, but constitute excessive logging for prod and eat up valuable disk space, cpu cycles, etc.
Troubleshooting
- Make sure to check the dates on your various types of files to make sure that the date in the file corresponds correctly with the date Splunk gives it. Especially do a search for all dates in the future to see if you turn up results for dates that haven't happened yet. If this happens, you will need to specify time formats for those files in the props.conf.
That's it for now, but this will be a living document that I plan on updating as new "best practices" are realized. Please feel free to leave comments or add suggestions. Thanks! - Josh
February 23rd, 2010 - 11:48
Hi Josh
Great article and many thanks for taking the time to write it, and of course for your Licensing app too. But do you know how to import it into v4.09? Or have you an updated version already.
That would really help me as I’m trying to locate where all my usage is coming from.
Many thanks
Lea
February 23rd, 2010 - 14:18
Lea,
Thanks for checking that out. That’s definitely old (3.x version) and some things like Deployment server are much nicer in the new version of Splunk. Perhaps someday I’ll get around to revising my best practices. I have actually added two new bundles to Splunkbase including the Splunk License Usage bundle modified for 4.x versions of Splunk.
Splunk License Usage
This bundle provides a new dashboard which has several widgets that query to help you determine your Splunk license usage total over the past 24 hours as well as usage by host, source, and sourcetype. It contains timecharts to help you understand usage over time and see usage spikes as well as pie charts to help you to figure out which log files, sourcetypes, and hosts Splunk is indexing the most data from.
http://www.splunkbase.com/apps/Splunk+License+Usage
Splunk Monitoring
The Splunk Monitoring application can be used to monitor your Splunk forwarding nodes from your indexing node using an nmap query script. It creates a new “splunk_monitoring” index and has a single dashboard that displays the overall number of servers that are UP or DOWN as well as the status of each individual server. To use the Splunk Monitoring application, extract the files into your $SPLUNK_HOME/etc/apps directory. The actual monitoring script uses nmap so make sure you have it installed on your indexing node. Edit the $SPLUNK_HOME/etc/apps/splunk_monitoring/local/tags.conf file to include a list of your servers (the actual tag doesn’t matter) or edit the $SPLUNK_HOME/etc/apps/splunk_monitoring/bin/splunk_port_monitor.sh script to point to a different location for the tag_file variable. You will also want to edit that file if you run Splunk on a port other than 8089 or if your nmap executable is located in a location other than /usr/bin/nmap.
http://www.splunkbase.com/apps/Splunk+Monitoring
Enjoy!
February 26th, 2010 - 07:53
Hi Josh!
A very helpful article. I’m in the process of setting up splunk and was looking at you license usage script for version 3.x. On Splunkbase, however, the script is still only for version 3.x. At least I can’t find it and the link you provide above takes me to the 3.x version of it.
Have to take a look at the monitoring after the weekend.
cheers,
madsen
February 26th, 2010 - 09:30
Madsen/Lea,
I see exctly what you’re talking about with the 4.x version of my Splunk License Usage app not showing up. It worked fine while I was logged in to Splunkbase, but not that I’m not logged in anymore, it just shows the 3.x version of the app. I’ve contacted Emma Dannin and Caleb Poterbin at Splunk support as they helped me get my app on the new Splunkbase. I will update you once it has been made available. Thanks!
March 1st, 2010 - 10:23
Alright, I think we’ve got the issues figured out with SplunkBase and you guys can download the new 4.x version of my Splunk License Usage application here:
http://www.splunkbase.com/apps/All/4.x/App/app:Splunk+License+Usage