I've been working with Splunk a LOT lately. Over the past few months I've changed our Splunk configurations over and over again as I find out new and better ways to do things. I decided to put together this "Web Admin's Guide to Splunk Best Practices" for those of you who are either considering implementing Splunk or who have already implemented Splunk and are having issues getting it to do what you need it to. Hope it helps!
- Write your log data to the main index. This is the default index that all inputs will write to so no extra configration should be necessary to do this.
- Create a new index for your configuration files. Originally I was writing these out to the main index as well, but they started getting deleted out of the index as it grew to it's max size and/or max time. By writing configuration files out to a separate index, I am able to keep these files around for as long as I need to without worrying about them eventually falling off because of new logs coming in.
- If you want to create dashboards which display lists of things like source names, source types, etc, they load a lot faster if you do some pre-processing and load that information into a new index. For example, our developers write all logs to /opt/apps/logs/appname directories and I wanted a dashboard that displayed a list of all of the appnames. I wrote a script which Splunk calls as a scripted input that does a find on the /opt/apps/logs filesystem and indexes the path to each file. Then, I'm able to use regular expressions to pull out the application names and display that list as a dashboard much faster than querying the main index for that information.
- Most of the Splunk documentation tells you to add your configuration files to the local bundle since that bundle won't get overwritten when Splunk is upgraded. My recommendation is to use the local bundle for server specific configurations like the outputs.conf and create your own custom bundles for all other configurations (inputs.conf, props.conf, transforms.conf, etc) that similar servers might share. This allows me to use the same bundle for say all of my Apache web servers regardless of environment (dev, test, or prod), but use the outputs.conf in the local bundle to send it to the right indexing server (one for each environment).
- Create new bundles for each different type of server. You should have different bundles for your web servers, application servers, database servers, etc. The general rule of thumb should be to use the same bundle if they can use the same (or very similar) inputs.conf files.
3. Deployment Server
- If you did what I suggested earlier and created bundles for each different type of server, then the transition to using a deployment server to update your bundles remotely should be a piece of cake. Just set up your deployment.conf file on the forwarding servers and on the indexing server and then drop your bundles in the $SPLUNK_HOME/etc/modules/distributedDeployment/classes directory. Now, you modify your bundles in a single location for each environment and all of the servers are updated. It takes a little bit of extra time to set this up, but it will save you tons of time if you are constantly tweaking your configuration bundles like I do.
- We were intially seeing about 40 GB/day worth of logs on servers where I couldn't find 40 GB total worth of log files on the entire system. Eventually, we tracked the issue down to how Splunk classifies XML files by default. These files automatically get categorized as "xml_file" sourcetype. Splunk thinks of this sourcetype as a configuration file where the entire file should get re-indexed if any part of it changes. The problem comes whenever you do logging in an XML format. With each new log (XML event) written out to the file, Splunk re-indexes the entire file. So as you log more and more, the file gets bigger and bigger, and Splunk's per-day useage skyrockets. To avoid this, just make sure that any XML logs get classified as something other than "xml_file".
- Splunk tells when a log file has been changed by keeping a checksum on the head and tail of the log files it's monitoring. This allows it to be able to tell if a file has been renamed by something like a logrotate script so it won't just re-index it as a new file. Unfortunately, Splunk uses zcat to evaluate the contents of compressed files so it is not able to compare checksums in the event that a file is compressed as part of the logrotate process. It indexes these compressed files as though they were brand new files (even though the contents were already indexed before compression). While not nearly as impactful as the "xml_file" issue above, this will still double your per-day useage for these files. My best suggestion here is to blacklist all compressed files (.gz, .zip, etc) and do a batch import on them when you begin indexing with Splunk for the first time. This will give you the contents of the old log files without indexing the new ones twice.
- If you are using any scripted inputs, do not place the scripts or any files created by the script inside of a bundle being deployed using the Deployment Server. This is because the bundles deployed using the Deployment Server are stored in a tar archive format. Splunk is able to read this format for the configurations, but as far as I can tell doesn't actually extract the bundle from the tar archive. So while your scripts would get deployed, you won't really be able to reference them in the inputs.conf. Instead, place the scripts and files outputted by them in either the local bundle on each server or create a new bundle for them that is not deployed with the Deployment Server. You can still use an inputs.conf inside of a deployed bundle to call the scripted input since you now know the exact path to the script.
- Splunk provides a "Licensing" tab in the admin section where you can view your license and see your daily use. This helps you to figure out where your usage is at in relation to your license, but doesn't do a thing to help you evaluate where that useage is coming from. I created a custom dashboard I titled "Splunk License Usage" that displays the total usage, usage by host, usage by source, and usage by source type over the past 24 hours. This allows me to track down my biggest loggers to figure out what is really necessary and it's easier to figure out when there are problems. You can download it off of SplunkBase here.
- If you're logging with log4j, you probably know that you can set logging level to debug, info, warn, and error. Make sure you set expectations with your developers to only log error messages in production. The rest of those log levels are probably fine in dev or test, but constitute excessive logging for prod and eat up valuable disk space, cpu cycles, etc.
- Make sure to check the dates on your various types of files to make sure that the date in the file corresponds correctly with the date Splunk gives it. Especially do a search for all dates in the future to see if you turn up results for dates that haven't happened yet. If this happens, you will need to specify time formats for those files in the props.conf.
That's it for now, but this will be a living document that I plan on updating as new "best practices" are realized. Please feel free to leave comments or add suggestions. Thanks! - Josh