Thursday, May 9, 2013

Some queries related to Splunk administration/Deployment Server

Once again time flys /sigh. At times I wish there was more hours in the day but that would probably just translate into more hours working. Hopefully will get over the bubble soon (yeah right). At any rate we had our first Columbus Splunk user group the other day. Was neat to see others in the local area and talk Splunk...at least in as much as we could. The location was a bit noisy. Figured I'd share here a few of the queries I put together in a slide deck related to using (or mostly administrating) Splunk's Deployment Server. I guess they aren't specifically related to the DS as much as general Universal Forwarder (local Splunk agent) health which you can control with the DS


Note that the key to these queries is by default the Universal Forwarder sends internal logs to your Indexer(s). Because these are internal events they don't count against your license. I'm pretty lazy so you will note about all of them have rex commands to create fields on the fly instead of adjusting my props/transforms. 

One of the first queries I created was after I found some events in the SoS dashboard that shows warning and error messages. There really isn't any formatting for the alert but /shrug. I have this running every hour as a scheduled search. 99% of the time it tells me I'm using a deprecated way to reference machine types but new method doesn't (or at least didn't) work. The query will show you if you've referenced an app in your serverclass.conf file that doesn't exist. Most of the times what this really means is you have mistyped the app's name - remember those bloody things are case sensitive.

index=_internal source=*splunkd.log (component=application OR component=serverclass) warn OR error
This next one is a go to query with a number of troubleshooting applications. Hopefully you are pushing out a new deploymentclient.conf file to all of your agents that as a minimum adjusts the default phonehome interval. The default is every 30 seconds.

index=_internal source=*splunkd_access.log POST phonehome

Two queries that capture different aspects of "how busy are my local Splunk agents"; am sure there are others. The first gives you a feel for the number of files open at any one time. I could be wrong with the way I interpret this but by default the Splunk UF limits itself to 100 open files at any one time. It also, by default, keeps a file open for 3 seconds waiting for new events to come in. If you are getting hits I take it to mean (maybe?; probably?) during that 3 second period it has 100 files open. 

index=_internal "File descriptor cache is full" | rex "is full \((?<fd_limit>\d+)" | stats count sparkline by host, fd_limit | sort -fd_limit, -count

The other query looks for the UF saying it has hit the throughput per second limit (default of 256 kb/s) and is throttling itself. The catch with this is you aren't going to get that exact kb/s hit but will see a little over or under. To account for that I have some pretty broad ranges in my case statement. I haven't adjusted any to over 512 but figured might as well account for that generically. I've set a minimum bounds on this as you will likely see these events when an agent restarts. Now that I think about it I should probably add that to the file descriptor query as well.

index=_internal "current data throughput" | rex "Current data throughput \((?<kb>\S+)" | eval rate=case(kb &lt; 500, "256", kb &gt; 499 AND kb &lt; 520, "512", 1=1, "Other") | stats count by host, rate | where count &gt; 4 | sort -rate,-count

One of my more recent additions is a query that looks for file permission messages. Note that you will likely see only one event per file path when the agent first is starting up so low counts aren't something to ignore. This is great for those cases when you are beating your head on your desk wondering why the data you are trying to ingest isn't showing up (maybe I'm the only one who has had that issue). I don't have it list the path(s) in question as given the effects of the stats command when you click on the results you will see the raw events and each path might have unique issues /shrug. You could do a values() I guess.

index=_internal "permission denied" | stats count by host | sort –count

The last query that I'll share for now is one that I have set to run as a realtime search with a 5 minute window. I don't have it running all the time mind you but this shows you when apps are installed or uninstalled on agents. This is useful on a number of fronts but especially if you HAVE adjusted your deploymentclient.conf phonehome timeframe so you know 1) that new app has been installed and 2) approximately when you will see any changes to your data based on whatever it is you pushed. Generally speaking you will see these events prior to the agent rebooting but sometimes on busy agents you see it after the fact. Is fun when you push out a new or updated app that goes to lots of systems.

index=_internal sourcetype=splunkd DeployedApplication "installing" OR "uninstalling" | rex "WARN\s+DeployedApplication - (?<action>\S+)\sapp\S+\s(?<app>\S+)" | table _time host action app | sort -_time

Anyone have additional searches to share along these lines? Advice on these?


4 comments:

  1. This one works... (After i named the fields)
    index=_internal sourcetype=splunkd DeployedApplication "installing" OR "uninstalling" | rex "WARN\s+DeployedApplication - (?\S+)\sapp\S+\s(?\S+)" | table _time host action app | sort -_time

    Great job btw. cheers

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
    2. Alex - sorry about that. The fields are named but because they are between the angled braces Google must have removed them. Not sure how to work around that. Will have to beat on it a bit.

      Delete
    3. I think it has been fixed!

      Delete