Note that the key to these queries is by default the Universal Forwarder sends internal logs to your Indexer(s). Because these are internal events they don't count against your license. I'm pretty lazy so you will note about all of them have rex commands to create fields on the fly instead of adjusting my props/transforms.
One of the first queries I created was after I found some events in the SoS dashboard that shows warning and error messages. There really isn't any formatting for the alert but /shrug. I have this running every hour as a scheduled search. 99% of the time it tells me I'm using a deprecated way to reference machine types but new method doesn't (or at least didn't) work. The query will show you if you've referenced an app in your serverclass.conf file that doesn't exist. Most of the times what this really means is you have mistyped the app's name - remember those bloody things are case sensitive.
This next one is a go to query with a number of troubleshooting applications. Hopefully you are pushing out a new deploymentclient.conf file to all of your agents that as a minimum adjusts the default phonehome interval. The default is every 30 seconds.
index=_internal
source=*splunkd.log
(component=application OR component=serverclass) warn OR error
index=_internal
source=*splunkd_access.log POST phonehome
Two queries that capture different aspects of "how busy are my local Splunk agents"; am sure there are others. The first gives you a feel for the number of files open at any one time. I could be wrong with the way I interpret this but by default the Splunk UF limits itself to 100 open files at any one time. It also, by default, keeps a file open for 3 seconds waiting for new events to come in. If you are getting hits I take it to mean (maybe?; probably?) during that 3 second period it has 100 files open.
index=_internal "File descriptor cache is full" | rex "is full \((?<fd_limit>\d+)" | stats count sparkline by host, fd_limit | sort -fd_limit, -count
The other query looks for the UF saying it has hit the throughput per second limit (default of 256 kb/s) and is throttling itself. The catch with this is you aren't going to get that exact kb/s hit but will see a little over or under. To account for that I have some pretty broad ranges in my case statement. I haven't adjusted any to over 512 but figured might as well account for that generically. I've set a minimum bounds on this as you will likely see these events when an agent restarts. Now that I think about it I should probably add that to the file descriptor query as well.
index=_internal
"current data throughput" | rex "Current data throughput
\((?<kb>\S+)" | eval rate=case(kb < 500, "256",
kb > 499 AND kb < 520, "512", 1=1, "Other") | stats
count by host, rate | where count > 4 | sort -rate,-count
One of my more recent additions is a query that looks for file permission messages. Note that you will likely see only one event per file path when the agent first is starting up so low counts aren't something to ignore. This is great for those cases when you are beating your head on your desk wondering why the data you are trying to ingest isn't showing up (maybe I'm the only one who has had that issue). I don't have it list the path(s) in question as given the effects of the stats command when you click on the results you will see the raw events and each path might have unique issues /shrug. You could do a values() I guess.
index=_internal
"permission denied" | stats count by host | sort –count
The last query that I'll share for now is one that I have set to run as a realtime search with a 5 minute window. I don't have it running all the time mind you but this shows you when apps are installed or uninstalled on agents. This is useful on a number of fronts but especially if you HAVE adjusted your deploymentclient.conf phonehome timeframe so you know 1) that new app has been installed and 2) approximately when you will see any changes to your data based on whatever it is you pushed. Generally speaking you will see these events prior to the agent rebooting but sometimes on busy agents you see it after the fact. Is fun when you push out a new or updated app that goes to lots of systems.
index=_internal sourcetype=splunkd DeployedApplication "installing" OR "uninstalling" | rex "WARN\s+DeployedApplication - (?<action>\S+)\sapp\S+\s(?<app>\S+)" | table _time host action app | sort -_time
Anyone have additional searches to share along these lines? Advice on these?
This one works... (After i named the fields)
ReplyDeleteindex=_internal sourcetype=splunkd DeployedApplication "installing" OR "uninstalling" | rex "WARN\s+DeployedApplication - (?\S+)\sapp\S+\s(?\S+)" | table _time host action app | sort -_time
Great job btw. cheers
This comment has been removed by the author.
DeleteAlex - sorry about that. The fields are named but because they are between the angled braces Google must have removed them. Not sure how to work around that. Will have to beat on it a bit.
DeleteI think it has been fixed!
Delete