Mark Runals' Blog: Splunk Apps: Forwarder Health

It is long past time I actually wrote a few posts on the Splunk apps I've created. Woke up far too early for a Saturday morning and in an effort to avoid anything around the house I will rationalize this as productivity at a general level and feel I've accomplished much! Who knows - it might be of value to my ... ones ... of readers! =)

Actually it was VERY cool to have a guy come up after my presentation at the 2014 Splunk user conference and mention having read my blog while working with ArcSight and now while working with Splunk (thanks Joe!).

Forwarder Health

So our environment has currently some 2,200+ forwarders which is certainly not the largest environment out there but is likely much larger than the average. While there are apps like Splunk on Splunk and Fire Brigade to help identify issues with your indexers and search heads there wasn't something that helps identify issues with forwarders. Admittedly this is a hefty task as there are innumerable issues a forwarder can have. I wondered though if there was a way to generically detect if an agent was having issues. The sage like advice from the Verizon breach reports bubbled up in my mind - start by looking at the size of the haystacks. What if you were to compare the number of internal logs a forwarder was generating and compare it to the average? A couple hours later the bones of the app were in place.

When you first install the app you should go to the macro section and adjust the macros for your search head and indexer naming convention. As the app's name implies this is about forwarders and the logs from your sh/idx tier are very verbose. While you are there the other macro to look at is what the rest of the searches are NOT looking for in the internal logs - basically that the agent is sending logs to the indexing tier. If you want to adjust those searches to not look at other types of logs this is the place.

Having already created the Data Curator app with its scoring system for your props configs and field extraction % I wanted to come up with one for your agent health. What I came up with was pretty simplistic and might need a revisit. All agents start with a score of 10 and is reduced by 1 for each factor their logs are over the average. In other words if a forwarder is generating 6x the average logs its score would be 4. In looking at my environment's score with my 2,240 forwarders I have a 9.8. There are 1871 with a score of 10 and another 100 with a score of 9. There are 24 with a score of 0 so while the scoring methodology is simplistic and not overly telling at this scale I suppose it does show that it isn't likely most of your forwarders have issues. 24 of 2,240 is right at 1%. The first dashboard you will probably look at is the Internal Event Count Overview. Here is the top portion of mine.

The heart of the queries is

index=_internal NOT `splunk_indexers` NOT `splunk_search_heads` `agent_internal_logs_blacklist` | stats count as events by host | eventstats avg(events) as avg_events | eval avg_count = floor(events/avg_events) | eval agent_score = 10 - avg_count | eval agent_score = if(agent_score<0, 0, agent_score)

The last pipe there is to account for those systems that are generating a metric ton of events as noted by the dashboard panel in the bottom right of the screenshot above. Good thing internally generated Splunk logs don't count against your license! The bottom panel of the dashboard (not shown) will let you know which systems are your trouble children and clicking on one of them will actually open up the next dashboard: Internal Logs - Host View.

This dashboard shows a number of things but in order to not have to account for every possible issue I figured it would be best to cluster the internal logs. Note the cluster command works differently than simple stats count or even stats count by punct (field) of the internal logs. While it isn't documented particularly well in terms of the methodology used the cluster command will group similar events. So while you might have multiple events saying the forwarder can't read individual files, with an event per file attempting to be read, the cluster command should give you an overall count of those events. In the case of the forwarder generating over 400k events the vast majority are of the "INFO WatchedFile - Will begin reading at offset" variety. However that is a symptom of the real issue which shows up on the next 2 lines which is there are a number of very small files that the agent is having trouble reading so simply ingests the file again.

The next highly usable dashboard is Limits and Path Monitoring issues. At a base level I believe detecting when an agent is struggling to keep up with the data it is being asked to read in and forward is a big deal and one that is a pretty easy win (for these 2 issues assuming the system and network resources can accommodate changes). For example the other day we installed a second forwarder on one of our syslog receiving servers for Splunk and didn't adjust the file descriptor limits. In a 24hr period there were thousands of messages indicating the default 100 simply wasn't enough. We also have a job that runs once a day to show us forwarder limits issues so I created these as saved searches vs inline panels on the dashboard which makes it easy for you all to do the same if you are so inclined.

The Forwarder Version Distribution is a very cool dashboard I think in that it shows you lots of information in a small space. The top of ours at the moment looks like this

The other panels show you the volume of logs from the different versions which is useful for a number of things. Part of this dashboard was born out of rediscovering the bug with agents under 5.0.4 which can send almost 2x logs to the first indexer on our output.conf list than the other indexers on your list. Now if I can just get a couple large groups to update their old agents!

The last dashboard I'll briefly mention is the deployment server clientName string one. I debated putting this here but we use the clientName string on all of our forwarder agents and there really aren't visualizations in other Splunk provided content/apps (that I know of). If the strings are missing or whacked there are issues. Hopefully this isn't too catered to our environment. If you aren't familiar with using the clientName string it is placed in the forwarders deploymentclient.conf file and can be used as part of the whitelist/blacklist components of your Splunk Deployment Server serverclass.conf file.

Welp - somewhat long winded. I'm very open to feedback on the app. Are there things that should be added or tweaked?

Mark Runals' Blog

Saturday, January 31, 2015

Splunk Apps: Forwarder Health

2 comments: