Mark Runals' Blog: Performance

Showing posts with label Performance. Show all posts

Saturday, January 31, 2015

Splunk Apps: Forwarder Health

It is long past time I actually wrote a few posts on the Splunk apps I've created. Woke up far too early for a Saturday morning and in an effort to avoid anything around the house I will rationalize this as productivity at a general level and feel I've accomplished much! Who knows - it might be of value to my ... ones ... of readers! =)

Actually it was VERY cool to have a guy come up after my presentation at the 2014 Splunk user conference and mention having read my blog while working with ArcSight and now while working with Splunk (thanks Joe!).

Forwarder Health

So our environment has currently some 2,200+ forwarders which is certainly not the largest environment out there but is likely much larger than the average. While there are apps like Splunk on Splunk and Fire Brigade to help identify issues with your indexers and search heads there wasn't something that helps identify issues with forwarders. Admittedly this is a hefty task as there are innumerable issues a forwarder can have. I wondered though if there was a way to generically detect if an agent was having issues. The sage like advice from the Verizon breach reports bubbled up in my mind - start by looking at the size of the haystacks. What if you were to compare the number of internal logs a forwarder was generating and compare it to the average? A couple hours later the bones of the app were in place.

Taming verbose Windows logs in Splunk

As you get into the world of logs you quickly realize how 'heavy' Windows logs are. By that I mean verbose. In this space verbose = length and log length translates to increased storage and licensing issues. Many log generators simply say this did that or this talked to that over this port. Pretty quick and dirty. Windows logs are generally along the lines of "dear reader, I've observed many events in the course of my life and here is something I thought I should bring to your attention. I will go on at length about this though really only give small pieces of important information with little to no explanation forcing you to scour the Internet looking for others who have gone through this self same issue." I ran a quick search in my Splunk environment and found the average Windows event code to be 630 bytes.

Splunk DateParserVerbose logs - Part 2

In part 1 of this subject we talked about what Splunk's DateParserVerbose internal logs are and I gave an example query that at its heart attempts to rollup and summarize timestamp related issues. In this post I'll present a query for taking the sourcetypes Splunk is having issues with from a timestamp perspective and display the relevant props configs. What we've done is thrown both queries into the same dashboard to make things easier to work though. I should note a couple things here. The first is the foreach command is only available in Splunk 6 (I believe). The second is the REST endpoint I'm getting the config data from is likely only available in 6.

With that out of the way here is the query:

Splunk, timestamps, and the DateParserVerbose internal logs - Part 1

Splunk is a pretty powerful piece of software. There is the obvious search and analytic capabilities it has but there is some robustness under the covers as well. One of those under-the-cover capabilities is detecting and understanding timestamp data. Its the sort of thing that as users of the software we simply accept and generally speaking don't spend a whole lot of time thinking about. From an admin perspective as you start to put some effort into understanding your deployment and making sure things are working correctly one of the items to look at is the DateParserVerbose logs. Why you ask? I've recently had to deal with some timstamp issues. These internal logs generally document problems related to timestamp extraction and can tell you if, for example, there are logs being dropped for a variety of timestamp related reasons. Dropped events are certainly worthy of some of your time! What about logs that aren't being dropped but for one reason or another Splunk is assigning a timestamp that isn't correct? In this writeup I will share a query you can use to bring these sorts of events to the surface and distill some quick understanding.

Splunk - troubleshooting remote agents with the phonehome logs

An issue popped up the other day that was pretty interesting (from a Splunk admin perspective) so figured I would share. This will likely be pretty long but hopefully someone will benefit. We had a number of servers with Splunk universal forwarders stop sending logs but in doing a spot check on their firewalls the server owner noticed traffic still going to our Splunk infrastructure backend. What had happened? The answer to the question lies in the UF phone home logs - do you know where they are and how to read them?

I want more time to play!

I find myself in a somewhat strange place today where because I'm going to be at the Splunk conference next week I don't have much scheduled that needs to be done (or staged to be done this weekend). This reminds me of a line that has come up a few times as we've been going through the interview and candidate selection process for two open slots we have in the office. We have all been working way too many hours and want some 'free time' back in our normal routine. I'm not talking about a mental health break or time away from the office as much as having a pocket or two of time where we can explore/investigate/work on little side projects/quality-of-life-things that need to be done. They, generally speaking, aren't hard or long things to do but get sidelined because of higher priorities.

So I'm monkeying around with a few things in Splunk and two rabbit holes later come up with a query that quite frankly doesn't return a whole lot of hits for me over the last month. What it DOES show is a server that wasn't able to install some config packages I was pushing from my deployment server.

index=_internal source=*metrics.log component="DeploymentMetrics" status="failed" | stats max(_time) as time by hostname event scName appName fqname | convert ctime(time)

This event is created on your deployment server. Not sure what fqname stands for exactly but in my case it was showing me the path the server was trying to install the app to (fully qualified path name is where my mind goes but doesn't fit the data). scName is likely server class name and appName is obviously the app itself - both are references to your serverclass.conf file contents. With over 1k agents deployed the fact that this found issues with only 1 server is pretty cool I suppose. Will likely bake this into the app I'll never create re: first paragraph =)

Thursday, May 9, 2013

Some queries related to Splunk administration/Deployment Server

Once again time flys /sigh. At times I wish there was more hours in the day but that would probably just translate into more hours working. Hopefully will get over the bubble soon (yeah right). At any rate we had our first Columbus Splunk user group the other day. Was neat to see others in the local area and talk Splunk...at least in as much as we could. The location was a bit noisy. Figured I'd share here a few of the queries I put together in a slide deck related to using (or mostly administrating) Splunk's Deployment Server. I guess they aren't specifically related to the DS as much as general Universal Forwarder (local Splunk agent) health which you can control with the DS

Mark Runals' Blog

Saturday, January 31, 2015

Splunk Apps: Forwarder Health

Wednesday, July 16, 2014

Taming verbose Windows logs in Splunk

Monday, May 5, 2014

Splunk DateParserVerbose logs - Part 2

Sunday, April 6, 2014

Splunk, timestamps, and the DateParserVerbose internal logs - Part 1

Saturday, February 22, 2014

Splunk - troubleshooting remote agents with the phonehome logs

Friday, September 27, 2013

I want more time to play!

Thursday, May 9, 2013

Some queries related to Splunk administration/Deployment Server