Mark Runals' Blog: Administration

Showing posts with label Administration. Show all posts

Saturday, April 9, 2016

Splunk admin tasks after you start getting data in...

I had the rather unique privilege to post a 3 part blog series on Splunk's official site recently. The focus was on some administration tasks Splunk admins should work into their routine. There is a level of assumption when users search in Splunk - these hosts are really these hosts and events that are observed within a time range really happened then. The series talks through a couple methodologies to validate those assumptions

Part 1 - Validating host field values: link
Part 2 - Validating agent host's system time: link
Part 3 - Getting a feel for data ingestion latency: link

Saturday, February 7, 2015

Gaining visibiliy to ad-hoc data exports from Splunk

Along the same lines of understanding how your users are using Splunk and dovetailing into are users abusing their access to data in Splunk is taking a periodic look into what data they might be exporting. By that I mean exporting to a csv or maybe generating a pdf of a dashboard. Ideally you would like to know, for example, if this Mark character has exported something, what format was it in, what was the search, and how many records or results were included in the download.
There are a couple challenges

Search results (result count, events searched, etc) are in the internal search completion logs while the search parameters are in the internal search initiation logs.

Those logs are separate from the web logs that indicate someone has performed one of the export actions.

The various Splunk commands you might use to merge all of this data has some limitations that you will need to keep in mind. For example to use a subsearch to get something like search_id and pass it to a parent search is limited by default to a 60s runtime and/or 10k results. A join or append is limited to a 60s runtime and/or 50k records, again by default. If you have even a moderately sized deployment over the course of several days you have thousands of searches being run when you factor in your users, scheduled content, and internal Splunk processes. I suppose one way to mitigate this is to review the detection query output every day but that seems a little too frequent to me.

Splunk Apps: Forwarder Health

It is long past time I actually wrote a few posts on the Splunk apps I've created. Woke up far too early for a Saturday morning and in an effort to avoid anything around the house I will rationalize this as productivity at a general level and feel I've accomplished much! Who knows - it might be of value to my ... ones ... of readers! =)

Actually it was VERY cool to have a guy come up after my presentation at the 2014 Splunk user conference and mention having read my blog while working with ArcSight and now while working with Splunk (thanks Joe!).

Forwarder Health

So our environment has currently some 2,200+ forwarders which is certainly not the largest environment out there but is likely much larger than the average. While there are apps like Splunk on Splunk and Fire Brigade to help identify issues with your indexers and search heads there wasn't something that helps identify issues with forwarders. Admittedly this is a hefty task as there are innumerable issues a forwarder can have. I wondered though if there was a way to generically detect if an agent was having issues. The sage like advice from the Verizon breach reports bubbled up in my mind - start by looking at the size of the haystacks. What if you were to compare the number of internal logs a forwarder was generating and compare it to the average? A couple hours later the bones of the app were in place.

So how big ARE Windows Logs?

In my last post I mentioned how I was re-writing a few Windows events to cut down on Splunk license issues. In trying to size log management solutions in the past I've looked for lists or rules of thumb when it comes to the size of Windows events but never really found anything. That being the case hopefully someone will find this useful. I ran a query just now in Splunk to get the average byte count per Windows event ID. If you need to figure out log management license sorts of things this could give you a ROM by which to multiply a sampling of your event count against (as in number of logs on one server over 24hrs * number of related servers). After the cut you will find a 'csv' listing the Event Viewer (sourcetype), Event ID (EventCode), and average bytes for that ID. Enjoy. Oh - the average byte count for all of our Windows logs is 630.

Taming verbose Windows logs in Splunk

As you get into the world of logs you quickly realize how 'heavy' Windows logs are. By that I mean verbose. In this space verbose = length and log length translates to increased storage and licensing issues. Many log generators simply say this did that or this talked to that over this port. Pretty quick and dirty. Windows logs are generally along the lines of "dear reader, I've observed many events in the course of my life and here is something I thought I should bring to your attention. I will go on at length about this though really only give small pieces of important information with little to no explanation forcing you to scour the Internet looking for others who have gone through this self same issue." I ran a quick search in my Splunk environment and found the average Windows event code to be 630 bytes.

Splunk DateParserVerbose logs - Part 2

In part 1 of this subject we talked about what Splunk's DateParserVerbose internal logs are and I gave an example query that at its heart attempts to rollup and summarize timestamp related issues. In this post I'll present a query for taking the sourcetypes Splunk is having issues with from a timestamp perspective and display the relevant props configs. What we've done is thrown both queries into the same dashboard to make things easier to work though. I should note a couple things here. The first is the foreach command is only available in Splunk 6 (I believe). The second is the REST endpoint I'm getting the config data from is likely only available in 6.

With that out of the way here is the query:

Splunk, timestamps, and the DateParserVerbose internal logs - Part 1

Splunk is a pretty powerful piece of software. There is the obvious search and analytic capabilities it has but there is some robustness under the covers as well. One of those under-the-cover capabilities is detecting and understanding timestamp data. Its the sort of thing that as users of the software we simply accept and generally speaking don't spend a whole lot of time thinking about. From an admin perspective as you start to put some effort into understanding your deployment and making sure things are working correctly one of the items to look at is the DateParserVerbose logs. Why you ask? I've recently had to deal with some timstamp issues. These internal logs generally document problems related to timestamp extraction and can tell you if, for example, there are logs being dropped for a variety of timestamp related reasons. Dropped events are certainly worthy of some of your time! What about logs that aren't being dropped but for one reason or another Splunk is assigning a timestamp that isn't correct? In this writeup I will share a query you can use to bring these sorts of events to the surface and distill some quick understanding.

Splunk - troubleshooting remote agents with the phonehome logs

An issue popped up the other day that was pretty interesting (from a Splunk admin perspective) so figured I would share. This will likely be pretty long but hopefully someone will benefit. We had a number of servers with Splunk universal forwarders stop sending logs but in doing a spot check on their firewalls the server owner noticed traffic still going to our Splunk infrastructure backend. What had happened? The answer to the question lies in the UF phone home logs - do you know where they are and how to read them?

A change in log format for Splunk UF 6.x relative to tracking apps using the Deployment Server

I realized two things yesterday as I was troubleshooting various Splunk things. The first relates to having multiple input configs sent to a centralized syslog server. The second relates to changes to the internal 6.x UF logs as it relates to tracking apps that have been installed or removed.

A search on the Splunk mug is wrong!!

For those that haven't seen it the Splunk mug is a neat little piece of practical schwag that contains queries for things ranging from finding happiness to finding Waldo and even tracking a zombie infestation. However! I've discovered an issue with one of the searches.

The first thing to understand, if you don't already, is that the asterisk is a wildcard in Splunk. A neat little trick is that when you combine it with a field as in field=* your search will return events where that field contains a value. This makes it a great little inclusive search and potentially you won't have to use a usenull=f as part of your chart or timechart further in your search for filtering out events where the field isn't populated.

I want more time to play!

I find myself in a somewhat strange place today where because I'm going to be at the Splunk conference next week I don't have much scheduled that needs to be done (or staged to be done this weekend). This reminds me of a line that has come up a few times as we've been going through the interview and candidate selection process for two open slots we have in the office. We have all been working way too many hours and want some 'free time' back in our normal routine. I'm not talking about a mental health break or time away from the office as much as having a pocket or two of time where we can explore/investigate/work on little side projects/quality-of-life-things that need to be done. They, generally speaking, aren't hard or long things to do but get sidelined because of higher priorities.

So I'm monkeying around with a few things in Splunk and two rabbit holes later come up with a query that quite frankly doesn't return a whole lot of hits for me over the last month. What it DOES show is a server that wasn't able to install some config packages I was pushing from my deployment server.

index=_internal source=*metrics.log component="DeploymentMetrics" status="failed" | stats max(_time) as time by hostname event scName appName fqname | convert ctime(time)

This event is created on your deployment server. Not sure what fqname stands for exactly but in my case it was showing me the path the server was trying to install the app to (fully qualified path name is where my mind goes but doesn't fit the data). scName is likely server class name and appName is obviously the app itself - both are references to your serverclass.conf file contents. With over 1k agents deployed the fact that this found issues with only 1 server is pretty cool I suppose. Will likely bake this into the app I'll never create re: first paragraph =)

Thursday, May 9, 2013

Some queries related to Splunk administration/Deployment Server

Once again time flys /sigh. At times I wish there was more hours in the day but that would probably just translate into more hours working. Hopefully will get over the bubble soon (yeah right). At any rate we had our first Columbus Splunk user group the other day. Was neat to see others in the local area and talk Splunk...at least in as much as we could. The location was a bit noisy. Figured I'd share here a few of the queries I put together in a slide deck related to using (or mostly administrating) Splunk's Deployment Server. I guess they aren't specifically related to the DS as much as general Universal Forwarder (local Splunk agent) health which you can control with the DS

Mark Runals' Blog

Saturday, April 9, 2016

Splunk admin tasks after you start getting data in...

Saturday, February 7, 2015

Gaining visibiliy to ad-hoc data exports from Splunk

Saturday, January 31, 2015

Splunk Apps: Forwarder Health

Wednesday, July 16, 2014

So how big ARE Windows Logs?

Taming verbose Windows logs in Splunk

Monday, May 5, 2014

Splunk DateParserVerbose logs - Part 2

Sunday, April 6, 2014

Splunk, timestamps, and the DateParserVerbose internal logs - Part 1

Saturday, February 22, 2014

Splunk - troubleshooting remote agents with the phonehome logs

Thursday, November 7, 2013

A change in log format for Splunk UF 6.x relative to tracking apps using the Deployment Server

Saturday, September 28, 2013

A search on the Splunk mug is wrong!!

Friday, September 27, 2013

I want more time to play!

Thursday, May 9, 2013

Some queries related to Splunk administration/Deployment Server