Sunday, April 6, 2014

Splunk, timestamps, and the DateParserVerbose internal logs - Part 1

Splunk is a pretty powerful piece of software. There is the obvious search and analytic capabilities it has but there is some robustness under the covers as well. One of those under-the-cover capabilities is detecting and understanding timestamp data. Its the sort of thing that as users of the software we simply accept and generally speaking don't spend a whole lot of time thinking about. From an admin perspective as you start to put some effort into understanding your deployment and making sure things are working correctly one of the items to look at is the DateParserVerbose logs. Why you ask? I've recently had to deal with some timstamp issues. These internal logs generally document problems related to timestamp extraction and can tell you if, for example, there are logs being dropped for a variety of timestamp related reasons. Dropped events are certainly worthy of some of your time! What about logs that aren't being dropped but for one reason or another Splunk is assigning a timestamp that isn't correct? In this writeup I will share a query you can use to bring these sorts of events to the surface and distill some quick understanding.

So to start with you should get some familiarity with the logs themselves. To find them run the following query. Note that unless you've done something with permissions you will likely need admin the admin role (maybe power - I'd have to look). On top of that, again unless you've changed things, you will need to specify the _internal index because that isn't in the default search criteria. 

index=_internal DateParserVerbose

Simple query and there are your logs (if there are issues). What is cool about these logs is someone in the history of Splunk figured it would be worthwhile to include source, host, and sourcetype information in the same delimited formatted across all of the variations of DateParserVerbose logs. I highlight this because unfortunately that isn't the case with all of Splunk's internal logs. At any rate the field extractions I choose to use were as follows

index=_internal DateParserVerbose | rex "source::(?<s>[^\|]+)\|host::(?<h>[^\|]+)\|(?<st>[^\|]+)"

There are a number of other field extractions you could include but let's leave that for now. 

Now for the meat. Having gone through the majority of the types of logs I came up with the following query. I was mainly focused on two things. The first is boiling down each type of message into a shortened description. This became a pretty long case statement. The second was to use multiple stats statements so that I could 'drill into' the logs from a top down approach with each level showing an increasing amount of data before getting to the raw logs themselves. I've named the fields such that as you drill down you still get very descriptive field names but it does add to the overall length of the query.

index=_internal DateParserVerbose | rex "source::(?<Source>[^\|]+)\|host::(?<Host>[^\|]+)\|(?<Sourcetype>[^\|]+)" | rex "(?<msgs_suppressed>\d+) similar messages suppressed."| eval Issue = case(like(_raw, "%too far away from the previous event's time%"), "Variability in date/event timestamp - Events dropped", like(_raw, "%suspiciously far away from the previous event's time%"), "Variablity in date/event timestamp - Events accepted", like(_raw, "%outside of the acceptable time window%"), "Timestamp is too far outside configured bounds - Events dropped", like(_raw, "%Failed to parse timestamp%"), "Reverting to last known good timestamp", like(_raw, "%Accepted time format has changed%"), "Attempting to learn new timestamp format - Events accepted(?)", like(_raw, "%The same timestamp has been used%"), "More than 100k+ events have the same timestamp", 1=1, "fixme") | stats count sum(msgs_suppressed) as "Duplicate Messages Suppressed" by Sourcetype Issue Host Source | stats sum(count) as count dc(Host) as Hosts dc(Source) as Sources sum("Duplicate Messages Suppressed") as "Duplicate Messages Suppressed" by Sourcetype Issue | stats sum(count) as "Total Events" sum("Duplicate Messages Suppressed") as "Total Suppressed Events" list(Issue) as Issues list(Hosts) as Hosts list(Sources) as Sources list("Duplicate Messages Suppressed") as "Duplicate Messages Suppressed" by Sourcetype  | sort -"Total Suppressed Events"

Pretty long query but have tried to highlight sections to make it easier to get your arms around it. The green section is a second rex extraction as these internal logs will show you aggregate count of events having the same issue over some period of time. These logs can also include some setting information, timestamp read in vs the previous date, and the number of events having the same timestamp (I had some in the millions range) but I didn't bother with that for this query. The light blue section is the case statement I mentioned earlier. Here I'm trying to go through each type of log message type and then given it a shortened description. As with all case statements events traverse from left to right and once a match is made it moves to the next step. This is why I've added the 1=1, "fixme" section. If logs make it through the rest of the items and reaches this Splunk yes "yep - 1 does equal 1" so fixme is displayed. This makes it easier to troubleshoot things.Note that unlike the normal asterisk wildcard, there are some commands in eval that allow for a more SQL like wildcard of a percent sign. 

I've also used 3 stats commands. The point of this is as you move from stats to stats you get more and more aggregated data so that what is actually displayed is summary data. Note as you progress through the stats you have more fields to the left of the (split) by in the command and fewer on the right. This allows you to get an increasing amount of data as you are drilling into the data from the query results prior to getting to the raw data. You could go directly from the raw to counts but how many times have you displayed data in a dashboard, clicked through to raw events, and the data there was still overwhelming or found yourself writing another query to summarize it to a point between raw and what as on the dashboard?

The only thing left to do at this point is throw it in the search bar and run it over an appropriately long period of time. Actually if the results are extensive for a 24hr period you might try running it over a shorter period of time (15 min) just to help you tackle making changes in bite sized chunks. Obviously adjust to suite and then maybe throw it in a dashboard or saved search and schedule it for email delivery once a week or something. 

Maybe I'll say that again...

Once you get the query to something you like or is more suited to your environment and have maybe made a few changes to the appropriate props - have the tool work for you. Schedule the this for delivery on some schedule so that you see it again. I would even go so far as to say schedule it for delivery even if there are no results (aka you don't have timestamp issues). Think of some admin duties like pulling weeds. If you don't keep on top of them they eventually come back. 

I hope to follow this article up (soon) with a REST search that will help give you visibility into what the associated props and app name is with the sourcetypes that are showing up in the search above.

Part 2 of this series is here.

No comments:

Post a Comment