Tuesday, July 2, 2013

Solve for 80% - find logs needing work in Splunk

There are a couple of sayings, maxims if you will, that I try to keep in the back of my mind as I do things
  • Most times coincidence is God acting anonymously
  • Activity != Accomplishment
  • Effectiveness and efficiency are two different concepts
  • Solve for 80%


As I try to ride herd on a large volume of logs coming in every day the last item comes to mind often. It isn’t that you shouldn’t take things to completion as much as when you have no lack of things that need done, priority and other variables aside, there comes a point where the time spent taking something above 80% completion takes more effort than bringing another area up to 80%. In theory, once you’ve brought multiple areas up to 80% you’ve raised the bar and can now circle back to that first item and solve again for (a new) 80%. 

Now that I’ve done a goodly bit of work on the higher priority logs we have coming into Splunk I wanted a query that could help me ID logs that still need work as they are indexed. The beauty of Splunk is you are able to throw all kinds of data at it w/o worrying too much about format. One of the downsides of Splunk is that you can throw all kinds of data at it w/o worrying too much about format =). Since one of the other really cool things with Splunk is that you can define new search time field extractions at any time which will be applied over any search timeframe my primary concern are logs that aren’t coming in and indexed correctly. There are a couple tells for these sorts of logs
  • Logs that don’t have the correct hostname. In our case we have a good bit of logs coming in via our syslog collector so finding logs with THAT server as the host generally is something that should be addressed
  • Linebreak field being more than a relatively low number (have to ignore Windows logs). You should probably do a separate body of work on these
  • Sourcetypes that end in either ‘too_small’  or a dash and number. These are default Splunk indicators that you haven’t defined what the logs are (typically in your inputs.conf) and it has tried a few things but doesn’t know what the logs are either. There are ways to turn this off but I haven’t explored those to any great length. Besides, it would make this a harder exercise =)

There are quite a number of things you can do to help Splunk ingest data more efficiently (eg., define timestamp location and format) but this query is more about stuff that is 'visible' and I'm after effective vs efficient at this point. The reality is if Splunk hasn't started coughing up blood its fine which is more a testament to its robustness than my ability as an application manager =).

Before I post the query I came up with I should probably mention a few things. All things being equal the results are sorted by the number of events meeting any of the search criteria focused on sourcetype. In some cases I’ve used distinct counts, some I’ve used values, and some both. I like that values shows you data in a pretty concise way that just getting a distinct count does not. Adjust to your needs. The last item is we have a few syslog servers due to historical reasons which for our purposes here I’ll call bob1, bob2, and bob3. You will want to adjust that...unless you name your servers bob.

index!=_*  | rex field=sourcetype "(?.+\-)(?.+)"  | search source_type=* | eval delay = round((_indextime-_time)/3600) | rex field=source "(?[^.]+)" | rex field=source "(.+\/|.+\\\)(?[^\.]+)" | eval host_issue = substr(host,1,3) | eval host_issue = if(host_issue="bob" ,"yes","no") | stats values(the_rest) values(host_issue) avg(delay) AS avg_delay dc(linecount) values(linecount) dc(index) values(index) dc(host) dc(path) values(file) count by source_type | eval avg_delay = round(avg_delay,0) |  sort -count

By the pipes
0 - Basically I'm saying I'm interested in any logs that aren't in the internal indicies which is much reduced from my own query; you will want to add things as you work through your logs.   
1 - So I'm looking for logs that have a dash in it as both 'too_small' and the dash number for learned sourcetypes have a dash. This will certianly pick up other sourcetypes that you will need to whitelist. It throws everything in the sourcetype field up to the last dash into a field called 'source_type' and everything to the right of that last dash into a field call 'the_rest'
2 - limits the results to only those where the source_type field contains a value. The upshot is this it doesn't include Windows event logs and in my case at least the majority of those sourcetypes we've defined don't have dashes in them. You will need to adjust based on your environment
3 - Figure out if there are timestamp issues as in the logs are coming in GMT and you are in some other timezone or some jackhole hasn't set their system for daylight savings. This value is rounded with offsets measured in hours
4 - I've used this a lot to figure how many unique paths there are in a particular index or whatever. At least in our environment a number of logs are in a format of logname a dot and a date. This pulls out everything up to the dot. There are obvious limitations we are looking for at the big picture
5 - Here I'm looking for what the file name being ingested is. Note how this and #4 play out in the end (guess you will have to run it actually). Works for Windows and *nix paths.
6 & 7 - there is probably a more efficient way to do this. I wanted to highlight if my log servers (the bobs) where in the results but you can't use wildcards in eval so I just trimmed the hostname field. This works because mine are all named bob*; YMMV.
8 - Now comes the meat of the query. Sorted by source_type (remember I've split up the actual sourcetype field) I'm looking
  1. distinct values of the rest of the sourcetype - do I need to whitelist, are the logs small, or is there a lot of variation in the logs (indicated by a high integer and/or lots of different integers
  2. do I need to do some work defining the host field (again this is in context of some logs coming in via syslog)
  3. any time zone issues to take care of? 0s are good. Negative numbers are things coming in from the future (timezone east of your location)
  4. is there a diversity of linecounts in the logs indicative of logs not breaking correctly
  5. what linecount values are there. 
  6. how many indicies are the logs in? 
  7. what indicies are the logs in?
  8. how many hosts are in play? this is one of those fields where after looking at the results can help you prioritize and/or help you figure out the magnitude of the work you have to do to clean
  9. how many different files are you having to work with. similar to the above as you figure out which things you are going to tackle first
  10. last but not least - how many logs are we talking about? This goes with pipe 12 where I've sorted the results in descending order
I generally run this for 15 or 30 minutes as I'm interested more in a sampling than anything else - at least anymore. When you first start this up you might want to run it for longer. 

Hopefully this has been helpful. Any thoughts for making it better either from a data visibility or efficiency perspective? Hopefully the query still works! As I've gone through it I've made some adjustments to make it more efficient.

No comments:

Post a Comment