Mark Runals' Blog: Finding Log Volume Ingestion Anomalies in Splunk

This is for my man Destry who I met recently in person. He was giving me a bit of good-natured fun at not posting more frequently. So Destry, this is for you!

I’m doing a Splunk tips & tricks workshop this week with some folks who, among other things, had asked for a query to identify log volume anomalies. Ahh volume anomalies. So many variations of this. Several apps can be found on Splunkbase which have been developed by the user community. One might ask why Splunk hasn’t incorporated more of this sort of thing in the Monitoring Console /shrug.

My normal recommendation to folks is run a few queries to capture log volume (internal index license log) and event counts (tstats) in a ‘summary index’ for long term retention and quicker analysis. Some of that is likely found in the introspection index but I’ve not done a deep dive there TBH. The workshop I’m doing is with folks in a multi-tenant environment where each would like to do their own quick analysis.

So let’s define a few goals

When a host is sending abnormally more or less of a data type compared to other hosts
When a host is sending abnormally more or less of a data type compared to itself
One query to do both comparisons to keep compute down and not have intermediate steps (like populating or reading from a lookup) for simplicity

So how do you detect anomalous volume vs normal ebb and flow variation? I’m not sure how standard it is but commonly people leverage the 68-95-99 rule. 95 percent of a dataset will fall within 2 standard deviations of the mean (average). We’ll use that for the initial version of the query. Adjust after observing the results.

TLDR - the query is below. After it I'll break down each section. If you have a particularly large environment you might run into limitations with eventstats. Again, summary indexing is your friend. For this version I'd run this query shortly after midnight as doing a timespan split by days will run midnight to midnight vs 24hrs from when you run your query. If you wanted to look by hour you could; you'd need to adjust some of the time snapping I use. Also based on your environment and needs, you might want to leverage the index more than I've done. A starting cron job could be 17 0 * * 1-5

| tstats count where index=* earliest=-8d@d latest=@d by _time index host sourcetype span=1d
| eval day = strftime(_time, "%w")
| search NOT day IN (0,6)
| eventstats avg(count) as avgSTDataVolume stdev(count) as stdSTDataVolume by sourcetype
| eventstats avg(count) as avgHostSTDataVolume stdev(count) as stdHostSTDataVolume by host sourcetype
| sort -_time index host sourcetype
| dedup index host sourcetype
| eval stdev_mod = 2
| eval avgHostSTThreshold_high = avgHostSTDataVolume + (stdev_mod * stdHostSTDataVolume)
| eval avgSTThreshold_high = avgSTDataVolume + (stdev_mod * stdSTDataVolume)
| eval avgHostSTThreshold_low = avgHostSTDataVolume - (stdev_mod * stdHostSTDataVolume)
| eval avgSTThreshold_low = avgSTDataVolume - (stdev_mod * stdSTDataVolume)
| where count > avgSTThreshold_high OR count > avgHostSTThreshold_high OR count < avgSTThreshold_low OR count < avgHostSTThreshold_low
| fields _time index host sourcetype count avgHostSTDataVolume avgSTDataVolume
| foreach avg* [ eval <<FIELD>> = round(<<FIELD>>) ]
| rename count as events avgSTDataVolume as "avg events for this sourcetype" avgHostSTDataVolume as "avg events for this host + sourcetype"

Ok so let’s start!

The first thing to do is grab data. Because this is going to be a single query, we want a historical comparison, and we aren’t looking at the content of individual events, the move is to use a tstats query. Let’s grab index host and sourcetype and split the data into individual days. The earliest and latest designations baked into the query will overwrite whatever you put into the time picker.

| tstats count where index=* earliest=-8d@d latest=@d by _time index host sourcetype span=1d

Great! Now we have 7 days of data for historical context. Because we’ve used the span function the _time field in the query result will be YYYY-MM-DD. Thinking ahead, I suspect we will see a fall or rise of data volumes going into and out of a weekend. Let’s go with pulling out the day of the week from the date using strftime and ignoring day 0, Sunday, and day 6, Saturday.

| eval day = strftime(_time, "%w")
| search NOT day IN (0,6)

The IN operator is a great add from a few releases ago. Saves you from multiple ORs. Ok – we now need to build out what is the average and standard deviation of the data volume by sourcetype and host + sourcetype. A simple stats command won’t work. Eventstats to the rescue! Think of eventstats as a fancy eval statement that allows you to do multi-line statistical calculations. Fair warning – the field names will be long but hopefully they makes sense. Note the differences in the split by fields.

| eventstats avg(count) as avgSTDataVolume stdev(count) as stdSTDataVolume by sourcetype
| eventstats avg(count) as avgHostSTDataVolume stdev(count) as stdHostSTDataVolume by host sourcetype

What is nice is the appropriate sourcetype and host + sourcetype calculations are on each line. For our purposes we only need the most recent data so let’s sort the results to have the most recent day at the top. Dedup sheds the rest.

| sort -_time index host sourcetype
| dedup index host sourcetype

Now we want to essentially do the following to establish the comparison threshold ( average + ( 2 * std deviation) ). Let’s list that as low and high threshold respectively. I start with defining a variable for our std deviation modifier to quickly change it if needed. This could all be combined within the where statement and might even be a hair faster. When you or someone else looked it the query in 6 months though would you be able to read it as well?

| eval stdev_mod = 2
| eval avgHostSTThreshold_high = avgHostSTDataVolume + (stdev_mod * stdHostSTDataVolume)
| eval avgSTThreshold_high = avgSTDataVolume + (stdev_mod * stdSTDataVolume)
| eval avgHostSTThreshold_low = avgHostSTDataVolume - (stdev_mod * stdHostSTDataVolume)
| eval avgSTThreshold_low = avgSTDataVolume - (stdev_mod * stdSTDataVolume)

Now let’s zoom in on just those outliers and limit our fields to just what we want to see. I chose not to show the thresholds as they simply are what they are. What I want to see is what this host’s event counts are and what are the averages.

| where count > avgSTThreshold_high OR count > avgHostSTThreshold_high OR count < avgSTThreshold_low OR count < avgHostSTThreshold_low
| fields _time index host sourcetype count avgHostSTDataVolume avgSTDataVolume

Its likely your averages have quite a few numbers. A simple round() function can clean it up but let’s use a foreach loop to make the typing process easier. Its also a great command to leverage. Let’s also rename the fields since this is what people will see.

| foreach avg* [ eval <<FIELD>> = round(<<FIELD>>) ]
| rename count as events avgSTDataVolume as "avg events for this sourcetype" avgHostSTDataVolume as "avg events for this host + sourcetype"

At this point you will want to schedule the search. Again, based on how this was written you will want to schedule it to run at some point after midnight. If you do cut out the weekend dates the query only needs to run mon-fri with Monday picking up anomalous volumes from Friday. It does mean if there is unusually high/low data coming in over the weekend this query won't catch it. A cron I'd probably start with is 17 0 * * 1-5. I'm a fan of using odd numbers for my cron jobs because most people think in terms of even numbers or numbers divisible by 5.

So that's it! Hopefully the query makes sense. Feel free to pass over any feedback. You might need to account for high volume data sources (ie domain controllers). That can be done at the end of the query.

Mark Runals' Blog

Saturday, July 29, 2023

Finding Log Volume Ingestion Anomalies in Splunk

No comments:

Post a Comment