This is for my man Destry who I met recently in person. He
was giving me a bit of good-natured fun at not posting more frequently. So
Destry, this is for you!
I’m doing a Splunk tips & tricks workshop this week with
some folks who, among other things, had asked for a query to identify log
volume anomalies. Ahh volume anomalies. So many variations of this. Several
apps can be found on Splunkbase which have been developed by the user
community. One might ask why Splunk hasn’t incorporated more of this sort of
thing in the Monitoring Console /shrug.
My normal recommendation to folks is run a few queries to
capture log volume (internal index license log) and event counts (tstats) in a ‘summary index’ for long term retention and quicker analysis. Some of that is likely
found in the introspection index but I’ve not done a deep dive there TBH. The workshop I’m
doing is with folks in a multi-tenant environment where each would like to do
their own quick analysis.
So let’s define a few goals
- When a host is sending abnormally more or less of a data type compared to other hosts
- When a host is sending abnormally more or less of a data type compared to itself
- One query to do both comparisons to keep compute down and not have intermediate steps (like populating or reading from a lookup) for simplicity
So how do you detect anomalous volume vs normal ebb and flow
variation? I’m not sure how standard it is but commonly people leverage the 68-95-99 rule. 95 percent of a dataset will fall within 2 standard deviations of the
mean (average). We’ll use that for the initial version of the query. Adjust
after observing the results.
TLDR - the query is below. After it I'll break down each section. If you have a particularly large environment you might run into limitations with eventstats. Again, summary indexing is your friend. For this version I'd run this query shortly after midnight as doing a timespan split by days will run midnight to midnight vs 24hrs from when you run your query. If you wanted to look by hour you could; you'd need to adjust some of the time snapping I use. Also based on your environment and needs, you might want to leverage the index more than I've done. A starting cron job could be 17 0 * * 1-5
| eval day = strftime(_time, "%w")
| search NOT day IN (0,6)
| eventstats avg(count) as avgSTDataVolume stdev(count) as stdSTDataVolume by sourcetype
| eventstats avg(count) as avgHostSTDataVolume stdev(count) as stdHostSTDataVolume by host sourcetype
| sort -_time index host sourcetype
| dedup index host sourcetype
| eval stdev_mod = 2
| eval avgHostSTThreshold_high = avgHostSTDataVolume + (stdev_mod * stdHostSTDataVolume)
| eval avgSTThreshold_high = avgSTDataVolume + (stdev_mod * stdSTDataVolume)
| eval avgHostSTThreshold_low = avgHostSTDataVolume - (stdev_mod * stdHostSTDataVolume)
| eval avgSTThreshold_low = avgSTDataVolume - (stdev_mod * stdSTDataVolume)
| where count > avgSTThreshold_high OR count > avgHostSTThreshold_high OR count < avgSTThreshold_low OR count < avgHostSTThreshold_low
| fields _time index host sourcetype count avgHostSTDataVolume avgSTDataVolume
| foreach avg* [ eval <<FIELD>> = round(<<FIELD>>) ]
| rename count as events avgSTDataVolume as "avg events for this sourcetype" avgHostSTDataVolume as "avg events for this host + sourcetype"
Ok so let’s start!
The first thing to do is grab data. Because this is going to
be a single query, we want a historical comparison, and we aren’t looking at
the content of individual events, the move is to use a tstats query. Let’s grab index host and sourcetype and split the data into
individual days. The earliest and latest designations baked into the query will
overwrite whatever you put into the time picker.
| tstats count where index=* earliest=-8d@d latest=@d by _time index host sourcetype span=1d
Great! Now we have 7 days of data for historical context. Because
we’ve used the span function the _time field in the query result will be
YYYY-MM-DD. Thinking ahead, I suspect we will see a fall or rise of data
volumes going into and out of a weekend. Let’s go
with pulling out the day of the week from the date using strftime and ignoring
day 0, Sunday, and day 6, Saturday.
| eval day = strftime(_time, "%w")
| search NOT day IN (0,6)
The IN operator is a great add from a few releases ago. Saves you from multiple ORs. Ok – we now need to build out what is the average and
standard deviation of the data volume by sourcetype and host + sourcetype. A simple
stats command won’t work. Eventstats to the rescue! Think of eventstats as a
fancy eval statement that allows you to do multi-line statistical calculations. Fair
warning – the field names will be long but hopefully they makes sense. Note the differences in the split by fields.
| eventstats avg(count) as avgHostSTDataVolume stdev(count) as stdHostSTDataVolume by host sourcetype
What is nice is the appropriate sourcetype and host +
sourcetype calculations are on each line. For our purposes we only need the most
recent data so let’s sort the results to have the most recent day at the top. Dedup sheds the rest.
| sort -_time index host sourcetype
| dedup index host sourcetype
Now we want to essentially do the following to establish the
comparison threshold ( average + ( 2 * std deviation) ). Let’s list that as low
and high threshold respectively. I start with defining
a variable for our std deviation modifier to quickly change it if needed. This could all be combined within the where statement and might even be a hair faster. When you or someone else looked it the query in 6 months though would you be able to read it as well?
| eval avgHostSTThreshold_high = avgHostSTDataVolume + (stdev_mod * stdHostSTDataVolume)
| eval avgSTThreshold_high = avgSTDataVolume + (stdev_mod * stdSTDataVolume)
| eval avgHostSTThreshold_low = avgHostSTDataVolume - (stdev_mod * stdHostSTDataVolume)
| eval avgSTThreshold_low = avgSTDataVolume - (stdev_mod * stdSTDataVolume)
Now let’s zoom in on just those outliers and limit our
fields to just what we want to see. I chose not to show the thresholds as they
simply are what they are. What I want to see is what this host’s event counts
are and what are the averages.
| where count > avgSTThreshold_high OR count > avgHostSTThreshold_high OR count < avgSTThreshold_low OR count < avgHostSTThreshold_low
| fields _time index host sourcetype count avgHostSTDataVolume avgSTDataVolume
Its likely your averages have quite a few numbers. A simple
round() function can clean it up but let’s use a foreach loop to make the typing
process easier. Its also a great command to leverage. Let’s also rename the
fields since this is what people will see.
| foreach avg* [ eval <<FIELD>> = round(<<FIELD>>) ]
| rename count as events avgSTDataVolume as "avg events for this sourcetype" avgHostSTDataVolume as "avg events for this host + sourcetype"
At this point you will want to schedule the search. Again, based on how this was written you will want to schedule it to run at some point after midnight. If you do cut out the weekend dates the query only needs to run mon-fri with Monday picking up anomalous volumes from Friday. It does mean if there is unusually high/low data coming in over the weekend this query won't catch it. A cron I'd probably start with is 17 0 * * 1-5. I'm a fan of using odd numbers for my cron jobs because most people think in terms of even numbers or numbers divisible by 5.
So that's it! Hopefully the query makes sense. Feel free to pass over any feedback. You might need to account for high volume data sources (ie domain controllers). That can be done at the end of the query.
No comments:
Post a Comment