Saturday, January 14, 2017

Adjusting Splunk forwarder phonehome / throughput

I was in the process of writing up a few things for a new EDU that is going to be spinning up a larger scale Splunk environment and figured if I was going to the effort it might as well be placed here for others to see. In working with my own environment today I realized I was making some adjustments that I take for granted but that we had to learn and bake in. For this installment these items are focused on the following:

  1. Adjusting the forwarder to deployment server phone home interval
  2. Allowing forwarders to send more than 256 kbps
By default forwarders will phonehome every 60s. In order to allow 1 deployment server to scale out as you increase the number of forwarders you have you will want to back this interval down. In some environments I've heard of people having the forwarders check in only once an hour. While we don't go that far the reality is once you have things setup for a new host/unit/whatever you will generally not make changes so very infrequent checkins are fine. Here are a few things I'd recommend:

1.a - set a baseline interval for all forwarders that is other than 60s
Config: deploymentclient.conf

Not only do you want to back off how often a forwarder checks in for scaling purposes but this also will help in troubleshooting efforts. For example through whatever method you deploy the adjusted phonehome interval if you are trying to figure out why a particular host forwarder isn't sending logs and it is still checking in every 60s then it hasn't received the configuration package you are pushing. We have a relatively old deployment and to account for what we thought was a bug we added the config line to both stanzas

[deployment-client]
phoneHomeIntervalInSecs = <your setting> 
[target-broker:deploymentServer]
phoneHomeIntervalInSecs = <your setting>

1.b - create a package that overwrites your baseline

Now that all of your forwarders are checking in more infrequently you will want a serverclass.conf stanza that causes forwarders of choice to check in more frequently. This is helpful for systems like maybe syslog servers that you might interact with more frequently, hosts where you are onboarding new data, or maybe forwarders you are troubleshooting. There are any number of ways to accomplish this but since you want all forwarders to get your baseline I suggest leveraging Splunk order of precedence. In our case we typically use lower case when we name our packages so we use upper case for overwrites. 

1.c - create an eventtype for the phonehome logs
Config: eventtypes.conf

You'd likely want to create a search on a search head allowing you to see when forwarders are phoning home. While everyone's environment is different we'd suggest instead creating an eventtype and pushing that to your search head(s) via the deployment server. Call it whatever you want; ours is splunk-phone-home and the config looks like this. Pulling back the lens on this you can put this in a stand alone app you push to the search heads or maybe create a generic admin to put global admin stuffs in. 

[splunk-phone-home]
search = index=_internal sourcetype=splunkd_access "/services/broker/phonehome/connection" NOT 0ms

1.d - increase timeout on your deployment server reload command

This is somewhat related to the topic at hand. As Splunk versions have increased we've had to back down our phonehome interval or receive a socket error on our deployment server when we issue a reload deploy-server command. This probably has more to do with the number of configuration packages/apps we push from our deployment server than specifically how many forwarders are checking in. Sadly there isn't a config setting you can adjust and by default the setting is 30sec. To help I've created the following Linux OS alias for my account on the deployment server: 

date && /opt/splunk/bin/splunk reload deploy-server -timeout 180


2.a - identifying forwarders that are trying to send more than 256 kbps

By default forwarders will only send data at a rate of 256 kbps. One way to see this in action is when you search for data from that host only to notice a time gap in your data. The forwarder does generate internal logs indicating the forwarder is hitting the limit. The Forwarder Health app was created, in part, to identify this. Other than installing that app the base search we use is

index=_internal sourcetype=splunkd "current data throughput" | rex "Current data throughput \((?<kb>\S+)" | eval rate=case(kb < 500, "256", kb > 499 AND kb < 520, "512", kb > 520 AND kb < 770 ,"768", kb>771 AND kb<1210, "1024", 1=1, ">1024") | stats count as Count sparkline as Trend by host, rate | where Count > 4 | rename rate as "Throughput rate(kb)" | sort -"Throughput rate(kb)",-Count

The 'where' part is designed to limit false positives when a forwarder is restarted and/or there is just a slight bump in log volume. 

2.b - create multiple throughput apps/packages
Config: limits.conf

There are a couple options when it comes to increasing the throughput for forwarders. If your environment can handle it you could just open the gates and let the forwarders blast all their data. The other is to take a stepped approach. This is what we've done. In order to help us layer these packages we again leverage Splunk config order of precedence and generally name our packages like this:

throughput_limits_100_unlimited
throughput_limits_200_1024
throughput_limits_300_0768
throughput_limits_400_0512

This gives us a couple options. For example if a unit's servers generally send a lot of data we can create a serverclass.conf stanza that sets all their forwarders to 512 but we can also set 1 server to something higher than that without having to back out that server from the 512 stanza. Somewhat implied is that you've created stanzas in serverclass.conf for each of those packages where you can add servers or units via whitelist. 

Summary

Hopefully you've found the above useful! Putting these in place are a few of the quality of life things we've done from an administrator's perspective that help us manage a large and diverse environment.




No comments:

Post a Comment