Saturday, February 22, 2014

Splunk - troubleshooting remote agents with the phonehome logs

An issue popped up the other day that was pretty interesting (from a Splunk admin perspective) so figured I would share. This will likely be pretty long but hopefully someone will benefit. We had a number of servers with Splunk universal forwarders stop sending logs but in doing a spot check on their firewalls the server owner noticed traffic still going to our Splunk infrastructure backend. What had happened? The answer to the question lies in the UF phone home logs - do you know where they are and how to read them?

Note this post applies to Splunk Enterprise 5.0.5 and below - I'm not sure how/if the logs will change in 6x but will update as appropriate.
[Update 3/8/14: no apparent changes in 6.0.2]

To start I need to give you some background into our environment. We use Splunk's deployment server as we have a very distributed environment and don't have direct control to more than about 15 of ~1,200 agents sending us data. There are three main aspects to our deployment server in terms of how we push content.

  1. We have people set the clientName string on their local deploymentclient.conf files which we leverage in the serverclass.conf file on the deployment server (DS) in terms of whitelisting/blacklisting agents for content packages (this basically names the forwarder for config management).
  2. We push a configuration setting to all agents that changes the check in time to the DS from every 30 seconds to 3 minutes. 
  3. Prior to individuals setting up their universal forwarders (UF) we've given them the naming string relative to item 1.
So what am I talking about. Let's say the college of Xenopsychology is going to start sending us logs and has both Windows and Linux servers. As part of the installation instructions we give them their clientName strings which would be along the lines of:
  • uf_xeno_win_<servername>
  • uf_xeno_nix_<servername>
In our serverclass.conf file on the DS and related to both items 1 and 2 above this allows us to push out our phonehome setting for anything starting with 'uf' (ie whitelist.0 = uf*) and push general inputs statements based on group and OS but also more specific ones because it also includes the unique server name. Incidentally to increase the phonehome create a deploymentclient.conf package on your DS and put in the following:

[deployment-client]
phoneHomeIntervalInSecs = 180

[target-broker:deploymentServer]
targetUri = <your_DS_server>:<management_port>
phoneHomeIntervalInSecs = 180

I don't know if you need the setting in the top stanza anymore. I read an issue on Splunk's answers over a year ago when I first pushed this /shrug. You also don't have to add the targetUri but is a good failsafe in case something happens to the local $SPLUNK_HOME/etc/system/local/deploymentclient.conf one (or better yet you get that removed).

At any rate why do I mention all this? Because the clientName string shows up in the phonehome messages!

So the first step is to find those phonehome logs. The general query I use is 

index=_internal source=*splunkd_access.log POST phonehome/connection

And you get results similar to the following
10.10.10.10 - - [22/Feb/2014:09:45:55.081 -0500] "POST /services/broker/phonehome/connection_10.10.10.10_8089_wuzzle.foo.com_wuzzle_uf_xeno_nix_wuzzle HTTP/1.0" 200 12518 - - - 43ms
After the phonehome/connection bit you are given the following pieces of data:
  • IP address of the system
  • Management port the local agent is using
  • FQ server name
  • Hostname
  • And lastly the clientName string
In hindsight using a "-" vs "_" probably makes sense relative to using these logs but since we have standardized beginnings of the name string it is still easy to create a rex statement to pull out the pertinent pieces of data. 

So what happens when the clientName string ISN'T set? You end up with something like this:
10.10.10.11 - - [22/Feb/2014:09:45:55.081 -0500] "POST /services/broker/phonehome/connection_10.10.10.11_8089_bazinga.foo.com_bazinga_F9059B3D-D6F2-44EC-A50C-12BCD3D2B305 HTTP/1.0" 200 12518 - - - 43ms

Instead of the clientName string at the end you end up with the Splunk agent's guid. 

To investigate this issue then what I did was take one of the server names that had stopped sending logs and did a couple things relative to these logs. As you might imagine I looked for the presence of the clientName string in the phonehome logs. For the period when logs were coming in the clientName string was there and I saw a guid in the phonehome DS logs when the agent stopped sending logs. Conclusion - at some point the clientName string was removed. The other piece of data to collaborate this is because we adjust the default phonehome time via content pushed by the clientName string when that is removed the agent reverts back to phoning home every 30 seconds. This time interval is VERY easy to see in the time range area at the top of the search results in Splunk and is great for general troubleshooting.

So in the end the clientName string was removed from the deploymentclient.conf file but the target-broker line was still there. How do I know this? Because the agent was still checking in! In other words the local conf file went from

[deployment-client]
clientName = uf_xeno_nix_wuzzle

[target-broker:deploymentServer]
targetUri = ds.foo.com:8089

to something like this

[deployment-client]

[target-broker:deploymentServer]
targetUri = ds.foo.com:8089

Ok great but what was the catalyst for change? Well that is somewhat my fault...but I'll just blame Splunk. In prep for upgrading our search heads and indexers to 6x we had to push a setting to all of our agents which required them to restart (we were using a setting that was deprecated in 5 and removed in 6). One thing to keep in mind is a number of changes in Splunk agents, search heads, and indexers won't take effect until that instance of Splunk has been restarted....or the rest endpoint tickled.

So the general order of events is as follows: at some point the agent was installed, the clientName string was set, the DS pushed the appropriate config packages and data was going into the correct index,  the clientName string was removed BUT the agent hadn't been restarted, I push the new setting which caused all agents to restart, these agents restart but when they come up they have a partial case of amnesia. Problem solved.

Hopefully this wall of text is helpful. There are many other things you can do with the phonehome logs to troubleshoot your agents:
  • How many agents are using the same clientName string but are different servers (get a distinct count of IPs by name string - I have a couple funny stories about this. Buy me a beer at .conf or something and I will tell them)
  • How many servers are running multiple agents (get a distinct count of name strings by IP)
  • It can be interesting to see what non standard ports mgmt ports Splunk is using. This is often indicative of someone installing multiple agents on a server - either by design or they have somehow loused up the installation process (not previous comment about stories and beer)
  • Look for agents checking in more frequently than your phonehome interval. For some reason there can be 2 identical phonehome logs for any particular phonehome instance. To account for this you can either dedup in your query or take a larger timerange and do some stats - ie since we have a 3 minute phonehome you could look for more than 6 logs in a 10 min period.
Takeaways 
Figuring out why logs aren't going into Splunk can be a pain - especially when you don't have access to the remote server. Using these logs is typically the first place I look. At a high level here are things to look for/do
  1. I highly recommend adjusting the phonehome time - this will also allow you to scale your deployment and still only use one DS
  2. Relative to the first point - I have my query open with a 15 minute window, plug in the server name, and refresh the query. The graph at the top of the search results quickly shows how frequently it is checking in. You could get fancy with a drop down menu in a dashboard but that doesn't scale once you are past a handful of agents.
  3. If the name string is present look for issues - eg if you have a multi-part string are all elements there and in the right order? 
Mark

No comments:

Post a Comment