Thursday, October 29, 2015

Moving toward Splunk's CIM

For those that don't know, for some time Splunk has been moving toward a Common Information Model (CIM). They are using this both a data normalization effort - what should you name fields from particular data sources - as well as a layer of abstraction placed over your data to indicate what the data IS. In the end this is a worthwhile effort though the devil is in the details for those of us with 1) older - by whatever definition - Splunk instances with local extractions and 2) larger - again by whatever definition - Splunk instances with 3) a large - sensing a trend? - number of sourcetypes. Frankly I'm big time scared of the performance implications of using what was a second or third class citizen in Splunk (tags) as my primary source of querying across 1k sourcetypes and 15B logs per day. Martin Mueller had a great talk at .conf15 looking into performancy sorts of things which brings a lot to that aspect of the discussion (search for his name here for a link to the slides. His name is actually spelled Martin Müeller and a direct link to the pdf is >here<).

Performance concerns aside the question is how do you go about a discovery effort to figure out which of your sourcetypes should map to which CIM based data model? In theory and based on the number of sourcetypes you have you could do this by manually reviewing a list. That might work for some percentage of sourcetypes but perhaps not all. At any rate some of this is addressable by the new Splunk commands: pivot and datamodel. The challenge with those is they are essentially searching across your data with the fields contained within the model in a one off basis (one DM at a time) and if the fields don't match then there simply is no results. What I was wanting was a way to take all of the fields from my data and throw that up against all of the fields in all of the models with a side of fuzzy string comparison. I *think* I have found a way.


The first bit is to know what fields you have per sourcetype. That is surprisingly challenging but I have that built out in the Data Curator app with the saved search "build sourcetype_field csv". (Note if you are just now installing the Data Curator app based on this post you should run that search or let the app bake a bit before proceeding). The next piece is to build a list of fields from the data models. I was going to try to do that with something like a rest search (ie | rest /servicesNS/-/-/datamodel/model | search title=alerts | table title description) but that proved problematic so I knuckled down and manually created a list which I will post below. I left off a few models like Alerts, Application State, Change Analysis, Inventory, etc as those are somewhat niche and don't generally contain unique field names. I also cut out the fields leveraged by the premium apps like ES (ie *_bunit, *_category,  *_priority). Now 'all' that is left is to compare the list.

The thing I wondered about was in cases where the field name from the DM was close but not exactly like what we've named our fields locally. For example suppose we have a field names like MAC, mac, clientmac, etc how do you map that back to the CIM fields which are along the lines of src_mac or dest_mac?

Approach 1: Levenshtein distance
The Levenshtein distance is a number representing how many edits would have to be made in order to match one word/string to another. I liked this idea on a couple levels as there might be other use cases where it would be useful and there actually is an app in Splunkbase. Ultimately I didn't pursue it.

Approach 2: Cluster command using n-gram (match=ngramset)
I poked at this for a bit which was kinda fun but I started having to use lots of subsearches and joins as I was having to compare the fields, come up with a cluster ID, and then reach back through my base DM fieldset to associate that same cluster ID to the fields in a particular dm and object. In looking at the initial results though it seemed that using wildcards in a lookup would accomplish just about the the same thing AND be easier to manipulate the results for use.

Approach 3: Lookup with wildcards
I should get more into the habit of using kv stores but that aside what I did here was take my list and wrap wildcards around just about all of the field names. The ones I didn't though were src, dest, & dvc as those have the potential to match a great many fields and skew the results. I also added *referer* to the Web model and *mac* to the Network Traffic one as both seemed to make sense - *mac* will match things like machine though so something to keep in mind. This does require that you can go under the hood to add a transform like this

[dm_fields]
filename = data_models.csv
match_type = WILDCARD(field)
max_matches = 1
min_matches = 1

The search then is pretty simple

| inputlookup sourcetype_fields.csv | eval field = lower(field) | lookup dm_fields field as field | search model!=none | stats dc(field) as fields values(field) as field_list by sourcetype model object | where fields > 1 | sort -fields | stats max(fields) as maxFieldMatch  list(model) as Model list(object) as Object list(fields) as fieldMatch by sourcetype | sort -maxFieldMatch

The results of the search in my environment are what you might call directionally correct moreso than exact =). There are quite a number of good, actionable results but much of the depends on what fields names you have defined already. One thing I did pre CIM years ago was settle on src_ip and dest_ip for IP address related fields so we already had that baked into many of our sourcetypes. If you've used source_ip or destination_ip you might want to look into adjusting the lookup. I also saw a fair bit of matches for both Web and Network Traffic for the same sourcetype as there is fair bit of field naming overlap. Not to be out done the Windows Security logs had 5 matches: Web, Email (filtering & email objects), Network Traffic, and Authentication. 

The other cool thing I like about having multiple stats commands is if you click on a sourcetype in the results and then on "View events" you will be brought down to the query results just after the first stats command for that sourcetype. This will then show you all the fields that were matched. You could adjust the query to show the fields as part of the overall results but it will lengthen the output making it more difficult to visually consume.

Here is the csv I used. As listed in the transforms above I named it data_models.csv though you could call it whatever. If you have any feedback to the list below or the methodology I'm all ears!

field,model,object
"*bytes*",Web,Web
"*bytes_in*",Web,Web
"*bytes_out*",Web,Web
"*cached*",Web,Web
"*cookie*",Web,Web
dest,Web,Web
"*duration*",Web,Web
"*http_content_type*",Web,Web
"*http_method*",Web,Web
"*http_referrer*",Web,Web
"*http_user_agent*",Web,Web
"*http_user_agent_length*",Web,Web
"*referer*",Web,Web
"*response_time*",Web,Web
"*site*",Web,Web
src,Web,Web
"*src_ip*",Web,Web
"*status*",Web,Web
"*uri_path*",Web,Web
"*uri_query*",Web,Web
"*url*",Web,Web
"*url_length*",Web,Web
"*user*",Web,Web
"*bytes*","Network_Traffic","All_Traffic"
"*bytes_in*","Network_Traffic","All_Traffic"
"*bytes_out*","Network_Traffic","All_Traffic"
"*channel*","Network_Traffic","All_Traffic"
dest,"Network_Traffic","All_Traffic"
"*dest_interface*","Network_Traffic","All_Traffic"
"*dest_mac*","Network_Traffic","All_Traffic"
"*dest_port*","Network_Traffic","All_Traffic"
"*dest_translated_ip*","Network_Traffic","All_Traffic"
"*dest_translated_port*","Network_Traffic","All_Traffic"
"*direction*","Network_Traffic","All_Traffic"
"*duration*","Network_Traffic","All_Traffic"
dvc,"Network_Traffic","All_Traffic"
"*flow_id*","Network_Traffic","All_Traffic"
"*icmp_code*","Network_Traffic","All_Traffic"
"*icmp_type*","Network_Traffic","All_Traffic"
"*mac*","Network_Traffic","All_Traffic"
"*packets*","Network_Traffic","All_Traffic"
"*packets_in*","Network_Traffic","All_Traffic"
"*packets_out*","Network_Traffic","All_Traffic"
"*protocol*","Network_Traffic","All_Traffic"
"*protocol_version*","Network_Traffic","All_Traffic"
"*response_time*","Network_Traffic","All_Traffic"
"*rule*","Network_Traffic","All_Traffic"
"*session_id*","Network_Traffic","All_Traffic"
src,"Network_Traffic","All_Traffic"
"*src_interface*","Network_Traffic","All_Traffic"
"*src_ip*","Network_Traffic","All_Traffic"
"*src_mac*","Network_Traffic","All_Traffic"
"*src_port*","Network_Traffic","All_Traffic"
"*src_translated_ip*","Network_Traffic","All_Traffic"
"*src_translated_port*","Network_Traffic","All_Traffic"
"*ssid*","Network_Traffic","All_Traffic"
"*tcp_flag*","Network_Traffic","All_Traffic"
"*transport*","Network_Traffic","All_Traffic"
"*tos*","Network_Traffic","All_Traffic"
"*ttl*","Network_Traffic","All_Traffic"
"*user*","Network_Traffic","All_Traffic"
"*vlan*","Network_Traffic","All_Traffic"
"*wifi*","Network_Traffic","All_Traffic"
dest,Authentication,Authentication
"*dest_nt_domain*",Authentication,Authentication
"*duration*",Authentication,Authentication
"*response_time*",Authentication,Authentication
src,Authentication,Authentication
"*src_nt_domain*",Authentication,Authentication
"*src_user*",Authentication,Authentication
"*user*",Authentication,Authentication
dest,Certificates,"All_Certificates"
"*dest_port*",Certificates,"All_Certificates"
"*duration*",Certificates,"All_Certificates"
"*response_time*",Certificates,"All_Certificates"
src,Certificates,"All_Certificates"
"*transport*",Certificates,"All_Certificates"
"*ssl_end_time*",Certificates,SSL
"*ssl_engine*",Certificates,SSL
"*ssl_hash*",Certificates,SSL
"*ssl_is_valid*",Certificates,SSL
"*ssl_issuer*",Certificates,SSL
"*ssl_issuer_common_name*",Certificates,SSL
"*ssl_issuer_email*",Certificates,SSL
"*ssl_issuer_locality*",Certificates,SSL
"*ssl_issuer_organization*",Certificates,SSL
"*ssl_issuer_state*",Certificates,SSL
"*ssl_issuer_street*",Certificates,SSL
"*ssl_issuer_unit*",Certificates,SSL
"*ssl_name*",Certificates,SSL
"*ssl_policies*",Certificates,SSL
"*ssl_publickey*",Certificates,SSL
"*ssl_publickey_algorithm*",Certificates,SSL
"*ssl_serial*",Certificates,SSL
"*ssl_session_id*",Certificates,SSL
"*ssl_signature_algorithm*",Certificates,SSL
"*ssl_start_time*",Certificates,SSL
"*ssl_subject*",Certificates,SSL
"*ssl_subject_common_name*",Certificates,SSL
"*ssl_subject_email*",Certificates,SSL
"*ssl_subject_locality*",Certificates,SSL
"*ssl_subject_state*",Certificates,SSL
"*ssl_subject_street*",Certificates,SSL
"*ssl_subject_unit*",Certificates,SSL
"*ssl_validity_window*",Certificates,SSL
"*ssl_version*",Certificates,SSL
"*delay*",Email,Email
dest,Email,Email
"*duration*",Email,Email
"*file_hash*",Email,Email
"*file_name*",Email,Email
"*file_size*",Email,Email
"*internal_message_id*",Email,Email
"*message_id*",Email,Email
"*message_info*",Email,Email
"*orig_dest*",Email,Email
"*orig_recipient*",Email,Email
"*orig_src*",Email,Email
"*process*",Email,Email
"*process_id*",Email,Email
"*protocol*",Email,Email
"*recipient*",Email,Email
"*recipient_count*",Email,Email
"*recipient_status*",Email,Email
"*response_time*",Email,Email
"*retries*",Email,Email
"*return_addr*",Email,Email
"*size*",Email,Email
src,Email,Email
"*src_user*",Email,Email
"*status_code*",Email,Email
"*subject*",Email,Email
"*url*",Email,Email
"*user*",Email,Email
"*xdelay*",Email,Email
"*xref*",Email,Email
"*filter_action*",Email,Filtering
"*filter_score*",Email,Filtering
"*signature*",Email,Filtering
"*signature_extra*",Email,Filtering
"*signature_id*",Email,Filtering
dest,"Intrusion Detection","IDS_Attacks"
dvc,"Intrusion Detection","IDS_Attacks"
"*ids_type*","Intrusion Detection","IDS_Attacks"
"*severity*","Intrusion Detection","IDS_Attacks"
"*signature*","Intrusion Detection","IDS_Attacks"
src,"Intrusion Detection","IDS_Attacks"
"*user*","Intrusion Detection","IDS_Attacks"
"*date*",Malware,"Malware_Attacks"
dest,Malware,"Malware_Attacks"
"*dest_nt_domain*",Malware,"Malware_Attacks"
"*dest_requires_av*",Malware,"Malware_Attacks"
"*file_hash*",Malware,"Malware_Attacks"
"*file_name*",Malware,"Malware_Attacks"
"*file_path*",Malware,"Malware_Attacks"
"*signature*",Malware,"Malware_Attacks"
src,Malware,"Malware_Attacks"
"*user*",Malware,"Malware_Attacks"
"*vendor_product*",Malware,"Malware_Attacks"
dest,Malware,"Malware_Operations"
"*dest_nt_domain*",Malware,"Malware_Operations"
"*dest_requires_av*",Malware,"Malware_Operations"
"*product_version*",Malware,"Malware_Operations"
"*signature_version*",Malware,"Malware_Operations"
"*vendor_product*",Malware,"Malware_Operations"
dest,Performance,"All_Performance"
"*dest_should_timesync*",Performance,"All_Performance"
"*hypervisor_id*",Performance,"All_Performance"
"*resource_type*",Performance,"All_Performance"
"*cpu_load_mhz*",Performance,CPU
"*cpu_load_percent*",Performance,CPU
"*cpu_time*",Performance,CPU
"*cpu_user_percent*",Performance,CPU
"*fan_speed*",Performance,Facilities
"*power*",Performance,Facilities
"*temperature*",Performance,Facilities
"*mem*",Performance,Memory
"*mem_committed*",Performance,Memory
"*mem_free*",Performance,Memory
"*mem_used*",Performance,Memory
"*swap*",Performance,Memory
"*swap_free*",Performance,Memory
"*swap_used*",Performance,Memory
"*array*",Performance,Storage
"*blocksize*",Performance,Storage
"*cluster*",Performance,Storage
"*fd_max*",Performance,Storage
"*fd_used*",Performance,Storage
"*latency*",Performance,Storage
"*mount*",Performance,Storage
"*parent*",Performance,Storage
"*read_blocks*",Performance,Storage
"*read_latency*",Performance,Storage
"*read_ops*",Performance,Storage
"*storage*",Performance,Storage
"*storage_free*",Performance,Storage
"*storage_free_percent*",Performance,Storage
"*storage_used*",Performance,Storage
"*storage_used_percent*",Performance,Storage
"*write_blocks*",Performance,Storage
"*write_latency*",Performance,Storage
"*write_ops*",Performance,Storage
"*thruput*",Performance,Network
"*thruput_max*",Performance,Network
"*signature*",Performance,OS
"*uptime*",Performance,Uptime

No comments:

Post a Comment