Tags and Fields and Shinken oh my..
(todo: insert funny ‘trigger’ warning image)
What is the problem ?
I have my flow working. Data is generated in Collectd, and sent to my monitoring server where a Shinken module mod_collectd receives it. It bounces around Shinken for some actual monitoring work, and then the metrics get shipped off to a InfluxDB server for eventual visualization in Grafana. The whole thing is deployed and configured using Saltstack with feedback as various minions come online. That is a separate series for the future :)..
To the problem - Everything was flowing fine, but now I want to start setting up some Grafana views of the flood of metrics coming in. A few test ones on Users or Load are all fine, but when I hit the CPU metrics we have a bit of a problem….
Here are some of my measurements in InfluxDB
and on..
As you can see, I could get this into Grafana with one query for each. For graphite backed data it looks like Grafana could use the dot notation and with various filters create aliases based on the above.
It looks like now with InfluxDB 0.9+, the proper way to go is with tags, and using the
syntax
in the Grafana Alias By section.1
$tag_sometagname
- How does the CPU 0 Idle Metric from a particular host make it through the chain to wind up in the InfluxDB ?
- Where does it become the measurement name metric_cpu_0_idle ?
- How do I change it into ‘cpu’ measurement name (since that makes sense), with the tag
and1
cpu=0
?1
state=idle
Start the debugging
I am going to start at the begining, the line protocol from CollectD to Shinken.
We could start digging into the protocol documentation, etc. but let’s start with just looking at the raw line protocol since I am already in on the system and I don’t want to switch to my browser to look up the protocol.
So, of to one of my server instances that has collectd -> shinken in progress. Default port is UDP 25826
Yup, data. Now I need to see the actual data.
So, just a quick look the phrase ‘metric_cpu_0_wait’ is not there. It looks like that the separate fields are indivitual variables in the protocol.
The next target will be the the Shinken
module that is getting this data. Off to the Python !1
mod_collectd
The module (on my install) is in
. I love being able to just see the source right there, and tweek
it as needed and this applies for the entire Shinken system. This is one reason I have switched to Shinken from Nagios.1
/var/lib/shinken/modules/mod_collectd/module.py
Digging into the mod_collectd and having to grok a fair bit of python I see where the
method is puting it all together.
It is all getting joined together to make it fit with in the Shinken (and Nagios) perf data layout - 1
Element.get_command()
1
some_variable_name=value,next_variable=value,etc..
1
$type-$plugininstance-$typeinstance_$valueindex=$value
For examples,
CPU #4 Interrupt collectd metric becomes
1
cpu-4-interrupt=0.0
1, 5, and 10 minute load values become:
1
load_0=0.95, load_1=0.98, load_2=0.78
(the CollectD load plugin sends 3 values).
The df plugin turns into :
1
df_complex-root-reserved=1734762496.0, df_complex-root-used=2364227584.0, df_complex-root-free=29587415040.0
..because the actual CollectD type for the df plugin metric is called df_complex.
Also, I will note that the mod_collectd plugin also has special handling for any module that
you label as a Grouped Plugin. This will cause the metrics shipped back into Shinken to
be under the service name
instead of service name 1
$plugin
(from collectd data package)1
$plugin_$plugin_instance
I have
listed as grouped plugins in the mod_collectd.conf file (in Shinken)1
cpu,df,disk,interface,and ntpd
It looks like it is best to leave this module alone since we don’t know what other components might be consuming the Perf Data that is getting injected. This means that our next place to look at chaning things is in the mod_influxdb module itself.
1
/var/lib/shinken/modules/mod_influxdb/*.py
Note that I am currently using the
from its master branch on Github since
there are bug fixes in it that deal with the fact that recent InfluxDB 0.9 subversions changed the API with respect to the timestamp value
and it now defaults to a 64Bit timestamp instead of 32Bit. (if all your shinken->influxdb values are coming up as 1970, that is probably why)1
mod_influxdb
runs under the Broker Daemon, and is passed down information in the form of a ‘brok’.
The Broker will call the module at the following points:1
mod_influxdb
It is all done by the
function that will call on this pattern :1
basemodule.manage_brok()
1
'manage_' + brok.type + '_brok'
Our perf data is coming into the module via 1
manage_service_check_result_brok
Right at the start, we can see the tags getting setup :
Sure enough, if you look at the raw InfluxDB data for metric_cpu_0_idle
You can see that the
and 1
host_name
tags are being set. We want more..
I want the metric to be called 1
service_description
, and the tags to be something like 1
cpu
and 1
instance=0
.1
type=idle
It looks like this is the module we need to hack on to make this work. It looks like a fair bit of work though.. CPU will be different from DF and Load. We need to think of a layout that will work for all modules, or most modules, and provide some way to act differently for certain exceptions.
Our pattern (small sample)
Well, the metric name - easy .. Always is the first before the _
Ah, what about the users plugin ? We have not looked at that one yet.
1
users=2.0
So, there might not always be a 1
_
The patterns are
I feel a regex coming….(todo: insert sad face)..
So I start pondering the regex, but I don’t want to ponder the regex, so the ADHD kicks in and I start
to ponder the $INSTANCE vs $TYPE that show up in my InfluxDB data.
“Hold on!” says my brain, “I think I remember the load metric showing up as
load_1, load_5, load_15, not load_1 to load_3..”
Quick check of Influx ‘SHOW MEASUREMENTS’, and sure enough, brain was correct.
WTH? Wait a second, the ‘minutes’ setting is not comming in from CollectD data, just an array of 3 values. Who/What is telling Shinken that it should be 1, 5, and 15 Minutes ?
Side Trip ! Off to the Triggers !
Triggers are a unique feature of Shinken that helps make all the Collectd->Shinken->InfluxDB->Grafana thing work. The documentation says that Triggers are ‘not for production’, but the documenation is old, and if what I need to make everything super cool is ‘not for production’, then unless my Shinken is connected to someone’s pacemaker, or a furnace boiler, I am going to use it! Besides, it looks like the trigger part of the code hasn’t been touched in almost a year - sounds stable enough for me ;)
(todo: insert picture of rebel)
Triggers are little bits of python code, that will be called as a ‘result’ comes in.
They are defined at the object level.
Here is my CPU service object :
The trigger is in
.1
/etc/shinken/triggers.d/collectd_cpu.trig
It is python code.
ps.. Some of these come from the collectd pack here and here
The trigger also works like a normal nagios module, so at this point in the data flow, you can trigger the alerts for OK, Warning, Critical, etc. Pretty cool eh? (todo: insert Miley Cyrus ‘pretty cool’)
Now, the documentation is pretty light on triggers (being beta and all), so we are going to have to dive into the code to see what things we can do. There is some logging already setup to help us as well (starting with trigger_functions.py
Not sure what daemon it runs under, so I am going to kick up the
in the scheduler first as
it seems like the right place.1
log_level=DEBUG
Restart shinken, and start learning (grep’ing)…
Yup, scheduler it was.. Now the logs..
The data is coming out of our
with basic performance data attached to it - the raw metrics.
The trigger is getting called and doing whatever it needs. Our Load trigger code looks like this:1
mod_collectd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
#!/usr/bin/env python
try:
import logging
backend = "Shinken"
logger = logging.getLogger(backend)
exit_code_output = {0: 'OK',
1: 'WARNING',
2: 'CRITICAL',
3: 'UNKNOWN',
}
exit_code = 0
# Get threshold
data = {}
warn = self.host.customs.get('_LOAD_WARN', None)
if not warn is None:
(data['warn_1'],
data['warn_5'],
data['warn_15']) = [float(x) for x in warn.split(",")]
crit = self.host.customs.get('_LOAD_CRIT', None)
if not crit is None:
(data['crit_1'],
data['crit_5'],
data['crit_15']) = [float(x) for x in warn.split(",")]
# Get perfs
# TODO: Check why load-_0 and not load_0
# Check the patch grouped_plugin in collectd_arbiter module
error = False
try:
data['1'] = float(perf(self, 'load_0'))
data['5'] = float(perf(self, 'load_1'))
data['15'] = float(perf(self, 'load_2'))
except ValueError:
logger.error("A required perf_data is missing for %s" % self.get_full_name())
logger.info("Dumping perf_data : %s" % self.perf_data)
perf_data = ""
output = "Error : A required perf_data is missing to compute trigger"
exit_code = 3
error = True
if not error:
# Prepare output
if warn is not None and crit is not None:
perf_data = ("load_1=%(1)0.2f;%(warn_1)0.2f;%(crit_1)0.2f;0 "
"load_5=%(5)0.2f;%(warn_5)0.2f;%(crit_5)0.2f;0 "
"load_15=%(15)0.2f;%(warn_15)0.2f;%(crit_15)0.2f;0" % data)
else:
perf_data = ("load_1=%(1)0.2f;;;0 "
"load_5=%(5)0.2f;;;0 "
"load_15=%(15)0.2f;;;0 " % data)
output = "Load: %(1)0.2f,%(5)0.2f,%(15)0.2f" % data
# Get status
if warn is not None and crit is not None:
for x in ['1', '5', '15']:
if data[x] > data['crit_' + x] and exit_code < 2:
exit_code = 2
continue
if data[x] > data['warn_' + x] and exit_code < 1:
exit_code = 1
continue
# Finish output
output = " - ".join((exit_code_output[exit_code], output))
# Set ouput
set_value(self, output, perf_data, exit_code)
except Exception, e:
set_value(self, "UNKNOWN: Trigger error: " + str(e), "", 3)#
It is pretty straight forward. It collects some threshold values from the host object
(self is filled with a few handy object collections), converts the
, 1
load_1
,
and 1
load_2
into 1
load_3
, 1
load_1
, and 1
load_5
values, check them against the
warning and critical thresholds (if set), then creates new per data in the generally
accepted perf data format for Nagios (metricname=value;unitofmeasure;warnlevel;critlevel;min;max)
and then pushes new perf data back along with a new output message in general
for OK, Warning, Critical, Unknown.1
load_15
Back on track to the mod_influxdb module
This is not where we can change it up to add tags, but we can inject some other information into the perfdata. Perhaps some kind of pattern hint for the mod_influxdb code to create tags?
I could just add things like tag_sometagname=somevalue here, and have mod_influxdb parse that, but what about things like the CPU perfdata that has a mess of different cpu numbers and types all in one perdata collection. tag_cpunumber=3 will not work for example, when the perf data has 8 different cpus in it.
Perhaps passing a 7th field after the ‘max’ that could be ‘format’ indicator for mod_influxdb.
Let’s head back to the mod_influxdb code and see what we can do.
Starting with the load metric, and then tackle the cpu metric.
passes to a few functions that parse the brok data into data
points for InfluxDB.1
manage_service_check_result_brok
There is some good diagnostic logging, so I am going to increase the log_level=debug on the Broker daemon. (mod_influxdb is a broker module).
One handy line is the ‘Generated Points’ debug log. Capturing that and running it through a python code formatter, this is the data that is generated for the load data brok :
So we can see the 3 variables in the perf data being converted into fields for Influx, and some Tags. A the end of the collection, you can also see the service state is collected and pushed to InfluxDB as well (very handy to mark point in time events overtop of the metric graphs)
For load, the ‘service_description’ is fine. We just need to get ‘measurement’ to be just ‘load’ (or ‘metric_load’ - mod_influxdb is adding ‘metric’ to everything), and we need to get the ‘type’ to be the 1,5,or 15..
Here is my diff to the function
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
diff --git a/module/module.py b/module/module.py
index e7ec6cf..8ec4e34 100644
--- a/module/module.py
+++ b/module/module.py
@@ -111,14 +111,31 @@ class InfluxdbBroker(BaseModule):
value = float(value)
fields[mapping[1]] = value
+ # Create some extra tags for the type and instance in the e.name
+ computed_tags = tags.copy()
+ namesplit = e.name.split('_',3);
+ if len(namesplit)>2 :
+ #name_instance_type
+ computed_tags.update({'instance': namesplit[1]})
+ computed_tags.update({'type': namesplit[2]})
+ elif len(namesplit)>1 :
+ #name_type
+ computed_tags.update({'type': namesplit[1]})
+ if len(namesplit)>=1 :
+ logger.debug("[influxdb broker] Namesplit : %s" % namesplit)
+ logger.debug("[influxdb broker] Tags : %s" % computed_tags)
+ e.name = namesplit[0];
+
if fields:
point = {
"measurement": 'metric_%s' % self.illegal_char.sub('_', e.name),
"time": timestamp,
"fields": fields,
- "tags": tags,
+ "tags": computed_tags,
}
points.append(point)
return points
There might be better ways to do this - More ‘Pythonic’, but it should be clear enough to understand what we are doing. I took a copy of the tags that are already passed into the function. Then I split up the current ‘metric’ name by the ‘_’ that were added in by the mod_collectd module. Then, based on how many are there, I fill out appropriate tags, and we create the point with the new created tags.
Result - Well, it works..
Let’s see the measurements in InfluxDB now :
So, metric_cpu_#_???? is gone, now just metric_cpu..
Inside ?
Instance and type are filled out properly.
Next is to setup Grafana to use these new tags in the alias.