Configuration¶
metricq_sink_nsca retrieves its configuration over the MetricQ network in form of a JSON object. The configuration defines a set of checks and specifies the host to which results of these check results should be sent.
For the impatient¶
A minimum working configuration looks like this:
{
"nsca": {
"host": "nsca.example.org",
},
"checks": {
"bar": {
"metrics": [
"bar.1",
"bar.2"
],
"warning_above": 10.0,
"critical_above": 15.0,
"timeout": "1min"
}
}
}
You tell it where to send the check results to ("nsca.host"
)
and then define for each check a list of metrics whose values are checked for abnormal values,
as given by {warning,critical}_above
.
Checks need to be configured as passive checks on the host side,
otherwise reports will be dropped silently by the host.
Full reference¶
Top-level configuration¶
nsca
A dictionary of NSCA host settings.
checks
A dictionary of check configurations by name.
- Example
Suppose you have configured a passive check called
foo
in Centreon/Nagios and want to be alerted when values of either of the metricsfoo.bar
orfoo.baz
drop below a threshold. A possible configuration looks like this:{ "nsca": { "host": "nsca.example.org", }, "checks": { "foo": { "metrics": [ "foo.bar", "foo.baz" ], "warning_below": 10.0, "critical_below": 5.0 } } }
reporting_host
Name of the host for which check results are reported, as configured in Nagios/Centreon.
- Default:
The current hostname returned by gethostname(2).
resend_interval
Global default resend interval (see Check configuration/resend_interval).
- Default
"3min"
overrides
An object of global overrides that affect operation of all checks. See override configuration.
- Default
If omitted, no overrides are applied.
Check configuration¶
A single check monitors a set of metrics for abnormal behavior.
It continuously consumes new data points for these metrics
and reports and overall state: if values of a single metric exceed their allowed range
or there are no new values after a certain time
a state of WARNING
or CRITICAL
is reported.
metrics
(list of strings)
A list of metrics that should be monitored.
This list is mandatory and required to be non-empty.
warning_above
,warning_below
,critical_above
,critical_below
(number)Range of value which should trigger a
WARNING
(resp.CRITICAL
) status report to be sent. We call the intervals \([-∞, \mathtt{warning\_below}) \cup (\mathtt{warning\_above}, ∞]\) the warning range; values within that range trigger aWARNING
. The critical range is defined similarly.We require the following, otherwise the configuration is rejected:
\[\mathtt{critical\_below} ≤ \mathtt{warning\_below} < \mathtt{warning\_above} ≤ \mathtt{critical\_above}\]A
WARNING
report is sent if the value of a metric drops belowwarning_below
, aCRITICAL
report is sent if it drops further belowcritical_below
. Metrics exceedingwarning_above
orcritical_above
similarly trigger reports.- Defaults
\(-∞\) (
{warning,critical}_below
)\(∞\) (
{warning,critical}_above
)
You cannot put \(±∞\) directly into the check configuration. Since they are the default anyway, simply omit the relevant key if necessary.
- Important
Setting any of these values forces incoming messages to be decoded and parsed, which adds significant overhead for high-volume metrics. Leave all values unset to disable packet processing and only check for timeouts.
- Example
An example check configuration with all ranges specified:
{ "checks": { "foo": { "metrics": [ "foo.bar", "foo.baz" ], "critical_below": 5.0, "warning_below": 10.0, "warning_above": 95.0, "critical_above": 100.0 } } }
timeout
(duration)Send check result of severity
WARNING
if values arrive apart more than the specified period. This monitors two kinds of failure:The network is fully operational, but two consecutive data points for a metric differ by more than
timeout
in their timestamps. This might indicate that the source for these metrics is not fully operational.metricq_sink_nsca does not receive data points for these metrics for more than the specified duration, measured against the local system clock. This might happen if a source has crashed, has lost its connection to the network or there is another issue along the way that prevents clients from consuming new value for these metrics.
- Default
Not set; no timeout checks are performed.
- Note
Timeout checks can be enabled independently from value checks. They do not require incoming messages to be parsed and can safely be enabled for high-volume metrics without incurring much overhead.
- Example
Make sure that
foo.bar
consistently produces values:{ "checks": { "foo": { "metrics": ["foo.bar"], "timeout": "1min" } } }
ignore
(list of numbers)A list of values that are never considered to generate
WARNING
orCRITICAL
reports.This is intended to be used for metrics that yield spurious, but fixed values that should be ignored, even if they are within an otherwise abnormal range. An example would be a faulty measuring device which produces the value
0.0
on encountering in an internal error, but wherewarning_below = 5.0
.Note
Use with care. The implementation essentially only performs a floating-point equality test to filter values.
If this sounds like a bad idea to you, you are probably right. Trust me, this is here because some source cannot be fixed easily.
- Example
A source computing a Power factor for an AC electrical power system reports a metric
ac_system.power_factor
. If the power factor is too low, a warning should be generated. It might be that the source calculates a power factor of0.0
on low draw. Since a low power factor on low draw might not be considered a problem, ignore the value0.0
:{ "checks": { "low-draw": { "metrics": ["ac_system.power_factor"], "warning_below": 0.8, "ignore": [0.0] } } }
resend_interval
(duration)Period of time after which the current state of this check is sent again to the server, even though it might not have changed. This is necessary since passive checks are considered to be in an
UNKNOWN
state by Centreon/Nagios if they have not sent a report for a certain time.- Default
Inherited from the global resend interval.
transition_debounce_window
(duration)If this value is set, metricq_sink_nsca tries to reduce the number of spurious
WARNING
orCRITICAL
reports. We call this process “transition debouncing”.If you are experiencing state transitions to
WARNING
orCRITICAL
that only last \(x\) seconds and want to suppress them, set this value to at least \(2x\) seconds.For each metric, a history of its state transitions is kept. This configures how far into the past state transitions are kept in each history. If the majority of recent state transitions indicate an abnormal state, a report is sent. Otherwise it is suppressed.
- Default
Not set; transitions debouncing is disabled.
- TODO
This should be called
transition_history_window
. Bug me about it in an issue.
plugins
A dictionary of plugin configurations. Keys in this dictionary must match the regex
[a-z_]+
.
NSCA host settings¶
These settings tell the reporter where it should send its check results and how that host is configured.
host
(string)Address of the NSCA daemon to which check results are sent. See
-H
flag ofsend_nsca
.- Default:
"localhost"
port
(integer)Port of the NSCA daemon to which check results are sent. See
-p
flag ofsend_nsca
.- Default:
5667
executable
(string)Path to
send_nsca
executable to use for sending check results.- Default:
"/usr/sbin/send_nsca"
config_file
(string)Path to
send_nsca
configuration file. See-c
flag ofsend_nsca
- Default:
"/etc/nsca/send_nsca.cfg"
Override configuration¶
Overrides should be used to temporarily reconfigure a checker instance, e.g. when a planned maintainance affects the availability of certain metrics.
The override configuration contains the following keys to define overrides:
ignored_metrics
(list of metric patterns)Each item in this list is a metric pattern that matches either one or multiple metrics. If a check defines a metric that matches at least one of these patterns, this metric is completely ignored by that check. In particular, neither abnormal values nor timeout conditions will trigger any reports to be sent.
Put a metric on this list if you want to temporarily exclude it from all checks, without deleting it from the actual check configuration. This prevents misconfigurations where a metric had to be temporarily ignored, but later was not added back to all checks from which it was removed.
A metric pattern can be one the following:
- An exact match
The full name of a metric. Exactly this metric will be ignored.
- A prefix match
A metric name consists of components separated by
.
. All metrics that share a common prefix of components can be matched at once. Write the prefix, followed by the wildcard component*
.- Example
foo.*
matchesfoo.bar.baz
,foo.qux
and any other metric whose first component isfoo
.
Note
The exact pattern syntax might be extended in the future in an incompatible way. In particular, it is currently neither possible to match parts of components (i.e. no
foo.b*r
) nor non-prefix components (nofoo.*.baz
). This might change in the future.Overrides should be temporary; before upgrading to a new feature release, check that your overrides are still valid.
- Default
If omitted, no metrics will be ignored for any check.
- Example
We can use an exact match to ignore exactly one metric:
{ "overrides": { "ignored_metrics": [ "waldo.location.latitude", "waldo.location.longitude" ] }, "checks": { "TRACK_WALDO": { "metrics": [ "waldo.location.latitude", "waldo.location.longitude", "waldo.hidden.duration" ], "timeout": "5min" } } }
In the above example, only
waldo.hidden.duration
is checked byTRACK_WALDO
for timeout conditions, bothwaldo.location.latitude
andwaldo.location.longitude
are ignored.- Example
To easily match multiple metrics, we can use a prefix match:
{ "overrides": { "ignored_metrics": [ "santa.*" ] }, "checks": { "LATITUDE_VALID": { "metrics": [ "waldo.location.latitude", "santa.location.latitude", ], "critical_above": 90.0, "critical_below": -90.0 }, "LONGITUDE_VALID": { "metrics": [ "waldo.location.longitude", "santa.location.longitude", ], "critical_above": 180.0, "critical_below": -180.0 } } }
In the above example, neither
santa.location.latitude
norsanta.location.longitude
are checked byLATITUDE_VALID
andLONGITUDE_VALID
, respectively. In fact, any metric that hadsanta
as its first component would be ignored.
Plugin configuration¶
file
(string)File system path to plugin implementation (a .py file).
This key is mandatory. Plugin configurations without it are rejected.
config
(dictionary)An arbitrary JSON object containing a plugin-specific configuration.
Duration¶
All durations in this configuration are strings in the form of "value [unit]"
.
The value
is an integer or (decimal) floating point literal, the unit is one of
d
/days
h
/hours
min
/minutes
s
/seconds
ms
/milliseconds
us
/μs
/microseconds
,ns
/nanoseconds
If the unit is not specified, the value is interpreted as a number of seconds.
- Examples
"5s"
,"5"
(5 seconds)"42 milliseconds"
"1.5 days"
Complete Example¶
An example with all possible options set is given below:
{
"$schema": "../../config.schema.json",
"nsca": {
"host": "nsca.example.com",
"port": 5667,
"executable": "/opt/nsca/bin/send_nsca",
"config_file": "/opt/nsca/etc/send_nsca.cfg"
},
"overrides": {
"ignored_metrics": [
"foo.qux.*"
]
},
"reporting_host": "foo",
"resend_interval": "2min",
"checks": {
"foo": {
"metrics": [
"foo.bar",
"foo.baz",
"foo.qux.ignored1",
"foo.qux.ignored2"
],
"critical_below": 5.0,
"warning_below": 10.0,
"warning_above": 95.0,
"critical_above": 100.0,
"timeout": "10min",
"ignore": [
0.0
],
"resend_interval": "5min",
"transition_debounce_window": "1min",
"plugins": {
"foo_plugin": {
"file": "/opt/metricq_sink_nsca/plugins/foo_plugin.py",
"config": {
"zorgs": 3,
"blargles": ["ahhh", "ouggh"]
}
}
}
}
}
}