Configuration

metricq_sink_nsca retrieves its configuration over the MetricQ network in form of a JSON object. The configuration defines a set of checks and specifies the host to which results of these check results should be sent.

For the impatient

A minimum working configuration looks like this:

{
    "nsca": {
        "host": "nsca.example.org",
    },
    "checks": {
        "bar": {
            "metrics": [
                "bar.1",
                "bar.2"
            ],
            "warning_above": 10.0,
            "critical_above": 15.0,
            "timeout": "1min"
        }
    }
}

You tell it where to send the check results to ("nsca.host") and then define for each check a list of metrics whose values are checked for abnormal values, as given by {warning,critical}_above. Checks need to be configured as passive checks on the host side, otherwise reports will be dropped silently by the host.

Full reference

Top-level configuration

nsca

A dictionary of NSCA host settings.

checks

A dictionary of check configurations by name.

Example

Suppose you have configured a passive check called foo in Centreon/Nagios and want to be alerted when values of either of the metrics foo.bar or foo.baz drop below a threshold. A possible configuration looks like this:

{
    "nsca": {
        "host": "nsca.example.org",
    },
    "checks": {
        "foo": {
            "metrics": [
                "foo.bar",
                "foo.baz"
            ],
            "warning_below": 10.0,
            "critical_below": 5.0
        }
    }
}
reporting_host

Name of the host for which check results are reported, as configured in Nagios/Centreon.

Default:

The current hostname returned by gethostname(2).

resend_interval

Global default resend interval (see Check configuration/resend_interval).

Default

"3min"

overrides

An object of global overrides that affect operation of all checks. See override configuration.

Default

If omitted, no overrides are applied.

Check configuration

A single check monitors a set of metrics for abnormal behavior. It continuously consumes new data points for these metrics and reports and overall state: if values of a single metric exceed their allowed range or there are no new values after a certain time a state of WARNING or CRITICAL is reported.

metrics (list of strings)

A list of metrics that should be monitored.

This list is mandatory and required to be non-empty.

warning_above, warning_below, critical_above, critical_below (number)

Range of value which should trigger a WARNING (resp. CRITICAL) status report to be sent. We call the intervals \([-∞, \mathtt{warning\_below}) \cup (\mathtt{warning\_above}, ∞]\) the warning range; values within that range trigger a WARNING. The critical range is defined similarly.

We require the following, otherwise the configuration is rejected:

\[\mathtt{critical\_below} ≤ \mathtt{warning\_below} < \mathtt{warning\_above} ≤ \mathtt{critical\_above}\]

A WARNING report is sent if the value of a metric drops below warning_below, a CRITICAL report is sent if it drops further below critical_below. Metrics exceeding warning_above or critical_above similarly trigger reports.

Defaults
  • \(-∞\) ({warning,critical}_below)

  • \(∞\) ({warning,critical}_above)

You cannot put \(±∞\) directly into the check configuration. Since they are the default anyway, simply omit the relevant key if necessary.

Important

Setting any of these values forces incoming messages to be decoded and parsed, which adds significant overhead for high-volume metrics. Leave all values unset to disable packet processing and only check for timeouts.

Example

An example check configuration with all ranges specified:

{
    "checks": {
        "foo": {
            "metrics": [ "foo.bar", "foo.baz" ],
            "critical_below":   5.0,
            "warning_below":   10.0,
            "warning_above":   95.0,
            "critical_above": 100.0
        }
    }
}
timeout (duration)

Send check result of severity WARNING if values arrive apart more than the specified period. This monitors two kinds of failure:

  • The network is fully operational, but two consecutive data points for a metric differ by more than timeout in their timestamps. This might indicate that the source for these metrics is not fully operational.

  • metricq_sink_nsca does not receive data points for these metrics for more than the specified duration, measured against the local system clock. This might happen if a source has crashed, has lost its connection to the network or there is another issue along the way that prevents clients from consuming new value for these metrics.

Default

Not set; no timeout checks are performed.

Note

Timeout checks can be enabled independently from value checks. They do not require incoming messages to be parsed and can safely be enabled for high-volume metrics without incurring much overhead.

Example

Make sure that foo.bar consistently produces values:

{
    "checks": {
        "foo": {
            "metrics": ["foo.bar"],
            "timeout": "1min"
        }
    }
}
ignore (list of numbers)

A list of values that are never considered to generate WARNING or CRITICAL reports.

This is intended to be used for metrics that yield spurious, but fixed values that should be ignored, even if they are within an otherwise abnormal range. An example would be a faulty measuring device which produces the value 0.0 on encountering in an internal error, but where warning_below = 5.0.

Note

Use with care. The implementation essentially only performs a floating-point equality test to filter values.

If this sounds like a bad idea to you, you are probably right. Trust me, this is here because some source cannot be fixed easily.

Example

A source computing a Power factor for an AC electrical power system reports a metric ac_system.power_factor. If the power factor is too low, a warning should be generated. It might be that the source calculates a power factor of 0.0 on low draw. Since a low power factor on low draw might not be considered a problem, ignore the value 0.0:

{
    "checks": {
        "low-draw": {
            "metrics": ["ac_system.power_factor"],
            "warning_below": 0.8,
            "ignore": [0.0]
        }
    }
}
resend_interval (duration)

Period of time after which the current state of this check is sent again to the server, even though it might not have changed. This is necessary since passive checks are considered to be in an UNKNOWN state by Centreon/Nagios if they have not sent a report for a certain time.

Default

Inherited from the global resend interval.

transition_debounce_window (duration)

If this value is set, metricq_sink_nsca tries to reduce the number of spurious WARNING or CRITICAL reports. We call this process “transition debouncing”.

If you are experiencing state transitions to WARNING or CRITICAL that only last \(x\) seconds and want to suppress them, set this value to at least \(2x\) seconds.

For each metric, a history of its state transitions is kept. This configures how far into the past state transitions are kept in each history. If the majority of recent state transitions indicate an abnormal state, a report is sent. Otherwise it is suppressed.

Default

Not set; transitions debouncing is disabled.

TODO

This should be called transition_history_window. Bug me about it in an issue.

plugins

A dictionary of plugin configurations. Keys in this dictionary must match the regex [a-z_]+.

NSCA host settings

These settings tell the reporter where it should send its check results and how that host is configured.

host (string)

Address of the NSCA daemon to which check results are sent. See -H flag of send_nsca.

Default:

"localhost"

port (integer)

Port of the NSCA daemon to which check results are sent. See -p flag of send_nsca.

Default:

5667

executable (string)

Path to send_nsca executable to use for sending check results.

Default:

"/usr/sbin/send_nsca"

config_file (string)

Path to send_nsca configuration file. See -c flag of send_nsca

Default:

"/etc/nsca/send_nsca.cfg"

Override configuration

Overrides should be used to temporarily reconfigure a checker instance, e.g. when a planned maintainance affects the availability of certain metrics.

The override configuration contains the following keys to define overrides:

  • ignored_metrics (list of metric patterns)

    Each item in this list is a metric pattern that matches either one or multiple metrics. If a check defines a metric that matches at least one of these patterns, this metric is completely ignored by that check. In particular, neither abnormal values nor timeout conditions will trigger any reports to be sent.

    Put a metric on this list if you want to temporarily exclude it from all checks, without deleting it from the actual check configuration. This prevents misconfigurations where a metric had to be temporarily ignored, but later was not added back to all checks from which it was removed.

    A metric pattern can be one the following:

    An exact match

    The full name of a metric. Exactly this metric will be ignored.

    A prefix match

    A metric name consists of components separated by .. All metrics that share a common prefix of components can be matched at once. Write the prefix, followed by the wildcard component *.

    Example

    foo.* matches foo.bar.baz, foo.qux and any other metric whose first component is foo.

    Note

    The exact pattern syntax might be extended in the future in an incompatible way. In particular, it is currently neither possible to match parts of components (i.e. no foo.b*r) nor non-prefix components (no foo.*.baz). This might change in the future.

    Overrides should be temporary; before upgrading to a new feature release, check that your overrides are still valid.

    Default

    If omitted, no metrics will be ignored for any check.

    Example

    We can use an exact match to ignore exactly one metric:

    {
        "overrides": {
            "ignored_metrics": [
                "waldo.location.latitude",
                "waldo.location.longitude"
            ]
        },
        "checks": {
            "TRACK_WALDO": {
                "metrics": [
                    "waldo.location.latitude",
                    "waldo.location.longitude",
                    "waldo.hidden.duration"
                ],
                "timeout": "5min"
            }
        }
    }
    

    In the above example, only waldo.hidden.duration is checked by TRACK_WALDO for timeout conditions, both waldo.location.latitude and waldo.location.longitude are ignored.

    Example

    To easily match multiple metrics, we can use a prefix match:

    {
        "overrides": {
            "ignored_metrics": [
                "santa.*"
            ]
        },
        "checks": {
            "LATITUDE_VALID": {
                "metrics": [
                    "waldo.location.latitude",
                    "santa.location.latitude",
                ],
                "critical_above": 90.0,
                "critical_below": -90.0
            },
            "LONGITUDE_VALID": {
                "metrics": [
                    "waldo.location.longitude",
                    "santa.location.longitude",
                ],
                "critical_above": 180.0,
                "critical_below": -180.0
            }
        }
    }
    

    In the above example, neither santa.location.latitude nor santa.location.longitude are checked by LATITUDE_VALID and LONGITUDE_VALID, respectively. In fact, any metric that had santa as its first component would be ignored.

Plugin configuration

file (string)

File system path to plugin implementation (a .py file).

This key is mandatory. Plugin configurations without it are rejected.

config (dictionary)

An arbitrary JSON object containing a plugin-specific configuration.

Duration

All durations in this configuration are strings in the form of "value [unit]". The value is an integer or (decimal) floating point literal, the unit is one of

  • d/days

  • h/hours

  • min/minutes

  • s/seconds

  • ms/milliseconds

  • us/μs/microseconds,

  • ns/nanoseconds

If the unit is not specified, the value is interpreted as a number of seconds.

Examples
  • "5s", "5" (5 seconds)

  • "42 milliseconds"

  • "1.5 days"

Complete Example

An example with all possible options set is given below:

{
    "$schema": "../../config.schema.json",
    "nsca": {
        "host": "nsca.example.com",
        "port": 5667,
        "executable": "/opt/nsca/bin/send_nsca",
        "config_file": "/opt/nsca/etc/send_nsca.cfg"
    },
    "overrides": {
        "ignored_metrics": [
            "foo.qux.*"
        ]
    },
    "reporting_host": "foo",
    "resend_interval": "2min",
    "checks": {
        "foo": {
            "metrics": [
                "foo.bar",
                "foo.baz",
                "foo.qux.ignored1",
                "foo.qux.ignored2"
            ],
            "critical_below": 5.0,
            "warning_below": 10.0,
            "warning_above": 95.0,
            "critical_above": 100.0,
            "timeout": "10min",
            "ignore": [
                0.0
            ],
            "resend_interval": "5min",
            "transition_debounce_window": "1min",
            "plugins": {
                "foo_plugin": {
                    "file": "/opt/metricq_sink_nsca/plugins/foo_plugin.py",
                    "config": {
                        "zorgs": 3,
                        "blargles": ["ahhh", "ouggh"]
                    }
                }
            }
        }
    }
}