Monitoring¶

Monitoring with Graunt¶

Moreover, finding some Truths and not-commonly-believed opinions to arise from my meditations upon these neglected Papers, I proceeded further to consider what benefit the knowledge of the same would bring to the world, … with some real fruit from those ayrie blossoms.

– John Graunt, „Natural and Political Observations Mentioned in a following Index, and made upon the Bills of Mortality.“ (1662)

Overview¶

The graunt package includes services to collect, aggregate, store, and display statistics on the performance of software and hardware (client) instances in the network.

At the core, the whisper round-robin databases provide a fixed-size storage pool for aggregated statistical data. Recent data is stored with a high time resolution. Older data is aggregated and stored more efficiently with less resolution. The amount of data is configurable.

The carbon-cache service accepts individual measurements, organized in a hierarchical namespace of whisper databases. Databases are created on demand when new names are encountered.

Access to databases is restricted through instances of the carbon-relay service. There is one such service for each protected namespace domain. Typically, there are several namespaces for the company network and one namespace for each customer.

The carbon-relay ports are made available on the internal network, or, in the case of customer access, over a dedicated SSH tunnel to an account on the customer’s machine that is locked down to only allow port fowarding of a single network port.

Graunt Tunnel

The aggregated data is exposed through the graphite web service, which includes a rich API to create graphs from the data, and also offers a configurable dashboard.

The default client is diamond, which comes with a variety of collectors for different system parameters and services.

For regular or ad-hoc instrumentalisation of deployed applications, bucky provides a statsd interface to carbon, which runs locally to the application instance and forwards pre-aggregated data to the central server.

The tools (except for SSH and library dependencies) are implemented in Python.

The following diagram gives an overview of the whole system.

Graunt Overview

Installation¶

graunt comes as a GIT repository. After cloning, the submodules need to be initialized:

$ git clone graunt.git graunt
$ cd graunt
[graunt]$ git submodules init
[graunt]$ git submodules update

Building graunt requires Python 2.7, Bash, OpenSSH and libevent development files to be available on the system. There are makefile targets to install the prerequisites:

[graunt]$ make prepare-fedora

Finally, the virtualenv for graunt can be set up with:

[graunt]$ make

The downloaded Python dependencies are cached in the directory cache, which can be reused.

Variable data, including the RR databases, are stored in the directory var, which can be carried over from one instance of graunt to another.

The target directory contains the virtualenv where all packages are installed. It can be removed and rebuilt at will, as it does not contain any variable data.

As all configuration files use relative paths, graunt is fully relocatable. As a consequence, care must be taken when starting graunt, see below.

Running¶

All services in graunt are started through Mozilla’s circusd service. The configuration file uses relative paths, so circusd has to be run from the graunt root directory. The wrapper script takes care of that:

[any]$ /path/to/graunt/circusd

The script auto-detects the graunt directory, so it can be run from anywhere.

As graunt runs in a virtualenv, all tools can be run directly from the target/bin directory, which can also be added to the user’s PATH variable:

[graunt]$ target/bin/circusctl stats
[graunt]$ target/bin/circusctl stop diamond
[graunt]$ target/bin/circusctl start diamond

Database¶

graunt uses the whisper database format. The database files are located at var/carbon/whisper and organized in a hierarchical namespace. carbon-relay servers are used to restrict access to specific namespaces only.

Namespace	Description
`carbon`	Statistics about the carbon cache and relay servers
`cust.$NAME`	Namespaces for customer installations
`smc.$HOST`	Namespaces for Semantics network
`demo.$PORTAL`	Namespaces for Semantics demo portals

Customer namespaces are subdivided:

Namespace	Description
`cust.$NAME.$SERVER`	Namespaces for customer operating systems
`cust.$NAME.$PORTAL`	Namespaces for customer portals

The available databases in each namespace are dependent on the services that are monitored.

Storage and Aggregation¶

Storage rules are set in etc/carbon/storage-schemas.conf, while aggregation rules are in etc/carbon/storage-aggregation.conf.

The current default retention is:

Resolution	Duration
10s	1d
1m	10d
1h	400d
1d	10y

The resulting database files are 436 kB small.

The default aggregation method is average.

Network Configuration¶

The following ports are used by graunt internally:

Port	Proto	Host	Description
2003	tcp	127.0.0.1	`carbon-cache` line interface (used by local `diamond` daemon)
2004	tcp	127.0.0.1	`carbon-cache` pickle interface (used by relays)
2033	tcp	127.0.0.1	`carbon-relay` `customer1` line interface
2043	tcp	127.0.0.1	`carbon-relay` `customer2` line interface
…			etc
5555	tcp	127.0.0.1	`circusd` ZMQ management socket (for `circusctl`)
5556	tcp	127.0.0.1	`circusd` ZMQ pub/sub event socket (for `circusctl`)
5557	tcp	127.0.0.1	`circusd` ZMQ pub/sub stats socket (for `circusctl`)
7002	tcp	127.0.0.1	`carbon-cache` query interface (for `graphite-web`)

The line interface is especially useful for manual testing.

The pickle interface is not very secure, but it is required by the local carbon-relay daemons.

Customer relays are accessed through dedicated SSH tunnels.

The following ports are external entry points:

Port	Proto	Host	Description
2013	tcp	0.0.0.0	`carbon-relay` `smc` line interface for the Semantics network
2023	tcp	0.0.0.0	`carbon-relay` `demo` line interface for the Semantics demo portals
8000	tcp	0.0.0.0	`graphite` web service
8135	udp	0.0.0.0	`statsd` interface for the Semantics network (prefix smc. is added)
8145	udp	0.0.0.0	`statsd` interface for the Semantics demo portals (prefix demo. is added)

At each customer server, the following port is forwarded over a SSH tunnel to one of the customer ports above (2050, 2051, 2052, etc).

Port	Proto	Host	Description
2003	tcp	127.0.0.1	`carbon-relay` `customer` line interface (over SSH)
8125	udp	0.0.0.0	`statsd` interface (over SSH)

Configuration¶

For each customer (here: CUSTOMER), the following configuration needs to be done on the central statsd server (burge.semantics.de).

Activate the graunt user and go to environment:

[burge]$ sudo -u graunt bash
[graunt]$ cd /home/graunt/graunt

Per-Customer Configuration¶

The per-customer carbon-relay provides isolated access to the whisper databases, such that the customer servers can only log data to the cust.CUSTOMER namespace.

For this, the following steps need to be taken:

Pick a local port number for the carbon-relay line interface (no pickle interface is allowed due to security concerns). Here, we choose 2023.
Create a new file graunt/etc/customers.d/CUSTOMER.ini with:
```
[customer:CUSTOMER]
  relay_port = 2023
```

See below for more information on the configuration file.

Rebuild the configuration:
```
[graunt]$ ./rebuild-config
```

An SSH key pair is generated if it doesn’t exist already. So far, we have only configured a carbon-relay server that is ready to accept data. We have not configured any SSH tunnels through which clients can actually send such data. Usually, you will want to add hosts to connect to before rebuilding the configuration again and restarting circus.

Per-Host Configuration¶

A user account (here: vlstat) needs to be set up on each customer server (here: CUSTOMER.EXAMPLE.COM) and configured to accept an SSH connection for port forwarding.

The SSH public key needs to be copied to the remote server:

[graunt]$ scp -P 22 etc/ssh/id_customer.pub CUSTOMER.EXAMPLE.COM:

Create the new user and disable login and password authentication:

[customerhost]$ sudo useradd --shell /bin/true vlstat
[customerhost]$ sudo usermod --lock vlstat
[customerhost]$ sudo mkdir /home/vlstat
[customerhost]$ sudo chown vlstat.vlstat /home/vlstat
[customerhost]$ sudo mkdir /home/vlstat/.ssh
[customerhost]$ sudo chown vlstat.vlstat /home/vlstat/.ssh
[customerhost]$ sudo chmod 0700 /home/vlstat/.ssh

For Suse-Linux:

[customerhost]$ useradd --shell /bin/true -g nogroup vlstat
[customerhost]$ sudo usermod -L vlstat
[customerhost]$ sudo mkdir /home/vlstat
[customerhost]$ sudo chown vlstat:nogroup /home/vlstat
[customerhost]$ sudo mkdir /home/vlstat/.ssh
[customerhost]$ sudo chown vlstat /home/vlstat/.ssh
[customerhost]$ sudo chmod 0700 /home/vlstat/.ssh

The /home/vlstat/.ssh/authorized_keys should have a single entry with the public key and the following configuration:

[customerhost]$ echo 'no-pty,command="/bin/false",no-agent-forwarding,no-user-rc,no-X11-forwarding,permitopen="127.0.0.1:2003"' `cat id_CUSTOMER.pub` | sudo tee -a /home/vlstat/.ssh/authorized_keys
[customerhost]$ sudo chmod 0600 /home/vlstat/.ssh/authorized_keys
[customerhost]$ sudo chown vlstat.vlstat /home/vlstat/.ssh/authorized_keys

For Suse-Linux:

[customerhost]$ sudo chown vlstat:nogroup /home/vlstat/.ssh/authorized_keys

This configuration achieves that a compromised key can at most be used to intercept the statistics data and interfere with its collection.

In the file graunt/etc/customer.d/CUSTOMER.ini, add a new section:

[host:CUSTOMER.EXAMPLE.COM]
  customer=CUSTOMER
  ssh_user=vlstat
  ssh_port=22 # can be omitted, 22 is default

Rebuild the configuration:
```
[graunt]$ ./rebuild-config
```
This will also retrieve the server certificate and add it to graunt/etc/ssh/known_hosts, where it is cached (to detect man in the middle attacks). You might be able to check the validity of the retrieved certificate with a second communication channel. It is also not a bad idea to add the certificate to the above configuration section under the ssh_certificate key.
Reload the circusd configuration for the changes to take effect:
```
[graunt]$ ./circusctl reloadconfig
```
Normally, this should restart all changed watchers and start newly added watchers. If this does not work correctly, circusd can be restarted with:
```
[graunt]$ ./circusctl quit
[graunt]$ ./circusd
```
On the customer host, install and configure a bucky server for statsd logging from applications (do not use a branch under /opt/vlsXXX).

Install diamond and bucky:
```
[customerhost]$ cd /opt/vls
[customerhost]$ bin/vlshell
[VLS]$ paver install_stats
```
First copy the bucky configuation template then modify the file::
```
[VLS]$ cp etc/bucky.conf.in etc/bucky.conf
[VLS]$ vi etc/bucky.conf
```
Edit the customer name in name_prefix_parts:
```
name_prefix_parts = ["cust", "CUSTOMER"]
```
This will result in all statsd metrics being prefixed by cust.CUSTOMER. Note that any period (.) in a name part is replaced with an underscore (_).

Install (and edit) the supervisord configuration for bucky as admin.:
```
[customerhost]$ cp etc/supervisord.conf.d/bucky.ini /etc/supervisord.conf.d
```
Make sure the paths are OK for your installation.

Activate and start bucky:
```
[VLS]$ supervisorctl add bucky
[VLS]$ supervisorctl start bucky
```
Dirk - I had more luck with:
```
[VLS]$ supervisorctl reread
[VLS]$ supervisorctl update
[VLS]$ supervisorctl start bucky # probably already autostarted
```
On the customer host, install and configure a diamond server for system statistics.

First copy the diamond configuation template then modify the file::
```
[VLS]$ cp etc/diamond/diamond.conf.in etc/diamond/diamond.conf
[VLS]$ vi etc/diamond/diamond.conf
```
Edit the customer name in path_prefix in the collectors.default section:
```
path_prefix = cust.CUSTOMER.host
```
This will result in all diamond metrics being prefixed by cust.CUSTOMER.host.SERVERNAME.

Currently, you have also set the following (why?):
```
collectors_path = /opt/vls/lib/python2.7/site-packages/diamond-3.3.506.patch4-py2.7.egg/share/diamond/collectors
```
Also, verify the settings for server.pid_file. Logging configuration is ignored and does not need to be adjusted, as logging happens through supervisord.

Install (and edit) the supervisord configuration for diamond as admin:
```
[customerhost]$ cp etc/supervisord.conf.d/diamond.ini /etc/supervisord.conf.d
```
Make sure the paths are OK for your installation.

Activate and start diamond:
```
[VLS]$ supervisorctl add diamond
[VLS]$ supervisorctl start diamond
```

Diamond¶

diamond by BrightcoveOS is a daemon that collects system metrics and publishes them to carbon. By default, the following collectors are enabled:

cpu
disk space
disk usage
load avg
memory
sockstat
vmstat

More collectors are available, too.

Security¶

carbon only aggregates data when moving it from one time resolution to another. It does not aggregate incoming data with existing data in the database. This means that if two data points arrive that fall into the same time resolution slot, the second overwrites the first. If data is generated more frequently than the highest time resolution, bucky should be used to aggregate the data locally before sending it to carbon less frequently.

Also, there is no freshness check on incoming statistics in carbon: A single write to a specific time in the past overwrites the aggregated data for that time slot.

carbon-cache must bind to localhost only, as otherwise anybody with access to the port can override any data in any namespace. As there is no authentication between the carbon-relay instances and carbon-cache, the machine running graunt should not be used for any other purpose.

carbon-relay that are forwarded over SSH must bind to localhost only for the same reason. The remote account must be dedicated to carbon and allow forwarding of that one port only, as the key is stored unprotected in the graunt SSH configuration. Each relay restricts incoming data to whitelisted namespaces only.

carbon-relay ports for company data should only be accessible from the internal network.

StatsD¶

The statsd interface has the following properties compared to the carbon line interface:

It uses UDP as a transport, which means that a failing service does not impact the function of the application that sends the metrics.
The data can be high-frequency and is aggregated before sending it to carbon.
Several statistics are derived from a data series, and the resulting statistics are also sent to carbon.

There are light-weight client libraries available to add instrumentation to an existing application easily.

Upstream documentation is available.

StatsD metric types¶

The following metric types are supported (as described in the upstream documentation:

Type	Description
counter	Event count per second.
timer	Time interval measurements with various statistics.
gauge	Constant data that is already aggregated.

Counter Metrics¶

Counters are events per second. They are counted and normalized in the statsd server, so that the application only needs to report increments. To save bandwidth to the statsd server, client libraries support setting a sample rate (only a sample of the counter increments are reported to statsd in that case, which upscales the samples).

Name	Description
`*.count`	The total number of events.
`*.rate`	The average number of events per second.

Timer Metrics¶

Timers are durations in milliseconds. From the raw data, various statistics are reported.

Name	Description
`*.mean`	Average (currently of 90 percentile)
`*.upper`	Maximum
`*.upper_90`	90 percentile
`*.lower`	Minimum
`*.count`	Number of data points

Gauge Metrics¶

Gauges are constant values that do not change until they are updated. Because they are resubmitted by statsd in each flush interval, an unchanged gauge yields a flat line in the graph.

Gauges are stored directly under the provided name.

Example Letter¶

Sehr geehrter Herr/Frau …,

um Ihnen bei der Sicherstellung der Verfügbarkeit und Behebung von möglichen Problemen in VLS Instanzen schneller und einfacher helfen zu können, sind wir dabei, unser Leistungskennzahlen-System von vorher „collectd“ auf jetzt „carbon“ umzustellen. Daraus ergibt sich eine Änderung in der Netzwerkkonfiguration:

Bisher wurden Leistungskennzahlen über UDP auf Port 25826 unverschlüsselt übertragen.

In dem neuen System werden die Leistungskennzahlen verschlüsselt über einen SSH Tunnel übertragen, der vom Semantics-Netzwerk ausgehend initiiert wird. Dazu wird auf ihrem System ein neuer Benutzer „vlstat“ angelegt. Dieser wird über technische Schutzmassnahmen so eingeschränkt, dass er ausschliesslich zur Errichtung eines Port-Forwarding-Tunnels (lokaler TCP Port 2003) verwendet werden kann (kein Shell-Zugang). Die SSH Verbindung wird dauerhaft aufrecht erhalten und bei Verbindungsabbruch automatisch neu aufgebaut, und ausschliesslich zur Übermittlung der Leistungskennzahlen verwendet.

Der bereits bestehende SSH Zugang kann von uns dazu verwendet werden. Es sind also ggf. keine Anpassungen Ihrerseits nötig.

Die Umstellung befindet sich noch im Testbetrieb. Nach Abschluss der Umstellung fällt der unverschlüsselte UDP Port weg und kann von Ihnen in der Firewall gesperrt werden. Wir werden sie darüber dann noch gesondert in Kenntnis setzen.

Ihre Vorteile sind, dass die Verbindung in Zukunft vollständig verschlüsselt ist, und dass wir mit dem neuen System Leistungsprobleme in Zukunft schneller erkennen und beheben können. Wir hoffen deshalb, dass die Umstellung auch in Ihrem Sinne ist.

Mit verehrten Grüssen,