Monitoring host availability using Nagios.

I just want it to work!

It was with a due sense of urgency that I stumbled into my Christmas break this year, like a marathon runner barely making it over the finish line. Oh, I thought I was prepared, I had my remote access after all… what else would I need? Two days in, the warm blanket of pride was ripped from me as I attempted to check my mail.

To my horror I discovered that I couldn’t access the mail service?! Surely my rock solid Linux mail server hadn’t crashed or worst developed a hardware error rendering it useless? I needn’t have worried, some cursory checks revealed that either the data circuit had gone down or there had been a power cut (we get a lot of them where we are) that the router may not have recovered from. That’s when I got the call from one of my users who was also having problems…

After a prolonged period of moans and groans I finally went onsite to check things and discovered that the server was ok, but the ADSL line was dead – not even a dial tone!! After a chat with the Telco, and various checking of logs on their side, I discovered that the circuit had been down for almost 24 hours!!

This was when the sickening feeling came over me that it had been down for so long and I had no idea. Blissful at home in my smugness that everything was ok and it wasn’t. That was the moment. The moment I realised I needed the drop on my users. I needed to know there was a problem before they did… I needed some monitoring software.

Nagios? How do you pronounce that?

I bang on about using Ubuntu for everything, and it’s not that I have shares in the company or anything, but because it’s a solid platform that is fairly scalable across a range of hardware configurations. There are so many flavours of Linux out there and I don’t claim that Ubuntu is the best of the bunch – but it works really well for everything I need to do, and since money is tight, I can reuse machines too slow to run Windows (plus I feel good that I’m not creating waste 🙂 )

Nagios – an acronym for Nagios Ain’t Gonna Insist On Sainthood after the name NetSaint couldn’t be used – is a powerful monitoring system that enables you to identify and resolve IT infrastructure problems before they affect critical business processes. It sounded perfect for what I needed…

Package installation

You know what they say about assumptions…

For this exercise the assumption is that you’re installing onto a clean Ubuntu 10.04+ system with no other software installed. Nagios will use Apache and Postfix as a dependency so if they aren’t already installed, they will be as part of Nagios installation.

I won’t go into configuring Apache or Postfix in this journal entry I’ll leave that up to you the reader…

Install the core packages.

sudo aptitude update
sudo aptitude install nagios3

You’ll notice a number of dependencies will be listed as part of the install. Answer Y to continue.

The order of the prompts during installation my vary but when prompted by Postfix I selected “Satellite System” and provided the name of my main internal mail server when asked for a “relay server“. This could just as easily be your ISP’s server but might require some additional tweaking of the main.cf file.

Next you’ll be prompted for a password for the “nagiosadmin” account. Select something suitable, retyping it to confirm then allow the installation to continue.

Once installed, if everthing has gone ok you should be able to hit the ground running. To access nagios, point a browser to http://your.server.name/nagios3. You will then be prompted for the nagiosadmin username and password.

In the left-hand column you’ll see a number of catagories. If you click on “Service Detail” you will see that Nagios has created an entry for this server as well as detecting the default gateway for this server.The status for the very first login will probably be “Pending” but of you check the “Status Information” column you’ll see when the check is scheduled to take place.

After a matter of minutes, the status for the two hosts as well as the services should turn green and be listed as “OK“. The default method for checking “host-alive” is to ping the host, so if you have ICMP turned off Nagios will show the host as “Critical” until you change the “host-alive” method to something different.

Adding additional hosts

From what I can see Nagios is a fairly modular system in its orientation. Configuration files for each host are created in the /etc/nagios3/conf.d directory and are loaded when the service starts (or is restarted!)

You can add hosts to groups (hostgroups_nagios2.cfg), and then define service “checks” for members of those groups (services_nagios2.cfg and generic-host_nagios2.cfg). This helps to avoid repeating commands across configuration files. This concept also applies to the default host settings as well (generic-service_nagios2.cfg) and functions the same way an “include” command would.

OK, lets start by adding a host to monitor. At this stage I’ll assume that you’ve logged into Nagios using the web interface and that the NRPE (Nagios-Remote-Plugin-Executor) client hasn’t been installed on anything yet (except localhost!). It worth noting that (depending on your requirements), the NRPE is not required in a simple setup with fairly passive keep-alive checks.

First of all have a look at /etc/nagios3/conf.d/hostgroups_nagios2.cfg

NOTE:

We’re only going to look at some of the files. There should be no need to alter them at this stage.

sudo vi /etc/nagios3/conf.d/hostgroups_nagios2.cfg

You’ll notice a list of host groups, each with a name, alias and member list. In this first configuration example we’ll only look at adding something simple that uses PING as it’s “host-alive” and service check. In particular note the entry for the host group titled “ping-servers”.

define hostgroup {
    hostgroup_name   ping-servers
    alias            Pingable servers
    members          your default gateway is probably the only thing in here
}

Next let’s have a look at /etc/nagios3/conf.d/services_nagios2.cfg.

sudo vi /etc/nagios3/conf.d/services_nagios2.cfg

define service {
    hostgroup_name         ping-servers
    service_description    PING
    check_command          check_ping!100.0,20%!500.0,60%
    use                    generic-service
    notification_interval  0 # set > 0 if you want to be renotified
}

You can see that any member of “ping-servers” in the host-groups file will have the service “PING” added to it. Utilising these files will mean that we don’t need to add specific service entries in the host definition and simplify deployment.

Lastly, have a look at at /etc/nagios3/conf.d/generic-host_nagios2.cfg. The important thing to note in this file is the definition for check_command. The command check-host-alive basically uses ping to check if a host is there and responding. If your hosts have ICMP Ping disabled then this command will result in Nagios reporting that a host is down. Think about the various ways you can check a host is alive, without using ping for those “special” hosts, as you’ll need to define something here (which is preferred) or in the host definition itself.

For now, our host is “ping-able” so it’s time to create a host definition file. I use the FQDN as the file name, or if monitoring hosts on your local network, I use the internal domain.

sudo vi /etc/nagios3/conf.d/hostname1.local.cfg

define host {
    host_name    hostname1.local
    alias        Core Switch meaningful description here
    address      192.168.1.250 the IP address of the device/host
    use          generic-host includes settings in generic-host_nagios2.cfg
}

Note:

You can also have a specific icon image appear next to the host’s entry in the Nagios monitor. To do this add the icon_image member and refer to an image in (or relative to, as this is the root of the image’s location) /usr/share/nagios3/htdocs/images/logos. There are also more images in /usr/share/nagios3/htdocs/images/logos/base.e.g.
icon_image    base/win40.png

Now let’s add this host into the “ping-servers” group…

sudo vi /etc/nagios3/conf.d/hostgroups_nagios2.cfg

define hostgroup {
    hostgroup_name    ping-servers
    alias             Pingable servers
    members           ...,hostname1.local
}

NOTE:

For the members object, you include the name of the host as you defined it by the host_name object in its definition file. Separate multiple entries with a comma.

Save the file and restart Nagios

sudo /etc/init.d/nagios3 restart

Once the Nagious service has restarted, go back to the web interface (refresh if required) and you should see an entry for your host under the “Host Detail” section. It may not have a status at this stage as it depends on Nagios’ polling cycle, but it normally doesn’t take any longer than 5mins for the status to update. The “Status Information” column should give some hints on when the next check is scheduled.

If you now click on “Service Detail” you should see your host with the name “PING” in the service column.

If you click on “View Config”, and then choose “Commands” as the object type, you can get an idea of the vast array of checks that can be done with Nagios. Remember: some of them are great passive checks, but there are a few that require the NRPE in order to extract machine specific information (disk usage, cpu usage, processes, etc…).

NOTE:

Be aware that checking ports directly can cause problems. Case in point (and lesson by experience) I used the check_tcp command to make sure my VNC port was open – the theory being if the port was open, then the “host” must be alive. Problem was VNC interpreted that as a scanning attack (duh) and promptly shut the service down with events stating an invalid login.