When Privacy and Reality Interconnect

His privacy being paramount, Kelly grudgingly chooses to head into Columbia every so often, rather than cede his data to Google or turn over his purchase history to another online retailer. “I’m just not sure why Google needs to know what breakfast cereal I eat,” the 51-year-old said. Washington Post

There are a couple of things to notice here.

First: Google is not the only company out there snarfing up your data. Zuckerbergland apps, Verizon (you know, AOL, Yahoo, Tumblr), Microsoft (Linkedin, Bing, all those Microsoft apps like Word, etc) are only some of them.

Second: Most websites have some form of tracking software on them, and they can be related to any of the three or more listed above.

Third: Despite what the EU would have you believe, GDPR is not your salvation, as many websites, in the small print, outside the EU say this site not intended for consumption by people in the EU which means that the GDPR has zero impact.

And realistically, if you do not want to be tracked, there is only one way to avoid it. Stay off the Internet. And that includes no smart devices (there is tracking software on them too), no credit cards (who do you think came up with the idea of tracking purchases), and no cheques. In fact, depending where you live, you are being watched by CCTV cameras, where the video is uploaded and searched for malcontents, using AI and facial recognition software. If you travel, you are tracked whether by planes, trains, or automobile (toll plazas, rest stops…). Let’s face it, unless you are a hermit, you have no privacy.

And ironically, we all know that Mr Kelly, who is 51 years-old, likes to eat Bob’s Red Mill muesli cereal. So his privacy is now shot too, because he talked to a reporter, and the story ended up…on the Internet.

Monitoring with Prometheus Exercises

Time: 60 minutes

Audience: Developers & DevOps Personnel

Purpose: Introduce basic monitoring using Prometheus so you may apply new skills.


  • Instrument an application using Prometheus
  • Visualize data using Grafana


Before beginning these exercises, it is assumed that the student has a certain level of familiarity with Linux and the command-line, as well as Git commands.

Set Up The Environment

For this exercise, we will use a version of Linux (CentOS) provided by Hashicorp called a Bento Box. If you have experience with other virtual environments, you can use them.

Regardless of which environment is selected, it is assumed that some Linux command line skills are possessed.

If the selected environment is a physical environment, skip ahead to download the correct version of Prometheus for your OS. If a virtual environment is selected, please follow the following steps.

A host operating system is the OS the virtual environment will run on. The guest is the virtual environment.

Virtual Box and Vagrant

  1. Download the appropriate version of Virtual Box and the VM VirtualBox Extension Pack.
  2. Install Virtual Box according to the instructions for your host Operating System.
  3. Download the appropriate version of Vagrant for the host operating system you downloaded for Virtual Box. (For Windows, you will want the 64-bit version for any OS Windows 7 and up.)
  4. Install Vagrant according to the instructions for your host Operating System1.
  5. Verify that Vagrant installed correctly. Open a command line (Terminal) and type vagrant --version
  1. Fetch an appropriate Vagrant box. A network connection from the host is needed for this. For this exercise, we will use the CentOS 7.6 Bento Box. To get the box, type vagrant init bento/centos-7.6 2:

  1. Everything is ready, start the box. Type: vagrant up
  1. Verify there are no errors. If not the box started successfully, if there are errors, check them carefully. Any errors are likely related to networking issues, or limits on the access to the PC.
  2. Once the Virtual Box is up, log in. Open a command prompt and type vagrant ssh

It is important to put vagrant in front of the ssh command so it finds the key correctly to log in. Feel free to tour around inside the guest.

  1. In order to access the Prometheus web port it is necessary to modify the Vagrant file, usually located in the root folder of the host’s home directory. Either ~/Vagrantfile on a Mac, or C:\Users\username. Look for the line: # config.vm.network "private_network", ip: "". The comment # can be removed, or the line retyped. Any valid IP address can be used.
  1. Reload the guest. Type: vagrant reload.

Add Git

You will need to add the git application on the virtual machine.

  1. Ensure you are logged into the guest with vagrant ssh
  2. Type $ sudo yum install git -y
  3. It will install about 32 packages and quit without error.

Add Docker

To demonstrate the Java metrics and Grafana, we will install a Docker container with some prebaked examples, but to run them, we need to install Docker.

  1. Install the needed packages.
    $ sudo yum install -y yum-utils device-mapper-persistent-data lvm2
  2. Configure the repository.
    $ sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
  3. Install the Docker Community Version.
    $ sudo yum install docker-ce
  4. Add the User to the Docker group.
    $ sudo usermod -aG docker $(whoami)
  5. Set Docker to start at boot.
    $ sudo systemctl enable docker.service
  6. Start Docker.
    $ sudo systemctl start docker.service

For this exercise we will not need Docker Compose.

Install Prometheus

  1. Visit the Prometheus website and download the most recent version of Prometheus for AMD64 Linux platforms (for example prometheus-2.9.2.linux-amd64.tar.gz).
  2. Transfer the binary to your guest. This can be done via SCP. On the host OS, install the vagrant-scp plugin. Type: vagrant plugin install vagrant-scp Then type: vagrant scp local_file_path_in_HostOS :remote_file_path_in_GuestOS. For example: vagrant scp prometheus-2.7.2.linux-amd64.tar.gz :~/ (If your guestOS is not running you will get an error message. You can type vagrant up to start it).

  1. The SCP will put the binary in the root of the home directory of the default (vagrant) user.
  1. Untar the file. Type tar -zxvf prometheus-2.7.2.linux-amd64.tar.gz
  1. Change into the Prometheus directory. You can test Prometheus by typing $ ./prometheus which will spin up a number of messages like:

And then type <CTRL>-C to stop it, for now.


Adding Java instrumentation to your code can be managed in a couple of ways. As we discussed in the class, we have the following examples.

Counter Example

import io.prometheus.client.Counter;
class YourClass {
  static final Counter requests = Counter.build()
     .name("requests_total").help("Total requests.").register();

  void processRequest() {
    // Your code here.

Gauge Example

class YourClass {
  static final Gauge inprogressRequests = Gauge.build()
     .name("inprogress_requests").help("Inprogress requests.").register();

  void processRequest() {
    // Your code here.

Putting it all together

import io.prometheus.client.Counter;
import io.prometheus.client.Gauge;
import io.prometheus.client.Histogram;
import io.prometheus.client.Summary;
import io.prometheus.client.exporter.HTTPServer;

import java.io.IOException;
import java.util.Random;

public class Main {

   private static double rand(double min, double max) {
       return min + (Math.random() * (max - min));

   public static void main(String[] args) {
       Counter counter = Counter.build().namespace("java").name("my_counter").help("This is my counter").register();
       Gauge gauge = Gauge.build().namespace("java").name("my_gauge").help("This is my gauge").register();
       Histogram histogram = Histogram.build().namespace("java").name("my_histogram").help("This is my histogram").register();
       Summary summary = Summary.build().namespace("java").name("my_summary").help("This is my summary").register();

       Thread bgThread = new Thread(() -> {
           while (true) {
               try {
                   counter.inc(rand(0, 5));
                   gauge.set(rand(-5, 10));
                   histogram.observe(rand(0, 5));
                   summary.observe(rand(0, 5));

               } catch (InterruptedException e) {

       try {

           HTTPServer server = new HTTPServer(8080);
       } catch (IOException e) {

Using io.prometheus

You can also modify the build to include the appropriate repository. For Maven, you edit your POM.XML something like this:

<!-- The client -->
<!-- Hotspot JVM metrics-->
<!-- Exposition HTTPServer-->
<!-- Pushgateway exposition-->

For more examples, please consult the Prometheus Java Client page.

Java Metrics and Prometheus

A Docker container with the examples above can be installed.

  1. Type $ git clone https://github.com/sysdiglabs/custom-metrics-examples
  2. Type $ sudo docker build custom-metrics-examples/prometheus/java -t prometheus-java

Depending on the state of your Docker container, it may rebuild itself while it downloads. This will take a couple of minutes.

  1. Type $ sudo docker run -d --rm --name prometheus-java -p 8080:8080 -p 80:80 prometheus-java
  2. Once you have a running container, check the available output by typing $ curl localhost:8080 which will give you something like:

These are now exporting successfully from the application. Now we need to configure Prometheus.

  1. Change into the Prometheus directory
  2. Move or delete the prometheus.yml file. This is the core configuration file for Prometheus.

This file is YAML (Yet Another Markup Language) and so spaces matter. Because our Java is being exported to port 8080, we will configure Prometheus to listen on 9090.

  scrape_interval: 10s
 - job_name: prometheus
    - targets:
       - localhost:9090

This is a basic configuration.

global: Defines the global parameters for Prometheus operation. The scrape interval is the time, in seconds, between system scrapes. Prometheus defaults to one minute [1m] as the time between scrapes. For demonstration purposes we will use ten seconds. The actual time should be a balance between need for metrics and load on the system.

scrape_configs: Define job name, labels, and other things that define the scrape. It scrapes http by default, but can scrape other protocols as configured. There is also the ability to pass usernames and passwords if needed. The entire documented suite of variable is available in the Prometheus documents.

For this exercise, we will scrape the localhost, port 9090, every 10 seconds. The job name is prometheus. Save the YAML above as prometheus.yml, then start Prometheus with $ ./prometheus.

Open a web browser and point it at: (or the address you configured in Step 10 in setting up Vagrant). This is the main visualization page. From here you can examine a number of default metrics collected by Prometheus. You can verify the host status by clicking on Status | Targets, which will show you:

If you click on the Prometheus name, you will return to the PromQL Expression Browser main screen where you any type any number of PromQL (the Prometheus Query Language) queries. For example, by typing up will result in a returned value of 1, for up. This is useful for debugging.

If you click on Insert Metric at Cursor, you can get an idea of the default metrics that can be viewed. You will notice there are no Java variables.

To add them:

  1. Stop Prometheus (<CTRL>-C)
  2. Edit the YAML file to add the 8080 port. It will look like this:
  scrape_interval: 10s
 - job_name: prometheus
    - targets:
       - localhost:9090
       - localhost:8080
  1. Restart Prometheus.
  2. Visit your browser and the Java variable will show up.
  1. You can visualize them with the built in graph feature (this view is after the Prometheus server has scrapped several cycles).

Prometheus configuration

There are hundreds of configuration options for Prometheus, including hosts and other library metrics. For example, if you have number of hosts, you might see a configuration like this:

  scrape_interval: 10s
  evaluation_interval: 10s
 - rules.yml
 - job_name: prometheus
    - targets:
       - localhost:9090
 - job_name: node
    - targets:
       - localhost:9100

This configuration comes from a network with four instances of prometheus, collecting data from the node_exporter module (which exports metrics related to the OS, network, and other node related information. You would get a report like this if we aggregate network interfaces:

The Prometheus documentation provides a number of configuration suggestions, and there is a prometheus channel in Stack Overflow with just as many suggestions. There is also a great book available from O’Reilly’s Safari subscription Prometheus Up and Running.


The Prometheus dashboard is handy for quick sanity checks, but it is not a great solution for long-term use. Enter Grafana, a popular tool for building just these sorts of dashboards.

The easiest way to plug into our existing configuration is to grab a copy of the Grafana Docker console. This will not use a volume mount so it is not a good long term solution.

$ sudo docker run -d --name=grafana --net=host grafana/grafana:5.0.0

Grafana will present itself on port 3000. Log in as admin/admin

  1. Change the Data Source to Prometheus (you can name it whatever you want. I chose Prometheus)
  1. Make sure to set your address correctly, click Save & Test and you will get this:
  1. Return to the dashboard, click edit and type in java_my_guage and you will get something like this being pulled from your Prometheus instance.
  1. Over time, your graphs will begin to fill in.

Putting it all together

Now that you have an overview of the pieces, take some existing code and modify it to include counters or gauges as appropriate, build it, and then run it where Prometheus can capture the metrics. Once you have done that, validate that they are being picked up by looking in the Prometheus web page and then in Grafana.

Additional Exercises

  • Investigate and look at $ /home/vagrant/custom-metrics-examples/prometheus/java/src/main/java/Main.java in the custom-metrics-examples repository you cloned
  • Grab the Node_Exporter module, add it to your Prometheus engine and experiment with the variables.
  • Try out some PromQL, for example rate(node_network_receive_bytes_total[1m])
  • Play with labels in Prometheus. How can they benefit you?
  • Look at the alerting module. What changes would you need to make to the alertmanager.yml to make it work in your environment?
  • Play with Grafana. What other visualizations can you create?
  1. This is not a course in how to use, or manipulate Virtual Box or Vagrant. If you need more help there is plenty of documentation on the Internet to help you out. Generally, both Virtual Box and Vagrant install without issues.
  2. If you have other vagrant boxes configured, you will get an error about an existing Vagrant file. You can modify your existing Vagrant file to include this new box.

Can I Get The Recipe For That – Pasta Dough

I do not have an Italian grandmother. In fact, I never really knew either of my grandmothers when I was growing up, and my mother’s favorite dinner was reservations, so learning how to cook has been an effort in trial, error, and ordering pizza.

If you are going to cook a meal that involves pasta, you really should learn how to make your own. I say this with a proviso - if you are going to make your own pasta, you will want some additional tools for your kitchen, so if you are going to make pasta, you are investing in more than just one or two meals because the tools are a couple of hundred dollars initially. You can do all of this by hand of course, but it will take a lot more time.

For this recipe, we will look at how I make the dough. I have used this dough recipe successfully for both fettuccine as well as other noodle shapes. The key to this is the flour, and if you are going to make pasta, you really should try and get the flour. This recipe comes from the King Arthur Flour Pasta Flour Blend, available from King Arthur Flour. The key to good pasta dough is the flour, specifically semolina, all-purpose, and high-protein, finely milled "00" flour. You can find this in some grocery stores and some specialty stores. Or you can order it directly from King Arthur.


3 cups of pasta flour
4 large eggs
2 to 4 tbsp water
extra flour for the work surface


Place the flour in a food processor, bread machine, or bowl. Mix in eggs all at once. Knead, adding only enough water to form a smooth dough. Form dough into a rectangle, about 1” thick, wrap well and rest for 30 minutes.

After 30 minutes, cut off a chunk, flour both sides of the dough and run it through a pasta machine on the thickest setting. Repeat the process, flouring as necessary. Repeat the process, flouring as necessary and gradually reducing the setting until the desired thickness is reached.

Cut into shapes and toss with flour to prevent sticking. Hang in individual strands or arrange in small nests and allow to dry.

To cook: Boil 4 quarts of water and 1 tablespoon of salt. Add pasta and cook for 2 to 4 minutes until pasta is still slightly firm. Fresh pasta cooks quickly. Drain and toss with oil or sauce.

A Political Stunt? Like Repealing ACA?

This got hung up somewhere, but it is still just as valid today when Congress still has not accomplished anything of note while the parties squabble with each other as it was when I wrote it before the shutdown....

Shutdown: Congress shouldn’t get paid during standoff, Va. rep says | WTOP

But McConnell called it a “political stunt” and said it would be useless to allow a vote that wouldn’t get Trump’s signature. The GOP lawmaker said simply, “This would not produce a result.”

In 2012, I wrote a post on the House of Representative’s 33 vote to repeal the Affordable Care Act. At the time I questioned why the Party of No was wasting the tax payer’s time with a show vote that would not result in anything more than, well a show vote.

Fast forward six years, and we have a Senate that seems to think show votes are a waste of time, and therefore, they will not do it. Except that this time, the vote is not a show vote. It is an adult, bringing a bill to the floor of the Senate, to be voted on to end the government shutdown. And this time, it is likely that even if the President does not sign the bill, there will be enough votes to override any veto he might threaten. That is what is scaring the Majority Leader. He is worried to his marrow that he has lost control of the Senate on this particular issue. Especially when the same Senate, not forty-eight hours before the shutdown began voted in favor of the same bill as the House has presented them.

I am beginning to think that this may actually be the definition of insanity.

Monitoring In the Modern Age


Continuous Integration and Continuous Development lead to continuous improvement, which, as we discussed in CI/CD is a founding principle of DevOps and Site Reliability Engineering. We test, we take the results of those tests, and we improve. But how do we measure this improvement? DevOps utilizes metrics to measure its progress, and the majority of those metrics come from instrumentation that generates metrics and logs. What is monitoring, how does it differ from logging, and does it matter?

The Monitoring Problem

Does this sound familiar: We need better monitoring! In most shops it is rare to hear someone say We have too much monitoring! Another frequent lament is: I am getting too many alerts! In the quest for achieving better monitoring, the question: What exactly is better monitoring? needs to be answered, and, wrapped up in that, questions around: Is there such a thing as too much monitoring? And, most importantly: How can monitoring be enabled to provide that all-important single pane of glass that all senior managers demand?

Monitoring used to be simple. Is the server up or down? Is the traffic flowing? Is there enough memory and CPU to execute the application? Companies spent as little as possible on monitoring, and it was rarely a high priority requirement, much less a demand in the problem scope, or considered in advance. Monitoring was reactive. As the transition to mission-critical applications increased, and their increased value to the business became quantified in monetary terms, the status of those applications became critically important. But administrators still had to be aware of the state of servers, disks, and other network equipment as well, which leads to an increase in complexity. Thus, the question becomes: What precisely needs to be monitored? Hint: Not everything.

Monitoring should not be approached as a single point solution. It represents multiple complex problems, and it represents different things to different people. It also means numerous tools will be involved and have to be integrated into an already crowded landscape.

A monitoring strategy should be established from a position of thoughtful construction. It is not enough to say monitor everything, even when management feels this is an appropriate solution. It is not even possible, with the number of servers, application, microservices, network devices, and even raw transaction to monitor everything. It is barely possible to monitor most things.

In the context of DevOps/SRE, monitoring serves several purposes.

First, it is a gauge of how the systems are performing. It is critical to note that performance has two facets, the customer facet, and the systems facet. Performance monitoring of one without the other only shows part of the picture.

Second, it provides the necessary metrics to evaluate how, or if systems are improving as changes are implemented in systems and code. These changes, unmonitored, could also introduce instability into well-behaved systems.

Third, it helps us understand our risk budget and enable success in our postmortems from a science-driven perspective.

Finally, useful metrics permit planning for the future. The one constant in all of these systems is growth. The speed of this growth can be predicted with useful metrics and modeling, but it means that monitoring and the collection of these metrics are accurate and representative.

In modern systems, we need to change the idea of monitoring, what we monitor, and why we monitor.

Modern Monitoring

In physics, the observer effect is the theory that simply observing a situation or phenomenon necessarily changes that phenomenon. In the 1990s, there was a school of thought that agents were terrible. They introduced the observer effect by adding an additional burden on the servers and systems they were supposed to monitor, oft times by masking their impact, or underreporting the server statistics. This led admins to assume their monitoring systems were unreliable because of different statistics being reported by various tools on the same measured value. If the system is unreliable, how can the metrics be used for planning? How can alerts be trusted? Which then leads management to question the value of monitoring, and the money being spent to support it.

Flash forward, and things have changed dramatically. The idea of an agent has also changed. What used to be a precompiled black box of code that seemed to chew up all the available resources is now reduced, in many cases, to small, programmable libraries that expose metrics on an as-needed, or as demanded basis.

However, this introduces a more complicated situation. With each set of systems and applications generating metrics, how many monitoring modes does it take to make a system that is capable of monitoring the systems that need to be monitored? If it sounds a bit like how much wood can a woodchuck chuck1 riddle, then the scope of the problem becomes more apparent. There is no easy answer to how many modes are too many modes. The short answer is, no more than are necessary, but determining how many is required is an open question. But there is a framework to help arrive at the answer that works in the organization.

First, it is an organizational decision. As will be discussed, do not worry about how other companies solve the problem, even if they look like your company.

Second, because the problem solution affects teams differently, ensure that each team’s vision and needs are considered and included. Avoid duplication, but do not assume one team’s needs are any more or any less important.

Finally, revisit all decisions and assumptions regularly. This is not a one and done exercise but a continuous part of the software life cycle of the application or system being monitored.

The same applies when selecting monitoring tools. Choose the tools wisely but do not be afraid to add a tool if there is a solid reason for it, or remove one if it is not doing the job. The tools must be selected to solve a real problem, not just because someone wants it. It is necessary to get groups together to talk about the tools being used, what they are monitoring, and what overlap may exist. If the tool does not solve a real problem, or there are too many overlaps, selecting that particular tool should be reconsidered.

Lastly, when we look at tools, there is a decision around buy or build. This is not so clear cut, but the idea of build or buy is still based on answering the question of what works best and will do what is needed for a reasonable cost. And the cost of monitoring is not just the cost of the software as we will see more clearly when we talk about logging.

One key concern is adopting tools only because someone else is using the tool (and even possibly using it successfully). Just because Google or Netflix shows success using this or that tool does not mean adoption will be successful in another context. Why? Google, or Netflix, or anyone else has put years of development into the tools to solve the monitoring issues they were experiencing. That does not mean the tools will automatically translate to the next environment, or in fact, be usable in any other environment. Do not be dissuaded from adopting the tools of another company. Just ensure that their use case is validated in the environment they are going to be used.

Single Pane of Glass

Frequently, someone, usually in senior management, will demand a single pane of glass. One view for everything. The best visual is the bridge of the Enterprise. A wall full of monitors with graphs and spinning wheels. Great eye candy, but not precisely useful at the end of the day for the people that are running the systems. There are several problems with the single pane of glass requirement.

First, how big is the viewport? Imagine a door with a peephole. You can see through the door into the hallway, and see, to an extent, who may or may not be there. But it is not a complete view. Any spy novel will illustrate how limited peephole views are. There might be an assassin hiding in the blind spot, waiting to kill the person behind the door. In a single pane of glass view, it might be a problem waiting to bring down the application. The broader the view, the more confusing the data becomes and the harder it is understanding what the data is telling you, even if it is simplified, or aggregated.

Second, and more importantly, it is tough to have a one-to-one mapping of tools to dashboards. One tool may feed multiple dashboards, or one dashboard may be a composite of the input of many tools. Because monitoring is a problem set solution, and a complex one at that, trying to force it into one system is going to hamper the ability to efficiently or effectively work.

Try to determine precisely what is most important to the senior manager, and provide them with that data, but remind them of the peephole problem. And the assassin that might be lurking on the other side of the door.

In this day of specialists, it might seem logical that the people that monitor systems should be specialists. The problem is that monitoring is not a job, it is a skill, and one that all team members need to have, not just for being able to understand what those metrics are trying to say, but to be able to expose the necessary metrics for consumption. After all, having a monitoring solution, just to say there is a monitoring solution, does no one any good.

Why Monitor

After the expenditure of hours, effort, and money on standing up a monitoring system, what exactly is being monitored? Or, in other words, what does up actually mean? And from whose perspective? Monitoring done right provides observability into the system being monitored.

One of the more popular reasons for monitoring is meeting a Service Level Agreement (SLA). Most SLAs usually involve some financial penalty either due or refunded depending on the verbiage in the contract. And most SLA goals are so unrealistic that invoking them, or proving their deficiency would need more than just a talented statistician, and a bevy of lawyers. If the reason for monitoring is to avoid or prove an SLA violation, then other things can be done with the resources. Especially when looking at system availability from the perspective of an SRE model, which assumes no system will be up one hundred percent of the time2.

Traditional monitoring has been done from the inside out. Up means, a server is showing a positive CPU, has memory available, and the application, such as it is, is running. What does running mean? In the UNIX world, it means that the application has not gone into zombie status3. In modern systems, this model is not enough. Systems can be functioning, yet the end-users are unable to access services. Even harder to track, if not adequately instrumented, is the partial disconnect of a system. The user believes a transaction has completed, yet there is no record of that transaction in the database, or worse, the transaction occurs multiple times. When it involves money, this could lead to larger than expected charges, overdrafts, etc. Understanding that the entire system is working correctly from end-user to back-end and back again is critical. A sound monitoring system will properly report the status of messages as well as give you the observability down to the thread level of what is happening.

Hardware is expensive. Cloud services cost money per minute, or per transaction or per byte of storage. With proper monitoring, metrics can be collected to provide demand forecasting and capacity planning, which results in better cost containment and provides departments the ability to argue for more substantial expenditures based on past usage histories and predictable future loads. Proper capacity planning also allows for growth predictability for new and existing services as well as the ability to increase capacity to support the future load without the risk of being caught short.

With proper monitoring, we can evaluate performance, specifically around behavior. A piece of software is updated. Is the update successful? How is this measured? With metrics, of course. Was the update supposed to improve throughput? If so, did performance improve? Because of monitoring, it is simple to show that if the change made was effective in achieving the goal of the upgrade. Or not, and if not, what did happen. This can be returned to the development team to resolve with the support of metrics that show the exact effect the software is having on the system. No more he said/she said arguments or antidotal discussions.

Finally, with proper monitoring, which leads to useful metrics, a risk budget can be established. To accomplish this, service risk must be identified, and that means metrics. Risk can include costs related to redundancies or opportunity, but if there are no measures, then the risk is only a gut reaction. To be efficient, it has to be measured. Downtime has to be quantified. Application failures must be offset by success and evaluated for value. For example, is a sign-up failure the same value as a failed poll request for new email? Which is the higher risk? Which has the more significant opportunity cost? Monitoring metrics help answer questions based on business cases.

Profiling vs Monitoring

Monitoring allows for the collection of metrics about events from all over the environment. This is usually limited to the numbers that can be aggregated while the raw information is purged shortly after the data’s been collected. If more specific information is needed or has been maintained over long periods, we are generally talking about logging.

Profiling, compared to monitoring, acknowledges that the full context for all events is not possible, but that it is necessary to collect the full context for limited periods. TCPDump is an example of a profiling tool. It records a slice of network traffic. While it is an essential debugging tool, if it runs for too long, it will rapidly fill available disk space. The same is true with debug builds. Large quantities of data can be collected, yet debug builds tend to impact performance over time negatively.

The best use of profiling is for tactical debugging, not for long term use.


The goal of monitoring is to gain near real-time insights into the current state of the system. By comparison, logging is best used to determine why a system changed state, or rather why they did not. Logging also is used to help create the post-mortem and other after the fact incidents. But this is not the only reason.

Logging is not done for fun. Like monitoring, logs and logging have a purpose behind them. Allocating storage, backups, and establishing search routines cost money. Usually, logging is established for regulatory and compliance reasons, to prove a system is doing what it should, or that access is being maintained, and it is done in such a way that cannot be altered by system users and provide that proof to external auditors.

Logging is also done for forensic analysis after the fact. People do bad things. Breaches occur, and the impact of those breaches have to be understood. Post-mortems need to be written, and lessons learned need to be documented. Logging helps with this as well.

Finally, logging is an essential part of Security Information and Event Management (SIEM), a vital piece of any system operation.

Like monitoring, a plan needs to be established early on about what will be collected, how it will be collected, and where it will be stored. The question of who can access it after the fact as well as when it can be destroyed should also be a part of the planning process and done with the cooperation of departments outside of the usual IT groups.

One of the more interesting aspects of logging is that logs tend to be ignored and deleted when storage space gets low. This is partially due to the inadequate information that applications like syslog generates, but it is also because logs are not sexy. And vendors would rather sell the latest tool than have companies use existing log tools, and analyzing log files, even when that tool generates logs of its own that need to be analyzed as well.

While there are are many ways to monitor systems, logging provides some specific things. A log can tell you if a system is up, but it is more valuable for determining why a system did not start correctly. Failures, both hardware and software related will often be reported in the logs, in such detail that it will help both engineers and developers resolve the situation. Most developers request logs as part of the trouble ticket, and many third-party hardware vendors will demand them.

Host logs often show failed user access attempts and other related information that become critical in post-mortems following intrusions. Who, how bad, and when are all questions that can be answered by a forensic review of the logs, and if properly stored, and maintained.

And of course, logs are useful in troubleshooting issues because they contain large quantities of data beyond numbers about what happened and date-time stamps for when it happened if the system is appropriately configured, although sometimes determining a link to causality can be a bit more complicated. That will be discussed shortly.

Data Collection

But just what is a log? Like many IT terms, it is borrowed from other disciplines and are used loosely, or incorrectly, or even corrupted entirely. To begin, we need to determine what a log should convey. A log should be able to help answer the what is wrong? question, but also, like monitoring, the questions of how well are we doing? are there indications of something going wrong in the future? and who did what when to whom? A log then is a collection of event records4. As part of defining things, an audit is a process of evaluating logs within an environment. The common goal of an audit is to assess the overall status or identify any unusual or problematic activity. Remember though that while useful logging will help with an audit, logging alone is not auditing and log messages do not necessarily indicate an audit trail.

There are numerous methods to collect data, but when talking about logs, they generally break down into log parsers and log scanners.

A log parser extracts specific information from log entries, such as the status code and response of requests from a web server log.

A log scanner generally counts occurrences of strings in log files as defined by a regular expression.

Log monitoring provides the raw data for analysis, usually utilized after the fact beyond the quick data gathered by a parser or scanner. Log analysis tools should be part of the everyday toolbox, not just broken out when something goes wrong.

As part of data collection, some form of data normalization may need to occur. Because each system can create a logs in different ways, it is up to the logging storage system to address the normalization and where that normalization will occur, if it occurs. With modern database-backend log servers, data being ingested does not generally need to be normalized if it is going to be queried by any number of SQL capable applications. However, the more normalized your data is, the better your indexes will be, and the faster a return on the query can be delivered. As a developer, this is a critical concept to understand.

Generally, logging falls into the following categories:

  • Security
  • Operations
  • Compliance
  • Application

Security logging is focused on detecting and responding to attacks and other security-related issues. This often includes infections by malware, routine user authentication, or the failure of authentication, and analyzing whether the authentication was granted but should not have been.

Operational logging provides information about the routine, or not so routine status of the systems being monitored. As stated before, most monitoring systems will parse information from operations logs, as well as other sources to show operational state. The real value of operations logs is when that state changes, or fails to change, as expected, such as initialization failures related to hardware or bad configurations.

Compliance logging often overlaps security logging, and quite often, security logs are part of the compliance logs. These logs are usually wrapped up in specific audit requirements around HIPAA data, PCI compliance, IRAP, and other government-related mandates.

Application logging often takes two forms. Like operations logging, applications logging is important at the change of state time, as well as tracking other runtime events. The most famous application log is probably the sendmail log, which tells operators not only the state of the mail queue but also who sent what to whom. This log is a progenitor of the syslog. The httpd log (the Apache log) is probably the most used application log.

The other type of application log is the debug log that may or may not be enabled by default. This contains dump information about the state of memory, application handles, threads, and whatever else the coders have included in their code to help them determine where something broke.

Logging syntax and format varies, but some standards are generally acknowledged:

  • W3C Extended Log File Format (ELF) (http://www.w3.org/TR/WD-logfile.html).
  • Apache access log (http://httpd.apache.org/docs/current/logs.html).
  • Cisco SDEE/CIDEE (http://www.cisco.com/en/US/docs/security/ips/specs/CIDEE_Specification.htm).
  • ArcSight common event format (CEF) (http://www.arcsight.com/solutions/solutions-cef/).
  • Syslog/SyslogNG (RFC3195 and newer RFC5424).
  • IDMEF, an XML-based format (http://www.ietf.org/rfc/rfc4765.txt).

Syslog is commonly used to describe both a way to move messages over port 514 UDP as well as a log format with a few structured elements.

Here are a few examples of various format that log files:


CEF:0|security|threatmanager|1.0|100|detected a \| in
message|10|src= act=blocked a | dst=

Apache CLF: - - [18/Feb/2000:13:33:37 -0600] “GET / HTTP/1.0” 200 5073 - frank [10/Oct/2000:13:55:36 -0700] “GET /apache_pb.gif HTTP/1.0” 200 2326


#Version: 1.0
#Date: 12-Jan-1996 00:00:00
#Fields: time cs-method cs-uri
00:34:23 GET /foo/bar.html
12:21:16 GET /foo/bar.html


<?xml version=“1.0” encoding=“UTF-8”?>
<IDMEF-Message version=“1.0” xmlns=“urn:iana:xml:ns:idmef”>
<Alert messageid=“abc123456789”>
<Analyzer analyzerid=“hq-dmz-analyzer01”>
<Node category=“dns”>
<location>Headquarters DMZ Network</location>
<CreateTime ntpstamp=“0xbc723b45.0xef449129”>
<Source ident=“a1b2c3d4”>
<Node ident=“a1b2c3d4-001” category=“dns”>
<Address ident=“a1b2c3d4-002” category=“ipv4-net-mask”>
<Target ident=“d1c2b3a4”>
<Node ident=“d1c2b3a4-001” category=“dns”>
<Address category=“ipv4-addr-hex”>
<Classification text=“Teardrop detected”>
<Reference origin=“bugtraqid”>
<url> http://www.securityfocus.com/bid/124</url>

Unfortunately, just because there are standards does not mean that most applications or application developers actually follow them. Worse, log files that are ASCII based are not necessarily human-readable (as in the IDMEF format, which is one of the more readable XML based logs). Some companies have binary logs, which means a specific piece of software is required to read the log. Binary logs do have a purpose - performance, and space. Binary logs tend to be smaller and take less processing time to create. Also, binary logs usually require less processing to parse and have delineated fields and data types, making the analysis more efficient. An ASCII log parser has to process more data and often has to use pattern matching to extract the useful bits of information.

Compliance mandates, such as PCI, have specific requirements of what should be included in the log, which may not be generally collected by the system, mandating an application or process specific to acquiring these requirements. Many other industry organizations have created their own recommendations in regards to events and details to be logged.

Overall, knowing log file syntax is critical before any kind of analysis can be done. The effect that some people can do such analysis in their head does not discount the value; it just shows once a case where such a thing is possible. The automated system analyzing logs need to have an understanding of a log syntax, usually encoded in some templates.

What is most important is the content of the log or its taxonomy. If multiple systems log the same event, it should be expected that the taxonomy of the event is identical. Thus, to make that happen, there needs to be a collection of well-defined words that can be combined into a predictable fashion - the log taxonomy. Regrettably, this is easier said than done.

Taxonomy can be further developed into these areas:

  • Change Management
  • Authentication and Authorization
  • Data and Systems Access
  • Threat Management
  • Performance and Capacity Management
  • Business Continuity and Availability Management
  • Miscellaneous Errors and Failures
  • Miscellaneous Debugging Messages

Therefore, when looking at logs, the question becomes what makes a good log? In general, a good log should tell you the five W’s of logging (appropriated from other disciplines where they are used today):

  • What happened, with appropriate detail
  • When did it happen, and if it includes when it started and when it ended, so much the better
  • Where did it happen, specifically, on what host, file system, interface, etc.
  • Who was involved, especially when talking about A&A activities
  • Where did they come from, with as much detail as possible

Also, it would be nice to know:

  • Where can more information be obtained
  • How certain is the information
  • What is affected

Some of the information will be dependent on the environment, and the messages programmed into the software by the developers. If the software does not trap the authentication information, then it cannot be recorded in the log, for example.

As in monitoring, logging information should fall into to categories: Does someone need to be awoken in the middle of the night to deal with the issue, or can it wait for morning. Sadly, in logging, there is usually a third category, which is it can be ignored. This, of course, begs the question if it can be ignored, why is it logged, and the associated space it is taking up? The other key to remember is that if an application is logging, there has to be somewhere for that log information to go to be useful. If the bits fall on the floor, then did they really get logged in the first place?

If the information is being logged, and those logs are being used properly, there is a large amount of data that can be gleaned from them. The risk then becomes determining causality.


Generally, finding the root of a problem means working backward, gathering together the symptoms to the cause, then determine the appropriate mitigating actions. To get there, it makes sense to follow some logical steps.

First, find a correlation. Identify the undesired symptom. Sometimes it is evident like a compiler failure or a missing configuration file. Sometimes it requires a bit more research to gather and juxtapose time-series events from several logs, and sometimes the symptom is a red herring of what is occurring.

Second, establish direction. Make sure that the cause is not preceding the effect. This means ensuring that not only are all systems using the same date/time codes, but they are actually correct. Many a log reading effort has been foiled because one system is set to UTC and one system is set to local time.

Finally, rule out confounding factors. This is where you test your hypothesis by witnessing the event occur. Sometimes, the potential source of the problem is not the actual source of the problem, especially if the network is sufficiently complicated, and the teams that manage the various parts of the enterprise do not talk to each other when changes are made.

Log Management

As mentioned, logs take up space. Sometimes a lot of space. As discussed in the Monitoring section, pretend it is necessary to monitor HTTP requests for a given service. Further, there is a limit of a hundred fields per log entry. If the service handles a thousand requests per second, a log entry with a hundred fields that take ten bytes each, consumes a megabyte a second, roughly. Or, about eighty gigabytes a day for logging. For one service. Most systems have dozens, if not hundreds of services, all which might need to create logs. If the service falls under government or industry mandate, that means the logs have to be maintained for a set period. If it is even a year, that is a lot of disk, and tape, and cold storage space that has to be allocated, and budgeted for, along with off-site storage. Disk space and tape may be cheap, but storage at somewhere like Iron Mountain quickly becomes a budget item of significance. Therefore, log management is important, if often overlooked, topic.

Fortunately, the corporate log retention policy has likely been established, but it does not hurt to review it from time to time, as various policies change5. When considering log retention, some critical areas include:

  • Applicable Compliance Requirements
  • Understand the organizational risk posture
  • Review log sources and the size of logs generated
  • Review storage options
    • Internal (device)
    • External (method of transit)

While there are several storage format options, most logs will be moved to some form of central log storage, usually backed by a database.


After setting up the monitoring solution, establishing the values to collect and ensuring that the monitoring solution is obtaining the appropriate data, the next, obvious question is Now What?

Most teams monitor their equipment to ensure that it is up and operational, and when it is not, to alert them when it fails if it is performing outside of established thresholds. If monitoring is the act of observation, then alerting is one method of delivering the data. It is not the only one, and should not be the first method of providing the data. Effective alerting is about getting it right, active, and ensuring it is not ignored.

Alerts fall into two categories. The first is an alert as a For your information. This alert is one where no immediate action is required, but someone should be informed. The backup job failed is an FYI type of alert. Someone needs to investigate, but the level of urgency does not rise to drop everything you are doing. This does not mean you should ignore the alert. You should follow up on all FYI alerts as soon as possible or question the value of continuing the alert.

The second type of alert is drop everything you are doing. This is an alert that is meant to wake people up in the dead of night6. It should be noted that there is occasionally a middle-range alert between the system is down and get to it while you can. Resist the temptation to create this intermediate type of alert. Either it is a problem that demands immediate action, or it is not. This is what makes a good alert.

The question then is around the strategy for delivery and resolution of alerts.

Stop using email for alerts. Most DevOps/SREs get too much email, and most filter their alerts into files that are never reviewed. If the alert is of the FYI variety, send it to a group notification (chat) system for follow up. If the alert requires immediate action, choose the method that works best for immediate response. SMS, pager duty, flashing red lights, etc. Not everyone uses or pays attention to their SMS alerts during the day or at night, so ensure the team has an agreed-upon method for how these alerts will be sent and acted upon. Use what works. It is also important to log all alerts for later reporting. A unique sequence number that can be referred back to in reports, knowledgebase articles, and post mortems is critical. It will also aid in SLA reporting and other legal actions.

Ensure that you have a runbook/checklist. In the Checklist Manifesto, the value of having a checklist, and what makes a good checklist details actions that are valuable, whether you are in surgery, trying to land a plane or troubleshoot a problem. Common items such as:

  • What system is on what machine in what subnet and connected to what router or gateway
  • Who is responsible for the system and code on the system and how to reach them in an emergency
  • Current infrastructure diagrams
  • Metrics and their meaning, and where they are collected

should be included.

This will get people focused on the job at hand and prevent the inevitable issue of but I thought…. Make sure to keep your checklists updated as systems change. If you are finding the runbook/checklist solution provides the actual answer, then automate the runbook! Self-healing should be the first response of any alerting system in the modern age. If a human has to respond to an alert, it is more than likely already too late.

Another useful alerting practice is to delete alerts or tune them to improve value. Do not be afraid of removing alerts. If an alert is being ignored, and ignoring the alert is not causing a system issue, consider evaluating why the alert was created in the first place. Is it still relevant? Is it still needed? Threshold alerts, in particular, should be reviewed frequently and with a critical eye. Just because an alert fires on a threshold does not mean the threshold is valid. If an alert triggers on a disk utilization at 90% capacity, is there an underlying problem if the disk goes from zero to 90% in an irrational amount of time? Is the monitoring system triggering on that sort of issue? Should it? When establishing threshold alerts, their reason for existence should be discussed, evaluated, and other what if scenarios should also be considered, and potentially rated a higher risk for alerting system to implement. Reducing alert fatigue will lead to more effective responses and fewer false alarms.

It should be common sense to disable or toggle alerting during a maintenance window, but more often than not, spurious alerts are generated. Again, this can lead to alert fatigue and the misdiagnosis that because system X is under maintenance, any alert generated by the system is related to this maintenance, when in fact it may not be and should be investigated.


Always being on-call is not a recipe for success. It is a tried and true way of rapidly burning out the staff. No one knows who is supposed to respond to an alert, so either everyone does, or worse, no one does, which only results in unnecessary escalations (and a lot of yelling in the C-suite).

Establish a visible and viable on-call rotation and stick to it. Whether each person gets the duty by the day, or by the week (do not put someone on-call for the month, that is just cruel) and reasonably rotate through the staff. Ensure that the change over occurs during the workday or in the middle of the week so that details and updates can be shared appropriately. Then establish an escalation path so that the on-call person knows who they can count on for support in the event of an alert beyond their immediate ability to resolve and when that escalation path crosses departments, what the alert process is and who needs to be included in the alert notification. This is especially important when developers need to be brought in to help resolve an issue.

Finally, acknowledge that being on-call deserves additional compensation. It is not just the duty person that is impacted when a system goes sideways at two AM on a Sunday. It is their family, and that has potential costs over and above the business costs. IT has long ignored that other industries compensate for having the duty. It should not be expected that IT personnel just do it as part of their normal course of events. If the systems are that important to the business being successful, then their support should also be treated as necessary. Especially if the business wishes to maintain quality people over time. More than one IT professional has tendered their resignation rather than face pressure from their family because they had to miss that event when the server crashed.

After the Fire

Once the event has been wrestled under control, the service restored to functionality, and everyone has had a moment to dust off, a quick review of what happened should occur. In emergency response circles this is called a hot-wash. What went right, what can be improved should be discussed, usually with a couple of quick bullet points. The team involved should also document their findings, with bug links, and other articles, for the after-action and the post mortem. The after-action should account for things that need to be fixed. Whether it is better alerting, more monitoring, replacing hardware, etc. These actions should be documented, ticketed, and addressed as soon as functionally possible to prevent the incident from occurring again.

A post mortem should also be held. This is different from the after-action. A successful post mortem looks at the issue from a blame-free position, using metrics, and root cause analysis. It should include action items from the after-action as well as a discussion on what can be done, that has not already been set in motion, to prevent the situation from occurring again. Punishment has no place in a post mortem.


There are numerous ways to monitor the network, servers, and services. Creating a proper monitoring solution means not approaching it as a single point solution but as multiple complex problems, and it represents different things to different people. Successful monitoring will also require numerous tools that will have to be integrated into an already crowded landscape. Sound monitoring is a group effort. Everyone should have a seat at the table and input into the discussion of the needs.

Logging, like monitoring, is about the health of the system but is also a critical factor in Security Information and Event Management (SIEM), forensic review, and compliance auditing. Logging is also an essential factor for state-change failure investigation. Because logs are needed for compliance, their growth, retention period, and media storage type needs to be considered in annual budget discussions with the concerned departments.

Alerting is the action of responding to monitoring reports. Good alerting, like monitoring and logging, is also a group effort and a complex problem. Ensure that FYI alerts are followed up on and that Fix me now alerts are actionable and closed quickly with a root cause evaluation and new code or systems as appropriate. Like monitoring, never be afraid of updating and re-evaluating the value of alerts, including deleting unneeded alerts.

Properly instrumented, a system that is well monitored allows for complete transactional observability. This leads to a better customer experience, and fewer phones ringing in the back office.

Web Links

  • Checklist Manifesto: (https://en.wikipedia.org/wiki/The_Checklist_Manifesto)
  • Nagios (https://www.nagios.com)
  • New Relic (https://newrelic.com)
  • MITRE: http://cee.mitre.org/docs/CEE_Architecture_Overview-v0.5.pdf
  • W3C Extended Log File Format (ELF) (http://www.w3.org/TR/WD-logfile.html).
  • Apache access log (http://httpd.apache.org/docs/current/logs.html).
  • Cisco SDEE/CIDEE (http://www.cisco.com/en/US/docs/security/ips/specs/CIDEE_Specification.htm).
  • ArcSight common event format (CEF) (http://www.arcsight.com/solutions/solutions-cef/).
  • Syslog/SyslogNG (RFC3195 and newer RFC5424).
  • IDMEF, an XML-based format (http://www.ietf.org/rfc/rfc4765.txt).


  1. https://en.wikipedia.org/wiki/How_much_wood_would_a_woodchuck_chuck ↩︎
  2. Many operations managers will tell you is that when you drive for extreme reliability (beyond even the legendary 5 9s) it comes with ever increasing cost. These costs are not only monetary but there are also costs in terms of how fast new features can be developed and delivered. The end result is a decrease in application deployment because of risk aversion. ↩︎
  3. On Unix and Unix-like computer operating systems, a zombie process or defunct process is a process that has completed execution (via the exit system call) but still has an entry in the process table: it is a process in the "Terminated state". Zombie Process ↩︎
  4. An event is a single occurrence within an environment, usually involving an attempted state change. An event usually includes a notion of time, the occurrence, and any details the explicitly pertain to the event or environment that may help explain or understand the event’s causes or effects” (source: http://cee.mitre.org and specifically http://cee.mitre.org/docs/CEEArchitectureOverview-v0.5.pdf). ↩︎
  5. I am not a lawyer and I am not going to cover the ins and out of the dozens of government oversight requirements, audit requirements, and other headaches that each company has to adhere to day in and day out. These are guidelines. Make sure to involve corporate compliance and other associated corporate entities when discussing log retention. ↩︎
  6. This type of alert should cause heads to pop up and people to move immediately to fix the problem. A wind up air raid siren sound is a great alert tone for this sort of issue. ↩︎