Monitoring In the Modern Age

The Monitoring Problem
Modern Monitoring
Single Pane of Glass
Why Monitor
Logging
Data Collection
Alerting
On-Call
After the Fire
Conclusion
Web Links

Continuous Integration and Continuous Development lead to continuous improvement, which, as we discussed in CI/CD is a founding principle of DevOps and Site Reliability Engineering. We test, we take the results of those tests, and we improve. But how do we measure this improvement? DevOps utilizes metrics to measure its progress, and the majority of those metrics come from instrumentation that generates metrics and logs. What is monitoring, how does it differ from logging, and does it matter?

The Monitoring Problem

Does this sound familiar: We need better monitoring! In most shops it is rare to hear someone say We have too much monitoring! Another frequent lament is: I am getting too many alerts! In the quest for achieving better monitoring, the question: What exactly is better monitoring? needs to be answered, and, wrapped up in that, questions around: Is there such a thing as too much monitoring? And, most importantly: How can monitoring be enabled to provide that all-important single pane of glass that all senior managers demand?

Monitoring used to be simple. Is the server up or down? Is the traffic flowing? Is there enough memory and CPU to execute the application? Companies spent as little as possible on monitoring, and it was rarely a high priority requirement, much less a demand in the problem scope, or considered in advance. Monitoring was reactive. As the transition to mission-critical applications increased, and their increased value to the business became quantified in monetary terms, the status of those applications became critically important. But administrators still had to be aware of the state of servers, disks, and other network equipment as well, which leads to an increase in complexity. Thus, the question becomes: What precisely needs to be monitored? Hint: Not everything.

Monitoring should not be approached as a single point solution. It represents multiple complex problems, and it represents different things to different people. It also means numerous tools will be involved and have to be integrated into an already crowded landscape.

A monitoring strategy should be established from a position of thoughtful construction. It is not enough to say monitor everything, even when management feels this is an appropriate solution. It is not even possible, with the number of servers, application, microservices, network devices, and even raw transaction to monitor everything. It is barely possible to monitor most things.

In the context of DevOps/SRE, monitoring serves several purposes.

First, it is a gauge of how the systems are performing. It is critical to note that performance has two facets, the customer facet, and the systems facet. Performance monitoring of one without the other only shows part of the picture.

Second, it provides the necessary metrics to evaluate how, or if systems are improving as changes are implemented in systems and code. These changes, unmonitored, could also introduce instability into well-behaved systems.

Third, it helps us understand our risk budget and enable success in our postmortems from a science-driven perspective.

Finally, useful metrics permit planning for the future. The one constant in all of these systems is growth. The speed of this growth can be predicted with useful metrics and modeling, but it means that monitoring and the collection of these metrics are accurate and representative.

In modern systems, we need to change the idea of monitoring, what we monitor, and why we monitor.

Modern Monitoring

In physics, the observer effect is the theory that simply observing a situation or phenomenon necessarily changes that phenomenon. In the 1990s, there was a school of thought that agents were terrible. They introduced the observer effect by adding an additional burden on the servers and systems they were supposed to monitor, oft times by masking their impact, or underreporting the server statistics. This led admins to assume their monitoring systems were unreliable because of different statistics being reported by various tools on the same measured value. If the system is unreliable, how can the metrics be used for planning? How can alerts be trusted? Which then leads management to question the value of monitoring, and the money being spent to support it.

Flash forward, and things have changed dramatically. The idea of an agent has also changed. What used to be a precompiled black box of code that seemed to chew up all the available resources is now reduced, in many cases, to small, programmable libraries that expose metrics on an as-needed, or as demanded basis.

However, this introduces a more complicated situation. With each set of systems and applications generating metrics, how many monitoring modes does it take to make a system that is capable of monitoring the systems that need to be monitored? If it sounds a bit like how much wood can a woodchuck chuck¹ riddle, then the scope of the problem becomes more apparent. There is no easy answer to how many modes are too many modes. The short answer is, no more than are necessary, but determining how many is required is an open question. But there is a framework to help arrive at the answer that works in the organization.

First, it is an organizational decision. As will be discussed, do not worry about how other companies solve the problem, even if they look like your company.

Second, because the problem solution affects teams differently, ensure that each team’s vision and needs are considered and included. Avoid duplication, but do not assume one team’s needs are any more or any less important.

Finally, revisit all decisions and assumptions regularly. This is not a one and done exercise but a continuous part of the software life cycle of the application or system being monitored.

The same applies when selecting monitoring tools. Choose the tools wisely but do not be afraid to add a tool if there is a solid reason for it, or remove one if it is not doing the job. The tools must be selected to solve a real problem, not just because someone wants it. It is necessary to get groups together to talk about the tools being used, what they are monitoring, and what overlap may exist. If the tool does not solve a real problem, or there are too many overlaps, selecting that particular tool should be reconsidered.

Lastly, when we look at tools, there is a decision around buy or build. This is not so clear cut, but the idea of build or buy is still based on answering the question of what works best and will do what is needed for a reasonable cost. And the cost of monitoring is not just the cost of the software as we will see more clearly when we talk about logging.

One key concern is adopting tools only because someone else is using the tool (and even possibly using it successfully). Just because Google or Netflix shows success using this or that tool does not mean adoption will be successful in another context. Why? Google, or Netflix, or anyone else has put years of development into the tools to solve the monitoring issues they were experiencing. That does not mean the tools will automatically translate to the next environment, or in fact, be usable in any other environment. Do not be dissuaded from adopting the tools of another company. Just ensure that their use case is validated in the environment they are going to be used.

Single Pane of Glass

Frequently, someone, usually in senior management, will demand a single pane of glass. One view for everything. The best visual is the bridge of the Enterprise. A wall full of monitors with graphs and spinning wheels. Great eye candy, but not precisely useful at the end of the day for the people that are running the systems. There are several problems with the single pane of glass requirement.

First, how big is the viewport? Imagine a door with a peephole. You can see through the door into the hallway, and see, to an extent, who may or may not be there. But it is not a complete view. Any spy novel will illustrate how limited peephole views are. There might be an assassin hiding in the blind spot, waiting to kill the person behind the door. In a single pane of glass view, it might be a problem waiting to bring down the application. The broader the view, the more confusing the data becomes and the harder it is understanding what the data is telling you, even if it is simplified, or aggregated.

Second, and more importantly, it is tough to have a one-to-one mapping of tools to dashboards. One tool may feed multiple dashboards, or one dashboard may be a composite of the input of many tools. Because monitoring is a problem set solution, and a complex one at that, trying to force it into one system is going to hamper the ability to efficiently or effectively work.

Try to determine precisely what is most important to the senior manager, and provide them with that data, but remind them of the peephole problem. And the assassin that might be lurking on the other side of the door.

In this day of specialists, it might seem logical that the people that monitor systems should be specialists. The problem is that monitoring is not a job, it is a skill, and one that all team members need to have, not just for being able to understand what those metrics are trying to say, but to be able to expose the necessary metrics for consumption. After all, having a monitoring solution, just to say there is a monitoring solution, does no one any good.

Why Monitor

After the expenditure of hours, effort, and money on standing up a monitoring system, what exactly is being monitored? Or, in other words, what does up actually mean? And from whose perspective? Monitoring done right provides observability into the system being monitored.

One of the more popular reasons for monitoring is meeting a Service Level Agreement (SLA). Most SLAs usually involve some financial penalty either due or refunded depending on the verbiage in the contract. And most SLA goals are so unrealistic that invoking them, or proving their deficiency would need more than just a talented statistician, and a bevy of lawyers. If the reason for monitoring is to avoid or prove an SLA violation, then other things can be done with the resources. Especially when looking at system availability from the perspective of an SRE model, which assumes no system will be up one hundred percent of the time².

Traditional monitoring has been done from the inside out. Up means, a server is showing a positive CPU, has memory available, and the application, such as it is, is running. What does running mean? In the UNIX world, it means that the application has not gone into zombie status³. In modern systems, this model is not enough. Systems can be functioning, yet the end-users are unable to access services. Even harder to track, if not adequately instrumented, is the partial disconnect of a system. The user believes a transaction has completed, yet there is no record of that transaction in the database, or worse, the transaction occurs multiple times. When it involves money, this could lead to larger than expected charges, overdrafts, etc. Understanding that the entire system is working correctly from end-user to back-end and back again is critical. A sound monitoring system will properly report the status of messages as well as give you the observability down to the thread level of what is happening.

Hardware is expensive. Cloud services cost money per minute, or per transaction or per byte of storage. With proper monitoring, metrics can be collected to provide demand forecasting and capacity planning, which results in better cost containment and provides departments the ability to argue for more substantial expenditures based on past usage histories and predictable future loads. Proper capacity planning also allows for growth predictability for new and existing services as well as the ability to increase capacity to support the future load without the risk of being caught short.

With proper monitoring, we can evaluate performance, specifically around behavior. A piece of software is updated. Is the update successful? How is this measured? With metrics, of course. Was the update supposed to improve throughput? If so, did performance improve? Because of monitoring, it is simple to show that if the change made was effective in achieving the goal of the upgrade. Or not, and if not, what did happen. This can be returned to the development team to resolve with the support of metrics that show the exact effect the software is having on the system. No more he said/she said arguments or antidotal discussions.

Finally, with proper monitoring, which leads to useful metrics, a risk budget can be established. To accomplish this, service risk must be identified, and that means metrics. Risk can include costs related to redundancies or opportunity, but if there are no measures, then the risk is only a gut reaction. To be efficient, it has to be measured. Downtime has to be quantified. Application failures must be offset by success and evaluated for value. For example, is a sign-up failure the same value as a failed poll request for new email? Which is the higher risk? Which has the more significant opportunity cost? Monitoring metrics help answer questions based on business cases.

Profiling vs Monitoring

Monitoring allows for the collection of metrics about events from all over the environment. This is usually limited to the numbers that can be aggregated while the raw information is purged shortly after the data’s been collected. If more specific information is needed or has been maintained over long periods, we are generally talking about logging.

Profiling, compared to monitoring, acknowledges that the full context for all events is not possible, but that it is necessary to collect the full context for limited periods. TCPDump is an example of a profiling tool. It records a slice of network traffic. While it is an essential debugging tool, if it runs for too long, it will rapidly fill available disk space. The same is true with debug builds. Large quantities of data can be collected, yet debug builds tend to impact performance over time negatively.

The best use of profiling is for tactical debugging, not for long term use.

Logging

The goal of monitoring is to gain near real-time insights into the current state of the system. By comparison, logging is best used to determine why a system changed state, or rather why they did not. Logging also is used to help create the post-mortem and other after the fact incidents. But this is not the only reason.

Logging is not done for fun. Like monitoring, logs and logging have a purpose behind them. Allocating storage, backups, and establishing search routines cost money. Usually, logging is established for regulatory and compliance reasons, to prove a system is doing what it should, or that access is being maintained, and it is done in such a way that cannot be altered by system users and provide that proof to external auditors.

Logging is also done for forensic analysis after the fact. People do bad things. Breaches occur, and the impact of those breaches have to be understood. Post-mortems need to be written, and lessons learned need to be documented. Logging helps with this as well.

Finally, logging is an essential part of Security Information and Event Management (SIEM), a vital piece of any system operation.

Like monitoring, a plan needs to be established early on about what will be collected, how it will be collected, and where it will be stored. The question of who can access it after the fact as well as when it can be destroyed should also be a part of the planning process and done with the cooperation of departments outside of the usual IT groups.

One of the more interesting aspects of logging is that logs tend to be ignored and deleted when storage space gets low. This is partially due to the inadequate information that applications like syslog generates, but it is also because logs are not sexy. And vendors would rather sell the latest tool than have companies use existing log tools, and analyzing log files, even when that tool generates logs of its own that need to be analyzed as well.

While there are are many ways to monitor systems, logging provides some specific things. A log can tell you if a system is up, but it is more valuable for determining why a system did not start correctly. Failures, both hardware and software related will often be reported in the logs, in such detail that it will help both engineers and developers resolve the situation. Most developers request logs as part of the trouble ticket, and many third-party hardware vendors will demand them.

Host logs often show failed user access attempts and other related information that become critical in post-mortems following intrusions. Who, how bad, and when are all questions that can be answered by a forensic review of the logs, and if properly stored, and maintained.

And of course, logs are useful in troubleshooting issues because they contain large quantities of data beyond numbers about what happened and date-time stamps for when it happened if the system is appropriately configured, although sometimes determining a link to causality can be a bit more complicated. That will be discussed shortly.

Data Collection

But just what is a log? Like many IT terms, it is borrowed from other disciplines and are used loosely, or incorrectly, or even corrupted entirely. To begin, we need to determine what a log should convey. A log should be able to help answer the what is wrong? question, but also, like monitoring, the questions of how well are we doing? are there indications of something going wrong in the future? and who did what when to whom? A log then is a collection of event records⁴. As part of defining things, an audit is a process of evaluating logs within an environment. The common goal of an audit is to assess the overall status or identify any unusual or problematic activity. Remember though that while useful logging will help with an audit, logging alone is not auditing and log messages do not necessarily indicate an audit trail.

There are numerous methods to collect data, but when talking about logs, they generally break down into log parsers and log scanners.

A log parser extracts specific information from log entries, such as the status code and response of requests from a web server log.

A log scanner generally counts occurrences of strings in log files as defined by a regular expression.

Log monitoring provides the raw data for analysis, usually utilized after the fact beyond the quick data gathered by a parser or scanner. Log analysis tools should be part of the everyday toolbox, not just broken out when something goes wrong.

As part of data collection, some form of data normalization may need to occur. Because each system can create a logs in different ways, it is up to the logging storage system to address the normalization and where that normalization will occur, if it occurs. With modern database-backend log servers, data being ingested does not generally need to be normalized if it is going to be queried by any number of SQL capable applications. However, the more normalized your data is, the better your indexes will be, and the faster a return on the query can be delivered. As a developer, this is a critical concept to understand.

Generally, logging falls into the following categories:

Security
Operations
Compliance
Application

Security logging is focused on detecting and responding to attacks and other security-related issues. This often includes infections by malware, routine user authentication, or the failure of authentication, and analyzing whether the authentication was granted but should not have been.

Operational logging provides information about the routine, or not so routine status of the systems being monitored. As stated before, most monitoring systems will parse information from operations logs, as well as other sources to show operational state. The real value of operations logs is when that state changes, or fails to change, as expected, such as initialization failures related to hardware or bad configurations.

Compliance logging often overlaps security logging, and quite often, security logs are part of the compliance logs. These logs are usually wrapped up in specific audit requirements around HIPAA data, PCI compliance, IRAP, and other government-related mandates.

Application logging often takes two forms. Like operations logging, applications logging is important at the change of state time, as well as tracking other runtime events. The most famous application log is probably the sendmail log, which tells operators not only the state of the mail queue but also who sent what to whom. This log is a progenitor of the syslog. The httpd log (the Apache log) is probably the most used application log.

The other type of application log is the debug log that may or may not be enabled by default. This contains dump information about the state of memory, application handles, threads, and whatever else the coders have included in their code to help them determine where something broke.

Logging syntax and format varies, but some standards are generally acknowledged:

W3C Extended Log File Format (ELF) (http://www.w3.org/TR/WD-logfile.html).
Apache access log (http://httpd.apache.org/docs/current/logs.html).
Cisco SDEE/CIDEE (http://www.cisco.com/en/US/docs/security/ips/specs/CIDEE_Specification.htm).
ArcSight common event format (CEF) (http://www.arcsight.com/solutions/solutions-cef/).
Syslog/SyslogNG (RFC3195 and newer RFC5424).
IDMEF, an XML-based format (http://www.ietf.org/rfc/rfc4765.txt).

Syslog is commonly used to describe both a way to move messages over port 514 UDP as well as a log format with a few structured elements.

Here are a few examples of various format that log files:

CEF:

CEF:0|security|threatmanager|1.0|100|detected a \| in
message|10|src=10.0.0.1 act=blocked a | dst=1.1.1.1

Apache CLF:

192.168.1.3 - - [18/Feb/2000:13:33:37 -0600] “GET / HTTP/1.0” 200 5073
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] “GET /apache_pb.gif HTTP/1.0” 200 2326

W3C ELF:

#Version: 1.0
#Date: 12-Jan-1996 00:00:00
#Fields: time cs-method cs-uri
00:34:23 GET /foo/bar.html
12:21:16 GET /foo/bar.html

IDMEF:

<?xml version=“1.0” encoding=“UTF-8”?>
<!DOCTYPE IDMEF-Message PUBLIC “-//IETF//DTD RFC XXXX IDMEF v1.0//EN”
“idmef-message.dtd”>
<IDMEF-Message version=“1.0” xmlns=“urn:iana:xml:ns:idmef”>
<Alert messageid=“abc123456789”>
<Analyzer analyzerid=“hq-dmz-analyzer01”>
<Node category=“dns”>
<location>Headquarters DMZ Network</location>
<name>analyzer01.example.com</name>
</Node>
</Analyzer>
<CreateTime ntpstamp=“0xbc723b45.0xef449129”>
2000-03-09T10:01:25.93464-05:00
</CreateTime>
<Source ident=“a1b2c3d4”>
<Node ident=“a1b2c3d4-001” category=“dns”>
<name>badguy.example.net</name>
<Address ident=“a1b2c3d4-002” category=“ipv4-net-mask”>
<address>192.0.2.50</address>
<netmask>255.255.255.255</netmask>
</Address>
</Node>
</Source>
<Target ident=“d1c2b3a4”>
<Node ident=“d1c2b3a4-001” category=“dns”>
<Address category=“ipv4-addr-hex”>
<address>0xde796f70</address>
</Address>
</Node>
</Target>
<Classification text=“Teardrop detected”>
<Reference origin=“bugtraqid”>
<name>124</name>
<url> http://www.securityfocus.com/bid/124</url>
</Reference>
</Classification>
</Alert>
</IDMEF-Message>

Unfortunately, just because there are standards does not mean that most applications or application developers actually follow them. Worse, log files that are ASCII based are not necessarily human-readable (as in the IDMEF format, which is one of the more readable XML based logs). Some companies have binary logs, which means a specific piece of software is required to read the log. Binary logs do have a purpose - performance, and space. Binary logs tend to be smaller and take less processing time to create. Also, binary logs usually require less processing to parse and have delineated fields and data types, making the analysis more efficient. An ASCII log parser has to process more data and often has to use pattern matching to extract the useful bits of information.

Compliance mandates, such as PCI, have specific requirements of what should be included in the log, which may not be generally collected by the system, mandating an application or process specific to acquiring these requirements. Many other industry organizations have created their own recommendations in regards to events and details to be logged.

Overall, knowing log file syntax is critical before any kind of analysis can be done. The effect that some people can do such analysis in their head does not discount the value; it just shows once a case where such a thing is possible. The automated system analyzing logs need to have an understanding of a log syntax, usually encoded in some templates.

What is most important is the content of the log or its taxonomy. If multiple systems log the same event, it should be expected that the taxonomy of the event is identical. Thus, to make that happen, there needs to be a collection of well-defined words that can be combined into a predictable fashion - the log taxonomy. Regrettably, this is easier said than done.

Taxonomy can be further developed into these areas:

Change Management
Authentication and Authorization
Data and Systems Access
Threat Management
Performance and Capacity Management
Business Continuity and Availability Management
Miscellaneous Errors and Failures
Miscellaneous Debugging Messages

Therefore, when looking at logs, the question becomes what makes a good log? In general, a good log should tell you the five W’s of logging (appropriated from other disciplines where they are used today):

What happened, with appropriate detail
When did it happen, and if it includes when it started and when it ended, so much the better
Where did it happen, specifically, on what host, file system, interface, etc.
Who was involved, especially when talking about A&A activities
Where did they come from, with as much detail as possible

Also, it would be nice to know:

Where can more information be obtained
How certain is the information
What is affected

Some of the information will be dependent on the environment, and the messages programmed into the software by the developers. If the software does not trap the authentication information, then it cannot be recorded in the log, for example.

As in monitoring, logging information should fall into to categories: Does someone need to be awoken in the middle of the night to deal with the issue, or can it wait for morning. Sadly, in logging, there is usually a third category, which is it can be ignored. This, of course, begs the question if it can be ignored, why is it logged, and the associated space it is taking up? The other key to remember is that if an application is logging, there has to be somewhere for that log information to go to be useful. If the bits fall on the floor, then did they really get logged in the first place?

If the information is being logged, and those logs are being used properly, there is a large amount of data that can be gleaned from them. The risk then becomes determining causality.

Causality

Generally, finding the root of a problem means working backward, gathering together the symptoms to the cause, then determine the appropriate mitigating actions. To get there, it makes sense to follow some logical steps.

First, find a correlation. Identify the undesired symptom. Sometimes it is evident like a compiler failure or a missing configuration file. Sometimes it requires a bit more research to gather and juxtapose time-series events from several logs, and sometimes the symptom is a red herring of what is occurring.

Second, establish direction. Make sure that the cause is not preceding the effect. This means ensuring that not only are all systems using the same date/time codes, but they are actually correct. Many a log reading effort has been foiled because one system is set to UTC and one system is set to local time.

Finally, rule out confounding factors. This is where you test your hypothesis by witnessing the event occur. Sometimes, the potential source of the problem is not the actual source of the problem, especially if the network is sufficiently complicated, and the teams that manage the various parts of the enterprise do not talk to each other when changes are made.

Log Management

As mentioned, logs take up space. Sometimes a lot of space. As discussed in the Monitoring section, pretend it is necessary to monitor HTTP requests for a given service. Further, there is a limit of a hundred fields per log entry. If the service handles a thousand requests per second, a log entry with a hundred fields that take ten bytes each, consumes a megabyte a second, roughly. Or, about eighty gigabytes a day for logging. For one service. Most systems have dozens, if not hundreds of services, all which might need to create logs. If the service falls under government or industry mandate, that means the logs have to be maintained for a set period. If it is even a year, that is a lot of disk, and tape, and cold storage space that has to be allocated, and budgeted for, along with off-site storage. Disk space and tape may be cheap, but storage at somewhere like Iron Mountain quickly becomes a budget item of significance. Therefore, log management is important, if often overlooked, topic.

Fortunately, the corporate log retention policy has likely been established, but it does not hurt to review it from time to time, as various policies change⁵. When considering log retention, some critical areas include:

Applicable Compliance Requirements
Understand the organizational risk posture
Review log sources and the size of logs generated
Review storage options
- Internal (device)
- External (method of transit)

While there are several storage format options, most logs will be moved to some form of central log storage, usually backed by a database.

Alerting

After setting up the monitoring solution, establishing the values to collect and ensuring that the monitoring solution is obtaining the appropriate data, the next, obvious question is Now What?

Most teams monitor their equipment to ensure that it is up and operational, and when it is not, to alert them when it fails if it is performing outside of established thresholds. If monitoring is the act of observation, then alerting is one method of delivering the data. It is not the only one, and should not be the first method of providing the data. Effective alerting is about getting it right, active, and ensuring it is not ignored.

Alerts fall into two categories. The first is an alert as a For your information. This alert is one where no immediate action is required, but someone should be informed. The backup job failed is an FYI type of alert. Someone needs to investigate, but the level of urgency does not rise to drop everything you are doing. This does not mean you should ignore the alert. You should follow up on all FYI alerts as soon as possible or question the value of continuing the alert.

The second type of alert is drop everything you are doing. This is an alert that is meant to wake people up in the dead of night⁶. It should be noted that there is occasionally a middle-range alert between the system is down and get to it while you can. Resist the temptation to create this intermediate type of alert. Either it is a problem that demands immediate action, or it is not. This is what makes a good alert.

The question then is around the strategy for delivery and resolution of alerts.

Stop using email for alerts. Most DevOps/SREs get too much email, and most filter their alerts into files that are never reviewed. If the alert is of the FYI variety, send it to a group notification (chat) system for follow up. If the alert requires immediate action, choose the method that works best for immediate response. SMS, pager duty, flashing red lights, etc. Not everyone uses or pays attention to their SMS alerts during the day or at night, so ensure the team has an agreed-upon method for how these alerts will be sent and acted upon. Use what works. It is also important to log all alerts for later reporting. A unique sequence number that can be referred back to in reports, knowledgebase articles, and post mortems is critical. It will also aid in SLA reporting and other legal actions.

Ensure that you have a runbook/checklist. In the Checklist Manifesto, the value of having a checklist, and what makes a good checklist details actions that are valuable, whether you are in surgery, trying to land a plane or troubleshoot a problem. Common items such as:

What system is on what machine in what subnet and connected to what router or gateway
Who is responsible for the system and code on the system and how to reach them in an emergency
Current infrastructure diagrams
Metrics and their meaning, and where they are collected

should be included.

This will get people focused on the job at hand and prevent the inevitable issue of but I thought…. Make sure to keep your checklists updated as systems change. If you are finding the runbook/checklist solution provides the actual answer, then automate the runbook! Self-healing should be the first response of any alerting system in the modern age. If a human has to respond to an alert, it is more than likely already too late.

Another useful alerting practice is to delete alerts or tune them to improve value. Do not be afraid of removing alerts. If an alert is being ignored, and ignoring the alert is not causing a system issue, consider evaluating why the alert was created in the first place. Is it still relevant? Is it still needed? Threshold alerts, in particular, should be reviewed frequently and with a critical eye. Just because an alert fires on a threshold does not mean the threshold is valid. If an alert triggers on a disk utilization at 90% capacity, is there an underlying problem if the disk goes from zero to 90% in an irrational amount of time? Is the monitoring system triggering on that sort of issue? Should it? When establishing threshold alerts, their reason for existence should be discussed, evaluated, and other what if scenarios should also be considered, and potentially rated a higher risk for alerting system to implement. Reducing alert fatigue will lead to more effective responses and fewer false alarms.

It should be common sense to disable or toggle alerting during a maintenance window, but more often than not, spurious alerts are generated. Again, this can lead to alert fatigue and the misdiagnosis that because system X is under maintenance, any alert generated by the system is related to this maintenance, when in fact it may not be and should be investigated.

On-Call

Always being on-call is not a recipe for success. It is a tried and true way of rapidly burning out the staff. No one knows who is supposed to respond to an alert, so either everyone does, or worse, no one does, which only results in unnecessary escalations (and a lot of yelling in the C-suite).

Establish a visible and viable on-call rotation and stick to it. Whether each person gets the duty by the day, or by the week (do not put someone on-call for the month, that is just cruel) and reasonably rotate through the staff. Ensure that the change over occurs during the workday or in the middle of the week so that details and updates can be shared appropriately. Then establish an escalation path so that the on-call person knows who they can count on for support in the event of an alert beyond their immediate ability to resolve and when that escalation path crosses departments, what the alert process is and who needs to be included in the alert notification. This is especially important when developers need to be brought in to help resolve an issue.

Finally, acknowledge that being on-call deserves additional compensation. It is not just the duty person that is impacted when a system goes sideways at two AM on a Sunday. It is their family, and that has potential costs over and above the business costs. IT has long ignored that other industries compensate for having the duty. It should not be expected that IT personnel just do it as part of their normal course of events. If the systems are that important to the business being successful, then their support should also be treated as necessary. Especially if the business wishes to maintain quality people over time. More than one IT professional has tendered their resignation rather than face pressure from their family because they had to miss that event when the server crashed.

After the Fire

Once the event has been wrestled under control, the service restored to functionality, and everyone has had a moment to dust off, a quick review of what happened should occur. In emergency response circles this is called a hot-wash. What went right, what can be improved should be discussed, usually with a couple of quick bullet points. The team involved should also document their findings, with bug links, and other articles, for the after-action and the post mortem. The after-action should account for things that need to be fixed. Whether it is better alerting, more monitoring, replacing hardware, etc. These actions should be documented, ticketed, and addressed as soon as functionally possible to prevent the incident from occurring again.

A post mortem should also be held. This is different from the after-action. A successful post mortem looks at the issue from a blame-free position, using metrics, and root cause analysis. It should include action items from the after-action as well as a discussion on what can be done, that has not already been set in motion, to prevent the situation from occurring again. Punishment has no place in a post mortem.

Conclusion

There are numerous ways to monitor the network, servers, and services. Creating a proper monitoring solution means not approaching it as a single point solution but as multiple complex problems, and it represents different things to different people. Successful monitoring will also require numerous tools that will have to be integrated into an already crowded landscape. Sound monitoring is a group effort. Everyone should have a seat at the table and input into the discussion of the needs.

Logging, like monitoring, is about the health of the system but is also a critical factor in Security Information and Event Management (SIEM), forensic review, and compliance auditing. Logging is also an essential factor for state-change failure investigation. Because logs are needed for compliance, their growth, retention period, and media storage type needs to be considered in annual budget discussions with the concerned departments.

Alerting is the action of responding to monitoring reports. Good alerting, like monitoring and logging, is also a group effort and a complex problem. Ensure that FYI alerts are followed up on and that Fix me now alerts are actionable and closed quickly with a root cause evaluation and new code or systems as appropriate. Like monitoring, never be afraid of updating and re-evaluating the value of alerts, including deleting unneeded alerts.

Properly instrumented, a system that is well monitored allows for complete transactional observability. This leads to a better customer experience, and fewer phones ringing in the back office.

Web Links

Checklist Manifesto: (https://en.wikipedia.org/wiki/The_Checklist_Manifesto)
Nagios (https://www.nagios.com)
New Relic (https://newrelic.com)
MITRE: http://cee.mitre.org/docs/CEE_Architecture_Overview-v0.5.pdf
W3C Extended Log File Format (ELF) (http://www.w3.org/TR/WD-logfile.html).
Apache access log (http://httpd.apache.org/docs/current/logs.html).
Cisco SDEE/CIDEE (http://www.cisco.com/en/US/docs/security/ips/specs/CIDEE_Specification.htm).
ArcSight common event format (CEF) (http://www.arcsight.com/solutions/solutions-cef/).
Syslog/SyslogNG (RFC3195 and newer RFC5424).
IDMEF, an XML-based format (http://www.ietf.org/rfc/rfc4765.txt).

Footnotes

https://en.wikipedia.org/wiki/How_much_wood_would_a_woodchuck_chuck ↩︎
Many operations managers will tell you is that when you drive for extreme reliability (beyond even the legendary 5 9s) it comes with ever increasing cost. These costs are not only monetary but there are also costs in terms of how fast new features can be developed and delivered. The end result is a decrease in application deployment because of risk aversion. ↩︎
On Unix and Unix-like computer operating systems, a zombie process or defunct process is a process that has completed execution (via the exit system call) but still has an entry in the process table: it is a process in the "Terminated state". Zombie Process ↩︎
An event is a single occurrence within an environment, usually involving an attempted state change. An event usually includes a notion of time, the occurrence, and any details the explicitly pertain to the event or environment that may help explain or understand the event’s causes or effects” (source: http://cee.mitre.org and specifically http://cee.mitre.org/docs/CEEArchitectureOverview-v0.5.pdf). ↩︎
I am not a lawyer and I am not going to cover the ins and out of the dozens of government oversight requirements, audit requirements, and other headaches that each company has to adhere to day in and day out. These are guidelines. Make sure to involve corporate compliance and other associated corporate entities when discussing log retention. ↩︎
This type of alert should cause heads to pop up and people to move immediately to fix the problem. A wind up air raid siren sound is a great alert tone for this sort of issue. ↩︎

Musings from the Back Room

Thoughts, rants and musings on all things of interest.

Contents