Can I Get The Recipe For That – Pasta Dough

I do not have an Italian grandmother. In fact, I never really knew either of my grandmothers when I was growing up, and my mother’s favorite dinner was reservations, so learning how to cook has been an effort in trial, error, and ordering pizza.

If you are going to cook a meal that involves pasta, you really should learn how to make your own. I say this with a proviso – if you are going to make your own pasta, you will want some additional tools for your kitchen, so if you are going to make pasta, you are investing in more than just one or two meals because the tools are a couple of hundred dollars initially. You can do all of this by hand of course, but it will take a lot more time.

For this recipe, we will look at how I make the dough. I have used this dough recipe successfully for both fettuccine as well as other noodle shapes. The key to this is the flour, and if you are going to make pasta, you really should try and get the flour. This recipe comes from the King Arthur Flour Pasta Flour Blend, available from King Arthur Flour. The key to good pasta dough is the flour, specifically semolina, all-purpose, and high-protein, finely milled “00” flour. You can find this in some grocery stores and some specialty stores. Or you can order it directly from King Arthur.


3 cups of pasta flour
4 large eggs
2 to 4 tbsp water
extra flour for the work surface


Place the flour in a food processor, bread machine, or bowl. Mix in eggs all at once. Knead, adding only enough water to form a smooth dough. Form dough into a rectangle, about 1” thick, wrap well and rest for 30 minutes.

After 30 minutes, cut off a chunk, flour both sides of the dough and run it through a pasta machine on the thickest setting. Repeat the process, flouring as necessary. Repeat the process, flouring as necessary and gradually reducing the setting until the desired thickness is reached.

Cut into shapes and toss with flour to prevent sticking. Hang in individual strands or arrange in small nests and allow to dry.

To cook: Boil 4 quarts of water and 1 tablespoon of salt. Add pasta and cook for 2 to 4 minutes until pasta is still slightly firm. Fresh pasta cooks quickly. Drain and toss with oil or sauce.

A Political Stunt? Like Repealing ACA?

This got hung up somewhere, but it is still just as valid today when Congress still has not accomplished anything of note while the parties squabble with each other as it was when I wrote it before the shutdown….

Shutdown: Congress shouldn’t get paid during standoff, Va. rep says | WTOP

But McConnell called it a “political stunt” and said it would be useless to allow a vote that wouldn’t get Trump’s signature. The GOP lawmaker said simply, “This would not produce a result.”

In 2012, I wrote a post on the House of Representative’s 33 vote to repeal the Affordable Care Act. At the time I questioned why the Party of No was wasting the tax payer’s time with a show vote that would not result in anything more than, well a show vote.

Fast forward six years, and we have a Senate that seems to think show votes are a waste of time, and therefore, they will not do it. Except that this time, the vote is not a show vote. It is an adult, bringing a bill to the floor of the Senate, to be voted on to end the government shutdown. And this time, it is likely that even if the President does not sign the bill, there will be enough votes to override any veto he might threaten. That is what is scaring the Majority Leader. He is worried to his marrow that he has lost control of the Senate on this particular issue. Especially when the same Senate, not forty-eight hours before the shutdown began voted in favor of the same bill as the House has presented them.

I am beginning to think that this may actually be the definition of insanity.

Monitoring In the Modern Age


Continuous Integration and Continuous Development lead to continuous improvement, which, as we discussed in CI/CD is a founding principle of DevOps and Site Reliability Engineering. We test, we take the results of those tests, and we improve. But how do we measure this improvement? DevOps utilizes metrics to measure its progress, and the majority of those metrics come from instrumentation that generates metrics and logs. What is monitoring, how does it differ from logging, and does it matter?

The Monitoring Problem

Does this sound familiar: We need better monitoring! In most shops it is rare to hear someone say We have too much monitoring! Another frequent lament is: I am getting too many alerts! In the quest for achieving better monitoring, the question: What exactly is better monitoring? needs to be answered, and, wrapped up in that, questions around: Is there such a thing as too much monitoring? And, most importantly: How can monitoring be enabled to provide that all-important single pane of glass that all senior managers demand?

Monitoring used to be simple. Is the server up or down? Is the traffic flowing? Is there enough memory and CPU to execute the application? Companies spent as little as possible on monitoring, and it was rarely a high priority requirement, much less a demand in the problem scope, or considered in advance. Monitoring was reactive. As the transition to mission-critical applications increased, and their increased value to the business became quantified in monetary terms, the status of those applications became critically important. But administrators still had to be aware of the state of servers, disks, and other network equipment as well, which leads to an increase in complexity. Thus, the question becomes: What precisely needs to be monitored? Hint: Not everything.

Monitoring should not be approached as a single point solution. It represents multiple complex problems, and it represents different things to different people. It also means numerous tools will be involved and have to be integrated into an already crowded landscape.

A monitoring strategy should be established from a position of thoughtful construction. It is not enough to say monitor everything, even when management feels this is an appropriate solution. It is not even possible, with the number of servers, application, microservices, network devices, and even raw transaction to monitor everything. It is barely possible to monitor most things.

In the context of DevOps/SRE, monitoring serves several purposes.

First, it is a gauge of how the systems are performing. It is critical to note that performance has two facets, the customer facet, and the systems facet. Performance monitoring of one without the other only shows part of the picture.

Second, it provides the necessary metrics to evaluate how, or if systems are improving as changes are implemented in systems and code. These changes, unmonitored, could also introduce instability into well-behaved systems.

Third, it helps us understand our risk budget and enable success in our postmortems from a science-driven perspective.

Finally, useful metrics permit planning for the future. The one constant in all of these systems is growth. The speed of this growth can be predicted with useful metrics and modeling, but it means that monitoring and the collection of these metrics are accurate and representative.

In modern systems, we need to change the idea of monitoring, what we monitor, and why we monitor.

Modern Monitoring

In physics, the observer effect is the theory that simply observing a situation or phenomenon necessarily changes that phenomenon. In the 1990s, there was a school of thought that agents were terrible. They introduced the observer effect by adding an additional burden on the servers and systems they were supposed to monitor, oft times by masking their impact, or underreporting the server statistics. This led admins to assume their monitoring systems were unreliable because of different statistics being reported by various tools on the same measured value. If the system is unreliable, how can the metrics be used for planning? How can alerts be trusted? Which then leads management to question the value of monitoring, and the money being spent to support it.

Flash forward, and things have changed dramatically. The idea of an agent has also changed. What used to be a precompiled black box of code that seemed to chew up all the available resources is now reduced, in many cases, to small, programmable libraries that expose metrics on an as-needed, or as demanded basis.

However, this introduces a more complicated situation. With each set of systems and applications generating metrics, how many monitoring modes does it take to make a system that is capable of monitoring the systems that need to be monitored? If it sounds a bit like how much wood can a woodchuck chuck1 riddle, then the scope of the problem becomes more apparent. There is no easy answer to how many modes are too many modes. The short answer is, no more than are necessary, but determining how many is required is an open question. But there is a framework to help arrive at the answer that works in the organization.

First, it is an organizational decision. As will be discussed, do not worry about how other companies solve the problem, even if they look like your company.

Second, because the problem solution affects teams differently, ensure that each team’s vision and needs are considered and included. Avoid duplication, but do not assume one team’s needs are any more or any less important.

Finally, revisit all decisions and assumptions regularly. This is not a one and done exercise but a continuous part of the software life cycle of the application or system being monitored.

The same applies when selecting monitoring tools. Choose the tools wisely but do not be afraid to add a tool if there is a solid reason for it, or remove one if it is not doing the job. The tools must be selected to solve a real problem, not just because someone wants it. It is necessary to get groups together to talk about the tools being used, what they are monitoring, and what overlap may exist. If the tool does not solve a real problem, or there are too many overlaps, selecting that particular tool should be reconsidered.

Lastly, when we look at tools, there is a decision around buy or build. This is not so clear cut, but the idea of build or buy is still based on answering the question of what works best and will do what is needed for a reasonable cost. And the cost of monitoring is not just the cost of the software as we will see more clearly when we talk about logging.

One key concern is adopting tools only because someone else is using the tool (and even possibly using it successfully). Just because Google or Netflix shows success using this or that tool does not mean adoption will be successful in another context. Why? Google, or Netflix, or anyone else has put years of development into the tools to solve the monitoring issues they were experiencing. That does not mean the tools will automatically translate to the next environment, or in fact, be usable in any other environment. Do not be dissuaded from adopting the tools of another company. Just ensure that their use case is validated in the environment they are going to be used.

Single Pane of Glass

Frequently, someone, usually in senior management, will demand a single pane of glass. One view for everything. The best visual is the bridge of the Enterprise. A wall full of monitors with graphs and spinning wheels. Great eye candy, but not precisely useful at the end of the day for the people that are running the systems. There are several problems with the single pane of glass requirement.

First, how big is the viewport? Imagine a door with a peephole. You can see through the door into the hallway, and see, to an extent, who may or may not be there. But it is not a complete view. Any spy novel will illustrate how limited peephole views are. There might be an assassin hiding in the blind spot, waiting to kill the person behind the door. In a single pane of glass view, it might be a problem waiting to bring down the application. The broader the view, the more confusing the data becomes and the harder it is understanding what the data is telling you, even if it is simplified, or aggregated.

Second, and more importantly, it is tough to have a one-to-one mapping of tools to dashboards. One tool may feed multiple dashboards, or one dashboard may be a composite of the input of many tools. Because monitoring is a problem set solution, and a complex one at that, trying to force it into one system is going to hamper the ability to efficiently or effectively work.

Try to determine precisely what is most important to the senior manager, and provide them with that data, but remind them of the peephole problem. And the assassin that might be lurking on the other side of the door.

In this day of specialists, it might seem logical that the people that monitor systems should be specialists. The problem is that monitoring is not a job, it is a skill, and one that all team members need to have, not just for being able to understand what those metrics are trying to say, but to be able to expose the necessary metrics for consumption. After all, having a monitoring solution, just to say there is a monitoring solution, does no one any good.

Why Monitor

After the expenditure of hours, effort, and money on standing up a monitoring system, what exactly is being monitored? Or, in other words, what does up actually mean? And from whose perspective? Monitoring done right provides observability into the system being monitored.

One of the more popular reasons for monitoring is meeting a Service Level Agreement (SLA). Most SLAs usually involve some financial penalty either due or refunded depending on the verbiage in the contract. And most SLA goals are so unrealistic that invoking them, or proving their deficiency would need more than just a talented statistician, and a bevy of lawyers. If the reason for monitoring is to avoid or prove an SLA violation, then other things can be done with the resources. Especially when looking at system availability from the perspective of an SRE model, which assumes no system will be up one hundred percent of the time2.

Traditional monitoring has been done from the inside out. Up means, a server is showing a positive CPU, has memory available, and the application, such as it is, is running. What does running mean? In the UNIX world, it means that the application has not gone into zombie status3. In modern systems, this model is not enough. Systems can be functioning, yet the end-users are unable to access services. Even harder to track, if not adequately instrumented, is the partial disconnect of a system. The user believes a transaction has completed, yet there is no record of that transaction in the database, or worse, the transaction occurs multiple times. When it involves money, this could lead to larger than expected charges, overdrafts, etc. Understanding that the entire system is working correctly from end-user to back-end and back again is critical. A sound monitoring system will properly report the status of messages as well as give you the observability down to the thread level of what is happening.

Hardware is expensive. Cloud services cost money per minute, or per transaction or per byte of storage. With proper monitoring, metrics can be collected to provide demand forecasting and capacity planning, which results in better cost containment and provides departments the ability to argue for more substantial expenditures based on past usage histories and predictable future loads. Proper capacity planning also allows for growth predictability for new and existing services as well as the ability to increase capacity to support the future load without the risk of being caught short.

With proper monitoring, we can evaluate performance, specifically around behavior. A piece of software is updated. Is the update successful? How is this measured? With metrics, of course. Was the update supposed to improve throughput? If so, did performance improve? Because of monitoring, it is simple to show that if the change made was effective in achieving the goal of the upgrade. Or not, and if not, what did happen. This can be returned to the development team to resolve with the support of metrics that show the exact effect the software is having on the system. No more he said/she said arguments or antidotal discussions.

Finally, with proper monitoring, which leads to useful metrics, a risk budget can be established. To accomplish this, service risk must be identified, and that means metrics. Risk can include costs related to redundancies or opportunity, but if there are no measures, then the risk is only a gut reaction. To be efficient, it has to be measured. Downtime has to be quantified. Application failures must be offset by success and evaluated for value. For example, is a sign-up failure the same value as a failed poll request for new email? Which is the higher risk? Which has the more significant opportunity cost? Monitoring metrics help answer questions based on business cases.

Profiling vs Monitoring

Monitoring allows for the collection of metrics about events from all over the environment. This is usually limited to the numbers that can be aggregated while the raw information is purged shortly after the data’s been collected. If more specific information is needed or has been maintained over long periods, we are generally talking about logging.

Profiling, compared to monitoring, acknowledges that the full context for all events is not possible, but that it is necessary to collect the full context for limited periods. TCPDump is an example of a profiling tool. It records a slice of network traffic. While it is an essential debugging tool, if it runs for too long, it will rapidly fill available disk space. The same is true with debug builds. Large quantities of data can be collected, yet debug builds tend to impact performance over time negatively.

The best use of profiling is for tactical debugging, not for long term use.


The goal of monitoring is to gain near real-time insights into the current state of the system. By comparison, logging is best used to determine why a system changed state, or rather why they did not. Logging also is used to help create the post-mortem and other after the fact incidents. But this is not the only reason.

Logging is not done for fun. Like monitoring, logs and logging have a purpose behind them. Allocating storage, backups, and establishing search routines cost money. Usually, logging is established for regulatory and compliance reasons, to prove a system is doing what it should, or that access is being maintained, and it is done in such a way that cannot be altered by system users and provide that proof to external auditors.

Logging is also done for forensic analysis after the fact. People do bad things. Breaches occur, and the impact of those breaches have to be understood. Post-mortems need to be written, and lessons learned need to be documented. Logging helps with this as well.

Finally, logging is an essential part of Security Information and Event Management (SIEM), a vital piece of any system operation.

Like monitoring, a plan needs to be established early on about what will be collected, how it will be collected, and where it will be stored. The question of who can access it after the fact as well as when it can be destroyed should also be a part of the planning process and done with the cooperation of departments outside of the usual IT groups.

One of the more interesting aspects of logging is that logs tend to be ignored and deleted when storage space gets low. This is partially due to the inadequate information that applications like syslog generates, but it is also because logs are not sexy. And vendors would rather sell the latest tool than have companies use existing log tools, and analyzing log files, even when that tool generates logs of its own that need to be analyzed as well.

While there are are many ways to monitor systems, logging provides some specific things. A log can tell you if a system is up, but it is more valuable for determining why a system did not start correctly. Failures, both hardware and software related will often be reported in the logs, in such detail that it will help both engineers and developers resolve the situation. Most developers request logs as part of the trouble ticket, and many third-party hardware vendors will demand them.

Host logs often show failed user access attempts and other related information that become critical in post-mortems following intrusions. Who, how bad, and when are all questions that can be answered by a forensic review of the logs, and if properly stored, and maintained.

And of course, logs are useful in troubleshooting issues because they contain large quantities of data beyond numbers about what happened and date-time stamps for when it happened if the system is appropriately configured, although sometimes determining a link to causality can be a bit more complicated. That will be discussed shortly.

Data Collection

But just what is a log? Like many IT terms, it is borrowed from other disciplines and are used loosely, or incorrectly, or even corrupted entirely. To begin, we need to determine what a log should convey. A log should be able to help answer the what is wrong? question, but also, like monitoring, the questions of how well are we doing? are there indications of something going wrong in the future? and who did what when to whom? A log then is a collection of event records4. As part of defining things, an audit is a process of evaluating logs within an environment. The common goal of an audit is to assess the overall status or identify any unusual or problematic activity. Remember though that while useful logging will help with an audit, logging alone is not auditing and log messages do not necessarily indicate an audit trail.

There are numerous methods to collect data, but when talking about logs, they generally break down into log parsers and log scanners.

A log parser extracts specific information from log entries, such as the status code and response of requests from a web server log.

A log scanner generally counts occurrences of strings in log files as defined by a regular expression.

Log monitoring provides the raw data for analysis, usually utilized after the fact beyond the quick data gathered by a parser or scanner. Log analysis tools should be part of the everyday toolbox, not just broken out when something goes wrong.

As part of data collection, some form of data normalization may need to occur. Because each system can create a logs in different ways, it is up to the logging storage system to address the normalization and where that normalization will occur, if it occurs. With modern database-backend log servers, data being ingested does not generally need to be normalized if it is going to be queried by any number of SQL capable applications. However, the more normalized your data is, the better your indexes will be, and the faster a return on the query can be delivered. As a developer, this is a critical concept to understand.

Generally, logging falls into the following categories:

  • Security
  • Operations
  • Compliance
  • Application

Security logging is focused on detecting and responding to attacks and other security-related issues. This often includes infections by malware, routine user authentication, or the failure of authentication, and analyzing whether the authentication was granted but should not have been.

Operational logging provides information about the routine, or not so routine status of the systems being monitored. As stated before, most monitoring systems will parse information from operations logs, as well as other sources to show operational state. The real value of operations logs is when that state changes, or fails to change, as expected, such as initialization failures related to hardware or bad configurations.

Compliance logging often overlaps security logging, and quite often, security logs are part of the compliance logs. These logs are usually wrapped up in specific audit requirements around HIPAA data, PCI compliance, IRAP, and other government-related mandates.

Application logging often takes two forms. Like operations logging, applications logging is important at the change of state time, as well as tracking other runtime events. The most famous application log is probably the sendmail log, which tells operators not only the state of the mail queue but also who sent what to whom. This log is a progenitor of the syslog. The httpd log (the Apache log) is probably the most used application log.

The other type of application log is the debug log that may or may not be enabled by default. This contains dump information about the state of memory, application handles, threads, and whatever else the coders have included in their code to help them determine where something broke.

Logging syntax and format varies, but some standards are generally acknowledged:

  • W3C Extended Log File Format (ELF) (
  • Apache access log (
  • Cisco SDEE/CIDEE (
  • ArcSight common event format (CEF) (
  • Syslog/SyslogNG (RFC3195 and newer RFC5424).
  • IDMEF, an XML-based format (

Syslog is commonly used to describe both a way to move messages over port 514 UDP as well as a log format with a few structured elements.

Here are a few examples of various format that log files:


CEF:0|security|threatmanager|1.0|100|detected a \| in
message|10|src= act=blocked a | dst=

Apache CLF: - - [18/Feb/2000:13:33:37 -0600] “GET / HTTP/1.0” 200 5073 - frank [10/Oct/2000:13:55:36 -0700] “GET /apache_pb.gif HTTP/1.0” 200 2326


#Version: 1.0
#Date: 12-Jan-1996 00:00:00
#Fields: time cs-method cs-uri
00:34:23 GET /foo/bar.html
12:21:16 GET /foo/bar.html


<?xml version=“1.0” encoding=“UTF-8”?>
<IDMEF-Message version=“1.0” xmlns=“urn:iana:xml:ns:idmef”>
<Alert messageid=“abc123456789”>
<Analyzer analyzerid=“hq-dmz-analyzer01”>
<Node category=“dns”>
<location>Headquarters DMZ Network</location>
<CreateTime ntpstamp=“0xbc723b45.0xef449129”>
<Source ident=“a1b2c3d4”>
<Node ident=“a1b2c3d4-001” category=“dns”>
<Address ident=“a1b2c3d4-002” category=“ipv4-net-mask”>
<Target ident=“d1c2b3a4”>
<Node ident=“d1c2b3a4-001” category=“dns”>
<Address category=“ipv4-addr-hex”>
<Classification text=“Teardrop detected”>
<Reference origin=“bugtraqid”>

Unfortunately, just because there are standards does not mean that most applications or application developers actually follow them. Worse, log files that are ASCII based are not necessarily human-readable (as in the IDMEF format, which is one of the more readable XML based logs). Some companies have binary logs, which means a specific piece of software is required to read the log. Binary logs do have a purpose – performance, and space. Binary logs tend to be smaller and take less processing time to create. Also, binary logs usually require less processing to parse and have delineated fields and data types, making the analysis more efficient. An ASCII log parser has to process more data and often has to use pattern matching to extract the useful bits of information.

Compliance mandates, such as PCI, have specific requirements of what should be included in the log, which may not be generally collected by the system, mandating an application or process specific to acquiring these requirements. Many other industry organizations have created their own recommendations in regards to events and details to be logged.

Overall, knowing log file syntax is critical before any kind of analysis can be done. The effect that some people can do such analysis in their head does not discount the value; it just shows once a case where such a thing is possible. The automated system analyzing logs need to have an understanding of a log syntax, usually encoded in some templates.

What is most important is the content of the log or its taxonomy. If multiple systems log the same event, it should be expected that the taxonomy of the event is identical. Thus, to make that happen, there needs to be a collection of well-defined words that can be combined into a predictable fashion – the log taxonomy. Regrettably, this is easier said than done.

Taxonomy can be further developed into these areas:

  • Change Management
  • Authentication and Authorization
  • Data and Systems Access
  • Threat Management
  • Performance and Capacity Management
  • Business Continuity and Availability Management
  • Miscellaneous Errors and Failures
  • Miscellaneous Debugging Messages

Therefore, when looking at logs, the question becomes what makes a good log? In general, a good log should tell you the five W’s of logging (appropriated from other disciplines where they are used today):

  • What happened, with appropriate detail
  • When did it happen, and if it includes when it started and when it ended, so much the better
  • Where did it happen, specifically, on what host, file system, interface, etc.
  • Who was involved, especially when talking about A&A activities
  • Where did they come from, with as much detail as possible

Also, it would be nice to know:

  • Where can more information be obtained
  • How certain is the information
  • What is affected

Some of the information will be dependent on the environment, and the messages programmed into the software by the developers. If the software does not trap the authentication information, then it cannot be recorded in the log, for example.

As in monitoring, logging information should fall into to categories: Does someone need to be awoken in the middle of the night to deal with the issue, or can it wait for morning. Sadly, in logging, there is usually a third category, which is it can be ignored. This, of course, begs the question if it can be ignored, why is it logged, and the associated space it is taking up? The other key to remember is that if an application is logging, there has to be somewhere for that log information to go to be useful. If the bits fall on the floor, then did they really get logged in the first place?

If the information is being logged, and those logs are being used properly, there is a large amount of data that can be gleaned from them. The risk then becomes determining causality.


Generally, finding the root of a problem means working backward, gathering together the symptoms to the cause, then determine the appropriate mitigating actions. To get there, it makes sense to follow some logical steps.

First, find a correlation. Identify the undesired symptom. Sometimes it is evident like a compiler failure or a missing configuration file. Sometimes it requires a bit more research to gather and juxtapose time-series events from several logs, and sometimes the symptom is a red herring of what is occurring.

Second, establish direction. Make sure that the cause is not preceding the effect. This means ensuring that not only are all systems using the same date/time codes, but they are actually correct. Many a log reading effort has been foiled because one system is set to UTC and one system is set to local time.

Finally, rule out confounding factors. This is where you test your hypothesis by witnessing the event occur. Sometimes, the potential source of the problem is not the actual source of the problem, especially if the network is sufficiently complicated, and the teams that manage the various parts of the enterprise do not talk to each other when changes are made.

Log Management

As mentioned, logs take up space. Sometimes a lot of space. As discussed in the Monitoring section, pretend it is necessary to monitor HTTP requests for a given service. Further, there is a limit of a hundred fields per log entry. If the service handles a thousand requests per second, a log entry with a hundred fields that take ten bytes each, consumes a megabyte a second, roughly. Or, about eighty gigabytes a day for logging. For one service. Most systems have dozens, if not hundreds of services, all which might need to create logs. If the service falls under government or industry mandate, that means the logs have to be maintained for a set period. If it is even a year, that is a lot of disk, and tape, and cold storage space that has to be allocated, and budgeted for, along with off-site storage. Disk space and tape may be cheap, but storage at somewhere like Iron Mountain quickly becomes a budget item of significance. Therefore, log management is important, if often overlooked, topic.

Fortunately, the corporate log retention policy has likely been established, but it does not hurt to review it from time to time, as various policies change5. When considering log retention, some critical areas include:

  • Applicable Compliance Requirements
  • Understand the organizational risk posture
  • Review log sources and the size of logs generated
  • Review storage options
    • Internal (device)
    • External (method of transit)

While there are several storage format options, most logs will be moved to some form of central log storage, usually backed by a database.


After setting up the monitoring solution, establishing the values to collect and ensuring that the monitoring solution is obtaining the appropriate data, the next, obvious question is Now What?

Most teams monitor their equipment to ensure that it is up and operational, and when it is not, to alert them when it fails if it is performing outside of established thresholds. If monitoring is the act of observation, then alerting is one method of delivering the data. It is not the only one, and should not be the first method of providing the data. Effective alerting is about getting it right, active, and ensuring it is not ignored.

Alerts fall into two categories. The first is an alert as a For your information. This alert is one where no immediate action is required, but someone should be informed. The backup job failed is an FYI type of alert. Someone needs to investigate, but the level of urgency does not rise to drop everything you are doing. This does not mean you should ignore the alert. You should follow up on all FYI alerts as soon as possible or question the value of continuing the alert.

The second type of alert is drop everything you are doing. This is an alert that is meant to wake people up in the dead of night6. It should be noted that there is occasionally a middle-range alert between the system is down and get to it while you can. Resist the temptation to create this intermediate type of alert. Either it is a problem that demands immediate action, or it is not. This is what makes a good alert.

The question then is around the strategy for delivery and resolution of alerts.

Stop using email for alerts. Most DevOps/SREs get too much email, and most filter their alerts into files that are never reviewed. If the alert is of the FYI variety, send it to a group notification (chat) system for follow up. If the alert requires immediate action, choose the method that works best for immediate response. SMS, pager duty, flashing red lights, etc. Not everyone uses or pays attention to their SMS alerts during the day or at night, so ensure the team has an agreed-upon method for how these alerts will be sent and acted upon. Use what works. It is also important to log all alerts for later reporting. A unique sequence number that can be referred back to in reports, knowledgebase articles, and post mortems is critical. It will also aid in SLA reporting and other legal actions.

Ensure that you have a runbook/checklist. In the Checklist Manifesto, the value of having a checklist, and what makes a good checklist details actions that are valuable, whether you are in surgery, trying to land a plane or troubleshoot a problem. Common items such as:

  • What system is on what machine in what subnet and connected to what router or gateway
  • Who is responsible for the system and code on the system and how to reach them in an emergency
  • Current infrastructure diagrams
  • Metrics and their meaning, and where they are collected

should be included.

This will get people focused on the job at hand and prevent the inevitable issue of but I thought…. Make sure to keep your checklists updated as systems change. If you are finding the runbook/checklist solution provides the actual answer, then automate the runbook! Self-healing should be the first response of any alerting system in the modern age. If a human has to respond to an alert, it is more than likely already too late.

Another useful alerting practice is to delete alerts or tune them to improve value. Do not be afraid of removing alerts. If an alert is being ignored, and ignoring the alert is not causing a system issue, consider evaluating why the alert was created in the first place. Is it still relevant? Is it still needed? Threshold alerts, in particular, should be reviewed frequently and with a critical eye. Just because an alert fires on a threshold does not mean the threshold is valid. If an alert triggers on a disk utilization at 90% capacity, is there an underlying problem if the disk goes from zero to 90% in an irrational amount of time? Is the monitoring system triggering on that sort of issue? Should it? When establishing threshold alerts, their reason for existence should be discussed, evaluated, and other what if scenarios should also be considered, and potentially rated a higher risk for alerting system to implement. Reducing alert fatigue will lead to more effective responses and fewer false alarms.

It should be common sense to disable or toggle alerting during a maintenance window, but more often than not, spurious alerts are generated. Again, this can lead to alert fatigue and the misdiagnosis that because system X is under maintenance, any alert generated by the system is related to this maintenance, when in fact it may not be and should be investigated.


Always being on-call is not a recipe for success. It is a tried and true way of rapidly burning out the staff. No one knows who is supposed to respond to an alert, so either everyone does, or worse, no one does, which only results in unnecessary escalations (and a lot of yelling in the C-suite).

Establish a visible and viable on-call rotation and stick to it. Whether each person gets the duty by the day, or by the week (do not put someone on-call for the month, that is just cruel) and reasonably rotate through the staff. Ensure that the change over occurs during the workday or in the middle of the week so that details and updates can be shared appropriately. Then establish an escalation path so that the on-call person knows who they can count on for support in the event of an alert beyond their immediate ability to resolve and when that escalation path crosses departments, what the alert process is and who needs to be included in the alert notification. This is especially important when developers need to be brought in to help resolve an issue.

Finally, acknowledge that being on-call deserves additional compensation. It is not just the duty person that is impacted when a system goes sideways at two AM on a Sunday. It is their family, and that has potential costs over and above the business costs. IT has long ignored that other industries compensate for having the duty. It should not be expected that IT personnel just do it as part of their normal course of events. If the systems are that important to the business being successful, then their support should also be treated as necessary. Especially if the business wishes to maintain quality people over time. More than one IT professional has tendered their resignation rather than face pressure from their family because they had to miss that event when the server crashed.

After the Fire

Once the event has been wrestled under control, the service restored to functionality, and everyone has had a moment to dust off, a quick review of what happened should occur. In emergency response circles this is called a hot-wash. What went right, what can be improved should be discussed, usually with a couple of quick bullet points. The team involved should also document their findings, with bug links, and other articles, for the after-action and the post mortem. The after-action should account for things that need to be fixed. Whether it is better alerting, more monitoring, replacing hardware, etc. These actions should be documented, ticketed, and addressed as soon as functionally possible to prevent the incident from occurring again.

A post mortem should also be held. This is different from the after-action. A successful post mortem looks at the issue from a blame-free position, using metrics, and root cause analysis. It should include action items from the after-action as well as a discussion on what can be done, that has not already been set in motion, to prevent the situation from occurring again. Punishment has no place in a post mortem.


There are numerous ways to monitor the network, servers, and services. Creating a proper monitoring solution means not approaching it as a single point solution but as multiple complex problems, and it represents different things to different people. Successful monitoring will also require numerous tools that will have to be integrated into an already crowded landscape. Sound monitoring is a group effort. Everyone should have a seat at the table and input into the discussion of the needs.

Logging, like monitoring, is about the health of the system but is also a critical factor in Security Information and Event Management (SIEM), forensic review, and compliance auditing. Logging is also an essential factor for state-change failure investigation. Because logs are needed for compliance, their growth, retention period, and media storage type needs to be considered in annual budget discussions with the concerned departments.

Alerting is the action of responding to monitoring reports. Good alerting, like monitoring and logging, is also a group effort and a complex problem. Ensure that FYI alerts are followed up on and that Fix me now alerts are actionable and closed quickly with a root cause evaluation and new code or systems as appropriate. Like monitoring, never be afraid of updating and re-evaluating the value of alerts, including deleting unneeded alerts.

Properly instrumented, a system that is well monitored allows for complete transactional observability. This leads to a better customer experience, and fewer phones ringing in the back office.

Web Links

  • Checklist Manifesto: (
  • Nagios (
  • New Relic (
  • MITRE:
  • W3C Extended Log File Format (ELF) (
  • Apache access log (
  • Cisco SDEE/CIDEE (
  • ArcSight common event format (CEF) (
  • Syslog/SyslogNG (RFC3195 and newer RFC5424).
  • IDMEF, an XML-based format (


  1. ↩︎
  2. Many operations managers will tell you is that when you drive for extreme reliability (beyond even the legendary 5 9s) it comes with ever increasing cost. These costs are not only monetary but there are also costs in terms of how fast new features can be developed and delivered. The end result is a decrease in application deployment because of risk aversion. ↩︎
  3. On Unix and Unix-like computer operating systems, a zombie process or defunct process is a process that has completed execution (via the exit system call) but still has an entry in the process table: it is a process in the “Terminated state”. Zombie Process ↩︎
  4. An event is a single occurrence within an environment, usually involving an attempted state change. An event usually includes a notion of time, the occurrence, and any details the explicitly pertain to the event or environment that may help explain or understand the event’s causes or effects” (source: and specifically ↩︎
  5. I am not a lawyer and I am not going to cover the ins and out of the dozens of government oversight requirements, audit requirements, and other headaches that each company has to adhere to day in and day out. These are guidelines. Make sure to involve corporate compliance and other associated corporate entities when discussing log retention. ↩︎
  6. This type of alert should cause heads to pop up and people to move immediately to fix the problem. A wind up air raid siren sound is a great alert tone for this sort of issue. ↩︎

I Hope We Got A Good Price For It

Twitter erupts over Trump claim that the moon ‘is a part’ of Mars

While it is clear 45 mis-spoke, it is not the first time he has said something that is so bizarre as to make many Bushisms (Bush II) seem like logical, well thought out arguments, the truth of the matter is that 45 just is not articulate. And that is being polite about it.

Put aside for the moment that the United States does not have the scientific will to return to the Moon, much less stage a mission to Mars, it does not have the political will to do so. With so many tangible problems on this planet, I can appreciate not funding a lunar mission. However, if funding a lunar mission kickstarts scientific funding for other, viable projects on Earth, then let’s go back to the Moon!

Otherwise, I hope Marvin gave us a good price.

The Realities of Contract Work

About three months after the longest government shutdown in history came to an end, leaders of companies and unions representing federal contract workers are speaking out, asking for legislative changes to ensure that their employees are guaranteed back pay if another shutdown occurs. WTOP

The latest Federal Government shutdown lasted thirty-five days and affected eight hundred thousand contractors, everyone from the men and women that empty the trash, to the men and women that write the software that fly the unmanned aircraft used by the military.

I ask that we work together to find a way to enact legislation that will treat the contractor workforce just as their federal workforce counterparts are treated —Leidos CEO Roger Krone.

Mr. Krone’s supposed concern for his employees notwithstanding, this is about the money Leidos and other federal contractors lost during the shutdown. Let me explain how it works. A contractor is a body with a value attached to it. That value is a combination of their salary, their benefits, and a number of other factors at play within the company that is loaning their services to the Federal Government. What is called the loaded rate. That loaded rate would astound you (or perhaps not – that $10,000 toilet seat. That was as much the cost of labour as the cost of materials). If I am employed at an annual salary of $130,000, I would be making about $60/hour. Now, we add to that the overhead costs, etc from the company and I am being sold to them at a rate of $120/hr.

This is about the money, and not the money lost by the contractors.

The argument or at least the parallel is that the Federal Employees got their money, the contractors should too. I am of two minds about this. I can see the argument that the Federal workforce was involuntarily furloughed. In labour law, it is slightly different from being laid-off, but the effects are the same. If Ford had furloughed a line, those employees would not be paid for their time off. Similarly, the contractors would not be paid.

Companies are often left with two choices during a shutdown, he said: pay their employees and go out of business, or withhold pay and see their workforce leave them.

He is correct. Contractors actually live on a very small margin with Federal contracts. And they do have a significant risk that their employees will walk out on them if they are not paid. And that is the risk of contracting in general. You take a job with a contractor knowing that at some point your contract will expire. Some companies recognize this, make allowances for it, and protect their employees, to a certain extent, while others are less concerned. In fairness to Leidos, they tend to protect their employees, but they have lost their margin over the thirty-five days, through no fault of their own because the government (the paymaster) was not doing their job.

How this compensation should be addressed is not a simple issue. Being paid to not work is not going to be successful. And 800,000 federal contractors did not work for thirty-five days. Unlike Federal Employees, contractors had the option of looking for work or taking unemployment, or annual training or leave. I am not sure contractors should expect back pay. But as a former Federal contractor, I can feel their pain. There is no easy solution.

Adults In Charge

In report, Mueller says Congress could take action on Trump obstruction

While most of official Washington is busy losing their minds today over the release of the Mueller report, this particular call out caught my attention.

“The President’s efforts to influence the investigation were mostly unsuccessful, but that is largely because the persons who surrounded the President declined to carry out orders or accede to his requests.”

What is harrowing to me, is that, had people like former chief of staff John Kelly not been in their jobs, then the current President of the United States would and quite likely could have been successful in interfering in the investigation and quite likely would now be facing some form of repercussion.

This begs the question. What other orders have they declined to carry out, and to the betterment of the country or the detriment of the President?

There is no question that the current President lacks certain qualities we might expect in a President. Worse, this President seems to not want to be President, yet clearly thinks that since he is, the Government should be run like a company, not a nation. And as a nation, the United States is paying the price. And will continue to pay the price for generations to come. History will not be kind to number 45, but what worries me, is now that the adults have been cleared from the room, what sort of havoc will be wrought on the country. I am not sure the nation is ready for it.

Greatest Opportunity of All Time

An open letter to all recruiters, past, present, and future. If you are serious about attracting top talent into hard to fill positions, you need to put as much effort into finding us as we put into finding our positions. It is not enough to have a catchy title, that title actually has to be meaningful. Most of us have been in the industry more than ten minutes so a headline like this is meaningless.

Greatest Opportunity Of All Time!

Even if it really is our dream opportunity, most of us will ignore an email with that title as junk because it means you have not done your homework.

Hello! I am a Sr. Technical Recruiter from an IT Staffing Firm, 1. I am reaching out in regards to possible job opportunities that we have in the area that match up very well with your profile. You have an excellent background and I know we can find the next best opportunity for you!

Again, I am skeptical. You did a keyword search, that’s good, but do I really match up? How?

We have a DevOps Lead specializing in Azure technologies in the Washington, DC area supporting the Federal Agnecy. US Citizens Required

Ah, you did not even do a keyword search. If you had, you would have discovered that I have zero experience with Azure, and I have never worked at the agency you are sourcing, and the law of averages says you should have hit my agencies, although I have neutered my profile and taken the agencies I have supported out of it.

If you are finding it hard to attract top talent, then perhaps you should review the tactics you are using to get their attention.

  1. I have removed all incriminating names to protect the guilty, but I get several of these a week with the same template.

When Security is Not Secure

There are wide variations in the quality and security of identification used to gain access to secure facilities where there is potential for terrorist attacks. In order to eliminate these variations, U.S. policy is to enhance security, increase Government efficiency, reduce identity fraud, and protect personal privacy by establishing a mandatory, Government-wide standard for secure and reliable forms of identification issued by the Federal Government to its employees and contractors (including contractor employees). HSPD-12

The Commonwealth of Virginia is the latest state to move to RealID. And again, I ask, why?

For those who have not followed the issue, following the attacks on September 11, 2001, a number of these Homeland Security Presidential Directives were issued. Number 12 forms the basis for the RealID standard. Other documents in this bucket include the CAC/PIV card used by the Federal Government, Passport, Global Entry, and yes, driver’s licenses. And if you have blindly, or even grudgingly handed over your personal information to these agencies, you probably did not think about the actual directive. But since this new ID allows you to board an aircraft, you probably did not blink. But perhaps you should. After all, unless you are issued a CAC/PIV card, what sort of security is this new ID providing?

I will wait.

Still confused? Let me help you. The process likely goes like this. You handed over your old driver’s license, your passport or immigration status card, your social security number, and some proof of residency to a clerk at the DMV (mine had a passing familiarity with English) and boom, you have a RealID card that will get you access to airplanes, military bases, and other government buildings. You may not get past the front door, but you will get inside. And how does this enhance security? There is no background check run. There are no fingerprints, no FBI file. If, like me, you have had your driver’s license more than a week, all they do is check your eyesight and charge a processing fee. Virginia gives you the option to not get one. For a lower fee.

CAC/PIV cards are completely different. They do a background check. With fingerprints, and an FBI file. But not with most of the other documents.

Feel more secure now? Oh, and China called. They are willing to sell you your file back. For less than the processing fee you just paid.

The Credibility Gap

US bars entry of International Criminal Court investigators | WTOP

The United States will revoke or deny visas to International Criminal Court personnel who try to investigate or prosecute alleged abuses committed by U.S. forces in Afghanistan or elsewhere, and may do the same with those who seeking action against Israel, Secretary of State Mike Pompeo said Friday.

The next time the US cries about an international war criminal running free – Julian Assange, Edward Snowden – and asking why they are not being held to account, I am going to point back at this decision.

Words Have Consequences

The white supremacist who livestreamed his Friday rampage at a New Zealand mosque posted a manifesto online that said the attack was partly inspired by Dylann Roof’s 2015 massacre of nine black churchgoers in South Carolina, police said. Washington Post

It is easy to say that these individuals have a screw loose. Or a at least a warped sense of community, but as we continue to suffer under the vitriol spewing forth from arguably one of the most important people on the planet, we continue to see this sort of self-enforcement of views. There are two ways to solve this problem. The first is to realize that the world is changing and that means everything you hold dear (yes, I am being sarcastic) will change with it. The second thing is that using a gun to solve the problem does not actually accomplish anything.

If you are not willing to talk civilly with the person next to you, then say nothing. Just because you disagree does not mean you have to shoot them.