Wednesday, August 10, 2011

Drift Management Coming to RHQ

Introduction
I am excited to share that we are very close to releasing a beta of RHQ 4.1.0. I have been working on Drift Management, one of the new features going into the release. I have been meaning to write a little bit about what this new feature is all about, and now is as good a time as any. I will try to provide a high level overview and save getting into more specific, detailed topics for future posts.

What is Drift?
The first thing we need to do is define what is exactly is meant by the term Drift Management. Let's start with the first part. Conceptually, we can define drift as an unplanned or unintended change to a managed resource. Let's consider a couple examples to illustrate the concept.

We have an EAP server that is configured for production use. That is, things like the JVM heap size, data source definitions, etc. are configured with production values. At some point suppose the heap settings for the EAP server are changed such that they are no longer consistent with what is expected for production use. This constitutes drift.

Now let's consider another example involving application deployment. Suppose we have a cluster of EAP servers that is running our business application. We deploy an updated version of the application. For some reason, one of the cluster nodes does not get updated with the newer version of the application while the others have. We now have a cluster node that does have content that is expected to be deployed on it. This constitutes drift.

Why Do We Care about Drift?
Now that we have looked at some examples to illustrate the concept of drift, there is a perfectly reasonably question to ask. Why should we care? Unplanned or unintended changes frequently lead to problems. Those problems can manifest themselves as production failures, defects, outages, etc. Even with planned, intended changes, problems arise. It is not a question of if but rather when. A production server going down can result in a significant loss of time and money among other things. Anything you can do to be proactive in handling issues when the occur could help save your organization time, money, and resources.

How Will RHQ Manage Drift?
What can RHQ do to deal with drift? First and foremost, it can monitor resources for unintended or unplanned changes. RHQ allows you to specify which resources or which parts of resources you want to monitor for drift. The agent can periodically scan the file system looking for changes. When the agent detects a change, it notifies the server with the details of what has changed.

The server maintains a history of the changes it receives from the agent. This makes it possible for example to compare the state of a resource today versus its state two weeks ago. One of the many interesting and challenging problems we are tackling is how to present that history in meaningful ways so that users can quickly and easily identify changes of interest.

An integral aspect of RHQ's monitoring capabilities is its alerting system. RHQ allows you to define different rules which can result in alerts being triggered. For example, we can create a rule that will trigger an alert whenever an EAP server goes down. Similarly, RHQ could (and will) give you the ability to have alerts triggered whenever drift is detected on any of your managed EAP servers.

Another key aspect of RHQ's drift management functionality is remediation. Some platforms and products provide automatic remediation. Consider the earlier example of the changed heap settings on the EAP server. With automatic remediation, those settings might be reverted back to their orignal values as soon as the change is detected.

Then there is also manual remediation. Think merge conflicts in a version control system. There are lots of visual editors for view diffs and resolving conflicts. A couple that I use are diffmerge and meld. RHQ will provide interfaces and tools for generating and viewing diffs and for performing remediation much in the same way you might with a visual diff editor.

What's Next?
Here is a quick run down of drift management features that will be in the beta:

  • Enable drift managent for individual resources
    • This involves defining the drift configuration or rules which specify what files to monitor for drift and how often monitoring should be done
  • Perform drift monitoring (done by the agent)
  • View change history in the UI
  • Execute commands from the CLI to:
    • Query for change history
    • Generate snapshots
      • A snapshot provides a point in time view of a resource for a specified set of changes
    • Diff snapshots (This is not a file diff)

Here are some notable features that will not be available in the beta:
  • Define filters that specify which files to include/exclude in drift monitoring (Note that you actually can define the filter. They just are not handled by the agent yet)
  • Perform manual remediation (i.e., visual diff editor)
  • Support for golden images (more on this in a future post)
  • Generate/view snapshots in the UI
  • Alerts integration
It goes without saying that there will be bugs, some of which are known, and that functionality in the beta is subject to change in ways that will likely break compatibility with future releases. More information will be provided in the release notes as soon as they are available. Stay tuned!

No comments:

Post a Comment