Showing posts with label rhq. Show all posts
Showing posts with label rhq. Show all posts

Sunday, July 20, 2014

Changing the Endpoint Address of an RHQ Storage Node

There is very limited support for changing the endpoint address of a storage node. In fact the only way to do so is by undeploying and redeploying the node with the new address. And in some cases, like when there is only a single storage node, this is not even an option. BZ 1103841 was opened to address this, and the changes will go into RHQ 4.13.

Changing the endpoint address of a Cassandra node is a routine maintenance operation. I am referring specifically to the address on which Cassandra uses for gossip. This address is specified by the listen_address property in cassandra.yaml. The key thing when changing the address is to ensure that the node's token assignments do not change. Rob Coli's post on changing a node's address provides a nice summary of the configuration changes involved.

With CASSANDRA-7356 however, things are even easier. Change the value of listen_address and restart Cassandra with the following system properties defined in cassandra-env.sh,

  • -Dcassandra.replace_address=<new_address>
  • -Dcassandra.replace_address_first_boot=true 
The seeds property in cassandra.yaml might need to be updated as well. Note that there is no need to worry about the auto_bootstrap, initial_token, or num_tokens properties.

For the RHQ Storage Node, these system properties will be set in cassandra-jvm.properties. Users will be able to update a node's address either through the storage node admin UI or through the RHQ CLI. One interesting to note is that the RHQ Storage Node resource type uses the node's endpoint address as its resource key. This is not good. When the address changes, the agent will think it has discovered a new Storage Node resource. To prevent this we can add resource upgrade support in the rhq-storage plugin, and change the resource key to use the node's host ID which is a UUID that does not change. The host ID is exposed through the StorageServiceMBean.getLocalHostId JMX attribute.

If you interested in learning more about the work involved with adding support for changing a storage node's endpoint address, check out the wiki design doc that I will be updating over the next several days.

Monday, September 9, 2013

Upgrading to RHQ 4.9

RHQ 4.8 introduced the new Cassandra backend for metrics. There has been a tremendous amount of work since then focused on the management of the new RHQ Storage Node. We do want to impose on users the burden of managing a second database. One of our key goals is to provide robust management such that Cassandra is nothing more than an implementation detail for users.

The version of Cassandra shipped in RHQ 4.8 includes some native libraries. One of the main uses for those native libraries is compression.  If the platform on which Cassandra is running has support for the native libraries, table compression will be enabled. Data files written to disk will be compressed.

All of the native libraries have been removed from the version of Cassandra shipped in RHQ 4.9. The reason for this change is to ensure RHQ continues to provide solid cross-platform support. The development and testing teams simply do not have the bandwidth right now to maintain native libraries for all of the supported platforms in RHQ and JON.

The following information applies only to RHQ 4.8 installs.

Since RHQ 4.9 does not ship with native those compression libraries, Cassandra will not be able to decompress the data files on disk.

Compression has to be disabled in your RHQ 4.8 installation before upgrading to 4.9. There is a patch which you will need to run prior to upgrading. Download rhq48-storage-patch.zip and follow the instructions provided in rhq48-storage-patch.sh|bat.

I do want to mention that we will likely re-enable compression using a pure Java compression library in a future RHQ release.

Wednesday, October 17, 2012

Why I am ready to move to CQL for Cassandra application development

Earlier this year, I started learning about Cassandra as it seemed like it might be a good fit as a replacement data store for metrics and other time series data in RHQ. I developed a prototype for RHQ. I used the client library Hector for accessing Cassandra from within RHQ. I defined my schema using a Cassandra CLI script. I recall when I first read about CQL. I spent some time deliberating over whether to define the schema using a CLI script or using a CQL script. Although I was intrigued but ultimately decided against using CQL. As the CLI and the Thrift interface were more mature, it seemed like the safer bet. While I decided not to invest any time in CQL, I did make a mental note to revisit it at a later point since there was clearly a big emphasis within the Cassandra community for improving CQL. That later point is now, and I have decided to start making extensive use of CQL.

After a thorough comparative analysis, the RHQ team decided to move forward with using Cassandra for metric data storage. We are making heavy use of dynamic column families and wide rows. Consider for example the raw_metrics column family in figure 1,

Figure 1. raw_metrics column family

The metrics schedule id is the row key. Each data point is stored in a separate column where the metric timestamp is the column name and the metric value is the column value. This design supports fast writes as well as fast reads and works particularly well for the various date range queries in RHQ. This is considered a dynamic column family because the number of columns per row will vary and because column names are not defined up front. I was quick to rule out using CQL due to a couple misconceptions about CQL's support for dynamic column families and wide rows. First, I did not think it was possible to define a dynamic table with wide rows using CQL. Secondly, I did not think it was possible to execute range queries on wide rows.

A couple weeks ago I came across this thread on the cassandra-users mailing list which points out that you can in fact create dynamic tables/column families with wide rows. And conveniently after coming across this thread, I happened to stumble on the same information in the docs. Specifically the DataStax docs state that wide rows are supported using composite column names. The primary key can have multiple components, but there must be at least one column that is not part of the primary key. Using CQL I would then define the raw_metrics column family as follows,

This CREATE TABLE statement is straightforward, and it does allow for wide rows with dynamic columns. The underlying column family representation of the data is slightly different from the one in figure 1 though.

Figure 2. CQL version of raw_metrics column family
Each column name is now a composite that consists of the metric timestamp along with the string literal, value. There is additional overhead on reads and writes as the column comparator now has to compare the string in addition to the timestamp. Although I have yet to do any of my own benchmarking, I am not overly concerned by the additional string comparison. I was however concerned about the additional overhead in terms of disk space. I have done some preliminary analysis and concluded that the difference with just storing the timestamp in the column name is negligible due to compression of SSTables which is enabled by default.

My second misconception about executing range queries is really predicated on the first misconception. It is true that you can only query named columns in CQL; consequently, it is not possible to perform a date range query against the column family in figure 1. It is possible though to execute a date range query against the column family in figure 2.

RHQ supports multiple upgrade paths. This means that in order to upgrade to the latest release (which happens to be 4.5.0 at the time of this writing), I do not have to first upgrade to the previous release (which would be 4.4.0). I can upgrade from 4.2.0 for instance. Supporting multiple upgrade paths requires a tool for managing schema changes. There are plenty of such tools for relational databases, but I am not aware of any for Cassandra. But because we can leverage CQL and because there is a JDBC driver, we can look at using an existing tool instead of writing something from scratch. I have done just that and working on adding support for Cassandra to Liquibase. I will have more on that in future post. Using CQL allows us to reuse existing solutions which in turn is going to save a lot of development and testing effort.

The most compelling reason to use CQL is the familiar, easy to use syntax. I have been nothing short of pleased with Hector. It is well designed, the online documentation is solid, and the community is great. Whenever I post a question on the mailing list, I get responses very quickly. With all that said, contrast the following two, equivalent queries against the raw_metrics column family.

RHQ developers can look at the CQL version and immediately understand it. Using CQL will result in less, easier to maintain code. We can also leverage ad hoc queries with cqlsh during development and testing. The JDBC driver also lends itself nicely to applications that run in an application as RHQ does.

Things are still evolving both with CQL and with the JDBC driver. Collections support is coming in Cassandra 1.2. The JDBC driver does not yet support batch statements. This is due to the lack of support for it the server side. The functionality is there in the Cassandra trunk/master branch, and I expect to see it in the 1.2 release. The driver also currently lacks support for connection pooling. These and other critical features will surely make their way into the driver. With the enhancements and improvements to CQL and to the JDBC driver, adding Cassandra support to Hibernate OGM becomes that much more feasible.

The flexibility, tooling, and ease of use make CQL a very attractive option for working with Cassandra. I doubt the Thrift API is going away any time soon, and we will continue to leverage the Thrift API through Hector in RHQ in various places. But I am ready to make CQL a first class citizen in RHQ and look forward to watching it continue to mature into a great technology.

Monday, July 2, 2012

Setting up a local Cassandra cluster using RHQ

As part of my ongoing research into using Cassandra with RHQ, I did some work to automate setting up a Cassandra cluster (for RHQ) on a single machine for development and testing. I put together a short demo showing what is involved. Check it out at http://bit.ly/N3jbT8.

Saturday, June 16, 2012

Aggregating Metric Data with Cassandra

Introduction

I successfully performed metric data aggregation in RHQ using a Cassandra back end for the first time recently. Data roll up or aggregation is done by the data purge job which is a Quartz job that runs hourly. This job is also responsible for purging old metric data as well as data from others parts of the system. The data purge job invokes a number of different stateless session EJBs (SLSBs) that do all the heavy lifting. While there is a still a lot of work that lies ahead, this is a big first step forward that is ripe for discussion.

Integration

JPA and EJB are the predominant technologies used to implement and manage persistence and business logic. Those technologies however, are not really applicable to Cassandra. JPA is for relational databases and one of the central features of EJB is declarative, container-managed transactions. Cassandra is neither a relational nor a transactional data store. For the prototype, I am using server plugins to integrate Cassandra with RHQ.

Server plugins are used in a number of areas in RHQ already. Pluggable alert notifcation senders is one of the best examples. A key feature of server plugins is the encapsulation made possible by the class loader isolation that is also present with agent plugins. So let's say that Hector, the Cassandra client library, requires a different version of a library that is already used by RHQ. I can safely use the version required by Hector in my plugin without compromising the RHQ server. In addition to the encapsulation, I can dynamically reload my plugin without having to restart the whole server. This can help speed up iterative development.

Cassandra Server Plugin Configuration
You can define a configuration in the plugin descriptor of a server plugin. The above screenshot shows the configuration of the Cassandra plugin. The nice thing about this is that it provides a consistent, familiar interface in the form of the configuration editor that is used extensively throughout RHQ. There is one more screenshot that I want to share.

System Settings
This is a screenshot of the system settings view. It provides details about the RHQ server itself like the database used, the RHQ version, and build number. There are several configurable settings, like the retention period for alerts and drift files and settings for integrating with an LDAP server for authentication. At the bottom there is a property named Active Metrics Server Plugin. There are currently two values from which to choose. The first is the default, which uses the existing RHQ database. The second is for the new Cassandra back end. The server plugin approach affords us a pluggable persistence solution that can be really useful for prototyping among other things. Pluggable persistence with server plugins is a really interesting topic in and of itself. I will have more to say on that in future post.

Implementation

The Cassandra implementation thus far uses the same buckets and time slices as the existing implementation. The buckets and retention periods are as follows:

Metrics Data Bucket Data Retention Period
raw data 7 days
one hour data 2 weeks
6 hour data 1 month
1 day data 1 year

Unlike the existing implementation, purging old data is accomplished simply by setting the TTL (time to live) on each column. Cassandra takes care of purging expired columns. The schema is pretty straightforward. Here is the column family definition for raw data specified as a CLI script:


The row key is the metric schedule id. The column names are timestamps and column values are doubles. And here is the column family definition for one hour data:


As with the raw data, the schedule id is the row key. Unlike the raw data however, we use composite columns here. All the buckets with the exception of the raw data, store computed aggregates. RHQ calculates and stores the min, max, and average for each (numeric) metric schedule. The column name consists of a timestamp and an integer. The integer identifies whether the value is the max, min, or average. Here is some sample (Cassandra) CLI output for one hour data:


Each row in the output reads like a tuple. The first entry is the column name with a colon delimiter. The timestamp is listed first followed by the integer code to identify the aggregate type. Next is the column value, which is the value of the aggregate calculation. Then we have a timestamp. Every column has a timestamp in Cassandra has a timestamp. It is used for conflict resolution on writes. Lastly, we have the ttl. The schema for the remaining buckets is similar the one_hour_metric_data column family so I will not list them here.

The last implementation detail I want to discuss is querying. When the data purge job runs, it has to determine what data is ready to be aggregated. With the existing implementation that uses the RHQ database, queries are fast and efficient using indexes. The following column family definition serves as an index to make queries fast for the Cassandra implementation as well:


The row key is the metric data column family name, e.g., one_hour_metric_data. The column name is a composite that consists of a timestamp and a schedule id. Currently the column value is an integer that is always set to zero because only the column name is needed. At some point I will likely refactor the data type of the column  value to something that occupies less space. Here is a brief explanation of how the index is used. Let's start with writes. Whenever data for a schedule is written into one bucket, we update the index for the next bucket. For example, suppose data for schedule id 123 is written into the raw_metrics column family at 09:15. We will write into the "one_hour_metric_data" row of the index with a column name of 09:00:123. The timestamp in which the write occurred is rounded down to the start of the time slice of the next bucket. Further suppose that additional data for schedule 123 is written into the raw_metrics column family at times 09:20, 09:25, and 09:30. Because each of those timestamps gets rounded down to 09:00 when writing to the index, we do not wind up with any additional columns for that schedule id. This means that the index will contain at most one column per schedule for a given time slice in each row.

Reads occur to determine what data if any needs to be aggregated. Each row is in the index is queried. After a column is read and the data for the corresponding schedule is aggregated into the next bucket, that column is then deleted. This index is a lot like a job queue. Reads in the existing implementation that use a relational database should be fast; however, there is still work that has to be done to determine what data if any needs to be aggregated when the data purge job runs. With the Cassandra implementation, the presence of a column in a row of the metrics_aggregates_index column family indicates that data for the corresponding schedule needs to be aggregated.

Testing

I have pretty good unit test coverage, but I have only done some preliminary integration testing. So far it has been limited to manual testing. This includes inspecting values in the database via the CLI or with CQL and setting break points to inspect values. As I look to automate the integration testing, I have been giving some thought to how metric data is pushed to the server. Relying on the agent to push data to the server is sub optimal for a couple reasons. First, the agent sends measurement reports to the server once a minute. I need better control of how frequently and when data is pushed to the server.

The other issue with using the agent is that it gets difficult to simulate older metric data that has been reported over a specified duration, be it an hour, a day, or a week. Simulating older data is needed for testing that data is aggregated into 6 hour and 24 hour buckets and that data is purged at appropriate times.

RHQ's REST interface is a better fit for the integration testing I want to do. It already provides the ability to push metric data to the server. I may wind up extending the API, even if just for testing, to allow for kicking off the aggregation that runs during the data purge job. I can then use the REST API to query the server and verify that it returns the expected values.

Next Steps

There is still plenty of work ahead.I have to investigate what consistency levels are most appropriate for different operations. There is a still a large portion of the metrics APIs that needs to be implemented, some of the more important ones being query operations used to render metrics graphs and tables. The data purge job is not the best approach going forward for doing the aggregation. Only a single instance of the job runs each hour, and it does not exploit any of the opportunities that exist for parallelism. Lastly and maybe most importantly, I have yet to start thinking about how to effectively manage the Cassandra cluster with RHQ. As I delve into these other areas I will continue sharing my thoughts and experiences.

Monday, June 11, 2012

Modeling Metric Data in Cassandra

RHQ supports three types of metric data - numeric, traits, and call time. Numeric metrics include things like the amount of free memory on a system or the number of transactions per minute. Traits are strings that track information about a resource and typically change in value much less frequently than numeric metrics. Some examples of traits include server start time and server version. Call time metrics capture the execution time of requests against a resource. An example of call time metrics is EJB method execution time.

I have read several times that with Cassandra it is best to let your queries dictate your schema design. I recently  spent some time thinking about RHQ's data model for metrics and how it might look in Cassandra. I decided to focus only on traits for the time being, but much of what I discuss applies to the other metrics types as well.

I will provide a little background on the existing data model to make it easier to understand some of the things I touch on. All metric data in RHQ belongs to resources. A particular resource might support metrics like those in the examples above, or it might support something entirely different. A resource has a type, and the resource type defines which type of metrics that it supports.We refer to these as measurement definitions. These measurement definitions, along with other meta data associated with the resource type, are defined in the plugin descriptor of the plugin that is responsible for managing the resource. You can think of a resource type of an abstraction and a resource is a realization of that abstraction. Similarly, a measurement definition is an abstraction, and a measurement schedule is a realization of a measurement definition. A resource can have multiple measurement schedules, and each schedule is associated with measurement definition. The schedule has a number of attributes like the collection interval, an enabled flag, and the value. When the agent reports metric data to the RHQ server the data is associated with a particular schedule. To tie it all together, here is a snippet of some of the relevant parts of the measurement classes:

To review, for a given measurement schedule, we can potentially add an increasing number of rows in the RHQ_MEASUREMENT_DATA_TRAIT table over time. There are a lot of fields included in the snippet for MeasurementDefinition. I chose to include most of them because they are pertinent to the discussion.

For the Cassandra integration, I am interested primarily in the MeasurementDataTrait class. All of the other types are managed by the RHQ database. Initially when I started thinking about what column families I would need, I felt overcome with writer's block. Then I reminded myself to think about trait queries and try to let those guide my design. I decided to focus on some resource-level queries and leave others like group-level queries for a later exercise. Here is a screenshot of one of the resource-level views where the queries are used:


Let me talk a little about this view. There are a few things to point out in order to understand the approach I took with the Cassandra schema. First, this is a list view of all the resource's traits. Secondly, the view shows only the latest value for each trait. Finally, the fields required by this query span across multiple tables and include resource id, schedule id, definition id, display name, value, and time stamp. Because the fields span across multiple tables, one or more joins is required for this query. There are two things I want to accomplish with the column family design in Cassandra. I want to be able to fetch all of the data with a single read, and I want to be able to fetch all of the traits for a resource in that read. Cassandra of course does not support joins; so, some denormalization is needed to meet my requirements. I have two column families for storing trait data. Here is the first one that supports the above list view as a Cassandra CLI script:
create column family resource_traits
    with comparator = 'CompositeType(DateType, Int32Type, Int32Type, BooleanType, UTF8Type, UTF8Type)' and
    default_validation_class = UTF8Type and
    key_validation_class = Int32Type;
The row key is the resource id. The column names are a composite type that consist of the time stamp, schedule id, definition id, enabled flag, display type, and display name. The column value is a string and is the latest known value of the trait. This design allows for the latest values of all traits to be fetched in a single read. It also gives me the flexibility to perform additional filtering. For example, I can query for all traits that are enabled or disabled. Or I can query for all traits whose values last changed after a certain date/time. Before I talk about the ramifications of the denormalization I want to introduce the other column family that tracks the historical data. Here is the CLI script for it:
create column family traits
    with comparator = DateType and
    default_validation_class = UTF8Type and
    key_validation_class = Int32Type;
This column family is pretty straightforward. The row key is the schedule id. The column name is the time stamp, and the column value is the trait value. In the relational design, we only store a new row in the trait table if the value has changed. I have only done some preliminary investigation, and I am not yet sure how to replicate that behavior with a single write. I may need to use a custom comparator. It is something I have to revisit.

I want to talk a little bit about the denormalization. As far this example goes, the system of record for everything except the trait data is the RHQ database. Suppose a schedule is disabled. That will now require a write to both the RHQ database as well as to Cassandra. When a new trait value is persisted, two writes have to be made to Cassandra - one to add a column to the traits column family and one to update the resource_traits column family.

The last thing I will mention about the design is that I could have opted for a more row based approach where each column in resource_traits is stored in a separate row. With that approach, I would use statically named columns like scheduleId and the corresponding value would be something like 1234. The primary reason I decided against this is because the RandomPartitioner is used for the partitioning strategy, which happens to be the default. RandomPartitioner is strongly recommended for most cases to allow for even key distribution across nodes. Without going into detail, range scans, i.e., row-based scans, are not possible when using the RandomPartitioner. Additionally, Cassandra is designed to perform better with slice queries, i.e., column-based queries than with range queries.

The design may change as I get further along in the implementation, but it is a good starting point. The denormalization allows for efficient querying of a resource's traits and offers the flexibility for additional filtering. There are some trade offs that have to be made, but at this point, I feel that they are worthwhile. One thing is for certain. Studying the existing (SQL/JPA) queries and understanding what data is involved and how helped flush out the column family design.

Tuesday, May 29, 2012

Working with Cassandra

RHQ provides a rich feature set in terms of its monitoring capabilities. In addition to collecting and storing metric data, RHQ automatically generates baselines, allows you to view graphs of data points over different time intervals, and gives you the ability to alert on metric data. RHQ uses a single database for storing all of its data. This includes everything from the inventory, to plugin meta data, to metric data. This presents an architectural challenge for the measurement subsystem particularly in terms of scale. As the number of managed resources grows and the volume of metrics being collected increases, database performance starts to degrade. Despite various optimizations that have been made, the database remains a performance bottleneck. The reality is that the relational database simply is not the best tool for write-intensive applications like time-series data.

This architectural challenge has in large part motivated me to start learning about Cassandra. There are plenty of other, non-relational database systems that I think could address the performance problems with our measurement subsystem. There are a couple things about Cassandra that provided enough intrigue that made me decide to invest time learning about it.

The first point of intrigue is that Cassandra is a distributed, peer-to-peer system with no single point of failure. Any node in the cluster can serve read and write requests. Nodes can be added to and removed from the cluster at any point making it easier to meet demands around scalability. This design is largely inspired by Amazon's Dyanmo.

The second point of intrigue for me is that running a node involves running only a single Java process. For the purposes of RHQ and JBoss Operations Network (JON), this is much more important to me than the first point about single points of failure. The fewer the moving parts, the better. It simplifies management which will goes along way towards the goal of having a self-contained solution.

Cassandra could be a great fit for RHQ, and the time I have spent thus far learning it is definitely time well spent. There are some learning curves and hurdles one has to overcome though. I find the project documentation to be lacking. For example, it took some time to wrap my head around super columns. It was only after I started understanding super columns to the point where I could begin thinking about how to leverage them with RHQ's data model that I then discovered that composite columns should be favored over super columns. Apparently composite columns do not have the performance and memory overhead inherent to super columns. And composite columns allow for an arbitrary level of nesting whereas super columns do not. Fortunately DataStax's docs help fill in a lot of the gaps.

One thing that was somewhat counter-intuitive initially is how the sorting works. With a relational database, you first define the schema, and then queries are defined later on. Sorting is done on column values and is specified at query time. With Cassandra, sorting is based on column names and is specified at the time of schema creation. This might seem really strange if you are thinking in terms of a relational database, but Cassandra is a distributed key-value store. If you think about it more along the lines of say, java.util.TreeMap, then it makes a lot more sense. With a TreeMap, sorting is done on keys. When I want to use a TreeMap or another ordered collection, I have to decide in advance how the elements of the collection should be ordered. This aspect of Cassandra is a good thing. It contributes to the high performance read/writes for which Cassandra is known. It also lends itself very well to working with time-series data.

DataStax posted a great blog the other day about how they use Cassandra as a metrics database. The algorithm described sounds similar to what we do in RHQ; however, there are a few differences (aside from the obvious one of using different database systems). One difference is in bucket sizes. They use bucket sizes of one minute, five minutes, two hours, and twenty-four hours. RHQ uses bucket sizes of one hour, six hours, and twenty-four hours. I will briefly explain what this means. RHQ writes raw data points into a set of round-robin tables. Every hour a job runs to perform aggregation. The latest hour of data points is aggregated into the one hour table or bucket. RHQ calculates the max, min, and average for each metric collection schedule. When the one hour table has six hours worth of data, it is aggregated and written into the  six hour table.

Disk space is cheap, but it is not infinite. There needs to be a purge mechanism in place to prevent unbounded growth. For RHQ, the hourly job that does the aggregation also handles the purging. Data in the six hour bucket for instance, is kept for 31 days. With Cassandra, DataStax simply relies on Cassandra's built-in TTL (time to live) feature. When data is written into a column, the TTL is set on it so that it will expire after the specified duration.

So far it has been a good learning experience. Cassandra is clearly an excellent fit for storing RHQ's metric data, but I am starting to how it could also be a good fit for other parts of the data model as well.

Wednesday, August 10, 2011

Drift Management Coming to RHQ

Introduction
I am excited to share that we are very close to releasing a beta of RHQ 4.1.0. I have been working on Drift Management, one of the new features going into the release. I have been meaning to write a little bit about what this new feature is all about, and now is as good a time as any. I will try to provide a high level overview and save getting into more specific, detailed topics for future posts.

What is Drift?
The first thing we need to do is define what is exactly is meant by the term Drift Management. Let's start with the first part. Conceptually, we can define drift as an unplanned or unintended change to a managed resource. Let's consider a couple examples to illustrate the concept.

We have an EAP server that is configured for production use. That is, things like the JVM heap size, data source definitions, etc. are configured with production values. At some point suppose the heap settings for the EAP server are changed such that they are no longer consistent with what is expected for production use. This constitutes drift.

Now let's consider another example involving application deployment. Suppose we have a cluster of EAP servers that is running our business application. We deploy an updated version of the application. For some reason, one of the cluster nodes does not get updated with the newer version of the application while the others have. We now have a cluster node that does have content that is expected to be deployed on it. This constitutes drift.

Why Do We Care about Drift?
Now that we have looked at some examples to illustrate the concept of drift, there is a perfectly reasonably question to ask. Why should we care? Unplanned or unintended changes frequently lead to problems. Those problems can manifest themselves as production failures, defects, outages, etc. Even with planned, intended changes, problems arise. It is not a question of if but rather when. A production server going down can result in a significant loss of time and money among other things. Anything you can do to be proactive in handling issues when the occur could help save your organization time, money, and resources.

How Will RHQ Manage Drift?
What can RHQ do to deal with drift? First and foremost, it can monitor resources for unintended or unplanned changes. RHQ allows you to specify which resources or which parts of resources you want to monitor for drift. The agent can periodically scan the file system looking for changes. When the agent detects a change, it notifies the server with the details of what has changed.

The server maintains a history of the changes it receives from the agent. This makes it possible for example to compare the state of a resource today versus its state two weeks ago. One of the many interesting and challenging problems we are tackling is how to present that history in meaningful ways so that users can quickly and easily identify changes of interest.

An integral aspect of RHQ's monitoring capabilities is its alerting system. RHQ allows you to define different rules which can result in alerts being triggered. For example, we can create a rule that will trigger an alert whenever an EAP server goes down. Similarly, RHQ could (and will) give you the ability to have alerts triggered whenever drift is detected on any of your managed EAP servers.

Another key aspect of RHQ's drift management functionality is remediation. Some platforms and products provide automatic remediation. Consider the earlier example of the changed heap settings on the EAP server. With automatic remediation, those settings might be reverted back to their orignal values as soon as the change is detected.

Then there is also manual remediation. Think merge conflicts in a version control system. There are lots of visual editors for view diffs and resolving conflicts. A couple that I use are diffmerge and meld. RHQ will provide interfaces and tools for generating and viewing diffs and for performing remediation much in the same way you might with a visual diff editor.

What's Next?
Here is a quick run down of drift management features that will be in the beta:

  • Enable drift managent for individual resources
    • This involves defining the drift configuration or rules which specify what files to monitor for drift and how often monitoring should be done
  • Perform drift monitoring (done by the agent)
  • View change history in the UI
  • Execute commands from the CLI to:
    • Query for change history
    • Generate snapshots
      • A snapshot provides a point in time view of a resource for a specified set of changes
    • Diff snapshots (This is not a file diff)

Here are some notable features that will not be available in the beta:
  • Define filters that specify which files to include/exclude in drift monitoring (Note that you actually can define the filter. They just are not handled by the agent yet)
  • Perform manual remediation (i.e., visual diff editor)
  • Support for golden images (more on this in a future post)
  • Generate/view snapshots in the UI
  • Alerts integration
It goes without saying that there will be bugs, some of which are known, and that functionality in the beta is subject to change in ways that will likely break compatibility with future releases. More information will be provided in the release notes as soon as they are available. Stay tuned!

Thursday, June 2, 2011

Manually Add Resources to Inventory from CLI

Resources in RHQ are typically added to the inventory through discovery scans that run on the agent. The plugin container (running inside the agent) invokes plugin components to discover resources. RHQ also allows you to manually add resources into inventory. There may be times when discovery scans fail to find a resource you want to manage. The other day I was asked whether or not you can manually add a resource to inventory via the CLI. Here is a small CLI script that demonstrates manually adding a Tomcat server into inventory.


The findResourceType and findPlatform functions are pretty straightforward. The interesting work happens in createTomcatConnectionProps and in manuallyAddTomcat. The key to it all though is on line 44. DiscoveryBoss provides methods for importing resources from the discovery queue as well as for manually adding resources. manuallyAddResources expects as arguments a resource type id, a parent resource id, and the plugin configuration (i.e., connection properties).

Determining the connection properties that you need to specify might not be entirely intuitive. I looked at the plugin descriptor as well as the TomcatDiscoveryComponent class from the tomcat plugin to determine the minimum, required connection properties that need to be included.

Here is how the script could be used from the CLI shell:

rhqadmin@localhost:7080$ login rhqadmin rhqadmin
rhqadmin@localhost:7080$ exec -f manual_add.js
rhqadmin@localhost:7080$ hostname = '127.0.0.1'
rhqadmin@localhost:7080$ tomcatDir = '/home/jsanda/Development/tomcat6'
rhqadmin@localhost:7080$ manuallyAddTomcat(hostname, tomcatDir)
Resource:
           id: 12071
         name: 127.0.0.1:8080
      version: 6.0.24.0
 resourceType: Tomcat Server

rhqadmin@localhost:7080$

This effectively adds the Tomcat server to the inventory of managed resources. This same approach can be used with other resource types. The key is knowing what connection properties you need to specify so that the plugin (in which the resource type is defined) knows how to connect to and manage the resource.

Monday, May 23, 2011

A REPL for the RHQ Plugin Container

Overview
RHQ plugins run inside of a plugin container that provides different services and manages the life cycles of  plugins. The plugin container in turn runs inside of the RHQ agent. If you are not familiar with the agent, it is deployed to each machine you want RHQ to manage. You can read more about it here and here. While the plugin container runs inside of the agent, it is not coupled to the agent. In fact is it used quite a bit outside of the agent. It is used in Embedded Jopr which was intended to be a replacement for the JMX web console in JBoss AS. The plugin container is also used a lot during development in automated tests.

My teammate, Heiko Rupp, has developed a cool wrapper application for the plugin container. It defines a handful of commands for working with the plugin container interactively. What is nice about this is that it can really speed up plugin development. Heiko has written several articles about the standalone container including Working on a standalone PluginContainer wrapper and I love the Standalone container in Jopr (updated). After reading some of his posts I got to thinking that a REPL for the plugin container would be really nice but not just any REPL though. I was thinking specifically about Clojure's REPL.

I have spent some time exploring the different ways Clojure could be effectively integrated with RHQ. There is little doubt in my mind that this is one of them. I recently started working on some Clojure functions to make working with the plugin container easier. I am utilizing Clojure's immutable and persistent data structures as well as some of the other great language features such as first class functions and multimethods. I am trying to make these functions easy enough to use so that someone who might not be a very experienced Clojure programmer might still find them useful during plugin development and testing.

Getting the Code
The project is available on github at https://github.com/jsanda/clj-rhq. It is built with leiningen so you want to get it installed. I typically run a swank server and connect from Emacs, but you can also start a REPL session directly if you are not an Emacs user. The project pulls in the necessary dependencies so that you can work with plugin container-related classes as you will see in the following sections.

Running the Code
These steps assume that you already have leiningen installed. First, clone the project:

git clone https://jsanda@github.com/jsanda/clj-rhq.git

Next, download project dependencies with:

lein deps

Some plugins reply  on native code provided by the Sigar library which you should find at clj-rhq/lib/sigar-dist-1.6.5.132.zip. Create a directory in lib named native and unzip sigar-dist-1.6.5.132.zip there. The project is configured to look for native libraries in lib/native.

Finally, if you are using Emacs run lein swank to start a swank server; otherwise, run lein repl to start a REPL session on the command line.

Starting/Stopping the Plugin Container

The first thing I do is call require to load the rhq.plugin-container namespace. Then I call the start function. The plugin container emits a line of output, and then the function returns nil. Next I verify that the plugin has started up by calling running?.  Then I call the stop function to shutdown the plugin container and finally call running? again to verify that the plugin container has indeed shutdown.

Executing Discovery Scans
So far we have looked at starting and stopping the PC. One of the nice things about working interactively in the REPL is that you are not limited to a pre-defined set of functions. If rhq.plugin-container did not offer any functions for executing a discovery scan, you could write something like the following:


The pc function simply returns the plugin container singleton which gives us access to the InventoryManager. We call InventoryManager's executeServerScanImmediately method and store the InventoryReport object that it returns in a variable named inventory-report. Alternatively you can use the discover function.


On the first call to discover we pass the keyword :SERVER as an argument. This results in a server scan being run. On the second call, we pass :SERVICE which results in a service scan being run. If you invoke discover with no arguments, a server scan is executed followed by a service scan. The two inventory reports from those scans are returned in a vector. The use of the count function to see how many resources were discovered is a good example that demonstrates how you can easily use functions defined outside of the rhq.plugin-container namespace to provide additional capabilities and functionality.

Searching the Local Inventory
Once you have the plugin container running and are able to execute discovery scans, you need a way query the inventory for resources with which you want to work. The inventory function does just that. It can be invoked in one of two ways. In its simpler form which takes no arguments, it returns the platform Resource object. In its more complex form, it takes a map of pre-defined filters and returns a lazy sequence of those resources that match the filters.


inventory is invoked on line 1 without any arguments, and then a string version of it is returned with a call to str. The type is Mac OS X indicating that the object is in fact the platform resource. On line 5 we invoke inventory with a single filter to include resources that are available.  That call shows that there are 62 resources in inventory that are up.  On line 7 we query for resources that are a service and see that there are 60 in inventory. On line 9 we specify multiple filters that will return down services. When multiple filters are specified, a resource must match each one in order to be included in the results. On line 10 we query for webapps from the JBossAS plugin. On line 13 we specify a custom filter in the form of an anonymous function with the :fn key. This filter finds resources that define at least two metrics.

Conclusion
We have looked at a number of functions to make working with the plugin container from the REPL a bit easier. Each function should also include a useful docstring as in the following example,


We have only scratched the surface with the functions in the rhq.plugin-container namesapce. In some future posts we will explore invoking resource operations, updating resource configurations, and deploying resources like EARs and WARs.

Thursday, May 19, 2011

Remote Streams in RHQ

The agent/server communication layer in RHQ provides rich, bi-directional communication that is highly configurable, performant, and fault tolerant. And as a developer it has another feature of which I am quite fond - I rarely have to think about it. It just works.

Recently I started working on a new feature that involves streaming potentially large files from agent to server. This work has led me to look under the hood of the comm layer to an extent. The comm layer allows for high-level APIs between server and agent. Consider the following example:

public interface ConfigurationServerService {
    ...
    @Asynchronous(guaranteedDelivery = true)
    void persistUpdatedResourceConfiguration(int resourceId, Configuration resourceConfiguration);
}

The agent calls persistUpdateResourceConfiguration when it has detected a resource configuration change that has occurred outside of RHQ. The @Asynchronous annotation tells the communication layer that the remote method call from agent to server can be performed asynchronously. There are no special stubs or proxies that I have to worry about to use this remote API. It is all nicely tucked away in the communication layer.

Several posts could be devoted to discussing RHQ's communication layer but back to my current work of streaming large files. I needed to put in place a remote API on the server so that the agent can upload files. You might consider something like the following as an initial approach:

// Remote API exposed by RHQ server to stream files from agent to server
void uploadFile(byte[] data);

The problem with this approach is that it involves loading the file contents into memory. File sizes could easily exceed several hundred megabytes in size resulting in substantial memory usage that would be impractical. The RHQ agent is finely tuned to keep a low foot print in terms of memory usage as well as CPU utilization. When reading the contents of a large file that is too big to fit into memory, java.io.InputStream is commonly used. With the RHQ communication layer, I am able to expose an API like the following,

// Remote API exposed by RHQ server to stream files from agent to server
void uploadFile(InputStream stream);

With this API, the agent passes an InputStream object to the server. Keep in mind though that none of Java's standard InputStream classes implement Serializable which is a requirement for using objects with a remote invocation framework like RMI or JBoss Remoting. Fortunately for me RHQ provides the RemoteInputStream class which extends java.io.InputStream. The Javadocs from that class state,

This is an input stream that actually pulls down the stream data from a remote server. Note that this extends InputStream so it can be used as any normal stream object; however, all methods are overridden to actually delegate the methods to the remote stream.

When the agent wants to upload a file, it calls uploadFile passing a RemoteInputStream object. The server can then read from the input stream just as it would any other input stream unbeknownst to it that the bytes are being streamed over the wire.

While I find myself impressed with RemoteInputStream, it gets even better. I wanted to read from the stream asynchronously. When the agent calls uploadFile, instead of reading from the stream in the thread handling the request, I fire off a message to a JMS queue to free up the thread to service other agent requests. I am able to pass the RemoteInputStream object in a JMS message and have a Message Driven Bean then read from the stream to upload the file from the agent.

This level of abstraction along with the performance, fault tolerance, and stability characteristics of the agent/server communication layer makes it one of those hidden gems you do not really appreciate until you have to look under the hood so to speak. And rarely if ever do I find myself having to look under the hood because... it just works. Lastly, I should point out that there is a RemoteOutputStream class that compliments the RemoteInputStream class.

Sunday, February 20, 2011

RHQ Bundle Recipe for Deploying JBoss Server

My colleague mazz wrote an excellent blog post that describes in detail the provisioning feature of RHQ. The post links to a nice Flash demo he put together to illustrates the various things he discusses in his article. Taking what I learned from his post, I put together a simple recipe to deploy a JBoss EAP server and then start the server after it has been laid down on the destination file system. Here is the recipe:


The bundle declaration itself on lines 4 - 11 is pretty straightforward. If this part is not clear, read through the docs on Ant bundles. Where things became a little less than straightforward is with the <exec> task starting on line 18. The first problem I encountered was Ant saying that it could not find run.sh. I think this is because it was looking for it on my PATH. Adding resovleexecutable="true" on line 21 took care of this problem. This tells Ant to look for the executable in the specified execution directory.

On line 22 I specify arguments to run.sh. -b 0.0.0.0 tells JBoss to bind to all available addresses. Initially I had line 22 written as:

<arg value="-b 0.0.0.0"/>

That did not get parsed correctly and resulting in JBoss throwing an exception with an error message saying that an invalid bind address was specified. Specifying the line attribute instead of the value attribute fixed the problem.

The last problem I encountered was Ant complaining that it did not have the necessary permissions to execute run.sh. It turned out that when the EAP distro was unpacked, the scripts in the bin were not executable. This is why I added the <chmodgt; call on line 17. It seems that the executable file mode bits are getting lost somewhere along the way in the deployment process. I went ahead and filed a bug for this you. You can view the ticket here.

After working through these issues, I was able to successfully deploy my JBoss server and have it start up without error. Now I can easily deploy my bundle a single machine, a cluster of RHEL servers that might serves as a QA or staging environment, or even a group heterogeneous machines that could consist of Windows, Fedora (or other Linux distros), and Mac OS X. Very cool! Provisioning is still a relatively new feature to RHQ. It add tremendous value to the platform, and fortunately I think it can add even more value. One of the things I would like to see is more support for common tasks like starting/stopping a server whether it is in the form of custom Ant tasks or something else.

Tuesday, November 23, 2010

Writing an RHQ Plugin in Clojure

Clojure is a new, exciting language. My biggest problem with it is that I do not find enough time to work with it. One of the ways I am trying to increase my exposure to Clojure is by exploring ways of integrating it into RHQ. RHQ is well-suited for integrating non-Java, JVM languages because it was designed and built to be extended. In previous posts I have talked about various extension points including agent plugins, server plugins, and remote clients.

I decided to write an agent plugin in Clojure. If you are not familiar with RHQ plugins or what is involved with implementing one, check out this excellent tutorial from my colleague Heiko. Right now, I am just doing exploratory work. I have a few goals in mind though as I go down this path.

First, I have no desire to wind up writing Java in Clojure. By that I mean that I do not want to get bogged down dealing with mutable objects. One of the big draws to Clojure for me is that it is a purely functional language with immutable data structures; so, as I continue my exploration of integrating Clojure with RHQ, I want to write idiomatic Clojure code to the greatest extent possible.

Secondly, I want to preserve what I like to think of as the Clojure development experience. Clojure is a very dynamic language in which functions and name spaces can be loaded and reloaded on the fly. The REPL is an invaluable tool. It provides instant feedback. In my experience Test-Driven Development usually results in short, quick development iterations. TDD + REPL produces extremely fast development iterations.

Lastly, I want to build on the aforementioned goals in order to create a framework for writing RHQ plugins in Clojure. For instance, I want to be able to run and test my plugin in a running plugin container directly from the REPL. And then when I make a change to some function in my plugin, I want to be able to just reload that code without having to rebuild or redeploy the plugin.

Now that I have provided a little background on where I hope to go, let's take a look at where I am currently. Here is the first cut at my Clojure plugin.


I am using gen-class to generate the plugin component classes. As you can see, this is just a skeleton implementation. Here is the plugin descriptor.


I have run into some problems though when I deploy the plugin. When the plugin container attempts to instantiate the plugin component to perform a discovery scan, the following error is thrown:


I was not entirely surprised to see such an error because I have heard about some of the complexities involved with trying to run Clojure in an OSGi container, and the RHQ plugin container shares some similarities with OSGi. There is a lot of class loader magic that goes on with the plugin container. For instance, each plugin has its own class loader, plugins can be reloaded at runtime, and the container limits the visibility of certain classes and packages. I came across this Clojure Jira ticket, CLJ-260, which talks about setting the context class loader. Unfortunately this did not help my situation because the context class loader is already set to the plugin class loader.

After spinning my wheels a bit, I decided to try a different approach. I implemented my plugin component class in Java, and it delegates to a Clojure script. Here is the code for it.


And here is the Clojure script,


This version deploys without error. I have not fully grokked the class loading issues, but at least for now, I am going to stick with a thin Java layer that delegates to my Clojure code. Up until now, I have been using leiningen to build my code, but now that I am looking at a mixed code base, I may consider switching over to Maven. I use Emacs and Par Edit for Clojure, but I use IntelliJ for Java. The IDE support for Maven will come in handy when I am working on the Java side of things.

Sunday, November 21, 2010

Server-Side Scripting in RHQ

Introduction
The RHQ platform can be extended in several ways, most notably through plugins that run on agents. There also exists the capability to extend the platform's functionality with scripting via the CLI. The CLI is a remote client, and even scripts that are run on the same machine on which the RHQ server resides are still a form of client-side scripting because they run in a separate process and operate on a set of remote APIs exposed by the server.

In this post I am going to introduce a way to do server-side scripting. That is, the scripts are run in the same JVM in which the RHQ server is running. This form of scripting is in no way mutually exclusive to writing CLI scripts; rather, it is complementary. While a large number of remote APIs are exposed through the CLI, they do not encompass all of the functionality internal to the RHQ server. Server-side scripts however, have full and complete access to the internal APIs of the RHQ server.


Server Plugins
RHQ 3.0.0 introduced server plugins which are distinct from agent plugins. Server plugins run directly in the RHQ server inside a server plugin container. Unlike agent plugins, they do not perform any resource discovery. The article, RHQ Server Plugins - Innovation Made Easy, provides a great introduction to server plugins. Similar to agent plugins, server plugins can expose operations which can be invoked from the UI. They can also be configured to run as scheduled jobs. Server plugins have full access to the internal APIs of the RHQ server. Reference documentation for server plugins can be found here. The server-side scripting capability we are going to look at is provided by a server plugin.

Groovy Script Server
Groovy Script Server is a plugin that allows you to dynamically execute Groovy scripts directly on the RHQ server. Documentation for the plugin can be found here. The plugin currently provides a handful of features including,
  • Customizable classpath per script
  • Easy access to RHQ EJBs through dynamic properties
  • An expressive DSL for generating criteria queries

An Example
Now that we have introduced server plugins and the Groovy Script Server, it is time for an example. A while back, I wrote a post on a way to auto-import resources into inventory using the CLI. We will revisit that script, written as a server-side script.

resourceIds = []

criteria(Resource) { 
  filters = [inventoryStatus: InventoryStatus.NEW] 
}.exec(SubjectManager.overlord) { resourceIds << it.id }

DiscoveryBoss.importResources(SubjectManager.overlord, (resourceIds as int[]))

On line three we call the criteria method which is available to all scripts. This method provides our criteria query DSL. Notice that the method takes a single parameter - the class for which the criteria query is being generated. Filters are specified as a map of property names to property values.
Properties names are derived from the various addFilterXXX methods exposed by the criteria object being built. In this instance, the filter corresponds to the method ResourceCriteria.addFilterInventoryStatus.

The criteria method returns a criteria object that corresponds to the class argument. In this example, a ResourceCriteria object is returned. Notice that exec is called on the generated ResourceCriteria object. This method is dynamically added to each generated criteria object. It takes care of calling the appropriate manager which in this case is ResourceManager. exec takes two arguments - a Subject and a closure. Most stateless session bean methods in RHQ go through a security layer to ensure that the user specified by the Subject has the necessary permissions to perform the requested operation. In the CLI, you may have noticed that you do not have to pass a Subject to the various manager methods. This is because the CLI implicitly passes the Subject corresponding to the logged in user. The second argument, a closure, is called once for each entity in the results returned from exec.

Let's look at a second example that builds off of the previous one. Instead of auto-importing everything in the discovery queue, suppose we only want to import JBoss AS 5 or or AS 6 instances.

resourceIds = []

criteria(Resource) { 
  filters = [
    inventoryStatus:  InventoryStatus.NEW,
    resourceTypeName: 'JBossAS Server',
    pluginName:       'JBossAS5'
  ] 
}.exec(SubjectManager.overlord) { resourceIds << it.id }

DiscoveryBoss.importResources(SubjectManager.overlord, (resourceIds as int[]))

Here we add two additional filters for the resource type name and the plugin. If we did not filter on the plugin name in addition to the resource type name, then our results could include JBoss AS 4 instances which we do not want.

Future Work
The Groovy Script Server, as well as server plugins in general, are relatively new to RHQ. There are some enhancements that I already have planned. First, is adding support for running scripts as scheduled jobs. This is one of the big features of server plugins. With support for scheduled jobs, we could configure the auto-inventory script to run periodically freeing us from manually having to log into the server to execute the script. The CLI version of the script could be wrapped in a cron job. If we did that with the CLI script though, we might want to include some error handling logic in case the server is down or otherwise unavailable. With the server-side scheduled job, we do not need that kind of error handling logic.

The second thing I have planned is to put together additional documentation and examples. With the work that has already been done, the server-side scripting capability opens up a lot of interesting possibilities. I would love to hear feedback on how you might utilize the script server as well as any enhancements that you might like to see.

Saturday, November 13, 2010

RHQ: Deleting Agent Plugins

Introduction
RHQ is an extensible management platform; however, the platform itself does not provide the management capabilities. For example, there is nothing built into the platform for managing a JBoss AS cluster. The platform is actually agnostic of the actual resource and types of resources it manages, like the JBoss AS cluster. The management capabilities for resources like JBoss AS are provided through plugins. RHQ's plugin architecture allows the platform to be extended in ways such that it can manage virtually any type of resource.

Plugin JAR files can be deployed and installed on an RHQ server (or cluster of servers), they can be upgraded, and they can even be disabled. They cannot however be deleted. In this post, we spend a little bit of time exploring plugin management, from the perspectives of installing and upgrading to disabling them. Then we consider my recent work for deleting plugins.

Installing Plugins
Plugins can be installed in one of two ways. The first involves copying the plugin JAR file to /jbossas/server/default/deploy/rhq.ear/rhq-downloads/rhq-plugins. And starting with RHQ 3.0.0, you can alternatively copy the plugin JAR file to /plugins which is arguably easier the much shorter path. The RHQ server will periodically scan these directories for new plugin files. When a new or updated plugin is detected, the server will deploy the plugin. This approach is particularly convenient during development when the RHQ server is running on the same machine on which I am developing. In fact, RHQ's Maven build is set up to copy plugins to a development server as part of the build process.

The second approach to installing a plugin involves uploading the plugin file through the web UI. The screenshot below shows the UI for plugin file upload.

 Deploying plugins through the web UI is particularly useful when the plugin is on a different file system that the one on which the RHQ server is running. It is worth noting that there currently is no API exposed for installing plugins through the CLI.

Upgrading Plugins
The platform not only supports deploying new plugins that previously have not been installed in the system, but it also supports upgrading existing plugins. From a user's perspective there really is no difference in upgrading a plugin versus installing one for the first time. The steps are the same. And the RHQ server, for the most part, treats both scenarios the same as well.

Installing a new or upgraded plugin does not affect any agents that are currently running. Agents have to be explicitly updated in one of a number of ways including,
  • Restarting the agent
  • Restarting the plugin container
  • Issuing the plugins update command from the agent prompt
  • Issuing a resource operation for one of the above. This can be done from the UI or from the CLI
  • Issuing a resource operation for one of the above from a server script.
Disabling Plugins
Installed plugins can be disabled. Disabling a plugin results in agents ignoring that plugin once the agent is restarted (or more precisely, when the plugin container running inside the agent is restarted). The plugin container will not load that plugin, which means resource components, discovery components, and plugin classloaders are not loaded. This results in a reduced memory footprint of the agent. It also reduces overall CPU utilization since the agent's plugin container is performing fewer discovery and availability scans.

Plugins can be disabled on a per-agent basis allowing for a more heterogeneous deployment of agents. For instance, I might have a web server that is only running Apache and the agent that is monitoring it, while on another machine I have a JBoss AS instance running.  I could disable the JBoss-related plugins on the Apache box freeing up memory and CPU cycles. Likewise, I can disable the Apache plugins on the box running JBoss AS.

When a plugin is disabled, nothing is removed from the database. Any resources already in inventory from the disabled plugin remain in inventory. Type definitions from the disabled plugin also remain in the system.

Deleting Plugins
Recently I have been working on adding support for deleting plugins. Deleting a plugin not only deletes the actual plugin from the system, but also everything associated with it including all type definitions and all resources of the types defined in the plugin. When disabling a plugin, the plugin container has to be explicitly restarted in order for it to pick up the changes. This is not the case though with deleting plugins. Agents periodically send inventory reports to the server. If the report contains a resource of a type that has been deleted, the server rejects the report and tells the agent that it contains stale resource types. The agent in turn recycles its plugin container, purging its local inventory of any stale types and updating its plugins to match what is on the server. No type definitions, discovery components, or resource components from the plugin will be loaded

Use Cases for Plugin Deletion
There are a number of motivating use cases for supporting plugin deletion. The most import of these might be the added ability to downgrade a plugin. But we will also see the benefits plugin deletion brings to the plugin developer.

Downgrading Plugins
We have already mentioned that RHQ supports upgrading plugins. It does not however support downgrading a plugin. Deleting a plugin effectively provides a way to rollback to a previous version of a plugin. There may be times in a production deployment for example when a plugin does not behave as expected or as desired. Users currently do not have the capability to downgrade to a previous version of that plugin. Plugin deletion now makes this possible.

Working with Experimental Plugins
Working with an experimental plugin or one that might not be ready for production use carries with it certain risks. Some of those risks can be mitigated with the ability to disable a plugin; however, the plugin still exists in the system. Resources remain in inventory. Granted those resources can be deleted easily enough, but there is still some margin for error in so far as failing to delete all of the resources from the plugin or accidentally deleting the wrong resources. And there exists no way to remove type definitions such as metric definitions and operation definitions without direct database access. Having the ability to delete a plugin along with all of its type definitions and all instances of those type definitions completely eliminates these risks.

Simplifying Plugin Development
A typical work flow during during plugin development includes incremental deployments to an RHQ server as changes are introduced to the plugin. Many if not all plugin developers have run into situations in which they have to blow away their database due to changes made in the plugin (This normally involves changes to type definitions in the plugin descriptor). This slows down development, sometimes considerably. Deleting a plugin should prove much less disruptive to a developer's work flow than having to start with a fresh database installation, particularly when a substantial amount of test data has been built up in the database. To that extent, I can really see the utility in a Maven plugin for RHQ plugin development that deploys the RHQ plugin to a development server. The Maven plugin could provide the option to delete the RHQ plugin if it already exists in the system before deploying the new version.

Conclusion
Development for the plugin deletion functionality is still ongoing, but I am confident that it will make it into the next major RHQ release. If you are interested in tracking the progress or experimenting with this new functionality, take a look at the delete-agent-plugin branch in the RHQ Git repo. This is where all of the work is currently being done. You can also check out this design document which provides a high level overview of the work involved.

Tuesday, September 28, 2010

Dealing with Asynchronous Workflows in the CLI

Introduction
There is constant, ongoing communication between agents and servers in RHQ. Agents at regularly scheduled intervals for example send inventory and availability reports up to the server. The server sends down resource-related requests such as updating a configuration or executing a resource operation. Examples of these include updating the connection pool setting for a JDBC data source and starting a JBoss AS server. Some of these work flows are performed in a synchronous manner while others are carried out in an asynchronous fashion. A really good example of an asynchronous work flows is scheduling a resource operation to execute at some point in the future. There is a common pattern used in implementing these asynchronous work flows. We will explore this pattern in some detail and then consider the impacts on remote clients like the CLI.

The Pattern
The asynchronous work flows are most prevalent in requests that produce mutative actions against resources. Let's go through the pattern.
  • A request is made on the server to take some action against a resource (e.g., invoke an operation, update connection properties, update configuration, deploy content, etc.)
  • The server logs the request on the audit trail
  • The server sends the request to the agent
    • Note that control is return back to the server immediately after sending the request to the agent. This means that the call to the agent will likely return before the requested action has actually been carried out.
  • The plugin container (running in the agent) invokes the appropriate resource component
  • The resource component carries out the request and reports the results back to the plugin container
  • The agent sends the response back to the server. The response will indicate success or failure.
  • The server updates the audit trail indicating that the request has completed and also whether it succeeded or failed.
    • Note that it is the same request that was originally logged on the original audit trail that is updated
Let's revisit the earlier example of scheduling an operation to start a JBoss server. Suppose I schedule the operation to execute immediately. Then I navigate to the operation history page for the JBoss server. I will see the operation request listed in the history. The history page is a view of the audit trail. The operation shows a status of In Progress. We could continually refresh the page until we see the status change. Eventually it will change to Success or Failure. The status does not necessarily change immediately after the operation completes. It changes after the agent reports the results back to the server and the audit trail is updated.

As previously stated, this pattern is very common throughout RHQ. Consider making a resource configuration update which is performed asynchronously as well. Once I submit submit the configuration update request, I can navigate to the configuration history page to check the status of the request. The status of the update request will show in progress until the agent reports back to the server that the update has completed. When the agent reports back to the server, the corresponding audit trail entry is updated with the results. The same pattern can also be observed when manually adding a new resource into the inventory.

Understanding the Impact to the CLI
So what does this asynchronous work flow mean for remote clients, notably CLI scripts? First and foremost, you need to understand when and where requests are carried out asynchronously to avoid unpredictable, unexpected results. We will discuss a number of things can potentially impact how you think about and how you write CLI scripts.

A method that returns without error does not necessarily mean that the operation succeeded
Let's say we have a requirement to write a script that performs a couple resource configuration  updates, but we only want to perform the second update if the first one succeeds. We might be inclined to implement this as follows,

ConfigurationManager.updateResourceConfiguration(resourceId, firstConfig);
ConfigurationManager.updateResourceConfiguration(resourceId, secondConfig);

Provided we are logged in as a user having the necessary permissions to update the resource configuration and provided the agent is online and available, the first call to updateResourceConfiguration will return without error. We proceed to submit the second configuration change, but the first update might have actually failed. With the code as is we could easily wind up violating the requirement of applying the second update only if the first succeeds. What we need to do here essentially is to block until the first configuration update finishes so that we can verify that it did in fact succeed. This can be  implemented by polling the ResourceConfigurationUpdate object that is returned from the call to updateResourceConfiguration.

ConfigurationManager.updateResourceConfiguration(resourceId, firstConfig);
var update = ConfigurationManager.getLatestResourceConfiguration(resourceId);
while (update.status == ConfigurationUpdateStatus.INPROGRESS) {
    java.lang.Thread.sleep(2000);  // sleep for 2 seconds
    update = ConfigurationManager.getLatestResourceConfiguration(resourceId);
}
if (update.status == ConfigurationUpdateStatus.SUCCESS) {
    ConfigurationManager.updateResourceConfiguration(resourceId, secondConfig);
}

The ResourceConfigurationUpdate object is our audit trail entry. The object's status will change once the resource component (running in the plugin container) finishes applying the update and the agent sends the response back to the server.

Resource proxies offer some polling suppport
 Resource proxies greatly simplify working with a number of the RHQ APIs. Invoking resource operations is one those enhanced areas. With a resource proxy, operations defined in the plugin descriptor appear as first class methods on the proxy object. This allows us to invoke a resource operation in a much more concise and intuitive fashion. Here is a brief example.

var jbossServerId1 = // look up resource id of JBoss server 1
var jbossServerId2 = // look up resource of JBoss server 2
server1 = ProxyFactory.getResource(jbossServerId1);
server2 = ProxyFactory.getResource(jbossServerId2);
server1.start();
server2.start();

The call to server1.start() does not immediately return. It polls the status of the operation waiting for it to complete.  The proxy sleeps for a short delay and the fetches the ResourceOperationHistory object that was logged for the request. If a history object is found and if its status is something other than in progress, then the proxy returns the operation's results. If the history object indicates that the operation has not yet completed, the proxy will continue polling.

Resource proxies provide some great abstractions that simplify working in the CLI. The polling that is done behind the scenes for resource operations is yet another useful abstraction in that it makes a resource operation request look like a regular, synchronous method call. The polling however, is somewhat limited. We will take a closer look at some of the implementation details to better understand how it all works.

The delay or sleep interval is fixed
The thread in which the proxy is running sleeps for one second before it polls the history object. There is currently no way to specify a different delay or sleep interval. In many cases the one second delay should be suitable, but there might be situations in which a shorter or longer delay is preferred.

The number of polling intervals is fixed
The proxy will poll the ResourceOperationHistory at most ten times. There is currently no way to specify a different number of intervals. If after ten times, the history still has a status of in progress, the proxy simply returns the incomplete results. Or if no history is available, null is returned. In many cases the polling delays and intervals may be sufficient for operations to complete, but there is no guarantee.

The proxy will not poll indefinitely
This is really an extension of the last point about not being able to specify the number of polling intervals. There may be times when you want to block indefinitely until the operation completes. Resource proxies currently do not offer this behavior.

Polling cannot be performed asynchronously
Let's say we want to start ten JBoss servers in succession. We want to know whether or not they start up successfully, but we are not concerned with the order in which they start. In this example some form of asynchronous polling would be appropriate. Let's further assume that each proxy winds up polling the maximum of ten intervals. Each call to server.start() will take a minimum of ten seconds plus whatever time it takes to retrieve and check the status of the ResourceOperationHistory. We can then conclude that it will take over 90 seconds to invoke the start operation on all of the JBoss servers. This could turn out to be very inefficient. In all likelihood, it would be faster to schedule the start operation, have control return back to the script immediately, and then schedule each subsequent operation. Then the script could block until all of the operations have completed.

As an aside, the previous example might better be solved by creating a resource group for the JBoss servers and then invoking the operation once on the entire group. The problems however, still manifest themselves with resource groups. Suppose we want to call operation O1 on resource group G1, followed by a call to operation O2 on group G2, followed by O3 on G3, etc. We are essentially faced with the same problems but now on a larger scale.

There is no uniform Audit Trail API
Scheduling a resource operation, submitting a resource configuration update, deploying content, etc. are generically speaking all operations that involve submitting a request to an agent (or multiple agents in the case of a group operation) for some mutative change to be applied to one or more resources.  In each of the different scenarios, an entry is persisted on the respective audit trails. For example, with a resource operation, a ResourceOperationHistory object is persisted. When deploying a new resource (i.e., a WAR file), a CreateResourceHistory object is persisted. With a resource configuration change, a ResourceConfigurationUpdate is persisted. Each of these objects exposes a status property that indicates whether the request is in progress, has succeeded, or has failed. Each of them also exposes an error message property that is populated if the request fails or an unexpected error occurs.

Unfortunately, there is no common base class shared among these audit trail classes in which the status and error message properties are defined. This makes writing a generic polling solution more challenging, at least if the solution is to be implemented in Java. A solution in a dynamic language, like JavaScript, might prove easier since we can rely on duck typing. We could implement a generic solution that works with a status property, without regard to an object's type.

Conclusion
It is important to understand the work flows and communication patterns described here as well as the current limitations in resource proxies in order to write effective CLI scripts that have consistent behavior and predictable results. Consistent behavior and predictable results certainly do not mean that the same results are produced every time a script is run. It does mean though that given certain conditions, we can make valid assumptions that hold to be true. For example, if we execute a resource operation and then block until the ResourceOperationHistory status has changed to SUCCESS, then we can reasonably assume that the operation did in fact complete successfully.

Many of the work flows in RHQ are necessarily asynchronous, and this has to be taken into account when working with a remote client like the CLI. Fortunately, there are many ways we can look to encapsulate much of this, shielding developers from the underlying complexities while at the same time not limiting developers in how they choose to deal with these issues.