John Sanda's blog

Changing the Endpoint Address of an RHQ Storage Node

2014-07-20T15:18:00.001-04:00

There is very limited support for changing the endpoint address of a storage node. In fact the only way to do so is by undeploying and redeploying the node with the new address. And in some cases, like when there is only a single storage node, this is not even an option. BZ 1103841 was opened to address this, and the changes will go into RHQ 4.13.

Changing the endpoint address of a Cassandra node is a routine maintenance operation. I am referring specifically to the address on which Cassandra uses for gossip. This address is specified by the listen_address property in cassandra.yaml. The key thing when changing the address is to ensure that the node's token assignments do not change. Rob Coli's post on changing a node's address provides a nice summary of the configuration changes involved.

With CASSANDRA-7356 however, things are even easier. Change the value of listen_address and restart Cassandra with the following system properties defined in cassandra-env.sh,

-Dcassandra.replace_address=<new_address>
-Dcassandra.replace_address_first_boot=true

The seeds property in cassandra.yaml might need to be updated as well. Note that there is no need to worry about the auto_bootstrap, initial_token, or num_tokens properties.

For the RHQ Storage Node, these system properties will be set in cassandra-jvm.properties. Users will be able to update a node's address either through the storage node admin UI or through the RHQ CLI. One interesting to note is that the RHQ Storage Node resource type uses the node's endpoint address as its resource key. This is not good. When the address changes, the agent will think it has discovered a new Storage Node resource. To prevent this we can add resource upgrade support in the rhq-storage plugin, and change the resource key to use the node's host ID which is a UUID that does not change. The host ID is exposed through the StorageServiceMBean.getLocalHostId JMX attribute.

If you interested in learning more about the work involved with adding support for changing a storage node's endpoint address, check out the wiki design doc that I will be updating over the next several days.

Upgrading to RHQ 4.9

2013-09-09T12:30:00.000-04:00

RHQ 4.8 introduced the new Cassandra backend for metrics. There has been a tremendous amount of work since then focused on the management of the new RHQ Storage Node. We do want to impose on users the burden of managing a second database. One of our key goals is to provide robust management such that Cassandra is nothing more than an implementation detail for users.

The version of Cassandra shipped in RHQ 4.8 includes some native libraries. One of the main uses for those native libraries is compression. If the platform on which Cassandra is running has support for the native libraries, table compression will be enabled. Data files written to disk will be compressed.

All of the native libraries have been removed from the version of Cassandra shipped in RHQ 4.9. The reason for this change is to ensure RHQ continues to provide solid cross-platform support. The development and testing teams simply do not have the bandwidth right now to maintain native libraries for all of the supported platforms in RHQ and JON.

The following information applies only to RHQ 4.8 installs.

Since RHQ 4.9 does not ship with native those compression libraries, Cassandra will not be able to decompress the data files on disk.

Compression has to be disabled in your RHQ 4.8 installation before upgrading to 4.9. There is a patch which you will need to run prior to upgrading. Download rhq48-storage-patch.zip and follow the instructions provided in rhq48-storage-patch.sh|bat.

I do want to mention that we will likely re-enable compression using a pure Java compression library in a future RHQ release.

Configuring Cassandra JDBC with JBoss AS 7

2012-10-22T22:50:00.000-04:00

1. Build the driver source

cassandra-jdbc is not yet available in the public Maven repos so it has to be built from source. There is already an open ticket requesting that the artifacts get published to a public Maven repo. If you do not already have a copy of the source, clone the git repo,

git clone https://code.google.com/a/apache-extras.org/p/cassandra-jdbc/

Build the project,

mvn install -DskipTests

cassandra-jdbc uses Maven. This generate artifacts and installs them into the local Maven repository. I set the skipTests property to tell Maven not to execute any tests. There are some failures in my local repo I have yet to investigate.

2. Determine runtime dependencies

We need to determine the runtime dependencies for the driver so that we know what libraries need to be installed in the next step. The Maven dependency plugin makes this easy. Run the following,

mvn dependency:copy-dependencies -DincludeScope="runtime"

This generates a list of the runtime dependencies under <CASSANDRA_JDBC_HOME>/target/dependency. You should find the following:

cassandra-clientutil-1.2.0-beta1.jar
cassandra-thrift-1.2.0-beta1.jar
commons-lang-2.4.jar
guava-12.0.jar
jsr305-1.3.9.jar
libthrift-0.7.0.jar
slf4j-api-1.6.1.jar

3. Create the JBoss module

There are several ways to install to the driver in JBoss AS 7. I am going to describe creating a module since that is what I have done. As it turns out JBoss AS 7 already comes with modules for some of the dependencies; so, we will not include all dependencies in the above list in the module we create. First we need to create the module directory tree.

mkdir -p <JBOSS_HOME>/modules/org/apache-extras/cassandra-jdbc/main

Then copy the cassandra-jdbc build artifact.

cp <CASSANDRA_JDBC_HOME>/target/cassandra-jdbc-1.2.0-SNAPSHOT.jar <JBOSS_HOME>/modules/org/apache-extras/cassandra-jdbc/main

Now copy the following dependencies to the module directory, right along side the driver JAR file.

cassandra-clientutil-1.2.0-beta1.jar
cassandra-thrift-1.2.0-beta1.jar
guava-12.0.jar
jsr305-1.3.9.jar

Do not copy slf4j-api or commons-lang because they are already installed as modules. There is also a module for guava as well, but it is an earlier version; consequently, we include guava in our module so that we have the correct version. Now create a module.xml to define the module resources and dependencies. This file goes in the same directory as the JAR file dependencies and it should look like,

That's it for creating the module. All that is left to do is define the data source and driver.

4. Declare and configure driver and data source

Open <JBOSS_HOME>/standalone/configuration/standalone.xml and scroll down until you reach the data sources subsystem which starts with the tag <subsystem xmlns="urn:jboss:domain:datasources:1.0"> We need to declare a data source as well as a driver. Here is what my additions look like,

This is a minimal configuration for illustration purposes. Changes were recently pushed to the cassandra-jdbc repo to provide more support for connection pooling. For the data source class we could also use org.apache.cassandra.cql.jdbc.PooledCassandraDataSource. Notice in the connection URL that the is a version parameter to tell the server we are using CQL v3. Take a look at the user guide for a full listing of available configuration options for data sources and drivers.

5. Start JBoss AS and verify data source configuration

In a terminal go to <JBOSS_HOME>/bin and then run ./standalone.sh. Now log into the web console at http://localhost:9990. Click on the Profile link in the upper right corner. Then click on Datasources in the menu on the left and you should see something like this,

Finally select the Connection tab and click the Test Connection button as shown below.

We are all set now to use the data source in application code. We can use resource injection just as we would with any other data source.

Why I am ready to move to CQL for Cassandra application development

2012-10-17T12:22:00.000-04:00

Earlier this year, I started learning about Cassandra as it seemed like it might be a good fit as a replacement data store for metrics and other time series data in RHQ. I developed a prototype for RHQ. I used the client library Hector for accessing Cassandra from within RHQ. I defined my schema using a Cassandra CLI script. I recall when I first read about CQL. I spent some time deliberating over whether to define the schema using a CLI script or using a CQL script. Although I was intrigued but ultimately decided against using CQL. As the CLI and the Thrift interface were more mature, it seemed like the safer bet. While I decided not to invest any time in CQL, I did make a mental note to revisit it at a later point since there was clearly a big emphasis within the Cassandra community for improving CQL. That later point is now, and I have decided to start making extensive use of CQL.

After a thorough comparative analysis, the RHQ team decided to move forward with using Cassandra for metric data storage. We are making heavy use of dynamic column families and wide rows. Consider for example the raw_metrics column family in figure 1,

Figure 1. raw_metrics column family

The metrics schedule id is the row key. Each data point is stored in a separate column where the metric timestamp is the column name and the metric value is the column value. This design supports fast writes as well as fast reads and works particularly well for the various date range queries in RHQ. This is considered a dynamic column family because the number of columns per row will vary and because column names are not defined up front. I was quick to rule out using CQL due to a couple misconceptions about CQL's support for dynamic column families and wide rows. First, I did not think it was possible to define a dynamic table with wide rows using CQL. Secondly, I did not think it was possible to execute range queries on wide rows.

A couple weeks ago I came across this thread on the cassandra-users mailing list which points out that you can in fact create dynamic tables/column families with wide rows. And conveniently after coming across this thread, I happened to stumble on the same information in the docs. Specifically the DataStax docs state that wide rows are supported using composite column names. The primary key can have multiple components, but there must be at least one column that is not part of the primary key. Using CQL I would then define the raw_metrics column family as follows,

This CREATE TABLE statement is straightforward, and it does allow for wide rows with dynamic columns. The underlying column family representation of the data is slightly different from the one in figure 1 though.

Figure 2. CQL version of raw_metrics column family

Each column name is now a composite that consists of the metric timestamp along with the string literal, value. There is additional overhead on reads and writes as the column comparator now has to compare the string in addition to the timestamp. Although I have yet to do any of my own benchmarking, I am not overly concerned by the additional string comparison. I was however concerned about the additional overhead in terms of disk space. I have done some preliminary analysis and concluded that the difference with just storing the timestamp in the column name is negligible due to compression of SSTables which is enabled by default.

My second misconception about executing range queries is really predicated on the first misconception. It is true that you can only query named columns in CQL; consequently, it is not possible to perform a date range query against the column family in figure 1. It is possible though to execute a date range query against the column family in figure 2.

RHQ supports multiple upgrade paths. This means that in order to upgrade to the latest release (which happens to be 4.5.0 at the time of this writing), I do not have to first upgrade to the previous release (which would be 4.4.0). I can upgrade from 4.2.0 for instance. Supporting multiple upgrade paths requires a tool for managing schema changes. There are plenty of such tools for relational databases, but I am not aware of any for Cassandra. But because we can leverage CQL and because there is a JDBC driver, we can look at using an existing tool instead of writing something from scratch. I have done just that and working on adding support for Cassandra to Liquibase. I will have more on that in future post. Using CQL allows us to reuse existing solutions which in turn is going to save a lot of development and testing effort.

The most compelling reason to use CQL is the familiar, easy to use syntax. I have been nothing short of pleased with Hector. It is well designed, the online documentation is solid, and the community is great. Whenever I post a question on the mailing list, I get responses very quickly. With all that said, contrast the following two, equivalent queries against the raw_metrics column family.

RHQ developers can look at the CQL version and immediately understand it. Using CQL will result in less, easier to maintain code. We can also leverage ad hoc queries with cqlsh during development and testing. The JDBC driver also lends itself nicely to applications that run in an application as RHQ does.

Things are still evolving both with CQL and with the JDBC driver. Collections support is coming in Cassandra 1.2. The JDBC driver does not yet support batch statements. This is due to the lack of support for it the server side. The functionality is there in the Cassandra trunk/master branch, and I expect to see it in the 1.2 release. The driver also currently lacks support for connection pooling. These and other critical features will surely make their way into the driver. With the enhancements and improvements to CQL and to the JDBC driver, adding Cassandra support to Hibernate OGM becomes that much more feasible.

The flexibility, tooling, and ease of use make CQL a very attractive option for working with Cassandra. I doubt the Thrift API is going away any time soon, and we will continue to leverage the Thrift API through Hector in RHQ in various places. But I am ready to make CQL a first class citizen in RHQ and look forward to watching it continue to mature into a great technology.

Setting up a local Cassandra cluster using RHQ

2012-07-02T01:50:00.000-04:00

As part of my ongoing research into using Cassandra with RHQ, I did some work to automate setting up a Cassandra cluster (for RHQ) on a single machine for development and testing. I put together a short demo showing what is involved. Check it out at http://bit.ly/N3jbT8.

Aggregating Metric Data with Cassandra

2012-06-16T11:21:00.000-04:00

Introduction

I successfully performed metric data aggregation in RHQ using a Cassandra back end for the first time recently. Data roll up or aggregation is done by the data purge job which is a Quartz job that runs hourly. This job is also responsible for purging old metric data as well as data from others parts of the system. The data purge job invokes a number of different stateless session EJBs (SLSBs) that do all the heavy lifting. While there is a still a lot of work that lies ahead, this is a big first step forward that is ripe for discussion.

Integration

JPA and EJB are the predominant technologies used to implement and manage persistence and business logic. Those technologies however, are not really applicable to Cassandra. JPA is for relational databases and one of the central features of EJB is declarative, container-managed transactions. Cassandra is neither a relational nor a transactional data store. For the prototype, I am using server plugins to integrate Cassandra with RHQ.

Server plugins are used in a number of areas in RHQ already. Pluggable alert notifcation senders is one of the best examples. A key feature of server plugins is the encapsulation made possible by the class loader isolation that is also present with agent plugins. So let's say that Hector, the Cassandra client library, requires a different version of a library that is already used by RHQ. I can safely use the version required by Hector in my plugin without compromising the RHQ server. In addition to the encapsulation, I can dynamically reload my plugin without having to restart the whole server. This can help speed up iterative development.

Cassandra Server Plugin Configuration

You can define a configuration in the plugin descriptor of a server plugin. The above screenshot shows the configuration of the Cassandra plugin. The nice thing about this is that it provides a consistent, familiar interface in the form of the configuration editor that is used extensively throughout RHQ. There is one more screenshot that I want to share.

System Settings

This is a screenshot of the system settings view. It provides details about the RHQ server itself like the database used, the RHQ version, and build number. There are several configurable settings, like the retention period for alerts and drift files and settings for integrating with an LDAP server for authentication. At the bottom there is a property named Active Metrics Server Plugin. There are currently two values from which to choose. The first is the default, which uses the existing RHQ database. The second is for the new Cassandra back end. The server plugin approach affords us a pluggable persistence solution that can be really useful for prototyping among other things. Pluggable persistence with server plugins is a really interesting topic in and of itself. I will have more to say on that in future post.

Implementation

The Cassandra implementation thus far uses the same buckets and time slices as the existing implementation. The buckets and retention periods are as follows:

Metrics Data Bucket	Data Retention Period
raw data	7 days
one hour data	2 weeks
6 hour data	1 month
1 day data	1 year

Unlike the existing implementation, purging old data is accomplished simply by setting the TTL (time to live) on each column. Cassandra takes care of purging expired columns. The schema is pretty straightforward. Here is the column family definition for raw data specified as a CLI script:

The row key is the metric schedule id. The column names are timestamps and column values are doubles. And here is the column family definition for one hour data:

As with the raw data, the schedule id is the row key. Unlike the raw data however, we use composite columns here. All the buckets with the exception of the raw data, store computed aggregates. RHQ calculates and stores the min, max, and average for each (numeric) metric schedule. The column name consists of a timestamp and an integer. The integer identifies whether the value is the max, min, or average. Here is some sample (Cassandra) CLI output for one hour data:

Each row in the output reads like a tuple. The first entry is the column name with a colon delimiter. The timestamp is listed first followed by the integer code to identify the aggregate type. Next is the column value, which is the value of the aggregate calculation. Then we have a timestamp. Every column has a timestamp in Cassandra has a timestamp. It is used for conflict resolution on writes. Lastly, we have the ttl. The schema for the remaining buckets is similar the one_hour_metric_data column family so I will not list them here.

The last implementation detail I want to discuss is querying. When the data purge job runs, it has to determine what data is ready to be aggregated. With the existing implementation that uses the RHQ database, queries are fast and efficient using indexes. The following column family definition serves as an index to make queries fast for the Cassandra implementation as well:

The row key is the metric data column family name, e.g., one_hour_metric_data. The column name is a composite that consists of a timestamp and a schedule id. Currently the column value is an integer that is always set to zero because only the column name is needed. At some point I will likely refactor the data type of the column value to something that occupies less space. Here is a brief explanation of how the index is used. Let's start with writes. Whenever data for a schedule is written into one bucket, we update the index for the next bucket. For example, suppose data for schedule id 123 is written into the raw_metrics column family at 09:15. We will write into the "one_hour_metric_data" row of the index with a column name of 09:00:123. The timestamp in which the write occurred is rounded down to the start of the time slice of the next bucket. Further suppose that additional data for schedule 123 is written into the raw_metrics column family at times 09:20, 09:25, and 09:30. Because each of those timestamps gets rounded down to 09:00 when writing to the index, we do not wind up with any additional columns for that schedule id. This means that the index will contain at most one column per schedule for a given time slice in each row.

Reads occur to determine what data if any needs to be aggregated. Each row is in the index is queried. After a column is read and the data for the corresponding schedule is aggregated into the next bucket, that column is then deleted. This index is a lot like a job queue. Reads in the existing implementation that use a relational database should be fast; however, there is still work that has to be done to determine what data if any needs to be aggregated when the data purge job runs. With the Cassandra implementation, the presence of a column in a row of the metrics_aggregates_index column family indicates that data for the corresponding schedule needs to be aggregated.

Testing

I have pretty good unit test coverage, but I have only done some preliminary integration testing. So far it has been limited to manual testing. This includes inspecting values in the database via the CLI or with CQL and setting break points to inspect values. As I look to automate the integration testing, I have been giving some thought to how metric data is pushed to the server. Relying on the agent to push data to the server is sub optimal for a couple reasons. First, the agent sends measurement reports to the server once a minute. I need better control of how frequently and when data is pushed to the server.

The other issue with using the agent is that it gets difficult to simulate older metric data that has been reported over a specified duration, be it an hour, a day, or a week. Simulating older data is needed for testing that data is aggregated into 6 hour and 24 hour buckets and that data is purged at appropriate times.

RHQ's REST interface is a better fit for the integration testing I want to do. It already provides the ability to push metric data to the server. I may wind up extending the API, even if just for testing, to allow for kicking off the aggregation that runs during the data purge job. I can then use the REST API to query the server and verify that it returns the expected values.

Next Steps

There is still plenty of work ahead.I have to investigate what consistency levels are most appropriate for different operations. There is a still a large portion of the metrics APIs that needs to be implemented, some of the more important ones being query operations used to render metrics graphs and tables. The data purge job is not the best approach going forward for doing the aggregation. Only a single instance of the job runs each hour, and it does not exploit any of the opportunities that exist for parallelism. Lastly and maybe most importantly, I have yet to start thinking about how to effectively manage the Cassandra cluster with RHQ. As I delve into these other areas I will continue sharing my thoughts and experiences.

Modeling Metric Data in Cassandra

2012-06-11T22:48:00.000-04:00

RHQ supports three types of metric data - numeric, traits, and call time. Numeric metrics include things like the amount of free memory on a system or the number of transactions per minute. Traits are strings that track information about a resource and typically change in value much less frequently than numeric metrics. Some examples of traits include server start time and server version. Call time metrics capture the execution time of requests against a resource. An example of call time metrics is EJB method execution time.

I have read several times that with Cassandra it is best to let your queries dictate your schema design. I recently spent some time thinking about RHQ's data model for metrics and how it might look in Cassandra. I decided to focus only on traits for the time being, but much of what I discuss applies to the other metrics types as well.

I will provide a little background on the existing data model to make it easier to understand some of the things I touch on. All metric data in RHQ belongs to resources. A particular resource might support metrics like those in the examples above, or it might support something entirely different. A resource has a type, and the resource type defines which type of metrics that it supports.We refer to these as measurement definitions. These measurement definitions, along with other meta data associated with the resource type, are defined in the plugin descriptor of the plugin that is responsible for managing the resource. You can think of a resource type of an abstraction and a resource is a realization of that abstraction. Similarly, a measurement definition is an abstraction, and a measurement schedule is a realization of a measurement definition. A resource can have multiple measurement schedules, and each schedule is associated with measurement definition. The schedule has a number of attributes like the collection interval, an enabled flag, and the value. When the agent reports metric data to the RHQ server the data is associated with a particular schedule. To tie it all together, here is a snippet of some of the relevant parts of the measurement classes:

To review, for a given measurement schedule, we can potentially add an increasing number of rows in the RHQ_MEASUREMENT_DATA_TRAIT table over time. There are a lot of fields included in the snippet for MeasurementDefinition. I chose to include most of them because they are pertinent to the discussion.

For the Cassandra integration, I am interested primarily in the MeasurementDataTrait class. All of the other types are managed by the RHQ database. Initially when I started thinking about what column families I would need, I felt overcome with writer's block. Then I reminded myself to think about trait queries and try to let those guide my design. I decided to focus on some resource-level queries and leave others like group-level queries for a later exercise. Here is a screenshot of one of the resource-level views where the queries are used:

Let me talk a little about this view. There are a few things to point out in order to understand the approach I took with the Cassandra schema. First, this is a list view of all the resource's traits. Secondly, the view shows only the latest value for each trait. Finally, the fields required by this query span across multiple tables and include resource id, schedule id, definition id, display name, value, and time stamp. Because the fields span across multiple tables, one or more joins is required for this query. There are two things I want to accomplish with the column family design in Cassandra. I want to be able to fetch all of the data with a single read, and I want to be able to fetch all of the traits for a resource in that read. Cassandra of course does not support joins; so, some denormalization is needed to meet my requirements. I have two column families for storing trait data. Here is the first one that supports the above list view as a Cassandra CLI script:

create column family resource_traits
    with comparator = 'CompositeType(DateType, Int32Type, Int32Type, BooleanType, UTF8Type, UTF8Type)' and
    default_validation_class = UTF8Type and
    key_validation_class = Int32Type;

The row key is the resource id. The column names are a composite type that consist of the time stamp, schedule id, definition id, enabled flag, display type, and display name. The column value is a string and is the latest known value of the trait. This design allows for the latest values of all traits to be fetched in a single read. It also gives me the flexibility to perform additional filtering. For example, I can query for all traits that are enabled or disabled. Or I can query for all traits whose values last changed after a certain date/time. Before I talk about the ramifications of the denormalization I want to introduce the other column family that tracks the historical data. Here is the CLI script for it:

create column family traits
    with comparator = DateType and
    default_validation_class = UTF8Type and
    key_validation_class = Int32Type;

This column family is pretty straightforward. The row key is the schedule id. The column name is the time stamp, and the column value is the trait value. In the relational design, we only store a new row in the trait table if the value has changed. I have only done some preliminary investigation, and I am not yet sure how to replicate that behavior with a single write. I may need to use a custom comparator. It is something I have to revisit.

I want to talk a little bit about the denormalization. As far this example goes, the system of record for everything except the trait data is the RHQ database. Suppose a schedule is disabled. That will now require a write to both the RHQ database as well as to Cassandra. When a new trait value is persisted, two writes have to be made to Cassandra - one to add a column to the traits column family and one to update the resource_traits column family.

The last thing I will mention about the design is that I could have opted for a more row based approach where each column in resource_traits is stored in a separate row. With that approach, I would use statically named columns like scheduleId and the corresponding value would be something like 1234. The primary reason I decided against this is because the RandomPartitioner is used for the partitioning strategy, which happens to be the default. RandomPartitioner is strongly recommended for most cases to allow for even key distribution across nodes. Without going into detail, range scans, i.e., row-based scans, are not possible when using the RandomPartitioner. Additionally, Cassandra is designed to perform better with slice queries, i.e., column-based queries than with range queries.

The design may change as I get further along in the implementation, but it is a good starting point. The denormalization allows for efficient querying of a resource's traits and offers the flexibility for additional filtering. There are some trade offs that have to be made, but at this point, I feel that they are worthwhile. One thing is for certain. Studying the existing (SQL/JPA) queries and understanding what data is involved and how helped flush out the column family design.

Hector's ColumnSliceIterator

2012-05-31T08:09:00.001-04:00

Hector is a popular Java client library for Cassandra. Hector offers several classes and APIs for performing range and slice queries. ColumnSliceIterator is one such class. It implements java.util.Iterator and encapsulates some basic paging functionality. By default it fetches 100 columns per page/batch. Below is an example of how I was using it in one place in my code that landed me in some trouble.

That code resulted in the following NPE.

java.lang.NullPointerException
 at me.prettyprint.cassandra.service.ColumnSliceIterator.next(ColumnSliceIterator.java:105)

Being new to both Cassandra and Hector, I immediately assumed that there was a problem with my query. I spent a good bit of time debugging before I realized what was happening. The internal iterator in ColumnSliceIterator is not initialized until hasNext is called. This seems like a bug to me and sure enough I found this issue. It does not look like there are any plans to fix this bug, but fortunately it can be worked around easily enough. Changing my code to,

did the trick. Going forward, I will probably wrap ColumnSliceIterator in an Iterable so that I can use it with Java's for-each loop. This will encapsulate the bug and allow me to better control the paging. This bug kept me up late the other night; so, I thought it worth a short post. Overall my experience with Hector thus far has been really good. It does a nice job of encapsulating Cassandra's Thrift APIs.

Working with Cassandra

2012-05-29T21:23:00.000-04:00

RHQ provides a rich feature set in terms of its monitoring capabilities. In addition to collecting and storing metric data, RHQ automatically generates baselines, allows you to view graphs of data points over different time intervals, and gives you the ability to alert on metric data. RHQ uses a single database for storing all of its data. This includes everything from the inventory, to plugin meta data, to metric data. This presents an architectural challenge for the measurement subsystem particularly in terms of scale. As the number of managed resources grows and the volume of metrics being collected increases, database performance starts to degrade. Despite various optimizations that have been made, the database remains a performance bottleneck. The reality is that the relational database simply is not the best tool for write-intensive applications like time-series data.

This architectural challenge has in large part motivated me to start learning about Cassandra. There are plenty of other, non-relational database systems that I think could address the performance problems with our measurement subsystem. There are a couple things about Cassandra that provided enough intrigue that made me decide to invest time learning about it.

The first point of intrigue is that Cassandra is a distributed, peer-to-peer system with no single point of failure. Any node in the cluster can serve read and write requests. Nodes can be added to and removed from the cluster at any point making it easier to meet demands around scalability. This design is largely inspired by Amazon's Dyanmo.

The second point of intrigue for me is that running a node involves running only a single Java process. For the purposes of RHQ and JBoss Operations Network (JON), this is much more important to me than the first point about single points of failure. The fewer the moving parts, the better. It simplifies management which will goes along way towards the goal of having a self-contained solution.

Cassandra could be a great fit for RHQ, and the time I have spent thus far learning it is definitely time well spent. There are some learning curves and hurdles one has to overcome though. I find the project documentation to be lacking. For example, it took some time to wrap my head around super columns. It was only after I started understanding super columns to the point where I could begin thinking about how to leverage them with RHQ's data model that I then discovered that composite columns should be favored over super columns. Apparently composite columns do not have the performance and memory overhead inherent to super columns. And composite columns allow for an arbitrary level of nesting whereas super columns do not. Fortunately DataStax's docs help fill in a lot of the gaps.

One thing that was somewhat counter-intuitive initially is how the sorting works. With a relational database, you first define the schema, and then queries are defined later on. Sorting is done on column values and is specified at query time. With Cassandra, sorting is based on column names and is specified at the time of schema creation. This might seem really strange if you are thinking in terms of a relational database, but Cassandra is a distributed key-value store. If you think about it more along the lines of say, java.util.TreeMap, then it makes a lot more sense. With a TreeMap, sorting is done on keys. When I want to use a TreeMap or another ordered collection, I have to decide in advance how the elements of the collection should be ordered. This aspect of Cassandra is a good thing. It contributes to the high performance read/writes for which Cassandra is known. It also lends itself very well to working with time-series data.

DataStax posted a great blog the other day about how they use Cassandra as a metrics database. The algorithm described sounds similar to what we do in RHQ; however, there are a few differences (aside from the obvious one of using different database systems). One difference is in bucket sizes. They use bucket sizes of one minute, five minutes, two hours, and twenty-four hours. RHQ uses bucket sizes of one hour, six hours, and twenty-four hours. I will briefly explain what this means. RHQ writes raw data points into a set of round-robin tables. Every hour a job runs to perform aggregation. The latest hour of data points is aggregated into the one hour table or bucket. RHQ calculates the max, min, and average for each metric collection schedule. When the one hour table has six hours worth of data, it is aggregated and written into the six hour table.

Disk space is cheap, but it is not infinite. There needs to be a purge mechanism in place to prevent unbounded growth. For RHQ, the hourly job that does the aggregation also handles the purging. Data in the six hour bucket for instance, is kept for 31 days. With Cassandra, DataStax simply relies on Cassandra's built-in TTL (time to live) feature. When data is written into a column, the TTL is set on it so that it will expire after the specified duration.

So far it has been a good learning experience. Cassandra is clearly an excellent fit for storing RHQ's metric data, but I am starting to how it could also be a good fit for other parts of the data model as well.

Drift Management Coming to RHQ

2011-08-10T21:31:00.000-04:00

Introduction
I am excited to share that we are very close to releasing a beta of RHQ 4.1.0. I have been working on Drift Management, one of the new features going into the release. I have been meaning to write a little bit about what this new feature is all about, and now is as good a time as any. I will try to provide a high level overview and save getting into more specific, detailed topics for future posts.

What is Drift?
The first thing we need to do is define what is exactly is meant by the term Drift Management. Let's start with the first part. Conceptually, we can define drift as an unplanned or unintended change to a managed resource. Let's consider a couple examples to illustrate the concept.

We have an EAP server that is configured for production use. That is, things like the JVM heap size, data source definitions, etc. are configured with production values. At some point suppose the heap settings for the EAP server are changed such that they are no longer consistent with what is expected for production use. This constitutes drift.

Now let's consider another example involving application deployment. Suppose we have a cluster of EAP servers that is running our business application. We deploy an updated version of the application. For some reason, one of the cluster nodes does not get updated with the newer version of the application while the others have. We now have a cluster node that does have content that is expected to be deployed on it. This constitutes drift.

Why Do We Care about Drift?
Now that we have looked at some examples to illustrate the concept of drift, there is a perfectly reasonably question to ask. Why should we care? Unplanned or unintended changes frequently lead to problems. Those problems can manifest themselves as production failures, defects, outages, etc. Even with planned, intended changes, problems arise. It is not a question of if but rather when. A production server going down can result in a significant loss of time and money among other things. Anything you can do to be proactive in handling issues when the occur could help save your organization time, money, and resources.

How Will RHQ Manage Drift?
What can RHQ do to deal with drift? First and foremost, it can monitor resources for unintended or unplanned changes. RHQ allows you to specify which resources or which parts of resources you want to monitor for drift. The agent can periodically scan the file system looking for changes. When the agent detects a change, it notifies the server with the details of what has changed.

The server maintains a history of the changes it receives from the agent. This makes it possible for example to compare the state of a resource today versus its state two weeks ago. One of the many interesting and challenging problems we are tackling is how to present that history in meaningful ways so that users can quickly and easily identify changes of interest.

An integral aspect of RHQ's monitoring capabilities is its alerting system. RHQ allows you to define different rules which can result in alerts being triggered. For example, we can create a rule that will trigger an alert whenever an EAP server goes down. Similarly, RHQ could (and will) give you the ability to have alerts triggered whenever drift is detected on any of your managed EAP servers.

Another key aspect of RHQ's drift management functionality is remediation. Some platforms and products provide automatic remediation. Consider the earlier example of the changed heap settings on the EAP server. With automatic remediation, those settings might be reverted back to their orignal values as soon as the change is detected.

Then there is also manual remediation. Think merge conflicts in a version control system. There are lots of visual editors for view diffs and resolving conflicts. A couple that I use are diffmerge and meld. RHQ will provide interfaces and tools for generating and viewing diffs and for performing remediation much in the same way you might with a visual diff editor.

What's Next?
Here is a quick run down of drift management features that will be in the beta:

Enable drift managent for individual resources

This involves defining the drift configuration or rules which specify what files to monitor for drift and how often monitoring should be done

Perform drift monitoring (done by the agent)
View change history in the UI
Execute commands from the CLI to:

Query for change history
Generate snapshots

A snapshot provides a point in time view of a resource for a specified set of changes

Diff snapshots (This is not a file diff)

Here are some notable features that will not be available in the beta:

Define filters that specify which files to include/exclude in drift monitoring (Note that you actually can define the filter. They just are not handled by the agent yet)
Perform manual remediation (i.e., visual diff editor)
Support for golden images (more on this in a future post)
Generate/view snapshots in the UI
Alerts integration

It goes without saying that there will be bugs, some of which are known, and that functionality in the beta is subject to change in ways that will likely break compatibility with future releases. More information will be provided in the release notes as soon as they are available. Stay tuned!

Manually Add Resources to Inventory from CLI

2011-06-02T11:03:00.000-04:00

Resources in RHQ are typically added to the inventory through discovery scans that run on the agent. The plugin container (running inside the agent) invokes plugin components to discover resources. RHQ also allows you to manually add resources into inventory. There may be times when discovery scans fail to find a resource you want to manage. The other day I was asked whether or not you can manually add a resource to inventory via the CLI. Here is a small CLI script that demonstrates manually adding a Tomcat server into inventory.

The findResourceType and findPlatform functions are pretty straightforward. The interesting work happens in createTomcatConnectionProps and in manuallyAddTomcat. The key to it all though is on line 44. DiscoveryBoss provides methods for importing resources from the discovery queue as well as for manually adding resources. manuallyAddResources expects as arguments a resource type id, a parent resource id, and the plugin configuration (i.e., connection properties).

Determining the connection properties that you need to specify might not be entirely intuitive. I looked at the plugin descriptor as well as the TomcatDiscoveryComponent class from the tomcat plugin to determine the minimum, required connection properties that need to be included.

Here is how the script could be used from the CLI shell:

rhqadmin@localhost:7080$ login rhqadmin rhqadmin
rhqadmin@localhost:7080$ exec -f manual_add.js
rhqadmin@localhost:7080$ hostname = '127.0.0.1'
rhqadmin@localhost:7080$ tomcatDir = '/home/jsanda/Development/tomcat6'
rhqadmin@localhost:7080$ manuallyAddTomcat(hostname, tomcatDir)
Resource:
           id: 12071
         name: 127.0.0.1:8080
      version: 6.0.24.0
 resourceType: Tomcat Server

rhqadmin@localhost:7080$

This effectively adds the Tomcat server to the inventory of managed resources. This same approach can be used with other resource types. The key is knowing what connection properties you need to specify so that the plugin (in which the resource type is defined) knows how to connect to and manage the resource.

A REPL for the RHQ Plugin Container

2011-05-23T07:35:00.001-04:00

Overview
RHQ plugins run inside of a plugin container that provides different services and manages the life cycles of plugins. The plugin container in turn runs inside of the RHQ agent. If you are not familiar with the agent, it is deployed to each machine you want RHQ to manage. You can read more about it here and here. While the plugin container runs inside of the agent, it is not coupled to the agent. In fact is it used quite a bit outside of the agent. It is used in Embedded Jopr which was intended to be a replacement for the JMX web console in JBoss AS. The plugin container is also used a lot during development in automated tests.

My teammate, Heiko Rupp, has developed a cool wrapper application for the plugin container. It defines a handful of commands for working with the plugin container interactively. What is nice about this is that it can really speed up plugin development. Heiko has written several articles about the standalone container including Working on a standalone PluginContainer wrapper and I love the Standalone container in Jopr (updated). After reading some of his posts I got to thinking that a REPL for the plugin container would be really nice but not just any REPL though. I was thinking specifically about Clojure's REPL.

I have spent some time exploring the different ways Clojure could be effectively integrated with RHQ. There is little doubt in my mind that this is one of them. I recently started working on some Clojure functions to make working with the plugin container easier. I am utilizing Clojure's immutable and persistent data structures as well as some of the other great language features such as first class functions and multimethods. I am trying to make these functions easy enough to use so that someone who might not be a very experienced Clojure programmer might still find them useful during plugin development and testing.

Getting the Code
The project is available on github at https://github.com/jsanda/clj-rhq. It is built with leiningen so you want to get it installed. I typically run a swank server and connect from Emacs, but you can also start a REPL session directly if you are not an Emacs user. The project pulls in the necessary dependencies so that you can work with plugin container-related classes as you will see in the following sections.

Running the Code
These steps assume that you already have leiningen installed. First, clone the project:

git clone https://jsanda@github.com/jsanda/clj-rhq.git

Next, download project dependencies with:

lein deps

Some plugins reply on native code provided by the Sigar library which you should find at clj-rhq/lib/sigar-dist-1.6.5.132.zip. Create a directory in lib named native and unzip sigar-dist-1.6.5.132.zip there. The project is configured to look for native libraries in lib/native.

Finally, if you are using Emacs run lein swank to start a swank server; otherwise, run lein repl to start a REPL session on the command line.

Starting/Stopping the Plugin Container

The first thing I do is call require to load the rhq.plugin-container namespace. Then I call the start function. The plugin container emits a line of output, and then the function returns nil. Next I verify that the plugin has started up by calling running?. Then I call the stop function to shutdown the plugin container and finally call running? again to verify that the plugin container has indeed shutdown.

Executing Discovery Scans
So far we have looked at starting and stopping the PC. One of the nice things about working interactively in the REPL is that you are not limited to a pre-defined set of functions. If rhq.plugin-container did not offer any functions for executing a discovery scan, you could write something like the following:

The pc function simply returns the plugin container singleton which gives us access to the InventoryManager. We call InventoryManager's executeServerScanImmediately method and store the InventoryReport object that it returns in a variable named inventory-report. Alternatively you can use the discover function.

On the first call to discover we pass the keyword :SERVER as an argument. This results in a server scan being run. On the second call, we pass :SERVICE which results in a service scan being run. If you invoke discover with no arguments, a server scan is executed followed by a service scan. The two inventory reports from those scans are returned in a vector. The use of the count function to see how many resources were discovered is a good example that demonstrates how you can easily use functions defined outside of the rhq.plugin-container namespace to provide additional capabilities and functionality.

Searching the Local Inventory
Once you have the plugin container running and are able to execute discovery scans, you need a way query the inventory for resources with which you want to work. The inventory function does just that. It can be invoked in one of two ways. In its simpler form which takes no arguments, it returns the platform Resource object. In its more complex form, it takes a map of pre-defined filters and returns a lazy sequence of those resources that match the filters.

inventory is invoked on line 1 without any arguments, and then a string version of it is returned with a call to str. The type is Mac OS X indicating that the object is in fact the platform resource. On line 5 we invoke inventory with a single filter to include resources that are available. That call shows that there are 62 resources in inventory that are up. On line 7 we query for resources that are a service and see that there are 60 in inventory. On line 9 we specify multiple filters that will return down services. When multiple filters are specified, a resource must match each one in order to be included in the results. On line 10 we query for webapps from the JBossAS plugin. On line 13 we specify a custom filter in the form of an anonymous function with the :fn key. This filter finds resources that define at least two metrics.

Conclusion
We have looked at a number of functions to make working with the plugin container from the REPL a bit easier. Each function should also include a useful docstring as in the following example,

We have only scratched the surface with the functions in the rhq.plugin-container namesapce. In some future posts we will explore invoking resource operations, updating resource configurations, and deploying resources like EARs and WARs.

Remote Streams in RHQ

2011-05-19T23:10:00.001-04:00

The agent/server communication layer in RHQ provides rich, bi-directional communication that is highly configurable, performant, and fault tolerant. And as a developer it has another feature of which I am quite fond - I rarely have to think about it. It just works.

Recently I started working on a new feature that involves streaming potentially large files from agent to server. This work has led me to look under the hood of the comm layer to an extent. The comm layer allows for high-level APIs between server and agent. Consider the following example:

public interface ConfigurationServerService {
    ...
    @Asynchronous(guaranteedDelivery = true)
    void persistUpdatedResourceConfiguration(int resourceId, Configuration resourceConfiguration);
}

The agent calls persistUpdateResourceConfiguration when it has detected a resource configuration change that has occurred outside of RHQ. The @Asynchronous annotation tells the communication layer that the remote method call from agent to server can be performed asynchronously. There are no special stubs or proxies that I have to worry about to use this remote API. It is all nicely tucked away in the communication layer.

Several posts could be devoted to discussing RHQ's communication layer but back to my current work of streaming large files. I needed to put in place a remote API on the server so that the agent can upload files. You might consider something like the following as an initial approach:

// Remote API exposed by RHQ server to stream files from agent to server
void uploadFile(byte[] data);

The problem with this approach is that it involves loading the file contents into memory. File sizes could easily exceed several hundred megabytes in size resulting in substantial memory usage that would be impractical. The RHQ agent is finely tuned to keep a low foot print in terms of memory usage as well as CPU utilization. When reading the contents of a large file that is too big to fit into memory, java.io.InputStream is commonly used. With the RHQ communication layer, I am able to expose an API like the following,

// Remote API exposed by RHQ server to stream files from agent to server
void uploadFile(InputStream stream);

With this API, the agent passes an InputStream object to the server. Keep in mind though that none of Java's standard InputStream classes implement Serializable which is a requirement for using objects with a remote invocation framework like RMI or JBoss Remoting. Fortunately for me RHQ provides the RemoteInputStream class which extends java.io.InputStream. The Javadocs from that class state,

This is an input stream that actually pulls down the stream data from a remote server. Note that this extends InputStream so it can be used as any normal stream object; however, all methods are overridden to actually delegate the methods to the remote stream.

When the agent wants to upload a file, it calls uploadFile passing a RemoteInputStream object. The server can then read from the input stream just as it would any other input stream unbeknownst to it that the bytes are being streamed over the wire.

While I find myself impressed with RemoteInputStream, it gets even better. I wanted to read from the stream asynchronously. When the agent calls uploadFile, instead of reading from the stream in the thread handling the request, I fire off a message to a JMS queue to free up the thread to service other agent requests. I am able to pass the RemoteInputStream object in a JMS message and have a Message Driven Bean then read from the stream to upload the file from the agent.

This level of abstraction along with the performance, fault tolerance, and stability characteristics of the agent/server communication layer makes it one of those hidden gems you do not really appreciate until you have to look under the hood so to speak. And rarely if ever do I find myself having to look under the hood because... it just works. Lastly, I should point out that there is a RemoteOutputStream class that compliments the RemoteInputStream class.

RHQ Bundle Recipe for Deploying JBoss Server

2011-02-20T21:12:00.001-05:00

My colleague mazz wrote an excellent blog post that describes in detail the provisioning feature of RHQ. The post links to a nice Flash demo he put together to illustrates the various things he discusses in his article. Taking what I learned from his post, I put together a simple recipe to deploy a JBoss EAP server and then start the server after it has been laid down on the destination file system. Here is the recipe:

The bundle declaration itself on lines 4 - 11 is pretty straightforward. If this part is not clear, read through the docs on Ant bundles. Where things became a little less than straightforward is with the <exec> task starting on line 18. The first problem I encountered was Ant saying that it could not find run.sh. I think this is because it was looking for it on my PATH. Adding resovleexecutable="true" on line 21 took care of this problem. This tells Ant to look for the executable in the specified execution directory.

On line 22 I specify arguments to run.sh. -b 0.0.0.0 tells JBoss to bind to all available addresses. Initially I had line 22 written as:

<arg value="-b 0.0.0.0"/>

That did not get parsed correctly and resulting in JBoss throwing an exception with an error message saying that an invalid bind address was specified. Specifying the line attribute instead of the value attribute fixed the problem.

The last problem I encountered was Ant complaining that it did not have the necessary permissions to execute run.sh. It turned out that when the EAP distro was unpacked, the scripts in the bin were not executable. This is why I added the <chmodgt; call on line 17. It seems that the executable file mode bits are getting lost somewhere along the way in the deployment process. I went ahead and filed a bug for this you. You can view the ticket here.

After working through these issues, I was able to successfully deploy my JBoss server and have it start up without error. Now I can easily deploy my bundle a single machine, a cluster of RHEL servers that might serves as a QA or staging environment, or even a group heterogeneous machines that could consist of Windows, Fedora (or other Linux distros), and Mac OS X. Very cool! Provisioning is still a relatively new feature to RHQ. It add tremendous value to the platform, and fortunately I think it can add even more value. One of the things I would like to see is more support for common tasks like starting/stopping a server whether it is in the form of custom Ant tasks or something else.

Writing an RHQ Plugin in Clojure

2010-11-23T22:30:00.024-05:00

Clojure is a new, exciting language. My biggest problem with it is that I do not find enough time to work with it. One of the ways I am trying to increase my exposure to Clojure is by exploring ways of integrating it into RHQ. RHQ is well-suited for integrating non-Java, JVM languages because it was designed and built to be extended. In previous posts I have talked about various extension points including agent plugins, server plugins, and remote clients.

I decided to write an agent plugin in Clojure. If you are not familiar with RHQ plugins or what is involved with implementing one, check out this excellent tutorial from my colleague Heiko. Right now, I am just doing exploratory work. I have a few goals in mind though as I go down this path.

First, I have no desire to wind up writing Java in Clojure. By that I mean that I do not want to get bogged down dealing with mutable objects. One of the big draws to Clojure for me is that it is a purely functional language with immutable data structures; so, as I continue my exploration of integrating Clojure with RHQ, I want to write idiomatic Clojure code to the greatest extent possible.

Secondly, I want to preserve what I like to think of as the Clojure development experience. Clojure is a very dynamic language in which functions and name spaces can be loaded and reloaded on the fly. The REPL is an invaluable tool. It provides instant feedback. In my experience Test-Driven Development usually results in short, quick development iterations. TDD + REPL produces extremely fast development iterations.

Lastly, I want to build on the aforementioned goals in order to create a framework for writing RHQ plugins in Clojure. For instance, I want to be able to run and test my plugin in a running plugin container directly from the REPL. And then when I make a change to some function in my plugin, I want to be able to just reload that code without having to rebuild or redeploy the plugin.

Now that I have provided a little background on where I hope to go, let's take a look at where I am currently. Here is the first cut at my Clojure plugin.

I am using gen-class to generate the plugin component classes. As you can see, this is just a skeleton implementation. Here is the plugin descriptor.

I have run into some problems though when I deploy the plugin. When the plugin container attempts to instantiate the plugin component to perform a discovery scan, the following error is thrown:

I was not entirely surprised to see such an error because I have heard about some of the complexities involved with trying to run Clojure in an OSGi container, and the RHQ plugin container shares some similarities with OSGi. There is a lot of class loader magic that goes on with the plugin container. For instance, each plugin has its own class loader, plugins can be reloaded at runtime, and the container limits the visibility of certain classes and packages. I came across this Clojure Jira ticket, CLJ-260, which talks about setting the context class loader. Unfortunately this did not help my situation because the context class loader is already set to the plugin class loader.

After spinning my wheels a bit, I decided to try a different approach. I implemented my plugin component class in Java, and it delegates to a Clojure script. Here is the code for it.

And here is the Clojure script,

This version deploys without error. I have not fully grokked the class loading issues, but at least for now, I am going to stick with a thin Java layer that delegates to my Clojure code. Up until now, I have been using leiningen to build my code, but now that I am looking at a mixed code base, I may consider switching over to Maven. I use Emacs and Par Edit for Clojure, but I use IntelliJ for Java. The IDE support for Maven will come in handy when I am working on the Java side of things.

Server-Side Scripting in RHQ

2010-11-21T21:31:00.002-05:00

Introduction
The RHQ platform can be extended in several ways, most notably through plugins that run on agents. There also exists the capability to extend the platform's functionality with scripting via the CLI. The CLI is a remote client, and even scripts that are run on the same machine on which the RHQ server resides are still a form of client-side scripting because they run in a separate process and operate on a set of remote APIs exposed by the server.

In this post I am going to introduce a way to do server-side scripting. That is, the scripts are run in the same JVM in which the RHQ server is running. This form of scripting is in no way mutually exclusive to writing CLI scripts; rather, it is complementary. While a large number of remote APIs are exposed through the CLI, they do not encompass all of the functionality internal to the RHQ server. Server-side scripts however, have full and complete access to the internal APIs of the RHQ server.

Server Plugins
RHQ 3.0.0 introduced server plugins which are distinct from agent plugins. Server plugins run directly in the RHQ server inside a server plugin container. Unlike agent plugins, they do not perform any resource discovery. The article, RHQ Server Plugins - Innovation Made Easy, provides a great introduction to server plugins. Similar to agent plugins, server plugins can expose operations which can be invoked from the UI. They can also be configured to run as scheduled jobs. Server plugins have full access to the internal APIs of the RHQ server. Reference documentation for server plugins can be found here. The server-side scripting capability we are going to look at is provided by a server plugin.

Groovy Script Server
Groovy Script Server is a plugin that allows you to dynamically execute Groovy scripts directly on the RHQ server. Documentation for the plugin can be found here. The plugin currently provides a handful of features including,

Customizable classpath per script
Easy access to RHQ EJBs through dynamic properties
An expressive DSL for generating criteria queries

An Example
Now that we have introduced server plugins and the Groovy Script Server, it is time for an example. A while back, I wrote a post on a way to auto-import resources into inventory using the CLI. We will revisit that script, written as a server-side script.

resourceIds = []

criteria(Resource) { 
  filters = [inventoryStatus: InventoryStatus.NEW] 
}.exec(SubjectManager.overlord) { resourceIds << it.id }

DiscoveryBoss.importResources(SubjectManager.overlord, (resourceIds as int[]))

On line three we call the criteria method which is available to all scripts. This method provides our criteria query DSL. Notice that the method takes a single parameter - the class for which the criteria query is being generated. Filters are specified as a map of property names to property values.
Properties names are derived from the various addFilterXXX methods exposed by the criteria object being built. In this instance, the filter corresponds to the method ResourceCriteria.addFilterInventoryStatus.

The criteria method returns a criteria object that corresponds to the class argument. In this example, a ResourceCriteria object is returned. Notice that exec is called on the generated ResourceCriteria object. This method is dynamically added to each generated criteria object. It takes care of calling the appropriate manager which in this case is ResourceManager. exec takes two arguments - a Subject and a closure. Most stateless session bean methods in RHQ go through a security layer to ensure that the user specified by the Subject has the necessary permissions to perform the requested operation. In the CLI, you may have noticed that you do not have to pass a Subject to the various manager methods. This is because the CLI implicitly passes the Subject corresponding to the logged in user. The second argument, a closure, is called once for each entity in the results returned from exec.

Let's look at a second example that builds off of the previous one. Instead of auto-importing everything in the discovery queue, suppose we only want to import JBoss AS 5 or or AS 6 instances.

resourceIds = []

criteria(Resource) { 
  filters = [
    inventoryStatus:  InventoryStatus.NEW,
    resourceTypeName: 'JBossAS Server',
    pluginName:       'JBossAS5'
  ] 
}.exec(SubjectManager.overlord) { resourceIds << it.id }

DiscoveryBoss.importResources(SubjectManager.overlord, (resourceIds as int[]))

Here we add two additional filters for the resource type name and the plugin. If we did not filter on the plugin name in addition to the resource type name, then our results could include JBoss AS 4 instances which we do not want.

Future Work
The Groovy Script Server, as well as server plugins in general, are relatively new to RHQ. There are some enhancements that I already have planned. First, is adding support for running scripts as scheduled jobs. This is one of the big features of server plugins. With support for scheduled jobs, we could configure the auto-inventory script to run periodically freeing us from manually having to log into the server to execute the script. The CLI version of the script could be wrapped in a cron job. If we did that with the CLI script though, we might want to include some error handling logic in case the server is down or otherwise unavailable. With the server-side scheduled job, we do not need that kind of error handling logic.

The second thing I have planned is to put together additional documentation and examples. With the work that has already been done, the server-side scripting capability opens up a lot of interesting possibilities. I would love to hear feedback on how you might utilize the script server as well as any enhancements that you might like to see.

RHQ: Deleting Agent Plugins

2010-11-13T22:25:00.000-05:00

Introduction
RHQ is an extensible management platform; however, the platform itself does not provide the management capabilities. For example, there is nothing built into the platform for managing a JBoss AS cluster. The platform is actually agnostic of the actual resource and types of resources it manages, like the JBoss AS cluster. The management capabilities for resources like JBoss AS are provided through plugins. RHQ's plugin architecture allows the platform to be extended in ways such that it can manage virtually any type of resource.

Plugin JAR files can be deployed and installed on an RHQ server (or cluster of servers), they can be upgraded, and they can even be disabled. They cannot however be deleted. In this post, we spend a little bit of time exploring plugin management, from the perspectives of installing and upgrading to disabling them. Then we consider my recent work for deleting plugins.

Installing Plugins
Plugins can be installed in one of two ways. The first involves copying the plugin JAR file to /jbossas/server/default/deploy/rhq.ear/rhq-downloads/rhq-plugins. And starting with RHQ 3.0.0, you can alternatively copy the plugin JAR file to /plugins which is arguably easier the much shorter path. The RHQ server will periodically scan these directories for new plugin files. When a new or updated plugin is detected, the server will deploy the plugin. This approach is particularly convenient during development when the RHQ server is running on the same machine on which I am developing. In fact, RHQ's Maven build is set up to copy plugins to a development server as part of the build process.

The second approach to installing a plugin involves uploading the plugin file through the web UI. The screenshot below shows the UI for plugin file upload.

Deploying plugins through the web UI is particularly useful when the plugin is on a different file system that the one on which the RHQ server is running. It is worth noting that there currently is no API exposed for installing plugins through the CLI.

Upgrading Plugins
The platform not only supports deploying new plugins that previously have not been installed in the system, but it also supports upgrading existing plugins. From a user's perspective there really is no difference in upgrading a plugin versus installing one for the first time. The steps are the same. And the RHQ server, for the most part, treats both scenarios the same as well.

Installing a new or upgraded plugin does not affect any agents that are currently running. Agents have to be explicitly updated in one of a number of ways including,

Restarting the agent
Restarting the plugin container
Issuing the plugins update command from the agent prompt
Issuing a resource operation for one of the above. This can be done from the UI or from the CLI
Issuing a resource operation for one of the above from a server script.

Disabling Plugins
Installed plugins can be disabled. Disabling a plugin results in agents ignoring that plugin once the agent is restarted (or more precisely, when the plugin container running inside the agent is restarted). The plugin container will not load that plugin, which means resource components, discovery components, and plugin classloaders are not loaded. This results in a reduced memory footprint of the agent. It also reduces overall CPU utilization since the agent's plugin container is performing fewer discovery and availability scans.

Plugins can be disabled on a per-agent basis allowing for a more heterogeneous deployment of agents. For instance, I might have a web server that is only running Apache and the agent that is monitoring it, while on another machine I have a JBoss AS instance running. I could disable the JBoss-related plugins on the Apache box freeing up memory and CPU cycles. Likewise, I can disable the Apache plugins on the box running JBoss AS.

When a plugin is disabled, nothing is removed from the database. Any resources already in inventory from the disabled plugin remain in inventory. Type definitions from the disabled plugin also remain in the system.

Deleting Plugins
Recently I have been working on adding support for deleting plugins. Deleting a plugin not only deletes the actual plugin from the system, but also everything associated with it including all type definitions and all resources of the types defined in the plugin. When disabling a plugin, the plugin container has to be explicitly restarted in order for it to pick up the changes. This is not the case though with deleting plugins. Agents periodically send inventory reports to the server. If the report contains a resource of a type that has been deleted, the server rejects the report and tells the agent that it contains stale resource types. The agent in turn recycles its plugin container, purging its local inventory of any stale types and updating its plugins to match what is on the server. No type definitions, discovery components, or resource components from the plugin will be loaded

Use Cases for Plugin Deletion
There are a number of motivating use cases for supporting plugin deletion. The most import of these might be the added ability to downgrade a plugin. But we will also see the benefits plugin deletion brings to the plugin developer.

Downgrading Plugins
We have already mentioned that RHQ supports upgrading plugins. It does not however support downgrading a plugin. Deleting a plugin effectively provides a way to rollback to a previous version of a plugin. There may be times in a production deployment for example when a plugin does not behave as expected or as desired. Users currently do not have the capability to downgrade to a previous version of that plugin. Plugin deletion now makes this possible.

Working with Experimental Plugins
Working with an experimental plugin or one that might not be ready for production use carries with it certain risks. Some of those risks can be mitigated with the ability to disable a plugin; however, the plugin still exists in the system. Resources remain in inventory. Granted those resources can be deleted easily enough, but there is still some margin for error in so far as failing to delete all of the resources from the plugin or accidentally deleting the wrong resources. And there exists no way to remove type definitions such as metric definitions and operation definitions without direct database access. Having the ability to delete a plugin along with all of its type definitions and all instances of those type definitions completely eliminates these risks.

Simplifying Plugin Development
A typical work flow during during plugin development includes incremental deployments to an RHQ server as changes are introduced to the plugin. Many if not all plugin developers have run into situations in which they have to blow away their database due to changes made in the plugin (This normally involves changes to type definitions in the plugin descriptor). This slows down development, sometimes considerably. Deleting a plugin should prove much less disruptive to a developer's work flow than having to start with a fresh database installation, particularly when a substantial amount of test data has been built up in the database. To that extent, I can really see the utility in a Maven plugin for RHQ plugin development that deploys the RHQ plugin to a development server. The Maven plugin could provide the option to delete the RHQ plugin if it already exists in the system before deploying the new version.

Conclusion
Development for the plugin deletion functionality is still ongoing, but I am confident that it will make it into the next major RHQ release. If you are interested in tracking the progress or experimenting with this new functionality, take a look at the delete-agent-plugin branch in the RHQ Git repo. This is where all of the work is currently being done. You can also check out this design document which provides a high level overview of the work involved.

Dealing with Asynchronous Workflows in the CLI

2010-09-28T20:40:00.000-04:00

Introduction
There is constant, ongoing communication between agents and servers in RHQ. Agents at regularly scheduled intervals for example send inventory and availability reports up to the server. The server sends down resource-related requests such as updating a configuration or executing a resource operation. Examples of these include updating the connection pool setting for a JDBC data source and starting a JBoss AS server. Some of these work flows are performed in a synchronous manner while others are carried out in an asynchronous fashion. A really good example of an asynchronous work flows is scheduling a resource operation to execute at some point in the future. There is a common pattern used in implementing these asynchronous work flows. We will explore this pattern in some detail and then consider the impacts on remote clients like the CLI.

The Pattern
The asynchronous work flows are most prevalent in requests that produce mutative actions against resources. Let's go through the pattern.

A request is made on the server to take some action against a resource (e.g., invoke an operation, update connection properties, update configuration, deploy content, etc.)
The server logs the request on the audit trail
The server sends the request to the agent

Note that control is return back to the server immediately after sending the request to the agent. This means that the call to the agent will likely return before the requested action has actually been carried out.

The plugin container (running in the agent) invokes the appropriate resource component
The resource component carries out the request and reports the results back to the plugin container
The agent sends the response back to the server. The response will indicate success or failure.
The server updates the audit trail indicating that the request has completed and also whether it succeeded or failed.

Note that it is the same request that was originally logged on the original audit trail that is updated

Let's revisit the earlier example of scheduling an operation to start a JBoss server. Suppose I schedule the operation to execute immediately. Then I navigate to the operation history page for the JBoss server. I will see the operation request listed in the history. The history page is a view of the audit trail. The operation shows a status of In Progress. We could continually refresh the page until we see the status change. Eventually it will change to Success or Failure. The status does not necessarily change immediately after the operation completes. It changes after the agent reports the results back to the server and the audit trail is updated.

As previously stated, this pattern is very common throughout RHQ. Consider making a resource configuration update which is performed asynchronously as well. Once I submit submit the configuration update request, I can navigate to the configuration history page to check the status of the request. The status of the update request will show in progress until the agent reports back to the server that the update has completed. When the agent reports back to the server, the corresponding audit trail entry is updated with the results. The same pattern can also be observed when manually adding a new resource into the inventory.

Understanding the Impact to the CLI
So what does this asynchronous work flow mean for remote clients, notably CLI scripts? First and foremost, you need to understand when and where requests are carried out asynchronously to avoid unpredictable, unexpected results. We will discuss a number of things can potentially impact how you think about and how you write CLI scripts.

A method that returns without error does not necessarily mean that the operation succeeded
Let's say we have a requirement to write a script that performs a couple resource configuration updates, but we only want to perform the second update if the first one succeeds. We might be inclined to implement this as follows,

ConfigurationManager.updateResourceConfiguration(resourceId, firstConfig);
ConfigurationManager.updateResourceConfiguration(resourceId, secondConfig);

Provided we are logged in as a user having the necessary permissions to update the resource configuration and provided the agent is online and available, the first call to updateResourceConfiguration will return without error. We proceed to submit the second configuration change, but the first update might have actually failed. With the code as is we could easily wind up violating the requirement of applying the second update only if the first succeeds. What we need to do here essentially is to block until the first configuration update finishes so that we can verify that it did in fact succeed. This can be implemented by polling the ResourceConfigurationUpdate object that is returned from the call to updateResourceConfiguration.

ConfigurationManager.updateResourceConfiguration(resourceId, firstConfig);
var update = ConfigurationManager.getLatestResourceConfiguration(resourceId);
while (update.status == ConfigurationUpdateStatus.INPROGRESS) {
    java.lang.Thread.sleep(2000);  // sleep for 2 seconds
    update = ConfigurationManager.getLatestResourceConfiguration(resourceId);
}
if (update.status == ConfigurationUpdateStatus.SUCCESS) {
    ConfigurationManager.updateResourceConfiguration(resourceId, secondConfig);
}

The ResourceConfigurationUpdate object is our audit trail entry. The object's status will change once the resource component (running in the plugin container) finishes applying the update and the agent sends the response back to the server.

Resource proxies offer some polling suppport
Resource proxies greatly simplify working with a number of the RHQ APIs. Invoking resource operations is one those enhanced areas. With a resource proxy, operations defined in the plugin descriptor appear as first class methods on the proxy object. This allows us to invoke a resource operation in a much more concise and intuitive fashion. Here is a brief example.

var jbossServerId1 = // look up resource id of JBoss server 1
var jbossServerId2 = // look up resource of JBoss server 2
server1 = ProxyFactory.getResource(jbossServerId1);
server2 = ProxyFactory.getResource(jbossServerId2);
server1.start();
server2.start();

The call to server1.start() does not immediately return. It polls the status of the operation waiting for it to complete. The proxy sleeps for a short delay and the fetches the ResourceOperationHistory object that was logged for the request. If a history object is found and if its status is something other than in progress, then the proxy returns the operation's results. If the history object indicates that the operation has not yet completed, the proxy will continue polling.

Resource proxies provide some great abstractions that simplify working in the CLI. The polling that is done behind the scenes for resource operations is yet another useful abstraction in that it makes a resource operation request look like a regular, synchronous method call. The polling however, is somewhat limited. We will take a closer look at some of the implementation details to better understand how it all works.

The delay or sleep interval is fixed
The thread in which the proxy is running sleeps for one second before it polls the history object. There is currently no way to specify a different delay or sleep interval. In many cases the one second delay should be suitable, but there might be situations in which a shorter or longer delay is preferred.

The number of polling intervals is fixed
The proxy will poll the ResourceOperationHistory at most ten times. There is currently no way to specify a different number of intervals. If after ten times, the history still has a status of in progress, the proxy simply returns the incomplete results. Or if no history is available, null is returned. In many cases the polling delays and intervals may be sufficient for operations to complete, but there is no guarantee.

The proxy will not poll indefinitely
This is really an extension of the last point about not being able to specify the number of polling intervals. There may be times when you want to block indefinitely until the operation completes. Resource proxies currently do not offer this behavior.

Polling cannot be performed asynchronously
Let's say we want to start ten JBoss servers in succession. We want to know whether or not they start up successfully, but we are not concerned with the order in which they start. In this example some form of asynchronous polling would be appropriate. Let's further assume that each proxy winds up polling the maximum of ten intervals. Each call to server.start() will take a minimum of ten seconds plus whatever time it takes to retrieve and check the status of the ResourceOperationHistory. We can then conclude that it will take over 90 seconds to invoke the start operation on all of the JBoss servers. This could turn out to be very inefficient. In all likelihood, it would be faster to schedule the start operation, have control return back to the script immediately, and then schedule each subsequent operation. Then the script could block until all of the operations have completed.

As an aside, the previous example might better be solved by creating a resource group for the JBoss servers and then invoking the operation once on the entire group. The problems however, still manifest themselves with resource groups. Suppose we want to call operation O1 on resource group G1, followed by a call to operation O2 on group G2, followed by O3 on G3, etc. We are essentially faced with the same problems but now on a larger scale.

There is no uniform Audit Trail API
Scheduling a resource operation, submitting a resource configuration update, deploying content, etc. are generically speaking all operations that involve submitting a request to an agent (or multiple agents in the case of a group operation) for some mutative change to be applied to one or more resources. In each of the different scenarios, an entry is persisted on the respective audit trails. For example, with a resource operation, a ResourceOperationHistory object is persisted. When deploying a new resource (i.e., a WAR file), a CreateResourceHistory object is persisted. With a resource configuration change, a ResourceConfigurationUpdate is persisted. Each of these objects exposes a status property that indicates whether the request is in progress, has succeeded, or has failed. Each of them also exposes an error message property that is populated if the request fails or an unexpected error occurs.

Unfortunately, there is no common base class shared among these audit trail classes in which the status and error message properties are defined. This makes writing a generic polling solution more challenging, at least if the solution is to be implemented in Java. A solution in a dynamic language, like JavaScript, might prove easier since we can rely on duck typing. We could implement a generic solution that works with a status property, without regard to an object's type.

Conclusion
It is important to understand the work flows and communication patterns described here as well as the current limitations in resource proxies in order to write effective CLI scripts that have consistent behavior and predictable results. Consistent behavior and predictable results certainly do not mean that the same results are produced every time a script is run. It does mean though that given certain conditions, we can make valid assumptions that hold to be true. For example, if we execute a resource operation and then block until the ResourceOperationHistory status has changed to SUCCESS, then we can reasonably assume that the operation did in fact complete successfully.

Many of the work flows in RHQ are necessarily asynchronous, and this has to be taken into account when working with a remote client like the CLI. Fortunately, there are many ways we can look to encapsulate much of this, shielding developers from the underlying complexities while at the same time not limiting developers in how they choose to deal with these issues.

Updating Metric Collection Schedules

2010-09-09T13:57:00.002-04:00

One of the primary features of RHQ is monitoring. Metrics can be collected for resources across the inventory, that metric data is aggregated, and then available for viewing in graphs and tables. Measurements are collected at scheduled intervals. These collection schedules are configurable on a per-resource or per-group basis. Collection intervals can be increased or decreased, and metric collections can be turned on or off as well. There might be any number of reasons why you might want to adjust metric collection schedules. One reason might be that collections are occurring too frequently and in turn causing performance degradation on the host machine.

RHQ exposes APIs for updating measurement schedules through the CLI. I find that some of the APIs exposed through the CLI are not the most intuitive or require a thorough understanding of the RHQ domain classes and APIs. Some of the APIs for dealing with measurements fall into this category. I have started working on putting together some utility scripts that can serve as higher-level building blocks for CLI scripts. My aim is to simplify various, common tasks when and where possible. I put together a JavaScript class that offers a more data-driven approach for updating measurement schedules. Let's look at an example.

The first thing we do is create an instance of MeasurementModule. This class expose properties and methods for working with measurements, most notably the method updateSchedules. We call this method on line 15. updateSchedules takes a single object which specifies the measurement schedule changes. That object is defined on lines 5 - 13. It has to define three properties - context, id, and schedules.

context accepts only two values, 'Resource' or 'Group'. This property declares whether the update is for an individual resource or for a resource group.

The value of context determines how id is interpreted. It refers either to a resource id or to a resource group id.

schedules is essentially a map that declares which schedules are to be updated. The keys are the measurement display names as you see them in the RHQ UI. The values can be one of three things. It can be the strings 'enabled' or 'disabled' indicating that that measurement collection should be enabled or disabled respectively. Or the value can be the collection interval specified as an integer. The collection interval is stored in milliseconds. Most collection intervals are on the order of minutes though.

It is a lot easier to read and write 30 minutes as instead of 1800000 milliseconds. To address this, MeasurementModule exposes the interval method to which we declare a reference on line 2 to facilitate in calculating the interval in a more readable way. We use MeasurementModule.time in conjunction with this method. time has three properties - seconds, minutes, hours. We see these in use on lines 9 and 11.

I think (hope) MeausurementModule offers a fairly straightforward approach for updating measurement schedules. It should allow you to make to make programmatic updates without having an in-depth understanding of the underling APIs.

There is at least one additional enhancement that I already intend to make. I want to provide a way to specify an update that applies to multiple schedules. Maybe something along these lines,

In this example we are updating schedules for a compatible group. We specify a collection interval of one hour for Measurement A, disable Measurement B, and all of the rest of the measurements are set to a collection interval of 30 minutes. I see this as a useful feature and will write another post when I have it implemented.

There is one last annoying detail that needs to be discussed before you start using MeausurementModule. The class uses some of the functions described in Utility Functions for RHQ CLI. Those functions are defined in another source file. When using the CLI in interactive mode, you can use the exec command to execute a script. That command (or an equivalent method/function) however is not available in non-interactive mode. I filed a bug for this a little while back. You can track the progress here if you are interested. This means that for now you need to run in interactive mode. Let's walk through a final example tying all of this together. Assume we are have already logged in through the CLI.

rhqadmin@localhost:7080$ exec -f path/to/util.js
rhqadmin@localhost:7080$ exec -f path/to/measurement_utils.js
rhqadmin@localhost:7080$ measurementModule = new MeaurementModule()
rhqadmin@localhost:7080$ // do some stuff with measurementModule

MeausrementModule is defined in the file measurement_utils.js. A few scripts including measurement_utils.js are shiped in the latest release of RHQ which can be downloaded from here. They are packaged in the CLI in the samples directory.

Find and Replace

2010-09-01T10:58:00.001-04:00

I came across a post on the clojure users list the other day that discussed some code that read a file, removed a specified character from each line in the file, and the wrote the file back out to disk. It got me thinking about how I might implement some basic file editing functions. I started with a find function. (Please note for brevity and for clarity the code snippets that follow only include the ns function call when I introduce a function from a different name space for the first time.) Here is the initial implementation.

(ns clj-sandbox.io
  (:use [clojure.contrib.duck-streams :only [read-lines]]))   

(defn find [file s]
  (filter #(.contains % s) (read-lines file)))

This returns a sequence of all lines in the file that contain the string s. So far so good. I would like to get the line numbers as well though. I was ready to implement a solution using recur, but then I started thinking that there is probably a more functional way of achieving what I want. I started reading through the chapter on functional programming in Programming Clojure (chapter 5) to spark some ideas. That led me to some of the example code which in turn led me to clojure.contrib.seq/indexed. And now I have this function,

(ns clj-sandbox.io
  (:use [clojure.contrib.duck-streams :only [read-lines]])
  (:use [clojure.contrib.seq :only [indexed]]))

(defn find [file s]
  (map (fn [match] {:line-number (inc (first match)) :text (second match)})
       (filter #(.contains (second %) s) (indexed (read-lines file)))))

which returns a lazy sequence of lines and corresponding line numbers that contain at least one occurrence of s. The sequence is comprised of maps containing two keys, :line-number and :text. At this point I am pretty happy with my function and am ready to consider some additional enhancements. First, I want the capability to get back just the line numbers or just the text of each line. This is easily accomplished with a bit of refactoring.

(defn find
  ([file s opt] (map #(opt %) (find file s)))
  ([file s]
     (map (fn [match] {:line-number (inc (first match)) :text (second match)})
 (filter #(.contains (second %) s) (indexed (read-lines file))))))

Now I have a version that takes a third argument, opt, which should be one of the map keys, :line-number or :text. To get a sequence of just the lines numbers I can write,

user> (find "myfile" "mystring" :line-number)

At this point I am satisfied with find and ready to move onto a replace function. Here is a first cut at replace.

(ns clj-sandbox.io
  (:use [clojure.contrib.duck-streams :only [read-lines]])
  (:use [clojure.contrib.seq :only [indexed]])
  (:use [clojure.contrib.string :only [replace-str]]))

(defn replace [file s r] (map #(replace-str s r (:text %)) (find file s)))

This function replaces every occurrence of s with r, and the results are returned as a lazy sequence. I need to update the function to write the changes back to the file. Initially I consider,

(defn replace [f s r]
  (write-lines (str "." f) (map #(replace-str s r (:text %)) (find f s)))
  (copy (file (str "." f)) (file f)))

This implementation however is problematic. Changes are written to a copy of the file and when that is done, the original file is overwritten is replaced with the copy. Only those lines that match the search string are written back to the file; so, we wind up altogether losing lines that should be left intact. The function needs to write every line back to the file, including those that have not been modified. I decide to take out the call to find since it does not return all lines and replace it with a call to read-lines.

(ns clj-sandbox.io
  (:use [clojure.contrib.duck-streams :only [read-lines]])
  (:use [clojure.contrib.seq :only [indexed]])
  (:use [clojure.contrib.string :only [replace-str]])
  (:use [clojure.java.io :only [file]]))


(defn replace [f s r]
  (write-lines (str "." f) (map #(replace-str s r %) (read-lines f)))  
  (copy (file (str "." f)) (file f)))

This gives me the behavior that I want; however, I notice some duplication with creating the backup file name. That can easily be eliminated with a let binding.

(defn replace [f s r]
  (let [new-file (str "." f)]
    (write-lines new-file (map #(replace-str s r %) (read-lines f)))
    (copy (file new-file) (file f))))

Now I have a couple functions that can be used to perform a global find and replace. I spent some time working on functions that find/replace a specified number of matches. There is some additional effort needed for these however because I need to keep a running total of matches. Suppose I want the first 3 matches of some string. The first two occur on line 6, and the third match occurs on line 9. With the find function that has been discussed here, we can determine that we have matches on lines 6 and 9, but we cannot determine how many matches there are per line. I will try to revisit this in a future post. The source code for this can be found on github.

Building an RHQ Client

2010-08-25T09:47:00.007-04:00

RHQ exposes a set of APIs that can be used to build a remote client. The CLI is one such consumer of those APIs. The documentation for building your own remote client says to get the needed libraries from the CLI distribution. If you are using a build tool such as Maven, Gradle, Buildr, etc. that provides dependency handling, it would be easier to be provided with the dependency information needed to integrate into your existing build.

Last week I started building my own client and hit a couple snags. The CLI pulls in dependencies from the module rhq-remoting-client-api; so, I hoped that declaring a dependency on that module was all that I needed.


  
    org.rhq
    rhq-remoting-client-api
    3.0.0

This pulls in a number of libraries. When I started my simple client though, I got the following error,

java.lang.NoClassDefFoundError: EDU/oswego/cs/dl/util/concurrent/SynchronizedLong (NO_SOURCE_FILE:3)

After a little digging I realized that I needed to add the following dependency.


  oswego-concurrent
  concurrent
  1.3.4-jboss-update1

After rebuilding and restarting my client, I got a different exception.

org.jboss.remoting.CannotConnectException: Can not connect http client invoker. Response: OK/200.
 at org.jboss.remoting.transport.http.HTTPClientInvoker.useHttpURLConnection(HTTPClientInvoker.java:348)
 at org.jboss.remoting.transport.http.HTTPClientInvoker.transport(HTTPClientInvoker.java:137)
 at org.jboss.remoting.MicroRemoteClientInvoker.invoke(MicroRemoteClientInvoker.java:122)
 at org.jboss.remoting.Client.invoke(Client.java:1634)
 at org.jboss.remoting.Client.invoke(Client.java:548)
 at org.jboss.remoting.Client.invoke(Client.java:536)
 at org.rhq.enterprise.client.RemoteClientProxy.invoke(RemoteClientProxy.java:201)
 at $Proxy0.login(Unknown Source)

I started comparing the dependencies that I was pulling in versus what the CLI used. I was definitely pulling in some additional libraries that are not included in the CLI. While the above exception does not convey a whole lot of information, I knew that it was occurring because I had classes on my local classpath that I should be getting from the server. When I managed to get my dependencies to match up with the CLI, I got past the exceptions.

For other folks who want to build their own RHQ client, this dependency situation can be a bit of a mess. Last night I pushed a new module into the RHQ git repo that alleviates this problem. Now you only need the following dependency,


  org.rhq
  remote-client-deps
  4.0.0-SNAPSHOT
  pom

With this, you will pull in only those dependencies that are needed to build your own RHQ client. You do not need to declare any additional RHQ dependencies. You can view the source for the remote-cliet-deps module here.

Utility Functions for RHQ CLI

2010-08-18T10:46:00.000-04:00

I have done a fair amount of programming in Groovy, and one of the things that I have quickly gotten accustomed to is closures. Among other things, closure provide an elegant solution for things like iteration by encapsulating control flow. The following example illustrates this.

[1, 2, 3].each { println it }

When working in the CLI however, I have to fall back to a for loop. In the case of a JavaScript array, I do not have to worry about a loop counter.

var array = [1, 2, 3];
for (i in array) println(array[i]);

A lot of the remote APIs, particularly query methods, return a java.util.List. In these situations, I have to use a loop counter.

var list = new java.util.ArrayList();
list.add(1);
list.add(2);
list.add(3);
for (i = 0; i < list.size(); i++) println(list.get(i));

Recently I had to write a script in which I was executing a number of criteria queries and then iterating over the results. Needless to say, I quickly found myself missing the methods in languages like Groovy and Ruby that provide control flow with closures; so, I wrote a few utility functions to make things a bit easier.

// Iterate over a JS array
var array = [1, 2, 3];
foreach(array, function(number) { println(number); });

// Iterate over a java.util.Collection
var list = new java.util.ArrayList();
list.add(1);
list.add(2);
list.add(3);
foreach(list, function(number) { println(number); });

// Iterate over a java.util.Map
var map = new java.util.HashMap();
map.put(1, "ONE");
map.put(2, "TWO");
map.put(3, "THREE");
foreach(map, function(key, value) { println(key + " --> " + value); });

// Iterate over an object
var obj = {x: "123", y: "456", z: "678"};
foreach(obj, function(name, value) { println(name + ": " + value); });

The foreach function is fairly robust in that it provides iteration over several different types including JavaScript arrays, Java collections and maps, and generic objects. In the case of an array or collection, the callback function takes a single argument. That argument will be each of the elements in the array or collection. In the case of a map or object, the callback function is passed two arguments. For the map the callback is passed the key and the value of each entry. For a generic object, the callback is passed each of the object's properties' names and values.

The find function iterates over an array, collection, map, or object in the same way that foreach does. The callback function though is a predicate that should evaluate to true or false. Iteration will stop when the first match is found, that is when the predicate returns true, and that value will be returned. Here are a couple examples to illustrate its usage.

// Find first number less than 3
var array = [1, 2, 3]
// prints "1"
println("found: " + find(array, function(number) { number < 3; }));

// Find map entry with a value of 'TWO'
var map = new java.util.HashMap();
map.put(1, "ONE");
map.put(2, "TWO");
map.put(3, "THREE");
var match = find(map, function(key, value) { return value == 'TWO'; });
// prints "found: 2.0 --> TWO"
println("found: " + match.key + " --> " + match.value);

When find is iterating over a generic object or map, the first match will be returned as an object with two properties - key and value. Lastly, there is a findAll function that is similar to find except that it returns all matches in a java.util.List.

These functions can be found in the RHQ git repo at rhq/etc/cli-scripts/util.js. These functions are neither part of nor distributed with the CLI; however, they may be included in a future release so that the functions would be available to any CLI script.

Auto Import Resources into Inventory

2010-08-17T12:45:00.001-04:00

There is no way currently in RHQ through the UI to auto-import resources. You have to go to the discovery queue to explicitly select resources to import. Here is a short CLI script for auto-importing resources.

// auto_import.js
rhq.login('rhqadmin', 'rhqadmin');

var resources = findUncommittedResources();
var resourceIds = getIds(resources);
DiscoveryBoss.importResources(resourceIds);

rhq.logout();

// returns a java.util.List of Resource objects
// that have not yet been committed into inventory
function findUncommittedResources() {
    var criteria = ResourceCriteria();
    criteria.addFilterInventoryStatus(InventoryStatus.NEW);
    
    return ResourceManager.findResourcesByCriteria(criteria);
}

// returns an array of ids for a given list
// of Resource objects. Note the resources argument
// can actually be any Collection that contains
// elements having an id property.
function getIds(resources) {
    var ids = [];
    for (i = 0; i < resources.size(); i++) { 
        ids[i] = resources.get(i).id;
    }
    return ids;
}

In the function findUncommittedResources() we query for Resources objects having an inventory status of NEW. This results in a query that retrieves discovered resources that have been "registered" with the RHQ server (i.e., stored in the database) but not yet committed into inventory.

DiscoveryBoss is one of the remote EJBs exposed by the RHQ server to the CLI. It provides a handful of inventory-related operations. On line six we call DiscoveryBoss.importResources() which takes an array of resource ids.

In a follow-up post we will use some additional CLI features to parametrize this script so that we have more control of what gets auto-imported.

Test Fixture Strategies

2008-03-28T22:59:00.000-04:00

In his book, xUnit Test Patterns, Gerard Meszaros provides an in-depth, analytical discourse on unit testing patterns. Meszaros provides prerequisite information before getting into the patterns. There is a section on test fixtures that I found particularly useful. He defines a test fixture as everything that we need to exercise the system under test (SUT) - in other words, the pre-conditions of the test. Let's suppose we are testing an XML parser. Our test fixture will include an XML document that will be fed to the parser. The fixture setup is the part of the test logic that is executed to set up the test fixture. Continuing with our parser example, our fixture setup might require reading an XML document from the file system, or it may involve constructing a document in memory at runtime. After defining some terminology, Meszaros goes through common test fixture strategies. These strategies lay the ground work for the patterns discussed in the book. In fact, it quickly becomes apparent that understanding these strategies plays a big role in getting the most out of these patterns.

Transient Fresh Fixture
A transient fresh fixture exists only in memory and only during the test in which it is used. It does not outlive the test. Fixture tear down is implicit (assuming a language that provides garbage collection). The fixture is created at the start of the test, and it is discarded at the end of the test. Each test creates its own fixture. In other words, The test creates the objects that it needs. Creation of those objects might be delegated to some helper object, but it is the test itself that is initiating the creation. The test does not re-use any part of a pre-built fixture or a fixture from another test. If we elect to use a transient fixture for our XML parser, then the test must create the document that will be fed to the parser. The primary disadvantafge of a transient fresh fixture is that it must be created for each and every test. In some situations this may lead to performance degradation. Despite this potential drawback, transient fresh fixtures offer the best avenue for keeping fixture logic clear and simple and thus, resulting in tests as documentation. The benefits of not having to deal with tear down logic simply cannot be overstated.

Persistent Fresh Fixture
A persistent fresh fixture lives beyond the test method in which it is used. It requires explicit tear down at the end of each test. We often wind up using this fixture when we are testing objects that are tightly coupled to a database. Let's revisit our parser example. Suppose We need to add a test that verifies that the parser can handle consuming documents from the file system. For the fixture set up, the test creates an XML document and then writes it to disk so that we can exercise our parser for this scenario. So far, our test is pretty similar to one that is using a transient fresh fixture. The difference however, reveals itself with tear down. The test using a transient fresh fixture does not have to worry about doing any tear down - it is implicit. Our test on the other hand, must explicitly tear down the fixture. We could implement this easily enough by deleting the document from the file system. It is worth mentioning that this is a pretty straightforward example of tearing downing a persistent fresh fixture. Things can quickly get more complicated, particularly when dealing with database. In these situations, we can easily wind up with obscure tests. Another test smell that is often encountered with persistent fresh fixtures is slow tests. This usually occurs as a reuls of the fixture having a high-latency dependency. For example, if we have to create our XML document on a remote file system over the network, we will likely experience high latency. High latency is commonly encountered when a database is involved.

Shared Fixture
A shared fixture is deliberately reused across different tests. Let's say that our parser has special requirements for handling very large documents. A shared fixture may seem like a logical approach in this situation. The advantage is improved execution time of tests since we cut out a lot of the set up and tear down work required. The primary disadvantage of this strategy is that is easily leads to interacting tests. Interacting tests is an anti-pattern in which there is an inter-dependency among tests. Let's suppose that are parser needs to support both reading from and writing to XML documents. We could very quickly wind up with interacting tests. One test modifies the document while another test reads the document. If the document is expected to be in a particular state, then the latter test could easily break as a result of the former test (which modifies the document) running first. When using a shared fixture, a couple questions should be considered:

To what extent should the fixture be shared?
How often do we rebuild the fixture?

Should we reuse our XML document across multiple test cases? Across the entire test suite? In general, we want to minimize the extent to which we share our fixture. As for how often we should rebuild the fixture, that may depend on a number of factors. In the case of an immutable fixture, we might very well be able altogether forgo rebuilding the fixture. Let's revisit a scenario in which we need to test both read and write operations for our parser. If we can guarantee the order of tests, then we can arrange for all of the read-only tests to run in sequence. Then for those tests we do not have to worry about rebuilding the fixture in between runs.

Conclusion
In most circumstances, a transient fresh fixture is the best strategy because it simply does not have to deal with the challenges presented by the other fixture strategies, namely fixture tear down. There are times when it is all but impossible to avoid using either a persistent fresh fixture or a shared fixture. Data access tests involving a database is the most prevalent example. Understanding the ramifications of the other fixture strategies is crucial to writing effective tests when they must be used; otherwise, we inevitably fall victim to the anti-patterns presented by Meszaros. Just as an understanding of the more mainstream patterns like the widely embraced GoF pattrerns leads to better designed software, an understanding of the sundry testing patterns leads to more effective tests, which in turn ultimately leads to better software.

What to Expect from a Unit Test

2008-02-13T15:48:00.000-05:00

What is and what is not a unit test is a hotly debated subject. At one end of the spectrum you have people who argue that a unit test replaces all depended-on objects with mock or fake objects so that the system under test (SUT) is tested in complete isolation. At the other end of the spectrum you have people who contend that anything written with an xUnit framework like JUnit is a unit test. And then we have everything else that falls in between the two ends of that spectrum. Rather than trying to arrive at a universally accepted definition of a unit test, I think that it may be more productive to talk about what we expect from a unit test. If we can agree upon a set of goals that we aim to achieve through the practice of unit testing, then we do not need to concern ourselves with whether or not the test that we are writing is a true unit test. Instead we can instead focus on using automated testing to facilitate the development of our software.

Rapid Feedback
A unit test should provide immediate feedback. Unit tests need to execute quickly since we (hopefully) run them over and over during development. While we work on a particular piece of code, we may choose to run a subset of the tests. We might do this through the IDE test runner. And then prior to committing code, we run the full test suite. The tests for the code we are currently working on need to be fast since we are running those tests frequently as we are writing the code. Since we want commits to be small and frequent, the full test suite needs to be fast as well if we expect it to be run prior to commits. Not only should the test suite run fast for a developer build, but it should also run fast during integration builds so that we receive timely feedback during integration.

Defect Localization
When a test fails, we should know exactly what part of the SUT caused the test to fail. There are a couple of things that can be done to help promote effective defect localization. Only one condition should be verified per test. To that end there is a school of thought in which a test should only contain a single assert statement; however, the most important thing is that the test is not overly aggressive in trying to check multiple conditions. There are times when testing a condition may require multiple assert statements. Whether you choose to break out each assert into a separate test or you choose to keep all of the asserts for that condition in the same test is really a matter of preference.

The second thing that comes into play for adequate defect localization is how well you isolate the SUT. Even if we only verify a single condition or even if we only use one assert per test, there may be times when it is not immediately obvious what part of the SUT caused the test to fail. This is often a direct result of not sufficiently isolating the SUT. There are a plenty of articles, papers, and books that discuss strategies and techniques for isolating the SUT. Some tools like mock object libraries allow you to complete isolate an object by mocking all of its neighboring objects. Here is a good rule of thumb to start with for determining on an appropriate level of isolation - the cause of the test failure can be determined without having to rely on a debugger or additional logging statements.

Executable Documentation
Tests can and should serve as documentation. They can provide a living, executable specification. Tests demonstrate how an object is expected to be used, what conditions must be satisfied for invoking a method on the object, and what kind of output to expect from that object. Because they can be such a powerful form of documentation, tests should be written in a clear, self-documenting style. Intuitive variable and method names should be used to make obvious what is being tested. Test method names should be intent-revealing. testUpdateTicket() for example does not reveal intent nearly as well as testUpdateTicketShouldAddComment().

Avoid putting complex set-up or verification code in test methods as it may obscure the intent of the tests. Instead, complex logic should be relegated to test utility methods and objects. This has a few benefits. First and most importantly, it prevents the test from getting littered with complex logic, thereby making it easier for the reader to see what the test is doing. Secondly, putting the complex logic in a test utility library makes it accessible and easy to reuse in other tests. Lastly, we can put our test utility objects in their own test harness to ensure that they have been implemented properly.

Regression Safeguard
The primary goal of testing in general is to validate that our software behaves as expected under prescribed conditions. Having a set of automated tests thast we can continually run against our software provides a great safety net for catching regressions that are introduced into our code. While unit testing alone is typically not sufficient for validating our code, it provides an excellent first level of defense that should be capable of catching most errors.

Refactoring
Unit tests should enable us to be aggressive with refactoring. Refactoring is the practice of changing the implementation of code while preserving its behavior. Our unit tests should give us the confidence that refactoring will not alter the intended behavior of our code, at least not unexpectedly. If the tests do not instill that confidence, then we need to consider whether or not the tests are reliable, thorough and effective enough. Code coverage tools can help provide some measure of the effectiveness of a test suite, although a coverage tool alone should not be be used to determine the quality and effectiveness of a test suite.

Repeatable and Reliable
What exactly does it mean for a test to be repeatable and reliable? Suppose we run a test and it passes. Then we run it again without making any changes, but this time the test fails. This would be an example of a test that is not repeatable. It could also be a strong indication that the test is using a persistent fixture that is outliving the test. With a persistent fixture, we need to be especially careful about cleaning up before/after the test so that the fixture is in a consistent state for each test run. Now consider a test that starts failing as a result of changes being made to code other than the SUT. This would be an example of an unreliable test. In these situations we need to ensure that we properly isolate the SUT so that external changes do not affect our tests.

Easy and Fast to Implement
Unit tests should be relatively easy to implement without adding a significant amount time and overhead to the overall development effort. Code that is particularly difficult to get under test may indicate a larger design issue. We should take the opportunity to look for potential design problems. And maybe the easiest, most effective way to ensure that we design for testability is to write our tests first.

As with any other software, it is imperative to refactor our test code. Using test utility methods and libraries as previously discussed will significantly reduce the amount of code that we have to write for tests as well as make tests more reliable since our test utility code can have its own test harness. Testing a single condition per test method will result in smaller tests as well. This will in turn lead to a faster turn around time with going back and forth between the main code and test code.

Conclusion
These are sound, reasonable things to expect from unit tests; however, the exact expectations may vary from team to team. For example, some teams may prefer to make extensive use of mock objects and libraries like jMock, while other teams may prefer not to use mock objects at all. The most important things are that the goals are clearly stated and agreed upon within the team and that the tests aide rather than inhibit development efforts.