Tuesday, September 28, 2010

Dealing with Asynchronous Workflows in the CLI

Introduction
There is constant, ongoing communication between agents and servers in RHQ. Agents at regularly scheduled intervals for example send inventory and availability reports up to the server. The server sends down resource-related requests such as updating a configuration or executing a resource operation. Examples of these include updating the connection pool setting for a JDBC data source and starting a JBoss AS server. Some of these work flows are performed in a synchronous manner while others are carried out in an asynchronous fashion. A really good example of an asynchronous work flows is scheduling a resource operation to execute at some point in the future. There is a common pattern used in implementing these asynchronous work flows. We will explore this pattern in some detail and then consider the impacts on remote clients like the CLI.

The Pattern
The asynchronous work flows are most prevalent in requests that produce mutative actions against resources. Let's go through the pattern.
  • A request is made on the server to take some action against a resource (e.g., invoke an operation, update connection properties, update configuration, deploy content, etc.)
  • The server logs the request on the audit trail
  • The server sends the request to the agent
    • Note that control is return back to the server immediately after sending the request to the agent. This means that the call to the agent will likely return before the requested action has actually been carried out.
  • The plugin container (running in the agent) invokes the appropriate resource component
  • The resource component carries out the request and reports the results back to the plugin container
  • The agent sends the response back to the server. The response will indicate success or failure.
  • The server updates the audit trail indicating that the request has completed and also whether it succeeded or failed.
    • Note that it is the same request that was originally logged on the original audit trail that is updated
Let's revisit the earlier example of scheduling an operation to start a JBoss server. Suppose I schedule the operation to execute immediately. Then I navigate to the operation history page for the JBoss server. I will see the operation request listed in the history. The history page is a view of the audit trail. The operation shows a status of In Progress. We could continually refresh the page until we see the status change. Eventually it will change to Success or Failure. The status does not necessarily change immediately after the operation completes. It changes after the agent reports the results back to the server and the audit trail is updated.

As previously stated, this pattern is very common throughout RHQ. Consider making a resource configuration update which is performed asynchronously as well. Once I submit submit the configuration update request, I can navigate to the configuration history page to check the status of the request. The status of the update request will show in progress until the agent reports back to the server that the update has completed. When the agent reports back to the server, the corresponding audit trail entry is updated with the results. The same pattern can also be observed when manually adding a new resource into the inventory.

Understanding the Impact to the CLI
So what does this asynchronous work flow mean for remote clients, notably CLI scripts? First and foremost, you need to understand when and where requests are carried out asynchronously to avoid unpredictable, unexpected results. We will discuss a number of things can potentially impact how you think about and how you write CLI scripts.

A method that returns without error does not necessarily mean that the operation succeeded
Let's say we have a requirement to write a script that performs a couple resource configuration  updates, but we only want to perform the second update if the first one succeeds. We might be inclined to implement this as follows,

ConfigurationManager.updateResourceConfiguration(resourceId, firstConfig);
ConfigurationManager.updateResourceConfiguration(resourceId, secondConfig);

Provided we are logged in as a user having the necessary permissions to update the resource configuration and provided the agent is online and available, the first call to updateResourceConfiguration will return without error. We proceed to submit the second configuration change, but the first update might have actually failed. With the code as is we could easily wind up violating the requirement of applying the second update only if the first succeeds. What we need to do here essentially is to block until the first configuration update finishes so that we can verify that it did in fact succeed. This can be  implemented by polling the ResourceConfigurationUpdate object that is returned from the call to updateResourceConfiguration.

ConfigurationManager.updateResourceConfiguration(resourceId, firstConfig);
var update = ConfigurationManager.getLatestResourceConfiguration(resourceId);
while (update.status == ConfigurationUpdateStatus.INPROGRESS) {
    java.lang.Thread.sleep(2000);  // sleep for 2 seconds
    update = ConfigurationManager.getLatestResourceConfiguration(resourceId);
}
if (update.status == ConfigurationUpdateStatus.SUCCESS) {
    ConfigurationManager.updateResourceConfiguration(resourceId, secondConfig);
}

The ResourceConfigurationUpdate object is our audit trail entry. The object's status will change once the resource component (running in the plugin container) finishes applying the update and the agent sends the response back to the server.

Resource proxies offer some polling suppport
 Resource proxies greatly simplify working with a number of the RHQ APIs. Invoking resource operations is one those enhanced areas. With a resource proxy, operations defined in the plugin descriptor appear as first class methods on the proxy object. This allows us to invoke a resource operation in a much more concise and intuitive fashion. Here is a brief example.

var jbossServerId1 = // look up resource id of JBoss server 1
var jbossServerId2 = // look up resource of JBoss server 2
server1 = ProxyFactory.getResource(jbossServerId1);
server2 = ProxyFactory.getResource(jbossServerId2);
server1.start();
server2.start();

The call to server1.start() does not immediately return. It polls the status of the operation waiting for it to complete.  The proxy sleeps for a short delay and the fetches the ResourceOperationHistory object that was logged for the request. If a history object is found and if its status is something other than in progress, then the proxy returns the operation's results. If the history object indicates that the operation has not yet completed, the proxy will continue polling.

Resource proxies provide some great abstractions that simplify working in the CLI. The polling that is done behind the scenes for resource operations is yet another useful abstraction in that it makes a resource operation request look like a regular, synchronous method call. The polling however, is somewhat limited. We will take a closer look at some of the implementation details to better understand how it all works.

The delay or sleep interval is fixed
The thread in which the proxy is running sleeps for one second before it polls the history object. There is currently no way to specify a different delay or sleep interval. In many cases the one second delay should be suitable, but there might be situations in which a shorter or longer delay is preferred.

The number of polling intervals is fixed
The proxy will poll the ResourceOperationHistory at most ten times. There is currently no way to specify a different number of intervals. If after ten times, the history still has a status of in progress, the proxy simply returns the incomplete results. Or if no history is available, null is returned. In many cases the polling delays and intervals may be sufficient for operations to complete, but there is no guarantee.

The proxy will not poll indefinitely
This is really an extension of the last point about not being able to specify the number of polling intervals. There may be times when you want to block indefinitely until the operation completes. Resource proxies currently do not offer this behavior.

Polling cannot be performed asynchronously
Let's say we want to start ten JBoss servers in succession. We want to know whether or not they start up successfully, but we are not concerned with the order in which they start. In this example some form of asynchronous polling would be appropriate. Let's further assume that each proxy winds up polling the maximum of ten intervals. Each call to server.start() will take a minimum of ten seconds plus whatever time it takes to retrieve and check the status of the ResourceOperationHistory. We can then conclude that it will take over 90 seconds to invoke the start operation on all of the JBoss servers. This could turn out to be very inefficient. In all likelihood, it would be faster to schedule the start operation, have control return back to the script immediately, and then schedule each subsequent operation. Then the script could block until all of the operations have completed.

As an aside, the previous example might better be solved by creating a resource group for the JBoss servers and then invoking the operation once on the entire group. The problems however, still manifest themselves with resource groups. Suppose we want to call operation O1 on resource group G1, followed by a call to operation O2 on group G2, followed by O3 on G3, etc. We are essentially faced with the same problems but now on a larger scale.

There is no uniform Audit Trail API
Scheduling a resource operation, submitting a resource configuration update, deploying content, etc. are generically speaking all operations that involve submitting a request to an agent (or multiple agents in the case of a group operation) for some mutative change to be applied to one or more resources.  In each of the different scenarios, an entry is persisted on the respective audit trails. For example, with a resource operation, a ResourceOperationHistory object is persisted. When deploying a new resource (i.e., a WAR file), a CreateResourceHistory object is persisted. With a resource configuration change, a ResourceConfigurationUpdate is persisted. Each of these objects exposes a status property that indicates whether the request is in progress, has succeeded, or has failed. Each of them also exposes an error message property that is populated if the request fails or an unexpected error occurs.

Unfortunately, there is no common base class shared among these audit trail classes in which the status and error message properties are defined. This makes writing a generic polling solution more challenging, at least if the solution is to be implemented in Java. A solution in a dynamic language, like JavaScript, might prove easier since we can rely on duck typing. We could implement a generic solution that works with a status property, without regard to an object's type.

Conclusion
It is important to understand the work flows and communication patterns described here as well as the current limitations in resource proxies in order to write effective CLI scripts that have consistent behavior and predictable results. Consistent behavior and predictable results certainly do not mean that the same results are produced every time a script is run. It does mean though that given certain conditions, we can make valid assumptions that hold to be true. For example, if we execute a resource operation and then block until the ResourceOperationHistory status has changed to SUCCESS, then we can reasonably assume that the operation did in fact complete successfully.

Many of the work flows in RHQ are necessarily asynchronous, and this has to be taken into account when working with a remote client like the CLI. Fortunately, there are many ways we can look to encapsulate much of this, shielding developers from the underlying complexities while at the same time not limiting developers in how they choose to deal with these issues.

1 comment: