Skip to main content

Developing Distributed Cache Applications

Developing Distributed Cache Applications

This topic discusses application development and design issues that are helpful if you would like to deploy your application on a distributed cache cluster, either as an option or as its primary configuration.

With InterSystems IRIS, the decision to deploy an application as a distributed system is primarily a runtime configuration issue (see Deploying a Distributed Cache Cluster). Using InterSystems IRIS configuration tools, map the logical names of your data (globals) and application logic (routines) to physical storage on the appropriate system.

This section discusses the following topics:

ECP Recovery Protocol

ECP is designed to automatically recover from interruptions in connectivity between an application server and the data server. In the event of such an interruption, ECP executes a recovery protocol that differs depending on the nature of the failure. The result is that the connection is either recovered, allowing the application processes to continue as though nothing had happened, or reset, forcing transaction rollback and rebuilding of the application processes. The main principles are as follows:

  • When the connection between an application server and data server is interrupted, the application server attempts to reestablish its connection with the data server, repeatedly if necessary, at an interval determined by the Time between reconnections setting (5 seconds by default).

  • When the interruption is brief, the connection is recovered.

    If the connection is reestablished within the data server’s configured Time interval for Troubled state timeout period (60 seconds by default), the data server restores all locks and open transactions to the state they were in prior to the interruption.

  • If the interruption is longer, the data server resets the connection, so that it cannot be recovered when the interruption ends.

    If the connection is not reestablished within the Time interval for Troubled state, the data server unilaterally resets the connection, allowing it to roll back transactions and release locks from the unresponsive application server so as not to block functioning application servers. When connectivity is restored, the connection is disabled from the application server point of view; all processes waiting for the data server on the interrupted connection receive a <NETWORK> error and enter a rollback-only condition. The next request received by the application server establishes a new connection to the data server.

  • If the interruption is very long, the application server also resets the connection.

    If the connection is not reestablished within the application server’s longer Time to wait for recovery timeout period (20 minutes by default), the application server unilaterally resets the connection; all processes waiting for the data server on the interrupted connection receive a <NETWORK> error and enter a rollback-only condition. The next request received by the application server establishes a new connection to the data server, if possible.

The ECP timeout settings are shown in the following table. Each is configurable on the System > Configuration > ECP Settings page of the Management Portal, or in the ECP section of in the configuration parameter file (CPF); for more information, see ECP in the Configuration Parameter File Reference.

ECP Timeout Values
Management Portal Setting CPF Setting Default Range  
Time between reconnections ClientReconnectInterval 5 seconds 1–60 seconds The interval at which an application makes attempts to reconnect to the data server.
Time interval for Troubled state ServerTroubleDuration 60 seconds 20–65535 seconds The length of time for which the data server waits for contact from the application server before resetting an interrupted connection.
Time to wait for recovery ClientReconnectDuration 1200 seconds (20 minutes) 10–65535 seconds The length of time for which an application server continues attempting to reconnect to the data server before resetting an interrupted connection.

The default values are intended to do the following:

  • Avoid tying up data server resources that could be used for other application servers for a long time by waiting for an application server to become available.

  • Give an application server — which has nothing else to do when the data server is not available — the ability to wait out an extended connection interruption for much longer by trying to reconnect at frequent intervals.

ECP relies on the TCP physical connection to detect the health of the instance at the other end without using too much of its capacity. On most platforms, you can adjust the TCP connection failure and detection behavior at the system level.

While an application server connection becomes inactive, the data server maintains an active daemon waiting for new requests to arrive on the connection, or for a new connection to be requested by the application server. If the old connection returns, it can immediately resume operation without recovery. When the underlying heartbeat mechanism indicates that the application server is completely unavailable due to a system or network failure, the underlying TCP connection is quickly reset. Thus, an extended period without a response from an application server generally indicates some kind of problem on the application server that caused it to stop functioning, but without interfering with its connections.

If the underlying TCP connection is reset, the data server puts the connection in an “awaiting reconnection” state in which there is no active ECP daemon on the data server. A new pair of data server daemons are created when the next incoming connection is requested by the application server.

Collectively, the nonresponsive state and the awaiting reconnection state are known as the data server Trouble state. The recovery required in both cases is very similar.

If the data server fails or shuts down, it remembers the connections that were active at the time of the crash or shutdown. After restarting, the data server has a short window (usually 30 seconds) during which it places these interrupted connections in the awaiting reconnection state. In this state, the application server and data server can cooperate together to recover all the transaction and lock states as well as all the pending Set and Kill transactions from the moment of the data server shutdown.

During the recovery of an ECP-configured instance, InterSystems IRIS guarantees a number of recoverable semantics, and also specifies limitations to these guarantees. ECP Recovery Process, Guarantees, and Limitations describes these in detail, as well as providing additional details about the recovery process.

Forced Disconnects

By default, ECP automatically manages the connection between an application server and a data server. When an ECP-configured instance starts up, all connections between application servers and data servers are in the Not Connected state (that is, the connection is defined, but not yet established). As soon as an application server makes a request (for data or code) that requires a connection to the data server, the connection is automatically established and the state changes to Normal. The network connection between the application server and data server is kept open indefinitely.

In some applications, you may wish to close open ECP connections. For example, suppose you have a system, configured as an application server, that periodically (a few times a day) needs to fetch data stored on a data server system, but does not need to keep the network connection with the data server open afterwards. In this case, the application server system can issue a call to the SYS.ECP.ChangeToNotConnected()Opens in a new tab method to force the state of the ECP connection to Not Connected.

For example:

 Do OperationThatUsesECP()
 Do SYS.ECP.ChangeToNotConnected("ConnectionName")

The ChangeToNotConnected method does the following:

  1. Completes sending any data modifications to the data server and waits for acknowledgment from the data server.

  2. Removes any locks on the data server that were opened by the application server.

  3. Rolls back the data server side of any open transactions. The application server side of the transaction goes into a “rollback only” condition.

  4. Completes pending requests with a <NETWORK> error.

  5. Flushes all cached blocks.

After completion of the state change to Not Connected, the next request for data from the data server automatically reestablishes the connection.

Note:

See Data Server Connections for information about changing data server connection status from the Management Portal.

Performance and Programming Considerations

To achieve the highest performance and reliability from distributed cache cluster-based applications, you should be aware of the following issues:

Do Not Use Multiple ECP Channels

InterSystems strongly discourages establishing multiple duplicate ECP channels between an application server and a data server to try to increase bandwidth. You run the dangerous risk of having locks and updates for a single logical transaction arrive out-of-sync on the data server, which may result in data inconsistency.

Increase Data Server Database Caches for ECP Control Structures

In addition to buffering the blocks that are served over ECP, data servers use global buffers to store various ECP control structures. There are several factors that go into determining how much memory these structures might require, but the most significant is a function of the aggregate sizes of the clients' caches. To roughly approximate the requirements, so you can adjust the data server’s database caches if needed, use the following guidelines:

Database Block Size Recommendation
8 KB 50 MB plus 1% of the sum of the sizes of all of the application servers’ 8 KB database caches
16 KB (if enabled) 0.5% of the sum of the sizes of all of the application servers’ 16 KB database caches
32 KB (if enabled) 0.25% of the sum of the sizes of all of the application servers’ 32 KB database caches
64 KB (if enabled) 0.125% of the sum of the sizes of all of the application servers’ 64 KB database caches

For example, if the 16 KB block size is enabled in addition to the default 8 KB block size, and there are six application servers, each with an 8 KB database cache of 2 GB and a 16 KB database cache of 4 GB, you should adjust the data server’s 8 KB database cache to ensure that 52 MB (50MB + [12 GB * .01]) is available for control structures, and the 16 KB cache to ensure that 2 MB (24 GB * .005) is available for control structures (rounding up in both cases).

For information about allocating memory to database caches, see Memory and Startup Settings.

Evaluate the Effects of Load Balancing User Requests

Connecting users to application servers in a round-robin or load balancing scheme may diminish the benefit of caching on the application server. This is particularly likely if users work in functional groups that have a tendency to read the same data. As these users are spread among application servers, each application server may end up requesting exactly the same data from the data server, which not only diminishes the efficiency of distributed caching using multiple caches for the same data, but can also lead to increased block invalidation as blocks are modified on one application server and refreshed across other application servers. This is somewhat subjective, but someone very familiar with the application characteristics should consider this possible condition. If you do decide to configure a load balancer, see Load Balancing, Failover, and Mirrored ConfigurationsOpens in a new tab for an important discussion of load balancing a web server tier distributing application connections across application servers.

Restrict Transactions to a Single Data Server

Restrict updates within a single transaction to either a single remote data server or the local server. When a transaction includes updates to more than one server (including the local server) and the TCommit cannot complete successfully, some servers that are part of the transaction may have committed the updates while others may have rolled them back. For details, see Commit Guarantee.

Note:

Updates to IRISTEMP are not considered part of the transaction for the purpose of rollback, and, as such, are not included in this restriction.

Locate Temporary Globals on the Application Server

Temporary (scratch) globals should be local to the application server, assuming they do not contain data that needs to be globally shared. Often, temporary globals are highly active and write-intensive. If temporary globals are located on the data server, this may penalize other application servers sharing the ECP connection.

Avoid Repeated References to Undefined Globals

Repeated references to a global that is not defined (for example, $Data(^x(1)) where ^x is not defined) always requires a network operation to test if the global is defined on the data server.

By contrast, repeated references to undefined nodes within a defined global (for example, $Data(^x(1)) where any other node in ^x is defined) does not require a network operation once the relevant portion of the global (^x) is in the application server cache.

This behavior differs significantly from that of a non-networked application. With local data, repeated references to the undefined global are highly optimized to avoid unnecessary work. Designers porting an application to a networked environment may wish to review the use of globals that are sometimes defined and sometimes not. Often it is sufficient to make sure that some other node of the global is always defined.

Avoid the Use of Stream Fields

A stream fieldOpens in a new tab within a query results in a read lock, which requires a connection to the data server. For this reason, such queries do not benefit from the database cache, which means their performance does not improve on the second and subsequent invocation.

Use the $Increment Function for Application Counters

A common operation in online transaction processing systems is generating a series of unique values for use as record numbers or the like. In a typical relational application, this is done by defining a table that contains a “next available” counter value. When the application needs a new identifier, it locks the row containing the counter, increments the counter value, and releases the lock. Even on a single-server system, this becomes a concurrency bottleneck: application processes spend more and more time waiting for the locks on this common counter to be released. In a networked environment, it is even more of a bottleneck at some point.

InterSystems IRIS addresses this by providing the $Increment function, which automatically increments a counter value (stored in a global) without any need of application-level locking. Concurrency for $Increment is built into the InterSystems IRIS database engine as well as ECP, making it very efficient for use in single-server as well as in distributed applications.

Applications built using the default structures provided by InterSystems IRIS objects (or SQL) automatically use $Increment to allocate object identifier values. $Increment is a synchronous operation involving journal synchronization when executed over ECP. For this reason, $Increment over ECP is a relatively slow operation, especially compared to others which may or may not already have data cached (in either the application server database cache or the data server database cache). The impact of this may be even greater in a mirrored environment due to network latency between the failover members. For this reason, it may be useful to redesign an application to replace $Increment with the $Sequence function, which automatically assigns batches of new values to each process on each application server, involving the data server only when a new batch of values is needed. (This approach cannot be used, however, when consecutive application counter values are required.) $Sequence can also be used in combination with $Increment.

ECP-related Errors

There are several runtime errors that can occur on a system using ECP. An ECP-related error may occur immediately after a command is executed or, in the case of commands that are asynchronous in nature, such as Kill, the error occurs a short time after the command completes.

<NETWORK> Errors

A <NETWORK> error indicates an error that could not be handled by the normal ECP recovery mechanism.

In an application, it is always acceptable to halt a process or roll back any pending work whenever a <NETWORK> error is received. Some <NETWORK> errors are essentially fatal error conditions. Others indicate a temporary condition that might clear up soon; however, the expected programming practice is to always roll back any pending work in response to a <NETWORK> error and start the current transaction over from the beginning.

A <NETWORK> error on a get-type request such as $Data or $Order can often be retried manually rather than simply rolling back the transaction immediately. ECP tries to avoid giving a <NETWORK> error that would lose data, but gives an error more freely for requests that are read-only.

Rollback Only Condition

The application-side rollback-only condition occurs when the data server detects a network failure during a transaction initiated by the application server and enters a state in which all network requests are met with errors until the transaction is rolled back.

FeedbackOpens in a new tab