Skip to main content

ECP Recovery Protocol

ECP Recovery Protocol

ECP is designed to automatically recover from interruptions in connectivity between an application server and the data server. In the event of such an interruption, ECP executes a recovery protocol that differs depending on the nature of the failure. The result is that the connection is either recovered, allowing the application processes to continue as though nothing had happened, or reset, forcing transaction rollback and rebuilding of the application processes. The main principles are as follows:

  • When the connection between an application server and data server is interrupted, the application server attempts to reestablish its connection with the data server, repeatedly if necessary, at an interval determined by the Time between reconnections setting (5 seconds by default).

  • When the interruption is brief, the connection is recovered.

    If the connection is reestablished within the data server’s configured Time interval for Troubled state timeout period (60 seconds by default), the data server restores all locks and open transactions to the state they were in prior to the interruption.

  • If the interruption is longer, the data server resets the connection, so that it cannot be recovered when the interruption ends.

    If the connection is not reestablished within the Time interval for Troubled state, the data server unilaterally resets the connection, allowing it to roll back transactions and release locks from the unresponsive application server so as not to block functioning application servers. When connectivity is restored, the connection is disabled from the application server point of view; all processes waiting for the data server on the interrupted connection receive a <NETWORK> error and enter a rollback-only condition. The next request received by the application server establishes a new connection to the data server.

  • If the interruption is very long, the application server also resets the connection.

    If the connection is not reestablished within the application server’s longer Time to wait for recovery timeout period (20 minutes by default), the application server unilaterally resets the connection; all processes waiting for the data server on the interrupted connection receive a <NETWORK> error and enter a rollback-only condition. The next request received by the application server establishes a new connection to the data server, if possible.

The ECP timeout settings are shown in the following table. Each is configurable on the System > Configuration > ECP Settings page of the Management Portal, or in the ECP section of in the configuration parameter file (CPF); for more information, see ECP in the Configuration Parameter File Reference.

ECP Timeout Values
Management Portal Setting CPF Setting Default Range  
Time between reconnections ClientReconnectInterval 5 seconds 1–60 seconds The interval at which an application makes attempts to reconnect to the data server.
Time interval for Troubled state ServerTroubleDuration 60 seconds 20–65535 seconds The length of time for which the data server waits for contact from the application server before resetting an interrupted connection.
Time to wait for recovery ClientReconnectDuration 1200 seconds (20 minutes) 10–65535 seconds The length of time for which an application server continues attempting to reconnect to the data server before resetting an interrupted connection.

The default values are intended to do the following:

  • Avoid tying up data server resources that could be used for other application servers for a long time by waiting for an application server to become available.

  • Give an application server — which has nothing else to do when the data server is not available — the ability to wait out an extended connection interruption for much longer by trying to reconnect at frequent intervals.

ECP relies on the TCP physical connection to detect the health of the instance at the other end without using too much of its capacity. On most platforms, you can adjust the TCP connection failure and detection behavior at the system level.

While an application server connection becomes inactive, the data server maintains an active daemon waiting for new requests to arrive on the connection, or for a new connection to be requested by the application server. If the old connection returns, it can immediately resume operation without recovery. When the underlying heartbeat mechanism indicates that the application server is completely unavailable due to a system or network failure, the underlying TCP connection is quickly reset. Thus, an extended period without a response from an application server generally indicates some kind of problem on the application server that caused it to stop functioning, but without interfering with its connections.

If the underlying TCP connection is reset, the data server puts the connection in an “awaiting reconnection” state in which there is no active ECP daemon on the data server. A new pair of data server daemons are created when the next incoming connection is requested by the application server.

Collectively, the nonresponsive state and the awaiting reconnection state are known as the data server Trouble state. The recovery required in both cases is very similar.

If the data server fails or shuts down, it remembers the connections that were active at the time of the crash or shutdown. After restarting, the data server has a short window (usually 30 seconds) during which it places these interrupted connections in the awaiting reconnection state. In this state, the application server and data server can cooperate together to recover all the transaction and lock states as well as all the pending Set and Kill transactions from the moment of the data server shutdown.

During the recovery of an ECP-configured instance, InterSystems IRIS guarantees a number of recoverable semantics, and also specifies limitations to these guarantees. ECP Recovery Process, Guarantees, and Limitations describes these in detail, as well as providing additional details about the recovery process.

FeedbackOpens in a new tab