This section provides additional detail on the mechanics of failover.
The mirror’s response to loss of contact between the failover members or between a failover member and the arbiter is supported by the use of two different mirror failover modes, as follows:
Agent Controlled Mode
When a mirror starts, the failover members begin operation in agent controlled mode. If the arbiter is not available or no arbiter is configured, they remain in this mode. When in agent controlled mode, the failover members respond to loss of contact with each other as described in the following.
Primary’s Response to Loss of Contact
If the primary loses its connection to an active backup, or exceeds the QoS timeout waiting for it to acknowledge receipt of data, the primary revokes the backup’s active status and enters the trouble state, waiting for the backup to acknowledge that it is no longer active. When the primary receives acknowledgement from the backup or the trouble timeout (which is two times the QoS timeout) expires, the primary exits the trouble state, resuming operation as primary.
If the primary loses its connection to a backup that is not active, it continues operating as primary and does not enter the trouble state.
Backup’s Response to Loss of Contact
If the backup loses its connection to the primary, or exceeds the QoS timeout waiting for a message from the primary, it attempts to contact the primary’s ISCAgent. If the agent reports that the primary instance is still operating as primary, the backup reconnects. If the agent confirms that the primary is down or that it has forced the primary down, the backup behaves as follows:
-
If the backup is active and the agent confirms that the primary is down within the trouble timeout, the backup takes over as primary.
-
If the backup is not active, or the trouble timeout is exceeded, the backup takes over if the agent confirms that the primary is down and if it can obtain the latest journal data from the agent.
Whether it is active or not, the backup can never automatically take over in agent controlled mode unless the primary itself confirms that it is hung or the primary’s agent confirms that the primary is down (possibly after forcing it down), neither of which can occur if the primary’s host is down or network isolated.
Note:
When one of the failover members restarts, it attempts to contact the other's ISCAgent and its behavior is as described for a backup that is not active.
Arbiter Controlled Mode
When the failover members are connected to each other, both are connected to the arbiter, and the backup is active, they enter arbiter controlled mode, in which the failover members respond to loss of contact between them based on information about the other failover member provided by the arbiter. Because each failover member responds to the loss of its arbiter connection by testing its connection to the other failover member, and vice versa, multiple connection losses arising from a single network event are processed as a single event.
In arbiter controlled mode, if either failover member loses its arbiter connection only, or the backup loses its active status, the failover members coordinate a switch to agent controlled mode and respond to further events as described for that mode.
If the connection between the primary and the backup is broken in arbiter controlled mode, each failover member responds based on the state of the arbiter connections as described in the following.
Primary Loses Connection to Backup
If the primary loses its connection to an active backup, or exceeds the QoS timeout waiting for it to acknowledge receipt of data, and learns from the arbiter that the arbiter has also lost its connection to the backup or exceeded the QoS timeout waiting for a response from the backup, the primary continues operating as primary and switches to agent controlled mode.
If the primary learns that the arbiter is still connected to the backup, it enters the trouble state and attempts to coordinate a switch to agent controlled mode with the backup through the arbiter. When either the coordinated switch is accomplished, or the primary learns that the backup is no longer connected to the arbiter, the primary returns to normal operation as primary.
If the primary has lost its arbiter connection as well as its connection to the backup, it remains in the trouble state indefinitely so that the backup can safely take over. If failover occurs, when the connection is restored the primary shuts down.
Note:
The trouble timeout does not apply in arbiter controlled mode.
Backup Loses Connection to Primary
If the backup loses its connection to the primary, or exceeds the QoS timeout waiting for a message from the primary, and learns from the arbiter that the arbiter has also lost its connection to the primary or exceeded the QoS timeout waiting for a response from the primary, the backup takes over as primary and switches to agent controlled mode. When connectivity is restored, if the former primary is not already down, the new primary forces it down.
If the backup learns that the arbiter is still connected to the primary, it no longer considers itself active, switches to agent controlled mode, and coordinates with the primary’s switch to agent controlled mode through the arbiter; the backup then attempts to reconnect to the primary.
If the backup has lost its arbiter connection as well as its connection to the primary, it switches to agent controlled mode and attempts to contact the primary’s ISCAgent per the agent controlled mechanics.
Mirror Responses to Lost Connections
The following table describes the mirror’s response to all possible combinations of lost connections in arbiter controlled mode. The first three situations represent network failures only, while the others could involve, from a failover member’s viewpoint, either system or network failures (or a combination). The descriptions assume that immediately prior to the loss of one or more connections, the failover members and arbiter were all in contact with each other and the backup was active.
Note:
The mirror's response to most combinations of connection losses in arbiter controlled mode is to switch to agent controlled mode. Therefore, once one failure event has been handled, responses to a subsequent event that occurs before all connections are reestablished are the same as those described for agent controlled mode rather than the responses described in the table.
Mirror Responses to Lost Connections in Arbiter Mode
|
All three systems connected:
|
|
Backup loses connection to arbiter, still connected to primary:
-
Mirror switches to agent controlled mode
-
Primary continues operating as primary
-
Backup attempts to reconnect to arbiter
|
|
Primary loses connection to arbiter, still connected to backup:
-
Mirror switches to agent controlled mode
-
Primary continues operating as primary
-
Primary attempts to reconnect to arbiter
|
|
Failover members lose connection to each other, still connected to arbiter:
-
Mirror switches to agent controlled mode
-
Primary continues operating as primary
-
Backup attempts to reconnect to primary
|
|
Arbiter failed or isolated — failover members lose connections to arbiter, still connected to each other:
-
Mirror switches to agent controlled mode
-
Primary continues operating as primary
-
Both failover members attempt to reconnect to arbiter
|
|
Backup failed or isolated — primary and arbiter lose connections to backup, still connected to each other:
|
|
Primary failed or isolated — backup and arbiter lose connections to primary, still connected to each other:
-
Primary (if in operation) remains in arbiter controlled mode and trouble state indefinitely
-
Backup takes over as primary, switches to agent controlled mode, and forces primary down when connectivity is restored
|
|
All three connections lost:
-
Primary (if in operation) remains in arbiter controlled mode and trouble state indefinitely; if contacted by backup, switches to agent controlled mode and resumes operation as primary
-
Backup (if in operation) switches to agent controlled mode and attempts to reconnect to primary
Note:
Loss of all connections due to a single event (or multiple simultaneous events) is rare. In most cases the mirror has switched to agent controlled mode before all connections are lost, in which case:
|