DataCheck Overview

DataCheck provides a mechanism to compare the state of data on two systems—the DataCheck source and the DataCheck destination—to determine whether or not they match. All configuration, operational controls and results of the check are provided on the destination system; the source system is essentially passive.

On the instance of InterSystems IRIS® that is to act as the DataCheck destination, you must create a DataCheck destination configuration. You can create multiple destination configurations on the same instance, which you can configure to check data against multiple source systems (or configure them to check different data against a single source). If you are using DataCheck to check the consistency of a mirror, see DataCheck for Mirror Configurations for more details.

The following subsections describe DataCheck topics in more detail:

DataCheck Queries

The destination system submits work units called DataCheck “queries” to the source system. Each query specifies a database, an initial global reference, a number of nodes, and a target global reference. Both systems calculate an answer by traversing the specified number of global nodes starting with the initial global reference, and hashing the global keys and values. If the answers match, the destination system records the results and resubmits the query with a larger number of nodes and the initial global reference advanced; if they don’t match, the query is resubmitted with a smaller number of nodes until the discrepancy is isolated down to the configured minimum query size.

You can display information about the queries submitted by the destination system using the View Queries option of the View Details submenu of the ^DATACHECK routine, including the globals that remain to be processed (or global ranges if subscript include/exclude ranges are used), and the active queries currently being worked on by DataCheck.

DataCheck Jobs

The answer to each query is calculated by DataCheck worker jobs running on both the source system and the destination system. The number of worker jobs is determined by the dynamically tunable performance settings of the destination system; for more information, see Performance Considerations.

In addition to the worker jobs, there are other jobs on each system. The following additional jobs run on the destination system:

Manager job—Loads and dispatches queries, compares query answers, and manages the progression through the workflow phases; this job is connected to the source system Manager job.
Receiver job—Receives answers from the source system.

The following additional jobs run on the source system:

Manager job—Receives requests from the destination system Manager job and sends them to worker jobs.
Sender job—Receives query answers from the worker jobs and sends them to the destination system Receiver job; this job is connected to the destination system Receiver job.

DataCheck Results

The results of the check lists global subscript ranges with one of the following states:

Unknown—DataCheck has not yet checked this range.
Matched—DataCheck has found that this range matches.
Unmatched—DataCheck has found a discrepancy in this range.
Collation Discrepancy—Global was found to have differing collation between the source system and the destination system.
Excluded—This range is excluded from checking.

You can view the results from the current check and the final results from the last check on the destination system; for more information, see the SYS.DataCheck.RangeListOpens in a new tab class. For all subscript ranges within DataCheck, the beginning of a range is inclusive and the end exclusive. See Specifying Globals and Subscript Ranges to Check for information about subscript ranges.

The following provides a sample check result:

c:\InterSystems\iris\mgr\mirror2 ^XYZ       Unmatched
        ^XYZ --Matched--> ^XYZ(3001,4)
        ^XYZ(3001,4) --Unmatched--> ^XYZ(5000)
        ^XYZ(5000) --Matched--> [end]

This result indicates that the nodes in the range starting at ^XYZ up to but not including ^XYZ(3001,4) are matched, while there is at least one discrepancy in the range of nodes from ^XYZ(3001,4) up to but not including ^XYZ(5000). The nodes in the range from ^XYZ(5000) to the end are matched.

The minimum number and frequency of discrepancies in the unmatched range depends on the minimum query size (see Performance Considerations). For example, if the minimum query size is set to the default of 32 in this case, there is at least one discrepancy every 32 nodes from ^XYZ(3001,4) until ^XYZ(5000); if there were a sequence within this range of more than 32 nodes without a discrepancy, it would appear in the results as a separate matched range.

DataCheck Workflow

During the check, data may be changing and transient discrepancies may be recorded. Rechecking may be required to eliminate these transient discrepancies. The destination system has a workflow that defines a strategy for how to check the globals.

A typical workflow begins with the “Check” phase as phase #1. (Phase #1 should always be defined as the logical starting point of the check cycle, since it is used by the workflow timeout and the Start dialog of the ^DATACHECK routine to indicate a "reset" from beginning, as described in the next section.) At the beginning of this phase, the current set of results are saved as the last completed results and a new set of active results is established. DataCheck makes an initial pass through all globals specified for inclusion in the check.

Following the Check phase, the “Recheck Discrepancies” phase is typically specified with the desired number of iterations. Each iteration rechecks all unmatched ranges in an effort to eliminate transient discrepancies.

As each phase of the workflow is completed, DataCheck moves to the next phase. The workflow is implicitly restarted from phase #1 after the last phase is complete. The “Stop” phase shuts down all DataCheck jobs and the “Idle” phase causes DataCheck to wait for you to manually specify the next phase.

Starting/Stopping/Reconnecting DataCheck

You can stop and start DataCheck at any time; when you start DataCheck, it resumes the workflow from where it left off. In addition, you can specify a different workflow phase to follow the current phase and/or abort the current phase at any time.

If, during a check, DataCheck is stopped, becomes disconnected, or pauses due to mirroring, the routine reports why the system was stopped, what phase it stopped in, and what it will do when it starts (for example, resume processing, move to the next phase, change phase due to user request or restart at phase #1 due to workflow timeout). If, upon starting, DataCheck is going to resume processing the current phase or make a transition to any phase other than phase #1, you are offered the option of restarting at phase #1, as in the following example:

Option? 4

Configuration Name: test

State:  Stopped due to Stop Requested
Current Phase: 1 - Check
Workflow Phases:
  1 - Check
  2 - RecheckDiscrepancies, Iterations=10
  3 - Stop
  (restart)
Workflow Timeout: 432000
New Phase Requested: 2
Abort Current Phase Requested

DataCheck is set to abort the current phase and transition to phase #2.

You may enter RESTART to restart at phase #1

Start Datacheck configuration 'test'? (yes/no/restart)

In cases in which DataCheck becomes disconnected and reconnects only after an extended period, it may be more desirable to restart from phase #1 of the workflow instead. For example, if the systems were disconnected for several weeks in the middle of a check and then the check is resumed, the results are of questionable value, having been collected in part from two weeks prior and in part from the present time. The workflow has a Timeout property that specifies the time, in seconds, within which DataCheck may resume a partially completed workflow phase. If the timeout is exceeded, DataCheck restarts from phase #1 the next time it reaches the running state. The default value is five days (432000 seconds), based on the assumption that a large amount of data is checked by this DataCheck configuration and the check may take hours or days to complete normally; a smaller value may be preferable for configurations that complete a check in a shorter amount of time. A value of zero means no timeout.

Note:

As noted, you should define phase #1 to be the logical starting point of the check cycle, since it is used by the workflow timeout and the Start dialog of the ^DATACHECK routine to indicate a "reset" from beginning, as shown in the previous example.

DataCheck for Mirror Configurations