Using Health Monitor
Health Monitor monitors a running InterSystems IRIS instance by sampling the values of a broad set of key metrics during specific periods and comparing them to configured parameters for the metric and established normal values for those periods; if sampled values are too high, Health Monitor generates an alert (notification of severity 2) or warning (severity 1). For example, if CPU usage values sampled by Health Monitor at 10:15 AM on a Monday are too high based on the configured maximum value for CPU usage or normal CPU usage samples taken during the Monday 9:00 AM to 11:30 AM period, Health Monitor generates a notification.
Health Monitor is part of the System Monitor tools.
Health Monitor Overview
Health Monitor uses a fixed set of rules to evaluate sampled values and identify those that are abnormally high. This design is based on the approach to monitoring manufacturing processes described in the “Process or Product Monitoring and Control” section of the NIST/SEMATECH e-Handbook of Statistical MethodsOpens in a new tab, with deviation from normal values determined using rules based on the WECO statistical probability rules (Western Electric RulesOpens in a new tab), both adapted specifically for InterSystems IRIS monitoring purposes.
Health Monitor alerts (severity 2) and warnings (severity 1) are written to the messages log (install-dir\mgr\messages.log). See Tracking System Monitor Notifications for information about ways to make sure you are aware of these notifications.
Health Monitor status messages (severity 0) are written to the System Monitor log (install-dir\mgr\SystemMonitor.log).
Unlike System Monitor and Application Monitor, Health Monitor runs only in the %SYS namespace.
The following subsections describe how Health Monitor works and contain information about configuring and extending it in various ways:
Health Monitor Process Description
By default, Health Monitor does not start automatically when the instance starts; for this to happen, you must enable Health Monitor within System Monitor using the ^%SYSMONMGR utility. (You can specify an interval to wait after InterSystems IRIS starts before starting Health Monitor when it is enabled, allowing the instance to reach normal operating conditions before sampling begins.) You can always use the utility to see the current status of Health Monitor. For more information, see Using ^%SYSMONMGR to Manage Health Monitor.
The basic elements of the Health Monitor process are described in the following:
-
Health Monitor monitors a number of system sensors, which are represented as sensor objects. Every sensor object has a base (minimum) value for sensor samples, and optionally includes two notification threshold values (one for alerts, and the other for warnings) which can be set as absolute values or multipliers. These values determine when Health Monitor sends notifications.
Sensors and Sensor Objects lists all the sensor objects.
-
For the duration of a predefined period, each sensor is sampled every 30 seconds; samples below the base value are discarded. By default there are 63 weekly periods (nine per day), but you can configure your own weekly, monthly, quarterly, or yearly periods. Periods lists the default periods.
-
For a given sensor, unless the notification thresholds are set as absolute values, Health Monitor evaluates the sensor readings based on a chart. If the necessary chart for the current period does not exist, Health Monitor places the sensor in analysis mode to generate the chart.
You can edit or create a chart to calibrate how Health Monitor evaluates sensor readings. For more information, see Charts.
-
If a sensor is not in analysis mode, it is in monitoring mode. In monitoring mode, sensor readings are evaluated by the appropriate subscriber class. To ensure that notifications are not triggered by transient abnormal samples, every six sample values are averaged together to generate one reading every three minutes, and it is these readings that are evaluated.
-
When a sequence of readings meets the criteria for a notification (as described in Notification Rules), the subscriber class generates an alert or a warning by passing a notification containing text and a severity code to the system notifier, SYS.Monitor.SystemNotifyOpens in a new tab.
Note:Because no chart is required to evaluate readings from sensors whose sensor objects have maximum and warning values specified, evaluation of these sensor readings and posting of any resulting notifications is handled by the SYS.Monitor.SystemSubscriber subscriber class, rather than the SYS.Monitor.Health.Control subscriber class (see Default System Monitor Components). As a result, notifications for these sensors are generated even when Health Monitor is not enabled, as long as System Monitor is running.
If you want to generate notifications using absolute values for some sensors but using multipliers for others—for example, using absolute values for DBLatency sensors for some databases but multipliers for others—you can do so by setting multipliers in the sensor object and manually creating charts for those for which you want to use absolute values; see Editing a Chart for more information.
Sensors and Sensor Objects
A Health Monitor sensor object represents one of the sensors in SYS.Monitor.SystemSensors. Each sensor object must provide a base value, and can optionally provide a maximum (alert) threshold and a warnings threshold (either as absolute values or multipliers); see Notification Rules for information about how these values are used in evaluating sensor readings. The Health Monitor sensor objects are shown with their default parameters in the following table.
Some sensors represent an overall metric for the InterSystems IRIS instance. These are the sensors which, in the following table, have no value listed in the Sensor Item column. For example, the LicensePercentUsed sensor samples the percentage of the instance’s authorized license units that are currently in use, while the JournalGrowthRate sensor samples the amount of data (in KB per minute) written to the instance’s journal files.
Other sensors collect information about a specific sensor item (either a CSP server, a database, or a mirror). For example, DBReads sensors sample the number of reads per minute from each mounted database. These sensors are specified as <sensor_object> <sensor_item>; for example, the DBLatency install-dir\IRIS\mgr\user sensor samples the time (in milliseconds) required to complete a random read on the USER database.
Sensor objects can be listed and edited (but not deleted) using the ^%SYSMONMGR utility (as described in Configure Health Monitor Classes). Editing a sensor object allows you to modify one or all of its values. You can enter a base value only; a base, maximum (alert), and warning value; or a base value, maximum (alert) multiplier, and warning multiplier.
Sensor Object | Sensor Item | Description | Base | Max Val. | Max Mult. | Warn Val. | Warn Mult. |
---|---|---|---|---|---|---|---|
CPUUsage | System CPU usage (percent). | 50 | 85 | — | 75 | — | |
CSPSessions | IP_address:port | Number of active web sessions on the listed Web Gateway server. | 100 | — | 2 | — | 1.6 |
CSPActivity | IP_address:port | Requests per minute to the listed Web Gateway server. | 100 | — | 2 | — | 1.6 |
CSPActualConnections | IP_address:port | Number of connections created on the listed Web Gateway server. | 100 | — | 2 | — | 1.6 |
CSPInUseConnections | IP_address:port | Number of currently active connections to the listed Web Gateway server. | 100 | — | 2 | — | 1.6 |
CSPPrivateConnections | IP_address:port | Number of private connections to the listed Web Gateway server. | 100 | — | 2 | — | 1.6 |
CSPUrlLatency | IP_address:port | Time (milliseconds) required to obtain a response from IP_address:port/csp/sys/UtilHome.csp. | 1000 | 5000 | — | 3000 | — |
CSPGatewayLatency | IP_address:port | Time (milliseconds) required to obtain a response from the listed Web Gateway server when fetching the metrics represented by the CSP sensor objects. | 1000 | 2000 | — | 1000 | — |
DBLatency | database_directory | Milliseconds to complete a random read from the listed mounted database. | 1000 | 3000 | — | 1000 | — |
DBReads | database_directory | Reads per minute from the listed mounted database. | 1024 | — | 2 | — | 1.6 |
DBWrites | database_directory | Writes per minute to the listed mounted database. | 1024 | — | 2 | — | 1.6 |
DiskPercentFull | database_directory | Disk percentage used for the listed mounted database. | 50 | 99 | — | 95 | — |
ECPAppServerKBPerMinute | KB per minute sent to the ECP data server. | 1024 | — | 2 | — | 1.6 | |
ECPConnections | Number of active ECP connections. | 100 | — | 2 | — | 1.6 | |
ECPDataServerKBPerMinute | KB per minute received as ECP data server. | 1024 | — | 2 | — | 1.6 | |
ECPLatency | Network latency (milliseconds) between the ECP data server and this ECP application server. | 1000 | 3000 | — | 3000 | — | |
ECPTransOpenCount | Number of open ECP transactions | 100 | — | 2 | — | 1.6 | |
ECPTransOpenSecsMax | Duration (seconds) of longest currently open ECP transaction | 60 | — | 2 | — | 1.6 | |
GlobalRefsPerMin | Global references per minute. | 1024 | — | 2 | — | 1.6 | |
GlobalSetKillPerMin | Global sets/kills per minute. | 1024 | — | 2 | — | 1.6 | |
JournalEntriesPerMin | Number of journal entries written per minute. | 1024 | — | 2 | — | 1.6 | |
JournalGrowthRate | Number of KB per minute written to journal files. | 1024 | — | 2 | — | 1.6 | |
LicensePercentUsed | Percentage of authorized license units currently in use. | 50 | — | 1.5 | — | — | |
LicenseUsedRate | License acquisitions per minute. | 20 | — | 1.5 | — | — | |
LockTablePercentFull | Percentage of the lock table in use. | 50 | 99 | — | 85 | — | |
LogicalBlockRequestsPerMin | Number of logical block requests per minute. | 1024 | — | 2 | — | 1.6 | |
MirrorDatabaseLatencyBytes | mirror_name | On the backup failover member of a mirror, number of bytes of journal data received from the primary but not yet applied to mirrored databases on the backup (measure of how far behind the backup’s databases are). | 2*107 | — | 2 | — | 1.6 |
MirrorDatabaseLatencyFiles | mirror_name | On the backup failover member of a mirror, number of journal files received from the primary but not yet fully applied to mirrored databases on the backup (measure of how far behind the backup’s databases are). | 3 | — | 2 | — | 1.6 |
MirrorDatabaseLatencyTime | mirror_name | On the backup failover member of a mirror, time (in milliseconds) between when the last journal file was received from the primary and when it was fully applied to the mirrored databases on the backup (measure of how far behind the backup’s databases are). | 1000 | 4000 | — | 3000 | — |
MirrorJournalLatencyBytes | mirror_name | On the backup failover member of a mirror, number of bytes of journal data received from the primary but not yet written to the journal directory on the backup (measure of how far behind the backup is). | 2*107 | — | 2 | — | 1.6 |
MirrorJournalLatencyFiles | mirror_name | On the backup failover member of a mirror, number of journal files received from the primary but not yet written to the journal directory on the backup (measure of how far behind the backup is). | 3 | — | 2 | — | 1.6 |
MirrorJournalLatencyTime | mirror_name | On the backup failover member of a mirror, time (in milliseconds) between when the last journal file was received from the primary and when it was fully written to the journal directory on the backup (measure of how far behind the backup is). | 1000 | 4000 | — | 3000 | — |
PhysicalBlockReadsPerMin | Number of physical block reads per minute. | 1024 | — | 2 | — | 1.6 | |
PhysicalBlockWritesPerMin | Number of physical block writes per minute. | 1024 | — | 2 | — | 1.6 | |
ProcessCount | Number of active processes for the InterSystems IRIS instance. | 100 | — | 2 | — | 1.6 | |
RoutineCommandsPerMin | Number of routine commands per minute. | 1024 | — | 2 | — | 1.6 | |
RoutineLoadsPerMin | Number of routine loads per minute. | 1024 | — | 2 | — | 1.6 | |
RoutineRefsPerMin | Number of routine references per minute. | 1024 | — | 2 | — | 1.6 | |
SMHPercentFull | Percentage of the shared memory heap (generic memory heap) in use. | 50 | 98 | — | 85 | — | |
TransOpenCount | Number of open local transactions (local and remote). | 100 | — | 2 | — | 1.6 | |
TransOpenSecondsMax | Duration (seconds) of longest currently open local transaction. | 60 | — | 2 | — | 1.6 | |
WDBuffers | Average number of database buffers updated per write daemon cycle. | 1024 | — | 2 | — | 1.6 | |
WDCycleTime | Average number of seconds required to complete a write daemon cycle. | 60 | — | 2 | — | 1.6 | |
WDWIJTime | Average number of seconds spent updating the write image journal (WIJ) per cycle. | 60 | — | 2 | — | 1.6 | |
WDWriteSize | Average number of KB written per write daemon cycle. | 1024 | — | 2 | — | 1.6 |
Some sensors are not sampled for all InterSystems IRIS instances. For example, the ECP... sensors are sampled only on ECP data and application servers.
When you are monitoring a mirror member (see Mirroring), the following special conditions apply to Health Monitor:
-
No sensors are sampled while the mirror is restarting (for example, just after the backup failover member has taken over as primary) or if the member’s status in the mirror is indeterminate.
-
If a sensor is in analysis mode for a period and the member’s status in the mirror changes during the period, no chart is created and the sensor remains in analysis mode.
-
Only the MirrorDatabaseLatency* and MirrorJournalLatency* sensors are sampled on the backup failover mirror member.
-
All sensors except the MirrorDatabaseLatency* and MirrorJournalLatency* sensors are sampled on the primary failover mirror member.
Periods
By default there are 63 recurring weekly periods during which sensors are sampled. Each of these periods represents one of the following specified intervals during a particular day of the week:
00:15 a.m. – 02:45 a.m. | 03:00 a.m. – 06:00 a.m. | 06:15 a.m. – 08:45 a.m. |
09:00 a.m. – 11:30 a.m. | 11:45 a.m. – 01:15 p.m. |
01:30 p.m. – 04:00 p.m. |
04:15 p.m. – 06:00 p.m. |
06:15 p.m. – 08:45 p.m. |
09:00 p.m. – 11:59 p.m. |
You can list, add and delete periods using the Configure Periods option in the ^%SYSMONMGR utility (see Configure Health Monitor Classes). You can add monthly, quarterly or yearly periods as well as weekly periods.
Quarterly periods are listed in three-month increments beginning with the month specified as the start month; for example, if you specify 5 (May) as the starting month, the quarterly cycle repeats in August (8), November (11) and February (2).
Descriptions are optional for user-defined periods.
Charts
If the notification threshold values for a sensor object are not given as multipliers (or not specified), Health Monitor requires a chart to evaluate those sensor readings. Health Monitor generates the necessary charts by calculating the mean, standard deviation, and maximum value from sample sensor readings. This section describes how Health Monitor generates charts in analysis mode, and how to edit or create custom charts.
Analysis Mode
Before Health Monitor can evaluate sensor samples, it checks whether that sensor requires a chart. If a chart is required but does not yet exist, Health Monitor automatically puts the sensor in analysis mode.
In analysis mode, Health Monitor simply records the samples it collects, and at the end of the period generates the required chart for the sensor. To ensure that the chart is reliable, a minimum of 13 samples must be taken in analysis mode. Until 13 valid samples are taken within a single recurrence of a period, the sensor remains in analysis mode and no chart is generated for that period.
Charts should always be generated from samples taken during normal, stable operation of the InterSystems IRIS instance. For example, when a Monday 09:00 a.m. - 11:30 a.m. chart does not exist, it should not be generated on a Monday holiday or while a technical problem is affecting the operation of the InterSystems IRIS instance.
When a period has recurred five times since a chart was generated for a sensor or sensor/item during that period, not including those during which an alert was generated, the readings from these five normal period recurrences are evaluated to detect a rising or shifted mean for the sensor. If the mean is rising or has shifted with 95% certainty, the chart is recalibrated—the existing chart for the sensor during that period is replaced with a chart generated from the samples taken during the most recent recurrence of the period. For example, if the number of users accessing a database is growing slowly but steadily, the mean DBReads value for that database is likely to also rise slowly but steadily, resulting in regular chart recalibration every five periods, which avoids unwarranted alerts.
Note that sensor object absolute and multiplier values cannot be automatically recalibrated in the same way, and should be adjusted manually because automatic chart recalibration does not apply to such sensors. For example, if the number of users accessing a database grows, the base, maximum (alert) value, and warning value for the DBLatency sensor object may require manual adjustment.
Editing a Chart
The ^%SYSMONMGR utility lets you display a list of all current charts, including the mean and sigma of each. You can also display the details of a particular chart, including the individual readings and highest reading. To access these options from the utility, select Configure Charts from the Configure Health Monitor Classes submenu .
The Configure Charts option also provides two ways to customize alerting by customizing charts:
-
You can change the mean and/or sigma to whatever values you wish by editing an existing chart. The standard notification rules apply, but using the values you have entered.
-
You can create a chart, specifying an alert value and a warning value. Creating a chart is similar to setting an absolute value for the notification threshold; alerts and warnings are generated based solely on the values you supply for the chart.
When listing, examining, editing, or creating charts, the Item heading or prompt refers to a database (specified by a directory path), a Web Gateway server (specified by an IP address), or a mirror (specified by the mirror name). See Sensors and Sensor Objects for more information.
You can also programmatically build chart statistics based on a list of values with the following SYS.Monitor.Health.ChartOpens in a new tab class methods:
-
CreateChart()Opens in a new tab — Creates a chart for a specific period/sensor, evaluates the list of values, and sets the resulting mean and sigma values.
-
SetChartStats()Opens in a new tab — Evaluates the list of values and sets the resulting mean and sigma values for a specified period/sensor.
For more information, see the SYS.Monitor.Health.ChartOpens in a new tab class documentation.
A chart generated by Health Monitor, including one you have edited, can be automatically recalibrated as described in Analysis Mode. In addition, all charts generated by Health Monitor, including those that have been edited, are deleted when an InterSystems IRIS instance is upgraded.
A chart created using the Configure Charts submenu or the CreateChart()Opens in a new tab class method, however, is never automatically recalibrated or deleted on upgrade. A user-created chart is therefore permanently associated with the selected sensor/period combination until you select the Reset Charts option within the Reset Defaults option of the Configure Health Monitor Classes submenu or select Recalibrate Charts within the Configure Charts option.
Notification Rules
Health Monitor generates an alert (notification of severity 2) if three consecutive readings of a sensor during a period are greater than the sensor maximum threshold value, and a warning (notification of severity 1) if five consecutive readings of a sensor during a period are greater than the sensor warning threshold value. The maximum and warning threshold values depend on the settings in the sensor object and whether the applicable chart was generated by Health Monitor or created by a user, as shown in the following table.
Note also that:
-
When a sensor object has maximum value and warning value set, no chart is required and therefore no chart is generated, and notifications are generated even when Health Monitor is disabled.
-
When a sensor object has maximum multiplier and warning multiplier set, or base only, a chart is required; until sufficient samples have been collected in analysis mode to generate the chart, no notifications are generated.
-
When a user-created chart exists, it does not matter what the sensor object settings are.
Sensor Object Settings | Chart Type | Sensor Maximum Value | Sensor Warning Value | Active When |
---|---|---|---|---|
base, maximum value, warning value | none | sensor object maximum value | sensor object warning value | System Monitor running |
base, maximum multiplier, warning multiplier | generated | sensor object maximum multiplier times greater of:
|
sensor object warning multiplier times greatest of:
|
System Monitor running, Health Monitor enabled |
base only | generated | greater of:
|
greater of:
|
System Monitor running, Health Monitor enabled |
(n/a if user-created chart exists) | user-created | chart alert value | chart warning value | System Monitor running, Health Monitor enabled |
Examples
In this example, the chart for the DBReads install-dir\IRIS\mgr\user sensor during the Monday 09:00 a.m. - 11:30 a.m. period indicates that the mean reads per minute from the USER database is 2145, with a sigma of 141 and maximum value of 2327. The default notification threshold multipler for DBReads is 2. An alert is generated for this sensor when three consecutive readings exceed the greater of the following two values:
-
maximum multiplier * (chart mean + (3 * chart sigma))
2 * (2145 + (3 * 141)) = 5136
-
maximum multiplier * (chart maximum value + chart sigma))
2 * (2327 + 141) = 4936
So, or this sensor during this period, an alert is generated if three consecutive readings are greater than 5136.
A sensor with no multipliers or maximum values is evaluated with a multiplier of 1. As an example, if the DBReads sensor object were edited to remove the multipliers, leaving it with only a base, an alert is generated for DBReads install-dir\IRIS\mgr\user when three consecutive readings are greater than 2568, calculated as the greater of:
-
maximum multiplier * (the chart mean + three times the sigma)
1 * (2145 + (3 * 141)) = 2568
-
maximum multiplier * (the highest value in the chart + one sigma)
1 * (2327 + 141) = 2468
Using ^%SYSMONMGR to Manage Health Monitor
As described in Using the ^%SYSMONMGR Utility, the ^%SYSMONMGR utility lets you manage and configure System Monitor, including Health Monitor. To manage Health Monitor, change to the %SYS namespace in the Terminal, then enter the following command:
%SYS>do ^%SYSMONMGR
1) Start/Stop System Monitor
2) Set System Monitor Options
3) Configure System Monitor Classes
4) View System Monitor State
5) Manage Application Monitor
6) Manage Health Monitor
7) View System Data
8) Exit
Option?
Health Monitor runs only in the %SYS namespace. When you start ^%SYSMONMGR in another namespace, option 6 (Manage Health Monitor) does not appear.
Enter 6 for Manage Health Monitor. The following menu displays:
1) Enable/Disable Health Monitor 2) View Alerts Records 3) Configure Health Monitor Classes 4) Set Health Monitor Options 5) Exit Option?
Enter the number of your choice or press Enter to exit the Health Monitor utility.
The options in the main menu let you perform Health Monitor tasks as described in the following table:
Option | Description |
---|---|
1) Enable/Disable Health Monitor |
|
2) View Alert Records |
|
3) Configure Health Monitor Classes |
|
4) Set Health Monitor Options |
|
When the utility asks you to specify a single element such as a sensor, rule, period or chart, you can enter ? (question mark) at the prompt for a numbered list, then enter the number of the element you want.
All output from the utility can be displayed on the Terminal or sent to a specified device.
View Alerts Records
Choose this option to view recently generated alerts for a specific sensor, or for all sensors. You can examine the details of individual alerts and warnings, including the mean and sigma of the chart and the readings that triggered the notification. (Alert records are purged after a configurable number of days; see the Set Health Monitor Options for more information.)
Configure Health Monitor Classes
The options in this submenu let you customize Health Monitor, as described in the following table.
You cannot use these options to customize Health Monitor while System Monitor is running; you must first stop System Monitor, and then restart it after you have made your changes.
Option | Description |
---|---|
1) Activate/ Deactivate Rules |
(not in use in this release) |
2) Configure Periods |
List the currently configured periods and add and delete periods. |
3) Configure Charts |
Lets you
|
4) Edit Sensor Objects |
List the sensor objects representing the sensors in the SYS.Monitor.SystemSensors class and modify their base, maximum, warning, maximum multiplier, and warning multiplier values. |
5) Reset Defaults |
Lets you
|
Set Health Monitor Options
This submenu lets you set several Health Monitor options, as shown in the following table:
Option | Description |
---|---|
1) Set Startup Wait Time |
Configure the number of minutes System Monitor waits after starting, when Health Monitor is enabled, before passing sensor readings to the Health Monitor subscriber, SYS.Health.Monitor.Control. This allows InterSystems IRIS to reach normal operating conditions before Health Monitor begins creating charts or evaluating readings. |
2) Set Alert Purge Time | Specify when an alert record should be purged (deleted); the default is five days after the alert is generated. |
See Also
-
Manage Email Options (information about generating email messages from notifications in the messages log, including those generated by System Monitor)
-
Monitoring Log Files (includes information on the log files generated by this tool)