Skip to main content

Providers

Providers are powerful machine learning frameworks that are accessible in a common interface in IntegratedML. To choose a provider for training, select an ML configuration which specifies the desired provider.

You can pass additional parameters specific to these providers with a USING clause. See Adding Training Parameters (the USING clause) for further discussion.

AutoML

AutoML is an automated machine learning system developed by InterSystems, housed within InterSystems IRIS® data platform. IntegratedML, AutoML trains models quickly to produce accurate results. Additionally, AutoML features basic natural language processing (NLP), allowing the provider to smartly incorporate feature columns with unstructured text into machine learning models.

%AutoML is the system-default ML configuration for IntegratedML, and points to AutoML as the provider.

Training Parameters — AutoML

You can pass training parameters with a USING clause. For example:

TRAIN MODEL my-model USING {"seed": 3}

With AutoML, you can pass the following parameters into your training queries:

Training Parameter Description
seed A seed to initialize the random number generator. You can manually set any integer as the seed for reproducibility between training runs. By default, seed is set to “None”.
verbosity Determines how verbose the output of each training run is. This output can be found in the ML_TRAINING_RUNS view. You can specify any of the following options for verbosity:
  • 0 — Minimal/no output.

  • 1 — Moderate output.

  • 2 — Full output. This is the default setting for verbosity.

TrainMode Determines the model selection metric for classification models. You can specify one of the following options for TrainMode:
  • TIME” — Model selection prioritizes faster training time.

  • BALANCE” — Model selection compares models by an equal proportion of each model’s respective score and training time.

  • SCORE” — Model selection does not factor training run time at all. This is the default setting for TrainMode.

See the AutoML Reference for more information about these different modes.
MaxTime The number of minutes allotted for initiating training runs. This does not necessarily limit training time. For example, if the MaxTime is set to 3000 minutes and there are 2 minutes remaining after a model is trained, another model could still be trained. By default, MaxTime is set to 14400 minutes.
Note:

This parameter is only applicable if TrainMode is set to “TIME” .

MinimumDesiredScore The minimum score to allow for classification model selection, irrespective of the training mode selected. You can set any value between 0 and 1. By default, MinimumDesiredScore is set to 0.
Note:

This parameter is only applicable if TrainMode is set to “TIME” .

If the trained logistic regression or random forest classifier model exceeds the MinimumDesiredScore, then AutoML does not train the neural network model. See the AutoML Reference for more information about the different models used for classification models.

Feature Engineering

AutoML uses feature engineering to modify existing features, create new ones, and remove unnecessary ones. These steps improve training speed and performance, including:

  • Column type classification to correctly use features in models

  • Feature elimination to remove redundancy and improve accuracy

  • One-hot encoding of categorical features

  • Filling in missing or null values in incomplete datasets

  • Creating new columns pertaining to hours/days/months/years, wherever applicable, to generate insights in your data related to time.

Model Selection

If a regression model is determined to be appropriate, AutoML uses a singular process for developing a regression model.

For classification models, AutoML uses the following selection process to determine the most accurate model:

  1. If the dataset is too large, AutoML samples down the data to speed up the model selection process. The full dataset is still used for training after model selection.

  2. AutoML determines if the dataset presents a binary classification problem, or if multiple classes are present, to use the proper scoring metric.

  3. Using Monte Carlo cross validation, AutoML selects the model with the best scoring metrics for training on the entire dataset.

Note:

A more detailed description of this model selection process can be found in the AutoML Reference.

Platform Support and Known Issues

The AutoML provider is not supported on any IBM AIX® platform, Red Hat Enterprise Linux 8 for ARM, or Ubuntu 20.04 for ARM.

AutoML is implemented using Python, which may lead to improper isolation between AutoML Python packages and Embedded Python packages. As a result, AutoML may be unable to find packages it needs to work correctly. To avoid this issue, add <path to instance>/lib/automl to the Python sys.path within your instance of InterSystems IRIS. To do so, open a Python shell with %SYS.Python.Shell()Opens in a new tab and enter the following commands:

import sys
sys.path.append("<path to instance>\\lib\\automl")

See More

For more information about how AutoML works, see the AutoML Reference.

H2O

You can specify H2O as your provider by setting %H2O as your ML configuration.

You can also create a new ML configuration where PROVIDER points to H2O.

Note:

The H2O provider does not support the creation of time series models.

Training Parameters — H2O

You can pass training parameters with a USING clause. For example:

TRAIN MODEL my-model USING {"seed": 3}

See the H2O documentationOpens in a new tab for information regarding expected input and how these parameters are handled. Unknown parameters result in an error during training.

When training a model using the H2O provider, the max_models parameter is set to 5 by default.

Model Selection

To make a selection on which model type to use, the system looks at the number of unique values in the column and compares that to the total number of values in the column. If there are a relatively small number of unique values, the system uses an H2O classification model. If there are a relatively large number of unique values, the system uses an H2O regression model.

If you want to force a column to be trained by H2O as a regression model, you can manually add the key value pair: "model_type":"regression" to your USING clause. For example:

TRAIN MODEL h2o-model USING {"model_type": "regression"}

Training Log Output

You can query the LOG column of the INFORMATION_SCHEMA.ML_TRAINING_RUNS view after training models using H2O.

Known Issues

  • When training with the H2O provider, you may see the following error message:

    LogMessage: %ML Provider '%ML.H2O.Provider' is not available on this instance
      > ERROR #5002: ObjectScript error: <READ>%GetResponse+4^%Net.Remote.Object.1
    

    If you do, you can address this issue by performing the following:

    1. Log in to the Management Portal.

    2. Go to System Administration > Configuration > Connectivity > External Language Servers.

    3. Select the server named %IntegratedML Server.

    4. Add the following to the JVM arguments field:

      -Djava.net.preferIPv6Addresses=true -Djava.net.preferIPv4Addresses=false
      
  • Setting the seed parameter with a USING clause for the H2O provider does not guarantee reproducible training runs. This is because the default training settings for H2O include the parameter max_models being set to 5, which triggers an early stopping mode. Reproducibility for the Gradient Boosting Model algorithm in H2O is a complex topic, as documentedOpens in a new tab by H2O.

See More

For more information about H2O, see their documentationOpens in a new tab.

DataRobot

Important:

You must have a business relationship with DataRobot to use their AutoML capabilities.

DataRobot clients can use IntegratedML to train models with data stored within InterSystems IRIS® data platform.

You can specify DataRobot as your provider by selecting a DataRobot configuration as your default ML configuration:

SET ML CONFIGURATION datarobot_configuration

where datarobot_configuration is the name of an ML configuration where PROVIDER points to DataRobot.

Training Parameters — DataRobot

You can pass training parameters with a USING clause. For example:

TRAIN MODEL my-model USING {"seed": 3}

IntegratedML uses the DataRobot API to make an HTTP request to start modeling. Please consult their documentationOpens in a new tab for information regarding expected input and how these parameters are handled. Unknown parameters result in an error during training.

When training a model using the DataRobot provider, the quickrun parameter is set to true by default.

PMML

IntegratedML supports PMML as a PMML consumer, making it easy for you to import and execute your PMML models using SQL.

How PMML Models work in IntegratedML

As with any other provider, you use a CREATE MODEL statement to specify a model definition, including features and labels. This model definition must contain the same features and label that your PMML model contains.

The TRAIN MODEL statement operates differently. Instead of “training” data, the TRAIN MODEL statement imports your PMML model. No training is necessary because the PMML model exhibits the properties of a trained model, including information on features and labels. The model is identified by a USING clause.

Important:

The feature and label columns specified in your model definition must match the feature and label columns of the PMML model.

While you still require a FROM clause in either your CREATE MODEL or TRAIN MODEL statement, the data specified is not used whatsoever.

Using your “trained” PMML model to make predictions works the same as any other trained model in IntegratedML. You can use the PREDICT function with any data that contains feature columns matching your PMML definition.

How to import a PMML Model

Before you can use a PMML model, set %PMML as your ML configuration, or select a different ML configuration where PROVIDER points to PMML.

You can specify a PMML model with a USING clause. You can choose one of the following parameters:

By Class Name

You can use the "class_name" parameter to specify the class name of a PMML model. For example:

USING {"class_name" : "IntegratedML.pmml.PMMLModel"}
By Directory Path

You can use the "file_name" parameter to specify the directory path to a PMML model. For example:

USING {"file_name" : "C:\temp\mydir\pmml_model.xml"}

Examples

The following examples highlight the multiple methods of passing a USING clause to specify a PMML model.

Specifying a PMML Model in an ML Configuration

The following series of statements creates a PMML configuration which specifies a PMML model for house prices by file name, and then imports the model with a TRAIN MODEL statement.

CREATE ML CONFIGURATION pmml_configuration PROVIDER PMML USING {"file_name" : "C:\PMML\pmml_house_model.xml"}
SET ML CONFIGURATION pmml_configuration
CREATE MODEL HousePriceModel PREDICTING (Price) WITH (TotSqft numeric, num_beds integer, num_baths numeric)
TRAIN MODEL HousePriceModel FROM HouseData
SELECT * FROM NewHouseData WHERE PREDICT(HousePriceModel) > 500000
Specifying a PMML Model in the TRAIN MODEL Statement

The following series of statements uses the provided %PMML configuration, and then specifies a PMML model by class name in the TRAIN MODEL statement.

SET ML CONFIGURATION %PMML
CREATE MODEL HousePriceModel PREDICTING (Price) WITH (TotSqft numeric, num_beds integer, num_baths numeric)
TRAIN MODEL HousePriceModel FROM HouseData USING {"class_name" : "IntegratedML.pmml.PMMLHouseModel"}
SELECT * FROM NewHouseData WHERE PREDICT(HousePriceModel) > 500000

Additional Parameters

If your PMML file contains multiple models, IntegratedML uses the first model in the file by default. To point to a different model within the file, use the model_name parameter in your USING clause:

TRAIN MODEL my_pmml_model FROM data USING {"class_name" : my_pmml_file, "model_name" : "model_2_name"}
FeedbackOpens in a new tab