Skip to main content

Feature Engineering

AutoML performs two key steps of feature engineering:

These steps help make the data compatible with the utilized machine learning algorithms, and can greatly improve performance.

Column Type Classification

AutoML first examines the columns in the dataset and classifies them as a particular Python data type. For information about the conversion from DDL to Python data types, see DDL to Python Type Conversion.

The column types, along with how their classifications are made, are listed below:

Numeric Columns

Numeric columns are those that have the numeric pandas datatype, such as int8, int64, float32, etc. All columns meeting this condition are included, except:

  • Ignored Columns.

  • Columns of the timedelta datatype.

  • Columns with only one unique value.

Some columns with seemingly numeric data may be inappropriately classified as numeric columns. For example, if had a column with ID numbers for different items, an ID number of 1000 is not “half of” an ID number of 2000. You can properly treat these columns as category columns by recasting the numeric data with VARCHAR values.

Category Columns

Category columns are those that contain categorical values, meaning there are a relatively small, fixed number of values that appear. They satisfy the following criteria:

  • Must be of category or object pandas datatype.

  • Must not include Ignored Columns.

  • Must not include List Columns.

  • The number of unique values is less than 10% the total number of values.

Text Columns

Text columns are columns where the values look like sentences. AutoML looks for values that contain 4 or more words. They satisfy the following criteria:

  • Must be of the category or object pandas datatype.

  • Must not include Ignored Columns.

  • Must not include Category Columns.

  • Must not include List Columns.

  • The number of unique values is less than 10% the total number of values.

List Columns

List columns are those that contain list values. They satisfy the following criteria:

  • Must be of category or object pandas datatype.

  • Must not include Ignored Columns.

  • Must be, or contain, one of the following types:

    • InterSystems IRIS data type %Library.String:list

    • InterSystems IRIS data type %Library.String:array

    • Python list. This is determined by checking the first 10 non-empty values of the column to see if the type of each value is a Python list.

    • String array. This is determined by checking the first 10 non-empty values of the column to see if the type of each value is a string, with starting character [, ending character ], and of length at least 2.

Boolean Columns

Boolean columns are those that have the bool pandas datatype. They additionally satisfy the condition that they do not include Ignored Columns.

Ignored Columns

Ignored columns are those that are to be disregarded and removed before training. These include:

  • The ID column.

  • The label column.

  • Columns with only one unique value (except for columns of datetime pandas datatype).

Date/Time Columns

Date/Time columns are those that have the datetime pandas datatype. They additionally satisfy the condition that they do not include Ignored Columns.

See below for discussion of additional date/time columns created.

DDL to Python Type Conversion

The following table maps DDL data types to the Python data types that AutoML uses to classify data columns.

DDL Data Type Python Data Type
BIGINT integer
BINARY bytes
BIT Boolean
DATE datetime64 (numpy)
DECIMAL decimal
DOUBLE float
INTEGER integer
NUMERIC float
REAL float
SMALLINT integer
TIME datetime64 (numpy)
TIMESTAMP datetime64 (numpy)
TINYINT integer
VARBINARY bytes
VARCHAR string

For information about DDL data types, and their associated InterSystems IRIS® data platform data types, see “Data Types” in the InterSystems SQL Reference.

Data Transformation

The Transform Function transforms the entire dataset into the form to be used by the machine learning models. It is applied on the training set before training, and on any future datasets before predictions are made.

Adding Additional Columns

Additional Date/Time columns are created. For every datetime column, the following separate columns are added whenever applicable:

  • Hour of day.

  • Day of week.

  • Month of year.

AutoML also creates duration columns. Each column added represents one of the original date/time columns, and each value in this column is the duration between the dates of that particular date/time column and all other date/time columns. For example, consider patient data that has three date/time columns:

  • Date of birth.

  • Time of admission.

  • Time of exit.

AutoML creates two useful duration columns from these columns: age (duration between date of birth and time of admission) and length of stay (duration between time of admission and exit).

Finally, for each list column present, another column is added simply with the size of the lists. That is, each value in the new column is the length of the corresponding list in the old column.

Replacing Missing Values

Datasets can often be incomplete, with missing values in some of their columns. To help compensate for this and improve performance, AutoML fills in missing/NULL values:

  • For categorical and date columns, AutoML replaces missing values with the mode (most popular value) of the column.

  • For numeric and duration columns, AutoML replaces missing values with the mean (average) of the column.

  • For list and text columns, AutoML replaces missing values with an empty string.

Transforming Numeric Columns

For each numeric column, a standard scalar is fit. These include the original numeric columns, along with the duration and list size columns as well.

Numerical column values are also binned and then used as categorical values. These new categorical bin columns are added on separately in addition to the already present numerical columns. Each numerical column is separated into four bins, each representing a quartile of the values in that column. The new binned columns are treated as categorical columns.

Transforming Text and List Columns

For each text and list column, a vectorizer is fit to transform the data to the appropriate form needed for training. This is done with SciKit Learn’s TFIDF Vectorizer. Please see their documentationOpens in a new tab.

The following parameters are used:

Parameter Value
Convert to lowercase True
Stop Words None
N-Gram Range (1,1)
Max Features 10000
Norm L2
Binary Columns

Binary columns are simply transformed to be composed of 1’s and 0’s, with true values mapping to 1’s.

Categorical Columns

Categorical columns are one-hot encodedOpens in a new tab before being used for training.

Feature Elimination

As the last step before training, feature elimination is performed to remove redundancy, improve training speed, and improve the accuracy of models. This is done using Scikit Learn’s SelectFPROpens in a new tab function.

The following parameters are used:

Parameter Value
Scoring function f_classif
alpha 0.2
FeedbackOpens in a new tab