Highlighted Features of AutoML
This section describes some of the machine learning features used by AutoML to smartly produce predictive models.
Natural Language Processing
AutoML leverages natural language processing (NLP) to turn your text features into numeric features for your predictive models. AutoML uses Term frequency-inverse document frequency (TFIDF) to evaluate key words in text and list columns.
Multi-Hot Encoding
While most of our data is sparse, machine learning algorithms can only understand dense data. In most data modeling workflows, data scientists are burdened with performing difficult, and cumbersome, manual transformations to convert their sparse data into dense data.
Unlike many workflows that require this manual step, AutoML performs this conversion seamlessly. Lists and one-to-many relationships are smartly “multi-hot encoded” to account for columns that are representing more than a single value.
For instance, assume a table that contains a list of medical conditions for each person:
Person | Conditions |
---|---|
Person A | [‘diabetes’, ‘osteoporosis’, ‘asthma’] |
Person B | [‘osteoporosis’, ‘hypertension’] |
Person C | [‘asthma’, ‘hypertension’] |
Person D | [‘hypertension’, ‘asthma’] |
Many machine learning functions treat these lists as separate entities, with one-hot encoding resulting in the following conversion:
Person | [‘diabetes’, ‘osteoporosis’, ‘asthma’] | [‘osteoporosis’, ‘hypertension’] | [‘asthma’, ‘hypertension’] | [‘hypertension’, ‘asthma’] |
---|---|---|---|---|
Person A | 1 | 0 | 0 | 0 |
Person B | 0 | 1 | 0 | 0 |
Person C | 0 | 0 | 1 | 0 |
Person D | 0 | 0 | 0 | 1 |
Instead, AutoML uses bag-of-words to create a separate column for each value in each list:
Person | ‘diabetes’ | ‘osteoporosis’ | ‘asthma’ | ‘hypertension’ |
---|---|---|---|---|
Person A | 1 | 1 | 1 | 0 |
Person B | 0 | 1 | 0 | 1 |
Person C | 0 | 0 | 1 | 1 |
Person D | 0 | 0 | 1 | 1 |
While other functions would have treated each person as having a separate list of medical conditions, AutoML’s method allows a model to properly find patterns between each of these persons’ set of medical conditions.
AutoML assumes that order does not matter. Person C and Person D share the same set of medical conditions, but just ordered differently. While other functions treat those two lists differently, AutoML identifies that they are the same.