Class Reference
IRIS for UNIX 2019.3
InterSystems: The power behind what matters   
Documentation  Search
  [ENSLIB] >  [%iKnow] >  [Classification] >  [Builder]
Private  Storage   

abstract class %iKnow.Classification.Builder extends %RegisteredObject

This is the framework class for building Text Categorization models, generating valid %iKnow.Classification.Classifier subclasses.
Here's an example using the %iKnow.Classification.IKnowBuilder:

	// first initialize training and test sets 
	set tDomainId = $system.iKnow.GetDomainId("Standalone Aviation demo") 
	set tTrainingSet = ##class(%iKnow.Filters.SimpleMetadataFilter).%New(tDomainId, "Year", "<", 2007) 
	set tTestSet = ##class(%iKnow.Filters.GroupFilter).%New(tDomainId, "AND", 1) // NOT filter
	do tTestSet.AddSubFilter(tTrainingSet) 
	// Initialize Builder instance with domain name and test set
	set tBuilder = ##class(%iKnow.Classification.IKnowBuilder).%New("Standalone Aviation demo", tTrainingSet)
	// Configure it to use a Naive Bayes classifier
		set tBuilder.ClassificationMethod = "naiveBayes" 
	// Load category info from metadata field "AircraftCategory"
	write tBuilder.%LoadMetadataCategories("AircraftCategory") 
	   
	// manually add a few terms
	write tBuilder.%AddEntity("ultralight vehicle")
	set tData(1) = "helicopter", tData(2) = "helicopters"
	write tBuilder.%AddEntity(.tData)
	write tBuilder.%AddEntity("balloon",, "partialCount")
	write tBuilder.%AddCooccurrence($lb("landed", "helicopter pad")) 
	// or add them in bulk by letting the Builder instance decide
	write tBuilder.%PopulateTerms(50) 
	// after populating the term dictionary, let the Builder generate a classifier class
	write tBuilder.%CreateClassifierClass("User.MyClassifier") 

Inventory

Parameters Properties Methods Queries Indices ForeignKeys Triggers
11 35


Summary

Properties
Categories ClassificationMethod Description DocumentVectorLocalWeights
DocumentVectorNormalization EntityRole MethodBuilder MinimumSpread
MinimumSpreadPercent TermSelectionMetric Terms

Methods
%%OIDGet %AddCRC %AddCategory %AddCooccurrence
%AddEntity %AddTerm %AddTermInternal %AddTermsFromSQL
%AddToSaveSet %BindExport %BuildObjectGraph %ClassIsLatestVersion
%ClassName %Close %ConstructClone %CreateClassifierClass
%DispatchClassMethod %DispatchGetModified %DispatchGetProperty %DispatchMethod
%DispatchSetModified %DispatchSetMultidimProperty %DispatchSetProperty %ExportDataTable
%Extends %GenerateClassifier %GetCandidateTerms %GetCategoryInfo
%GetCategoryPosition %GetParameter %GetRecordCount %GetTermInfo
%GetTermPosition %GetTerms %IncrementCount %IsA
%IsModified %LoadFromDefinition %LoadFromModel %New
%NormalizeObject %ObjectModified %OnLoadFromDefinition %OriginalNamespace
%PackageName %PopulateTerms %RemoveFromSaveSet %RemoveTerm
%RemoveTermAtIndex %RemoveTermEntryAtIndex %Reset %SerializeObject
%SetModified %TestClassifier %ValidateObject ClassificationMethodSet
GetColumnName

Subclasses
%iKnow.Classification.IFindBuilder %iKnow.Classification.IKnowBuilder

Properties

• property Categories as list of %List;
Categories.GetAt(i) = $lb("name", "spec", "description", "recordCount")
• property ClassificationMethod as %String(VALUELIST=",naiveBayes,linearRegression,euclideanDistance,cosineSimilarity,pmml,rules") [ InitialExpression = "naiveBayes" ];
The general method used for classification:
  • "naiveBayes" uses a probability-based approach based on the Naive Bayes theorem,
  • "rules" runs through a set of straightforward decision rules based on boolean expressions, each contributing to a single category's score if they fire. The category with the highest score wins.
  • "euclideanDistance" treats the per-category term weights as a vector in the same vector space as the document term vector and calculates the euclidean distance between these vectors and the query vector.
  • "cosineSimilarity" also treats the per-category term weights as a vector in the same vector space as the document term vector and looks at the (cosine of) the angle between these vectors.
  • "linearRegression" considers the per-category term weights to be coefficients in a linear regression formula for calculating a category score, with the highest value winning
  • "pmml" delegates the mathematical work to a predictive model defined in PMML. See also %iKnow.Classification.Methods.pmml
• property Description as %String;
Optional description for the Classifier
• property DocumentVectorLocalWeights as %String(VALUELIST=",binary,linear,logarithmic") [ InitialExpression = "linear" ];
Local Term Weights for the document vector to register in the ClassificationMethod element. This might be overruled for some classification methods (ie Naive Bayes, which always uses "binary")
• property DocumentVectorNormalization as %String(VALUELIST=",none,cosine") [ InitialExpression = "none" ];
Document vector normalization method to register in the Classification element This might be overruled for some classification methods (ie Naive Bayes, which always uses "none")
• property EntityRole as %Integer [ InitialExpression = $$$ENTTYPECONCEPT ];
Used by some models to refine the terms selected and/or how their default score is calculated
• property MethodBuilder as %iKnow.Classification.Methods.Base [ ReadOnly ];
This object will deliver the actual implementation of the classification method and is instantiated automatically through settting ClassificationMethod.
• property MinimumSpread as %Integer [ InitialExpression = 3 ];
The minimum number of records in the training set that should contain a term before it can get selected by %PopulateTerms. (Can be bypassed for specific terms by adding them through %AddTerm)
• property MinimumSpreadPercent as %Double [ InitialExpression = 0.05 ];
The minimum fraction of records in the training set that should contain a term before it can get selected by %PopulateTerms, EXCEPT if it occurs in more than 50% of the records in at least one category. (Can be bypassed for specific terms by adding them through %AddTerm)
• property TermSelectionMetric as %String;
The metric used for selecting terms for this classifier. This is for information purposes only.
• property Terms as list of %iKnow.Classification.Definition.Term;

Methods

• final method %AddCRC(ByRef pCRC As %List, pNegation As %String = "undefined", pCount As %String = "exactCount", Output pIndex As %Integer) as %Status

Adds one or more CRCs as a single term to the Text Categorization model's term dictionary. The term is to be counted only if it appears in the negation context defined by pNegation. If pCount = "exactCount", only exact occurrences of this CRC will be counted to calculate its base score to be fed into the categorization algorithm. If it is set to "partialCount", both exact and partial matches will be considered and if set to "partialScore", the score of all exact and partial matches will be summed as this term's base score.

Multiple CRC can be supplied either as a one-dimensional array of 3-element-%Lists

.
• final method %AddCategory(pName As %String, pSpec As %String, pRecordCount As %Integer = "", pDescription As %String = "") as %Status
Adds an optional category named pName for the classifier being built by this class. The meaning of pSpec depends on the actual builder implementation, but should allow the builder implementation to identify the records in the training set belonging to this category.
• final method %AddCooccurrence(ByRef pValue As %List, pNegation As %String = "undefined", pCount As %String = "exactCount", Output pIndex As %Integer) as %Status

Adds one or more Cooccurrences as a single term to the Text Categorization model's term dictionary. The term is to be counted only if it appears in the negation context defined by pNegation. If pCount = "exactCount", only exact occurrences of this cooccurrence's entities will be counted to calculate its base score to be fed into the categorization algorithm. If it is set to "partialCount", both exact and partial matches will be considered and if set to "partialScore", the score of all exact and partial matches will be summed as this term's base score.

A single cooccurrence can be supplied as a one-dimensional array of strings or a %List. Multiple cooccurrences can be supplied either as a one-dimensional array of %Lists or as a two-dimensional array of strings

.
• final method %AddEntity(ByRef pValue As %String, pNegation As %String = "undefined", pCount As %String = "exactCount", Output pIndex As %Integer) as %Status

Adds one or more entities as a single term to the Text Categorization model's term dictionary. The term is to be counted only if it appears in the negation context defined by pNegation. If pCount = "exactCount", only exact occurrences of this entity will be counted to calculate its base score to be fed into the categorization algorithm. If it is set to "partialCount", both exact and partial matches will be considered and if set to "partialScore", the score of all exact and partial matches will be summed as this term's base score.

Multiple entities can be supplied either as a one-dimensional array or as a %List

.
deprecatedfinal method %AddTerm(pValue As %String, pType As %String = "entity", ByRef pCustomWeights, pNegation As %String = "undefined") as %Status

Deprecated: use %AddEntity, %AddCRC or %AddCooccurrence instead

Adds a term whose presence or frequency is to be considered for categorizing by the classifier being built by this class.

• method %AddTermInternal(pTerm As %iKnow.Classification.Definition.Term, Output pIndex As %Integer) as %Status
Directly add a term object at the last index. (no existence checking!)
• method %AddTermsFromSQL(pSQL As %String, pType As %String = "entity", pNegationContext As %String = "undefined", pCount As %String = "exactCount") as %Status

Adds all terms selected by pSQL as pType, taking the string value from the column named "term" with negation context pNegationContext and count policy pCount. If there are columns named "type", "negation" or "count" selected by the query, any values in these columns will be used instead of the defaults supplied through the respective parameters.

When adding CRC or Cooccurrence terms, use colons to separate the composing entities.

• method %CreateClassifierClass(pClassName As %String, pVerbose As %Boolean = 1, pIncludeBuilderInfo As %Boolean = 1, pOverwrite As %Boolean = 0, pCompile As %Boolean = 1) as %Status

Generates a classifier definition and saves it to a %iKnow.Classification.Classifier subclass named pClassName. This will overwrite any existing class with that name if pOverwrite is 1. See also %GenerateClassifier.

• method %DispatchGetProperty(Property As %String)
Dispatch unknown property getters to MethodBuilder
• method %DispatchMethod(Method As %String, Args...)
Dispatch unknown method calls to MethodBuilder
• method %DispatchSetProperty(Property As %String, Val)
Dispatch unknown property setters to MethodBuilder
• final method %ExportDataTable(pClassName As %String, pOverwrite As %Boolean = 1, pVerbose As %Boolean = 1, pTracking As %Boolean = 0) as %Status
Exports the data in the training set to a new table pClassName, with columns containing the weighted score for each term.
• final method %GenerateClassifier(Output pDefinition As %iKnow.Classification.Definition.Classifier, pIncludeBuilderInfo As %Boolean = 0, pVerbose As %Boolean = 1) as %Status

Generates a %iKnow.Classification.Definition.Classifier XML tree based on the current set of categories and terms, with the appropriate weights and parameters calculated by the builder implementation (see %OnGenerateClassifier).

Use pIncludeBuilderInfo to include specifications of how this classifier was built so it can be "reloaded" from the classifier XML to retrain the model.

• method %GetCandidateTerms(pType As %String = "entity") as %Status
INTERNAL - DO NOT INVOKE Used by MethodBuilder.%PopulateTerms() to provide: ^||%IK.TermCandidates(id) = $lb(value, spread) ^||%IK.TermCandidates(id, j) = [spread in category j]
• abstract method %GetCategoryInfo(Output pCategories) as %Status
Returns all categories added so far: pCategories(n) = $lb([name], [record count])
• method %GetCategoryPosition(pName As %String) as %Integer
• abstract method %GetRecordCount(Output pSC As %Status) as %Integer
• method %GetTermInfo(Output pTermInfo, pIncludeCategoryDetails As %String = "") as %Status
Returns an array for the current builder terms: pTermInfo(i, "spread") = [spread in training set] pTermInfo(i, "spread", j) = [spread in training set for category j] pTermInfo(i, "frequency", j) = [freq in training set for category j]
• method %GetTermPosition(pTerm As %iKnow.Classification.Definition.Term) as %Integer
• method %GetTerms(Output pTerms) as %Status
Returns all terms added so far: pTerms(n) = $lb([string value], [type], [negation policy], [count policy])
• final classmethod %LoadFromDefinition(pClassName As %String, Output pBuilder As %iKnow.Classification.Builder, pValidateFirst As %Boolean = 1) as %Status
Loads the categories and terms from an existing Classifier class pClassName.
Note: this does not load any (custom) weight information from the definition.
• final classmethod %LoadFromModel(pDefinition As %iKnow.Classification.Definition.Classifier, Output pBuilder As %iKnow.Classification.Builder) as %Status
• private method %OnCreateExportTable(pClassDef As %Dictionary.ClassDefinition, pVerbose As %Boolean) as %Status
Callback invoked by %ExportDataTable when creating the export table definition.
• abstract private method %OnExportTable(pClassName As %String, pVerbose As %Boolean, pTracking As %Boolean) as %Status
Callback invoked by %ExportDataTable to load the data into export table pClassName.
• private method %OnGenerateClassifier(ByRef pDefinition As %iKnow.Classification.Definition.Classifier, pVerbose As %Boolean = 1, pIncludeBuilderInfo As %Boolean = 0) as %Status
Appends the ClassificationMethod element for this type of classifier.
• method %OnLoadFromDefinition(pDefinition As %iKnow.Classification.Definition.Classifier) as %Status
• private method %OnReset() as %Status
• method %PopulateTerms(pCount As %Integer = 100, pType As %String = "entity", pMetric As %String = "NaiveBayes", pPerCategory As %Boolean = 0) as %Status

Adds pCount terms of type pType to this classifier's set of terms, selecting those terms that have a high relevance for the categorization task based on metric pMetric and/or the specifics of this builder implementation.

If pPerCategory is 1, (pCount \ [number of categories]) terms are selected using the specified metric as calculated within each category. This often gives better results, but might not be supported for every metric or builder.

Builder implementations should ensure these terms meet the conditions set forward by MinimumSpread and MinimumSpreadPercent. MinimumSpreadPercent can be ignored if pPerCategory = 1

This method implements a populate method for pMetric = "NaiveBayes", selecting terms based on their highest average per-category probability. In this case, the value of pPerCategory is ignored (automatically treated as 1). Implementations for other metrics can be provided by subclasses.

• method %RemoveTerm(pValue As %String, pType As %String = "entity", pNegation As %String = "undefined", pCount As %String = "exactCount") as %Status
Removes pValue from the first term that contains it meeting the pType pNegation and pCount criteria. If this is the last entry for that term, remove the whole term.
• method %RemoveTermAtIndex(pIndex As %Integer) as %Status
Removes the term at index pIndex. If the term at this position is a composite one, all its entries are dropped along.
• method %RemoveTermEntryAtIndex(pValue As %String, pIndex As %Integer, Output pRemovedTerm As %Boolean) as %Status
Removes a specific entry pValue from the term at index pIndex.
• final method %Reset() as %Status
Resets the term and category lists for this classifier
• abstract method %TestClassifier(pTestSet As %RawString, Output pResult, Output pAccuracy As %Double, pCategorySpec As %String = "", pVerbose As %Boolean = 0) as %Status

Utility method to batch-test the classifier against a test set pTestSet. Per-record results are returned through pResult:
pResult(n) = $lb([record ID], [actual category], [predicted category])

pAccuracy will contain the raw accuracy (# of records predicted correctly) of the current model. Use %iKnow.Classificaton.Utils for more advanced model testing.

If the current model's category options were added through %AddCategory without an appropriate category specification, use pCategorySpec to refer to the actual category values to test against.

• method ClassificationMethodSet(pMethod As %String) as %Status
This is a Set accessor method for the ClassificationMethod property.
• classmethod GetColumnName(pTermId As %Integer) as %String


Copyright (c) 2019 by InterSystems Corporation. Cambridge, Massachusetts, U.S.A. All rights reserved. Confidential property of InterSystems Corporation.