Creating an NLP Environment Manually
InterSystems has deprecated InterSystems IRIS® Natural Language Processing (NLP). It may be removed from future versions of InterSystems products. The following documentation is provided as reference for existing users only. Existing users who would like assistance identifying an alternative solution should contact the WRCOpens in a new tab.
When you create or edit an InterSystems NLP domain using the Domain Architect, InterSystems NLP automatically creates or edits instances of three objects which together define the environment into which you load your text sources. These objects are:
-
Domain: establishes the logical space for InterSystems NLP operations. Specifying a domain is mandatory.
-
Configuration: establishes the language environment for source document content. Specifying a Configuration is optional. If not specified, InterSystems NLP provides a default. The Domain Architect allows you to define the language environment for a domain; this chapter describes how to create domain-independent Configurations, as well as additional features and options.
-
User Dictionary: establishes a custom set of assignments and substitutions to apply when loading source texts into InterSystems NLP. A User Dictionary enhances the engine’s default behavior by, for example, identifying a string as a semantic attribute marker or as a concept unique to your field. A User Dictionary is supplied to one or more Configurations. Specifying a User Dictionary is optional. If not specified, no User Dictionary is used.
You can also create InterSystems NLP environments by creating instances of these objects manually. You can create multiple instances of Domains, Configurations, and User Dictionaries. These environment objects are independent of one another, and are independent of any specified set of source data.
This page describes Domain, Configuration, and User Dictionary objects in greater detail, and explains how to use these objects to build InterSystems NLP environments manually.
NLP Domains
All InterSystems NLP operations occur within a Domain. A domain is an InterSystems NLP defined unit within an InterSystems IRIS® data platform namespace. All source data to be used by InterSystems NLP is listed and loaded into a domain. A namespace can contain multiple domains.
If when a namespace was created it was copied from another namespace, domains will be shared between those namespaces. When creating a namespace, it must create a new database for globals for domains to be unique to that namespace.
You can define, modify and delete an InterSystems NLP domain in three ways:
-
Defining a domain using the Domain Architect. This is the easiest way to define a domain and to specify its languages, metadata, and data to be loaded.
-
Defining a domain as a subclass of %iKnow.DomainDefinitionOpens in a new tab. This option provides a powerful and fully-featured way to define a domain and specify its Configuration settings, metadata, and data to be loaded.
-
Defining a domain programmatically using the %iKnow.DomainOpens in a new tab class methods and properties.
Defining a Domain Using DomainDefinition
When a user creates and compiles a class inheriting from %iKnow.DomainDefinitionOpens in a new tab, the compiler will automatically create an InterSystems NLP domain corresponding to the settings specified in XML representation in the class’s Domain XData block. The user can specify static elements, such as domain parameters, metadata field definitions, and an assigned Configuration, all of which are created automatically at compile time. In addition, the user can specify sources of text data to be loaded into the domain. InterSystems IRIS uses this source information to generate a dedicated %Build() method in a new class named [classname].Domain. This %Build() method can then be used to load the specified data into the domain.
The following is an example of this kind of domain definition:
Class Aviation.MyDomain Extends %iKnow.DomainDefinition
{
/// An XML representation of the domain this class defines.
XData Domain [ XMLNamespace = "http://www.intersystems.com/iknow" ]
{
<domain name="AviationEvents">
<parameter name="Status" value="1" />
<configuration name="MyConfig" detectLanguage="1" languages="en,es" />
<parameter name="DefaultConfig" value="MyConfig" />
<metadata>
<field name="EventDate" dataType="DATE"/>
<field name="Type" dataType="STRING" />
</metadata>
<data dropBeforeBuild="true">
<files path="C:\MyDocs\" encoding="utf-8" recursive="1" extensions="txt"
configuration="MyConfig" />
<query sql="SELECT %ID,EventDate,Type,NarrativeFull
FROM Aviation.Event"
idField="ID" groupField="ID" dataFields="NarrativeFull"
metadataFields="EventDate,Type" metadataColumns="EventDate,Type" />
</data>
</domain>
}
}
At compile-time, this definition creates a domain "AviationEvents", with a Status parameter set to 1, and two metadata fields. It defines and assigns to the domain a Configuration "MyConfig" for processing English (en) and Spanish (es) texts.
This definition specifies the files to be loaded into this domain. It will load text (.txt) files from the C:\MyDocs\ directory, and it will load InterSystems SQL data from the Aviation.Event table. Refer to the %iKnow.Model.listFilesOpens in a new tab and %iKnow.Model.listQueryOpens in a new tab class properties for details.
The system generates a %Build()Opens in a new tab method in a dependent class named Aviation.MyDomain.Domain that contains the logic to load data from the C:\MyDocs directory and the Aviation.Event table.
To load the specified text data sources into this domain:
SET stat=##class(Aviation.MyDomain).%Build()
After using %Build(), you can check for errors:
DO $SYSTEM.iKnow.ListErrors("AviationEvents",0)
This lists three types of errors: errors, failed sources, and warnings.
To display the domains defined in the current namespace and the number of sources loaded for each domain:
DO $SYSTEM.iKnow.ListDomains()
To display the metadata fields defined for this domain:
DO $SYSTEM.iKnow.ListMetadata("AviationEvents")
InterSystems NLP assigns every domain one metadata field: DateIndexed. You can define additional metadata fields.
You can specify a <matching> element in the domain definition. The <matching> element describes dictionary information and specifies whether or not to automatically match loaded sources to these dictionaries. InterSystems IRIS performs basic validation on these objects during class compilation, but because they are loaded as part of the %Build() method, some name conflicts might only arise at runtime. Refer to the %iKnow.Model.matchingOpens in a new tab class properties for details.
You can specify a <metrics> element in the domain definition. The <metrics> element adds custom metrics to the domain. No call to %iKnow.Metrics.MetricDefinition.Register() is required, because this is automatically performed by the Domain Definition code at compile time. Refer to the %iKnow.Model.metricsOpens in a new tab class properties for details.
Defining a Domain Programmatically
To define a new domain using class methods, invoke the %iKnow.Domain.%New() persistent method, supplying the domain name as the method parameter. A domain name can be any valid string; domain names are not case-sensitive. The name you assign to this domain must be unique for the current namespace. This method returns a domain object reference (oref) which is unique for all namespaces of the InterSystems IRIS instance. You must then save this instance using the %Save() method to make it persistent. The domain Id property (an integer value) is not defined until you save the instance as a persistent object, as shown in the following example:
CreateDomain
SET domOref=##class(%iKnow.Domain).%New("FirstExampleDomain")
WRITE "Id before save: ",domOref.Id,!
DO domOref.%Save()
WRITE "Id after save: ",domOref.Id,!
CleanUp
DO ##class(%iKnow.Domain).%DeleteId(domOref.Id)
WRITE "All done"
Use NameIndexExists()Opens in a new tab to determine if the domain already exists. If the domain exists, use NameIndexOpen()Opens in a new tab to open it. If the domain doesn’t exist, use %New() to create it and then use %Save().
The following example checks whether a domain exists. If the domain doesn’t exist, the program creates it. If the domain does exist, the program opens it. For the purpose of demonstration, this program then randomly either deletes or doesn’t delete the domain.
DomainCreateOrOpen
SET domn="mydomain"
IF (##class(%iKnow.Domain).NameIndexExists(domn))
{ WRITE "The ",domn," domain already exists",!
SET domo=##class(%iKnow.Domain).NameIndexOpen(domn)
SET domId=domo.Id
}
ELSE {
SET domo=##class(%iKnow.Domain).%New(domn)
DO domo.%Save()
SET domId=domo.Id
WRITE "Created the ",domn," domain",!
WRITE "with domain ID ",domId,! }
ContainsData
SET x=domo.IsEmpty()
IF x=1 {WRITE "Domain ",domn," contains no data",!}
ELSE {WRITE "Domain ",domn," contains data",!}
CleanupForNextTime
SET rnd=$RANDOM(2)
IF rnd {
SET stat=##class(%iKnow.Domain).%DeleteId(domId)
IF stat {WRITE "Deleted the ",domn," domain" }
ELSE { WRITE "Domain delete error:",stat }
}
ELSE {WRITE "No delete this time" }
The %iKnow.DomainOpens in a new tab class methods that create or open a domain are provided with an output %Status parameter. This parameter is set when the current system does not have license access to InterSystems NLP, and thus cannot create or open an InterSystems NLP domain.
Setting Domain Parameters
Domain parameters govern the behavior of a wide variety of InterSystems NLP operations. The specific parameters are described where applicable. For a full list of available domain parameters, refer to the Appendix “Domain Parameters”.
In the examples that follow, domain parameters are referenced by their macro equivalent (for example, $$$IKPFULLMATCHONLY), not their parameter name (For example, FullMatchOnly). The recommended programming practice is to use these %IKPublic macros rather than the parameter names.
All domain parameters take a default value. Commonly, InterSystems NLP will give optimal results without specifically setting any domain parameters. InterSystems NLP determines the value for each parameter as follows:
-
If you have specified a parameter value for the current domain, that value is used. Note that some parameters can only be set before loading data into a domain, while others can be set at any time. You can use the IsEmpty()Opens in a new tab method to determine if any data has been loaded into the current domain.
-
If you have specified a system-wide parameter value, that value is used as a default for all domains, except for a domain where a domain-specific value has been set.
-
If you have not specified a value for a parameter at either the domain level or the system level, InterSystems NLP uses its default value for that parameter.
Setting Parameters for the Current Domain
Once you have created a domain, you can set domain parameters for this specific domain using the SetParameter()Opens in a new tab instance method. SetParameter() returns a status indicating whether the parameter specified is valid and was set. GetParameter()Opens in a new tab returns the parameter value and the level at which the parameter was set (DEFAULT, DOMAIN, or SYSTEM). Note that GetParameter() does not check the validity of a parameter name; it returns DEFAULT for any parameter name it cannot identify as being set at the domain or system level.
The following example gets the default for the SortField domain parameter, sets this parameter for the current domain, then gets the value you set and the level at which it was set (DOMAIN):
#include %IKPublic
DomainCreate
SET domn="paramdomain"
SET domo=##class(%iKnow.Domain).%New(domn)
WRITE "Created the ",domn," domain",!
DO domo.%Save()
DomainParameters
SET sfval=domo.GetParameter($$$IKPSORTFIELD,.sf)
WRITE "SortField before SET=",sfval," ",sf,!
IF sfval=0 {WRITE "changing SortByFrequency to SortBySpread",!
SET stat=domo.SetParameter($$$IKPSORTFIELD,1)
IF stat=0 {WRITE "SetParameter failed" QUIT} }
WRITE "SortField after SET=",domo.GetParameter($$$IKPSORTFIELD,.str)," ",str,!!
CleanupForNextTime
SET stat=##class(%iKnow.Domain).%DeleteId(domo.Id)
IF stat {WRITE "Deleted the ",domn," domain" }
ELSE { WRITE "Domain delete error:",stat }
Setting Parameters System-wide
You can set domain parameters for all domains system-wide using the SetSystemParameter()Opens in a new tab method. A parameter set using this method immediately becomes the default parameter value for all existing and subsequently created domains in all namespaces. This system-wide default is overridden for an individual domain using the SetParameter()Opens in a new tab instance method.
The SortField and Jobs domain parameters are exceptions. Setting these parameters at the system level has no effect on the domain settings.
You can determine if a domain parameter has been established as the system default using the GetSystemParameter()Opens in a new tab method. The initial value for a system-wide parameter is always the null string (no default).
If you wish to remove a system-wide default setting for a domain parameter, use the UnsetSystemParameter()Opens in a new tab method. Once a system-wide parameter setting has been established, you must unset it before you can set it to a new value. UnsetSystemParameter() returns a status of 1 (success) even when there was no parameter default value to unset.
The following example establishes a FullMatchOnly system-wide parameter value. If no system-wide default has been established, the program sets this system-wide parameter. If a system-wide default has been established, the program unsets this system-wide parameter, then sets it.
#include %IKPublic
SystemwideParameterSet
/* Initial set */
SET stat=##class(%iKnow.Domain).SetSystemParameter($$$IKPFULLMATCHONLY,1)
IF stat=1 {
WRITE "FullMatchOnly set system-wide to: "
WRITE ##class(%iKnow.Domain).GetSystemParameter($$$IKPFULLMATCHONLY),!
QUIT }
ELSE {
/* Unset and Reset */
SET stat=##class(%iKnow.Domain).UnsetSystemParameter($$$IKPFULLMATCHONLY)
IF stat=1 {
SET stat=##class(%iKnow.Domain).SetSystemParameter($$$IKPFULLMATCHONLY,1)
IF stat=1 {
WRITE "FullMatchOnly was unset system-wide",!,"then set to: "
WRITE ##class(%iKnow.Domain).GetSystemParameter($$$IKPFULLMATCHONLY),!!
GOTO CleanUpForNextTime }
ELSE {WRITE "System Parameter set error",stat,!}
}
ELSE {WRITE "System Parameter set error",stat,!}
}
CleanUpForNextTime
SET stat=##class(%iKnow.Domain).UnsetSystemParameter($$$IKPFULLMATCHONLY)
IF stat '=1 {WRITE " Unset error status:",stat}
The following example shows that setting a system-wide parameter value immediately sets the parameter value for all domains. After setting a system-wide parameter value, you can override this value for individual domains:
#include %IKPublic
SystemwideParameterUnset
SET stat=##class(%iKnow.Domain).UnsetSystemParameter($$$IKPFULLMATCHONLY)
WRITE "System-wide setting FullMatchOnly=",##class(%iKnow.Domain).GetSystemParameter($$$IKPFULLMATCHONLY),!!
Domain1Create
SET domn1="mysysdomain1"
SET domo1=##class(%iKnow.Domain).%New(domn1)
DO domo1.%Save()
SET dom1Id=domo1.Id
WRITE "Created the ",domn1," domain ",dom1Id,!
WRITE "FullMatchOnly=",domo1.GetParameter($$$IKPFULLMATCHONLY,.str)," ",str,!!
SystemwideParameterSet
SET stat=##class(%iKnow.Domain).SetSystemParameter($$$IKPFULLMATCHONLY,1)
IF stat=0 {WRITE "SetSystemParameter failed" QUIT}
WRITE "Set system-wide FullMatchOnly=",##class(%iKnow.Domain).GetSystemParameter($$$IKPFULLMATCHONLY),!!
Domain2Create
SET domn2="mysysdomain2"
SET domo2=##class(%iKnow.Domain).%New(domn2)
DO domo2.%Save()
SET dom2Id=domo2.Id
WRITE "Created the ",domn2," domain ",dom2Id,!
WRITE "Domain setting FullMatchOnly=",domo2.GetParameter($$$IKPFULLMATCHONLY,.str)," ",str,!!
DomainParameters
WRITE "New domain ",dom2Id," FullMatchOnly=",domo2.GetParameter($$$IKPFULLMATCHONLY,.str)," ",str,!
WRITE "Existing domain ",dom1Id," FullMatchOnly=",domo1.GetParameter($$$IKPFULLMATCHONLY,.str)," ",str,!!
OverrideForOneDomain
SET stat=domo1.SetParameter($$$IKPFULLMATCHONLY,0)
IF stat=0 {WRITE "SetParameter failed" QUIT}
WRITE "Domain override FullMatchOnly=",domo1.GetParameter($$$IKPFULLMATCHONLY,.str)," ",str,!
CleanupForNextTime
SET stat=##class(%iKnow.Domain).%DeleteId(dom1Id)
SET stat=##class(%iKnow.Domain).%DeleteId(dom2Id)
SET stat=##class(%iKnow.Domain).UnsetSystemParameter($$$IKPFULLMATCHONLY)
Assigning to a Domain
Once you have created a domain and (optionally) specified its domain parameters, you can assign various components to that domain:
-
Source Data: After creating a domain, you commonly will load a number (usually a large number) of text sources into a domain; this generates InterSystems NLP indexed data within that domain. Loading text sources is a required precondition for most InterSystems NLP operations. A variety of text sources are supported, including files, SQL fields, and text strings. You can specify SQL fields of data type %String or %Stream.GlobalCharacter (character stream data). After InterSystems NLP has indexed a data source, the original data source can be removed without affecting further processing. Changing a data source has no effect on InterSystems NLP processing, unless you re-load that data source to update the indexed data in the domain.
-
Filters: After creating a domain, you can optionally create one or more filters for that domain. A filter specifies criteria used to exclude some of the loaded sources from a query. Thus a filter allows you to perform InterSystems NLP operations on a subset of the data loaded in the domain.
-
Metadata: After creating a domain, you can optionally specify one or more metadata fields that you can use as criteria for filtering sources. A metadata field is data associated with a source that is not indexed data. For example, the date and time that a text source was loaded is a metadata field for that source. Metadata fields must be defined before loading text sources into a domain.
-
Skiplists: After creating a domain, you can optionally create one or more skiplists for that domain. A skiplist is a list of entities (such as words or phrases) that you do not want a query to return. Thus a skiplist allows you to perform InterSystems NLP operations that ignore specific data entities in data sources loaded in the domain.
-
Smart Matching Dictionaries: After creating a domain, you can optionally create one or more Smart Matching dictionaries for that domain. A dictionary contains entities that are used to match the indexed data.
These components are defined using various InterSystems NLP classes and methods. You can also use the InterSystems IRIS Domain Architect to define metadata fields, load sources, and define skiplists and dictionaries.
Metadata fields must be defined before loading sources. Filters, skiplists, and dictionaries can be defined or modified at any time.
Deleting All Data from a Domain
Deleting or changing an original source text has no effect on the source data listed and loaded from that text into an InterSystems NLP domain. You must explicitly add or delete a source to the set of indexed sources.
The %DeleteId() persistent method deletes a domain and all source data that has been listed and loaded in that domain. You can use the DropData()Opens in a new tab method to delete all source data that has been loaded into a domain without deleting the domain itself. Either method deletes all indexed source data, allowing you to start over with a new set of data sources.
When deleting a domain that contains a significant number of sources, use DropData() to delete the data before using %DeleteId() to delete the domain. If you use %DeleteId() to delete a domain while it still has data in it, InterSystems IRIS will delete the data, but it will journal each data deletion, even if journaling has been disabled. Deleting the data, and then deleting the domain prevents the generation of these large journal files.
You can use the IsEmpty()Opens in a new tab method to determine if any data has been loaded into a domain.
The following example demonstrates deleting the data from a domain. If the named domain doesn’t exist, the program creates the domain. If the named domain does exist, the program tests for the presence of data. If there is data in the domain, the program opens the domain and deletes the data.
DomainCreateOrOpen
SET dname="mytestdomain"
IF (##class(%iKnow.Domain).NameIndexExists(dname))
{ WRITE "The ",dname," domain already exists",!
SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
IF domoref.IsEmpty() {GOTO RestOfProgram}
ELSE {GOTO DeleteData }
}
ELSE
{ WRITE "The ",dname," domain does not exist",!
SET domoref=##class(%iKnow.Domain).%New(dname)
DO domoref.%Save()
WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
GOTO RestOfProgram }
DeleteData
SET stat=domoref.DropData()
IF stat { WRITE "Deleted the data from the ",dname," domain",!
GOTO RestOfProgram }
ELSE { WRITE "DropData error",!
QUIT}
RestOfProgram
WRITE "The ",dname," domain contains no data"
Listing All Domains
You can use the GetAllDomainsOpens in a new tab query to list all current domains in all namespaces. This is shown in the following example:
SET stmt=##class(%SQL.Statement).%New()
SET status=stmt.%PrepareClassQuery("%iKnow.Domain","GetAllDomains")
IF status'=1 {WRITE "%Prepare failed:" DO $System.Status.DisplayError(status) QUIT}
SET rset= stmt.%Execute()
WRITE !,"Domains in all namespaces",!
DO rset.%Display()
Each domain is listed on a separate line, using the following format: domainId:domainName:namespace:version.
The VersionOpens in a new tab property is an integer that shows what version of InterSystems NLP data structure was used when the domain was created. The system version number changes when a release contains a change to the InterSystems NLP data structures. Therefore, a new version of InterSystems IRIS or the introduction of new InterSystems NLP features may not change the system version number. If the Version property value for a domain is not the current InterSystems NLP system version, you may wish to upgrade the domain to take advantage of the latest features of InterSystems NLP. See Upgrading InterSystems NLP Data on the “InterSystems NLP Implementation” page.
By default, GetAllDomains lists all the current domains for all namespaces. You can specify a boolean argument in %Execute() to limit the listing of domains to the current namespace, as shown in the following example:
SET stmt=##class(%SQL.Statement).%New()
SET status=stmt.%PrepareClassQuery("%iKnow.Domain","GetAllDomains")
IF status'=1 {WRITE "%Prepare failed:" DO $System.Status.DisplayError(status) QUIT}
SET rset= stmt.%Execute(1)
WRITE !,"Domains in all namespaces",!
DO rset.%Display()
A boolean value of 1 limits listing to domains in the current namespace. A boolean value of 0 (the default) lists all domains in all namespaces. (Note: listed Version property values may not be correct for domains other than the current domain.)
You can also list all domains in the current namespace using:
DO ##class(%SYSTEM.iKnow).ListDomains()
This method lists the domain Ids, domain names, number of sources, and the domain version number.
Renaming a Domain
You can use the Rename()Opens in a new tab class method to change the name of an existing domain within the current namespace, as shown in the following example:
SET stat=##class(%iKnow.Domain).Rename(oldname,newname)
IF stat=1 {WRITE "renamed ",oldname," to ",newname,!}
ELSE {WRITE "no rename",oldname," is unchanged",! }
Renaming a domain changes the name used to open the domain, assigning the existing Domain Id to the new name. Rename() does not change the name of a current instance of the domain. For a rename to occur, the old domain name must exist and the new domain name must not exist.
Copying a Domain
You can copy an existing domain to a new domain in the current namespace by using the CopyDomain()Opens in a new tab method of the %iKnow.Utils.CopyUtilsOpens in a new tab class. The CopyDomain() method copies a domain definition to a new domain, assigning a unique domain name and domain Id; the existing domain is unchanged. If the new domain does not exist, this method creates a new domain. By default, this method copies the domain parameter settings and assigned domain components from the existing domain to the copy, if these components are present.
By default, the CopyDomain() method copies the source data from the existing domain to the copy. However, if source data copying is requested and no source data is present in the existing domain, the CopyDomain() operation fails.
The following example copies a the domain named “mydomain” and its parameter settings and source data to a new domain named “mydupdomain”. Because “mydomain” contains no source data, the 3rd argument (which specifies whether to copy source data) is set to 0:
DomainMustExistToBeCopied
SET olddom="mydomain"_$PIECE($H,",",2)
SET domo=##class(%iKnow.Domain).%New(olddom)
DO domo.%Save()
IF (##class(%iKnow.Domain).NameIndexExists(olddom))
{WRITE "Old domain exists, proceed with copy",!!}
ELSE {WRITE "Old domain does not exist" QUIT}
CopyDomain
SET newdom="mydupdomain"
IF (##class(%iKnow.Domain).NameIndexExists(newdom))
{WRITE "Domain copy overwriting domain ",newdom,!}
ELSE {WRITE "Domain copy creating domain ",newdom,!}
SET stat=##class(%iKnow.Utils.CopyUtils).CopyDomain(olddom,newdom,0)
IF stat=1 {WRITE !!,"Copied ",olddom," to ",newdom," copying all assignments",!!}
ELSE {WRITE "Domain copy failed with status ",stat,!}
CleanUp
SET stat=##class(%iKnow.Domain).%DeleteId(domo.Id)
WRITE "Deleted the old domain"
The CopyDomain() method allows you to quickly copy all of the domain settings, source data, and assigned components of an existing domain to a new domain. It provides boolean options for all-or-nothing copying of assigned components. Other methods in the %iKnow.Utils.CopyUtilsOpens in a new tab class provide greater control in specifying which assigned components to copy from one existing domain to another.
InterSystems NLP Configurations
An InterSystems NLP configuration specifies behavior for handling source documents. It is only used during the source data loading operation. A configuration is specific to its namespace; you can create multiple configurations within a namespace. InterSystems NLP assigns each configuration in a namespace a configuration Id, a unique integer. Configuration Id values are not reused. You can apply the same configuration to different domains and source text loads. Defining or using an InterSystems NLP configuration is optional; if you don’t specify a configuration, InterSystems NLP uses the property defaults.
You can define an InterSystems NLP configuration in two ways:
-
Using the %iKnow.ConfigurationOpens in a new tab class methods and properties, as described in this chapter.
-
Using the Domain Architect to specify supported languages as part of domain definition.
Defining a Configuration
You can define a configuration using the %New()Opens in a new tab persistent method of the %iKnow.ConfigurationOpens in a new tab class.
You can determine if an InterSystems NLP configuration with that name already exists by invoking the Exists()Opens in a new tab method. If the configuration exists, you can open it using the Open()Opens in a new tab method, as shown in the following example:
IF ##class(%iKnow.Configuration).Exists("EnFr") {
SET cfg=##class(%iKnow.Configuration).Open("EnFr") }
ELSE { SET cfg=##class(%iKnow.Configuration).%New("EnFr",1,$LB("en","fr"))
DO cfg.%Save() }
Setting Configuration Properties
A configuration defines the following properties:
-
Name: A configuration name can be any valid string; configuration names are not case-sensitive. The name you assign to this configuration must be unique for the current namespace.
-
DetectLanguage: A boolean value that specifies whether to use automatic language identification if more that one language is specified in the Languages property. Because this option may have a significant effect on performance it should not be set unless needed. The default is 0 (do not use automatic language identification).
-
Languages: What language(s) the source documents contain, and therefore which languages to test for and which language models to apply. The available options are Czech (cs), Dutch (nl), English (en), French (fr), German (de), Japanese (ja), Portuguese (pt), Russian (ru), Spanish (es), Swedish (sv), and Ukrainian (uk). The default is English (en). Languages are always specified using their ISO 639-1 two-letter abbreviation. This property value is specified as an InterSystems IRIS list of strings (using $LISTBUILD).
-
User Dictionary: Either the name of a defined User Dictionary object or the file path location of defined User Dictionary file. A User Dictionary contains user-defined substitution pairs that InterSystems NLP applies to the source text entities during the load operation. This property is optional; the default is the null string.
-
Summarize: a boolean value that specifies whether to store summary information when loading source texts. If set to 1, source information is generated that InterSystems NLP requires to generate summaries of the loaded source texts. If set to 0, no summaries can be generated for the sources processed with this Configuration object. Setting this option to 1 is generally recommended. The default is 1.
All configuration properties (except the Name) are assigned default values. You can get or set a configuration property by using property dispatch:
IF cfgOref.DetectLanguage=0 {
SET cfgOref.DetectLanguage=1
DO cfgOref.%Save() }
Note that you must first %Save() the newly created configuration before you can change its properties using property dispatch, and then you must %Save() the configuration after changing the property values.
The following example creates a configuration that supports English and French with automatic language identification. It then changes the configuration to support English and Spanish:
OpenOrCreateConfiguration
SET myconfig="Bilingual"
IF ##class(%iKnow.Configuration).Exists(myconfig) {
SET cfg=##class(%iKnow.Configuration).Open(myconfig)
WRITE "Opened existing configuration ",myconfig,! }
ELSE { SET cfg=##class(%iKnow.Configuration).%New(myconfig,1,$LB("en","fr"))
DO cfg.%Save()
WRITE "Created new configuration ",myconfig,! }
GetLanguages
WRITE "that supports ",$LISTTOSTRING(cfg.Languages),!
SetConfigParameters
SET cfg.Languages=$LISTBUILD("en","sp")
DO cfg.%Save()
WRITE "changed ",myconfig," to support ",$LISTTOSTRING(cfg.Languages),!
CleanUpForNextTime
SET rnd=$RANDOM(2)
IF rnd {
SET stat=##class(%iKnow.Configuration).%DeleteId(cfg.Id)
IF stat {WRITE "Deleted the ",myconfig," configuration" }
}
ELSE {WRITE "No delete this time",! }
For a description of using multiple languages and automatic language identification, refer to the “Language Identification” page.
Using a Configuration
You can apply a defined configuration in any of the following ways:
-
Defining the DefaultConfig domain parameter.
-
Specifying the configuration as the first argument of the Init()Opens in a new tab instance method to initialize the Lister instance and override the configuration default.
-
Invoking %iKnow.Source.Lister.SetConfig()Opens in a new tab.
-
Specifying the configuration as an argument of the loader.ProcessBuffer()Opens in a new tab or loader.ProcessVirtualBuffer()Opens in a new tab method.
Listing All Configurations
You can use the GetAllConfigurationsOpens in a new tab query to list all defined configurations in the current namespace. This is shown in the following example:
SET stmt=##class(%SQL.Statement).%New()
SET status=stmt.%PrepareClassQuery("%iKnow.Configuration","GetAllConfigurations")
IF status'=1 {WRITE "%Prepare failed:" DO $System.Status.DisplayError(status) QUIT}
SET rset= stmt.%Execute()
WRITE "The current namespace is: ",$NAMESPACE,!
WRITE "It contains the following configurations: ",!
DO rset.%Display()
Each configuration is listed on a separate line, listing the configuration Id followed by the configuration parameter values. Listed values are separated by colons. If the configuration is defined with a list of supported languages, GetAllConfigurations displays these language abbreviations separated by commas.
You can also list all configurations in the current namespace using:
DO ##class(%SYSTEM.iKnow).ListConfigurations()
Using a Configuration to Normalize a String
Using a defined InterSystems NLP configuration, you can perform text normalization on a string using the Normalize()Opens in a new tab method. This method both normalizes the string characters and (optionally) applies a User Dictionary, as shown in the following example:
DefineUserDictionary
SET time=$PIECE($H,",",2)
SET udname="Abbrev"_time
SET udict=##class(%iKnow.UserDictionary).%New(udname)
DO udict.%Save()
DO udict.AddEntry("Dr.","Doctor")
DO udict.AddEntry("Mr.","Mister")
DO udict.AddEntry("\&\","and")
DisplayUserDictionary
DO udict.GetEntries(.dictlist)
SET i=1
WHILE $DATA(dictlist(i)) {
WRITE $LISTTOSTRING(dictlist(i),",",1),!
SET i=i+1 }
WRITE "End of UserDictionary",!!
DefineConfiguration
SET cfg=##class(%iKnow.Configuration).%New("EnUDict"_time,0,$LB("en"),udname)
DO cfg.%Save()
NormalizeAString
SET mystring="...The Strange Case of Dr. Jekyll & Mr. Hyde"
SET normstring=cfg.Normalize(mystring)
WRITE normstring
CleanUp
DO ##class(%iKnow.UserDictionary).%DeleteId(udict.Id)
DO ##class(%iKnow.Configuration).%DeleteId(cfg.Id)
You can perform InterSystems NLP text normalization on a string independent of a configuration using the NormalizeWithParams()Opens in a new tab method.
These methods perform these operations, in the following order:
-
Apply a User Dictionary, if one is specified
-
Perform InterSystems NLP language model preprocessing
-
Convert all text to lowercase letters
-
Replace multiple whitespace characters with a single space
InterSystems NLP User Dictionary
A User Dictionary allows you enhance the default behavior of the InterSystems NLP engine. It consists of a set of definition pairs, where each definition pair associates a string with one of the following counterparts:
-
a semantic attribute label, such as UDNegation or UDPositiveSentiment, for which the string should serve as an attribute marker. By assigning a semantic attribute label, you can (for example) specify “tremendous” as a term that indicates a positive sentiment.
-
an entity label to assign to each occurrence of the string, such as UDConcept or UDRelation. By assigning an entity label, you can (for example) instruct InterSystems NLP to recognize and index a concept which is unfamiliar to people outside of your industry or field.
-
A sentence break token: either \end, instructing the engine to issue a sentence break when it otherwise would not; or \noend, instructing the engine not to issue a sentence break when it otherwise would
-
A replacement string to substitute for each occurrence of the string. Using a substitution pair, you can (for example) replace all occurrences of an abbreviation with an occurrence of the entity it represents.
When you define custom terms as markers for semantic attributes in a User Dictionary, InterSystems NLP finds each occurrence of the term and flags it (and the part of the sentence which contains it) with the attribute corresponding to the attribute label which follows. When you specify a term as a marker for the certainty attribute, you must also assign a certainty level c as metadata for each phrase which contains that term.
Unlike all other components of InterSystems NLP, a User Dictionary modifies the source content before listing and loading. This means that when the User Dictionary contains a substitution pair, all subsequent operations see only the substituted term. For example, if a User Dictionary replaces the abbreviation “Dr.” with “Doctor”, every occurrence of “Dr.” is replaced by the word “Doctor” in the data indexed by InterSystems NLP.
Although User Dictionary substitutions do not alter the input file for a source text, they do irreversibly modify all representations of a source text within the InterSystems NLP environment. The original content is not preserved for analysis unless the environment is rebuilt and the sources are reloaded. For this reason, using the User Dictionary for substitution is usually not recommended.
Substitution pairs are applied before NLP text normalization, which converts the NLP internal text representation to lowercase letters. For this reason, substitution pairs are case-sensitive. Thus, to replace all instances of “physician” with “doctor” you will need the substitution pairs "physician","doctor", "Physician","Doctor", and perhaps "PHYSICIAN","DOCTOR".
Defining a User Dictionary is optional. A User Dictionary exists independent of any specific configuration or domain. A defined User Dictionary can be assigned as a Configuration property. Only one User Dictionary can be assigned to a Configuration. However, the same User Dictionary can be assigned to multiple Configurations.
A defined User Dictionary can also be specified to the NormalizeWithParams()Opens in a new tab method, independent of any Configuration.
A User Dictionary is applied to sources when the sources are listed; already indexed sources are not affected by changes to User Dictionary.
Defining a User Dictionary in Domain Architect
You can define a User Dictionary as part of Domain Settings when creating a domain using the interactive Domain Architect tool.
Defining a User Dictionary as an Object Instance
You must first create a User Dictionary object, then populate that instance.
SET udict=##class(%iKnow.UserDictionary).%New("MyUserDict")
DO udict.%Save()
DO udict.AddEntry("Dr.","Doctor")
DO udict.AddEntry("physician","doctor")
DO udict.AddEntry("Physician","Doctor")
To populate a User Dictionary object, use the method in the %iKnow.UserDictionary class that is appropriate for the definition pair you would like to add. For example, AddConcept() allows you to identify a string as a concept entity; AddSentenceNoEnd() allows you to specify that the occurrence of a string should not result in a sentence break.
To add user-defined attribute terms, such as Sentiment attributes, you use the appropriate instance method, as shown in the following example:
SET udict=##class(%iKnow.UserDictionary).%New("SentimentUserDict")
DO udict.%Save()
DO udict.AddNegativeSentimentTerm("bad")
DO udict.AddNegativeSentimentTerm("horrible")
DO udict.AddPositiveSentimentTerm("good")
DO udict.AddPositiveSentimentTerm("excellent")
When you assign a certainty attribute using the AddCertaintyTerm()Opens in a new tab method, provide the integer value of the certainty level as the second argument, as shown in the following example:
SET udict=##class(%iKnow.UserDictionary).%New("CertaintyUserDict")
DO udict.%Save()
DO udict.AddCertaintyTerm("absolutely", 9)
DO udict.AddCertaintyTerm("presumably", 0)
To assign a custom attribute using one of the generic attribute labels, use the generic AddAttribute()Opens in a new tab method. This method accepts the attribute label as a string for its second argument. For example:
SET udict=##class(%iKnow.UserDictionary).%New("CustomAttrUserDict")
DO udict.%Save()
DO udict.AddAttribute("patient", "UDGeneric1")
To add a case-sensitive substitution pair, use AddEntry() with the following format: AddEntry(oldstring,newstring). You can, optionally, specify the position at which to add the User Dictionary entry (the position default is to add the entry at the end of the User Dictionary). Because InterSystems NLP applies substitution pairs in User Dictionary order, you can use position to perform additive substitutions. For example, first replace “PA” with “physician’s assistant”, then replace “physician” with “doctor”.
To assign a User Dictionary object, you supply the User Dictionary name as the 4th argument in the Configuration %New() method:
SET cfg=##class(%iKnow.Configuration).%New("MyConfig",0,$LISTBUILD("en"),"MyUserDict",1)
DO cfg.%Save()
Defining a User Dictionary as a File
You can also create a User Dictionary by populating a file, and then assigning the User Dictionary file to a Configuration.
A User Dictionary file must be a text file in UTF-8 format encoding.
To populate a User Dictionary file, include each definition pair on a separate line.
An entity label or attribute label assignment must follow the following format: @<markerTerm>,<label>. For the certainty attribute, the line must also include a certainty level assignment in the following format: @<markerTerm>,UDCertainty,c=<number>.
A substitution pair must follow the following format: <oldString>,<replacementString>. To specify that an assignment or substitution should only apply when a blank space occurs preceding or following a string, include the \ character in place of the blank space. To specify that a sentence break should or should not occur at a given string, provide /end or /noend (respectively) in place of the <replacementString>.
The following is a sample User Dictionary file:
Mr.,Mister Dr.,Doctor Fr.,Fr \UK,United Kingdom @outstanding,UDPosSentiment @absolutely,UDCertainty,c=9 @patient,UDGeneric
To assign a User Dictionary file, supply the full pathname as the 4th argument in the Configuration %New() method:
SET cfg=##class(%iKnow.Configuration).%New(myconfig,0,$LISTBUILD("en"),"C:\temp\udict.txt",1)
DO cfg.%Save()