Skip to main content

InterSystems NLP Configurations

InterSystems NLP Configurations

An InterSystems NLP configuration specifies behavior for handling source documents. It is only used during the source data loading operation. A configuration is specific to its namespace; you can create multiple configurations within a namespace. InterSystems NLP assigns each configuration in a namespace a configuration Id, a unique integer. Configuration Id values are not reused. You can apply the same configuration to different domains and source text loads. Defining or using an InterSystems NLP configuration is optional; if you don’t specify a configuration, InterSystems NLP uses the property defaults.

You can define an InterSystems NLP configuration in two ways:

Defining a Configuration

You can define a configuration using the %New()Opens in a new tab persistent method of the %iKnow.ConfigurationOpens in a new tab class.

You can determine if an InterSystems NLP configuration with that name already exists by invoking the Exists()Opens in a new tab method. If the configuration exists, you can open it using the Open()Opens in a new tab method, as shown in the following example:

 IF ##class(%iKnow.Configuration).Exists("EnFr") {
       SET cfg=##class(%iKnow.Configuration).Open("EnFr") }
  ELSE { SET cfg=##class(%iKnow.Configuration).%New("EnFr",1,$LB("en","fr"))
         DO cfg.%Save() }

Setting Configuration Properties

A configuration defines the following properties:

  • Name: A configuration name can be any valid string; configuration names are not case-sensitive. The name you assign to this configuration must be unique for the current namespace.

  • DetectLanguage: A boolean value that specifies whether to use automatic language identification if more that one language is specified in the Languages property. Because this option may have a significant effect on performance it should not be set unless needed. The default is 0 (do not use automatic language identification).

  • Languages: What language(s) the source documents contain, and therefore which languages to test for and which language models to apply. The available options are Czech (cs), Dutch (nl), English (en), French (fr), German (de), Japanese (ja), Portuguese (pt), Russian (ru), Spanish (es), Swedish (sv), and Ukrainian (uk). The default is English (en). Languages are always specified using their ISO 639-1 two-letter abbreviation. This property value is specified as an InterSystems IRIS list of strings (using $LISTBUILD).

  • User Dictionary: Either the name of a defined User Dictionary object or the file path location of defined User Dictionary file. A User Dictionary contains user-defined substitution pairs that InterSystems NLP applies to the source text entities during the load operation. This property is optional; the default is the null string.

  • Summarize: a boolean value that specifies whether to store summary information when loading source texts. If set to 1, source information is generated that InterSystems NLP requires to generate summaries of the loaded source texts. If set to 0, no summaries can be generated for the sources processed with this Configuration object. Setting this option to 1 is generally recommended. The default is 1.

All configuration properties (except the Name) are assigned default values. You can get or set a configuration property by using property dispatch:

   IF cfgOref.DetectLanguage=0 {
     SET cfgOref.DetectLanguage=1
     DO cfgOref.%Save() }

Note that you must first %Save() the newly created configuration before you can change its properties using property dispatch, and then you must %Save() the configuration after changing the property values.

The following example creates a configuration that supports English and French with automatic language identification. It then changes the configuration to support English and Spanish:

OpenOrCreateConfiguration
  SET myconfig="Bilingual"
  IF ##class(%iKnow.Configuration).Exists(myconfig) {
       SET cfg=##class(%iKnow.Configuration).Open(myconfig)
       WRITE "Opened existing configuration ",myconfig,! }
  ELSE { SET cfg=##class(%iKnow.Configuration).%New(myconfig,1,$LB("en","fr"))
         DO cfg.%Save()
       WRITE "Created new configuration ",myconfig,! }
GetLanguages
     WRITE "that supports ",$LISTTOSTRING(cfg.Languages),!
SetConfigParameters
     SET cfg.Languages=$LISTBUILD("en","sp")
     DO cfg.%Save()
     WRITE "changed ",myconfig," to support ",$LISTTOSTRING(cfg.Languages),!
CleanUpForNextTime
  SET rnd=$RANDOM(2)
  IF rnd {
       SET stat=##class(%iKnow.Configuration).%DeleteId(cfg.Id)
       IF stat {WRITE "Deleted the ",myconfig," configuration" }
       }
  ELSE {WRITE "No delete this time",! }

For a description of using multiple languages and automatic language identification, refer to the “Language Identification” page.

Using a Configuration

You can apply a defined configuration in any of the following ways:

Listing All Configurations

You can use the GetAllConfigurationsOpens in a new tab query to list all defined configurations in the current namespace. This is shown in the following example:

  SET stmt=##class(%SQL.Statement).%New()
  SET status=stmt.%PrepareClassQuery("%iKnow.Configuration","GetAllConfigurations")
     IF status'=1 {WRITE "%Prepare failed:" DO $System.Status.DisplayError(status) QUIT}
  SET rset= stmt.%Execute()
  WRITE "The current namespace is: ",$NAMESPACE,!
  WRITE "It contains the following configurations: ",!
  DO rset.%Display()

Each configuration is listed on a separate line, listing the configuration Id followed by the configuration parameter values. Listed values are separated by colons. If the configuration is defined with a list of supported languages, GetAllConfigurations displays these language abbreviations separated by commas.

You can also list all configurations in the current namespace using:

  DO ##class(%SYSTEM.iKnow).ListConfigurations()

Using a Configuration to Normalize a String

Using a defined InterSystems NLP configuration, you can perform text normalization on a string using the Normalize()Opens in a new tab method. This method both normalizes the string characters and (optionally) applies a User Dictionary, as shown in the following example:

DefineUserDictionary
  SET time=$PIECE($H,",",2)
  SET udname="Abbrev"_time
  SET udict=##class(%iKnow.UserDictionary).%New(udname) 
  DO udict.%Save() 
  DO udict.AddEntry("Dr.","Doctor")
  DO udict.AddEntry("Mr.","Mister")
  DO udict.AddEntry("\&\","and")
DisplayUserDictionary
  DO udict.GetEntries(.dictlist)
  SET i=1
  WHILE $DATA(dictlist(i)) {
    WRITE $LISTTOSTRING(dictlist(i),",",1),!
    SET i=i+1 }
  WRITE "End of UserDictionary",!!
DefineConfiguration
   SET cfg=##class(%iKnow.Configuration).%New("EnUDict"_time,0,$LB("en"),udname)
   DO cfg.%Save()
NormalizeAString
   SET mystring="...The Strange Case  of Dr. Jekyll      & Mr. Hyde"
   SET normstring=cfg.Normalize(mystring)
   WRITE normstring
CleanUp
   DO ##class(%iKnow.UserDictionary).%DeleteId(udict.Id)
   DO ##class(%iKnow.Configuration).%DeleteId(cfg.Id)

You can perform InterSystems NLP text normalization on a string independent of a configuration using the NormalizeWithParams()Opens in a new tab method.

These methods perform these operations, in the following order:

  1. Apply a User Dictionary, if one is specified

  2. Perform InterSystems NLP language model preprocessing

  3. Convert all text to lowercase letters

  4. Replace multiple whitespace characters with a single space

FeedbackOpens in a new tab