InterSystems NLP User Dictionary
InterSystems NLP User Dictionary
A User Dictionary allows you enhance the default behavior of the InterSystems NLP engine. It consists of a set of definition pairs, where each definition pair associates a string with one of the following counterparts:
a semantic attribute label, such as UDNegation or UDPositiveSentiment, for which the string should serve as an attribute marker. By assigning a semantic attribute label, you can (for example) specify “tremendous” as a term that indicates a positive sentiment.
an entity label to assign to each occurrence of the string, such as UDConcept or UDRelation. By assigning an entity label, you can (for example) instruct InterSystems NLP to recognize and index a concept which is unfamiliar to people outside of your industry or field.
A sentence break token: either \end, instructing the engine to issue a sentence break when it otherwise would not; or \noend, instructing the engine not to issue a sentence break when it otherwise would
A replacement string to substitute for each occurrence of the string. Using a substitution pair, you can (for example) replace all occurrences of an abbreviation with an occurrence of the entity it represents.
When you define custom terms as markers for semantic attributes in a User Dictionary, InterSystems NLP finds each occurrence of the term and flags it (and the part of the sentence which contains it) with the attribute corresponding to the attribute label which follows. When you specify a term as a marker for the certainty attribute, you must also assign a certainty level c as metadata for each phrase which contains that term.
Unlike all other components of InterSystems NLP, a User Dictionary modifies the source content before listing and loading. This means that when the User Dictionary contains a substitution pair, all subsequent operations see only the substituted term. For example, if a User Dictionary replaces the abbreviation “Dr.” with “Doctor”, every occurrence of “Dr.” is replaced by the word “Doctor” in the data indexed by InterSystems NLP.
Although User Dictionary substitutions do not alter the input file for a source text, they do irreversibly modify all representations of a source text within the InterSystems NLP environment. The original content is not preserved for analysis unless the environment is rebuilt and the sources are reloaded. For this reason, using the User Dictionary for substitution is usually not recommended.
Substitution pairs are applied before NLP text normalization, which converts the NLP internal text representation to lowercase letters. For this reason, substitution pairs are case-sensitive. Thus, to replace all instances of “physician” with “doctor” you will need the substitution pairs "physician","doctor", "Physician","Doctor", and perhaps "PHYSICIAN","DOCTOR".
Defining a User Dictionary is optional. A User Dictionary exists independent of any specific configuration or domain. A defined User Dictionary can be assigned as a Configuration property. Only one User Dictionary can be assigned to a Configuration. However, the same User Dictionary can be assigned to multiple Configurations.
A defined User Dictionary can also be specified to the NormalizeWithParams()Opens in a new tab method, independent of any Configuration.
A User Dictionary is applied to sources when the sources are listed; already indexed sources are not affected by changes to User Dictionary.
Defining a User Dictionary in Domain Architect
You can define a User Dictionary as part of Domain Settings when creating a domain using the interactive Domain Architect tool.
Defining a User Dictionary as an Object Instance
You must first create a User Dictionary object, then populate that instance.
SET udict=##class(%iKnow.UserDictionary).%New("MyUserDict") DO udict.%Save() DO udict.AddEntry("Dr.","Doctor") DO udict.AddEntry("physician","doctor") DO udict.AddEntry("Physician","Doctor")
To populate a User Dictionary object, use the method in the %iKnow.UserDictionary class that is appropriate for the definition pair you would like to add. For example, AddConcept() allows you to identify a string as a concept entity; AddSentenceNoEnd() allows you to specify that the occurrence of a string should not result in a sentence break.
To add user-defined attribute terms, such as Sentiment attributes, you use the appropriate instance method, as shown in the following example:
SET udict=##class(%iKnow.UserDictionary).%New("SentimentUserDict") DO udict.%Save() DO udict.AddNegativeSentimentTerm("bad") DO udict.AddNegativeSentimentTerm("horrible") DO udict.AddPositiveSentimentTerm("good") DO udict.AddPositiveSentimentTerm("excellent")
When you assign a certainty attribute using the AddCertaintyTerm()Opens in a new tab method, provide the integer value of the certainty level as the second argument, as shown in the following example:
SET udict=##class(%iKnow.UserDictionary).%New("CertaintyUserDict") DO udict.%Save() DO udict.AddCertaintyTerm("absolutely", 9) DO udict.AddCertaintyTerm("presumably", 0)
To assign a custom attribute using one of the generic attribute labels, use the generic AddAttribute()Opens in a new tab method. This method accepts the attribute label as a string for its second argument. For example:
SET udict=##class(%iKnow.UserDictionary).%New("CustomAttrUserDict") DO udict.%Save() DO udict.AddAttribute("patient", "UDGeneric1")
To add a case-sensitive substitution pair, use AddEntry() with the following format: AddEntry(oldstring,newstring). You can, optionally, specify the position at which to add the User Dictionary entry (the position default is to add the entry at the end of the User Dictionary). Because InterSystems NLP applies substitution pairs in User Dictionary order, you can use position to perform additive substitutions. For example, first replace “PA” with “physician’s assistant”, then replace “physician” with “doctor”.
To assign a User Dictionary object, you supply the User Dictionary name as the 4th argument in the Configuration %New() method:
SET cfg=##class(%iKnow.Configuration).%New("MyConfig",0,$LISTBUILD("en"),"MyUserDict",1) DO cfg.%Save()
Defining a User Dictionary as a File
You can also create a User Dictionary by populating a file, and then assigning the User Dictionary file to a Configuration.
A User Dictionary file must be a text file in UTF-8 format encoding.
To populate a User Dictionary file, include each definition pair on a separate line.
An entity label or attribute label assignment must follow the following format: @<markerTerm>,<label>. For the certainty attribute, the line must also include a certainty level assignment in the following format: @<markerTerm>,UDCertainty,c=<number>.
A substitution pair must follow the following format: <oldString>,<replacementString>. To specify that an assignment or substitution should only apply when a blank space occurs preceding or following a string, include the \ character in place of the blank space. To specify that a sentence break should or should not occur at a given string, provide /end or /noend (respectively) in place of the <replacementString>.
The following is a sample User Dictionary file:
Mr.,Mister Dr.,Doctor Fr.,Fr \UK,United Kingdom @outstanding,UDPosSentiment @absolutely,UDCertainty,c=9 @patient,UDGeneric
To assign a User Dictionary file, supply the full pathname as the 4th argument in the Configuration %New() method:
SET cfg=##class(%iKnow.Configuration).%New(myconfig,0,$LISTBUILD("en"),"C:\temp\udict.txt",1) DO cfg.%Save()