Skip to main content

Text Normalization

Text Normalization

NLP maintains links to the original source text. This enables it to return a sentence with its original capitalization, punctuation, and so forth. Within NLP, normalization operations are performed on entities to facilitate matching:

  • Capitalization is ignored. NLP matching is not case-sensitive. Entity values are returned in all lowercase letters.

  • Extra spaces are ignored. NLP treats all words as being separated by a single space.

  • Multiple periods (...) are reduced to a single period, which NLP treats as a sentence termination character.

  • Most punctuation is used by the language model to identify sentences, concepts and relations, then discarded from further analysis. Punctuation is generally not preserved within entities. Most punctuation is only preserved in an entity when there are no spaces before or after the punctuation mark. However, the slash (/), and at sign (@) are preserved in an entity with or without surrounding spaces.

  • Certain language-specific letter characters are normalized. For example, the German eszett (“ß”) character is normalized as “ss”.

The NLP engine automatically performs text normalization when a source text is indexed. NLP also automatically performs text normalization of dictionary terms and items.

You can also perform NLP text normalization on a string, independent of any NLP data loading, by using the Normalize()Opens in a new tab or NormalizeWithParams()Opens in a new tab methods. This is shown in the following example:

   SET mystring="Stately plump Buck Mulligan   ascended the StairHead,  bearing a shaving bowl"
   SET normstring=##class(%iKnow.Configuration).NormalizeWithParams(mystring)
   WRITE normstring
FeedbackOpens in a new tab