Skip to main content

Semantic Attributes

Important:

InterSystems has deprecatedOpens in a new tab InterSystems IRIS® Natural Language Processing (NLP). It may be removed from future versions of InterSystems products. The following documentation is provided as reference for existing users only. Existing users who would like assistance identifying an alternative solution should contact the WRCOpens in a new tab.

InterSystems NLP identifies concepts and their contextOpens in a new tab in a natural language text.

Concepts are indivisible word groups that have a meaning on their own, mostly independent of the sentence in which they appear. For example, in the sentence “Patient is being treated for acute pulmonary hypertension,” InterSystems NLP identifies the word groups “patient” and “acute pulmonary hypertension” as concepts. Within sentences, relations indicate meaningful associations between concepts by linking them into a path. In the preceding example, the concepts are associated by the relation “is being treated for.”

However, in the sentence “Patient is not being treated for acute pulmonary hypertension,” the concept “acute pulmonary hypertension” has the same intrinsic meaning, but its context is clearly different. In this case, it appears as part of a sentence where the relation has been negated. An application that uses natural language processing to flag pulmonary problems should obviously treat this occurrence of the concept differently from its occurrence in the previous example.

By indexing when a path features semantic attributes (such as negation) which affect the contextual meaning of the path and its constituent entities, InterSystems NLP provides a richer data set about your source texts, allowing you to perform more sophisticated analyses.

How Attributes Work: Marker Terms and Attribute Expansion

Within a sentence, a semantic attribute is usually indicated by the use of a marker term. In the sentence, “Patient is not being treated for acute pulmonary hypertension,” the word “not” is a marker term which indicates negation. Marker terms are usually single words or combinations of words, but they do not have to be a whole entity. In the preceding example, “not” is only part of the relation entity, “is not being treated for.”

You can specify additional marker terms for an attribute by using a User Dictionary. When a User Dictionary is specified as part of a Configuration, InterSystems NLP will recognize the marker terms defined within the User Dictionary and perform the appropriate attribute expansion to determine which part of the sentence or path the attribute applies to. If you are adding attribute marker terms to a User Dictionary programmatically, the %iKnow.UserDictionaryOpens in a new tab class includes instance methods specific to each attribute type (for example, AddPositiveSentimentTerm()Opens in a new tab). The class also provides an AddAttribute()Opens in a new tab method for defining generic attributes.

At the Entity-level: Bit Mask for Marker Terms

Because the smallest unit of analysis within InterSystems NLP is an entity, the word-level presence of a marker term within an entity occurrence is annotated at the entity level using a bit mask. A bit mask is a string of zeroes and ones, with each position in the string representing a word in sequence of words which comprise the entity. A one in a given position indicates that the corresponding word is a marker term. For example, the entity “is not being treated for” would be assigned the negation bit mask "01000".

At the Path-level: Position and Span for Expanded Attributes

However, InterSystems NLP does not merely index entities that contain marker terms for a semantic attribute. In addition, InterSystems NLP leverages its understanding of the grammar to perform attribute expansion, flagging all of the entities in the path before and after the marker term which are also affected by the attribute. Given the sentence, “Patient is not being treated for acute pulmonary hypertension or CAD, but reports frequent chest pain,” the concepts “acute pulmonary hypertension” and “CAD” would be flagged as part of the expanded negation attribute, but the concept “frequent chest pain” would not be.

Expanded semantic attribute information is annotated at the path level using two integers:

  • the position: the location of the first entity affected by the attribute

  • the span: the number of consecutive entities affected by the attribute

In the preceding sentence, the starting position for the negation would be 1 (“Patient”) and the span would be 5 (ending with “CAD”).

Through attribute expansion, InterSystems NLP uniquely empowers you to perform advanced analyses. For example, you can easily distinguish between positive and negative occurrences of a the concept “CAD,” because that information is available at the path-level for clear highlighting on your screen or advanced interpretation logic within your applications.

Access Attribute Data

Attribute analysis information can be used with the following methods:

Attribute Data Structure

When InterSystems NLP identifies a marker term and determines which neighboring entities are affected by it, it then stores data about the attribute so that you can access it using one of the APIs in the %iKnow.Queries package listed previously.

Attribute data is stored as a %List. The exact contents of this attribute %List vary based on the level of analysis it provides information for. As applicable, content appears in the following order:

  1. The numeric ID for the attribute type. For ease of use, the %IKPublic #include file provides named macros for specifying these values in your queries. For example, the ID for the certainty attribute type can be invoked using the $$$IKATTCERTAINTY macro.

  2. A string containing the name of the attribute type (for example,“negation” or “measurement”).

  3. The numeric ID for the level of analysis (for example, path-level). For ease of use, the %IKPublic #include file provides named macros for specifying these values in your queries. For example, you can reference the path level by invoking the $$$IKATTLVLPATH macro.

  4. The numeric ID for the element at the given level of analysis which contains the attribute. For words within this will be a string containing a 0 or a 1.

  5. The position of the first entity within the element affected by the attribute. This value counts non-relevant words (such as “the” and “a” as separate entities). For example, in the sentence “The White Rabbit usually hasn't any time,” the negation marker is in entity 3 (the relation “usually hasn’t”). In the sentence, “The White Rabbit usually has no time,” the negation marker is in entity 4, the concept “no time”.

  6. The span of the attribute—that is, the number of consecutive entities which the attribute affects. For example, “The man is neither fat nor thin” has a negation span of 5 entities: “man,” “is neither,” “fat,” “nor,” and “thin.”

  7. Where applicable, a string containing the property or properties associated with this attribute. For measurements, this string includes a comma-delimited list containing the magnitudes and units of measurement in the order they are detected. For the certainty attribute, this string contains the numeric value of the certainty level c.

Example

Refer to A Note on Program Examples for details on the coding and data used in the examples in this book.

The following example uses %iKnow.Queries.SourceAPI.GetAttributes()Opens in a new tab to search each source in a domain for paths and sentences that have the negation attribute. It displays the PathId or SentenceId, the start position and the span of each negation:

#include %IKPublic
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT TOP 100 ID AS UniqueVal,Type,NarrativeCause FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeCause")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
GetSourcesAndAttributes
   SET numSrcD=##class(%iKnow.Queries.SourceQAPI).GetCountByDomain(domId)
   DO ##class(%iKnow.Queries.SourceAPI).GetByDomain(.srcs,domId,1,numSrcD)
   SET i=1
   WHILE $DATA(srcs(i)) {
      SET srcId = $LISTGET(srcs(i),1)
      SET i=i+1
      DO ##class(%iKnow.Queries.SourceAPI).GetAttributes(.att,domId,srcId,1,10,"",$$$IKATTLVLANY)
      SET j=1
      WHILE $DATA(att(j)) {
          IF $LISTGET(att(j),1)=1 {
            SET type=$LISTGET(att(j),2)
            SET level=$LISTGET(att(j),3)
            SET targId=$LISTGET(att(j),4)
            SET start=$LISTGET(att(j),5)
            SET span=$LISTGET(att(j),6)
               IF level=1 {WRITE "source ",srcId," ",type," path ",targId," start at ",start," span ",span,!}
               ELSEIF level=2 {WRITE "source ",srcId," ",type," sentence ",targId," start at ",start," span ",span,!!}
               ELSE {WRITE "unexpected attribute level",! }
         }
     SET j=j+1
     }
    }

Supported Attributes

InterSystems NLP supports several semantic attribute types, and annotates each attribute type independently. In other words, an entity occurrence can receive annotations for any number and combination of the attribute types supported by a given language model.

InterSystems NLP includes marker terms for all of these attribute types (except the generic ones) for the English language. Semantic attribute support varies; the table identifies which semantic attribute types are supported for each language model in this version of InterSystems NLP. For ease of reference, the parenthesis beside each attribute type provides the default color used for highlighting within the Domain ExplorerOpens in a new tab and the Indexing Results tool.

Attribute (highlight color) English Czech Dutch French German Japanese Portuguese Russian Spanish Swedish Ukrainian
Negation (red) Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Time (orange) Yes Yes Yes Yes No No No Yes No Yes Yes
Duration(green) Yes No No No No Yes No No No No No
Frequency(yellow) Yes No No No No Yes No No No No No
Measurement (pink) Yes No Partial No No Yes No No No Partial No
Sentiment (purple) Yes Yes Yes Yes Yes No Yes Yes Yes Yes Yes
Certainty (yellow) Yes No No No No No No No No No No
Generic (dark blue) Yes No No No No No No No No No No

The rest of this page provides detailed descriptions of each attribute type.

Negation

Negation is the process that turns an affirmative sentence (or part of a sentence) into its opposite, a denial. For example, the sentence “I am a doctor.” can be negated as “I am not a doctor.” In analyzing text it is often important to separate affirmative statements about a topic from negative statements about that topic. During source indexing, InterSystems NLP associates the attribute “negation” with a sentence and indicates which part of the text is negated.

While in its simplest form negation is simply associating “no” or “not” with an affirmative statement or phrase, in actual language usage negation is a more complex and language-specific operation. There are two basic types of negation:

  • Formal, or grammatical, negation is always indicated by a specific morphological element in the text. For example, “no”, “not”, “don’t” and other specific negating terms. These negating elements can be part of a concept “He has no knowledge of” or part of a relation “He doesn’t know anything about”. Formal negation is always binary: a sentence (or part of a sentence) either contains a negating element (and is thus negation), or it is affirmative.

  • Semantic negation is a complex, context-dependent form of negation that is not indicated by any specific morphological element in the text. Semantic negation depends upon the specific meaning of a word or word group in a specific context, or results from a specific combination of meaning and tense (for example, conjunctive and subjunctive tenses in Romance languages). For example, “Fred would have been ready if he had stayed awake” and “Fred would have been ready if the need had arisen” say opposite things about Fred’s readiness. Semantic negation is not a binary principle; it is almost never absolute, but is subject to contextual and cultural insights.

InterSystems NLP language models contain a variety of language-specific negation words and structures. Using these language models, InterSystems NLP is able to automatically identify most instances of formal negation as part of the source loading operation, flagging them for your analysis. However, InterSystems NLP cannot identify instances of semantic negation.

The largest unit of negation in InterSystems NLP is a path; InterSystems NLP cannot identify negations in text units larger than a path. Many, but not all, sentences comprise a single path.

Negation Special Cases

The following are a few peculiarities of negation in English:

  • No.: The word “No.” (with capital letter and period, quoted or not quoted) in English is treated as an abbreviation. It is not treated as negation and is not treated as the end of a sentence. Lowercase “no.” is treated as negation and as a sentence ending.

  • Nor: The word “Nor” at the beginning of a sentence is not marked as negation. Within the body of a sentence the word “nor” is marked as negation.

  • No-one: The hyphenated word “no-one” is treated as a negation marker. Other hyphenated forms (for example, “no-where”) are not.

  • False negatives: Because formal negation depends on words, not context, occasional cases of false negatives may inevitably arise. For example, the sentences “There was no answer” and “The answer was no” are both flagged as negation.

Because negation operates on sentence units, it is important to know what InterSystems NLP does (and does not) consider a sentence. For details on how InterSystems NLP identifies a sentence, refer to the Logical Text Units Identified by InterSystems NLP section of the “Conceptual Overview” chapter.

Negation and Smart Matching

InterSystems NLP recognizes negated entities when matching against a Smart Matching dictionary. It calculates the number of entities that are part of a negation and stores this number as part of the match-level information (as returned by methods such as GetMatchesBySource() or as the NegatedEntityCount property of %iKnow.Objects.DictionaryMatchOpens in a new tab). This allows you to create code that interprets matching results by considering negation content, for example by comparing negated entities to the total number of entities matched.

For further details, refer to Smart Matching: Using a Dictionary.

Time, Duration, and Frequency

Documents may also contain structured data that expresses time, duration, or frequency. These are annotated as separate attributes, commonly consisting of an attribute term as part of a concept. These attributes are identified based on marker terms identified in the language. They may or may not include a specific number.

A number, either specified with numerals or with words is almost always treated as a measurement attribute. However, a time attribute can contain a numeric, and a frequency attribute can contain an ordinal number.

The following are some of the guidelines in the English language model governing numbers:

  • Numerics: A number with no associated term could be a measurement or a year. Numbers from 1900 through 2039 are assumed to be years and are assigned the time attribute (1923, 2008 applicants). Numbers outside this range (1776) are not considered to be years, unless the word “year” is specified (the year 1776). Isolated numbers outside this range (1776) are assigned no attribute. Numbers outside this range with an associated term are assumed to be measurements (1776 applicants). Two-digit numbers with an appended apostrophe (for example Winter of '89) are assumed to be years and are assigned the time attribute. Numbers with numeric or currency punctuation (1,973, -1973, 1973.0, or $1973), and numbers expressed in words (nineteen seventy-three) are assumed to be measurements. A valid time numeric (12:34:33) is assigned the time attribute.

  • Ordinal numbers: An ordinal in a concept with other words takes the frequency attribute when spelled out (the fourth attempt), the measurement attribute when specified as a number (the 4th attempt). Spelled-out ordinals beyond “tenth” do not take a frequency attribute. An ordinal by itself in a concept does not take an attribute (a fifth of scotch, came in third).

    A spelled out ordinal (first through tenth) following another number takes the measurement attribute as a fraction (one third, two fifths), with exception of “one second” which takes the duration attribute.

    An ordinal of any size with a month name takes the time attribute (sixteenth of October, October 16th, October sixteenth), except May, which is ambiguous in English and therefore doesn’t take an attribute.

Note:

InterSystems NLP supports the Time attribute for Czech, Dutch, Russian, Swedish, and Ukrainian, but does not currently support distinct annotations for Time, Duration, and Frequency.

Measurement

Documents commonly contain structured data elements that express a quantity. These can include counts, lengths, weights, currency amounts, medication dosages, and other quantified expressions. They can follow different patterns:

  1. A number accompanied by a unit of measurement, such as “20 miles” or “5 mg”

  2. A string which contains both a number and a unit of measurement, such as “$200”

  3. A number with the associated concept term it is counting, such as “50 people”

  4. A number accompanied by an indicator that it is a measurement, such as “Blood Pressure: 120/80”

InterSystems NLP annotates a combination of a number and a unit of measurement (patterns 1 and 2 in the preceding list) as a measurement marker term at the word level. In other cases (patterns 3 and 4 in the preceding list), InterSystems NLP only annotates the number as a measurement at the word level.

To handle fractional numeric values, a leading period is included as part of the measurement number.

In addition to annotating the number and unit, InterSystems NLP uses attribute expansion rules to identify the other concepts “involved” in the measurement. This annotated sequence of concepts captures what is being measured, rather than just the measurement itself. (For patterns 3 and 4 in the preceding list, the indicator or concept term “involved” in the measurement is part of the span.)

These expanded attributes can be used to:

  • Extract all of the measurable facts in a document by highlighting them or displaying them in a list.

  • Narrow the display of a specific concept to only those that are associated with a measurement.

Note:

In English and Japanese, any concept associated with a number is flagged with the measurement attribute. The exceptions are described in Time, Duration, and Frequency. In Dutch and Swedish, numbers with an associated concept term (form 3 in the preceding list) are not marked as measurement.

Sentiment

A sentiment attribute flags a sentence as having either a positive or negative sentiment. Sentiment terms are highly dependent on the kind of texts being analyzed. For example, in a customer perception survey context the following terms might be flagged with a sentiment attribute:

  • The words “avoid”, “terrible”, “difficult”, “hated” convey a negative sentiment.

  • The words “attractive”, “simple”, ”self-evident”, “useful”, “improved” convey a positive sentiment.

Because sentiment terms are often specific to the nature of the source texts, InterSystems NLP only identifies a small set of sentiment terms automatically. You can flag additional words as having a positive sentiment or a negative sentiment attribute. You can specify a sentiment attribute for specific words using a User Dictionary. When source texts are loaded into a domain, each appearance of these terms and the part of the sentence affected by it is flagged with the specified positive or negative sentiment marker.

For example, if “hated” is specified as having a negative sentiment attribute, and “amazing” is specified as having a positive sentiment attribute, when InterSystems NLP applies them to the sentence:

I hated the rain outside, but the running shoes were amazing.

Negative sentiment would affect “rain” and positive sentiment would affect “running shoes”.

When a positive or negative sentiment attribute appears in a negated part of a sentence, the sense of the sentiment is reversed. For example, if the word “good” is flagged as a positive sentiment, the sentence “The coffee was good” is a positive sentiment, but the sentence “The coffee was not good” is a negative sentiment.

Certainty

A statement of fact is often qualified by terms that indicate the speaker's certainty (or a lack of certainty) about the accuracy of the statement.

  • Terms such as "clearly," "definitely," "confident that," and "without a doubt" can indicate a high level of certainty

  • Terms such as "could," "uncertain," "quite possibly," and "seem to be" can indicate a low level of certainty.

Indexing these terms and the paths they qualify can provide valuable analytical information. For example, if you are analyzing medical records for occurrences of the entity "pulmonary embolus" in order to determine statistically effective response strategies for that condition, you may want to exclude a patient observation record where the entity appears as part of the phrase "concern for pulmonary embolus."

When you load texts into a domain, InterSystems NLP flags each appearance of a certainty term and the part of the sentence affected by it with a certainty attribute marker. As metadata, each certainty attribute flag receives an integer value c between 0 and 9, with higher values indicating higher levels of certainty.

However, the terms which indicate certainty or uncertainty are highly dependent on the kinds of texts being analyzed. For example, if the record of a medical encounter says that "the patient has no chance of recovery," the phrase "no chance of" indicates an unequivocally high level of certainty. But the same phrase may not indicate high certainty when it appears in a news article which quotes the manager of a sports team as saying, "We have no chance of being defeated.”

Because of this, InterSystems NLP only identifies a small number of certainty terms automatically. These automatically detected certainty flags either receive the minimum (c=0) or maximum (c=9) certainty level.

You can specify certainty attributes for additional terms by including them in a User Dictionary.

Generic Attributes

In addition to the semantic attributes described previously, InterSystems NLP provides three generic flags that allow you to define custom attributes. You can specify terms as markers for one of the generic attributes by assigning them to one of the three generic attribute values (UDGeneric1, UDGeneric2, or UDGeneric3) in a User Dictionary. Similar to negation or certainty, InterSystems NLP flags each appearance of these terms and the part of the sentence affected by them with the generic attribute marker you have specified.

FeedbackOpens in a new tab