Semantic Dominance

Semantic dominance is the overall importance of an entity within a source. NLP determines semantic dominance by performing the following tests and obtaining statistical results:

The number of times that an entity appears in the source (the frequency).
The number of times that each component word of an entity appears in the source.
The number of words in the entity.
The type of entity (concept or relation).
The diversity of entities in the source.
The diversity of component words in the source.

Non-relevant and path-relevant words have no effect on dominance and frequency calculations.

Note:

Japanese uses a different, language-specific set of statistics to calculate the semantic dominance of an entity. See NLP JapaneseOpens in a new tab.

NLP generates these values when the sources are loaded as part of NLP indexing. It combines these values to produce a dominance score for each entity. The higher the dominance score, the more dominant the entity is within the source.

For example, for the concept “cardiovascular surgery” to be semantically dominant in a document, a statistical analysis would be performed. It determines that this concept appears 20 times in the source. A dominant concept is not the same as a top concept. In this source the concepts “doctor” (60 times), “surgery” (50 times), “operating room” (40 times), and “surgical procedure” (30 times) are far more common. However, the component words of “cardiovascular surgery” appear over twice as many times as the concept itself: “cardiovascular” (50 times) and “surgery” (80 times), which lends support to this being a dominant concept. In contrast, the concept “operating room” appears 40 times, but its component words appear barely more often than the concept: “operating” 60 times and “room” only 45 times. This indicates that this source is much more concerned with cardiovascular matters and surgery than it is with rooms.

NLP gives these frequency counts greater or lesser weight based on the number of words in the original entity and whether that entity is a concept or a relation. (There is a much smaller number of commonly-occurring relations than concepts.)

However, to determine how dominant a concept really is, NLP has to compare it to the total number of concepts in the source. If 5% of the concepts in the source contain the words “cardiovascular” and “surgery”, and these words do not combine in other concepts nearly as frequently as they do together, we know that these words not only appear frequently in the source, but that the source does not have a wide range of subject matter. If, however, the source contains a nearly equal occurrence of the word “surgery” in concepts with “hand”, “kidney” and “brain” and the word “cardiovascular” appears nearly as often with the words “exercise” and “diet” it is apparent that the source contains a wide range of subject matter. The concept “cardiovascular surgery” and its component words may appear more frequently than others, but may not significantly dominate the subject matter of the source.

By performing these statistical calculations, NLP can determine the dominant concepts in a source — the subjects that are of greatest interest to you. NLP performs this analysis without using an external reference corpus (such as pre-existing table of the relative frequency of words in a “typical” medical text). NLP determines dominance using only the contents of the actual source text, and thus can be used on sources on any topic without any prior knowledge of the subject matter.

Dominance in Context

NLP calculates the dominance of entities (concepts and relations) within a source and assigns each an integer value. It assigns the most dominant concept(s) in a source a dominance value of 1000. You can use the Indexing Results tool to list the dominance values for concepts in a single source.

NLP calculates the dominance of CRCs within a source. The algorithm uses the dominance values of the entities within the CRC. CRC dominance values are intended only for comparison with other CRC dominance values; CRC dominance values should not be compared with entity dominance values. You can use the Indexing Results tool to list the dominance values for CRCs in a single source.

NLP calculates the dominance of concepts across all loaded sources as a weighted average. These concept dominance scores are fractional numbers, with the largest possible number being 1000. You can use the Domain Explorer tool Dominant Concepts option to list the dominance scores for concepts in all loaded sources. You can also use the %iKnow.Semantics.DominanceAPIOpens in a new tab GetTop()Opens in a new tab method to list the dominance scores for concepts in all loaded sources.

Concepts of Semantic Dominance

The following are the key elements of semantic dominance:

Profile: the counts of elements that are used to calculate a dominance score.
Typical: a typical source is a source in which the dominant entities in that source are most similar to the dominant elements of the group of sources. This is the opposite of Breaking.
Breaking: a breaking source is a source in which the dominant entities in that source are least similar to the dominant elements of the group of sources. This is the opposite of Typical. For example, a breaking news story would likely be least similar to the dominant entities in all of the news stories from the previous month.
Overlap: the number of occurrences of an entity in different sources.
Correlation: a comparison of the entities in a source with a list of entities, returning a correlation percentage for each source.

Semantic Dominance Examples

This chapter describes and provides examples for the following semantic dominance queries:

Refer to A Note on Program Examples for details on the coding and data used in the examples in this book.

Concepts with Top Dominance Scores in the Domain

The GetTop()Opens in a new tab method returns the top concepts (or relations) by dominance scores for all loaded sources. GetTop() supports filters and skiplists.

#include %IKPublic
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT Top 25 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
SourceCount
  SET numSrcD=##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)
  WRITE "The domain contains ",numSrcD," sources",!!
DominantConcepts
  DO ##class(%iKnow.Semantics.DominanceAPI).GetTop(.profresult,domId,1,50)
  WRITE "Top Concepts in Domain by Dominance Score",!
  SET j=1
  WHILE $DATA(profresult(j),list) {
     WRITE $LISTGET(list,2)
     WRITE ": ",$LISTGET(list,3),!
     SET j=j+1 }
  WRITE !,"Printed ",j-1," dominant concepts"

Dominance Score for a Specified Entity

NLP uses the GetDomainValue()Opens in a new tab method to return the dominance value for a specified entity. You specify the entity by its entity Id (a unique integer), and specify the entity type by a numeric code. The default entity type is 0 (concept).

A single set of unique entity Ids is used for concepts and relations; therefore, no concept has the same entity Id as a relation. A separate set of unique entity Ids is used for CRCs; therefore, a CRC may have the same entity Id as a concept or a relation. This numbering is entirely coincidental; there is no connection between the entities.

The following example takes the top 12 entities and determines the dominance score for each entity. As one can see from the results of this example, the top (most frequently occurring) entities do not necessarily correspond to the entities with the highest dominance scores:

#include %IKPublic
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT Top 25 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
TopEntitiesDominanceScores
  DO ##class(%iKnow.Queries.EntityAPI).GetTop(.result,domId,1,12)
  SET i=1
  WHILE $DATA(result(i)) {
       SET topstr=$LISTTOSTRING(result(i),",",1)
       SET topid=$PIECE(topstr,",",1)
       SET val=$PIECE(topstr,",",2)
         SET spc=25-$LENGTH(val)
       WRITE val
       WRITE $JUSTIFY("top=",spc),i
       WRITE " dominance="
       WRITE ##class(%iKnow.Semantics.DominanceAPI).GetDomainValue(domId,topid,0),!
       SET i=i+1 }
  WRITE "Top ",i-1," entities and their dominance scores"

Semantic Proximity