Skip to main content

Listing Similar Entities

Listing Similar Entities

You can list the unique entities that are similar to a specified string. An entity is similar if one of the following applies:

  • The string is identical to the entity.

  • The string is one of the words of the entity.

  • The string is the first letters of one of the words of the entity.

Similarity returns each unique entity (Head Concept or Tail Concept) with integer counts of its frequency and spread, in descending sort order of these integer counts. Similarity does not match Relations. As is true throughout NLP, matching ignores letter case; all entities are returned in lowercase letters. Similarity does not use stemming logic; “cat” returns both “cats” and “category”.

The following example lists the entities that are similar to the string “student pilot”:

#include %IKPublic
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT TOP 100 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
SourceCountQuery
  WRITE ##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)," total sources",!!
SimilarEntityQuery
  WRITE "Entities similar to 'Student Pilot':",!
  DO ##class(%iKnow.Queries.EntityAPI).GetSimilar(.simresult,domId,"student pilot",1,50)
  SET j=1
  WHILE $DATA(simresult(j)) {
       SET outstr = $LISTTOSTRING(simresult(j),",",1)
         SET entity = $PIECE(outstr,",",2)
         SET freq = $PIECE(outstr,",",3)
         SET spread = $PIECE(outstr,",",4)
       WRITE "(",entity,")  appears ",freq," times in ",spread," sources",!
      SET j=j+1 }

The default domain parameter setting governing entity similarity is EnableNgrams, a boolean value.

Parts and N-grams

The GetSimilar()Opens in a new tab and GetSimilarCounts()Opens in a new tab methods have a mode parameter that specifies where to search for similarity. There are two available values:

  • $$$USEPARTS causes NLP to match the beginning of each part (word) for similarity. For texts in English and most other languages this is generally the preferred setting. $$$USEPARTS is the default.

  • $$$USENGRAMS causes NLP to match words and linguistic units within words (n-grams) for similarity. This mode is used when the source text language compounds words. For example, $$$USENGRAMS would commonly be used with German, a language which regularly forms compound words. $$$USENGRAMS would not be used with English, a language which does not compound words. $$$USENGRAMS can only be used in a domain that has the EnableNgrams domain parameter set.

FeedbackOpens in a new tab