Listing Similar Sources
Listing Similar Sources
The NLP semantic analysis engine can list which sources are similar to a specified source. Similarity between sources is determined by the number of entities that appear in both sources (the overlap), and the percentage of the source contents that contain overlap.
The GetSimilar()Opens in a new tab method can calculate similarity of sources to a specified source. Because of the potentially large number of similar sources, this method is commonly used with a filter to limit the set of sources considered. GetSimilar() can use your choice of two algorithms, each of which takes an algorithm parameter:
-
Basic similarity of items ($$$SIMSRCSIMPLE, the default). Available algorithm parameters are “ent” (entity similarity, the default), “crc” (Concept-Relation-Concept sequence), or “cc” (Concept + Concept pair).
-
Using semantic dominance calculations ($$$SIMSRCDOMENTS). The algorithm parameter is a boolean flag that specifies limiting similarity to sources that contain a dominant entity that is also a dominant entity in the specified source.
For each similar source, NLP returns a list of elements with the following format:
srcId,extId,percentageMatched,percentageNew,nbOfEntsInRefSrc,nbOfEntsInCommon,nbOfEntsInSimSrc,score
Element | Description |
---|---|
srcId | The source ID, an integer assigned by NLP. |
extId | The external ID for the source, a string value. |
percentageMatched | The percentage of the contents of the source that is the same as the match source. |
percentageNew | The percentage of the contents of the source that is new. New contents are those that do not match with the match source. |
nbOfEntsInRefSrc | The number of unique entities in the source being referenced (matched against this source). |
nbOfEntsInCommon | The number of unique entities that are found in both sources. |
nbOfEntsInSimSrc | The number of unique entities in this source. |
score | The similarity score, expressed as a fractional number. An identical source would have a similarity score of 1. |
The following example demonstrates the listing of similar sources. It first limits the set of test sources to those that may describe an engine failure incident, by using GetByEntities() to select for a list of appropriate entities. It then uses GetSimilar() to find sources similar to these test sources, which may indicate a pattern of similar incidents. GetSimilar() takes the default similarity algorithm ($$$SIMSRCSIMPLE) and its default algorithm parameter (“ent”). The program displays only those similar sources with a high similarity score (>.33). The similarity display omits the source external IDs:
#include %IKPublic
DomainCreateOrOpen
SET dname="mydomain"
IF (##class(%iKnow.Domain).NameIndexExists(dname))
{ WRITE "The ",dname," domain already exists",!
SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
GOTO DeleteOldData }
ELSE
{ WRITE "The ",dname," domain does not exist",!
SET domoref=##class(%iKnow.Domain).%New(dname)
DO domoref.%Save()
WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
GOTO ListerAndLoader }
DeleteOldData
SET stat=domoref.DropData()
IF stat { WRITE "Deleted the data from the ",dname," domain",!!
GOTO ListerAndLoader }
ELSE { WRITE "DropData error ",$System.Status.DisplayError(stat)
QUIT}
ListerAndLoader
SET domId=domoref.Id
SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
SET myquery="SELECT TOP 100 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
SET idfld="UniqueVal"
SET grpfld="Type"
SET dataflds=$LB("NarrativeFull")
UseLister
SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
SET stat=myloader.ProcessBatch()
IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
SourceCountQuery
SET totsrc = ##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)
WRITE totsrc," total sources",!
SimiarSourcesQuery
SET engineents = $LB("engine","engine failure","engine power","loss of power","carburetor","crankshaft","piston")
DO ##class(%iKnow.Queries.SourceAPI).GetByEntities(.result,domId,engineents,1,totsrc)
SET i=1
WHILE $DATA(result(i)) {
SET src = $LISTTOSTRING(result(i),",",1)
SET srcId = $PIECE(src,",",1)
WRITE "Source ",srcId," contains an engine incident",!
DO ##class(%iKnow.Queries.SourceAPI).GetSimilar(.sim,domId,srcId,1,50,"",$$$SIMSRCSIMPLE,$LB("ent"))
SET j=1
WHILE $DATA(sim(j)) {
SET simlist=$LISTTOSTRING(sim(j))
IF $PIECE(simlist,",",8) > .33 {
WRITE " similar to source ",$PIECE(simlist,",",1),": "
WRITE $PIECE(simlist,",",3,8),! }
SET j=j+1 }
SET i=i+1 }