Skip to main content

Filtering a Random Selection of Sources

Filtering a Random Selection of Sources

You can use the %iKnow.Filters.RandomFilterOpens in a new tab to select a random sample of your sources. A random sample allows you to perform tests on a manageable subset of your sources. It also allows you to divide your sources (or a subset of them) into “training” and “test” sets. You would use the “training” set to define NLP analytics (dictionary matches, source categories, etc.), then would use the “test” set to determine how well these analytics apply to another set of data. In this way you can avoid “overfitting” the analytics to a particular set of data.

You can specify the size of the random subset in two ways:

  • As a percentage: You specify a percentage (as a fractional number between 0 and 1), and this filter returns the corresponding percentage of the indexed sources in the specified domain (or filtered subset of the domain). For example, a value of “.5” means that 50% of the sources in the domain will be included in the filtered result. Halves are rounded up, so 50% of 5 sources is 3 sources. You specify 100% as “.999” with the appropriate number of fractional digits. This filter selects the requisite number of sources randomly.

  • As an integer: You specify an integer, and this filter returns that number of indexed sources in the specified domain (or filtered subset of the domain). For example, a value of “7” means that 7 of the sources in the domain will be included in the filtered result. This filter selects the specified number of sources randomly.

The following example randomly selects 33% of 50 sources, returning 17 sources. You can run this example repeatedly to demonstrate that different sources are randomly sampled:

  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
      { SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
        GOTO DeleteOldData }
  ELSE { SET domoref=##class(%iKnow.Domain).%New(dname)
         DO domoref.%Save()
         GOTO SetEnvironment }
  SET stat=domoref.DropData()
  IF stat { GOTO SetEnvironment }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
  SET domId=domoref.Id
  IF ##class(%iKnow.Configuration).Exists("myconfig") {
         SET cfg=##class(%iKnow.Configuration).Open("myconfig") }
  ELSE { SET cfg=##class(%iKnow.Configuration).%New("myconfig",0,$LISTBUILD("en"),"",1)
         DO cfg.%Save() }
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
   SET myquery="SELECT TOP 50 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
  SET filt=##class(%iKnow.Filters.RandomFilter).%New(domId,.33)
  SET numSrcD=##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)
  WRITE "The ",dname," domain contains ",numSrcD," sources",!
  SET numSrcFD=##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId,filt)
  WRITE "Of these ",numSrcD," sources ",numSrcFD," were sampled:",!
  DO ##class(%iKnow.Queries.SourceAPI).GetByDomain(.result,domId,1,20,filt)
  SET i=1
  WHILE $DATA(result(i)) {
     SET intId = $LISTGET(result(i),1)
     SET extId = $LISTGET(result(i),2)
     WRITE "sample #",i," is source ",intId," ",extId,!
     SET i=i+1 } 
     WRITE "End of list"

The following example filters the sources in the domain by source Id, returning 11 sources. It then supplies this source Id filter when defining a random filter. Thus, the random filter returns 3 of these source-Id-filtered sources. You can run this example repeatedly to demonstrate that different sources are randomly sampled:

  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
      { SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
        GOTO DeleteOldData }
  ELSE { SET domoref=##class(%iKnow.Domain).%New(dname)
         DO domoref.%Save()
         GOTO SetEnvironment }
  SET stat=domoref.DropData()
  IF stat { GOTO SetEnvironment }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
  SET domId=domoref.Id
  IF ##class(%iKnow.Configuration).Exists("myconfig") {
         SET cfg=##class(%iKnow.Configuration).Open("myconfig") }
  ELSE { SET cfg=##class(%iKnow.Configuration).%New("myconfig",0,$LISTBUILD("en"),"",1)
         DO cfg.%Save() }
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
   SET myquery="SELECT TOP 50 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
  SET srclist=$LB(1,3,5,7,9,11,13,15,17,21,23)
  SET idfilt=##class(%iKnow.Filters.SourceIdFilter).%New(domId,srclist)
  SET numsrc = ##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)
     WRITE "The ",dname," domain contains ",numsrc," sources",!
 SET numfsrc = ##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId,idfilt)
     WRITE "Source count after source Id filtering: ",numfsrc,!
  DO ##class(%iKnow.Queries.SourceAPI).GetByDomain(.result,domId,1,20,idfilt)
  SET i=1
  WHILE $DATA(result(i)) {
     SET intId = $LISTGET(result(i),1)
     WRITE intId," "
     SET i=i+1 } 
     WRITE !,"End of list",! 
  SET rfilt=##class(%iKnow.Filters.RandomFilter).%New(domId,3,idfilt)
  SET numrsrc=##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId,rfilt)
  WRITE "From ",numfsrc," sources ",numrsrc," are randomly sampled:",!
  DO ##class(%iKnow.Queries.SourceAPI).GetByDomain(.result,domId,1,20,rfilt)
  SET j=1
  WHILE $DATA(result(j)) {
     SET intId = $LISTGET(result(j),1)
     WRITE intId," "
     SET j=j+1 } 
     WRITE !,"End of list",! 
FeedbackOpens in a new tab