Summarizing a Source

The NLP semantic analysis engine can summarize a source text by returning the most relevant sentences. It returns a user-specified number of sentences in the original sentence order, selecting those sentences that have the highest similarity to the overall content of the source text. NLP determines relevance by calculating an internal relevancy score for each sentence. Sentences that contain concepts that appear many times in the source text are more likely to be included in the summary than those that contain concepts that only appear once in the source text. NLP considers the overall frequency of each concept, the similarity of each concept to the most frequent concepts in the source, and other factors.

Summarizing a source is only available if the Summarize property was set to 1 in the Configuration when loading the source. The default Configuration specifies Summarize=1.

The accuracy of a summary therefore depends on two factors:

The source text must be large enough to permit meaningful frequency analysis, but not too large. NLP summarization works best on texts the length of a chapter or article. A book-length text should be summarized chapter-by-chapter.
The number of sentences in the summary should be a large enough subset of the original for the returned sentences to form a readable summary text. The minimum summary percentage is between 25% and 33%, depending on the contents of the text.

NLP provides three summary methods:

GetSummary()Opens in a new tab which returns each sentence of the summary text as a separate result. The sentence Id is returned as the first element of each returned sentence.
GetSummaryDirect()Opens in a new tab which returns the summary text as a single string. By default, the sentences within this string are separated by an ellipsis: a space followed by three periods, followed by a space. For example, “This is sentence one. ... This is sentence two.” You may specify a different sentence separator, if desired. Because this method concatenates multiple sentences into a single string, it may attempt to create a string longer than the InterSystems IRIS string length limit. When the maximum string length is reached, InterSystems IRIS sets this method’s isTruncated boolean output parameter to 1, and truncates the remaining text.
GetSummaryForText()Opens in a new tab which supports compiling a summary of a user-supplied string directly, rather than by supplying a specific source ID.

For details on what NLP considers a sentence, refer to the Logical Text Units Identified by NLP section of the “Conceptual Overview” chapter.

The following example goes through the source texts in a domain until it finds one that contains more than 100 sentences. It then uses GetSummary() to summarize that source to half of its original sentences:

#include %IKPublic
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT TOP 100 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
SourceSentenceTotals
  SET numSrcD=##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)
  WRITE "The domain contains ",numSrcD," sources",!
  SET numSentD=##class(%iKnow.Queries.SentenceAPI).GetCountByDomain(domId)
  WRITE "These sources contain ",numSentD," sentences",!!
  DO ##class(%iKnow.Queries.SourceAPI).GetByDomain(.result,domId)
SentenceCounts
  FOR i=1:1:numSrcD {
     SET srcId = $LISTGET(result(i),1)
     SET extId = $LISTGET(result(i),2)
     SET fullref = $PIECE(extId,":",3,4)
     SET fname = $PIECE(fullref,"\",$LENGTH(extId,"\"))
     SET numSentS = ##class(%iKnow.Queries.SentenceAPI).GetCountBySource(domId,result(i))
       IF numSentS > 100 {WRITE fname," has ",numSentS," sentences",!
                          GOTO SummarizeASource }
     }
     QUIT
SummarizeASource
   SET sumlen=$NUMBER(numSentS/2,0)
   WRITE "total sentences=",numSentS," summary=",sumlen," sentences",!!
   DO ##class(%iKnow.Queries.SourceAPI).GetSummary(.sumresult,domId,srcId,sumlen)
   FOR j=1:1:sumlen { WRITE "[S",j,"]: ",$LISTGET(sumresult(j),2),! }
   WRITE !,"END OF ",fname," SUMMARY",!!
   QUIT

Note that $NUMBER is used to assure that the specified summary sentence count is an integer. $LISTGET is used to remove the sentence Id and return just the sentence text.

The following example uses GetSummaryDirect() to return the same summary as a single concatenated string. It then uses $EXTRACT to divide the string into 38-character lines for display purposes:

#include %IKPublic
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT TOP 100 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
SourceSentenceTotals
  SET numSrcD=##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)
  WRITE "The domain contains ",numSrcD," sources",!
  SET numSentD=##class(%iKnow.Queries.SentenceAPI).GetCountByDomain(domId)
  WRITE "These sources contain ",numSentD," sentences",!!
  DO ##class(%iKnow.Queries.SourceAPI).GetByDomain(.result,domId)
SentenceCounts
  FOR i=1:1:numSrcD {
     SET srcId = $LISTGET(result(i),1)
     SET extId = $LISTGET(result(i),2)
     SET fullref = $PIECE(extId,":",3,4)
     SET fname = $PIECE(fullref,"\",$LENGTH(extId,"\"))
     SET numSentS = ##class(%iKnow.Queries.SentenceAPI).GetCountBySource(domId,result(i))
       IF numSentS > 100 {WRITE fname," has ",numSentS," sentences",!
                          GOTO SummarizeASource }
     }
     QUIT
SummarizeASource
   SET sumlen=$NUMBER(numSentS/2,0)
   WRITE "total sentences=",numSentS," summary=",sumlen," sentences",!!
   SET summary = ##class(%iKnow.Queries.SourceAPI).GetSummaryDirect(domId,srcId,sumlen)
FormatSummaryDisplay
   SET x=1
   SET totlines=$LENGTH(summary)/38
   FOR i=1:1:totlines {
      WRITE $EXTRACT(summary,x,x+38),!
      SET x=x+39 }
   WRITE !,"END OF ",fname," SUMMARY"

Custom Summaries

NLP permits you to generate custom summaries for sources by specifying a summaryConfig parameter string. Custom summaries are provided for those who desire to tune the content of NLP generated summaries to their specific needs. Custom summaries allow you to absolutely include, preferentially include, or absolutely exclude sentences into the summary. You can, for example, include or exclude standard components of sources that always appear at the same location, such as a title, byline, copyright, abstract, or summary. You can also absolutely or preferentially include or exclude sentences that contain a specified word.

The source summarization operation first gives each sentence a numeric summary weight, and then creates the summary by selecting the appropriate number of sentences with the highest weights. You can influence this ranking by specifying a summaryConfig parameter to the summary method.

The summaryConfig parameter value is a string consisting of one or more specifications. Each specification consists of three elements separated by vertical bars. For example, "s|2|false". You can concatenate multiple specifications using a vertical bar. For example, "s|1|true|s|2|false". The summaryConfig parameter default is the empty string.

You can configure the summary to select sentences according to the following:

Always (or never) include a sentence.
- By sentence number: "s|1|true" mean that sentence (s) number 1 is always included in the summary (true). For example, if the sources always contain a title, it can be beneficial to always include the first sentence (the title) in the summary. If the second line of each source is an author byline, and you never want this included in the summary, you can specify this as "s|2|false". You can also specify that the last sentence should never be included in the summary: "s|-1|false"; this might be appropriate if all of the sources end with a transcription reference or journal citation. Sentences in a source are numbered forward from 1, or numbered backwards from the end of the source as -1, -2, and so forth.
- By word: "w|requirement|true" mean that any sentence containing the word (w) requirement is always included in the summary (true). You can also exclude sentences containing a specific word. For example, "w|foreign|false" excludes all sentences that contain the word foreign from the summary. A “word” can be one or more whole words: it can consist of a string of multiple words separated by spaces; it cannot consist of a partial word string. Note that words are normalized, and thus must be specified in all lowercase letters.
Give a sentence more summary weight. This increases the chances that the sentence will appear in the summary. Available weight values are the integers 0 through 9.
- By sentence number: "s|1|3" mean that sentence (s) number 1 has its summary weight increased by a factor of 3. For example, the title (first sentence) of sources should be included if it is somewhat descriptive of the contents, but not included if it is not directly descriptive (for example, a literary quotation). Sentences in a source are numbered forward from 1, or numbered backwards from the end of the source as -1, -2, and so forth.
- By word: "w|requirement|2" mean that any sentence containing the word (w) requirement has its summary weight increased by a factor of 2. A “word” can be one or more whole words: it can consist of a string of multiple words separated by spaces; it cannot consist of a partial word string. Note that words are normalized, and thus must be specified in all lowercase letters.

You can specify multiple summary customizations by concatenation. For example: "s|1|true|s|2|false|w|surgery|3|w|hypnosis|false" (always include the first sentence, never include the second sentence, increase the summary weight of all sentences containing the word “surgery”, exclude all sentences containing the word “hypnosis”.

Thus the user can give more or less importance to specific words and/or sentences. The weight of sentences affected by more than one of the specifications in the summaryConfig will be resolved by the Custom Summaries algorithm. This algorithm also applies when there is a conflict between specifications that apply to the same sentence:

If there is a conflict between a sentence (s) specification and a word (w) specification, the sentence specification wins.
If there is a conflict between an include (true) and an exclude (false) involving two s specifications, or two w specifications, the include specification wins.
If there is a conflict between the specified summary length and the number of sentences that must be included or number of sentences that must be excluded, the summary length is ignored.

The options for custom summaries can be set by means of the summaryConfig parameter in the %iKnow.Queries.SourceAPI.GetSummary()Opens in a new tab and %iKnow.Queries.SourceAPI.GetSummaryForText()Opens in a new tab methods.

Querying a Subset of the Sources

Listing Similar Sources