Loading Text Data Programmatically
InterSystems has deprecated InterSystems IRIS® Natural Language Processing (NLP). It may be removed from future versions of InterSystems products. The following documentation is provided as reference for existing users only. Existing users who would like assistance identifying an alternative solution should contact the WRCOpens in a new tab.
Before NLP can analyze text data, the data sources must be loaded into a domain. This can be done in three ways:
-
Using the Domain Architect to specify the data locations source texts for a domain. The Build button loads the specified sources into the domain.
-
Creating an %iKnow.DomainDefinitionOpens in a new tab subclass allows you to specify the data locations for source texts for a domain. It generates a %Build() method in a dependent class that contains the logic to load this data.
-
Specifying a Loader and Lister programmatically to load the specified sources into a domain, as described in this chapter.
To make text data available for NLP analysis, the domain must invoke an instance of a Loader and a Lister. The Loader supervises NLP processing of text sources, using the Lister and a Processor. The Lister identifies the text sources to be used by the Loader. NLP provides a variety of Listers for different types of source text data. Each Lister, by default, automatically invokes the corresponding Processor with default parameters. There is one Loader used for data sources of all types.
Note that the Loader and Lister objects can be created in any order, but both must have been created before you invoke the Lister AddListToBatch() instance method and then the Loader ProcessBatch() instance method (or other equivalent Lister and Loader methods).
Loader
The Loader (%iKnow.Source.LoaderOpens in a new tab) is the main class coordinating the loading process. You must create a new loader object for the domain. To create a loader object:
SET domo=##class(%iKnow.Domain).NameIndexOpen("mydomain")
SET domId=domo.Id
SET myloader=##class(%iKnow.Source.Loader).%New(domId)
After creating a loader and a lister, you issue instance methods to list and process the sources. For example, when performing a batch load you issue the Lister AddListToBatch() instance method to list the text sources. You then issue the Loader ProcessBatch()Opens in a new tab instance method to process the listed sources. This Loader method calls the Lister to scan the locations marked by AddListToBatch(), then calls the Processor to read those documents and push them to the NLP engine and finally, it invokes the ^%iKnow.BuildGlobals routine to process the staging globals loaded by the NLP engine.
Loader Error Logging
If a load operation completes, but encounters errors in loading one or more sources, these errors are recorded in an error log. Errors of varying severity can be retrieved using the GetErrors()Opens in a new tab, GetWarnings()Opens in a new tab, and GetFailed()Opens in a new tab methods. For example, a failed load error (GetFailed()) occurs if you attempt to load a source file that has no contents. A warning load error (GetWarnings()) occurs if there is an error in the source metadata.
You can use the ClearLogs()Opens in a new tab method to clear the error log of error messages at any or all of these severity levels.
Loader Reset()
If a load operation didn't complete in an expected fashion and you want to start from scratch, you should invoke the Reset()Opens in a new tab method for the loader instance, as follows:
DO myloader.Reset()
Lister
The Lister identifies text files, records, or other sources of unstructured data you wish NLP to index. That is, all text that will eventually end up as a Source in the domain. The unit of content in NLP is a Source, which can represent any unit of text you wish to analyze, such as a text file, a record in a SQL table, an RSS posting, or other text source.
Usually a Source is a text containing multiple sentences. However, a source can contain content of any type. For example, a file containing the number 123 is treated as a Source containing one sentence. A file with no contents is not listed as a Source.
All listers are found in class %iKnow.Source.ListerOpens in a new tab and have their own specific type of sources they can scan. For example, the subclass %iKnow.Source.File.ListerOpens in a new tab scans a file system and the subclass %iKnow.Source.RSS.ListerOpens in a new tab scans RSS web feeds, such as blog postings, in XML file format. NLP provides seven listers for different types of sources. You can also create your own custom lister.
Most text sources require a Lister. However, text that is directly specified as a string does not require a Lister.
Through the AddListToBatch()Opens in a new tab method you can instruct the Lister to look into a specific directory, SQL table, or RSS feed for Sources. The lister parameters depend on the actual Lister class.
Initializing a Lister
You can create a Lister instance for a domain using the %New()Opens in a new tab method for that type of lister, supplying the domain Id. The following example creates two listers within the specified domain:
SET domo=##class(%iKnow.Domain).NameIndexOpen("mydomain")
SET domId=domo.Id
SET flister=##class(%iKnow.Source.File.Lister).%New(domId)
WRITE flister,!
SET rlister=##class(%iKnow.Source.RSS.Lister).%New(domId)
WRITE rlister
Each lister automatically invokes the corresponding processor, as follows:
-
The File.Lister invokes the File.Processor.
-
The Global.Lister invokes the Global.Processor.
-
The Domain.Lister invokes the Domain.Processor.
-
All other Listers invokes the Temp.Processor. The %iKnow.Source.Temp.Processor has that name because it processes temporary globals that are automatically created and deleted by NLP during the loading process.
Each processor has default processor parameters, which are appropriate for most NLP sources. Therefore, in most cases, you do not need to specify a processor or processor parameters. If you do not specify a processor, NLP uses the default processor, as shown by the DefaultProcessor() method.
Overriding Lister Instance Defaults
In most cases, the lister instance defaults are appropriate for the processing of your NLP sources.
If you wish to overriding lister instance defaults for Configuration, Processor, or Converter objects, you can, optionally, use the Init()Opens in a new tab instance method to initialize the Lister instance. If you omit Init() the defaults are used.
The complete Lister initialization is as follows:
Init(config,processor,processorparams,converter,converterparams)
To specify the default for any of these items, specify the empty string ("") as the Init() parameter value.
You can also initialize these objects separately using the SetConfig()Opens in a new tab, SetProcessor()Opens in a new tab, and SetConverter()Opens in a new tab methods.
-
Configuration (Config): If you do not specify a configuration, NLP uses the default configuration. A configuration specifies what language(s) the text documents contain, and whether or not automatic language identification should be used. A configuration object is not domain-specific; you can use the same configuration for multiple domains. While not required, explicitly specifying a configuration is recommended.
-
Processor: Using lister.Init() you can specify a processor and processor parameters. A processor reads the texts into NLP. Specifying a processor is optional. If you do not specify a processor, NLP uses the default processor and its parameter defaults. If you specify a processor, you can specify the processor parameter values, as shown in the following example:
SET flister=##class(%iKnow.Source.File.Lister).%New(domId) SET processor="%iKnow.Source.File.Processor" SET pparams=$LB("Latin1") DO flister.Init("",processor,pparams,"","")
If explicitly specified, the processor subclass should be either of the same type as the Lister subclass (for example, %iKnow.Source.File.ListerOpens in a new tab takes %iKnow.Source.File.ProcessorOpens in a new tab) or %iKnow.Source.Temp.ProcessorOpens in a new tab if the Lister subclass has no corresponding Processor subclass. You can also create your own custom processor.
Processor parameters are specified as an InterSystems IRIS list. For %iKnow.Source.File.ProcessorOpens in a new tab the first list element is the name of the character set used (for example "Latin1"). The %iKnow.Source.Temp.ProcessorOpens in a new tab does not take any processor parameters.
-
Converter: Using lister.Init() you can specify a user-defined converter and converter parameters. A Converter converts formatted source documents to plain text, removing HTML or XML tags, PDF formatting, or other non-text contents. Usually separate converters are used for each source document formatting type. Specifying a converter is optional. The default is to use no converter. If no converter is used, NLP indexes formatting contents as well as text contents.
Lister Assigns IDs to Sources
The lister assigns two unique IDs to each source:
-
Source ID (internal ID): a unique integer assigned by NLP that is used for NLP internal processing.
-
External ID: a unique identifying string or number. The External ID is used as the link for any user-specified application that wishes to use NLP. The External ID has the following structure:
ListerReference:FullReference
The Lister Reference is either the full class name of the Lister class used to load this source, or a short alias defined by the Lister class itself, prefixed with a colon. The Full Reference is a string for which the format is defined by the Lister class. It contains a Group Name and a Local Reference. It is up to the Lister to provide the implementation to derive the Group Name and Local Reference from this Full Reference, and to rebuild the Full Reference from the Group Name and Local Reference.
For example, the text file external ID :FILE:c:\mytextfiles\mydoc.txt consists of:
-
ListerReference: the Lister class alias :FILE
-
FullReference: c:\mytextfiles\mydoc.txt, which consists of the Group Name c:\mytextfiles\ and the Local Reference mydoc.txt.
For data in an SQL table, the ListerReference is :SQL. The Group Name is the groupfield, a field in the record that contains a unique value, and the Local Reference is the row ID.
For data in a string or global variable, the ListerReference is :TEMP.
The external ID format described here is the default; external ID format is configurable using the SimpleExtIds domain parameter.
-
You can access a source using either ID. The %iKnow.Queries.SourceAPIOpens in a new tab class contains methods for accessing these IDs. The GetByDomain()Opens in a new tab method returns both IDs for each source. Given the source ID, the GetExternalId()Opens in a new tab method returns the external ID. Given the external ID, the GetSourceId()Opens in a new tab method returns the source ID.
You can determine the lister class alias using the GetAlias()Opens in a new tab method of the %iKnow.Source.File.ListerOpens in a new tab class. If no alias exists, the External ID contains the full Lister class name.
Lister Defaults Example
The following is a minimal Lister and Loader example, taking all defaults. It establishes a domain, then creates Lister and Loader instance objects for that domain. It does not invoke lister.Init(), but takes the defaults for configuration, processor, and converter. It then lists and loads a directory of user-defined .txt and .log files:
SET domo=##class(%iKnow.Domain).NameIndexOpen("mydomain")
SET domId=domo.Id
SetListerAndLoader
SET mylister=##class(%iKnow.Source.File.Lister).%New(domId)
SET myloader=##class(%iKnow.Source.Loader).%New(domId)
UseListerAndLoader
SET install=$SYSTEM.Util.DataDirectory()
SET dirpath=install_"mgr\Temp\iris\mytextfiles"
SET stat=mylister.AddListToBatch(dirpath,$LB("txt","log"),0,"")
WRITE "The lister status is ",$System.Status.DisplayError(stat),!
SET stat=myloader.ProcessBatch()
WRITE "The loader status is ",$System.Status.DisplayError(stat),!
Most examples in this book delete old data before using the Lister and Loader; this old data deletion is for demonstration purposes to allow these examples to be run repeatedly. Most examples in this book do not specify the processor and processor parameters, taking the defaults. Many examples in this book specify values for configuration rather than taking the defaults.
Lister Parameters
When you invoke a method to specify sources, you specify Lister parameters. You specify the same Lister parameters for the AddListToBatch()Opens in a new tab Lister instance method (for large batch loads of sources) and the ProcessList()Opens in a new tab Loader instance method (for adding a small number of sources to an existing batch of sources).
There are four Lister parameters that cumulatively define which sources are to be listed for NLP indexing:
-
Path: the location where the sources are located, specified as a string. This parameter is mandatory.
-
Extensions: one or more file extension suffixes that identify which sources are to be listed. Specified as an InterSystems IRIS list data structure, each element of which is a string (refer to $LISTBUILD for details on InterSystems IRIS list data structures). By default the Lister selects all files in the Path directory that contain data, regardless of their file extension suffix. This includes files with no file extension suffix or with a file extension suffix indicating a non-text (such as .jpg). Empty files are not selected. Directories are not selected. When an extension suffix parameter is specified, the Lister selects only those files in the Path directory with that file extension suffix (or with no file extension suffix) that contain data.
-
Recursive: a boolean value that specifies whether to search subdirectories of the path for sources. If selected, multiple levels of subdirectories are searched for sources. 1 = include subdirectories. 0 = do not include subdirectories. The default is 0.
-
Filter: a string specifying a filter used to limit which sources are to be listed for NLP indexing. For example, a user-designed filter could limit the Lister to only those files that have a specified substring in their file names. The default is to use no filter. (Note that this use of the word “filter” is completely separate from the filters in the %iKnow.Filters class that are used to include or exclude already-indexed sources supplied to an NLP query.)
Batch or List?
NLP provides two ways to load sources of all types, batch loading (ProcessBatch()) or list loading (ProcessList()). Both perform the same processing, they differ in their speed of execution. Which one you use depends primarily on how many sources you are loading. As a general rule, when loading ten or fewer sources, use ProcessList(); when loading one hundred or more sources, use ProcessBatch(). Which to use on intermediate numbers of sources depends on the nature of the specific sources.
Listing and Loading Examples
The examples in this section show the different ways to load sources:
-
lister.AddListToBatch()Opens in a new tab and loader.ProcessBatch()Opens in a new tab to batch load a large number of sources.
-
loader.SetLister()Opens in a new tab and loader.ProcessList()Opens in a new tab to load a small number of sources, or to add sources to an existing batch load.
-
loader.BufferSource()Opens in a new tab and loader.ProcessBuffer()Opens in a new tab to load a string as a source. You can, of course, specify a local or global variable that contains the string.
You can also load sources as virtual sources using loader.ProcessVirtualList() or loader.ProcessVirtualBuffer(), as described in Loading a Virtual Source.
Loading Files
The following executable example performs a batch load of the source files in the Windows directory dirpath that have the extensions .txt or .log.
DomainCreateOrOpen
SET dname="mydomain"
IF (##class(%iKnow.Domain).NameIndexExists(dname))
{ SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
GOTO DeleteOldData }
ELSE
{ SET domoref=##class(%iKnow.Domain).%New(dname)
DO domoref.%Save()
GOTO SetEnvironment }
DeleteOldData
SET stat=domoref.DropData()
IF stat { GOTO SetEnvironment }
ELSE { WRITE "DropData error ",$System.Status.DisplayError(stat)
QUIT}
SetEnvironment
SET domId=domoref.Id
IF ##class(%iKnow.Configuration).Exists("myconfig") {
SET cfg=##class(%iKnow.Configuration).Open("myconfig") }
ELSE { SET cfg=##class(%iKnow.Configuration).%New("myconfig",0,$LISTBUILD("en"),"",1)
DO cfg.%Save() }
CreateListerAndLoader
SET flister=##class(%iKnow.Source.File.Lister).%New(domId)
DO flister.Init("myconfig","","","","")
SET myloader=##class(%iKnow.Source.Loader).%New(domId)
UseListerAndLoader
SET install=$SYSTEM.Util.DataDirectory()
SET dirpath=install_"mgr\Temp\iris\mytextfiles"
SET stat=flister.AddListToBatch(dirpath,$LB("txt","log"),0,"")
WRITE "The lister status is ",$System.Status.DisplayError(stat),!
SET stat=myloader.ProcessBatch()
WRITE "The loader status is ",$System.Status.DisplayError(stat),!
QueryLoadedSources
WRITE ##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)," sources"
This example performs a batch load, appropriate for loading a large number of files. To load a small number of files use the SetLister()Opens in a new tab and ProcessList()Opens in a new tab methods.
Loading SQL Records
Refer to A Note on Program Examples for details on the coding and data used in the examples in this book.
The following executable example performs a batch load of the records of the Cinema.Review table. It loads as a source text the ReviewText field value for each record. You can specify a source text field of data type %String or %Stream.GlobalCharacter (character stream data). If there is an error in the SQL query, the Loader returns an error status.
NLP programs that load SQL data must use the %iKnow.Source.SQL.Lister. This lister always invokes the %iKnow.Source.Temp.Processor, which takes no parameters. There is, therefore, no reason to specify the processor, unless you have created your own custom processor.
DomainCreateOrOpen
SET dname="mydomain"
IF (##class(%iKnow.Domain).NameIndexExists(dname))
{ WRITE "The ",dname," domain already exists",!
SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
GOTO DeleteOldData }
ELSE
{ WRITE "The ",dname," domain does not exist",!
SET domoref=##class(%iKnow.Domain).%New(dname)
DO domoref.%Save()
WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
GOTO SetEnvironment }
DeleteOldData
SET stat=domoref.DropData()
IF stat { WRITE "Deleted the data from the ",dname," domain",!!
GOTO SetEnvironment }
ELSE { WRITE "DropData error ",$System.Status.DisplayError(stat)
QUIT}
SetEnvironment
SET domId=domoref.Id
IF ##class(%iKnow.Configuration).Exists("myconfig") {
SET cfg=##class(%iKnow.Configuration).Open("myconfig") }
ELSE { SET cfg=##class(%iKnow.Configuration).%New("myconfig",0,$LISTBUILD("en"),"",1)
DO cfg.%Save() }
CreateListerAndLoader
SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
DO flister.Init("myconfig")
SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
SET myquery="SELECT Top 25 ID AS UniqueVal,Type,NarrativeFull,EventDate FROM Aviation.Event"
SET idfld="UniqueVal"
SET grpfld="Type"
SET dataflds=$LB("NarrativeFull")
UseLister
SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
SET stat=myloader.ProcessBatch()
IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
QueryLoadedSources
WRITE ##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)," sources loaded"
This example performs a batch load, appropriate for loading a large number of SQL records. To load a small number of SQL records use the SetLister()Opens in a new tab and ProcessList()Opens in a new tab methods.
You can also use the %SYSTEM.iKnowOpens in a new tab utility method IndexTable()Opens in a new tab.
Loading Elements of a Subscripted Global
Refer to A Note on Program Examples for details on the coding and data used in the examples in this book.
The following executable example loads the elements of a subscripted global. It uses the %iKnow.Source.Global.Lister and specifies the following Lister parameters to the ProcessList()Opens in a new tab method: global name, first subscript (inclusive), and last subscript (inclusive). This example uses the ^Aviation.AircraftD global. Because this is a sparse array, only a few of the subscripts between 1 and 50,000 contain data:
DomainCreateOrOpen
SET dname="mydomain"
IF (##class(%iKnow.Domain).NameIndexExists(dname))
{ SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
GOTO DeleteOldData }
ELSE
{ SET domoref=##class(%iKnow.Domain).%New(dname)
DO domoref.%Save()
GOTO SetEnvironment }
DeleteOldData
SET stat=domoref.DropData()
IF stat { GOTO SetEnvironment }
ELSE { WRITE "DropData error ",$System.Status.DisplayError(stat)
QUIT}
SetEnvironment
SET domId=domoref.Id
IF ##class(%iKnow.Configuration).Exists("myconfig") {
SET cfg=##class(%iKnow.Configuration).Open("myconfig") }
ELSE { SET cfg=##class(%iKnow.Configuration).%New("myconfig",0,$LISTBUILD("en"),"",1)
DO cfg.%Save() }
ListerAndLoader
SET mylister=##class(%iKnow.Source.Global.Lister).%New(domId)
DO mylister.Init("myconfig","","","","")
SET myloader=##class(%iKnow.Source.Loader).%New(domId)
SET stat=myloader.SetLister(mylister)
IF stat '= 1 { WRITE "SetLister error ",$System.Status.DisplayError(stat)
QUIT}
SET gbl="^Aviation.AircraftD"
SET stat=myloader.ProcessList(gbl,1,50000)
IF stat '= 1 { WRITE "ProcessList error ",$System.Status.DisplayError(stat)
QUIT }
SourceSentenceQueries
SET numSrcD=##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)
WRITE "The domain contains ",numSrcD," sources",!
SET numSentD=##class(%iKnow.Queries.SentenceAPI).GetCountByDomain(domId)
WRITE "These sources contain ",numSentD," sentences"
The ProcessList() method can specify only one subscript level at a time. In order to iterate through multiple subscript levels, you must write code to invoke this method at the desired subscript level. For example, to load the second level subscripts 1 and 2, you would write code such as the following:
FOR i=1:1:90000 {
SET gbl="^Aviation.NarrativeS("_i_")"
SET stat=myloader.ProcessList(gbl,1,2) }
This loads globals such as ^Aviation.NarrativeS(85879,1) and ^Aviation.NarrativeS(85879,2).
Loading a String
The following executable example loads a single global (or a string literal) as a source file. Note that no Lister is required when loading a string. You can specify the Configuration to apply in the ProcessBuffer() method.
ConfigurationCreateOrOpen
IF ##class(%iKnow.Configuration).Exists("EnFr") {
SET cfg=##class(%iKnow.Configuration).Open("EnFr") }
ELSE { SET cfg=##class(%iKnow.Configuration).%New("EnFr",1,$LB("en","fr"))
DO cfg.%Save() }
DomainCreateOrOpen
SET dname="mydomain"
IF (##class(%iKnow.Domain).NameIndexExists(dname))
{ WRITE "The ",dname," domain already exists",!
SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
GOTO DeleteOldData }
ELSE
{ WRITE "The ",dname," domain does not exist",!
SET domoref=##class(%iKnow.Domain).%New(dname)
DO domoref.%Save()
SET domId=domoref.Id
WRITE "Created the ",dname," domain with domain ID ",domId,!
GOTO CreateLoader }
DeleteOldData
SET stat=domoref.DropData()
IF stat { WRITE "Deleted the data from the ",dname," domain",!!
SET domId=domoref.Id
GOTO CreateLoader }
ELSE { WRITE "DropData error ",$System.Status.DisplayError(stat)
QUIT}
CreateLoader
SET myloader=##class(%iKnow.Source.Loader).%New(domId)
UseLoader
SET ^a="I drove at 70mph then sped up to 100mph when the light changed."
DO myloader.BufferSource("ref",^a)
DO myloader.ProcessBuffer("EnFr")
QuerySources
WRITE "number of sources:",##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)
The first argument of the BufferSource() method specifies a unique external source Id. The following example creates a separate source for each global subscript:
SET i=1
WHILE $DATA(^a(i)) {
DO myloader.BufferSource("ref"_i,^a(i))
DO myloader.ProcessBuffer()
SET i=i+1 }
WRITE "end of data"
You can also use the %SYSTEM.iKnowOpens in a new tab utility method IndexString()Opens in a new tab.
Updating the Domain Contents
After you have performed an initial load of sources to a domain, you can change this list of sources by adding sources or by deleting sources. Updating a domain refers to responding to changes in the set of source texts. This should not be confused with upgrading a domain, which refers to responding to changes in the NLP software, commonly after installing a significant new version of InterSystems IRIS.
Adding Sources
After you have performed an initial load of sources to a domain (using the AddListToBatch() and ProcessBatch() methods) you may want to add more files to the list of sources. This is done using the SetLister()Opens in a new tab and ProcessList()Opens in a new tab methods. The ProcessList() method takes the same parameters as the AddListToBatch() method.
-
To add a one source at a time: SET stat=myloader.ProcessList("C:\mytextfiles\newfile.txt")
-
To add a directory of sources: SET stat=myloader.ProcessList("C:\mytextfiles\logfiles",$LB("log"),0,"")
Adding more sources to a batch load is shown in the following example:
DomainCreateOrOpen
SET dname="mydomain"
IF (##class(%iKnow.Domain).NameIndexExists(dname))
{ SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
GOTO DeleteOldData }
ELSE { SET domoref=##class(%iKnow.Domain).%New(dname)
DO domoref.%Save()
GOTO SetEnvironment }
DeleteOldData
SET stat=domoref.DropData()
IF stat { GOTO SetEnvironment }
ELSE { WRITE "DropData error ",$System.Status.DisplayError(stat)
QUIT}
SetEnvironment
SET domId=domoref.Id
IF ##class(%iKnow.Configuration).Exists("myconfig") {
SET cfg=##class(%iKnow.Configuration).Open("myconfig") }
ELSE { SET cfg=##class(%iKnow.Configuration).%New("myconfig",0,$LISTBUILD("en"),"",1)
DO cfg.%Save() }
ListerAndLoader
SET flister=##class(%iKnow.Source.File.Lister).%New(domId)
DO flister.Init("myconfig","","","","")
SET myloader=##class(%iKnow.Source.Loader).%New(domId)
SET stat=myloader.SetLister(flister)
SourceBatchLoad
SET install=$SYSTEM.Util.DataDirectory()
SET dirpath=install_"mgr\Temp\iris\mytextfiles"
SET stat=flister.AddListToBatch(dirpath,$LB("txt"),0,"")
SET stat=myloader.ProcessBatch()
IF stat '= 1 { WRITE "Loader error ",$System.Status.DisplayError(stat)
QUIT }
QueryLoadedSources
WRITE "Source count is ",##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId),!
ExpandListofSources
SET elister=##class(%iKnow.Source.File.Lister).%New(domId)
DO elister.Init("myconfig")
SET stat=myloader.SetLister(elister)
SET addpath=install_"dev\IRIS"
SET stat=myloader.ProcessList(addpath,$LB("txt"),1,"")
IF stat '= 1 { WRITE "The ProcessList loader status is ",$System.Status.DisplayError(stat)
QUIT }
QueryTotalSources
WRITE "Expanded source count is ",##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)
You can also use the %SYSTEM.iKnowOpens in a new tab utility methods IndexFile()Opens in a new tab and IndexDirectory()Opens in a new tab.
Deleting Sources
You can remove a source that has been loaded to a domain using the DeleteSource()Opens in a new tab method. This method cannot be used to delete a virtual source; a separate DeleteVirtualSource()Opens in a new tab method is provided for this purpose. Both methods are found in the %SYSTEM.iKnowOpens in a new tab class.
Loading a Virtual Source
A virtual source is a source that is not static. You might, for example, use a virtual source for a file that is being frequently modified. The srcId of a virtual source is a negative integer. The external Id of a virtual source begins with the ListerReference (the Lister class alias), commonly :TEMP.
Adding a virtual source does not update NLP statistics. For this reason, using a virtual source may be desirable when you wish to temporarily add sources for a specific purpose without incurring the overhead of revising the domain statistics. You should use a virtual source when adding a source that is being continuously modified, such a source in the process of being written. Because the virtual source Id is a negative number, it is easy to distinguish virtual sources from regular sources. Different methods are used to delete virtual sources and regular sources.
You can load virtual sources using loader.SetLister()Opens in a new tab and loader.ProcessVirtualList()Opens in a new tab or loader.BufferSource()Opens in a new tab and loader.ProcessVirtualBuffer()Opens in a new tab. The following program loads a virtual source using ProcessVirtualBuffer().
DomainCreateOrOpen
SET dname="mydomain"
IF (##class(%iKnow.Domain).NameIndexExists(dname))
{ WRITE "The ",dname," domain already exists",!
SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
GOTO DeleteOldData }
ELSE
{ WRITE "The ",dname," domain does not exist",!
SET domoref=##class(%iKnow.Domain).%New(dname)
DO domoref.%Save()
SET domId=domoref.Id
WRITE "Created the ",dname," domain with domain ID ",domId,!
GOTO SetEnvironment }
DeleteOldData /* This DOES NOT delete virtual sources */
SET stat=domoref.DropData()
IF stat { WRITE "Deleted the data from the ",dname," domain",!!
SET domId=domoref.Id
GOTO SetEnvironment }
ELSE { WRITE "DropData error ",$System.Status.DisplayError(stat)
QUIT}
SetEnvironment
SET config="VSConfig"
IF ##class(%iKnow.Configuration).Exists(config) {
SET cfg=##class(%iKnow.Configuration).Open(config) }
ELSE { SET cfg=##class(%iKnow.Configuration).%New(config,1)
DO cfg.%Save() }
CreateLoader
SET myloader=##class(%iKnow.Source.Loader).%New(domId)
VirtualSource
SET node="",(total,status)=0
FOR { SET node=$ORDER(^VendorData(node),1,data) QUIT:node=""
SET company=$LIST(data,1) QUIT:company=""
SET address=$LTS($LIST(data,2))
SET total=total+1
SET status=myloader.BufferSource("SourceTest"_total,company)
SET status=myloader.BufferSource("SourceTest"_total,address)
}
SET status=myloader.ProcessVirtualBuffer(config)
SET vsrclist=myloader.GetSourceIds()
FOR i=1:1:$LL(vsrclist) {
SET srcid=-$LIST(vsrclist,i)
WRITE "External Id=",##class(%iKnow.Queries.SourceAPI).GetExternalId(domId,srcid)
WRITE " Source Id=",srcid,!
WRITE " Sentence Count=",##class(%iKnow.Queries.SentenceAPI).GetCountBySource(domId,$lb(srcid)),!
}
Note that the %iKnow.Queries.SourceAPI.GetCountByDomain() method does not count virtual sources. You can determine if a virtual source has been loaded by invoking %iKnow.Queries.SourceAPI.GetExternalId(domId,-1). Here -1 is the srcId of the first virtual source loaded.
By default, many NLP queries process only ordinary sources and ignore virtual sources. To use these queries to process a virtual souce you must specify a vSrcId parameter value for the query method.
Deleting a Virtual Source
The %iKnow.Source.LoaderOpens in a new tab class provides two methods for deleting virtual sources.
-
DeleteVirtualSource()Opens in a new tab deletes a single virtual source indexed for a domain. You specify the domain Id (a positive integer) and the virtual source Id (a negative integer). This deletes all NLP entities generated for this source text.
-
DeleteAllVirtualSources()Opens in a new tab deletes all of the virtual sources indexed for a specified domain. This deletes all NLP entities generated for these source texts.
Copying and Re-indexing Loaded Source Data
After you have successfully loaded sources into a domain, you may wish to copy some or all of these sources to another domain. When NLP copies these loaded sources it also re-indexes them. The copied sources therefore have different source Ids and entity Ids; the external Ids are not changed.
Some reasons you might want to copy/re-index from one domain to another:
-
To create a copy of a domain. You may wish to make a backup copy, or to create a copy to serve as a snapshot of the domain at a particular time. For example, when indexing RSS feeds you may wish to create a snapshot because these feeds change over time; at a future date you might no longer have access to the original source data.
-
To create a domain containing a subset of the original set of sources. The new domain can be smaller, more efficient, and easier to work with. You can specify this copied subset of sources by a list of source Ids to copy, or by a filter that limits which sources to copy. For example, you could create a domain consisting of only the newest sources, which you could then query without having to filter by date for each query.
-
To create a domain containing the merged sets of sources from two domains, or to add sources from one domain into a domain that already contains sources.
-
To re-index the sources in a domain after extreme modification of the set of sources. For example, if you very frequently add or delete multiple sources in a domain, the indexing may no longer be optimal. (Normal adding and deleting of sources does not degrade index performance.) By copying the domain, you re-index the current sources that you are copying, making the indexing in the new domain optimal.
-
To apply NLP language model revisions. Release versions of NLP commonly contain improvements to its language models. These may include introduction of support for new languages and improvements to already-supported languages. Copying the set of sources in a domain re-indexes these sources, and therefore applies the most current NLP language models to the copied sources.
You use the %iKnow.Source.Domain.ListerOpens in a new tab class to copy/re-index from one domain to another. The new domain must already be defined before you can create a Lister instance for this class using the %New()Opens in a new tab method. Both domains must be in the same namespace.
The following example populates the firstdomain domain, then copies the contents of firstdomain to an empty domain named newdomain, automatically re-indexing the newdomain contents:
EstablishAndPopulateFirstDomain
SET domOref=##class(%iKnow.Domain).%New("firstdomain")
DO domOref.%Save()
SET domId=domOref.Id
SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
SET myloader=##class(%iKnow.Source.Loader).%New(domId)
SET myquery="SELECT Top 25 ID AS UniqueVal,Type,NarrativeFull,EventDate FROM Aviation.Event"
SET idfld="UniqueVal"
SET grpfld="Type"
SET dataflds=$LB("NarrativeFull")
SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
SET stat=myloader.ProcessBatch()
IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
TestQueryFirstDomain
WRITE ##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)," sources in the from domain",!
CreateSecondDomain
SET domOref=##class(%iKnow.Domain).%New("newdomain")
DO domOref.%Save()
SET domNewId=domOref.Id
CopyAndReindexFromFirstDomainToSecondDomain
SET newlister=##class(%iKnow.Source.Domain.Lister).%New(domNewId)
SET newloader=##class(%iKnow.Source.Loader).%New(domNewId)
SET stat=newlister.AddListToBatch(domId)
SET stat=newloader.ProcessBatch()
TestQuerySecondDomain
WRITE ##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domNewId)," sources in the to domain"
CleanUpForNextTime
SET stat=##class(%iKnow.Domain).%DeleteId(domId)
IF stat '= 1 {WRITE "Domain delete error:",stat }
SET stat=##class(%iKnow.Domain).%DeleteId(domNewId)
IF stat '= 1 {WRITE "Domain delete error:",stat }
The AddListToBatch()Opens in a new tab method can take a second lister parameter to specify which sources are to be copied. It can either specify a list of sources (a comma-separated list of source Id integers) or specify a filter. The following example is identical to the previous example, except that it limits which sources are to be copied by specifying a comma-separated list of source Ids.
EstablishAndPopulateFirstDomain
SET domOref=##class(%iKnow.Domain).%New("firstdomain")
DO domOref.%Save()
SET domId=domOref.Id
SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
SET myloader=##class(%iKnow.Source.Loader).%New(domId)
SET myquery="SELECT Top 25 ID AS UniqueVal,Type,NarrativeFull,EventDate FROM Aviation.Event"
SET idfld="UniqueVal"
SET grpfld="Type"
SET dataflds=$LB("NarrativeFull")
SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
SET stat=myloader.ProcessBatch()
IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
TestQueryFirstDomain
WRITE ##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)," sources in the from domain",!
CreateSecondDomain
SET domOref=##class(%iKnow.Domain).%New("newdomain")
DO domOref.%Save()
SET domNewId=domOref.Id
SubsetOfSourcesToCopy
SET subset="1,3,5,7,9,11,13,15,17,19"
CopyAndReindexFromFirstDomainToSecondDomain
SET newlister=##class(%iKnow.Source.Domain.Lister).%New(domNewId)
SET newloader=##class(%iKnow.Source.Loader).%New(domNewId)
SET stat=newlister.AddListToBatch(domId,subset)
SET stat=newloader.ProcessBatch()
TestQuerySecondDomain
WRITE ##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domNewId)," sources in the to domain"
CleanUpForNextTime
SET stat=##class(%iKnow.Domain).%DeleteId(domId)
IF stat '= 1 {WRITE "Domain delete error:",stat }
SET stat=##class(%iKnow.Domain).%DeleteId(domNewId)
IF stat '= 1 {WRITE "Domain delete error:",stat }
UserDictionary and Copied Sources
A UserDictionary is applied when a source is listed. Therefore, any UserDictionary modifications made to the initial loaded sources will appear in the copied sources. However, because the copy operation is also a list operation, you can also apply a new UserDictionary to modify the sources as they are copied.
For example, the UserDictionary used when the sources were originally listed substitutes “Doctor” for the abbreviation “Dr.”; this substitution will be present in the copied sources. Later you modified the UserDictionary to also substitute “doctor” for “physician”. This change to your UserDictionary had no effect on the already-loaded sources. When you copy the sources, you apply this revised UserDictionary. The “Dr.” to “Doctor” substitution is performed 0 times, because that substitution is already present in the initial loaded sources; the “physician” to “doctor” substitution is performed on the copied sources.