You can customize how the NLP semantic analysis engine loads and lists source texts on InterSystems IRIS® data platform. The following types of customization are supported:
Lister. By default, NLP provides several listers that correspond to common sources of data. You can create a custom lister appropriate for your data.
Processor. By default, NLP uses the processor that corresponds to the lister. You can create a custom processor and associate it with your custom lister, or you can associate an existing processor to your custom lister.
Converter. A Converter is an object that can transform complex input text into the plain text expected by the NLP engine. For example, a Converter can extract plain text from a PDF or RTF document, or select specific nodes from an XML document. By default, NLP does not use a converter. You can create a custom converter and specify it in the Lister SetConverter() or Init() method.
A Lister can optionally specify its Configuration, Processor, and Converter. You can specify these using the SetConfig(), SetProcessor(), and SetConverter() methods, or specify all three using the Init() method.
The following example shows how to specify a custom processor and/or a custom converter using the optional Lister Init() method. You can specify an empty string ("") for any of the Init() method parameter to take the default for the specified Lister.
SET flister=##class(%iKnow.Source.File.Lister).%New(domId) DO flister.Init(configuration,myprocessor,processorparams,myconverter,converterparams)
NLP provides a base lister class and five subclasses containing listers specific to different types of input sources.
In order to implement a custom lister you begin with the base Lister class, %iKnow.Source.Lister, and override several of its defaults.
In order to be able to work with the lister, the lister needs to specify the format in which the external id for each source is presented. The external id for a lister consists of the lister name and the full reference. The full reference consists of the groupName and the localRef. An external id is shown in the following example:
In this example, MYLISTER is the lister name alias. If you don't provide an alias, the full classname of the lister class is used. To determine the alias for your lister, use the GetAlias() method.
A lister name alias must be unique within the current namespace. If you specify a lister name alias that already exists, NLP generates a $$$IKListerAliasInUse error.
SplitFullRef() and BuildFullRef()
You must specify a SplitFullRef() instance method for your custom lister. This method is used to extract the groupName and the localRef from the fullRef string. Its results are supplied to the SplitExtId class method. Assume your lister has an external id format like this: :MYLISTER:groupname:localref.
In this simple example, fullRef consists of the string groupname:localref, so the groupName is $PIECE(fullRef,":",1) and localRef is $PIECE(fullRef,":",2). Note that this is a very simple example; it does not work if the groupName or localRef parts contain ":" characters.
You must specify a BuildFullRef() instance method for your custom lister. This method is used to combine the groupName and the localRef to form the fullRef string. Its results are supplied to the BuildExtId class method.
A Processor is a class that takes as input the list populated by the Lister, reads the corresponding source text, and directs the source data to the NLP engine for indexing. It can, optionally, pass this source data through a Converter.
NLP Processors are subclasses of the %iKnow.Source.Processor class. Each processor subclass is designed to read sources of a specific type, such as the %iKnow.Source.File.Processor which reads files from a directory.
Every Lister has a default processor that is capable of processing the sources from that lister. By default, it uses a class called Processor in the same package as the Lister. If there is no processor corresponding to the specified lister, or if you wish to use the generic %iKnow.Source.Temp.Processor, you should override the DefaultProcessor() method and specify the desired default processor.
The ExpandList() method is responsible for listing all sources that need to be indexed. This method should be overridden by user-defined subclasses that implement how to scan through the particular type of source location or structures for your custom Lister. The parameters for this method are the same as those used when invoking the corresponding AddListToBatch() method. The parameters may differ, depending on the Lister that you implement. Make sure that the Lister-specific ExpandList() parameters are documented, so that a user knows which parameters to supply to the lister.AddListToBatch() method or the loader.ProcessList() method.
The ExpandList() parameters are as follows (in order):
Path: the location where the sources are located, specified as a string.
Extensions: one or more file extension suffixes that identify which sources are to be listed. Specified as a %List of strings.
Recursive: a boolean value that specifies whether to search subdirectories of the path for sources.
Filter: a string specifying a filter used to limit which sources are to be listed.
For further details, refer to Lister Parameters in the chapter “Loading Text Data into NLP”.
A processor can either copy the complete source into a temporary global for NLP processing, or it can store a reference to the source in a temporary global. These temporary globals are used by the NLP engine to index the text and store the results in NLP globals.
If a Lister does not have a corresponding processor, the %iKnow.Source.Temp.Processor is the default processor. It copies the complete text of each source into a temporary global. The other supplied processors store a reference to the source in a temporary global. You can use ..StoreTemp to specify copying the source, or ..StoreRef to specify storing a reference to the source.
While listing sources, the Lister is capable of extracting metadata that should be added to the sources. In order to let the system know which metadata the Lister will provide, you can call the function ..RegisterMetadataKeys(metaFieldNames). The metaFieldNames parameter is a %List containing the keys for the metadata key-value pairs. After that you can provide the metadata values by using the function ..SetMetadataValues(ref, metaValues). the metaValues parameter is a %List containing the values for the metadata key-value pairs. They should appear in the same order as the keys are listed.
After establishing the metadata in the Lister, you can access this metadata in your processor by implement the GetMetadataKeys() method. This method should return a %List of keys from the metadata key-value pairs. In the FetchSource() method the processor can then set the appropriate values for calling ..SetCurrentMetadataValues(values), where values is a %List of the values of the metadata key-value pairs, in the same order as the keys were reported.
A Converter converts source text to plain text by removing tags from the source text. Tags are non-content elements used to format the text for display or printing. For example, you might use a converter to remove tags from RTF (Microsoft Rich Text Format) files, or to extract plain text from a PDF file. A converter is invoked by the Lister and applied prior to indexing the source text. Depending on the format of your source documents, the use of a source converter is an optional step.
NLP provides one sample converter, the subclass %iKnow.Source.Converter.Html, which you can use to remove HTML tags from source text. This is a basic HTML converter; you may need to customize your instance of this converter to support full conversion of your HTML source texts.
In order to implement a custom Converter you need to override several methods from the base converter class %iKnow.Source.Converter.
The user-provided %OnNew() callback method is invoked by the %New() method. It takes as its parameter a %List of any parameters that the Converter requires.
The BufferString() method will be called as many times as needed to buffer the complete document into the Converter. Each call will provide a chunk of text by means of the data parameter (max 32K). When no more data is to be buffered, the Convert() method will be called.
The Convert() method is responsible for processing the buffered content and converting the data into plain text (for example, RTF file conversion), or extracting the required data from the buffer (for example, node extraction from xml). The converted or extracted data will need to be buffered, as the converted data can be larger than 32K.
Next Converted Part
The NextConvertedPart() method is called after the Convert() method. This method must return the converted data in chunks of 32K. Every time this method is called, you need to return the next chunk. If no more data is available, this method should return the empty string ("") to indicate that it has finished extracting the converted data.