Custom Converter
A Converter converts source text to plain text by removing tags from the source text. Tags are non-content elements used to format the text for display or printing. For example, you might use a converter to remove tags from RTF (Microsoft Rich Text Format) files, or to extract plain text from a PDF file. A converter is invoked by the Lister and applied prior to indexing the source text. Depending on the format of your source documents, the use of a source converter is an optional step.
NLP provides one sample converter, the subclass %iKnow.Source.Converter.HtmlOpens in a new tab, which you can use to remove HTML tags from source text. This is a basic HTML converter; you may need to customize your instance of this converter to support full conversion of your HTML source texts.
In order to implement a custom Converter you need to override several methods from the base converter class %iKnow.Source.ConverterOpens in a new tab.
%OnNew
The user-provided %OnNew() callback method is invoked by the %New() method. It takes as its parameter a %List of any parameters that the Converter requires.
Buffer String
The BufferString()Opens in a new tab method will be called as many times as needed to buffer the complete document into the Converter. Each call will provide a chunk of text by means of the data parameter (max 32K). When no more data is to be buffered, the Convert() method will be called.
Convert
The Convert()Opens in a new tab method is responsible for processing the buffered content and converting the data into plain text (for example, RTF file conversion), or extracting the required data from the buffer (for example, node extraction from xml). The converted or extracted data will need to be buffered, as the converted data can be larger than 32K.
Next Converted Part
The NextConvertedPart()Opens in a new tab method is called after the Convert() method. This method must return the converted data in chunks of 32K. Every time this method is called, you need to return the next chunk. If no more data is available, this method should return the empty string ("") to indicate that it has finished extracting the converted data.