%Text.Text

datatype class %Text.Text extends %Library.String

ODBC Type: VARCHAR

The %Text.Text data type class implements the methods used by InterSystems IRIS for full text indexing, text search, similarity scoring, automatic classification, dictionary management, word stemming, n-gram key creation, and noise word filtering.

Usage

Creating a Text Property and a Full-Text Index

To create a %Text property and an index that supports Boolean queries, declare the property using the %Library.Text class and create a full-text index on the property specifying (KEYS) in the ON clause as shown below

PROPERTY myDocument As %Text (MAXLEN = 256, LANGUAGECLASS = "%Text.English");
INDEX myIndex ON myDocument(KEYS) [ TYPE=BITMAP ];

Set the LANGUAGECLASS property parameter to the name of the appropriate language-specific subclass of the %Text.Text class, such as %Text.English, %Text.French, %Text.Spanish, %Text.Italian, %Text.Portuguese, or %Text.Japanese.

Text indexes can be very large and expensive to update, but there are several ways to reduce the size of the index without compromising (and often improving) search quality:

MINWORDLEN discards all terms that are fewer than this number of characters
FILTERNOISEWORDS=1 enables common-word filtering, in combination with calling the ExcludeCommonTerms() class method. Calling ExcludeCommonTerms() with an argument of 175 causes the 175 most common words and two-word combinations to be ignored, resulting in a very substantial reduction of index size (also see Dictionary Management, below)
STEMMING conflates multiple forms of a word to a common "stem". For example, in English, the common word endings -s, -ing, -ed, (etc.) may be removed so that the various word forms can all match against each other.

The %CONTAINS Operator

With the declarations above, the following SQL query could be issued to find all documents containing both the terms "InterSystems" and "IRIS":

SELECT myDocument FROM table t WHERE myDocument %CONTAINS ('InterSystems', 'IRIS')

The %CONTAINS operator matches on complete terms, based on a language-specific tokenization of the text into words; therefore, unlike the behavior of the "[" operator, "Intersystems" would not match "IntersystemsCorp". Consider the following similar query:

SELECT myDocument FROM table t WHERE myDocument [ 'InterSystems' AND myDocument [ 'IRIS'

Both queries above would produce similar results in most cases, but the %CONTAINS operator can fully exploit the full text index, whereas the "[" operator matches character sequences rather than terms, is case sensitive, and will not exploit the full text index.

The %CONTAINS operator may also be used to search for multi-word phrases, such as in the following query:

SELECT myDocument FROM table t WHERE 
    myDocument %CONTAINS ('New Guinea') OR myDocument %CONTAINS ('West Africa')

If the class parameter NGRAMLEN=2 or more, then the 2-word terms 'New Guinea' and 'West Africa' will be stored in the index. If NGRAMLEN=1, then the text index will still be used to find all documents that contain "Guinea" or "Africa", but the query cannot be fully satisfied from the index and must be completed by searching through the original data using the case-sensitive "[" operator.

The next query illustrates the use of the STEMMING parameter. The language-specific subclasses of the %Text.Text class each strip off common word endings to put each term into a standard form.

SELECT myDocument FROM table t WHERE 
    myDocument %CONTAINS ('jumping')

The query above may succeed on any document that contains the word "jump", "jumping", "jumps" or "jumped", depending on the stemming algorithm. Note that if NGRAMLEN=1, then the query:

SELECT myDocument FROM table t WHERE
    myDocument %CONTAINS ('jumping through hoops')

will succeed only if the document contains the 3 word phrase exactly as specified, whereas if NGRAMLEN=3, then the query could also match "jump through hoops", because 3-word phrases are considered single terms and are subject to stemming.

Additional flexibility beyond what is available from the %CONTAINS operator can be obtained by using the FOR SOME %ELEMENT predicate. For example, wildcarding can be specified if STEMMING=0 and can optionally be combined with other WHERE clause predicates as follows:

SELECT myDocument FROM table t WHERE
    FOR SOME %ELEMENT(myDocument) (%KEY LIKE 'myo%opy')
    AND myDocument %CONTAINS ('heart')

The %SIMILARITY Operator

Many text-search applications require the ability to rank the results of a Boolean query by their relevance to a set of related terms. InterSystems IRIS supports this capability with the %SIMILARITY SQL extension The following example finds all documents containing the terms 'InterSystems' and 'IRIS', and then ranks them in descending order of their similarity to any or all of the terms 'InterSystems IRIS Queue Messaging':

SELECT myDocument FROM table t 
    WHERE myDocument %CONTAINS ('InterSystems', 'IRIS')
    ORDER BY %SIMILARITY (myDocument, 'InterSystems IRIS Queue Messaging') DESC

If the optional SIMILARITYINDEX parameter is defined for a property, then the %SIMILARITY operator will be implemented by calling the SimilarityIdx() class method; otherwise the %SIMILARITY operator will call the Similarity() class method. If the SIMILARITYIDX property parameter is defined, then it must specify the name of an index. The index on-clause and data-clause must follow whatever restrictions are imposed by the SimilarityIdx() class method.

%Text uses a state of the art similarity algorithm based on the Okapi BM25 term weighting strategy and the cosine similarity metric. If desired, you can adjust the Okapi BM25 model parameters OKAPIBM25B, OKAPIBM25K1, and OKAPIBM25K3 to fine-tune the ranking algorithm when there is a mixture of large and small documents that need to be ranked. Alternatively, you may override the default similarity algorithms with your own algorithms and/or special index structures.

The second operand to %SIMILARITY may be any text-valued expression, so to find documents that contain both the terms "InterSystems" and "IRIS", but to rank the documents based on references to "integration", "platform", or "integration platform", the following query could be used:

SELECT myDocument FROM table t 
    WHERE myDocument %CONTAINS ('InterSystems', 'IRIS')
    ORDER BY %SIMILARITY (myDocument, 'Integration platform') DESC

Note that if similarity matching is not required by an application, then it may be preferrable to use a bitmap full text indexe rather than an ordinary full text index, particularly if noise word filtering is not used.

Dictionary Management

Just as the %CONTAINS operator may be used without an index or without %SIMILARITY ranking, %SIMILARITY ranking can be used without dictionary support; however, a critically important aspect of similarity ranking is the ability to assess the information content of different words. For example, the word "the" has low utility as a search term, whereas the word "London" is much more specific and useful as a search term.

To reduce the size of the index, and to enable the similarity algorithm to more easily ignore words with low information content, you will usually want to call the ExcludeCommonTerms() class method to specify noise words for the current dictionary. By calling ExcludeCommonTerms with argument n and setting the class parameter FILTERNOISEWORDS=1, the n most common words and 2-word combinations in the current language will be ignored. For English text, the most common 100 words represent about 50% of all word occurrences.

Each language-specific subclass of the %Text.Text class is associated with a particular DICTIONARY identifier, so by default English words go into a different dictionary than French words, and so on; however, you can also create multiple dictionaries for each language. For example, it may be useful to have a separate dictionary for email than for legal briefs, because words that are common in one domain may be uncommon and useful in another domain.

To collect statistics about the frequency of different terms, call the AddDocToDictionary() class method. Since words that were rare yesterday are likely to be rare tomorrow (except in special applications like news feeds), the dictionary can be populated initially and then updated as an infrequent database maintenance operation (to rebuild the dictionary on a monthly or quarterly schedule, for example). For example, the following loop drops the current dictionary, then repopulates it:

  do ##class(%Text.English).DropDictionary()
  do ##class(%Text.English).ExcludeCommonTerms(175)
  &sql(DECLARE C CURSOR FOR SELECT myDocument, category INTO :myDoc, :category FROM myTable T)
  &sql(OPEN C) QUIT:SQLCODE<0 SQLCODE
  for { &sql(FETCH C) QUIT:SQLCODE=100  do ##class(%Text.English).AddDocToDictionary(myDoc, category)
  &sql(CLOSE C)

Note: the second argument to the AddDocToDictionary() class method is discussed in the next section on Automatic Classification.

You can find relevant documents more easily by specifying a dictionary-specific thesaurus. If the class parameter THESAURUS=1, then terms in each document and in each %CONTAINS predicate are replaced by the standard term in the thesaurus. The API for adding or removing a term from the English language thesaurus is:

  do ##class(%Text.English).AddToThesaurus(term, standardTerm)
  do ##class(%Text.English).RemoveFromThesaurus(term)

In addition you can initialize a Thesaurus from a text file. For English, a predefined text file contains translations for both irregular verbs and commonly misspelled words. You can load the text file by invoking the LoadThesaurus class method:

  do ##class(%Text.English).LoadThesaurus("EnglishThesaurus.txt")

Automatic Classification

The example above not only repopulates the English dictionary, it also associates a category with each document. For example, if myDocument is an email, then category might be "junk" or "normal", or if myDocument is a problem report, then category might be the name of the person who resolved the problem. Classifying documents in this fashion makes it possible to automatically classify new and unseen documents into one of the known categories based on the similarity of the previously unseen document with the documents in each category. The Classify() computes the probability that a given document belongs to each of the known categories, and returns a $list of the n most likely categories, in decreasing order of probability.

A more whimsical (but hopefully interesting) example that illustrates the potential power of automatic classification would be to evaluate the true authorship of a document. A few literary scholars have speculated that some of the famous later works attributed to William Shakespeare were actually authored by Christopher Marlowe. Marlowe and Shakespeare attended the same school, and probably knew each other in England before Marlowe was forced to flee in secrecy and live in hiding in Italy. The theory is that Marlowe continued to publish his works in England through Shakespeare. If the theory is true, then The Merchant of Venice is among the works most likely to have been written by Marlowe since Marlowe lived in Italy, and Shakespeare is not known to have ever visited Italy. This question could be researched by calling AddDocToDictionary() to gather statistics about each passage in each work attributed to Marlowe to Marlowe, and each passage of each early work attributed to Shakespeare (up to the time of Marlowe's departure to Italy) to Shakespeare. The Classify class method could then directly estimate whether each passage of The Merchant of Venice is more similar to early works attributed to Shakespeare than to works attributed to Marlowe.

Method Inventory

Parameters

parameter CASEINSENSITIVE = 1;

CASEINSENSITIVE=1 causes comparisons to be performed by %CONTAINS in a case-insensitive manner when the collation of the underlying property is case insensitive. Setting CASEINSENSITIVE=1 improves matching and typically reduces both the size of the index and index update time. Note that CASEINSENSITIVE is not applicable to the %CONTAINSTERM operator, since %CONTAINSTERM always compares terms using the collation of the specified property.

parameter DICTIONARY = 1;

The default dictionary for properties of this class. By overriding the DICTIONARY you can create separate dictionaries for different kinds of properties in the same language. For example, email documents, legal briefs, and medical records might each have a separate dictionary so that term frequency and document similarity can be appropriately estimated in each separate domain.

parameter FILTERNOISEWORDS = 1;

FILTERNOISEWORDS controls whether common-word filtering is enabled. Specifying a list of noise words can greatly reduce the size of a text index and the associated index update time; however, to perform text search it is necessary to also remove noise words from the search pattern, and this can produce some counter-intuitive results. See example below.

Setting up noise word filtering is a two-step process: First enable noise word filtering by setting FILTERNOISEWORDS=1. Second, populate the noise word dictionary by calling the ExcludeCommonTerms() with the desired number of noise words to populate the corresponding DICTIONARY. ExcludeCommonTerms purges the previous set of noise words, so it may be called any number of times, but it is necessary to rebuild all text indexes on the corresponding properties whenever the list of noise words is changed.

Note: The SQL predicate:

SELECT myDocument FROM table t WHERE myDocument %CONTAINS ('to be or not to be')

will not find any qualifying rows if 'to, be, or, not' are all noise words; however, if any of these terms are not noise words, then only the non-noise words will participate in the matching process.

parameter IGNOREMARKUP = 0;

IGNOREMARKUP is a Boolean (0/1) flag. If equal to 1, then all content between '<' and '>' will be ignored. Note that the text must be properly escaped in order to pass literal '<' and '>' characters when IGNOREMARKUP=1.

parameter MAXLEN;

By default, there is no default MAXLEN; that is, it must be specified wherever a %Text.Text property is declared. This behavior may be overridden by specifying MAXLEN as a positive integer in the %Library.Text class and optionally also in the %Text.Text class.

parameter MAXOCCURS = 5;

Text search applications sometimes need to highlight the matching terms found in a document. The array returned by BuildValueArray makes this possible by encoding the character offset of each occurrence of each term within a document, along with the number of occurrences of each term. Since the number of occurrences has no upper limit and you may want to store the occurrence list in an index, the MAXOCCURS parameter imposes an upper bound on the number of character positions that will be retained.

The first ..#MAXOCCURS-1 positions, the last position, and the total count of occurrences are returned in the %value portion of the valueArray in the format: count ^ pos1 ^ deltaPos2 ^ deltaPos3... ^ deltaPosN-1^ posN, where the separator "^" is defined as the "metachar", and may be redefined if necessary. The "deltaPos" are delta-compressed positions, so the first and last positions are simple character offsets into the document. The second position can be recovered by summing pos1+deltaPos2, the third by summing pos1+deltaPos2+deltaPos3, and so on.

parameter MAXWORDLEN = 128;

MAXWORDLEN specifies the maximum word length that will be retained. See also MINWORDLEN

parameter MINWORDLEN = 3;

MINWORDLEN specifies the minimum length word that will be retained excluding ngram words and post-stemmed words. MINWORDLEN provides a simple means of excluding terms based on their length, since it is usually the case that short words such as 'a', 'to', 'an', etc., are connectives that contain little information content. The length refers to the number of characters in the original document. Note that if stemming or thesaurus translation is enabled, then the length of the term in a text index may have fewer than MINWORDLEN characters.

Note: MINWORDLEN should typically be set to 3 or less when STEMMING=1, since otherwise a word stem could be classified as a noise word even though alternate forms of the word would not be classified as a noise word. For example, with MINWORDLEN=5 "jump" would be discarded as a noise word, whereas "jumps" would not.

parameter NGRAMLEN = 1;

NGRAMLEN is the maximum number of words that will be regarded as a single search term. When NGRAMLEN=2, two-word combinations will be added to any index, in addition to single words. Consecutive words exclude noise words.

parameter NOISEBIGRAMS100;

parameter NOISEBIGRAMS200;

parameter NOISEBIGRAMS300;

parameter NOISEWORDS100;

NOISEWORDSnnn lists the most common words in the language, in order of their frequency of occurrence. See http://www.ranks.nl/stopwords/ for a list of commonly used noise words for many different languages.

parameter NOISEWORDS200;

parameter NOISEWORDS300;

parameter NUMCHARS = .-,;

NUMCHARS specifies the characters other than digits that may appear in a number. Note that if "," is included in NUMCHARS, then "1,000" will be considered a single number, but the comma will be removed so that "1,000" will match "1000" using the %CONTAINS SQL predicate. The characters "." and "-" are also special and mark the beginning of a numeric term when the next character is numeric, regardless of how NUMCHARS is defined.

parameter NUMERIC = 1;

NUMERIC specifies whether numeric terms will be retained(1) or ignored(0).

parameter OKAPIBM25B = .2;

See SimilarityIdx()

parameter OKAPIBM25K1 = 2;

See SimilarityIdx()

parameter OKAPIBM25K3 = 7;

See SimilarityIdx()

parameter SEPARATEWORDS = 0;

Languages such as Japanese require the raw document text to be parsed and separated into words before being processed by the class methods. If SEPARATEWORDS=1 then call the SeparateWords() class method.

parameter SOURCELANGUAGE;

SOURCELANGUAGEUAGE specifies the default source language to translate documents or queries from. This enables documents written and stored in multiple langauges to be queried in a single common language.

parameter STEMMING = 1;

STEMMING replaces each word by its language-specific stem to improve the matching quality. Note that stemmed words are modified, and may or may not correspond to real words in the language. If stemming is enabled, then search patterns must also be stemmed prior to searching.

Note: Stemming of search strings is performed automatically by the %CONTAINS SQL predicate if stemming is enabled on the corresponding property; however, stemming is not automatically performed by the more primitive FOR SOME %ELEMENT SQL predicate.

parameter TARGETLANGUAGE;

TARGETLANGUAGE specifies the default target language to translate documents or queries to. This enables documents written and stored in multiple langauges to be queried in a single common language. See also TARGETLANGUAGECLASS. To find the list of values

parameter TARGETLANGUAGECLASS;

TARGETLANGUAGECLASS specifies the class to use when TARGETLANGUAGE has been specified as a non-null value. For example, if TARGETLANGUAGE="fr", then by default the TARGETLANGUAGECLASS would be "%Text.French", but if you extend the %Text.French class and also want to also use it as a target class, then you need to override TARGETLANGUAGECLASS in every class that is referenced by a LANGUAGECLASS.

parameter THESAURUS = 0;

THESAURUS specifies that a language-specific thesaurus is to be used in place of, or in addition to, stemming. If an unstemmed term is found in the thesaurus, then the term in the thesaurus is used, otherwise if stemming is enabled then the term is first stemmed, and then the thesaurus is searched again for the stemmed term. If the term or stemmed term is found in the thesaurus, then the thesaurus term is used, otherwise the term or stemmed term is used.

parameter WORDCHARS;

WORDCHARS specifies the characters other than alphabetic that may appear in a word. For example, to regard hyphenated words as terms, include "-" in WORDCHARS. Note that characters that are not numbers or words are ignored for the purpose of comparison with the %CONTAINS operator, therefore the search pattern "off-hand" will match "off hand" if WORDCHARS="", but not if WORDCHARS="-"; conversely, "off-hand" will match "offhand" if WORDCHARS="-", but not if WORDCHARS="".

Methods

classmethod AddDocToDictionary(document As %String, category As %String = "") as %Status

Add words of the specified document to the ^%SYSDict global. Optionally, classify the document as being in the specified category so that other documents may be automatically classified. The ..#DICTIONARY is used as the first subscript to ^%SYSDict to enable classification to be carried out in both a language-specific and an application specific way. For example, a subclass of the %Text.English class could be defined for English email, with a unique DICTIONARY value. The dictionary for this sub-language of English could inherit the English stemmer, but could have its own list of noise words, its own domain-specific word frequencies, and possibly its own BuildValueArray that encodes words in the Subject/From/To/Body differently from each other. Email identified as belonging to the "junk mail" category could then be used to help automatically classify incoming mail as "junk mail".

classmethod AddToDictionary(word As %String, wordType As %Integer = 1, category As %String = "", wCount As %Integer = 1) as %Status

Add the specified word or phrase to the current dictionary. Optionally a repetition count and a category may be specified.

classmethod AddToThesaurus(term As %String, standardTerm As %String) as %Status

classmethod BuildValueArray(document As %Binary, ByRef valueArray As %Binary) as %Status

The BuildValueArray() method tokenizes a text string into a collection of terms (words or phrases), computes statistics (count and positions) of each term, and stores the result as valueArray(term)=statistics.

The statistics include the term count in $p(statistics,"#",1), and optionally include the character positions where the term appears in the document in subsequent #-delimited positions, where "#" is a non-word meta-character that may be redefined by an application if necessary.

Three special values are also returned in the valueArray:

valueArray("#doclen") holds the number of non-noise terms in the document
valueArray("#norm") holds a statistic needed by the cosine metric (see SimilarityIdx())
valueArray holds the number of distinct terms in the document (the number of terms)

classmethod ChooseSearchKey(document As %String) as %String

If we must choose exactly one indexable search string from a pattern that has more than ..#NGRAMLEN terms, then choose a multi-term pattern that occurs in at least 3 documents, if any; otherwise just select the longest term.

classmethod Classify(document As %String, topN As %Integer = 1, maxDocFreq=.20) as %List

Classify document into one of the known categories using a semi-naive Bayesian classification algorithm. A list of lists is returned, with each sublist containing the (category, score). The score is the ln(probability) of generating the document, given the category, divided by the (unknown) probability of generating a document of the given length, which is assumed to be constant for all document lengths.

For background information (not used in this implementation), see www-2.cs.cmu.edu/~mccallum/bow/ Also see Dr. Dobb's Journal, May 2005

A basic explanation of Bayes' Rule is as follows:

Naive Bayes assumes a particular generative model for text documents. Assumptions built into the model are that (a) the data are produced by a mixture model, (b) there is a one-to-one correspondence between mixture components and classes, (c) the probability that any given word appears in a document is conditionally independent of the probability of appearance of any other word, and (d) the probability that document Di is associated with class Cj is independent of the length of the document.

Thus the parameters of an individual mixture component are a multinomial distribution over words, i.e. the collection of word probabilities. Since the model assumes that document length is identically distributed for all classes, it does not need to be parameterized to classify a document.

Learning a Naive Bayes classifier consists of estimating the parameters of the generative model by using a set of pre-classified training samples. The goal of the training procedure is to determine the parameters T that maximize p(T | class(Di) = Cj), i=1:|D|, j=1:|C|).

	p(Category|Document) = ( p(Document|Category) * p(Category) ) / p(Document)
	         exp(metric) =   p(Document|Category) * p(Category) = Product(p(Word|Category)) * p(Category)
	         exp(metric) = Product(count(word,doc)/count(word,corpus)) * (nWordsInCategory / nWordsInDictionary)

p(Document) is unknown, but since it is independent of category we can ignore it for the purpose of computing a relative p(Category|Document) score. p(Category) is the number of words in all documents in the specified category divided by the total number of words in all documents. p(Document|Category) is defined as the product of the probabilities of the individual words in that document. p(Word) is the count of each word in the category divided by the count of that word in the corpus. We make a log transformation and compute the sum of the logs of the ratios instead of computing the product of the ratios themselves.

The resulting p(Document|Category) * p(Category) can then be compared across all categories to identify the category with maximum score, and hence the maximum p(Category|Document). This is the predicted category.

Note that the use of ..#NGRAMLEN>1 invalidates the mathematical justification for using Bayesian probabilities; however, biasing the probability score in favor of documents that match multi-word combinations is justifiable because it partially addresses the absence of the joint probability information that is the main deficiency of the naive Bayesian algorithm; therefore when ..#NGRAMLEN>1, we call this a "semi-naive" Bayesian classifier.

classmethod CreateQList(document As %String, coll As %String) as %List

Internal method used by the Similarity() and SimilarityIdx() class methods.

classmethod DecompressOffsets(compressed As %String) as %String

Converts the offsets from compressed to uncompressed form

classmethod DropDictionary()

Deletes all of the words, noisewords, etc. from the current dictionary. Dictionaries other than the current dictionary are not affected.

classmethod EndOfWord(document As %String, cs As %Integer) as %Integer

classmethod ExcludeCommonTerms(nTerms) as %Status

Classifies the most common nTerms words in the current language as noise words. The words specified in NOISEWORDS100, NOISEWORDS200, and NOISEWORDS300, list the most common 300 words of the current language, in order of their frequency. Similarly, NOISEBIGRAMSn00 lists the most common 300 bigrams of the current language that would not typically be considered useful for searching.

classmethod LoadThesaurus(pathname As %String) as %Status

classmethod MakeSearchTerms(searchPattern As %String, ngramlen As %Integer = 0) as %List

Convert a string into a list of search terms, such that each search term contains no noise words and has at most NGRAMLEN words per search term. Use this method to convert a search pattern into a list of search patterns that can be passed to %CONTAINSTERM. Note that if noise word filtering is enabled, noise words will be removed.

classmethod RemoveDocFromDictionary(document As %String, category As %String = "") as %Status

classmethod RemoveFromThesaurus(term As %String) as %Status

classmethod SeparateWords(rawText As %String) as %String

Separates individual terms with whitespace, for languages such as Japanese.

classmethod Similarity(document As %String, qList As %List) as %Numeric

%Text.Text

Method Inventory

Parameters

Methods

Inherited Members

Inherited Methods