The %Text.Text data type class implements the methods used by InterSystems IRIS for full text indexing, text search, similarity scoring,
automatic classification, dictionary management, word stemming, n-gram key creation, and noise word filtering.
Usage
Creating a Text Property and a Full-Text Index
To create a %Text property and an index that supports Boolean queries, declare the property using the
%Library.Text class and create a full-text index on the property
specifying (KEYS) in the ON clause as shown below
PROPERTY myDocument As %Text (MAXLEN = 256, LANGUAGECLASS = "%Text.English");
INDEX myIndex ON myDocument(KEYS) [ TYPE=BITMAP ];
Set the LANGUAGECLASS property parameter to the name of the appropriate language-specific subclass
of the %Text.Text class, such as %Text.English, %Text.French,
%Text.Spanish, %Text.Italian, %Text.Portuguese, or %Text.Japanese.
Text indexes can be very large and expensive to update, but there are several ways to
reduce the size of the index without compromising (and often improving) search quality:
MINWORDLEN discards all terms that are fewer than this number of characters
FILTERNOISEWORDS=1 enables common-word filtering, in combination with calling the ExcludeCommonTerms() class method. Calling ExcludeCommonTerms() with
an argument of 175 causes the 175 most common words and two-word combinations to be ignored, resulting in a very substantial
reduction of index size (also see Dictionary Management, below)
STEMMING conflates multiple forms of a word to a common "stem". For example, in
English, the common word endings -s, -ing, -ed, (etc.) may be removed so that the various word forms
can all match against each other.
The %CONTAINS Operator
With the declarations above, the following SQL query could be issued to find all documents containing both the
terms "InterSystems" and "IRIS":
SELECT myDocument FROM table t WHERE myDocument %CONTAINS ('InterSystems', 'IRIS')
The %CONTAINS operator matches on complete terms, based on a language-specific tokenization
of the text into words; therefore, unlike the behavior of the "[" operator, "Intersystems" would not match
"IntersystemsCorp". Consider the following similar query:
SELECT myDocument FROM table t WHERE myDocument [ 'InterSystems' AND myDocument [ 'IRIS'
Both queries above would produce similar results in most cases, but the %CONTAINS operator can fully exploit the full text
index, whereas the "[" operator matches character sequences rather than terms, is case sensitive, and
will not exploit the full text index.
The %CONTAINS operator may also be used to search for multi-word phrases, such as in the following query:
SELECT myDocument FROM table t WHERE
myDocument %CONTAINS ('New Guinea') OR myDocument %CONTAINS ('West Africa')
If the class parameter NGRAMLEN=2 or more, then the 2-word terms 'New Guinea' and
'West Africa' will be stored in the index. If NGRAMLEN=1, then the text index
will still be used to find all documents that contain "Guinea" or "Africa", but the query cannot be fully
satisfied from the index and must be completed by searching through the original data using the case-sensitive "[" operator.
The next query illustrates the use of the STEMMING parameter. The language-specific
subclasses of the %Text.Text class each strip off common word endings to put each term into a standard form.
SELECT myDocument FROM table t WHERE
myDocument %CONTAINS ('jumping')
The query above may succeed on any document that contains the word "jump", "jumping", "jumps" or "jumped",
depending on the stemming algorithm. Note that if NGRAMLEN=1, then the query:
SELECT myDocument FROM table t WHERE
myDocument %CONTAINS ('jumping through hoops')
will succeed only if the document contains the 3 word phrase exactly as specified, whereas if NGRAMLEN=3,
then the query could also match "jump through hoops", because 3-word phrases are considered single terms and are subject to stemming.
Additional flexibility beyond what is available from the %CONTAINS operator can be obtained by
using the FOR SOME %ELEMENT predicate. For example, wildcarding can be specified if STEMMING=0
and can optionally be combined with other WHERE clause predicates as follows:
SELECT myDocument FROM table t WHERE
FOR SOME %ELEMENT(myDocument) (%KEY LIKE 'myo%opy')
AND myDocument %CONTAINS ('heart')
The %SIMILARITY Operator
Many text-search applications require the ability to rank the results of a Boolean query
by their relevance to a set of related terms. InterSystems IRIS supports this capability with
the %SIMILARITY SQL extension The following example finds all documents containing the
terms 'InterSystems' and 'IRIS', and then ranks them in descending order of their
similarity to any or all of the terms 'InterSystems IRIS Queue Messaging':
SELECT myDocument FROM table t
WHERE myDocument %CONTAINS ('InterSystems', 'IRIS')
ORDER BY %SIMILARITY (myDocument, 'InterSystems IRIS Queue Messaging') DESC
If the optional SIMILARITYINDEX
parameter is defined for a property, then the %SIMILARITY operator will be implemented by
calling the SimilarityIdx() class method; otherwise the %SIMILARITY operator
will call the Similarity() class method. If the SIMILARITYIDX property parameter
is defined, then it must specify the name of an index. The index on-clause and data-clause
must follow whatever restrictions are imposed by the SimilarityIdx() class method.
%Text uses a state of the art similarity algorithm based on the Okapi BM25 term weighting
strategy and the cosine similarity metric. If desired, you can adjust the Okapi BM25 model
parameters OKAPIBM25B, OKAPIBM25K1, and
OKAPIBM25K3 to fine-tune the ranking algorithm when there is a mixture of
large and small documents that need to be ranked. Alternatively, you may override the default
similarity algorithms with your own algorithms and/or special index structures.
The second operand to %SIMILARITY may be any text-valued expression, so to find documents that
contain both the terms "InterSystems" and "IRIS", but to rank the documents based on
references to "integration", "platform", or "integration platform", the following query could
be used:
SELECT myDocument FROM table t
WHERE myDocument %CONTAINS ('InterSystems', 'IRIS')
ORDER BY %SIMILARITY (myDocument, 'Integration platform') DESC
Note that if similarity matching is not required by an application, then it may be preferrable to
use a bitmap full text indexe rather than an ordinary full text index, particularly if noise word
filtering is not used.
Dictionary Management
Just as the %CONTAINS operator may be used without an index or without %SIMILARITY ranking,
%SIMILARITY ranking can be used without dictionary support; however, a critically important
aspect of similarity ranking is the ability to assess the information content of different
words. For example, the word "the" has low utility as a search term, whereas the word "London"
is much more specific and useful as a search term.
To reduce the size of the index, and to enable the similarity algorithm to more easily ignore
words with low information content, you will usually want to call the ExcludeCommonTerms()
class method to specify noise words for the current dictionary. By calling ExcludeCommonTerms with
argument n and setting the class parameter FILTERNOISEWORDS=1, the n most
common words and 2-word combinations in the current language will be ignored. For English text,
the most common 100 words represent about 50% of all word occurrences.
Each language-specific subclass of the %Text.Text class is associated with a particular DICTIONARY
identifier, so by default English words go into a different dictionary than French words, and so on; however,
you can also create multiple dictionaries for each language. For example, it may be useful to have a separate
dictionary for email than for legal briefs, because words that are common in one domain may be uncommon and
useful in another domain.
To collect statistics about the frequency of different terms, call the AddDocToDictionary()
class method. Since words that were rare yesterday are likely to be rare tomorrow (except in special applications
like news feeds), the dictionary can be populated initially and then updated as an infrequent database maintenance operation
(to rebuild the dictionary on a monthly or quarterly schedule, for example). For example, the following loop
drops the current dictionary, then repopulates it:
do ##class(%Text.English).DropDictionary()
do ##class(%Text.English).ExcludeCommonTerms(175)
&sql(DECLARE C CURSOR FOR SELECT myDocument, category INTO :myDoc, :category FROM myTable T)
&sql(OPEN C) QUIT:SQLCODE<0 SQLCODE
for { &sql(FETCH C) QUIT:SQLCODE=100 do ##class(%Text.English).AddDocToDictionary(myDoc, category)
&sql(CLOSE C)
Note: the second argument to the AddDocToDictionary() class method is discussed in
the next section on Automatic Classification.
You can find relevant documents more easily by specifying a dictionary-specific thesaurus. If
the class parameter THESAURUS=1, then terms in each document and in each
%CONTAINS predicate are replaced by the standard term in the thesaurus. The API for adding or
removing a term from the English language thesaurus is:
do ##class(%Text.English).AddToThesaurus(term, standardTerm)
do ##class(%Text.English).RemoveFromThesaurus(term)
In addition you can initialize a Thesaurus from a text file. For English, a predefined text
file contains translations for both irregular verbs and commonly misspelled words. You can
load the text file by invoking the LoadThesaurus class method:
do ##class(%Text.English).LoadThesaurus("EnglishThesaurus.txt")
Automatic Classification
The example above not only repopulates the English dictionary, it also associates a category with
each document. For example, if myDocument is an email, then category might be "junk" or "normal", or if
myDocument is a problem report, then category might be the name of the person who resolved the problem.
Classifying documents in this fashion makes it possible to automatically classify new and unseen documents
into one of the known categories based on the similarity of the previously unseen document with the documents
in each category. The Classify() computes the probability that a given document belongs to
each of the known categories, and returns a $list of the n most likely categories, in decreasing order of
probability.
A more whimsical (but hopefully interesting) example that illustrates the potential power
of automatic classification would be to evaluate the true authorship of a document. A few
literary scholars have speculated that some of the famous later works attributed to
William Shakespeare were actually authored by Christopher Marlowe. Marlowe and Shakespeare
attended the same school, and probably knew each other in England before Marlowe was forced to
flee in secrecy and live in hiding in Italy. The theory is that Marlowe continued to publish his
works in England through Shakespeare. If the theory is true, then The
Merchant of Venice is among the works most likely to have been written by Marlowe since Marlowe lived
in Italy, and Shakespeare is not known to have ever visited Italy.
This question could be researched by calling AddDocToDictionary()
to gather statistics about each passage in each work attributed to Marlowe to Marlowe, and each passage
of each early work attributed to Shakespeare (up to the time of Marlowe's departure to Italy) to
Shakespeare. The Classify class method could then directly estimate whether
each passage of The Merchant of Venice is more similar to early works
attributed to Shakespeare than to works attributed to Marlowe.
CASEINSENSITIVE=1 causes comparisons to be performed by %CONTAINS
in a case-insensitive manner when the collation of the underlying property is case
insensitive. Setting CASEINSENSITIVE=1 improves
matching and typically reduces both the size of the index and index update time.
Note that CASEINSENSITIVE is not applicable to the %CONTAINSTERM operator,
since %CONTAINSTERM always compares terms using the collation of the specified property.
parameter DICTIONARY = 1;
The default dictionary for properties of this class. By overriding the
DICTIONARY you can create separate dictionaries for different kinds
of properties in the same language. For example, email documents, legal briefs, and
medical records might each have a separate dictionary so that term frequency and document
similarity can be appropriately estimated in each separate domain.
parameter FILTERNOISEWORDS = 1;
FILTERNOISEWORDS controls whether common-word filtering is enabled.
Specifying a list of noise words can greatly reduce the size of a text index and the associated
index update time; however, to perform text search it is necessary to also remove noise words
from the search pattern, and this can produce some counter-intuitive results. See example below.
Setting up noise word filtering is
a two-step process: First enable noise word filtering by setting FILTERNOISEWORDS=1. Second,
populate the noise word dictionary by calling the ExcludeCommonTerms()
with the desired number of noise words to populate the corresponding DICTIONARY. ExcludeCommonTerms
purges the previous set of noise words, so it may be called any number of times, but it is necessary
to rebuild all text indexes on the corresponding properties whenever the list of noise words is changed.
Note: The SQL predicate:
SELECT myDocument FROM table t WHERE myDocument %CONTAINS ('to be or not to be')
will not find any qualifying rows if 'to, be, or, not' are all noise words; however, if any
of these terms are not noise words, then only the non-noise words will participate in the matching
process.
parameter IGNOREMARKUP = 0;
IGNOREMARKUP is a Boolean (0/1) flag. If equal to 1, then all content
between '<' and '>' will be ignored. Note that the text must be properly escaped in order to
pass literal '<' and '>' characters when IGNOREMARKUP=1.
parameter MAXLEN;
By default, there is no default MAXLEN; that is, it must be specified wherever a %Text.Text
property is declared. This behavior may be overridden by specifying MAXLEN as a positive
integer in the %Library.Text class and optionally also in the %Text.Text
class.
parameter MAXOCCURS = 5;
Text search applications sometimes need to highlight the matching terms found
in a document. The array returned by BuildValueArray makes this possible by
encoding the character offset of each occurrence of each term within a document,
along with the number of occurrences of each term. Since the number of occurrences
has no upper limit and you may want to store the occurrence list in an index, the
MAXOCCURS parameter imposes an upper bound on the number of
character positions that will be retained.
The first ..#MAXOCCURS-1 positions, the last position,
and the total count of occurrences are returned in the %value portion of
the valueArray in the format: count ^ pos1 ^ deltaPos2 ^ deltaPos3... ^ deltaPosN-1^ posN, where the
separator "^" is defined as the "metachar", and may be redefined if necessary.
The "deltaPos" are delta-compressed positions, so the first and last positions are
simple character offsets into the document. The second position can be recovered by
summing pos1+deltaPos2, the third by summing pos1+deltaPos2+deltaPos3, and so on.
parameter MAXWORDLEN = 128;
MAXWORDLEN specifies the maximum word length that will be retained. See also MINWORDLEN
parameter MINWORDLEN = 3;
MINWORDLEN specifies the minimum length word that will be retained
excluding ngram words and post-stemmed words.
MINWORDLEN provides a simple means of excluding terms based on their
length, since it is usually the case that short words such as 'a', 'to', 'an', etc., are
connectives that contain little information content. The length refers to the number of
characters in the original document. Note that if stemming or thesaurus translation is
enabled, then the length of the term in a text index may have fewer than MINWORDLEN
characters.
Note: MINWORDLEN should typically be set to 3 or less when STEMMING=1,
since otherwise a word stem could be classified as a noise word even though alternate forms of the
word would not be classified as a noise word. For example, with MINWORDLEN=5 "jump" would be discarded
as a noise word, whereas "jumps" would not.
parameter NGRAMLEN = 1;
NGRAMLEN is the maximum number of words that will be regarded as a single
search term. When NGRAMLEN=2, two-word combinations will be added to any
index, in addition to single words. Consecutive words exclude noise words.
parameter NOISEBIGRAMS100;
parameter NOISEBIGRAMS200;
parameter NOISEBIGRAMS300;
parameter NOISEWORDS100;
NOISEWORDSnnn lists the most common words in the language, in order of their frequency of occurrence.
See http://www.ranks.nl/stopwords/ for a list of commonly used noise words for many different languages.
parameter NOISEWORDS200;
parameter NOISEWORDS300;
parameter NUMCHARS = .-,;
NUMCHARS specifies the characters other than digits that may appear
in a number. Note that if "," is included in NUMCHARS, then "1,000" will be considered a
single number, but the comma will be removed so that "1,000" will match "1000" using the
%CONTAINS SQL predicate. The characters "." and "-" are also special and mark the beginning of
a numeric term when the next character is numeric, regardless of how NUMCHARS is defined.
parameter NUMERIC = 1;
NUMERIC specifies whether numeric terms will be retained(1) or ignored(0).
Languages such as Japanese require the raw document text to be parsed and
separated into words before being processed by the class methods.
If SEPARATEWORDS=1 then call the SeparateWords() class method.
parameter SOURCELANGUAGE;
SOURCELANGUAGEUAGE specifies the default source language to translate
documents or queries from. This enables documents written and stored in multiple langauges to
be queried in a single common language.
parameter STEMMING = 1;
STEMMING replaces each word by its language-specific stem to improve the
matching quality. Note that stemmed words are modified, and may or may
not correspond to real words in the language. If stemming is enabled, then
search patterns must also be stemmed prior to searching.
Note: Stemming of search
strings is performed automatically by the %CONTAINS SQL predicate if stemming is enabled
on the corresponding property; however, stemming is not automatically performed by
the more primitive FOR SOME %ELEMENT SQL predicate.
parameter TARGETLANGUAGE;
TARGETLANGUAGE specifies the default target language to translate
documents or queries to. This enables documents written and stored in multiple langauges to
be queried in a single common language. See also TARGETLANGUAGECLASS.
To find the list of values
parameter TARGETLANGUAGECLASS;
TARGETLANGUAGECLASS specifies the class to use when TARGETLANGUAGE
has been specified as a non-null value. For example, if TARGETLANGUAGE="fr", then by default the
TARGETLANGUAGECLASS would be "%Text.French", but if you extend the %Text.French class and also want to also
use it as a target class, then you need to override TARGETLANGUAGECLASS in every class that is
referenced by a LANGUAGECLASS.
parameter THESAURUS = 0;
THESAURUS specifies that a language-specific thesaurus is to be used in place of,
or in addition to, stemming. If an unstemmed term is found in the thesaurus, then
the term in the thesaurus is used, otherwise if stemming is enabled then the term
is first stemmed, and then the thesaurus is searched again for the stemmed term. If
the term or stemmed term is found in the thesaurus, then the thesaurus term is used,
otherwise the term or stemmed term is used.
parameter WORDCHARS;
WORDCHARS specifies the characters other than alphabetic that may
appear in a word. For example, to regard hyphenated words as terms, include "-" in WORDCHARS.
Note that characters that are not numbers or words are ignored for the purpose of comparison
with the %CONTAINS operator, therefore the search pattern "off-hand" will match "off hand"
if WORDCHARS="", but not if WORDCHARS="-"; conversely, "off-hand" will match "offhand" if
WORDCHARS="-", but not if WORDCHARS="".
Methods
classmethod AddDocToDictionary(document As %String, category As %String = "") as %Status [ Language = objectscript ]
Add words of the specified document to the ^%SYSDict global. Optionally, classify the document
as being in the specified category so that other documents may be automatically classified.
The ..#DICTIONARY is used as the first subscript to ^%SYSDict to enable classification to be
carried out in both a language-specific and an application specific way. For example, a subclass
of the %Text.English class could be defined for English email, with a unique DICTIONARY value. The
dictionary for this sub-language of English could inherit the English stemmer, but could have its
own list of noise words, its own domain-specific word frequencies,
and possibly its own BuildValueArray that encodes words in the Subject/From/To/Body differently
from each other. Email identified as belonging to the "junk mail" category could then be used to
help automatically classify incoming mail as "junk mail".
classmethod AddToDictionary(word As %String, wordType As %Integer = 1, category As %String = "", wCount As %Integer = 1) as %Status [ Language = objectscript ]
Add the specified word or phrase to the current dictionary. Optionally a repetition count
and a category may be specified.
classmethod AddToThesaurus(term As %String, standardTerm As %String) as %Status [ Language = objectscript ]
classmethod BuildValueArray(document As %Binary, ByRef valueArray As %Binary) as %Status [ Language = objectscript ]
The BuildValueArray() method tokenizes a text string into a collection of
terms (words or phrases), computes statistics (count and positions) of each term, and stores
the result as valueArray(term)=statistics.
The statistics include the term count in $p(statistics,"#",1), and optionally include the character
positions where the term appears in the document in subsequent #-delimited positions, where
"#" is a non-word meta-character that may be redefined by an application if necessary.
Three special values are also returned in the valueArray:
valueArray("#doclen") holds the number of non-noise terms in the document
valueArray("#norm") holds a statistic needed by the cosine metric (see SimilarityIdx())
valueArray holds the number of distinct terms in the document (the number of terms)
classmethod ChooseSearchKey(document As %String) as %String [ Language = objectscript ]
If we must choose exactly one indexable search string from a pattern that
has more than ..#NGRAMLEN terms, then choose a multi-term pattern that
occurs in at least 3 documents, if any; otherwise just select the longest term.
classmethod Classify(document As %String, topN As %Integer = 1, maxDocFreq=.20) as %List [ Language = objectscript ]
Classify document into one of the known categories using a semi-naive Bayesian classification algorithm.
A list of lists is returned, with each sublist containing the (category, score). The score is the
ln(probability) of generating the document, given the category, divided by the (unknown) probability of
generating a document of the given length, which is assumed to be constant for all document lengths.
For background information (not used in this implementation), see www-2.cs.cmu.edu/~mccallum/bow/
Also see Dr. Dobb's Journal, May 2005
A basic explanation of Bayes' Rule is as follows:
Naive Bayes assumes a particular generative model for text documents. Assumptions built into
the model are that (a) the data are produced by a mixture model, (b) there is a one-to-one correspondence
between mixture components and classes, (c) the probability that any given word appears in a document is
conditionally independent of the probability of appearance of any other word, and (d) the probability that
document Di is associated with class Cj is independent of the length of the document.
Under these assumptions, the probability that document Di could be generated by parameters T is given
by p(Di | T) = sum(p(Cj | T) * p(Di | Cj ; T),j=1:|C|), and
p(Di | Cj ; T) = p(|Di|) * product( p(word(Di,k) | Cj ; T),k=1:|Di|)
Thus the parameters of an individual mixture component are a multinomial distribution over words, i.e. the
collection of word probabilities. Since the model assumes that document length is identically distributed
for all classes, it does not need to be parameterized to classify a document.
Learning a Naive Bayes classifier consists of estimating the parameters of the generative model by using a
set of pre-classified training samples. The goal of the training procedure is to determine the parameters T
that maximize p(T | class(Di) = Cj), i=1:|D|, j=1:|C|).
p(Document) is unknown, but since it is independent of category we can ignore it for the purpose of computing
a relative p(Category|Document) score. p(Category) is the number of words in all documents in the specified
category divided by the total number of words in all documents. p(Document|Category) is defined as the product
of the probabilities of the individual words in that document. p(Word) is the count of each word in the category
divided by the count of that word in the corpus. We make a log transformation and compute the sum of the logs
of the ratios instead of computing the product of the ratios themselves.
The resulting p(Document|Category) * p(Category) can then be compared across all categories to identify the
category with maximum score, and hence the maximum p(Category|Document). This is the predicted category.
Note that the use of ..#NGRAMLEN>1 invalidates the mathematical justification for using Bayesian probabilities;
however, biasing the probability score in favor of documents that match multi-word combinations is justifiable
because it partially addresses the absence of the joint probability information that is the main deficiency of
the naive Bayesian algorithm; therefore when ..#NGRAMLEN>1, we call this a "semi-naive" Bayesian classifier.
classmethod CreateQList(document As %String, coll As %String) as %List [ Language = objectscript ]
classmethod DecompressOffsets(compressed As %String) as %String [ Language = objectscript ]
Converts the offsets from compressed to uncompressed form
classmethod DropDictionary() [ Language = objectscript ]
Deletes all of the words, noisewords, etc. from the current dictionary. Dictionaries
other than the current dictionary are not affected.
classmethod EndOfWord(document As %String, cs As %Integer) as %Integer [ Language = objectscript ]
classmethod ExcludeCommonTerms(nTerms) as %Status [ Language = objectscript ]
Classifies the most common nTerms words in the current language as noise words. The words specified
in NOISEWORDS100, NOISEWORDS200, and NOISEWORDS300,
list the most common 300 words of the current language, in order of their frequency. Similarly, NOISEBIGRAMSn00 lists
the most common 300 bigrams of the current language that would not typically be considered useful for searching.
classmethod LoadThesaurus(pathname As %String) as %Status [ Language = objectscript ]
classmethod MakeSearchTerms(searchPattern As %String, ngramlen As %Integer = 0) as %List [ Language = objectscript ]
Convert a string into a list of search terms, such that each search term contains no
noise words and has at most NGRAMLEN words per search term. Use this method to convert
a search pattern into a list of search patterns that can be passed to %CONTAINSTERM.
Note that if noise word filtering is enabled, noise words will be removed.
classmethod RemoveDocFromDictionary(document As %String, category As %String = "") as %Status [ Language = objectscript ]
classmethod RemoveFromThesaurus(term As %String) as %Status [ Language = objectscript ]
classmethod SeparateWords(rawText As %String) as %String [ Language = objectscript ]
Separates individual terms with whitespace, for languages such as Japanese.
classmethod Similarity(document As %String, qList As %List) as %Numeric [ Language = objectscript ]
classmethod SimilarityIdx(ID As %String, textIndex As %String, qList As %List) as %Numeric [ Language = objectscript ]
This feature is not available prior to version 2007.1
An index that supports both Boolean queries (the %CONTAINS operator) and ranking queries (the %SIMILARITY operator)
may be created by removing TYPE=BITMAP and by specifying "[ DATA = myDocument(ELEMENTS) ]". If such an index is
created, then also specify the name of the index in the SIMILARITYINDEX parameter of the corresponding property as follows:
PROPERTY myDocument As %Text (MAXLEN = 256, LANGUAGECLASS = "%Text.English", SIMILARITYINDEX="bigIndex");
INDEX bigIndex ON myDocument(KEYS) [ DATA=myDocument(ELEMENTS) ];
This method computes a score that relates the similarity of a query document
to a reference document. Many similarity heuristics have been proposed,
and have been shown to be effective on real data sets. A variation of one
effective and commonly used statistic is the cosine measure:
SUM(w(q,t)*w(d,t)) for t in both q and d
C(q,d) = -------------------------------------
SQRT(SUM(w(d,t)^2)*SQRT(SUM(w(q,t)^2) for all t
The weights w(d,t) and w(q,t) are Okapi BM25 weights, calculated as follows:
w(d,t) = dtf / (dtf + sizeAdj)
dtf = term frequency in document
sizeAdj = k1*((1-b) + (b * doclen/avgdoclen))
b = .75, k1 = 2
w(q,t) = qtf * IDF(N,f(t))
qtf = term frequency in query
IDF(N,df) = (ln(N/df)+1) / (ln(N)+1)
N = the number of documents classified
df = document frequency, or #documents containing term
OkapiBM25 = SUM(QTF*ln(IDF)*DTF)
Where:
- IDF = (N-n+.5)/(n+.5)
- DTF = (k1+1)*tf/((k1*sizeAdjD)+tf)
- tf = frequency of occurrences of the term in the document
- sizeAdjD = (1-b) + b*doclen/avgdoclen
- QTF = (k3+1)*qtf/(k3+qtf)
- qtf = frequency of occurrences of the term in the query
- doclen = document length
- avgdoclen = average document length
- N = is the number of documents in the collection
- n = is the number of documents containing the word
- k1 = 1.2
- b = 0.75 or 0.25 (recommend .75 for full text and .25 for shorter representations)
- k3 = 7, set to 7 or 1000, controls the effect of the query term frequency on the weight.
classmethod Standardize(document As %String, origtext As %Boolean = 0) as %String [ Language = objectscript ]
Returns the specified string in standardized form, that is: stemmed, filtered, translated,
space separated, with a leading and trailing space.
classmethod Translate(text As %String = "", langpair As %String = "", ByRef translated As %String) as %Status [ Language = objectscript ]