Note that NLP measures are not stored in the fact table for the cube and are not displayed in the Analyzer. The primary purpose of an NLP measure is to define an Text Analytics domain and to serve as the basis of an NLP dimension. See the next sections.
Alternative Technique: Retrieving Unstructured Text from Elsewhere
In some scenarios, you may need to retrieve the unstructured text from a web page. For example, you might have a table of structured information, with a field that contains the URL where additional information (such as a news article) can be found. In such a case, the easiest way to use that text as an NLP measure is as follows:
For an example, suppose that we are basing a cube on a class that has summary information about news articles. Each record in the class contains the name of the news agency, the date, the headline, and a property named Link, which contains the URL of the full news story. We want to create an NLP measure that uses the news stories at those URLs.
To do this, we could define a method, GetArticleText(), in the cube class as follows:
ClassMethod GetArticleText(pLink As %String) As %String
{
set tSC = $$$OK, tStringValue = ""
try {
set tRawText = ..GetRawTextFromLink(pLink, .tSC)
quit:$$$ISERR(tSC)
set tStringValue = ..StripHTML(tRawText, .tSC)
quit:$$$ISERR(tSC)
} catch (ex) {
set tSC = ex.AsStatus()
}
if $$$ISERR(tSC) {
set tLogFile = "UpdateNEWSARCHIVE"
set tMsg = $system.Status.GetOneErrorText(tSC)
do ##class(%DeepSee.Utils).%WriteToLog("UPDATE", tMsg, tLogFile)
}
quit tStringValue
}
The GetRawTextFromLink() method would retrieve the raw text, as follows:
ClassMethod GetRawTextFromLink(pLink As %String, Output pSC As %Status) As %String
{
set pSC = $$$OK, tRawText = ""
try {
// derive server and path from pLink
set pLink = $zstrip(pLink,"<>W")
set pLink = $e(pLink,$find(pLink,"://"),*)
set tFirstSlash = $find(pLink,"/")
set tServer = $e(pLink,1,tFirstSlash-2)
set tPath = $e(pLink,tFirstSlash-1,*)
// send the HTTP request for the article
set tRequest = ##class(%Net.HttpRequest).%New()
set tRequest.Server = tServer
set tSC = tRequest.Get(tPath)
quit:$$$ISERR(tSC)
set len = 32000
while len>0 {
set tString = tRequest.HttpResponse.Data.Read(.len, .pSC)
quit:$$$ISERR(pSC)
set tRawText = tRawText _ tString
}
} catch (ex) {
set pSC = ex.AsStatus()
}
quit tRawText
}
The StripHTML() method would remove the HTML formatting, as follows:
ClassMethod StripHTML(pRawText As %String, Output pSC As %Status) As %String
{
set pSC = $$$OK, tCleanText = ""
try {
for tTag = "b","i","span","u","a","font","em","strong","img","label","small","sup","sub" {
set tReplaceTag(tTag) = " "
}
set tLowerText = $$$LOWER(pRawText)
set tStartPos = $find(tLowerText,"<body")-5, tEndTag = ""
set pRawText = $e(pRawText,tStartPos,*), tLowerText = $e(tLowerText,tStartPos,*)
for {
set tPos = $find(tLowerText,"<")
quit:'tPos // no tag start found
set tNextSpace = $f(tLowerText," ",tPos), tNextEnd = $f(tLowerText,">",tPos)
set tTag = $e(tLowerText,tPos,$s(tNextSpace&&(tNextSpace<tNextEnd):tNextSpace, 1:tNextEnd)-2)
if (tTag="script") || (tTag="style") {
set tPosEnd = $find(tLowerText,">",$find(tLowerText,"</"_tTag,tPos))
} else {
set tPosEnd = tNextEnd
}
if 'tPosEnd { //
set tEndTag = $e(pRawText,tPos-1,*)
set pRawText = $e(pRawText,1,tPos-2)
quit
}
set tReplace = $s(tTag="":"", 1:$g(tReplaceTag(tTag),$c(13,10,13,10)))
set pRawText = $e(pRawText,1,tPos-2) _ tReplace _ $e(pRawText,tPosEnd,*)
set tLowerText = $e(tLowerText,1,tPos-2) _ tReplace _ $e(tLowerText,tPosEnd,*)
}
set tCleanText = $zstrip($zconvert(pRawText, "I", "HTML"),"<>=W")
} catch (ex) {
set pSC = ex.AsStatus()
}
quit tCleanText
}
Finally, we would create an NLP measure and base it on the following source expression: %cube.GetArticleText(%source.Link).