Skip to main content

Language Identification

Important:

InterSystems has deprecatedOpens in a new tab InterSystems IRIS® Natural Language Processing (NLP). It may be removed from future versions of InterSystems products. The following documentation is provided as reference for existing users only. Existing users who would like assistance identifying an alternative solution should contact the WRCOpens in a new tab.

This chapter describes how to configure and use Automatic Language Identification (ALI), which is applied at the sentence level. It also describes a few language-specific issues.

Configuring Automatic Language Identification

An NLP Configuration establishes the language environment for source document content. A Configuration is independent of any specified set of source data. You can either define a Configuration, or take the default Configuration. If you do not specify a Configuration, the default is English-only, with no automatic language identification.

A configuration defines the following language options:

  • What language(s) the source documents contain, and therefore which languages to test for and which language models to apply. The available options are Czech (cs), Dutch (nl), English (en), French (fr), German (de), Japanese (ja), Portuguese (pt), Russian (ru), Spanish (es), Swedish (sv), and Ukrainian (uk). Specify a language using the ISO two-letter code. You can specify multiple languages as an InterSystems IRIS list structure.

  • When specifying more than one language, specify a boolean value to activate automatic language identification.

The following example creates a configuration that assumes all source texts will be in English or French, and supports automatic language identification:

  SET myconfig="EnglishFrench"
  IF ##class(%iKnow.Configuration).Exists(myconfig) {
     SET cfg=##class(%iKnow.Configuration).Open(myconfig)
     WRITE "Opened existing configuration ",myconfig,! 
  }
  ELSE {
     SET cfg=##class(%iKnow.Configuration).%New(myconfig,1,$LISTBUILD("en","fr"),"",1)
     DO cfg.%Save()
     IF ##class(%iKnow.Configuration).Exists(myconfig)
     {WRITE "Configuration ",myconfig," now exists",! }
     ELSE {WRITE "Configuration creation error" QUIT }
  }
      SET cfgId=cfg.Id
      WRITE "with configuration ID ",cfgId,!
   SET rnd=$RANDOM(2)
  IF rnd {
       SET stat=##class(%iKnow.Configuration).%DeleteId(cfgId)
       IF stat {WRITE "Deleted the ",myconfig," configuration" }
       }
  ELSE {WRITE "No delete this time",! }

Using Automatic Language Identification

NLP performs automatic language identification on a per-sentence basis. When the current configuration has activated automatic language identification, NLP tests each sentence in each source text to determine which of the languages specified in the Configuration is the language used in that sentence. This identification is a statistical probability. This has the following consequences:

  • If a sentence contains text in more than one language specified in the Configuration, NLP will assign the sentence to what it determines is the predominant language of the sentence.

  • If a sentence is in a language not specified in the Configuration (or a language not supported by NLP), NLP will assign the sentence to one of the specified Configuration languages.

NLP subsequently uses this language determination in determining CRCs and other NLP analysis.

Thus, source texts and sentences within a source text can be in different languages. NLP automatically determines which language model to apply. Automatic language identification also assigns a confidence level in its language identification as an integer indicating a percentage. These range from 100 (complete confidence) to 0 (indeterminate). If automatic language identification is not active, all sentences are assigned a confidence level of 0.

Language Identification Queries

Refer to A Note on Program Examples for details on the coding and data used in the examples in this book.

The following example uses GetTopLanguage()Opens in a new tab to identify the language for a source and the degree of confidence in that identification. Because language identification is performed on the sentence level, the language for the source is the result of averaging the language identification confidence for the component sentences. This method returns the language as a two character abbreviation (in this case, “en”). Note that totlangconf (the total of the language confidence for the sentences) must be divided by numlangsent, not by numsent. These two sentence count numbers are usually, but not always, the same. This is because a source may contain sentences for which no language can be determined.

Configuration
  SET myconfig="EnFr"
  IF ##class(%iKnow.Configuration).Exists(myconfig)
       {SET cfg=##class(%iKnow.Configuration).Open(myconfig) }
  ELSE {SET cfg=##class(%iKnow.Configuration).%New(myconfig,1,$LISTBUILD("en","fr"),"",1)
        DO cfg.%Save() }
  SET cfgId=cfg.Id 
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET stat=flister.SetConfig(myconfig)
    IF stat '= 1 { WRITE "SetConfig error ",$System.Status.DisplayError(stat)
                   QUIT }
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT Top 10 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
GetSources
  DO ##class(%iKnow.Queries.SourceAPI).GetByDomain(.result,domId)
  SET i=1
  WHILE $DATA(result(i)) {
     SET intId = $LISTGET(result(i),1)
     SET extId = $LISTGET(result(i),2)
    SET numsent = ##class(%iKnow.Queries.SentenceAPI).GetCountBySource(domId,result(i))
     WRITE !,extId," has ",numsent," sentences",!
     SET srclang = ##class(%iKnow.Queries.SourceAPI).GetTopLanguage(domId,intId,.totlangconf,.numlangsent)
     WRITE "Source language is ",srclang,!,"with a confidence % of ",totlangconf/numlangsent,!!
     SET i=i+1
     }

The following example uses GetLanguage()Opens in a new tab to identify the language for each sentence in a source and the degree of confidence in that identification. This method returns the language as a two character abbreviation (in this case, “en”) and the confidence level as a percentage between 0 and 100. Note that the confidence level is rarely (if ever) 100%.

Configuration
  SET myconfig="EnFr"
  IF ##class(%iKnow.Configuration).Exists(myconfig)
       {SET cfg=##class(%iKnow.Configuration).Open(myconfig) }
  ELSE {SET cfg=##class(%iKnow.Configuration).%New(myconfig,1,$LISTBUILD("en","fr"),"",1)
        DO cfg.%Save() }
  SET cfgId=cfg.Id 
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET stat=flister.SetConfig(myconfig)
    IF stat '= 1 { WRITE "SetConfig error ",$System.Status.DisplayError(stat)
                   QUIT }
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT Top 10 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
GetOneSource
  DO ##class(%iKnow.Queries.SourceAPI).GetByDomain(.result,domId)
  FOR i=1:1:10 {
   IF $DATA(result(i)) {
     SET intId = $LISTGET(result(i),1)
     SET extId = $LISTGET(result(i),2)
     SET myconf=0 
  SET numSentS = ##class(%iKnow.Queries.SentenceAPI).GetCountBySource(domId,result(i))
  WRITE !,extId," has ",numSentS," sentences",!
GetSentencesInSource
     SET sentStat=##class(%iKnow.Queries.SentenceAPI).GetBySource(.sent,domId,intId)
     IF sentStat=1 {
         SET i=1
         WHILE $DATA(sent(i)) { 
            SET sentnum=$LISTGET(sent(i),1)
            WRITE "sentence:",sentnum
            SET lang = ##class(%iKnow.Queries.SentenceAPI).GetLanguage(domId,sentnum,.myconf)
            WRITE " language:",lang," confidence:",myconf,!
            SET i=i+1
         }
     }
   }
   ELSE { WRITE !,"That's all folks!" }
  }

Overriding Automatic Language Identification

You can use the LanguageFieldName domain parameter to override Automatic Language Identification. If activated, this parameter determines which language to apply by accessing a metadata field for each source. This metadata field contains the ISO language code. If the metadata field data is present, Automatic Language Identification is overridden for that source. If the metadata field is empty or invalid, Automatic Language Identification is used for that source. The LanguageFieldName domain parameter is inactive by default. For further details, refer to the Domain Parameters appendix of this manual.

Language-Specific Issues

German: the German eszett (“ß”) character is normalized as “ss”. German commonly requires setting the EnableNgrams domain parameter.

FeedbackOpens in a new tab