docs.intersystems.com
InterSystems IRIS Data Platform 2019.2  /  Using ObjectScript  /  Operators and Expressions

Using ObjectScript
Pattern Matching
Previous section           Next section
InterSystems: The power behind what matters   
Search:  


Pattern Matching
InterSystems IRIS supports two systems of pattern matching:
These pattern match systems are wholly separate. Each pattern match system can only be used in its own context. It is, however, possible to combine pattern match tests from different pattern match systems using logical AND and OR syntax, as shown in the following example:
  SET var = "abcDEf"
  IF (var ?.e2U.e) && $MATCH(var, "^.{3,7}") { WRITE "It's a match!"}
  ELSE { WRITE "No match"}
The ObjectScript pattern tests that the string must contain two consecutive uppercase letters. The Regular Expression pattern tests that the string must contain between 3 and 7 characters.
ObjectScript Pattern Matching
The ObjectScript pattern match operator tests whether the characters in its left operand are correctly specified by the pattern in its right operand. It returns a boolean value. The pattern match operator produces a result of TRUE (1) when the pattern correctly specifies the pattern of characters in the left operand. It produces a result of FALSE (0) if the pattern does not correctly specify the pattern of characters in the left operand.
For example, the following tests if the string ssn contains a valid U.S. Social Security Number (3 digits, a hyphen, 2 digits, a hyphen, and 4 digits):
 SET ssn="123-45-6789"
 SET match = ssn    ?3N1"-"2N1"-"4N
 WRITE match
The left operand (the test value) and the right operand (the pattern) are distinguished by the leading ? of the right operand. The two operands may be separated by one or more blank spaces, or not separated by blank spaces, as shown in the following equivalent program example:
 SET ssn="123-45-6789"
 SET match = ssn?3N1"-"2N1"-"4N
 WRITE match
No white space is permitted following the ? operator. White space within the pattern must be within a quoted string and is interpreted as being part of the pattern.
The general format for a pattern match operation is as follows:
operand?pattern
operand An expression that evaluates to a string or number, the characters of which you want to test for a pattern.
pattern A pattern match sequence beginning with a ? character (or with ‘? for a not-match test). The pattern sequence can be one of the following: a sequence of one or more pattern-elements; an indirect reference that evaluates to a sequence of one or more pattern-elements
A pattern-element consists of one of the following:
repeat-count A repeat count — the exact number of instances to be matched. The repeat-count can evaluate to an integer or to the period wildcard character (.). Use the period to specify any number of instances.
pattern-codes One or more pattern codes. If more than one code is specified, the pattern is satisfied by matching any one of the codes.
literal-string A literal string enclosed in double quotes.
alternation A set of pattern-element sequences to choose from (in order to perform pattern matching on a segment of the operand string). This provides logical OR capability in pattern specifications.
Use a literal string enclosed in double quotes in a pattern if you want to match a specific character or characters. In other situations, use the special pattern codes provided by ObjectScript. Characters associated with a particular pattern code are (to some extent) locale-dependent. The following table shows the available pattern codes and their meanings:
Pattern Codes
Code Meaning
A Matches any uppercase or lowercase alphabetic character. The 8-bit character set for the locale defines what is an alphabetic character. For the English locale (based on the Latin-1 character set), this includes the ASCII values 65 through 90 (A through Z), 97 through 122 (a through z), 170, 181, 186, 192 through 214, 216 through 246, and 248 through 255.
C Matches any of the ASCII control characters (ASCII values 0 through 31 and the extended ASCII values 127 through 159).
E Matches any character, including non-printing characters, whitespace characters, and control characters.
L Matches any lowercase alphabetic character. The 8-bit character set for the locale defines what is a lowercase character. For the English locale (based on the Latin-1 character set) this includes the ASCII values 97 through 122 (a through z), 170, 181, 186, 223 through 246, and 248 through 255.
N Matches any of the 10 numeric characters 0 through 9 (ASCII 48 through 57).
P Matches any punctuation character. The character set for the locale defines what is a punctuation character for an extended (8-bit) ASCII character set. For the English locale (based on the Latin-1 character set), this includes the ASCII values 32 through 47, 58 through 64, 91 through 96, 123 through 126, 160 through 169, 171 through 177, 180, 182 through 184, 187, 191, 215, and 247.
U Matches any uppercase alphabetic character. The 8-bit character set for the locale defines what is an uppercase character. For the English locale (based on the Latin-1 character set), this includes the ASCII values 65 through 90 (A through Z), 192 through 214, and 216 through 222.
R
B
M
Matches Cyrillic 8–bit alphabetic character mappings. R matches any Cyrillic character (ASCII values 192 through 255). B matches uppercase Cyrillic characters (ASCII values 192 through 223). M matches lowercase Cyrillic characters (ASCII values 224 through 255). These pattern codes are only meaningful in the Russian 8-bit Windows locale (ruw8). In other locales they execute successfully but fail to match any character.
ZFWCHARZ Matches any of the characters in the Japanese ZENKAKU character set. ZFWCHARZ matches full-width characters, such as those in the Kanji range, as well as many non-Kanji characters that occupy a double cell when displayed by some terminal emulators. ZFWCHARZ also matches the 303 surrogate pair characters defined in the JIS2004 standard, treating each surrogate pair as a single character. For example, the surrogate pair character $WC(131083) matches ?1ZFWCHARZ. This pattern match code requires a Japanese locale. See the $ZZENKAKU function for further details.
ZHWKATAZ Matches any of the characters in the Japanese HANKAKU Kana character set. These are Unicode values 65377 (FF61) through 65439 (FF9F). This pattern match code requires a Japanese locale. See the $ZZENKAKU function for further details.
Pattern codes are not case-sensitive; you can specify them in either uppercase or lowercase. For example, ?5N is equivalent to ?5n. You can specify multiple pattern codes to match a specific character or string. For example, ?1NU matches either a number or an uppercase letter.
As stated in the InterSystems Glossary of Terms, the ASCII character set refers to an extended, 8-bit character set, rather than the more limited, 7-bit character set.
Note:
Pattern matching with double-quote characters can yield inconsistent results, especially when data is supplied from InterSystems IRIS implementations using different NLS locales. The straight double-quote character ($CHAR(34) = ") matches as a punctuation character. Directional double-quote characters (curly quotes) do not match as punctuation characters. The 8-bit directional double-quote characters ($CHAR(147) = “ and $CHAR(148) = ”) match as control characters. The Unicode directional double-quote characters ($CHAR(8220) = “ and $CHAR(8221) = ”) do not match as either punctuation or control characters.
The Pattern Match operator differs from the Binary Contains ([) operator. The Binary Contains operator returns TRUE (1) even if only a substring of the left-hand operand matches the right-hand operand. Also, Binary Contains expressions do not provide the range of options available with the Pattern Match operator. In Binary Contains expressions, you can use only a single string as the right-hand operand, without any special codes.
For example, assume that variable var2 contains the value “abc”. Consider the following Pattern Match expression:
 SET match = var2?2L
This sets match to FALSE (0) because var2 contains three lowercase characters, not just two.
Here are some examples of basic pattern matching:
PatternMatchTest
 SET var = "O"
 WRITE "Is the letter O",!

 WRITE "...an alphabetic character? "
 WRITE var?1A,!

 WRITE "...a numeric character? "
 WRITE var?1N,!

 WRITE "...an alphabetic or ",!,"  a numeric character? "
 WRITE var?1AN,!

 WRITE "...an alphabetic or ",!,"  a ZENKAKU Kanji character? "
 WRITE var?1AZFWCHARZ,!

 WRITE "...a numeric or ",!,"  a HANKAKU Kana character? "
 WRITE var?1ZHWKATAZN
You can extend the scope of a pattern code by specifying:
Specifying How Many Times a Pattern Can Occur
To define a range for the number of times that pattern can occur in the target operand, use the form:
n.n
The first n defines the lower limit for the range of occurrences; the second n defines the upper limit.
For the following example, assume that the variable var3 contains multiple copies of the string “AB” (and no other characters). 1.4 indicates that from one to four occurrences of “AB” are recognized:
 SET match = var3?1.4"AB"
If var3 =“ABABAB”, the expression returns a result of TRUE (1) even though var3 contains only three occurrences of “AB”.
As another example, consider the following expression:
 SET match = var4?1.6A
This expression checks to see whether var4 contains from one to six alphabetic characters. A result of FALSE (0) is returned if var4 contains zero or more than six alphabetic characters, or contains a non-alphabetic character.
If you omit either n, ObjectScript supplies a default. The default for the first n is zero (0). The default for the second n is any number. Consider the following example:
 SET match = var5?.E1"AB".E
This example returns a result of TRUE (1) as long as var5 contains at least one occurrence of the pattern string “AB”.
Specifying Multiple Patterns
To define multiple patterns, you can combine n and pattern in a sequence of any length. Consider the following example:
 SET match = date?2N1"/"2N1"/"2N
This expression checks for a date value in the format mm/dd/yy. The string “4/27/98” would return FALSE (0) because the month has only one digit. To detect both one and two digit months, you could modify the expression as:
 SET match = date?1.2N1"/"2N1"/"2N
Now the first pattern match (1.2N) accepts either 1 or 2 digits. It uses the optional period (.) to define a range of acceptable occurrences as described in the previous section.
Specifying a Combination Pattern
To define a combination pattern, use the form:
Pattern1Pattern2
With a combination pattern, the sequence consisting of pattern1 followed by pattern2 is checked against the target operand. For example, consider the following expression:
 SET match = value?3N.4L
This expression checks for a pattern in which three numeric digits are followed by zero to four lowercase alphabetic characters. The expression returns TRUE (1) only if the target operand contains exactly one occurrence of the combined pattern. For example, the strings “345g” and “345gfij” would qualify, but “345gfijhkbc” “345gfij276hkbc” would not.
Specifying an Indefinite Pattern
To define an indefinite pattern, use the form:
.pattern
With an indefinite pattern, the target operand is checked for an occurrence of pattern, but any number of occurrences is accepted (including zero occurrences). For example, consider the expression:
 SET match = value?.N
This expression returns TRUE (1) if the target operand contains zero, one, or more than one numeric character, and contains no characters of any other type.
Specifying an Alternating Pattern (Logical OR)
Alternation allows for testing if an operand matches one or more of a group of specified pattern sequences. It provides logical OR capability to pattern matching.
An alternation has the following syntax:
( pattern-element sequence {, pattern-element sequence }...)
Thus, the following pattern returns TRUE (1) if val contains one occurrence of the letter “A” or one occurrence of the letter “B”.
 SET match = value?1(1"A",1"B")
You can have nested alternation patterns, as in the following pattern match expression:
 SET match = value?.(.(1A,1N),1P)
For example, you may want to validate a U.S. telephone number. At a minimum, the phone number must be a 7-digit phone number with a hyphen (-) separating the third and fourth digits. For example:
nnn-nnnn
The phone number can also include a three-digit area code that must either have surrounding parentheses or be separated from the rest of the number by a hyphen. For example:
(nnn) nnn-nnnn
nnn-nnn-nnnn
The following pattern match expressions describe three valid forms of a U.S. telephone number:
 SET match = phone?3N1"-"4N
 SET match = phone?3N1"-"3N1"-"4N
 SET match = phone?1"("3N1") "3N1"-"4N
Without an alternation, the following compound Boolean expression would be required to validate any form of U.S. telephone number.
  SET match = 
     (
     (phone?3N1"-"4N) || 
     (phone?3N1"-"3N1"-"4N) || 
     (phone?1"("3N1") "3N1"-"4N)
     )
With an alternation, the following single pattern can validate any form of U.S. telephone number:
 SET match = phone?.1(1"("3N1") ",3N1"-")3N1"-"4N
The alternation in this example allows the area code component of the phone number to be satisfied by either 1"("3N1") " or 3N1"-". The alternation count range of 0 to 1 indicates that the operand phone can have 0 or 1 area code components.
Alternations with a repeat count greater than one (1) can produce many combinations of acceptable patterns. The following alternation matches the string shown and matches 26 other three-character strings.
 SET match = "CAT"?3(1"C",1"A",1"T")
Using Incomplete Patterns
If a pattern match successfully describes only part of a string, then the pattern match returns a result of FALSE (0). That is, there cannot be any string left over when the pattern is exhausted. The following expression evaluates to a result of FALSE (0) because the pattern does not match the final “R”:
 SET match = "RAW BAR"?.U1P2U
Multiple Pattern Interpretations
There can be more than one interpretation of a pattern as it is matched against an operand. For example, the following expression can be interpreted in two ways:
 SET match = "/////A#####B$$$$$"?.E1U.E
  1. The first “.E” matches the substring “/////”, the 1U matches the “A”, and the second “.E” matches the substring “#####B$$$$$”.
  2. The first “.E” matches the substring “/////A#####”, the 1U matches the character “B”, and the second “.E” matches the substring “$$$$$”.
    As long as at least one interpretation of the expression is TRUE (1), then the expression has a value of TRUE.
Not Match Operator
You can produce a Not Match operation by using the Unary Not operator ( ' ) with Pattern Match:
operand'?pattern
Not Match reverses the truth value of the Pattern Match. If the characters in the operand cannot be described by the pattern, then Not Match returns a result of TRUE (1). If the pattern matches all of the characters in the operand, then Not Match returns a result of FALSE (0).
The following example uses the Not Match operator:
  WRITE !,"abc" ?3L
  WRITE !,"abc" '?3L
  WRITE !,"abc" ?3N
  WRITE !,"abc" '?3N
  WRITE !,"abc" '?3E
Pattern Complexity
A pattern match with multiple alternations and indefinite patterns, when applied to a long string, can recurse many levels into the system stack. In rare cases, this recursion can rise to several thousand levels, threatening stack overflow and a process crash. When this extreme situation occurs, InterSystems IRIS issues a <COMPLEX PATTERN> error rather than risking a crash of the current process.
In the unusual event that such an error occurs, it is recommended that you either simplify your pattern, or apply it to shorter subunits of the original string.
You can interrupt pattern execution by issuing a Crtl-C key command, resulting in an <INTERRUPT> error.


Previous section           Next section
Send us comments on this page
View this book as PDF   |  Download all PDFs
Copyright © 1997-2019 InterSystems Corporation, Cambridge, MA
Content Date/Time: 2019-09-13 06:50:57