@r_compile(key:value, r:string)

@r_match(key:value, s:string)
@r_match(r:string, s:string)

@r_search(key:value, s:string)
@r_search(r:string, s:string)

@r_findall(key:value, s:string)
@r_findall(r:string, s:string)

The function @r_compile(k, r) compiles the regular expression r (a pattern given as a string) and stores the result under the key k (any value) in some internal dictironnary. Then, this key can be used as the first argument to the three functions @r_match(), @r_search() and @r_findall():

  • Function @r_match() looks for a pattern matching the whole string.

  • The @r_search() function allows to find a single occurrence of a pattern within a string.

  • Function @r_findall() report all the occurences of a pattern withing the whole string.

A regular expression r can be given directly to these functions as a string, but using @r_compile() avoids the re-compilation of the pattern r at each call. This is usefull if the pattern is often reused.


Function @r_match(p, s) returns false if the pattern does not match the entire string. If string s is an instance of the pattern p, a non empty tab is returned. This tab contains either the submatches of the pattern (the substrings of s matching the groups in the pattern) or the entire string if there is no group. See below for groups in a pattern.


Function @r_search(p, s) returns false if there is no occurence of the pattern p in the string s. If an occurence is found, the returned value is a non empty tab providing the matched string and the characters before and after the match:

     [ prefix, m[1], , m[n], suffix ]

The strings m[1], …, m[n] are the substrings matched by the groups of the regular expression.


Function @r_findall(p, s) returns false if there is no occurence of the pattern p in the string s. If occurences are found, the returned value is a non empty tab providing the matched string and the characters before and after the match:

     [
        prefix,
        [ m[1,1], , m[1,n] ],
        sep_1,
        [ m[2,1], , m[2,n] ],
        sep_2,
        ,
        [ m[p,1], , m[p,n] ],
        suffix
     ]

The strings m[i, 1], …, m[i, n] are the substrings matched by the groups of the ith occurences of the regular expression in the string s. The string sep_i is the substring between the ith occurence of the pattern and the occurence i+1. String prefix is the prefix of s before the first occurence and suffix the suffix of s after the last occurence.

 

Regular expression notation

A regular expression (RE) is a matching engine constructed from a string. Several notations can be used to specify the RE: ECMAScript (the default), POSIX, awk, grep, and egrep notation. The convention used can be changed by changing the value of the global variable

       $regexp_syntax_option

The recognized values are:

       "ECMAScript"    ; ECMAScript notation (aka JavaScript)
       "ECMA"          ; ECMAScript (alias)
       "default"       ; ECMAScript (alias)
       "basic"         ; POSIX basic RE
       "extended"      ; POSIX extended RE
       "awk"           ; awk RE
       "grep"          ; grep RE
       "egrep"         ; egrep RE

If the value of $regexp_syntax_option is not valid, the default notation (EMACScript) is used. A change in convention affects only the subsequent compilation and the already compiled RE are not affected.

We do not present here the various RE notations. However, we recall some features of ECMAScript:

  • A suffix ? after any of the repetition notations makes the pattern matcher ‘‘lazy’’ or ‘‘non-greedy.’’ That is, when looking for a pattern, it will look for the shortest match rather than the longest. By default, the pattern matcher always looks for the longest match. For example, the pattern (ab)* matches all of ababab. However, (ab)∗? matches only the first ab.

  • The most common character classifications have names xxx that can be used in a character class [ … ] using the notation [:xxx:]. For example $[[:alpha:]_][[:alnum:]_]* matches an antescofo variable identifier: it starts by a dollar sign, the second character must be an alphabetic character (class [:alpha:]) or an underscore and the rest of its characters are alphanumeric characters including the underscore. Some of these classes are also supported through the @char_is_xxx() predicates.

  • A group (a subpattern) potentially to be represented by a submatch is delimited by parentheses. If you need parentheses that should not define a subpattern, use (? rather than plain (.

 

Examples

In this example, the RE is directly given to the @r_match function:

@r_match("[a-e]*", "abcde")  ->  [ "abcde" ]
@r_match("[a-e]*", "ab.cde")  ->  false
@r_search("[a-e]*", "ab.cde")  ->  ["", "ab", ".cde"]

The RE is compiled at each call. To avoid this recompilation, function @r_compile can be used:

_ := @r_compile(1, "[a-z]+")

The key 1 can then be an argument of the matching functions:

@r_match(1, "abcde")  ->  [ "abcde" ]
@r_search(1, "888ab12cde999")  -> ["888", "ab", "12cde999"]

Any value can be used as a key, even the string specifying the RE:

_ := @r_compile("[a-z]+", "[a-z]+")
@r_search("[a-z]+", "888ab12cde999")  -> ["888", "ab", "12cde999"]

In this last example, the occurence of the pattern is ab, the prefix (the substring preceding the match) is 888 and the suffix is 12cde999.

The pattern "[a-z]+" does not contain groups. Pattern "([a-d]+)[0-9]+([a-z]+)" contains two groups "([a-d]+)" and "([a-z]+)". In presence of groups, submatchs are reported in the returned tab:

@r_match("([a-d]+)[0-9]+([a-z]+)", "ab12cde")  ->  [ "ab", "cde" ]
@r_search("([a-d]+)[0-9]+([a-z]+)", "88ab12cde99")  -> ["88", "ab", "cde", "99"]

Nota Bene: only groups are reported. In the previous example the substring matched by [0-9]+ is not reported.

Function @r_findall can be used to report all the occurence of the pattern found:

@r_findall("([a-z]+)[0-9]+",
           "a1.b2::c3   dd44____abcdefghi12345678-------")

The pattern defines a sequence alphabetic lower letters followed by a sequence of digits. Only the alphabetic part is reported. The call returns:

 [
     "",             ; prefix
     ["a"],          ; first occurence
     ".",            ; sep_1
     ["b"],          ; occurence 2
     "::",           ; sep_2
     ["c"],          ; occurence 3
     "   ",          ; sep_3
     ["dd"],         ; occurence 4
     "____",         ; sep_4
     ["abcdefghi"],  ; occurence 5
     "-------"       ; suffix
 ]

The prefix is an empty string because s starts with an occurence of the pattern. Each occurence is reported as a tab. These tabs contains only one element corresponding to the group in the pattern.

 

 

See also: @char_is_alnum, @char_is_alpha, @char_is_ascii, @char_is_blank, @char_is_cntrl, @char_is_digit, @char_is_graph, @char_is_lower, @char_is_print, @char_is_punct, @char_is_space, @char_is_upper, @char_is_xdigit

See also String Management @car    @cdr    @char_is_alnum    @char_is_alpha    @char_is_ascii    @char_is_blank    @char_is_cntrl    @char_is_digit    @char_is_graph    @char_is_lower    @char_is_print    @char_is_punct    @char_is_space    @char_is_upper    @char_is_xdigit    @copy    @count    @drop    @dump    @dumpvar    @empty    @explode    @find    @is_prefix    @is_string    @is_subsequence    @is_suffix    @last    @member    @occurs    @parse    @permute    @push_back    @r_compile    @r_findall    @r_match    @r_search    @remove    @remove_duplicate    @replace    @scramble    @slice    @sort    @split    @sputter    @string2fun    @string2proc    @strip_path    @stutter    @system    @take    @to_num    @Tracing    @UnTracing   

Most functions acting on tabs operate also on strings.