Google

PLT MzScheme: Language Manual


Regular Expressions

MzScheme provides built-in support for regular expression pattern matching on strings and input ports, built on Henry Spencer's package. Regular expressions are specified as strings, using the same pattern language as the Unix utility egrep. String-based regular expressions can be compiled into a regexp value for repeated matches. The internal size of a regexp value is limited to 32 kilobytes; this limit roughly corresponds to a source string with 32,000 literal characters or 5,000 special characters.

The pregexp.ss library of MzLib (see Chapter 22 in PLT MzLib: Libraries Manual) provides a similar -- but more powerful -- form of matching.

Regexp   ::= Pieces                   Match Pieces                                 
          |  Regexp|Regexp            Match either Regexp, try left first          
Pieces   ::= Piece                    Match Piece                                  
          |  PiecePiece               Match first Piece followed by second Piece   
Piece    ::= Atom*                    Match Atom 0 or more times, longest possible 
          |  Atom+                    Match Atom 1 or more times, longest possible 
          |  Atom?                    Match Atom 0 or 1 times, longest possible    
          |  Atom*?                   Match Atom 0 or more times, shortest possible
          |  Atom+?                   Match Atom 1 or more times, shortest possible
          |  Atom??                   Match Atom 0 or 1 times, shortest possible   
          |  Atom                     Match Atom exactly once                      
Atom     ::= (Regexp)                 Match sub-expression Regexp                  
          |  [Range]                  Match any character in Range                 
          |  [^Range]                 Match any character not in Range             
          |  .                        Match any character                          
          |  ^                        Match start of string                        
          |  $                        Match end of string                          
          |  Literal                  Match a single literal character             
Literal  ::= Any character except (, ), *, +, ?, [, ], ., ^, \, or |               
          |  \Aliteral                Match Aliteral                               
Aliteral ::= Any character                                                         
Range    ::= ]                        Range contains ] only                        
          |  -                        Range contains - only                        
          |  ]Lrange                  Range contains ] and everything in Lrange    
          |  -Lrange                  Range contains - and everything in Lrange    
          |  Lrange-                  Range contains - and everything in Lrange    
          |  ]Lrange-                 Range contains ], -, and everything in Lrange
          |  Lrange                   Range contains everything in Lrange          
Lrange   ::= Rliteral                 Range contains a literal character           
          |  Rliteral-Rliteral        Range contains ASCII range inclusive         
          |  LrangeLrange             Range contains everything in both            
Rliteral ::= Any character except ] or -                                           

Figure 1:  Grammar for regular expressions

The format of a regular expression is specified by the grammar in Figure 1. A few subtle points about the regexp language are worth noting:

  • When an opening square bracket (``['') that starts a range is immediately followed by a closing square bracket (``]''), then the closing square bracket is part of the range, instead of ending an empty range. For example, "[]a]" matches any string that contains a lowercase ``a'' or a closing square bracket. A dash (``-'') at the start or end of a range is treated specially in the same way.

  • When a caret (``^'') or dollar sign (``$'') appears in the middle of a regular expression (not in a range), the resulting regexp is legal even though it is usually not matchable. For example, "a$b" is unmatchable, because no string can contain the letter ``b'' after the end of the string. In contrast, "a$b*" matches any string that ends with a lowercase ``a'', since zero ``b''s will match the part of the regexp after ``$''.

  • A backslash (``\'') in a regexp pattern specified with a Scheme string literal must be protected with an additional backslash. For example, the string "\\." describes a pattern that matches any string containing a period. In this case, the first backslash protects the second to generate a Scheme string containing two characters; the second backslash (which is the first slash in the actual string value) protects the period in the regexp pattern.

The regular expression procedures are:

  • (regexp string) takes a string representation of a regular expression and compiles it into a regexp value. Other regular expression procedures accept either a string or a regexp value as the matching pattern. If a regular expression string is used multiple times, it is faster to compile the string once to a regexp value and use it for repeated matches instead of using the string each time.

    The object-name procedure (see section 6.2.4) returns the source string for a regexp value.

  • (regexp? v) returns #t if v is a regexp value created by regexp, #f otherwise.

  • (regexp-match pattern string [start-k end-k output-port]) attempts to match pattern (a string or a regexp value) to a portion of string; see below for information on using an input port in place of string.

    The optional start-k and end-k arguments select a substring of string for matching, and the default is the entire string. The end-k argument can be #f, which is the same as not supplying end-k. The matcher finds a portion of string that matches pattern and is closest to the start of the selected substring.

    If the match fails, #f is returned. If the match succeeds, a list containing strings, and possibly #f, is returned. The first string in this list is the portion of string that matched pattern. If two portions of string can match pattern, then the match that starts earliest is found.

    Additional strings are returned in the list if pattern contains parenthesized sub-expressions; matches for the sub-expressions are provided in the order of the opening parentheses in pattern. When sub-expressions occur in branches of an ``or'' (``|''), in a ``zero or more'' pattern (``*''), or in a ``zero or one'' pattern (``?''), a #f is returned for the expression if it did not contribute to the final match. When a single sub-expression occurs in a ``zero or more'' pattern (``*'') or a ``one or more'' pattern (``+'') and is used multiple times in a match, then the rightmost match associated with the sub-expression is returned in the list.

    If the optional output-port is provided, the part of string that precedes the match is written to the port. All of string up to end-k is written to the port if no match is found. This functionality is not especially useful, but it is provided for consistency with regexp-match on input ports.

  • (regexp-match pattern input-port [start-k end-k output-port]) is similar to regexp-match with a string (see above), except that the match is found in the stream of characters produced by input-port. The optional start-k argument indicates the number of characters to skip before matching pattern, and end-k indicates the maximum number of characters to consider (including skipped characters). The end-k argument can be #f, which is the same as not supplying end-k. The default is to skip no characters and read until the end-of-file if necessary. If the end-of-file is reached before start-k characters are skipped, the match fails.

    In pattern, a start-of-string caret (``^'') refers to the first read position after skipping, and the end-of-string dollar sign (``$'') refers to the end-kth read character or the end of file, whichever comes first.

    The optional output-port receives all characters that precede a match in the input port, or up to end-k characters (by default the entire stream) if no match is found.

    When matching an input port stream, all characters up to and including the match are eventually read from the port, but matching proceeds by first peeking characters from the port (using peek-string-avail!; see section 11.2.1), and then (re-)reading characters to discard them after the match result is determined. The matcher peeks in blocking mode only as far as necessary to determine a match, but it may peek extra characters to fill an internal buffer if immediately available (i.e., without blocking). Greedy repeat operators in pattern, such as ``*'' or ``+'', tend to force reading the entire content of the port to determine a match.

    If the port is read simultaneously by another thread, or if the port is a custom port with inconsistent reading and peeking procedures (see section 11.1.6), then the characters that are peeked and used for matching may be different than the characters read and discarded after the match completes. The matcher inspects only the peeked characters.

  • (regexp-match-positions pattern string-or-input-port [start-k end-k output-port]) is like regexp-match, but returns a list of number pairs (and #f) instead of a list of strings. Each pair of numbers refers to a range of characters in string-or-input-port in a substring-compatible manner for strings, independent of start-k. In the case of an input port, the returned positions indicate the number of characters that were read before the first matching character.

  • (regexp-match-peek pattern input-port [start-k end-k]) is like regexp-match on input ports, but only peeks characters from input-port instead of reading them.

  • (regexp-match-peek-positions pattern input-port [start-k end-k]) is like regexp-match-positions on input ports, but only peeks characters from input-port instead of reading them.

  • (regexp-replace pattern string insert-string) performs a match using pattern on string and then returns a string in which the matching portion of string is replaced with insert-string. If pattern matches no part of string, then string is returned unmodified.

    If insert-string contains ``&'', then ``&'' is replaced with the matching portion of string before it is substituted into string. If insert-string contains ``\n'' (for some integer n), then it is replaced with the nth matching sub-expression from string.15 ``&'' and ``\0'' are synonymous. If the nth sub-expression was not used in the match or if n is greater than the number of sub-expressions in pattern, then ``\n'' is replaced with the empty string.

    A literal ``&'' or ``\'' is specified as ``\&'' or ``\\'', respectively. If insert-string contains ``\$'', then ``\$'' is replaced with the empty string. (This can be used to terminate a number n following a backslash.) If a ``\'' is followed by anything other than a digit, ``&'', ``\'', or ``$'', then it is treated as ``\0''.

    (regexp-replace* pattern string insert-string) is the same as regexp-replace, except that every instance of pattern in string is replaced with insert-string. Only non-overlapping instances of pattern in the original string are replaced, so instances of pattern within inserted strings are not replaced recursively.

Examples:

(define r (regexp "(-[0-9]*)+"))  
(regexp-match r "a-12--345b") ; => '("-12--345" "-345") 
(regexp-match-positions r "a-12--345b") ; => '((1 . 10) (5 . 10)) 
(regexp-match "x+" "12345") ; => #f 
(regexp-replace "mi" "mi casa" "su") ; => "su casa" 
(define r2 (regexp "([Mm])i ([a-zA-Z]*)"))  
(define insert "\\1y \\2")  
(regexp-replace r2 "Mi Casa" insert) ; => "My Casa" 
(regexp-replace r2 "mi cerveza Mi Mi Mi" insert) ; => "my cerveza Mi Mi Mi" 
(regexp-replace* r2 "mi cerveza Mi Mi Mi" insert) ; => "my cerveza My Mi Mi" 


15 The backslash is a character in the string, so an extra backslash is required to specify the string as a Scheme constant. For example, the Scheme constant "\\1" is ``\1''.