string manipulation - Convert recursive RegularExpression to StringExpression?

Consider the following expression for matching balanced brackets:

StringPosition["[[hello][[hi]][hey]ahoy][]", RegularExpression@"(?P<0>\\[([^\\[\\]]|(?P>0))*\\])"]

{{1, 24}, {2, 8}, {9, 14}, {10, 13}, {15, 19}, {25, 26}}

It uses features for recursive pattern matching. This answer by WReach explains the general structure using similar example. This comment by Leonid Shifrin states that recursive StringExpression is possible but its performance is worse than StringExpression's. And this question by Mr.Wizard shows that StringExpression can perform faster than RegularExpression.
So I'm eagerly interested if it's possible to build recursive StringExpressions, and if true, then how? What have I tried:

bb := "[" ~~ (Except@{"[", "]"} | bb) ... ~~ "]";
StringPosition["[[hello][[hi]][hey]ahoy][]", bb]

$RecursionLimit::reclim: Recursion depth of 1024 exceeded. >>

Ok, that was wrong, I think I know why, but I do not see another approach. :|

Answer

TL;DR

Recursive expressions are possible using native string patterns in Mathematica, but can be difficult to write correctly, and might perform very poorly.

Difficult To Write?

As @Leonid's solution shows, it is possible to express recursive patterns without resorting to regular expressions. However, recursive string patterns can be more difficult to construct.

Consider the pattern bb from @Leonid's response:

bb = "[" ~~ ((Except["[" | "]"] ...) | (patt__ /; StringMatchQ[patt, bb])) ~~ "]";

This is not actually equivalent to the original regular expression as can be seen from the presence of the string position {1, 24} in the original answer, but its absence when using bb. Here is a smaller example that demonstrates the problem:

re = RegularExpression@"(?P<0>\\[([^\\[\\]]|(?P>0))*\\])";

StringPosition["[[x][y]]", re]
(* {{1, 8}, {2, 4}, {5, 7}} *)


StringPosition["[[x][y]]", bb]
(* {{2, 4}, {5, 7}} *)

The outermost match is lost because bb does not allow multiple occurrences of the recursive pattern:

StringMatchQ["[[x][x]]", re]
(* True *)

StringMatchQ["[[x][x]]", bb]
(* False *)

That seems to be easy to fix...

bbb = "[" ~~ (Except["[" | "]"] | (patt__ /; StringMatchQ[patt, bbb])) ... ~~ "]";

StringMatchQ["[[x][x]]", bbb]
(* True *)

... but perhaps it is not so easy:

StringMatchQ["[[x][y]]", re]
(* True *)


StringMatchQ["[[x][y]]", bbb]
(* False *)

StringPosition["[[x][y]]", bbb]
(* {{2, 4}, {5, 7}} *)

Expressing Recursive Repetition is Hard

The problem is that once we name a subpattern patt, its repetition using ... requires that each submatch be the same. That is why [[x][x]] matches, but [[x][y]] does not.

In normal (non-string) pattern-matching, the work-around is simple. We would use a pattern test instead of a condition, like this: __?(StringMatchQ[#, bbb]&). This allows the match to remain anonymous, and each repetition to have a different value. Unfortunately, the documentation for string patterns tells us that pattern tests are applied to each matched character separately whereas we want to test the entire match. Consider:

StringMatchQ["a", __?(# === "a" &)]

(* True *)

StringMatchQ["abc", __?(# === "abc" &)]
(* False *)

StringMatchQ["aaaaaa", __?(# === "a"&)]
(* True *)

Thus the pattern-test solution is closed to us. It means that we must express independent repetition directly as recursion:

x1 = "[" ~~ (a___ /; StringMatchQ[a, x2]) ~~ "]";


x2 = Except["["|"]"]... ~~
     ("" | ((a:("["~~__) /; StringMatchQ[a, x1]) ~~ (b___ /; StringMatchQ[b, x2])));

This fixes the observed problems:

StringMatchQ["[[x][y]]", x1]
(* True *)

StringPosition["[[x][y]]", x1]
(* {{1, 8}, {2, 4}, {5, 7}} *)

BUT...

Say Goodbye to PCRE Optimization

These patterns are phenomenally slow to run. Consider the original question:

StringPosition["[[hello][[hi]][hey]ahoy][]", x1] // Timing
(* {99.435037, {{1, 24}, {2, 8}, {9, 14}, {10, 13}, {15, 19}, {25, 26}}} *)

99 seconds! Why is it so slow, even when we took great care to ensure that each matching path would consume at least one character?

Normally, PCRE compiles a regular expression into a highly efficient finite state machine that optimizes back-tracking. It identifies common back-tracking states and eliminates the need to constantly revisit the same potential matches over and over again. However, when we use conditions in string patterns, we force PCRE to call back into Mathematica to evaluate those conditions. The PCRE compiler has no information about the nature of those callback expressions, so it cannot determine whether a backtracking path can be abandoned. The net result is that all possible backtracking paths need to be explored, a process which can exhibit exponential time complexity. This is why our revised pattern is so slow. 100 seconds for a 26 character string -- imagine how long it would take to scan a text file!

Summary

To summarize...

It is possible to implement recursive patterns using native Mathematica string patterns, but such usage:

can be more difficult to express (an ironic statement given the "write only" nature of regular expressions), and

might block the PCRE optimization strategies, causing poor performance.

Blog

Search This Blog

string manipulation - Convert recursive RegularExpression to StringExpression?

Comments

Post a Comment

Popular posts from this blog

front end - keyboard shortcut to invoke Insert new matrix

How to thread a list

dynamic - How can I make a clickable ArrayPlot that returns input?