Skip to main content

string manipulation - Convert recursive RegularExpression to StringExpression?


Consider the following expression for matching balanced brackets:


StringPosition["[[hello][[hi]][hey]ahoy][]", RegularExpression@"(?P<0>\\[([^\\[\\]]|(?P>0))*\\])"]

{{1, 24}, {2, 8}, {9, 14}, {10, 13}, {15, 19}, {25, 26}}

It uses features for recursive pattern matching. This answer by WReach explains the general structure using similar example. This comment by Leonid Shifrin states that recursive StringExpression is possible but its performance is worse than StringExpression's. And this question by Mr.Wizard shows that StringExpression can perform faster than RegularExpression.
So I'm eagerly interested if it's possible to build recursive StringExpressions, and if true, then how? What have I tried:



bb := "[" ~~ (Except@{"[", "]"} | bb) ... ~~ "]";
StringPosition["[[hello][[hi]][hey]ahoy][]", bb]

$RecursionLimit::reclim: Recursion depth of 1024 exceeded. >>

Ok, that was wrong, I think I know why, but I do not see another approach. :|



Answer



TL;DR


Recursive expressions are possible using native string patterns in Mathematica, but can be difficult to write correctly, and might perform very poorly.


Difficult To Write?



As @Leonid's solution shows, it is possible to express recursive patterns without resorting to regular expressions. However, recursive string patterns can be more difficult to construct.


Consider the pattern bb from @Leonid's response:


bb = "[" ~~ ((Except["[" | "]"] ...) | (patt__ /; StringMatchQ[patt, bb])) ~~ "]";

This is not actually equivalent to the original regular expression as can be seen from the presence of the string position {1, 24} in the original answer, but its absence when using bb. Here is a smaller example that demonstrates the problem:


re = RegularExpression@"(?P<0>\\[([^\\[\\]]|(?P>0))*\\])";

StringPosition["[[x][y]]", re]
(* {{1, 8}, {2, 4}, {5, 7}} *)


StringPosition["[[x][y]]", bb]
(* {{2, 4}, {5, 7}} *)

The outermost match is lost because bb does not allow multiple occurrences of the recursive pattern:


StringMatchQ["[[x][x]]", re]
(* True *)

StringMatchQ["[[x][x]]", bb]
(* False *)


That seems to be easy to fix...


bbb = "[" ~~ (Except["[" | "]"] | (patt__ /; StringMatchQ[patt, bbb])) ... ~~ "]";

StringMatchQ["[[x][x]]", bbb]
(* True *)

... but perhaps it is not so easy:


StringMatchQ["[[x][y]]", re]
(* True *)


StringMatchQ["[[x][y]]", bbb]
(* False *)

StringPosition["[[x][y]]", bbb]
(* {{2, 4}, {5, 7}} *)

Expressing Recursive Repetition is Hard


The problem is that once we name a subpattern patt, its repetition using ... requires that each submatch be the same. That is why [[x][x]] matches, but [[x][y]] does not.


In normal (non-string) pattern-matching, the work-around is simple. We would use a pattern test instead of a condition, like this: __?(StringMatchQ[#, bbb]&). This allows the match to remain anonymous, and each repetition to have a different value. Unfortunately, the documentation for string patterns tells us that pattern tests are applied to each matched character separately whereas we want to test the entire match. Consider:


StringMatchQ["a", __?(# === "a" &)]

(* True *)

StringMatchQ["abc", __?(# === "abc" &)]
(* False *)

StringMatchQ["aaaaaa", __?(# === "a"&)]
(* True *)

Thus the pattern-test solution is closed to us. It means that we must express independent repetition directly as recursion:


x1 = "[" ~~ (a___ /; StringMatchQ[a, x2]) ~~ "]";


x2 = Except["["|"]"]... ~~
("" | ((a:("["~~__) /; StringMatchQ[a, x1]) ~~ (b___ /; StringMatchQ[b, x2])));

This fixes the observed problems:


StringMatchQ["[[x][y]]", x1]
(* True *)

StringPosition["[[x][y]]", x1]
(* {{1, 8}, {2, 4}, {5, 7}} *)


BUT...


Say Goodbye to PCRE Optimization


These patterns are phenomenally slow to run. Consider the original question:


StringPosition["[[hello][[hi]][hey]ahoy][]", x1] // Timing
(* {99.435037, {{1, 24}, {2, 8}, {9, 14}, {10, 13}, {15, 19}, {25, 26}}} *)

99 seconds! Why is it so slow, even when we took great care to ensure that each matching path would consume at least one character?


Normally, PCRE compiles a regular expression into a highly efficient finite state machine that optimizes back-tracking. It identifies common back-tracking states and eliminates the need to constantly revisit the same potential matches over and over again. However, when we use conditions in string patterns, we force PCRE to call back into Mathematica to evaluate those conditions. The PCRE compiler has no information about the nature of those callback expressions, so it cannot determine whether a backtracking path can be abandoned. The net result is that all possible backtracking paths need to be explored, a process which can exhibit exponential time complexity. This is why our revised pattern is so slow. 100 seconds for a 26 character string -- imagine how long it would take to scan a text file!


Summary



To summarize...


It is possible to implement recursive patterns using native Mathematica string patterns, but such usage:



  1. can be more difficult to express (an ironic statement given the "write only" nature of regular expressions), and

  2. might block the PCRE optimization strategies, causing poor performance.


Comments

Popular posts from this blog

plotting - Filling between two spheres in SphericalPlot3D

Manipulate[ SphericalPlot3D[{1, 2 - n}, {θ, 0, Pi}, {ϕ, 0, 1.5 Pi}, Mesh -> None, PlotPoints -> 15, PlotRange -> {-2.2, 2.2}], {n, 0, 1}] I cant' seem to be able to make a filling between two spheres. I've already tried the obvious Filling -> {1 -> {2}} but Mathematica doesn't seem to like that option. Is there any easy way around this or ... Answer There is no built-in filling in SphericalPlot3D . One option is to use ParametricPlot3D to draw the surfaces between the two shells: Manipulate[ Show[SphericalPlot3D[{1, 2 - n}, {θ, 0, Pi}, {ϕ, 0, 1.5 Pi}, PlotPoints -> 15, PlotRange -> {-2.2, 2.2}], ParametricPlot3D[{ r {Sin[t] Cos[1.5 Pi], Sin[t] Sin[1.5 Pi], Cos[t]}, r {Sin[t] Cos[0 Pi], Sin[t] Sin[0 Pi], Cos[t]}}, {r, 1, 2 - n}, {t, 0, Pi}, PlotStyle -> Yellow, Mesh -> {2, 15}]], {n, 0, 1}]

plotting - Plot 4D data with color as 4th dimension

I have a list of 4D data (x position, y position, amplitude, wavelength). I want to plot x, y, and amplitude on a 3D plot and have the color of the points correspond to the wavelength. I have seen many examples using functions to define color but my wavelength cannot be expressed by an analytic function. Is there a simple way to do this? Answer Here a another possible way to visualize 4D data: data = Flatten[Table[{x, y, x^2 + y^2, Sin[x - y]}, {x, -Pi, Pi,Pi/10}, {y,-Pi,Pi, Pi/10}], 1]; You can use the function Point along with VertexColors . Now the points are places using the first three elements and the color is determined by the fourth. In this case I used Hue, but you can use whatever you prefer. Graphics3D[ Point[data[[All, 1 ;; 3]], VertexColors -> Hue /@ data[[All, 4]]], Axes -> True, BoxRatios -> {1, 1, 1/GoldenRatio}]

plotting - Adding a thick curve to a regionplot

Suppose we have the following simple RegionPlot: f[x_] := 1 - x^2 g[x_] := 1 - 0.5 x^2 RegionPlot[{y < f[x], f[x] < y < g[x], y > g[x]}, {x, 0, 2}, {y, 0, 2}] Now I'm trying to change the curve defined by $y=g[x]$ into a thick black curve, while leaving all other boundaries in the plot unchanged. I've tried adding the region $y=g[x]$ and playing with the plotstyle, which didn't work, and I've tried BoundaryStyle, which changed all the boundaries in the plot. Now I'm kinda out of ideas... Any help would be appreciated! Answer With f[x_] := 1 - x^2 g[x_] := 1 - 0.5 x^2 You can use Epilog to add the thick line: RegionPlot[{y < f[x], f[x] < y < g[x], y > g[x]}, {x, 0, 2}, {y, 0, 2}, PlotPoints -> 50, Epilog -> (Plot[g[x], {x, 0, 2}, PlotStyle -> {Black, Thick}][[1]]), PlotStyle -> {Directive[Yellow, Opacity[0.4]], Directive[Pink, Opacity[0.4]],