Skip to main content

string manipulation - Unexpected behaviour when pattern matching with Longest


I came across some unexpected behaviour today when using the Longest function when trying to do some pattern matching.


StringCases[#, a___ ~~ b_ ~~ Longest[c___] ~~ b_ ~~ d___ -> {a, b, c, d}] &@"abcbba"

The intent is to find the two identical characters which are the furthest apart in the string. I would expect this to return {a,b,cb,a}, since that would be the longest possible distance between two identical characters. However, it instead returns {abc,b, ,a}. I don't understand why it's not finding the longer possibility, especially when default behaviour when matching patterns seems to be that earlier patterns try to match the shortest possible sequences. Using c__ instead of c___ makes the behaviour more like the expected behaviour for this case, but I do want the pattern to work when the only sets of identical characters are adjacent to each other.



What am I missing?



Answer



There are a number of subtle details at work here. Perhaps the prominent one is that in string expressions, unlike expression patterns, Longest and Shortest do not actually refer to the maximal or miminal possible match in a string. Rather, they correspond to the regular expression concepts of greedy and ungreedy.


Furthest Apart Identical Characters?


The question states that the intent is to find the two identical characters that are furthest apart in the string, and that the expected answer is {a,b,cb,a}. However, the characters that actually meet this criterion are the two a's at the ends of the string, which should yield a result of {,a,bcbb,}. This is a consequence of the observation that the subpatterns a___ and d___ are permitted to match empty strings. However, we do not get this result. Why?


The reason is that the initial subpattern is "greedy", meaning that it will match the maximal number of characters that does not cause the overall pattern to fail. We can see this is so from the actual result obtained: {abc,b,,a}. The pattern a___ has greedily matched "abc", leaving only the remnant "bba" for the rest of the pattern to match. The leftover adjacent b's become our paired characters.


We do not want a___ to be greedy, so we use Shortest[a___]:


StringCases["abcbba",
Shortest[a___] ~~ b_ ~~ Longest[c___] ~~ b_ ~~ d___ -> {a, b, c, d} ]


(* {{,a,bcbb,}} *)

Shortest is the string pattern syntax to make a repetition "ungreedy" (also sometimes known as "reluctant" or "lazy"). Ungreedy repetition matches the minimal number of characters while still giving the rest of the pattern a chance to succeed.


The use of Longest here is redundant since it is the default:


StringCases["abcbba",
Shortest[a___] ~~ b_ ~~ c___ ~~ b_ ~~ d___ -> {a, b, c, d} ]

(* {{,a,bcbb,}} *)

If we change the subject string so that it does not start and end with the same character, we get a result very much like the one desired by the question:



StringCases["abcbbz",
Shortest[a___] ~~ b_ ~~ c___ ~~ b_ ~~ d___ -> {a, b, c, d} ]

(* {{a,b,cb,z}} *)

StringCases["123bcbb45",
Shortest[a___] ~~ b_ ~~ c___ ~~ b_ ~~ d___ -> {a, b, c, d} ]

(* {{123,b,cb,45}} *)


How does all this relate to regular expressions?


Mathematica string patterns are a high-level expression-based syntax that is implemented using PCRE, a regular expression engine. The concept of greediness comes from that engine.


We can see how Longest and Shortest compile down to regular expressions:


StringPattern`PatternConvert[a___] // First

(* "(?ms)(.*)" *)

StringPattern`PatternConvert[Longest[a___]] // First

"(?ms)(.*)"


StringPattern`PatternConvert[Shortest[a___]] // First

"(?ms)(.*?)"

Close inspection shows that Shortest changes the regular expression from .* to .*?. The trailing ? is the regular expression syntax to turn off greediness.


The PCRE notion of greediness does not have anything to do with whether a match is the overall longest or shortest possible match within a string. Rather it controls whether a repeating pattern will consume the maximal (greedy) or minimal (ungreedy) number of characters without causing the rest of the pattern to fail (if possible). Subject to this behaviour, scanning proceeds from left to right. The first valid match is always used unless the remainder of the pattern fails. Such failure causes backtracking to try other possible matches.


The following examples demonstrate that extreme pattern lengths are not a consideration for string patterns:


StringCases["0111011110", Shortest["1"..]]
(* {1,1,1,1,1,1,1} *)


StringCases["0111011110", "1"..]
(* {111,1111} *)

StringCases["0111011110", Longest["1"..]]
(* {111,1111} *)

StringReplace["0111011110", Longest["1"..] :> "x"]
(* 0x0x0 *)


Observe how the presence of Longest does not restrict its operation to the absolute longest match. Also observe that since none of our patterns used StartOfString/EndOfString to anchor to the string ends, multiple matches are possible.


Why are Longest/Shortest different between string and expression patterns?


In contrast to string patterns, expression patterns do have a notion of the absolute longest match:


{0, 1, 1, 1, 0, 1, 1, 1, 1, 0} /. {___, Longest[x : 1 ..], ___} :> {x}

(* {1, 1, 1, 1} *)

Also, for expression patterns the default is Shortest instead of Longest:


{0, 1, 1, 1, 0, 1, 1, 1, 1, 0} /. {___, x:1.., ___} :> {x}
(* {1} *)


{0, 1, 1, 1, 0, 1, 1, 1, 1, 0} /. {___, Shortest[x:1..], ___} :> {x}
(* {1} *)

Why are there such differences?


The different defaults are due to history. Mathematica has used Shortest as the default for expression patterns from the beginning. String patterns were added relatively recently, and they follow the default convention that virtually all regular expression engines have adopted.


As for the semantic differences, history is still a factor but there are two other significant reasons as well.


First of all, repeating expression patterns can only work on sequences. Sequences are not quite first class citizens as they are always hosted within a full expression "container". Patterns must match the containers that hold sequences instead of the sequences themselves. This is illustrated by the /. examples above. Once such a container has been matched or replaced, further scanning within that container stops. Strings have no such notion of "container", so string patterns are under no such restriction. This "container" behaviour means that expression patterns are always anchored in some sense. Very recently, version 10.3 introduced functions like SequenceCases to loosen these restrictions a little.


Second, there is the matter of practicality. String patterns are based upon PCRE. PCRE has no notion of the longest overall match. Therefore neither do string patterns.


A good design decision?



On account of the semantic differences, one might argue that the names Longest and Shortest were misleading choices for the string pattern syntax. Perhaps they should instead have been something like Greedy and Ungreedy. That would at least suggest to a reader that they have different semantics. But the differences are subtle and there is always pressure to resist adding new symbols to the system namespace (*Data symbol notwithstanding). A design choice was made, and it is defensible. Hindsight is 20/20.


Comments

Popular posts from this blog

plotting - Plot 4D data with color as 4th dimension

I have a list of 4D data (x position, y position, amplitude, wavelength). I want to plot x, y, and amplitude on a 3D plot and have the color of the points correspond to the wavelength. I have seen many examples using functions to define color but my wavelength cannot be expressed by an analytic function. Is there a simple way to do this? Answer Here a another possible way to visualize 4D data: data = Flatten[Table[{x, y, x^2 + y^2, Sin[x - y]}, {x, -Pi, Pi,Pi/10}, {y,-Pi,Pi, Pi/10}], 1]; You can use the function Point along with VertexColors . Now the points are places using the first three elements and the color is determined by the fourth. In this case I used Hue, but you can use whatever you prefer. Graphics3D[ Point[data[[All, 1 ;; 3]], VertexColors -> Hue /@ data[[All, 4]]], Axes -> True, BoxRatios -> {1, 1, 1/GoldenRatio}]

plotting - Filling between two spheres in SphericalPlot3D

Manipulate[ SphericalPlot3D[{1, 2 - n}, {θ, 0, Pi}, {ϕ, 0, 1.5 Pi}, Mesh -> None, PlotPoints -> 15, PlotRange -> {-2.2, 2.2}], {n, 0, 1}] I cant' seem to be able to make a filling between two spheres. I've already tried the obvious Filling -> {1 -> {2}} but Mathematica doesn't seem to like that option. Is there any easy way around this or ... Answer There is no built-in filling in SphericalPlot3D . One option is to use ParametricPlot3D to draw the surfaces between the two shells: Manipulate[ Show[SphericalPlot3D[{1, 2 - n}, {θ, 0, Pi}, {ϕ, 0, 1.5 Pi}, PlotPoints -> 15, PlotRange -> {-2.2, 2.2}], ParametricPlot3D[{ r {Sin[t] Cos[1.5 Pi], Sin[t] Sin[1.5 Pi], Cos[t]}, r {Sin[t] Cos[0 Pi], Sin[t] Sin[0 Pi], Cos[t]}}, {r, 1, 2 - n}, {t, 0, Pi}, PlotStyle -> Yellow, Mesh -> {2, 15}]], {n, 0, 1}]

plotting - Mathematica: 3D plot based on combined 2D graphs

I have several sigmoidal fits to 3 different datasets, with mean fit predictions plus the 95% confidence limits (not symmetrical around the mean) and the actual data. I would now like to show these different 2D plots projected in 3D as in but then using proper perspective. In the link here they give some solutions to combine the plots using isometric perspective, but I would like to use proper 3 point perspective. Any thoughts? Also any way to show the mean points per time point for each series plus or minus the standard error on the mean would be cool too, either using points+vertical bars, or using spheres plus tubes. Below are some test data and the fit function I am using. Note that I am working on a logit(proportion) scale and that the final vertical scale is Log10(percentage). (* some test data *) data = Table[Null, {i, 4}]; data[[1]] = {{1, -5.8}, {2, -5.4}, {3, -0.8}, {4, -0.2}, {5, 4.6}, {1, -6.4}, {2, -5.6}, {3, -0.7}, {4, 0.04}, {5, 1.0}, {1, -6.8}, {2, -4.7}, {3, -1....