I came across some unexpected behaviour today when using the Longest
function when trying to do some pattern matching.
StringCases[#, a___ ~~ b_ ~~ Longest[c___] ~~ b_ ~~ d___ -> {a, b, c, d}] &@"abcbba"
The intent is to find the two identical characters which are the furthest apart in the string. I would expect this to return {a,b,cb,a}
, since that would be the longest possible distance between two identical characters. However, it instead returns {abc,b, ,a}
. I don't understand why it's not finding the longer possibility, especially when default behaviour when matching patterns seems to be that earlier patterns try to match the shortest possible sequences. Using c__
instead of c___
makes the behaviour more like the expected behaviour for this case, but I do want the pattern to work when the only sets of identical characters are adjacent to each other.
What am I missing?
Answer
There are a number of subtle details at work here. Perhaps the prominent one is that in string expressions, unlike expression patterns, Longest
and Shortest
do not actually refer to the maximal or miminal possible match in a string. Rather, they correspond to the regular expression concepts of greedy and ungreedy.
Furthest Apart Identical Characters?
The question states that the intent is to find the two identical characters that are furthest apart in the string, and that the expected answer is {a,b,cb,a}
. However, the characters that actually meet this criterion are the two a's at the ends of the string, which should yield a result of {,a,bcbb,}
. This is a consequence of the observation that the subpatterns a___
and d___
are permitted to match empty strings. However, we do not get this result. Why?
The reason is that the initial subpattern is "greedy", meaning that it will match the maximal number of characters that does not cause the overall pattern to fail. We can see this is so from the actual result obtained: {abc,b,,a}
. The pattern a___
has greedily matched "abc"
, leaving only the remnant "bba"
for the rest of the pattern to match. The leftover adjacent b's become our paired characters.
We do not want a___
to be greedy, so we use Shortest[a___]
:
StringCases["abcbba",
Shortest[a___] ~~ b_ ~~ Longest[c___] ~~ b_ ~~ d___ -> {a, b, c, d} ]
(* {{,a,bcbb,}} *)
Shortest
is the string pattern syntax to make a repetition "ungreedy" (also sometimes known as "reluctant" or "lazy"). Ungreedy repetition matches the minimal number of characters while still giving the rest of the pattern a chance to succeed.
The use of Longest
here is redundant since it is the default:
StringCases["abcbba",
Shortest[a___] ~~ b_ ~~ c___ ~~ b_ ~~ d___ -> {a, b, c, d} ]
(* {{,a,bcbb,}} *)
If we change the subject string so that it does not start and end with the same character, we get a result very much like the one desired by the question:
StringCases["abcbbz",
Shortest[a___] ~~ b_ ~~ c___ ~~ b_ ~~ d___ -> {a, b, c, d} ]
(* {{a,b,cb,z}} *)
StringCases["123bcbb45",
Shortest[a___] ~~ b_ ~~ c___ ~~ b_ ~~ d___ -> {a, b, c, d} ]
(* {{123,b,cb,45}} *)
How does all this relate to regular expressions?
Mathematica string patterns are a high-level expression-based syntax that is implemented using PCRE, a regular expression engine. The concept of greediness comes from that engine.
We can see how Longest
and Shortest
compile down to regular expressions:
StringPattern`PatternConvert[a___] // First
(* "(?ms)(.*)" *)
StringPattern`PatternConvert[Longest[a___]] // First
"(?ms)(.*)"
StringPattern`PatternConvert[Shortest[a___]] // First
"(?ms)(.*?)"
Close inspection shows that Shortest
changes the regular expression from .*
to .*?
. The trailing ?
is the regular expression syntax to turn off greediness.
The PCRE notion of greediness does not have anything to do with whether a match is the overall longest or shortest possible match within a string. Rather it controls whether a repeating pattern will consume the maximal (greedy) or minimal (ungreedy) number of characters without causing the rest of the pattern to fail (if possible). Subject to this behaviour, scanning proceeds from left to right. The first valid match is always used unless the remainder of the pattern fails. Such failure causes backtracking to try other possible matches.
The following examples demonstrate that extreme pattern lengths are not a consideration for string patterns:
StringCases["0111011110", Shortest["1"..]]
(* {1,1,1,1,1,1,1} *)
StringCases["0111011110", "1"..]
(* {111,1111} *)
StringCases["0111011110", Longest["1"..]]
(* {111,1111} *)
StringReplace["0111011110", Longest["1"..] :> "x"]
(* 0x0x0 *)
Observe how the presence of Longest
does not restrict its operation to the absolute longest match. Also observe that since none of our patterns used StartOfString
/EndOfString
to anchor to the string ends, multiple matches are possible.
Why are Longest/Shortest different between string and expression patterns?
In contrast to string patterns, expression patterns do have a notion of the absolute longest match:
{0, 1, 1, 1, 0, 1, 1, 1, 1, 0} /. {___, Longest[x : 1 ..], ___} :> {x}
(* {1, 1, 1, 1} *)
Also, for expression patterns the default is Shortest
instead of Longest
:
{0, 1, 1, 1, 0, 1, 1, 1, 1, 0} /. {___, x:1.., ___} :> {x}
(* {1} *)
{0, 1, 1, 1, 0, 1, 1, 1, 1, 0} /. {___, Shortest[x:1..], ___} :> {x}
(* {1} *)
Why are there such differences?
The different defaults are due to history. Mathematica has used Shortest
as the default for expression patterns from the beginning. String patterns were added relatively recently, and they follow the default convention that virtually all regular expression engines have adopted.
As for the semantic differences, history is still a factor but there are two other significant reasons as well.
First of all, repeating expression patterns can only work on sequences. Sequences are not quite first class citizens as they are always hosted within a full expression "container". Patterns must match the containers that hold sequences instead of the sequences themselves. This is illustrated by the /.
examples above. Once such a container has been matched or replaced, further scanning within that container stops. Strings have no such notion of "container", so string patterns are under no such restriction. This "container" behaviour means that expression patterns are always anchored in some sense. Very recently, version 10.3 introduced functions like SequenceCases
to loosen these restrictions a little.
Second, there is the matter of practicality. String patterns are based upon PCRE. PCRE has no notion of the longest overall match. Therefore neither do string patterns.
A good design decision?
On account of the semantic differences, one might argue that the names Longest
and Shortest
were misleading choices for the string pattern syntax. Perhaps they should instead have been something like Greedy
and Ungreedy
. That would at least suggest to a reader that they have different semantics. But the differences are subtle and there is always pressure to resist adding new symbols to the system namespace (*Data
symbol notwithstanding). A design choice was made, and it is defensible. Hindsight is 20/20.
Comments
Post a Comment