Edit: as noted by Albert Retey the performance difference is only seen when sub expression extraction is performed. If this test is used below the timings are similar:
First@Timing[r1 = StringCases[textBig, se];]
First@Timing[r2 = StringCases[textBig, re];]
According to the documentation:
Any symbolic string pattern is first translated to a regular expression. You can see this translation by using the internal
StringPattern`PatternConvert
function.StringPattern`PatternConvert["a" | "" ~~ DigitCharacter ..] // InputForm
{"(?ms)a?\\d+", {}, {}, Hold[None]}
The first element returned is the regular expression, while the rest of the elements have to do with conditions, replacement rules, and named patterns.
The regular expression is then compiled by PCRE, and the compiled version is cached for future use when the same pattern appears again. The translation from symbolic string pattern to regular expression only happens once.
Based on this I would expect a StringExpression
and the regular expression produced by PatternConvert
to perform similarly, but they do not. Taking an example from this recent question please observe:
se = Shortest["(ICD-9-CM " ~~ code__ ~~ ")"];
re = First @ StringPattern`PatternConvert[se] // RegularExpression
RegularExpression["(?ms)\\(ICD-9-CM (.+?)\\)"]
text1 = " A Vitamin D Deficiency (ICD-9-CM 268.9) (ICD-9-CM 268.9) 09/11/2015 01 ";
textBig = StringJoin @ ConstantArray[text1, 1*^6];
First@Timing[r1 = StringCases[textBig, se :> code];]
First@Timing[r2 = StringCases[textBig, re :> "$1"];]
r1 === r2
0.718
1.903
True
- Why is using the
StringExpression
more than twice as fast as theRegularExpression
? - Is there a way to make the
RegularExpression
matching run just as quickly?
Comments
Post a Comment