Skip to main content

performance tuning - Highlighting pattern strings in a large text


I would like to apply Style to strings that are matched, for example, with one or more specified patterns or regular expressions, a string or a list of both, whatever.


I've already written a function that works fine with small text samples, but it is slow and it has problems to export ( a RTF ) correctly, cause a formatting problem.


Highlight[pattern_, style_] := 
s_String :> (Row[{##}] & @@
StringReplace[s, t : pattern :> style[t], IgnoreCase -> False])

So for example try it with a short text:


txt = Import["http://www.google.com"] // 
StringReplace[#, {"\t", "\r\n", "\n"} -> ""] &;


txt /. Highlight[RegularExpression["[A-Z][a-z]+"], Style[#, Blue] &]

The output looks like:


string replace with styles


My scenario:



  • Issue 1: the input text is a long string, i.e ~500KB so far.

  • Issue 2: I want to be able send a list of patterns without drastically impacting the runtime or the normal behavior of the kernel. (Right now when I render more than one pattern in a large text sample with the current code, the kernel crashes).

  • Issue 3: It seems that the programming using these pattern replaces are so slow.



Is there a better way to run this pattern matching, or a more efficient approach to find and highlight specific text?


UPDATE: Sharing some kind of my input based on the comments:


txt = ExampleData[{"Text", "AeneidEnglish"}];
somewords =
DictionaryLookup[RegularExpression[".*tion"]] // Take[#, 200] &;

In[536]:= AbsoluteTiming[
txt /. Highlight[somewords, Style[#, Blue] &];]


Out[536]= {7.84513, Null}

Based on a list of words, and RegularExpression for the pattern argument, it seems to be the method is doing the necessary work, so far. Apparently this method is the fastest.



Answer



There is another way that is on my machine almost 500x faster then your solution. The idea is to look how Mathematica represents colored strings and use this directly.


When we colorize an input string by selecting text and using the Format menu, we can create something like this



Mathematica graphics



Now, press Ctrl+Shift+E to see the underlying expression.



Cell[BoxData["\"\  \!\(\*StyleBox[\"my\",FontColor->RGBColor[1, 0, 0]]\)
friend\>\""], "Input"]

I have put the important part in the second like and you see, it's only an inline style-box that is used.


In your updated question, you used a list of words to highlight and for this task, there is another approach useful:



  • we create a function that takes a string and returns the same colorized string when it is in your list of words. Otherwise, it just returns the same string

  • we split your input into words and apply this function to each word

  • we rebuild all words into a string again which now contains normal text and highlighted words.



For this purpose, I use a Module that on-the-fly creates local functions that do the highlighting. This is important, because with each call to highlight you might want to provide a different list of words to highlight. Therefore, the function doHighlight needs to be rebuilt on every call.


Sounds expensive? It is not and the implementation is only a few lines long:


highlight[txt_, words_] := Module[{colorize, doHighlight},
colorize[str_] := "\!\(\*StyleBox[\"" <> str <>
"\",FontColor->RGBColor[0, 0, 1]]\)";
SetAttributes[doHighlight, {Listable}];
(doHighlight[#] := colorize[#]) & /@ words;
doHighlight[s_] := s;
StringRiffle[doHighlight[StringSplit[txt]]]

]

Let's test it



Mathematica graphics



Now let us time this with the same input that Peter Roberge used. His function needed 3.7 seconds on my machine.


txt = ExampleData[{"Text", "AeneidEnglish"}];
somewords = DictionaryLookup[RegularExpression["[A-Z][a-z]+"]];


output = highlight[txt, somewords]; // AbsoluteTiming
(* {0.168501, Null} *)

And the text is highlighted as expected


enter image description here


Since you were brave enough to read until the end, let me tell you that there is one significant drawback: Mathematica has a bug and does not export colored strings to rtf correctly. At least on my machine, the text is not colorized in the final rtf.


Update


In case you really need to replace not a fixed word, but an expression you need to use StringReplace because it is possible you match more than one word (maybe a group of words). Therefore, splitting the text into words won't always work.


Nevertheless, the basic idea of my answer stays the same: We don't use Row and Style, but we inject inline string styles and transform a string into string.


The function itself becomes very easy:



highlight2[txt_, patterns_] := 
StringReplace[txt, str : (Alternatives @@ patterns) :>
"\!\(\*StyleBox[\"" <> str <> "\",FontColor->RGBColor[0, 0, 1]]\)"
]

Here a short test with different kinds of patterns:


highlight2["Hello bear, what are you doing here?", 
{ "b" ~~ LetterCharacter ..,
_ ~~ "o" ~~ _,
RegularExpression["[A-Z][a-z]+"],

"re?"
}]


Mathematica graphics



Update to provide custom style


Providing a custom style is possible too. You can just add this as parameter and the only thing you have to do inside the function is to transform this into a string and put it at the right place.


That being said:


highlight2[txt_, patterns_] := highlight2[txt, patterns, {Blue}];

highlight2[txt_, patterns_, {style__}] :=
StringReplace[txt,
str : (Alternatives @@ patterns) :>
"\!\(\*StyleBox[\"" <> str <> "\"," <>
StringRiffle[ToString /@ {style}, ", "]
<> "]\)"]

You can now give a list of style directives as last argument. When you leave them out, then the matching text becomes blue.


highlight2["Hello bear, what are you doing here?", {"b" ~~ 
LetterCharacter .., _ ~~ "o" ~~ _},

{30, Red, Italic}]

Mathematica graphics


Comments

Popular posts from this blog

plotting - Plot 4D data with color as 4th dimension

I have a list of 4D data (x position, y position, amplitude, wavelength). I want to plot x, y, and amplitude on a 3D plot and have the color of the points correspond to the wavelength. I have seen many examples using functions to define color but my wavelength cannot be expressed by an analytic function. Is there a simple way to do this? Answer Here a another possible way to visualize 4D data: data = Flatten[Table[{x, y, x^2 + y^2, Sin[x - y]}, {x, -Pi, Pi,Pi/10}, {y,-Pi,Pi, Pi/10}], 1]; You can use the function Point along with VertexColors . Now the points are places using the first three elements and the color is determined by the fourth. In this case I used Hue, but you can use whatever you prefer. Graphics3D[ Point[data[[All, 1 ;; 3]], VertexColors -> Hue /@ data[[All, 4]]], Axes -> True, BoxRatios -> {1, 1, 1/GoldenRatio}]

plotting - Filling between two spheres in SphericalPlot3D

Manipulate[ SphericalPlot3D[{1, 2 - n}, {θ, 0, Pi}, {ϕ, 0, 1.5 Pi}, Mesh -> None, PlotPoints -> 15, PlotRange -> {-2.2, 2.2}], {n, 0, 1}] I cant' seem to be able to make a filling between two spheres. I've already tried the obvious Filling -> {1 -> {2}} but Mathematica doesn't seem to like that option. Is there any easy way around this or ... Answer There is no built-in filling in SphericalPlot3D . One option is to use ParametricPlot3D to draw the surfaces between the two shells: Manipulate[ Show[SphericalPlot3D[{1, 2 - n}, {θ, 0, Pi}, {ϕ, 0, 1.5 Pi}, PlotPoints -> 15, PlotRange -> {-2.2, 2.2}], ParametricPlot3D[{ r {Sin[t] Cos[1.5 Pi], Sin[t] Sin[1.5 Pi], Cos[t]}, r {Sin[t] Cos[0 Pi], Sin[t] Sin[0 Pi], Cos[t]}}, {r, 1, 2 - n}, {t, 0, Pi}, PlotStyle -> Yellow, Mesh -> {2, 15}]], {n, 0, 1}]

plotting - Mathematica: 3D plot based on combined 2D graphs

I have several sigmoidal fits to 3 different datasets, with mean fit predictions plus the 95% confidence limits (not symmetrical around the mean) and the actual data. I would now like to show these different 2D plots projected in 3D as in but then using proper perspective. In the link here they give some solutions to combine the plots using isometric perspective, but I would like to use proper 3 point perspective. Any thoughts? Also any way to show the mean points per time point for each series plus or minus the standard error on the mean would be cool too, either using points+vertical bars, or using spheres plus tubes. Below are some test data and the fit function I am using. Note that I am working on a logit(proportion) scale and that the final vertical scale is Log10(percentage). (* some test data *) data = Table[Null, {i, 4}]; data[[1]] = {{1, -5.8}, {2, -5.4}, {3, -0.8}, {4, -0.2}, {5, 4.6}, {1, -6.4}, {2, -5.6}, {3, -0.7}, {4, 0.04}, {5, 1.0}, {1, -6.8}, {2, -4.7}, {3, -1....