Counting the number of instances of one sub-string within a given string within a lower- and upper-bound gap of a second sub-string

Please consider the situation where I give you a list where entries are drawn from a fixed set of characters:

 alphabet = {0,1,2};
 numElements = 10^3;
 bigString = StringJoin[Map[ToString, RandomChoice[alphabet, numElements]]];

I provide you two strings: string1 and string2. I'd like to count the number of instances where string2 occurs within a lower-bound and upper-bound "distance" of string1, and by "distance" I mean this in terms of the count for the number of characters in the gap region between string1 and string2 (i.e. the number of characters counting from immediately after the last character in string1 and the immediately before the first character in string2 if string1 occurs before string2, and vice versa if string2 occurs before string1). There may be multiple instances of string1 and string2, so in terms of overcounting, each instance of string2 should only be considered a single possible "hit" (if its within the lower- and upper-bound cutoff distance of string1).

Is there a built in function, or an easy way to do this?

As Leonid Shifrin requests, let's construct a small case example:

string1 = "1111";
string2 = "1221";

lowerboundDistance = 3;
upperboundDistance = 10;

bigString = "000000111100012210111100000122100000001111000000000001221";

Now, in the above string, there are three instances of string2, so at most we can have an output count of 3. From left-to-right, here are all possible instances where string1 and string2 are separated by a gap of at least 3 characters and at most 10 characters:

[1] "11110001221"

[2] "1111000001221"
[3] "122100000001111"

Notice however that instances [2] and [3] correspond to the same instance of string2, so we only increase the count by 1 after seeing both of these instances. The final count is therefore 2.

To clarify a particular point, note that:

string1 = "1111";
string2 = "1221";

lowerboundDistance = 3;
upperboundDistance = 10;


bigString = "11110001221001221";

Should give an output count of 2 considering that "11110001221" and "11110001221001221" (abstracted as "1111.........1221") represent instances of string1 and string2 within the lower- and upperbound gap specifications.

Answer

The code below will identify "hits" using the requested rules. It works by computing the intersection of two sets of positions: 1) the positions of "string2" and 2) the allowable positions determined by leading and trailing "windows" relative to each occurrence of "string1". The count function will return the number of hits and position give the positions of the hits, if desired.

count[test_, floater_, anchor_, min_, max_] :=
    Length @ positions[test, floater, anchor, min, max]

positions[test_, floater_, anchor_, min_, max_] :=

  anchorPositions[test, anchor] ~Intersection~
  allowedPositions[test, floater, anchor, min, max]

anchorPositions[test_, anchor_] := StringPosition[test, anchor][[All, 1]]

allowedPositions[testString_, floater_, anchor_, min_, max_] :=
  window[#, StringLength@anchor, min, max]& /@ StringPosition[testString, floater] //
  Union @@ # &

window[floaterPosition:{_, _}, anchorLength_, min_, max_] :=

  Flatten @ Apply[
    Range
  , floaterPosition + {{-max, -min} - anchorLength, 1 + {min, max}}
  , {1}
  ]

Explanation

For purposes of discussion, let's call the string that we wish to count the "anchor" string. We will use the term "floater" string to refer to the other string that must appear within a stipulated distance from an anchor. We will use the following test data from the question:

$floater= "1111";
$anchor = "1221";


$min = 3;
$max = 10;

$testString = "000000111100012210111100000122100000001111000000000001221";

We will start by defining a function that returns the start positions of the anchors within a test string.

anchorPositions[test_, anchor_] := StringPosition[test, anchor][[All, 1]]

For example:

anchorPositions[$testString, $anchor]
(* {14, 28, 54} *)

Next, we define a function that will return the allowable positions for an anchor given that a particular floater is in a known position:

window[floaterPosition:{_, _}, anchorLength_, min_, max_] :=
  Flatten @ Apply[
    Range
  , floaterPosition + {{-max, -min} - anchorLength, 1 + {min, max}}
  , {1}
  ]

For example, if we know the position of a floater in a test string, then the allowable start positions for an anchor lie in two windows. One window is before the floater and the other after. If a floater extends between positions 20-23 and the distance band is 3-10, then an anchor of length 4 must fall in either a leading window covering positions 6-13 or a trailing window covering positions 27-34 (taking into account the min/max spacing and the length of the anchor string).

window[{20, 23}, 4, 3, 10]
(* {6, 7, 8, 9, 10, 11, 12, 13, 27, 28, 29, 30, 31, 32, 33, 34} *)

This helper function can be used to determine all allowable positions for anchors in a test string, being the union of the windows associated with all floater strings:

allowedPositions[testString_, floater_, anchor_, min_, max_] :=
  window[#, StringLength@anchor, min, max]& /@ StringPosition[testString, floater] //
  Union @@ # &

Example:

allowedPositions[$testString, $floater, $anchor, $min, $max]
(* {-7, -6, -5, -4, -3, -2, -1, 0, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15,
    16, 17, 18, 19, 20, 21, 25, 26, 27, 28, 29, 30, 31, 32, 33, 46, 47,
    48, 49, 50, 51, 52, 53} *)

Now we have everything we need to determine the positions of the allowed anchors within a test string. They are simply the intersection of two position sets: the set of anchor string and the set of allowable anchor positions:

positions[test_, floater_, anchor_, min_, max_] :=
  anchorPositions[test, anchor] ~Intersection~
  allowedPositions[test, floater, anchor, min, max]

Example:

positions[$testString, $floater, $anchor, $min, $max]
(* {14, 28} *)

Finally, the anchor count is the length of this position list:

count[test_, floater_, anchor_, min_, max_] :=
    Length @ positions[test, floater, anchor, min, max]

Example:

count[$testString, $floater, $anchor, $min, $max]
(* 2 *)

Here are the test cases from the question:

count[
  "000000111100012210111100000122100000001111000000000001221"
, "1111", "1221", 3, 10
]
(* 2 *)


count[
  "11110001221001221"
, "1111", "1221", 3, 10
]
(* 2 *)

plotting - Adding a thick curve to a regionplot

Suppose we have the following simple RegionPlot: f[x_] := 1 - x^2 g[x_] := 1 - 0.5 x^2 RegionPlot[{y < f[x], f[x] < y < g[x], y > g[x]}, {x, 0, 2}, {y, 0, 2}] Now I'm trying to change the curve defined by $y=g[x]$ into a thick black curve, while leaving all other boundaries in the plot unchanged. I've tried adding the region $y=g[x]$ and playing with the plotstyle, which didn't work, and I've tried BoundaryStyle, which changed all the boundaries in the plot. Now I'm kinda out of ideas... Any help would be appreciated! Answer With f[x_] := 1 - x^2 g[x_] := 1 - 0.5 x^2 You can use Epilog to add the thick line: RegionPlot[{y < f[x], f[x] < y < g[x], y > g[x]}, {x, 0, 2}, {y, 0, 2}, PlotPoints -> 50, Epilog -> (Plot[g[x], {x, 0, 2}, PlotStyle -> {Black, Thick}][[1]]), PlotStyle -> {Directive[Yellow, Opacity[0.4]], Directive[Pink, Opacity[0.4]],

Blog

Search This Blog

Counting the number of instances of one sub-string within a given string within a lower- and upper-bound gap of a second sub-string

Comments

Post a Comment

Popular posts from this blog

plotting - Filling between two spheres in SphericalPlot3D

plotting - Plot 4D data with color as 4th dimension

plotting - Adding a thick curve to a regionplot