Counting the number of instances of one sub-string within a given string within a lower- and upper-bound gap of a second sub-string
Please consider the situation where I give you a list where entries are drawn from a fixed set of characters:
alphabet = {0,1,2};
numElements = 10^3;
bigString = StringJoin[Map[ToString, RandomChoice[alphabet, numElements]]];
I provide you two strings: string1
and string2
. I'd like to count the number of instances where string2
occurs within a lower-bound and upper-bound "distance" of string1
, and by "distance" I mean this in terms of the count for the number of characters in the gap region between string1
and string2
(i.e. the number of characters counting from immediately after the last character in string1
and the immediately before the first character in string2
if string1
occurs before string2
, and vice versa if string2
occurs before string1
). There may be multiple instances of string1
and string2
, so in terms of overcounting, each instance of string2
should only be considered a single possible "hit" (if its within the lower- and upper-bound cutoff distance of string1
).
Is there a built in function, or an easy way to do this?
As Leonid Shifrin requests, let's construct a small case example:
string1 = "1111";
string2 = "1221";
lowerboundDistance = 3;
upperboundDistance = 10;
bigString = "000000111100012210111100000122100000001111000000000001221";
Now, in the above string, there are three instances of string2
, so at most we can have an output count of 3
. From left-to-right, here are all possible instances where string1
and string2
are separated by a gap of at least 3
characters and at most 10
characters:
[1] "11110001221"
[2] "1111000001221"
[3] "122100000001111"
Notice however that instances [2] and [3] correspond to the same instance of string2
, so we only increase the count by 1
after seeing both of these instances. The final count is therefore 2
.
To clarify a particular point, note that:
string1 = "1111";
string2 = "1221";
lowerboundDistance = 3;
upperboundDistance = 10;
bigString = "11110001221001221";
Should give an output count of 2
considering that "11110001221"
and "11110001221001221"
(abstracted as "1111.........1221"
) represent instances of string1
and string2
within the lower- and upperbound gap specifications.
Answer
The code below will identify "hits" using the requested rules. It works by computing the intersection of two sets of positions: 1) the positions of "string2" and 2) the allowable positions determined by leading and trailing "windows" relative to each occurrence of "string1". The count
function will return the number of hits and position
give the positions of the hits, if desired.
count[test_, floater_, anchor_, min_, max_] :=
Length @ positions[test, floater, anchor, min, max]
positions[test_, floater_, anchor_, min_, max_] :=
anchorPositions[test, anchor] ~Intersection~
allowedPositions[test, floater, anchor, min, max]
anchorPositions[test_, anchor_] := StringPosition[test, anchor][[All, 1]]
allowedPositions[testString_, floater_, anchor_, min_, max_] :=
window[#, StringLength@anchor, min, max]& /@ StringPosition[testString, floater] //
Union @@ # &
window[floaterPosition:{_, _}, anchorLength_, min_, max_] :=
Flatten @ Apply[
Range
, floaterPosition + {{-max, -min} - anchorLength, 1 + {min, max}}
, {1}
]
Explanation
For purposes of discussion, let's call the string that we wish to count the "anchor" string. We will use the term "floater" string to refer to the other string that must appear within a stipulated distance from an anchor. We will use the following test data from the question:
$floater= "1111";
$anchor = "1221";
$min = 3;
$max = 10;
$testString = "000000111100012210111100000122100000001111000000000001221";
We will start by defining a function that returns the start positions of the anchors within a test string.
anchorPositions[test_, anchor_] := StringPosition[test, anchor][[All, 1]]
For example:
anchorPositions[$testString, $anchor]
(* {14, 28, 54} *)
Next, we define a function that will return the allowable positions for an anchor given that a particular floater is in a known position:
window[floaterPosition:{_, _}, anchorLength_, min_, max_] :=
Flatten @ Apply[
Range
, floaterPosition + {{-max, -min} - anchorLength, 1 + {min, max}}
, {1}
]
For example, if we know the position of a floater in a test string, then the allowable start positions for an anchor lie in two windows. One window is before the floater and the other after. If a floater extends between positions 20-23 and the distance band is 3-10, then an anchor of length 4 must fall in either a leading window covering positions 6-13 or a trailing window covering positions 27-34 (taking into account the min/max spacing and the length of the anchor string).
window[{20, 23}, 4, 3, 10]
(* {6, 7, 8, 9, 10, 11, 12, 13, 27, 28, 29, 30, 31, 32, 33, 34} *)
This helper function can be used to determine all allowable positions for anchors in a test string, being the union of the windows associated with all floater strings:
allowedPositions[testString_, floater_, anchor_, min_, max_] :=
window[#, StringLength@anchor, min, max]& /@ StringPosition[testString, floater] //
Union @@ # &
Example:
allowedPositions[$testString, $floater, $anchor, $min, $max]
(* {-7, -6, -5, -4, -3, -2, -1, 0, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15,
16, 17, 18, 19, 20, 21, 25, 26, 27, 28, 29, 30, 31, 32, 33, 46, 47,
48, 49, 50, 51, 52, 53} *)
Now we have everything we need to determine the positions of the allowed anchors within a test string. They are simply the intersection of two position sets: the set of anchor string and the set of allowable anchor positions:
positions[test_, floater_, anchor_, min_, max_] :=
anchorPositions[test, anchor] ~Intersection~
allowedPositions[test, floater, anchor, min, max]
Example:
positions[$testString, $floater, $anchor, $min, $max]
(* {14, 28} *)
Finally, the anchor count is the length of this position list:
count[test_, floater_, anchor_, min_, max_] :=
Length @ positions[test, floater, anchor, min, max]
Example:
count[$testString, $floater, $anchor, $min, $max]
(* 2 *)
Here are the test cases from the question:
count[
"000000111100012210111100000122100000001111000000000001221"
, "1111", "1221", 3, 10
]
(* 2 *)
count[
"11110001221001221"
, "1111", "1221", 3, 10
]
(* 2 *)
Comments
Post a Comment