Skip to main content

Counting the number of instances of one sub-string within a given string within a lower- and upper-bound gap of a second sub-string


Please consider the situation where I give you a list where entries are drawn from a fixed set of characters:


 alphabet = {0,1,2};
numElements = 10^3;
bigString = StringJoin[Map[ToString, RandomChoice[alphabet, numElements]]];

I provide you two strings: string1 and string2. I'd like to count the number of instances where string2 occurs within a lower-bound and upper-bound "distance" of string1, and by "distance" I mean this in terms of the count for the number of characters in the gap region between string1 and string2 (i.e. the number of characters counting from immediately after the last character in string1 and the immediately before the first character in string2 if string1 occurs before string2, and vice versa if string2 occurs before string1). There may be multiple instances of string1 and string2, so in terms of overcounting, each instance of string2 should only be considered a single possible "hit" (if its within the lower- and upper-bound cutoff distance of string1).


Is there a built in function, or an easy way to do this?


As Leonid Shifrin requests, let's construct a small case example:



string1 = "1111";
string2 = "1221";

lowerboundDistance = 3;
upperboundDistance = 10;

bigString = "000000111100012210111100000122100000001111000000000001221";

Now, in the above string, there are three instances of string2, so at most we can have an output count of 3. From left-to-right, here are all possible instances where string1 and string2 are separated by a gap of at least 3 characters and at most 10 characters:


[1] "11110001221"

[2] "1111000001221"
[3] "122100000001111"

Notice however that instances [2] and [3] correspond to the same instance of string2, so we only increase the count by 1 after seeing both of these instances. The final count is therefore 2.


To clarify a particular point, note that:


string1 = "1111";
string2 = "1221";

lowerboundDistance = 3;
upperboundDistance = 10;


bigString = "11110001221001221";

Should give an output count of 2 considering that "11110001221" and "11110001221001221" (abstracted as "1111.........1221") represent instances of string1 and string2 within the lower- and upperbound gap specifications.



Answer



The code below will identify "hits" using the requested rules. It works by computing the intersection of two sets of positions: 1) the positions of "string2" and 2) the allowable positions determined by leading and trailing "windows" relative to each occurrence of "string1". The count function will return the number of hits and position give the positions of the hits, if desired.


count[test_, floater_, anchor_, min_, max_] :=
Length @ positions[test, floater, anchor, min, max]

positions[test_, floater_, anchor_, min_, max_] :=

anchorPositions[test, anchor] ~Intersection~
allowedPositions[test, floater, anchor, min, max]

anchorPositions[test_, anchor_] := StringPosition[test, anchor][[All, 1]]

allowedPositions[testString_, floater_, anchor_, min_, max_] :=
window[#, StringLength@anchor, min, max]& /@ StringPosition[testString, floater] //
Union @@ # &

window[floaterPosition:{_, _}, anchorLength_, min_, max_] :=

Flatten @ Apply[
Range
, floaterPosition + {{-max, -min} - anchorLength, 1 + {min, max}}
, {1}
]

Explanation


For purposes of discussion, let's call the string that we wish to count the "anchor" string. We will use the term "floater" string to refer to the other string that must appear within a stipulated distance from an anchor. We will use the following test data from the question:


$floater= "1111";
$anchor = "1221";


$min = 3;
$max = 10;

$testString = "000000111100012210111100000122100000001111000000000001221";

We will start by defining a function that returns the start positions of the anchors within a test string.


anchorPositions[test_, anchor_] := StringPosition[test, anchor][[All, 1]]

For example:



anchorPositions[$testString, $anchor]
(* {14, 28, 54} *)

Next, we define a function that will return the allowable positions for an anchor given that a particular floater is in a known position:


window[floaterPosition:{_, _}, anchorLength_, min_, max_] :=
Flatten @ Apply[
Range
, floaterPosition + {{-max, -min} - anchorLength, 1 + {min, max}}
, {1}
]


For example, if we know the position of a floater in a test string, then the allowable start positions for an anchor lie in two windows. One window is before the floater and the other after. If a floater extends between positions 20-23 and the distance band is 3-10, then an anchor of length 4 must fall in either a leading window covering positions 6-13 or a trailing window covering positions 27-34 (taking into account the min/max spacing and the length of the anchor string).


window[{20, 23}, 4, 3, 10]
(* {6, 7, 8, 9, 10, 11, 12, 13, 27, 28, 29, 30, 31, 32, 33, 34} *)

This helper function can be used to determine all allowable positions for anchors in a test string, being the union of the windows associated with all floater strings:


allowedPositions[testString_, floater_, anchor_, min_, max_] :=
window[#, StringLength@anchor, min, max]& /@ StringPosition[testString, floater] //
Union @@ # &


Example:


allowedPositions[$testString, $floater, $anchor, $min, $max]
(* {-7, -6, -5, -4, -3, -2, -1, 0, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15,
16, 17, 18, 19, 20, 21, 25, 26, 27, 28, 29, 30, 31, 32, 33, 46, 47,
48, 49, 50, 51, 52, 53} *)

Now we have everything we need to determine the positions of the allowed anchors within a test string. They are simply the intersection of two position sets: the set of anchor string and the set of allowable anchor positions:


positions[test_, floater_, anchor_, min_, max_] :=
anchorPositions[test, anchor] ~Intersection~
allowedPositions[test, floater, anchor, min, max]


Example:


positions[$testString, $floater, $anchor, $min, $max]
(* {14, 28} *)

Finally, the anchor count is the length of this position list:


count[test_, floater_, anchor_, min_, max_] :=
Length @ positions[test, floater, anchor, min, max]

Example:



count[$testString, $floater, $anchor, $min, $max]
(* 2 *)

Here are the test cases from the question:


count[
"000000111100012210111100000122100000001111000000000001221"
, "1111", "1221", 3, 10
]
(* 2 *)


count[
"11110001221001221"
, "1111", "1221", 3, 10
]
(* 2 *)

Comments

Popular posts from this blog

plotting - Plot 4D data with color as 4th dimension

I have a list of 4D data (x position, y position, amplitude, wavelength). I want to plot x, y, and amplitude on a 3D plot and have the color of the points correspond to the wavelength. I have seen many examples using functions to define color but my wavelength cannot be expressed by an analytic function. Is there a simple way to do this? Answer Here a another possible way to visualize 4D data: data = Flatten[Table[{x, y, x^2 + y^2, Sin[x - y]}, {x, -Pi, Pi,Pi/10}, {y,-Pi,Pi, Pi/10}], 1]; You can use the function Point along with VertexColors . Now the points are places using the first three elements and the color is determined by the fourth. In this case I used Hue, but you can use whatever you prefer. Graphics3D[ Point[data[[All, 1 ;; 3]], VertexColors -> Hue /@ data[[All, 4]]], Axes -> True, BoxRatios -> {1, 1, 1/GoldenRatio}]

plotting - Mathematica: 3D plot based on combined 2D graphs

I have several sigmoidal fits to 3 different datasets, with mean fit predictions plus the 95% confidence limits (not symmetrical around the mean) and the actual data. I would now like to show these different 2D plots projected in 3D as in but then using proper perspective. In the link here they give some solutions to combine the plots using isometric perspective, but I would like to use proper 3 point perspective. Any thoughts? Also any way to show the mean points per time point for each series plus or minus the standard error on the mean would be cool too, either using points+vertical bars, or using spheres plus tubes. Below are some test data and the fit function I am using. Note that I am working on a logit(proportion) scale and that the final vertical scale is Log10(percentage). (* some test data *) data = Table[Null, {i, 4}]; data[[1]] = {{1, -5.8}, {2, -5.4}, {3, -0.8}, {4, -0.2}, {5, 4.6}, {1, -6.4}, {2, -5.6}, {3, -0.7}, {4, 0.04}, {5, 1.0}, {1, -6.8}, {2, -4.7}, {3, -1....

functions - Get leading series expansion term?

Given a function f[x] , I would like to have a function leadingSeries that returns just the leading term in the series around x=0 . For example: leadingSeries[(1/x + 2)/(4 + 1/x^2 + x)] x and leadingSeries[(1/x + 2 + (1 - 1/x^3)/4)/(4 + x)] -(1/(16 x^3)) Is there such a function in Mathematica? Or maybe one can implement it efficiently? EDIT I finally went with the following implementation, based on Carl Woll 's answer: lds[ex_,x_]:=( (ex/.x->(x+O[x]^2))/.SeriesData[U_,Z_,L_List,Mi_,Ma_,De_]:>SeriesData[U,Z,{L[[1]]},Mi,Mi+1,De]//Quiet//Normal) The advantage is, that this one also properly works with functions whose leading term is a constant: lds[Exp[x],x] 1 Answer Update 1 Updated to eliminate SeriesData and to not return additional terms Perhaps you could use: leadingSeries[expr_, x_] := Normal[expr /. x->(x+O[x]^2) /. a_List :> Take[a, 1]] Then for your examples: leadingSeries[(1/x + 2)/(4 + 1/x^2 + x), x] leadingSeries[Exp[x], x] leadingSeries[(1/x + 2 + (1 - 1/x...