Skip to main content

fitting - Generating an "average curve" from a dense set of "semicontinuous curves" which clear trends


This question is related to a previous one of mine: Partitioning a superset of coordinates into subsets that generate continuous curves, where the challenge was to partition a data set into two clusters representing distinct "semicontinuous curves" (we mean this in the loose sense that a ListPlot generated two curves that by eye appeared to be continuous). The user rm -rf illustrated and demonstrated that FindClusters could very nice solve the problem using a Euclidean distance metric (a hat tip also to g3kk0 who elaborated on this approach).


My question here is how one would approach a much more pathological case, where we have multiple closely spaced curves, displaced some amount vertically, and we would like to generate an "average" curve that represents a sort of average for the shape of all the semicontinuous curves.


Take a look at the following plots (data uploaded to: https://www.dropbox.com/sh/751invayqe3ip9b/r6raIrdvLF):



Plot A:


enter image description here


Plot B:


enter image description here


In both cases, we can more or less clearly make out by eye what the shape of each semicontinuous curve looks like. For example, take the following "hand drawn" traces of two semicontinuous curves in Plot B:


enter image description here


While it may not be possible to separate the data sets for Plot A and Plot B into clusters that represent each of these semicontinuous curves (I'd love to be corrected on this point), how can we automatically generate a sort of "average curve" that follows the trend for each of the semicontinuous curves?


Update: In reponse to rm -rf's comment, if we take a similar data set to Plot B and bin along the X-axis and then compute a mean for each bin (computing the median generates a similar result), the resulting plot looks like this:


enter image description here


My explanation for this is that we cannot "see" (i.e. we do not have access to) all of the semicontinuous curves in the plot. A peak in the set of curves may cause additional curves to appear near the bottom of the plot (see the $X \approx 5000$ to $8000$ region in Plot B), and vice versa.





Update 2: If we connect all points within some small cutoff distance $T$ (perhaps $T = 1$ here), generating a random geometric graph, perhaps we could look at the angle of the edges connecting the points / vertices? This shouldn't be affected very much by the presence of peaks or troughs of curves at the bottom and top of the plot, respectively.




Update 3: Provided that the data here is generally periodic along the $y$-axis, it might be reasonable to treat the surface as if it it curves around on itself like a tube (where the x-axis is the long-axis of the tube, and the curvature occurs along the y-axis).



Answer



Your data seems to be composed of two components: An offset for each x-location (or an offset-function of x) and a density relative to that offset. If you had either of the two, estimating the other would be easy. Right so far?


One common solution for this kind of problem is the EM algorithm. Basically, you start with some estimate for one variable (e.g. the offset function) and use that to update the other (the density). Then you use fix the density and update the offsets. Rinse and repeat.


I used your "B" dataset for testing:


data = ToExpression[
Import["https://dl.dropboxusercontent.com/sh/751invayqe3ip9b/\

lhJ65LVWMI/Plot%20B%20Points.txt?token_hash=AAF1qneRm1kbtw4P9p_\
JCMhzO1SpmMlDst8zMbLfG_cNjQ&dl=1"]];

I need some parameters for estimating the density given the offsets and estimating the offsets given a density, but the algorithm is (relatively) robust to different parameters:


histogramBinSize = 0.1;
partitionSize = 50;
filterSize = 5;

makeHist =
Function[data,

GaussianFilter[BinCounts[data, {160, 200, histogramBinSize}],
filterSize]];

makeHist estimates the density from a set of values (with offsets removed).


The EM step takes an estimate for the offset at each position, calculates a denstiy estimate and returns a new offset estimate from that:


emStep = Function[offsetEstimate,
Module[{densityEstimate, zeroMeanDensityEstimate, correlation,
shift, offsets, offsetForEachPoint, dataWithoutOffset,
newDensityEstimate},
(

(* first estimate the density *)
dataWithoutOffset = data - offsetEstimate;
densityEstimate = makeHist[dataWithoutOffset];

(* then estimate the offset given the density, using simple correlation *)
zeroMeanDensityEstimate = densityEstimate - Mean[densityEstimate];
correlation =
GaussianFilter[
ListCorrelate[zeroMeanDensityEstimate, makeHist[#], 1] & /@
Partition[data, partitionSize], filterSize];

shift = Round[Length[densityEstimate]/2];
offsets =
histogramBinSize (Position[RotateRight[#, shift], Max[#]][[1,
1]] - shift) & /@ correlation;
offsetForEachPoint =
ListInterpolation[offsets, {1, Length[data]},
InterpolationOrder -> 1][Range[1, Length[data]]];

Sow[
Row[

{
ListLinePlot[densityEstimate, PlotLabel -> "Density Estimate",
ImageSize -> 400],
ListLinePlot[offsets, PlotLabel -> "New Estimated Offset",
ImageSize -> 400]
}]];

offsetForEachPoint
)]];


Starting with offset 0 for each point, the algorithm converges to a good estimate after a few iterations:


offsetStart = data*0;
{offsetEstimate, {output}} = Reap[Nest[emStep, offsetStart, 5]];
Show[ListPlot[data],
ListLinePlot[180 + offsetEstimate, PlotStyle -> {Red, Thick}]]

enter image description here


The EM step sows a plot of the current offset/denstiy estimate after each iteration. As you can see, it converges to a good solution quite quickly:


Column[output]


enter image description here


There are several ways to improve this simple algorithm:



  • You could probably come up with a smarter way to estimate the density given the offsets.

  • The estimation of the offset given the density is also far from perfect (it just looks for the maximum correlation at each x-location). For example, you could only search for offset functions without large jumps (e.g. Viterbi's or some graph-path-searching algorithm. Or maybe some general function approximation like Gaussian Process).

  • The algorithm is still quite ad-hoc. You could probably improve the results and/or automatically optimize the parameters by treating the "densities" as proper probability densities and estimating real probabilities (instead of a meaningless correlation score).


Update: To process the "multiple values per column" dataset, I'd change the data format a little so it can be processed by the same algorithm:


data2d = ToExpression[
Import["https://dl.dropboxusercontent.com/sh/751invayqe3ip9b/\

Ra8vDwQ1Ew/DataSet%20-%20Multiple%20Points%20Per%20Column.txt?token_\
hash=AAF1qneRm1kbtw4P9p_JCMhzO1SpmMlDst8zMbLfG_cNjQ&dl=1"]];

data = Flatten /@ data2d[[All, All, 2]];

Now data contains an array of y-values for each x-column, i.e. when data2d looks like this,


{{}, {{2, 149.645}}, {{3, 149.533}}, {{4, 139.979}}, {{5, 
156.648}, {5, 140.324}}, {{6, 139.452}}, {{7, 145.777}, {7,
140.795}, {7, 138.106}}, {{8, 145.766}, {8, 140.788}, {8,
137.636}} ...


data will look like this:


{{}, {149.645}, {149.533}, {139.979}, {156.648, 140.324}, 
{139.452}, {145.777, 140.795, 138.106}, {145.766, 140.788, 137.636}...

There are really only three kinds of operations I do with the data:



  • Calculate a histogram of all points,

  • Partition it into N-column stripes and calculate a histogram for each

  • Subtract an offset-list containing an offset for each x-column



All of these work perfectly well for a list of values for each column, as long as the histograms are calculated from a flattened list:


makeHist = 
Function[data,
GaussianFilter[
BinCounts[Flatten[data], {130, 170, histogramBinSize}],
filterSize]];

And I had to adjust the final data display:


Show[ListPlot[Flatten[data2d, 1]], 

ListLinePlot[Mean[Flatten[data]] + offsetEstimate,
PlotStyle -> {Red, Thick}]]

Other than that, the algorithm works fine for this dataset:


enter image description here


enter image description here


Comments

Popular posts from this blog

plotting - Plot 4D data with color as 4th dimension

I have a list of 4D data (x position, y position, amplitude, wavelength). I want to plot x, y, and amplitude on a 3D plot and have the color of the points correspond to the wavelength. I have seen many examples using functions to define color but my wavelength cannot be expressed by an analytic function. Is there a simple way to do this? Answer Here a another possible way to visualize 4D data: data = Flatten[Table[{x, y, x^2 + y^2, Sin[x - y]}, {x, -Pi, Pi,Pi/10}, {y,-Pi,Pi, Pi/10}], 1]; You can use the function Point along with VertexColors . Now the points are places using the first three elements and the color is determined by the fourth. In this case I used Hue, but you can use whatever you prefer. Graphics3D[ Point[data[[All, 1 ;; 3]], VertexColors -> Hue /@ data[[All, 4]]], Axes -> True, BoxRatios -> {1, 1, 1/GoldenRatio}]

plotting - Mathematica: 3D plot based on combined 2D graphs

I have several sigmoidal fits to 3 different datasets, with mean fit predictions plus the 95% confidence limits (not symmetrical around the mean) and the actual data. I would now like to show these different 2D plots projected in 3D as in but then using proper perspective. In the link here they give some solutions to combine the plots using isometric perspective, but I would like to use proper 3 point perspective. Any thoughts? Also any way to show the mean points per time point for each series plus or minus the standard error on the mean would be cool too, either using points+vertical bars, or using spheres plus tubes. Below are some test data and the fit function I am using. Note that I am working on a logit(proportion) scale and that the final vertical scale is Log10(percentage). (* some test data *) data = Table[Null, {i, 4}]; data[[1]] = {{1, -5.8}, {2, -5.4}, {3, -0.8}, {4, -0.2}, {5, 4.6}, {1, -6.4}, {2, -5.6}, {3, -0.7}, {4, 0.04}, {5, 1.0}, {1, -6.8}, {2, -4.7}, {3, -1....

functions - Get leading series expansion term?

Given a function f[x] , I would like to have a function leadingSeries that returns just the leading term in the series around x=0 . For example: leadingSeries[(1/x + 2)/(4 + 1/x^2 + x)] x and leadingSeries[(1/x + 2 + (1 - 1/x^3)/4)/(4 + x)] -(1/(16 x^3)) Is there such a function in Mathematica? Or maybe one can implement it efficiently? EDIT I finally went with the following implementation, based on Carl Woll 's answer: lds[ex_,x_]:=( (ex/.x->(x+O[x]^2))/.SeriesData[U_,Z_,L_List,Mi_,Ma_,De_]:>SeriesData[U,Z,{L[[1]]},Mi,Mi+1,De]//Quiet//Normal) The advantage is, that this one also properly works with functions whose leading term is a constant: lds[Exp[x],x] 1 Answer Update 1 Updated to eliminate SeriesData and to not return additional terms Perhaps you could use: leadingSeries[expr_, x_] := Normal[expr /. x->(x+O[x]^2) /. a_List :> Take[a, 1]] Then for your examples: leadingSeries[(1/x + 2)/(4 + 1/x^2 + x), x] leadingSeries[Exp[x], x] leadingSeries[(1/x + 2 + (1 - 1/x...