fitting - Generating an "average curve" from a dense set of "semicontinuous curves" which clear trends
This question is related to a previous one of mine: Partitioning a superset of coordinates into subsets that generate continuous curves, where the challenge was to partition a data set into two clusters representing distinct "semicontinuous curves" (we mean this in the loose sense that a ListPlot generated two curves that by eye appeared to be continuous). The user rm -rf illustrated and demonstrated that FindClusters could very nice solve the problem using a Euclidean distance metric (a hat tip also to g3kk0 who elaborated on this approach).
My question here is how one would approach a much more pathological case, where we have multiple closely spaced curves, displaced some amount vertically, and we would like to generate an "average" curve that represents a sort of average for the shape of all the semicontinuous curves.
Take a look at the following plots (data uploaded to: https://www.dropbox.com/sh/751invayqe3ip9b/r6raIrdvLF):
Plot A:
Plot B:
In both cases, we can more or less clearly make out by eye what the shape of each semicontinuous curve looks like. For example, take the following "hand drawn" traces of two semicontinuous curves in Plot B:
While it may not be possible to separate the data sets for Plot A and Plot B into clusters that represent each of these semicontinuous curves (I'd love to be corrected on this point), how can we automatically generate a sort of "average curve" that follows the trend for each of the semicontinuous curves?
Update: In reponse to rm -rf's comment, if we take a similar data set to Plot B and bin along the X-axis and then compute a mean for each bin (computing the median generates a similar result), the resulting plot looks like this:
My explanation for this is that we cannot "see" (i.e. we do not have access to) all of the semicontinuous curves in the plot. A peak in the set of curves may cause additional curves to appear near the bottom of the plot (see the $X \approx 5000$ to $8000$ region in Plot B), and vice versa.
Update 2: If we connect all points within some small cutoff distance $T$ (perhaps $T = 1$ here), generating a random geometric graph, perhaps we could look at the angle of the edges connecting the points / vertices? This shouldn't be affected very much by the presence of peaks or troughs of curves at the bottom and top of the plot, respectively.
Update 3: Provided that the data here is generally periodic along the $y$-axis, it might be reasonable to treat the surface as if it it curves around on itself like a tube (where the x-axis is the long-axis of the tube, and the curvature occurs along the y-axis).
Answer
Your data seems to be composed of two components: An offset for each x-location (or an offset-function of x) and a density relative to that offset. If you had either of the two, estimating the other would be easy. Right so far?
One common solution for this kind of problem is the EM algorithm. Basically, you start with some estimate for one variable (e.g. the offset function) and use that to update the other (the density). Then you use fix the density and update the offsets. Rinse and repeat.
I used your "B" dataset for testing:
data = ToExpression[
Import["https://dl.dropboxusercontent.com/sh/751invayqe3ip9b/\
lhJ65LVWMI/Plot%20B%20Points.txt?token_hash=AAF1qneRm1kbtw4P9p_\
JCMhzO1SpmMlDst8zMbLfG_cNjQ&dl=1"]];
I need some parameters for estimating the density given the offsets and estimating the offsets given a density, but the algorithm is (relatively) robust to different parameters:
histogramBinSize = 0.1;
partitionSize = 50;
filterSize = 5;
makeHist =
Function[data,
GaussianFilter[BinCounts[data, {160, 200, histogramBinSize}],
filterSize]];
makeHist
estimates the density from a set of values (with offsets removed).
The EM step takes an estimate for the offset at each position, calculates a denstiy estimate and returns a new offset estimate from that:
emStep = Function[offsetEstimate,
Module[{densityEstimate, zeroMeanDensityEstimate, correlation,
shift, offsets, offsetForEachPoint, dataWithoutOffset,
newDensityEstimate},
(
(* first estimate the density *)
dataWithoutOffset = data - offsetEstimate;
densityEstimate = makeHist[dataWithoutOffset];
(* then estimate the offset given the density, using simple correlation *)
zeroMeanDensityEstimate = densityEstimate - Mean[densityEstimate];
correlation =
GaussianFilter[
ListCorrelate[zeroMeanDensityEstimate, makeHist[#], 1] & /@
Partition[data, partitionSize], filterSize];
shift = Round[Length[densityEstimate]/2];
offsets =
histogramBinSize (Position[RotateRight[#, shift], Max[#]][[1,
1]] - shift) & /@ correlation;
offsetForEachPoint =
ListInterpolation[offsets, {1, Length[data]},
InterpolationOrder -> 1][Range[1, Length[data]]];
Sow[
Row[
{
ListLinePlot[densityEstimate, PlotLabel -> "Density Estimate",
ImageSize -> 400],
ListLinePlot[offsets, PlotLabel -> "New Estimated Offset",
ImageSize -> 400]
}]];
offsetForEachPoint
)]];
Starting with offset 0 for each point, the algorithm converges to a good estimate after a few iterations:
offsetStart = data*0;
{offsetEstimate, {output}} = Reap[Nest[emStep, offsetStart, 5]];
Show[ListPlot[data],
ListLinePlot[180 + offsetEstimate, PlotStyle -> {Red, Thick}]]
The EM step sows a plot of the current offset/denstiy estimate after each iteration. As you can see, it converges to a good solution quite quickly:
Column[output]
There are several ways to improve this simple algorithm:
- You could probably come up with a smarter way to estimate the density given the offsets.
- The estimation of the offset given the density is also far from perfect (it just looks for the maximum correlation at each x-location). For example, you could only search for offset functions without large jumps (e.g. Viterbi's or some graph-path-searching algorithm. Or maybe some general function approximation like Gaussian Process).
- The algorithm is still quite ad-hoc. You could probably improve the results and/or automatically optimize the parameters by treating the "densities" as proper probability densities and estimating real probabilities (instead of a meaningless correlation score).
Update: To process the "multiple values per column" dataset, I'd change the data format a little so it can be processed by the same algorithm:
data2d = ToExpression[
Import["https://dl.dropboxusercontent.com/sh/751invayqe3ip9b/\
Ra8vDwQ1Ew/DataSet%20-%20Multiple%20Points%20Per%20Column.txt?token_\
hash=AAF1qneRm1kbtw4P9p_JCMhzO1SpmMlDst8zMbLfG_cNjQ&dl=1"]];
data = Flatten /@ data2d[[All, All, 2]];
Now data contains an array of y-values for each x-column, i.e. when data2d
looks like this,
{{}, {{2, 149.645}}, {{3, 149.533}}, {{4, 139.979}}, {{5,
156.648}, {5, 140.324}}, {{6, 139.452}}, {{7, 145.777}, {7,
140.795}, {7, 138.106}}, {{8, 145.766}, {8, 140.788}, {8,
137.636}} ...
data
will look like this:
{{}, {149.645}, {149.533}, {139.979}, {156.648, 140.324},
{139.452}, {145.777, 140.795, 138.106}, {145.766, 140.788, 137.636}...
There are really only three kinds of operations I do with the data:
- Calculate a histogram of all points,
- Partition it into N-column stripes and calculate a histogram for each
- Subtract an offset-list containing an offset for each x-column
All of these work perfectly well for a list of values for each column, as long as the histograms are calculated from a flattened list:
makeHist =
Function[data,
GaussianFilter[
BinCounts[Flatten[data], {130, 170, histogramBinSize}],
filterSize]];
And I had to adjust the final data display:
Show[ListPlot[Flatten[data2d, 1]],
ListLinePlot[Mean[Flatten[data]] + offsetEstimate,
PlotStyle -> {Red, Thick}]]
Other than that, the algorithm works fine for this dataset:
Comments
Post a Comment