Skip to main content

performance tuning - Transferring a large amount of data in parallel calculations


Bug fixed in version 11.1
Functions like MemberQ, FreeQ, etc. no longer unpack. Yay!




This question is inspired by one of @whuber's answers


Consider the following code:


μ = RandomReal[{0, 1}, 100];

Σ = DiagonalMatrix[Exp[RandomReal[{0, 1}, 100]]];

AbsoluteTiming[
RandomVariate[MultinormalDistribution[μ, Σ], 400000];]

It runs in 3.5 seconds here.


Now let's parallelize it:


LaunchKernels[]

AbsoluteTiming[

Join @@ ParallelTable[
RandomVariate[MultinormalDistribution[μ, Σ], 200000], {2}];]

This runs in 6.3 seconds on a 2-core machine---much slower. It also uses a lot of memory (which I checked using a process monitor).


Now let's suppress returning the results from the subkernels by including a semicolon:


AbsoluteTiming[
Join @@ ParallelTable[
RandomVariate[MultinormalDistribution[μ, Σ], 200000];, {2}];]

This one runs in 2.6 seconds---a speedup.



What is happening here? Why is the calculation that returns the result so much slower? Is it a general rule with parallel calculations that returning even moderately large data tends to lead to a significant slowdown? Is the slowdown due to MathLink's performance? Is there anything one could do to avoid the slowdown?


Warning: This might eat all your memory and force your system to swap! This computer has 6 GB and everything was fine. If you have less memory, reduce the amount of data a bit.




Solution


@Oleksandr's excellent analysis showed that the performance bottleneck is MemberQ, in particular that it unpacks all arrays inside the expression tested. This is completely unnecessary, and it's possible to define a more efficient (though more limited) version of MemberQ:


memberQ[list_, form_] := Or @@ (MatchQ[#, form] & /@ list)

Note that MemberQ only tests at level 1 by default (unlike FreeQ which tests at all levels). This made it easy to re-implement the two-argument form of MemberQ.


We can temporarily change MemberQ while executing parallel operations:


ClearAll[fix]

SetAttributes[fix, HoldAll]
fix[expr_] := Block[{MemberQ = memberQ}, expr]

fix@AbsoluteTiming[
Join @@ ParallelTable[
RandomVariate[MultinormalDistribution[μ, Σ],
200000], {2}];]

This runs in 3.0 seconds now, a huge improvement.


This is just an illustration of how to fix the performance problem, but the code I showed here is not completely safe to use in its current form.



Some notes:




  • Changing builtins is always risky, and can easily cause problems




  • I used Block to localize the change, which reduces the risk. Note that Block will not affect calculations in the parallel kernels, so if fix is used only on the parallelization functions, in the form fix@ParallelTable[...], then it will only have an effect for these functions, but not for the code that is being parallelized. This reduces the risk further.




  • I did not implement the 3-argument form of MemberQ. If this is used anywhere in the parallel tools, fix will break things. It'd take a bit more work to correctly implement this too, preferably just falling back to the builtin MemberQ for this case. There may always be some undocumented behaviour of MemberQ which we are not aware of and which differs from memberQ.





  • I did not implement short circuiting, so memberQ will be slower in some cases. This can be fixed as well.




These potential problems can largely be fixed with a bit of work, and I believe this method can work well for fixing this particular performance problem of parallel calculations.



Answer



There are two performance problems here. The first is relatively minor: MultinormalDistribution[μ, Σ] is evaluated in each slave kernel, returned to the master kernel, and sent back to the slave kernels as part of the RandomVariate call. In your example, this is a packed array of about 80KB in size: not large, yet not small either, and this behaviour may become an issue in other contexts. This is easily solved by specifying Method -> "CoarsestGrained" as an option to ParallelTable. However, while certainly representing an improvement, this setting unfortunately has little impact on overall behaviour in the current case.


The second issue is both more subtle and more serious, and comes from the handling of aborted evaluations by the Parallel` package. The essence of it is that all results returned to the master kernel by the slaves are checked for aborted evaluations using MemberQ[res, $Aborted]. Here, res is a large matrix in the form of a packed array $\approx$160MB in size, and the unpacking of this by MemberQ accounts for the poor performance and considerable memory consumption of this example. The peak memory consumption does not persist, however, since after the absence of $Aborted has been verified, the intermediate (unpacked) results are discarded.


To demonstrate more concretely the source of problem, we examine the file



FileNameJoin[{$InstallationDirectory, "AddOns", "Applications", "Parallel", "Combine.m"}]

and note that we can change the behaviour of this example using only the excerpt (which is complete with original comments, but repackaged to be a stand-alone modification of the relevant code, as well as slightly reformatted):


BeginPackage["Parallel`Combine`"];
Begin["`Private`"];

Needs["Parallel`Parallel`"];
Needs["Parallel`Kernels`"];

(* Additional required contexts -- O. R. *)

Needs["Parallel`Protected`"];
Needs["Parallel`Developer`"];

parallelIterateE[orig_, iter_, comb_, f_, expr_,
it : {w1_}, others___, {meth_, dist_, ___}] :=
With[{nk = $KernelCount, items = Internal`GetIteratorLength[it, orig]},
Module[{batches, batchsize, sizes, res},
If[ !IntegerQ[items] || items < 0, (* cannot do it if symbolic *)
Message[orig::nopar1, HoldForm[orig[expr, {w1}, others]]];
Return[iter[expr, {w1}, others]]

];
(* handle Method option *)
grokMethodOption[orig, Evaluate[items], nk, batches, batchsize, meth];
sizes = makeSizes[items, nk, batches, batchsize];
(* send definitions *)
Parallel`Protected`AutoDistribute[{f, expr, others}, orig, dist];

parStart;
With[{
chunks = HoldComplete @@ sizes /.

u1_Integer :> f[iter[expr, {u1}, others]]
},
res = If[
batches === 1,
ParallelDispatch[chunks],
ConcurrentEvaluate[chunks]
];
];
res = If[
(* Here lies the problem -- O. R. *)

MemberQ[res, $Aborted],
$Aborted,
comb @@ res
]; (* remote abort received *)
parStop;

res
]
]


End[];
EndPackage[];

(I should stress that the above is only one example of this code pattern as used in Combine.m; it also appears in other places not relevant to this example.)


Modifying this, either by removing the MemberQ call, or simply by commenting out res, and then running it (after loading the Parallel` package) to modify the definitions made in Combine.m can be seen to restore the expected performance.


Unfortunately, although one might think up a more performant (or at least non-unpacking) but semantically equivalent replacement, such as


Block[{$Aborted := Abort[]}, CheckAbort[res, $aborted]] === $aborted

this won't work since $Aborted is Locked. In any case, for obvious reasons, I hesitate to suggest modifying Combine.m in order to fix this problem, so attempting a workaround seems to be the best option here.


Edit: the workaround



As promised in the edit comment I left when I deleted my initial (incorrect) attempt at a workaround, here is a potential (this time, hopefully correct) solution to the problem of MemberQ unpacking. It is different in spirit to @Szabolcs's approach, being based on a modification of the original MemberQ rather than a reimplementation, so I feel it is worth posting as well. However, the same caveat applies: modifying System` functions is something which should be attempted with great care and only as a last resort. This method is based on the replacement of any packed arrays that appear in the first argument of MemberQ with an opaque list when the second argument is a symbol. I think it's safe, but there may be a situation I've overlooked, so I still advise caution.


withModifiedMemberQ[expr_] :=
Module[{doneQ, unmatchable},
Internal`InheritedBlock[{MemberQ},
Unprotect[MemberQ];
(* Can uncomment this if we want to print out the MemberQ calls:
mq:MemberQ[args___]/;(Print@HoldForm[mq];True):=mq;
*)
MemberQ[list_, patt_Symbol, args___] /; !TrueQ[doneQ] :=
Block[{doneQ = True},

MemberQ[
Unevaluated[list] /. _List?Developer`PackedArrayQ -> {unmatchable},
Unevaluated[patt],
args
]
];
Protect[MemberQ];
expr
]
];

SetAttributes[withModifiedMemberQ, HoldAllComplete];

Let's test it. For this I've uncommented the definition that prints out the calls to MemberQ so that its operation will produce some debug output. (Note also that Range returns a packed array.)


On["Packing"];

MemberQ[Range[5], $Aborted]
(* (message) Developer`FromPackedArray::unpack: Unpacking array in call to MemberQ.
False *)

withModifiedMemberQ@MemberQ[Range[5], $Aborted]

(* (prints) MemberQ[{1, 2, 3, 4, 5}, $Aborted]
(prints) MemberQ[{unmatchable$1305}, $Aborted]
False *)

withModifiedMemberQ@MemberQ[Range[5], 1]
(* (prints) MemberQ[{1, 2, 3, 4, 5}, 1]
(message) Developer`FromPackedArray::unpack: Unpacking array in call to MemberQ.
True *)

So it looks okay. Now let's comment out the debug output (otherwise the front end will probably lock up) and try it on the problematic ParallelTable call:



On["Packing"];

withModifiedMemberQ@AbsoluteTiming[
Join @@ ParallelTable[
RandomVariate[
MultinormalDistribution[\[Mu], \[CapitalSigma]],
200000
], {2}
];
]

(* {2.8593750, Null} *)

No unpacking messages are printed, and the performance issue is fixed. (On my computer, the version using the unmodified MemberQ takes 5.6 seconds.)


Comments

Popular posts from this blog

plotting - Plot 4D data with color as 4th dimension

I have a list of 4D data (x position, y position, amplitude, wavelength). I want to plot x, y, and amplitude on a 3D plot and have the color of the points correspond to the wavelength. I have seen many examples using functions to define color but my wavelength cannot be expressed by an analytic function. Is there a simple way to do this? Answer Here a another possible way to visualize 4D data: data = Flatten[Table[{x, y, x^2 + y^2, Sin[x - y]}, {x, -Pi, Pi,Pi/10}, {y,-Pi,Pi, Pi/10}], 1]; You can use the function Point along with VertexColors . Now the points are places using the first three elements and the color is determined by the fourth. In this case I used Hue, but you can use whatever you prefer. Graphics3D[ Point[data[[All, 1 ;; 3]], VertexColors -> Hue /@ data[[All, 4]]], Axes -> True, BoxRatios -> {1, 1, 1/GoldenRatio}]

plotting - Filling between two spheres in SphericalPlot3D

Manipulate[ SphericalPlot3D[{1, 2 - n}, {θ, 0, Pi}, {ϕ, 0, 1.5 Pi}, Mesh -> None, PlotPoints -> 15, PlotRange -> {-2.2, 2.2}], {n, 0, 1}] I cant' seem to be able to make a filling between two spheres. I've already tried the obvious Filling -> {1 -> {2}} but Mathematica doesn't seem to like that option. Is there any easy way around this or ... Answer There is no built-in filling in SphericalPlot3D . One option is to use ParametricPlot3D to draw the surfaces between the two shells: Manipulate[ Show[SphericalPlot3D[{1, 2 - n}, {θ, 0, Pi}, {ϕ, 0, 1.5 Pi}, PlotPoints -> 15, PlotRange -> {-2.2, 2.2}], ParametricPlot3D[{ r {Sin[t] Cos[1.5 Pi], Sin[t] Sin[1.5 Pi], Cos[t]}, r {Sin[t] Cos[0 Pi], Sin[t] Sin[0 Pi], Cos[t]}}, {r, 1, 2 - n}, {t, 0, Pi}, PlotStyle -> Yellow, Mesh -> {2, 15}]], {n, 0, 1}]

plotting - Mathematica: 3D plot based on combined 2D graphs

I have several sigmoidal fits to 3 different datasets, with mean fit predictions plus the 95% confidence limits (not symmetrical around the mean) and the actual data. I would now like to show these different 2D plots projected in 3D as in but then using proper perspective. In the link here they give some solutions to combine the plots using isometric perspective, but I would like to use proper 3 point perspective. Any thoughts? Also any way to show the mean points per time point for each series plus or minus the standard error on the mean would be cool too, either using points+vertical bars, or using spheres plus tubes. Below are some test data and the fit function I am using. Note that I am working on a logit(proportion) scale and that the final vertical scale is Log10(percentage). (* some test data *) data = Table[Null, {i, 4}]; data[[1]] = {{1, -5.8}, {2, -5.4}, {3, -0.8}, {4, -0.2}, {5, 4.6}, {1, -6.4}, {2, -5.6}, {3, -0.7}, {4, 0.04}, {5, 1.0}, {1, -6.8}, {2, -4.7}, {3, -1....