Bug fixed in version 11.1
Functions like MemberQ
, FreeQ
, etc. no longer unpack. Yay!
This question is inspired by one of @whuber's answers
Consider the following code:
μ = RandomReal[{0, 1}, 100];
Σ = DiagonalMatrix[Exp[RandomReal[{0, 1}, 100]]];
AbsoluteTiming[
RandomVariate[MultinormalDistribution[μ, Σ], 400000];]
It runs in 3.5 seconds here.
Now let's parallelize it:
LaunchKernels[]
AbsoluteTiming[
Join @@ ParallelTable[
RandomVariate[MultinormalDistribution[μ, Σ], 200000], {2}];]
This runs in 6.3 seconds on a 2-core machine---much slower. It also uses a lot of memory (which I checked using a process monitor).
Now let's suppress returning the results from the subkernels by including a semicolon:
AbsoluteTiming[
Join @@ ParallelTable[
RandomVariate[MultinormalDistribution[μ, Σ], 200000];, {2}];]
This one runs in 2.6 seconds---a speedup.
What is happening here? Why is the calculation that returns the result so much slower? Is it a general rule with parallel calculations that returning even moderately large data tends to lead to a significant slowdown? Is the slowdown due to MathLink's performance? Is there anything one could do to avoid the slowdown?
Warning: This might eat all your memory and force your system to swap! This computer has 6 GB and everything was fine. If you have less memory, reduce the amount of data a bit.
Solution
@Oleksandr's excellent analysis showed that the performance bottleneck is MemberQ
, in particular that it unpacks all arrays inside the expression tested. This is completely unnecessary, and it's possible to define a more efficient (though more limited) version of MemberQ
:
memberQ[list_, form_] := Or @@ (MatchQ[#, form] & /@ list)
Note that MemberQ
only tests at level 1 by default (unlike FreeQ
which tests at all levels). This made it easy to re-implement the two-argument form of MemberQ
.
We can temporarily change MemberQ
while executing parallel operations:
ClearAll[fix]
SetAttributes[fix, HoldAll]
fix[expr_] := Block[{MemberQ = memberQ}, expr]
fix@AbsoluteTiming[
Join @@ ParallelTable[
RandomVariate[MultinormalDistribution[μ, Σ],
200000], {2}];]
This runs in 3.0 seconds now, a huge improvement.
This is just an illustration of how to fix the performance problem, but the code I showed here is not completely safe to use in its current form.
Some notes:
Changing builtins is always risky, and can easily cause problems
I used
Block
to localize the change, which reduces the risk. Note thatBlock
will not affect calculations in the parallel kernels, so iffix
is used only on the parallelization functions, in the formfix@ParallelTable[...]
, then it will only have an effect for these functions, but not for the code that is being parallelized. This reduces the risk further.I did not implement the 3-argument form of
MemberQ
. If this is used anywhere in the parallel tools,fix
will break things. It'd take a bit more work to correctly implement this too, preferably just falling back to the builtinMemberQ
for this case. There may always be some undocumented behaviour ofMemberQ
which we are not aware of and which differs frommemberQ
.I did not implement short circuiting, so
memberQ
will be slower in some cases. This can be fixed as well.
These potential problems can largely be fixed with a bit of work, and I believe this method can work well for fixing this particular performance problem of parallel calculations.
Answer
There are two performance problems here. The first is relatively minor: MultinormalDistribution[μ, Σ]
is evaluated in each slave kernel, returned to the master kernel, and sent back to the slave kernels as part of the RandomVariate
call. In your example, this is a packed array of about 80KB in size: not large, yet not small either, and this behaviour may become an issue in other contexts. This is easily solved by specifying Method -> "CoarsestGrained"
as an option to ParallelTable
. However, while certainly representing an improvement, this setting unfortunately has little impact on overall behaviour in the current case.
The second issue is both more subtle and more serious, and comes from the handling of aborted evaluations by the Parallel`
package. The essence of it is that all results returned to the master kernel by the slaves are checked for aborted evaluations using MemberQ[res, $Aborted]
. Here, res
is a large matrix in the form of a packed array $\approx$160MB in size, and the unpacking of this by MemberQ
accounts for the poor performance and considerable memory consumption of this example. The peak memory consumption does not persist, however, since after the absence of $Aborted
has been verified, the intermediate (unpacked) results are discarded.
To demonstrate more concretely the source of problem, we examine the file
FileNameJoin[{$InstallationDirectory, "AddOns", "Applications", "Parallel", "Combine.m"}]
and note that we can change the behaviour of this example using only the excerpt (which is complete with original comments, but repackaged to be a stand-alone modification of the relevant code, as well as slightly reformatted):
BeginPackage["Parallel`Combine`"];
Begin["`Private`"];
Needs["Parallel`Parallel`"];
Needs["Parallel`Kernels`"];
(* Additional required contexts -- O. R. *)
Needs["Parallel`Protected`"];
Needs["Parallel`Developer`"];
parallelIterateE[orig_, iter_, comb_, f_, expr_,
it : {w1_}, others___, {meth_, dist_, ___}] :=
With[{nk = $KernelCount, items = Internal`GetIteratorLength[it, orig]},
Module[{batches, batchsize, sizes, res},
If[ !IntegerQ[items] || items < 0, (* cannot do it if symbolic *)
Message[orig::nopar1, HoldForm[orig[expr, {w1}, others]]];
Return[iter[expr, {w1}, others]]
];
(* handle Method option *)
grokMethodOption[orig, Evaluate[items], nk, batches, batchsize, meth];
sizes = makeSizes[items, nk, batches, batchsize];
(* send definitions *)
Parallel`Protected`AutoDistribute[{f, expr, others}, orig, dist];
parStart;
With[{
chunks = HoldComplete @@ sizes /.
u1_Integer :> f[iter[expr, {u1}, others]]
},
res = If[
batches === 1,
ParallelDispatch[chunks],
ConcurrentEvaluate[chunks]
];
];
res = If[
(* Here lies the problem -- O. R. *)
MemberQ[res, $Aborted],
$Aborted,
comb @@ res
]; (* remote abort received *)
parStop;
res
]
]
End[];
EndPackage[];
(I should stress that the above is only one example of this code pattern as used in Combine.m
; it also appears in other places not relevant to this example.)
Modifying this, either by removing the MemberQ
call, or simply by commenting out res
, and then running it (after loading the Parallel`
package) to modify the definitions made in Combine.m
can be seen to restore the expected performance.
Unfortunately, although one might think up a more performant (or at least non-unpacking) but semantically equivalent replacement, such as
Block[{$Aborted := Abort[]}, CheckAbort[res, $aborted]] === $aborted
this won't work since $Aborted
is Locked
. In any case, for obvious reasons, I hesitate to suggest modifying Combine.m
in order to fix this problem, so attempting a workaround seems to be the best option here.
Edit: the workaround
As promised in the edit comment I left when I deleted my initial (incorrect) attempt at a workaround, here is a potential (this time, hopefully correct) solution to the problem of MemberQ
unpacking. It is different in spirit to @Szabolcs's approach, being based on a modification of the original MemberQ
rather than a reimplementation, so I feel it is worth posting as well. However, the same caveat applies: modifying System`
functions is something which should be attempted with great care and only as a last resort. This method is based on the replacement of any packed arrays that appear in the first argument of MemberQ
with an opaque list when the second argument is a symbol. I think it's safe, but there may be a situation I've overlooked, so I still advise caution.
withModifiedMemberQ[expr_] :=
Module[{doneQ, unmatchable},
Internal`InheritedBlock[{MemberQ},
Unprotect[MemberQ];
(* Can uncomment this if we want to print out the MemberQ calls:
mq:MemberQ[args___]/;(Print@HoldForm[mq];True):=mq;
*)
MemberQ[list_, patt_Symbol, args___] /; !TrueQ[doneQ] :=
Block[{doneQ = True},
MemberQ[
Unevaluated[list] /. _List?Developer`PackedArrayQ -> {unmatchable},
Unevaluated[patt],
args
]
];
Protect[MemberQ];
expr
]
];
SetAttributes[withModifiedMemberQ, HoldAllComplete];
Let's test it. For this I've uncommented the definition that prints out the calls to MemberQ
so that its operation will produce some debug output. (Note also that Range
returns a packed array.)
On["Packing"];
MemberQ[Range[5], $Aborted]
(* (message) Developer`FromPackedArray::unpack: Unpacking array in call to MemberQ.
False *)
withModifiedMemberQ@MemberQ[Range[5], $Aborted]
(* (prints) MemberQ[{1, 2, 3, 4, 5}, $Aborted]
(prints) MemberQ[{unmatchable$1305}, $Aborted]
False *)
withModifiedMemberQ@MemberQ[Range[5], 1]
(* (prints) MemberQ[{1, 2, 3, 4, 5}, 1]
(message) Developer`FromPackedArray::unpack: Unpacking array in call to MemberQ.
True *)
So it looks okay. Now let's comment out the debug output (otherwise the front end will probably lock up) and try it on the problematic ParallelTable call:
On["Packing"];
withModifiedMemberQ@AbsoluteTiming[
Join @@ ParallelTable[
RandomVariate[
MultinormalDistribution[\[Mu], \[CapitalSigma]],
200000
], {2}
];
]
(* {2.8593750, Null} *)
No unpacking messages are printed, and the performance issue is fixed. (On my computer, the version using the unmodified MemberQ
takes 5.6 seconds.)
Comments
Post a Comment