What reasons are there that can cause parallelized Mathematica code not to run with full performance?
Answer
This is a general guide on debugging issues with parallelization performance.
1. Measuring performance
The proper way to measure the timing of parallelized calculations is AbsoluteTiming
, which measures wall time. Timing
measures CPU time on the main kernel only and won't give a correct result when used with parallel calculations.
2. How to parallelize effectively?
Simply using Parallelize
will not work magically on most code snippets. It won't work at all on most built-in functions such as NIntegrate
. Here's some info on what is auto-Parallelize
able.
It is better to formulate the problem in terms of more specific constructs such as ParallelTable
, ParallelMap
, ParallelCombine
, ParallelDo
, etc. and take full control.
Try to use functional code with no side effects for easy and effective parallelization. Read more about this here and here.
Using procedural code might require synchronization using SetSharedFunction
and SetSharedVariable
. Both of these force code to be evaluated on the main kernel only and can cause a significant performance hit. Every time a variable or function marked with SetSharedVariable
or SetSharedFunction
is evaluated on a parallel kernel, it triggers a costly callback to the main kernel.
Do not parallelize code that is already parallelized internally, as this is likely to reduce performance. Certain functions such as LinearSolve
use multi-threaded code internally. Some others, such as NIntegrate
, will make use of multiple cores in certain cases only. Use a process monitor to check whether your code already makes use of multiple cores without explicit parallelization.
3. Common issues causing slowdowns
3.1 Communication overhead
There is an overhead to parallelization. The data being operated on needs to be broken into pieces, sent to subkernels, processed there, then the result needs to be sent back. The subkernels are separate processes. The interprocess communication involved can take a considerable amount of time, especially when the subkernels are running on a remote machine. So it is important to minimize the communication with subkernels:
Try to communicate less often. This is controlled by the
Method
option ofParallelize
and related functions.Method -> "CoarsestGrained"
minimizes communication.Method -> "FinestGrained"
breaks the data into as many pieces as possible. This is useful when the pieces can take vastly different times to process.Try to send as little data back and forth as possible. Sending large expressions takes a longer time. If the subkernels generate large data, see if you can reduce it before returning it to the main kernel.
Different types of data can take a hugely different amount of time to transfer. Prefer packed arrays whenever you can.
A common mistake (example) is to send a huge array along with each parallel evaluation. Try to send the array once, then index into it as necessary in the evaluations. I show an example at the end of this post.
Up til version 9, Mathematica launched twice as many subkernels as there were available cores when the CPU had HyperThreading ($ProcessorCount
). This typically increases communication overhead, but does not always improve computation performance. Sometimes it's better to use only as many subkernels as the number of physical cores. (The optimal number differs from case to case.)
3.2 Improperly distributed definitions
Since subkernels are completely separate Mathematica processes, all the definitions that are used in the parallel calculation need to be distributed (sent) to subkernels. This must be done manually in version 7, while it's mostly automatic in later version.
In some cases it does not happen automatically, causing a situation when incorrectly parallelized code will return the correct result, but will run slowly.
Example: ParallelEvaluate
does not automatically distribute definitions. The following code returns the expected result:
f[] := RandomReal[]
ParallelEvaluate[f[]]
What happens is that f[]
is evaluated on each subkernel, and the results are returned as a list. Since f
has no associated definition on subkernels, f[]
is returned unevaluated by each subkernel, and the main kernel receives the list {f[], f[], ..., f[]}
. This list is then further evaluated on the main kernel to a list of random numbers. Notice that all the calculation will happen on the main kernel. This computation doesn't really run in parallel. The solution is to use DistributeDefinitions[f]
.
3.3 Make sure packages are loaded in subkernels
This is closely related to the previous point. Functions from packages loaded into the main kernel are not automatically distributed to subkernels. If you use any packages in the parallel code, make sure they are loaded into the subkernels using ParallelNeeds
.
Warning: In certain cases the parallelized code appears to work even without loading the packages in the subkernels, but will be much slower. What actually happens is completely analogous to the example from the previous point: functions are returned unevaluated from the subkernels, and will subsequently get evaluated on the main kernel.
Loading custom packages: To load a custom package from the current directory of the main kernel, make sure that the current directory of the subkernels is that same as the current directory of the main kernel:
With[{d = Directory[]}, ParallelEvaluate[SetDirectory[d]]]
If you set a custom $Path
in init.m
, it won't take effect in subkernels. To make subkernels use the same $Path
as the main kernel, use
With[{p = $Path}, ParallelEvaluate[$Path = p]];
3.4 There are a few bugs known to affect parallel performance
Packed arrays get temporarily unpacked when sent back to the main kernel (reference). Affects performance when large packed arrays are sent back. See link for workaround.
There are certain functions which lose performance when evaluated on subkernels (ref1, ref2).
Some functions known to be affected:
Rule
,InterpolatingFunction
.Workaround: re-evaluate the affected expression as
expression = expression
on the subkernels. This is described in the last entry under the Possible Issues forDistributeDefinitions
.
Comments
Post a Comment