Skip to main content

performance tuning - Why is CompilationTarget -> C slower than directly writing with C?

Probably a hard question, but I think it's better to cry out loud.

I've hesitated for a while about whether I should post this in StackOverflow with a c tag or not, but finally decide to keep it here.

This question can be viewed as a follow up of Has this implementation of FDM touched the speed limit of Mathematica?. In the answer under that post, Daniel managed to implement a compiled Mathematica function that's almost as fast (to be more precise, 3/4 as fast) as the one directly implementing with C++, with the help of devectorization,CompilationTarget -> "C", RuntimeOptions -> "Speed" and Compile`GetElement. Since then, this combination has been tested in various samples, and turns out to be quite effective in speeding up CompiledFunction that involves a lot of array element accessing. I do benefit a lot from this technique, nevertheless in the mean time, another question never disappear in my mind, that is:

Why is the CompiledFunction created with the combination above still slower than the one directly writing with C++?

To make the question more clear and answerable, let's use a simpler example. In the answers under this post about Laplacian of a matrix, I create the following function with the technique above:

cLa = Hold@Compile[{{z, _Real, 2}}, 
Module[{d1, d2}, {d1, d2} = Dimensions@z;
Table[z[[i + 1, j]] + z[[i, j + 1]] + z[[i - 1, j]] + z[[i, j - 1]] -
4 z[[i, j]], {i, 2, d1 - 1}, {j, 2, d2 - 1}]], CompilationTarget -> C,
RuntimeOptions -> "Speed"] /. Part -> Compile`GetElement // ReleaseHold;

and Shutao create one with LibraryLink (which is almost equivalent to writing code directly with C):

src = "
#include \"WolframLibrary.h\"

DLLEXPORT int laplacian(WolframLibraryData libData, mint Argc, MArgument *Args, \
MArgument Res) {
MTensor tensor_A, tensor_B;
mreal *a, *b;
mint const *A_dims;

mint n;
int err;
mint dims[2];
mint i, j, idx;
tensor_A = MArgument_getMTensor(Args[0]);
a = libData->MTensor_getRealData(tensor_A);
A_dims = libData->MTensor_getDimensions(tensor_A);
n = A_dims[0];
dims[0] = dims[1] = n - 2;
err = libData->MTensor_new(MType_Real, 2, dims, &tensor_B);

b = libData->MTensor_getRealData(tensor_B);
for (i = 1; i <= n - 2; i++) {
for (j = 1; j <= n - 2; j++) {
idx = n*i + j;
b[idx+1-2*i-n] = a[idx-n] + a[idx-1] + a[idx+n] + a[idx+1] - 4*a[idx];
MArgument_setMTensor(Res, tensor_B);

lib = CreateLibrary[src, "laplacian"];

lapShutao = LibraryFunctionLoad[lib, "laplacian", {{Real, 2}}, {Real, 2}];

and the following is the benchmark by anderstood:

enter image description here

Why cLa is slower than lapShutao?

Do we really touch the speed limit of Mathematica this time?

Answer(s) addressing the reason for the inferiority of cLa or improving the speed of cLa are both welcomed.

…OK, the example above turns out to be special, as mentioned in the comment below, cLa will be as fast as lapShutao if we extract the LibraryFunction inside it:

cLaCore = cLa[[-1]];

mat = With[{n = 5000}, RandomReal[1, {n, n}]];

cLaCore@mat; // AbsoluteTiming
(* {0.269556, Null} *)

lapShutao@mat; // AbsoluteTiming
(* {0.269062, Null} *)

However, the effect of this trick is remarkable only if the output is memory consuming.

Since I've chosen such a big title for my question, I somewhat feel responsible to add a more general example. The following is the fastest 1D FDTD implementation in Mathematica so far:

fdtd1d = ReleaseHold@
With[{ie = 200, cg = Compile`GetElement},
Hold@Compile[{{steps, _Integer}},
Module[{ez = Table[0., {ie + 1}], hy = Table[0., {ie}]},

Do[ez[[j]] += hy[[j]] - hy[[j - 1]], {j, 2, ie}];
ez[[1]] = Sin[n/10.];
Do[hy[[j]] += ez[[j + 1]] - ez[[j]], {j, 1, ie}], {n, steps}]; ez],
"CompilationTarget" -> "C", "RuntimeOptions" -> "Speed"] /. Part -> cg /.
HoldPattern@(h : Set | AddTo)[cg@a__, b_] :> h[Part@a, b]];

fdtdcore = fdtd1d[[-1]];

and the following is an implemenation via LibraryLink (which is almost equivalent to writing code directly with C):

str = "#include \"WolframLibrary.h\"

DLLEXPORT int fdtd1d(WolframLibraryData libData, mint Argc, MArgument *Args, MArgument \
MTensor tensor_ez;
double *ez;
int i,t;
const int ie=200,steps=MArgument_getInteger(Args[0]);
const mint dimez=ie+1;

double hy[ie];

libData->MTensor_new(MType_Real, 1, &dimez, &tensor_ez);
ez = libData->MTensor_getRealData(tensor_ez);

for(i=0;i for(i=0;i
for(i=0;i }

MArgument_setMTensor(Res, tensor_ez);
return 0;}

fdtdlib = CreateLibrary[str, "fdtd"];
fdtdc = LibraryFunctionLoad[fdtdlib, "fdtd1d", {Integer}, {Real, 1}];

test = fdtdcore[10^6]; // AbsoluteTiming
(* {0.551254, Null} *)
testc = fdtdc[10^6]; // AbsoluteTiming
(* {0.261192, Null} *)

As one can see, the algorithms in both pieces of code are the same, but fdtdc is twice as fast as fdtdcore. (Well, the speed difference is larger than two years ago, the reason might be I'm no longer on a 32 bit machine. )

My C compiler is TDM-GCC 4.9.2, with "SystemCompileOptions"->"-Ofast" set in Mathematica.


Popular posts from this blog

functions - Get leading series expansion term?

Given a function f[x] , I would like to have a function leadingSeries that returns just the leading term in the series around x=0 . For example: leadingSeries[(1/x + 2)/(4 + 1/x^2 + x)] x and leadingSeries[(1/x + 2 + (1 - 1/x^3)/4)/(4 + x)] -(1/(16 x^3)) Is there such a function in Mathematica? Or maybe one can implement it efficiently? EDIT I finally went with the following implementation, based on Carl Woll 's answer: lds[ex_,x_]:=( (ex/.x->(x+O[x]^2))/.SeriesData[U_,Z_,L_List,Mi_,Ma_,De_]:>SeriesData[U,Z,{L[[1]]},Mi,Mi+1,De]//Quiet//Normal) The advantage is, that this one also properly works with functions whose leading term is a constant: lds[Exp[x],x] 1 Answer Update 1 Updated to eliminate SeriesData and to not return additional terms Perhaps you could use: leadingSeries[expr_, x_] := Normal[expr /. x->(x+O[x]^2) /. a_List :> Take[a, 1]] Then for your examples: leadingSeries[(1/x + 2)/(4 + 1/x^2 + x), x] leadingSeries[Exp[x], x] leadingSeries[(1/x + 2 + (1 - 1/x...

mathematical optimization - Minimizing using indices, error: Part::pkspec1: The expression cannot be used as a part specification

I want to use Minimize where the variables to minimize are indices pointing into an array. Here a MWE that hopefully shows what my problem is. vars = u@# & /@ Range[3]; cons = Flatten@ { Table[(u[j] != #) & /@ vars[[j + 1 ;; -1]], {j, 1, 3 - 1}], 1 vec1 = {1, 2, 3}; vec2 = {1, 2, 3}; Minimize[{Total@((vec1[[#]] - vec2[[u[#]]])^2 & /@ Range[1, 3]), cons}, vars, Integers] The error I get: Part::pkspec1: The expression u[1] cannot be used as a part specification. >> Answer Ok, it seems that one can get around Mathematica trying to evaluate vec2[[u[1]]] too early by using the function Indexed[vec2,u[1]] . The working MWE would then look like the following: vars = u@# & /@ Range[3]; cons = Flatten@{ Table[(u[j] != #) & /@ vars[[j + 1 ;; -1]], {j, 1, 3 - 1}], 1 vec1 = {1, 2, 3}; vec2 = {1, 2, 3}; NMinimize[ {Total@((vec1[[#]] - Indexed[vec2, u[#]])^2 & /@ R...

plotting - Plot 4D data with color as 4th dimension

I have a list of 4D data (x position, y position, amplitude, wavelength). I want to plot x, y, and amplitude on a 3D plot and have the color of the points correspond to the wavelength. I have seen many examples using functions to define color but my wavelength cannot be expressed by an analytic function. Is there a simple way to do this? Answer Here a another possible way to visualize 4D data: data = Flatten[Table[{x, y, x^2 + y^2, Sin[x - y]}, {x, -Pi, Pi,Pi/10}, {y,-Pi,Pi, Pi/10}], 1]; You can use the function Point along with VertexColors . Now the points are places using the first three elements and the color is determined by the fourth. In this case I used Hue, but you can use whatever you prefer. Graphics3D[ Point[data[[All, 1 ;; 3]], VertexColors -> Hue /@ data[[All, 4]]], Axes -> True, BoxRatios -> {1, 1, 1/GoldenRatio}]