In 2021, is there still a huge performance difference between JavaScript and C++ for CPU-bound operations
If you think that this is a joke, then this story is for you
I published very recently a new plugin for the JS bindings of GDAL that I currently maintain for importing and exporting N-dimensional arrays to scijs. scijs, while still far behind its Python namesake, is currently gaining more and more popularity, driven by the general success of JavaScript and V8.
I always knew that Node.js I/O performance was unbeatable compared to other scripting languages. Then, last year, I completely gave up using Python an year ago after I was unable to beat my own mono-threaded JS implementation of a linear optimization algorithm in multi-threaded Python.
Still, being used to the node-gdal-async model, where a multi-threaded C++ implementation is being driven by a mono-threaded but asynchronous JS super-structure — which is currently the leading high-performance model in the Node.js ecosystem — I naturally assumed that a good implementation of the multidimensional arrays in scijs just had to work the same way:
So I set to do some testing to see what could have been achieved by implementing a drop-in replacement of some of the basic operations in ndarray-ops — namely comparing, adding and multiplying matrices.
This story is not about basic matrices operations, as this is a very well studied problem. It is about using JavaScript to do heavy CPU-bound work.
When it comes to matrices you rarely have to reinvent the wheel, especially when your target language is C++. So I set to compare Eigen, one of the leading linear algebra libraries for C++ to ndarray. Eigen, implemented as C++ templates, supports using SIMD instructions — SSE and AVX — on modern x86 CPUs via hand-crafted assembly inner loops. Surely, when operating on very large matrices, Eigen was going to smoke ndarray. I expected at least a factor of 10, maybe even more.
Turns out, it is not always that simple.
My test consists of adding two 10,000x10,000 Float32 (TYPE) matrices (400MB size). It is a somewhat unusual test, that requires very high memory bandwidth.
My first method is the “naive JS” which simply did:
const a = ndarray(new TYPE(size * size), [size, size], [1, size]);
const b = ndarray(new TYPE(size * size), [size, size], [1, size]);
for (int i = 0; i < size*size; i++) a[i] += b[i];
It wouldn’t work for anything but a simple addition of matrices with identical in-memory strides. I planned to use as a baseline to check the overhead of ndarray-ops.
The ndarray-ops one:
const a = ndarray(new TYPE(size * size), [size, size], [1, size]);
const b = ndarray(new TYPE(size * size), [size, size], [1, size]);
ops.addeq(a, b);
which is basically the same, except ndarray doesn’t really know that the destination is the same as one of the operands.
And finally Eigen C++ matrices:
Matrix<TYPE, Dynamic, Dynamic> a(SIZE, SIZE);
Matrix<TYPE, Dynamic, Dynamic> b(SIZE, SIZE);
a += b;
which does in-place addition too.
Matrices are initialized (omitted here for brevity) as to defeat any eventual lazy memory allocation.
So, first round:
-----------------------------------------------------
trivial | ndarray-ops | Eigen | Eigen w/SIMD
-----------------------------------------------------
130ms 224ms 46ms 33ms AVX512 (recent Intel)
434ms 579ms 152ms 126ms SSE2 (older Athlon)
Not that bad from V8. Once again, Eigen is using an internal SSE2/AVX in hand-optimized assembly loop. I was expecting a larger difference.
Now let’s try another test — the one that I actually needed — the two matrices have different strides — one is in row-major order and the other one is in column-major order. This time the trivial implementation won’t work. For each library a is the one using its default stride.
Second round:
-------------------------------
ndarray-ops | Eigen | Eigen SIMD
-------------------------------
363ms 533ms 533ms AVX512 (recent Intel i7)
1131ms 3032ms 3032ms SSE2 (older Athlon Phenom II BE)
Huh?
At this point, I spent more than three hours fiddling with various compilation options when I was confronted with that result.
I simply refused to believe it. At one point I even suspected that some black magic was at hand and V8 was secretly doing parallel work on multiple cores.
I ended disassembling the code and this is the moment I decided to write this story as it was a valuable lesson (and also to find a way to justify all the time I wasted).
( Turns out the answer was very simple — on this particular operation Eigen was not using tiling and was thrashing the L1 cache. Eigen is a very good library with a very good performance reputation, but on this particular operation, it simply did a sloppy job — there is no other way to put it. The nature of the operation that is used as an example makes it very memory-bandwidth dependent which is the reason SIMD has only a very marginal impact — it is mostly an L1 cache competition and the order of traversal of the matrices is really important. When I reimplemented the Eigen addition using tiling I was able to drop the execution time by an order of magnitude, see the note at the end. Just for the point, Eigen is using tiling when doing multiplication — I had just hit a weak spot.)
First of all, a word on ndarray-ops: it is built around the idea of generating custom functions for any given array shape and stride. If you are coming from a modern structured language, the idea of constructing a function by writing code in a string and then passing it to the interpreter might like the ultimate sacrilege, but this is a common and very useful (even if not always very safe) programming paradigm in the JavaScript world. Here is the inner thunk that it generated on this occasion:
for(i1 = 0; i1 < s0; ++i1) {
for(i0 = 0; i0 < s1; ++i0) {
a0[p0] += a1[p1]
p0 += d0s0
p1 += d1s0
}
}
Also, for those new to V8 — V8 works in both interpreter and JIT-compiler mode. Usually the first few iterations would be run by the interpreter, then as the function becomes “hotter” there will be several rounds of JIT compilation, starting by the most commonly encountered code path and then up to a full optimized compilation of those parts than can be compiled — as JavaScript, in the general case, cannot be fully compiled. Usually even the final version will have an occasional exception handler or two for code paths that are never triggered. So here is that final optimized output of the inner loop of this function as produced by V8:
0x00003f738f8c5860: mov %rcx,%r12 # Not great
0x00003f738f8c5863: mov %r11,%r8 # register
0x00003f738f8c5866: mov %r14,%rbx # optimization
0x00003f738f8c5869: cmp -0x158(%rbp),%r12d # i0 < s1
0x00003f738f8c5870: jge 0x3f738f8c58f7 # loop exit
0x00003f738f8c5876: movslq %r8d,%r11 # array 1 start
0x00003f738f8c5879: cmp %rdx,%r11 # rdx = array 1 size
0x00003f738f8c587c: jae 0x3f738f8c6419 # exception handler
0x00003f738f8c5882: lea (%r9,%rsi,1),%rcx # array 1 offset
0x00003f738f8c5886: movslq %ebx,%r14 # array 2 start
0x00003f738f8c5889: movss (%rcx,%r11,4),%xmm0 # load float 1
0x00003f738f8c588f: cmp %rdi,%r14 # rdi = array 2 size
0x00003f738f8c5892: jae 0x3f738f8c6425 # exception handler
0x00003f738f8c5898: lea (%r15,%rax,1),%rcx # array 2 offset
0x00003f738f8c589c: movss (%rcx,%r14,4),%xmm1 # load float 2 0x00003f738f8c58a2: cvtss2sd %xmm1,%xmm1 # somewhat archaic
0x00003f738f8c58a6: cvtss2sd %xmm0,%xmm0 # method
0x00003f738f8c58aa: addsd %xmm1,%xmm0 # add float
0x00003f738f8c58ae: lea (%r9,%rsi,1),%rcx # dst array offset!
0x00003f738f8c58b2: cvtsd2ss %xmm0,%xmm0 # why? useless!
0x00003f738f8c58b6: movss %xmm0,(%rcx,%r11,4) # store
0x00003f738f8c58bc: mov -0x120(%rbp),%r11 # p0
0x00003f738f8c58c3: add %r8d,%r11d # d0s0
0x00003f738f8c58c6: jo 0x3f738f8c6431 # exception handler
0x00003f738f8c58cc: mov -0x118(%rbp),%r14 # p1
0x00003f738f8c58d3: add %ebx,%r14d # d0s1
0x00003f738f8c58d6: jo 0x3f738f8c643d # exception handler
0x00003f738f8c58dc: mov %r12,%rcx # i0
0x00003f738f8c58df: add $0x1,%ecx # ++
0x00003f738f8c58e2: jo 0x3f738f8c6449 # exception handler
0x00003f738f8c58e8: cmp -0x20(%r13),%rsp # JIT handler?
0x00003f738f8c58ec: ja 0x3f738f8c5860 # loop
It is not absolutely perfect, but it is still remarkably clean. It has lots of bounds checking and exception handling, but you will get these with any modern “safe” language. It has a few more memory accesses than it should, the addition is not in place, the register optimization is terrible and the floats are not loaded in the most efficient way for a modern x86 CPU, but this is about everything. It is not much worse than what you will get from g++ -O0.
A small side note: In this particular case, V8 wins easily the match vs g++ -O0 because Eigen, implemented as C++ templates, has a very heavy debug framework. But otherwise, it tends to be just a little bit slower than a non-optimizing C++ compiler in the general case.
Something that is really striking for any first-time V8 assembly reader (me) is that all the array indices are integers. As you know, there are no natural integers in JavaScript. Still, V8 has turned them to integers with bounds checking. Remarkable.
Here is now the high performance inner loop produced by g++ with AVX512 when using matrices with identical strides (the first round):
0x00000000000014e8 <+56>: vmovaps (%rsi,%rcx,4),%ymm1
0x00000000000014ed <+61>: vaddps (%r8,%rcx,4),%ymm1,%ymm0
0x00000000000014f3 <+67>: vmovaps %ymm0,(%rsi,%rcx,4)
0x00000000000014f8 <+72>: add $0x8,%rcx # 2 floats per
0x00000000000014fc <+76>: cmp %rcx,%rax # iteration
0x00000000000014ff <+79>: jg 0x14e8
Now, I won’t lie to you — it would be difficult anyway — but this one is obviously better. This is the result of 30 years of optimizing compiler experience. For starters, it does 2 numbers per loop (why not 4 remains a mystery to me). It also does much less memory accesses. It doesn’t do bounds checking, so the code appears to be much shorter — but once again, the way V8 does bound checking is very efficient and it is not too expensive. Also the addition is in-place — saving quite a few cycles.
In case you are wondering — the inner assembly code comes directly from Eigen itself via compiler intrinsics — but the loop logic around ithas been produced by the compiler.
This is the g++ loop adding matrices with different strides (no AVX possible in this case):
0x00000000000014c0 <+80>: movss (%rax),%xmm0 # load
0x00000000000014c4 <+84>: addss (%rsi),%xmm0 # add
0x00000000000014c8 <+88>: add $0x4,%rax # increment on X
0x00000000000014cc <+92>: add %r11,%rsi # increment on Y
0x00000000000014cf <+95>: movss %xmm0,-0x4(%rax) # store
0x00000000000014d4 <+100>: cmp %r9,%rax # loop exit?
0x00000000000014d7 <+103>: jne 0x14c0 # loop
This is about perfect — this should be our baseline for the V8 code. The first increment is a constant, the second one is a register. Adding the two numbers happens in-place. Instead of computing the array offset at every iteration, it is simply incremented and used as a loop iterator — this is probably the number one performance problem in the V8 assembly code.
Still, V8 comes remarkably close. Now, one has to admit a certain clumsiness when it comes to producing assembly output. Some variables that could have been in registers are accessed from the memory — but as they will surely be in the L1 caches, it is not such a big deal — on this point the lack of sophistication of V8 is covered up by the modern CPU — registry optimization is not as important as it used to be 20 or 30 years ago. The instructions used are not always optimal — the cvtss2ssd could have been avoided and the addition could have been in place — especially since the JS is doing an in-place addition. The bounds and exception checking is definitely remarkably good — it could hardly have been better.
Also the thunk from ndarray-ops could probably manually eliminate the iterator variable — saving a few cycles per iteration. Alas, as there is no pointer arithmetic, computing the array offset must be left to V8 — and this carries a significant performance price.
So, to recap:
- First of all, algorithms are more important than low-level optimization and the higher level language makes it easier to implement the better one
- Don’t assume that you need C++ to be fast, V8 JavaScript has an almost compiled-like performance
- The current performance level of V8-produced assembly is a little bit below that of g++ -O0 — and an order of magnitude higher than any other “interpreted” language
- Manual memory management and pointer arithmetic, which make C++ so hard to learn is probably its last major strength against memory-managed languages — as JIT-compilers mature, the gap in raw performance keeps narrowing
- It is still possible to get 10x performance increase by doing hand-optimized assembly — but it is a very time-consuming process and even small details can easily ruin your performance — and your investment
- In the particular case of V8 JavaScript, C++ also remains the only way to take advantage of having multiple cores for computationally heavy tasks
- SIMD would be a very welcome addition to V8 — especially since it is quite easy to do on a simple loop of the type of that in the ndarray-ops thunk
Update: It seems that there is a very recent SIMD WebAssembly library on github: https://github.com/CrimsonCodes0/SIMD.ts
As of July 9th, v128 SIMD is a phase 4 proposal with a working implementation in V8 since 9.1.269.36 and Node.js since 16.4.0
This means that an eventual near-future version of ndarray_ops could probably have near-native single-threaded performance
As a final note, after reimplementing the sum in Eigen using tiling (second round test), these are the results I got :
-----------------------------------------
ndarray-ops | Eigen a+b | Eigen tiled a+b
-----------------------------------------
425ms 533ms 45ms AVX512 (recent Intel i7)
1273ms 3032ms 175ms SSE2 (older Athlon Phenom II BE)
With special thanks to mikolalysenko and puzpuzpuz for their suggestions
2022 Update 1: This story has a sequel in which V8 comes even closer to C++:
2022 Update 2: The latest Intel CPUs can perform a LEA
with triple operand indirection in 1 cycle which renders the C++ pointer optimizations useless. Also, incrementing a C++ pointer uses the CPU ALU, while LEA
uses a dedicated circuit called an AGU (Address Generation Unit) which frees up the ALU for real work.