Performance considerations when using and abusing functions in JavaScript/V8

Momtchil Momtchev
5 min readMar 31, 2022

(even if there is no such thing as a free lunch, some meals are definitely cheaper than others)

Photo by sk / Jacky Watt on Unsplash

This article is the sequel of:

If you have never read it, you should probably start with it.

Recently, while working on ExprTk.js I found that in some cases, especially when using floating point numbers, scijs/cwise made V8 produced a remarkably fast code from one particular function:

const fn = (x) => x*x + 2*x + 1

I couldn’t resist, so I disassembled its output and I was, once again, amazed by its remarkable efficiency. Since I already had a previous story where I had clearly underestimated its performance, I decided to right that wrong and publish a sequel.

While I was at it, I also decided to examine one very common dilemma every JavaScript developer has faced, namely the eternal question of is using aforEach loop slower than a normal one and if so — why.

So, first things first, the code.

Let’s examine the various functions:

Let’s examine the various functions.

Old developers with lots of experience with compiled languages will immediately tell you that #1, the direct approach is surely the best one. The #2 help isn’t really needed — the compiler will optimize away the multiple memory accesses for sure. And calling a function, #3, is a good test if your compiler has an excellent inline optimizer. If you are not aware of the subtle differences in #4/#5, you should probably think that these are equivalent to #3. What about #6 and #7? Every time we ask the V8 gurus, we get a different answer, so probably it is time to finally check this for ourselves.

Ok, lets see the results now.

….ta-da…

…and the winner is…

#1 direct: 6.396ms
#2 direct with help: 4.409ms
#3 with function: 5.076ms
#4 with function and variable creation: 235.577ms
#5 with function and variable creation: 15.278ms
#6 with map: 12.399ms
#7 with forEach: 10.519ms
#8 two slices: 6.998ms
#9 two slices with cheese: 7.856ms

Well, turns out V8 is remarkably good at some tasks, but still needs some help for other tasks. This one should definitely end on a post-it on one of the V8 developers’ screen. #2 vs #1 should be automatic. It will be even faster as it will avoid the creation of a dynamic variable on the stack.

#3 is where we discover the (almost) free lunch in JS — functions in JS are a dime a dozen. You can safely use and abuse them.

If no one ever told you that you should never use for..inwith an Array — after looking at #4, now you will remember it. It was made for Objects and it has many problems, including not guaranteeing order, and also creating temporary Arrays.

And of course, #5, #6 and #7. Now it is official — forEach should stop blaming the function call for what is clearly its own fault. All these methods of Array traversal are suboptimal and should be avoided when possible. In V8, a simple for loop will be directly compiled, while these rely on more complex builtin implementations written in Torque. for..offor example supports using the Iterator protocol of the underlying object and can be used with every Object that implements it. These methods do not have any added value when used with an Array.

#8 and #9 show the dynamic compilation cost — which is less than I previously thought. The role of the cheese in #9 is to break the compiler optimization — because it initially compiled the function to accept only floating point numbers. Calling it with a different argument forces a recompilation. That one millionth and one call costs as much as hundred thousand compiled iterations. Still, this is clearly not as expensive as you might have believed — V8 recompilation is very fast.

Now to go back to my initial observation.

In fact, unlike scijs/ndarray-ops, which uses the slightly slower #1 form, scijs/cwise uses the most efficient form possible — #2.

In fact, when using #2 on x86–64 — as is the case of scijs/cwise, V8 produces this absolutely amazing machine translation which has only 2 memory accesses per iteration and keeps everything in a register:

movapd  %xmm1,%xmm2        # x has been previously loaded in xmm1
addsd %xmm1,%xmm2 # x+x = 2*x in xmm1
movapd %xmm1,%xmm3 # the previous add is still running
mulsd %xmm1,%xmm3 # x*x in xmm3
addsd %xmm2,%xmm3 # x*x+2*x in xmm3
lea (%r11,%rax,1),%rdx # calculate output offset
addsd %xmm0,%xmm3 # constant xmm0=1, x*x+2*x+1 in xmm3
movsd %xmm3,(%rdx,%r15,8) # store the result

+2*x has been transformed to +x+x which is slightly faster and avoids using a constant that must be loaded into an FPU register. Instructions are correctly interleaved to take advantage of the super-scalar architecture. And — something I previously believed was absent from V8 — instead of being loaded at each iteration, the constant 1 is being held throughout the loop in a static register — xmm0. This is clearly indicative of a very good register optimization — unlike the example from my previous story.

In fact, this piece of machine code is equivalent to what gcc or clang would produce with -O3 maximum optimization. Yes, on this particular example, when using floating point numbers, V8 runs with the same speed as highly optimized C++ code. Integer would have added a few more JOinstructions for exception handling.

Remember to tell this to everyone who still intends to rewrite scijs in C++.

Or to whoever told you that JavaScript was ill-suited for data-mining.

I am a hungry engineer who is currently unemployed and turned open source developer because my ex-employers are extorting me to cover up a remarkably ugly sex scandal. I do mostly Node.js and V8 stuff.

--

--