Debugging a common performance pitfall on Node.js

Or why manual throttling is mandatory when using async parallelism in Node.js

A valuable lesson from scientific computing that also applies to anyone doing I/O, networking or number-crunching

Photo by shun idota on Unsplash

I am running a very specialized weather site for paragliding and sailplane flying that does scientific computing in a web context.

Since discovering that the need for C++ code for high performance is constantly diminishing in “In 2021, is there still a huge performance difference between JavaScript and C++ for CPU-bound operations”, I set to finalize the migration of some of the API code from my old GDAL/C++ implementation to a more modern Node.js/scijs/gdal-async one — instead of invoking native code from Express, I was going to read and process the data directly in Node.js.

I expected to lose some performance on the way, but I was deeply convinced that the simplicity of the code was worth it — and I was quickly proven right as my new implementation made evident a number of bugs in the old one.

However the huge disappointment came when I started measuring the performance loss: generating an emagram (a weather diagram that requires retrieving a small amount of numerical data from a huge 4D space in numerous 2D GeoTIFFs) went up from a tiny 4s in C++ to 115s in Node.js.

No, this was simply not right. There was a problem worth investigating. As the main author of gdal-async and at least an occasional contributor to the remaining software I felt that I was in the right position to fix it.

So, first step, Node.js profiling. It is as simple as it gets:

$ node --prof program.js ; node --prof-process isolate*

Also, the program seemed to have extreme difficulty getting beyond 100% CPU utilization— ie it was mostly sequential. And it had a very sluggish response to Ctrl-Cwhich was the first sign of some kind of event loop jam.

This behavior had written all over it gdal-async, of which I am the main author. Did I do a sloppy job with the locking? The parallelization maybe? Definitely worth investigating!

So, time for the heavy machinery. Now, just a small side note — if you have never used perf from linux-tools before and you are interested in performance, you simply must try it. It is a huge step up from all the classical profilers you have been using as it allows for true vertical profiling of the whole system — from the user process to the kernel events and the L1 cache. It is a very complex tool and there are numerous tutorials out there — so go check them out if you have never used it.

$ perf record -F 99 node program.js ; perf report

Huh? Two C lines in the glibc fclose syscall are responsible for 2/3 of the CPU load. My program is spending most of its time closing files? What could explain this?

Let’s begin with those 2 lines in genops.c in glibc:

for (f = &_IO_list_all->file._chain; *f; f = &(*f)->_chain)
if (*f == (FILE *) fp) { *f = fp->file._chain; break; }

Er? Deleting elements from a singly-linked chain-list? In a mutex-protected critical section? Hmm. Not very clever for sure. How many files does my program open?

$ ls /proc/120282/fd | wc
47577

Oh holy ****. How did Node.js even get away with this behavior?

$ ulimit -n
1024

No, it is impossible. Unless someone t̶h̶i̶n̶k̶s̶ ̶t̶h̶e̶ ̶r̶u̶l̶e̶s̶ ̶d̶o̶ ̶n̶o̶t̶ ̶a̶p̶p̶l̶y̶ ̶t̶o̶ ̶h̶i̶m̶ is raising the limit.

$ ulimit -n -H
1048576

Is Node.js simply being cheeky?

Yes, it is, it happens in node.cc:

Turns out Node.js is one real file-descriptor-hungry beast. Its authors understand the problem and so they quietly s̶t̶e̶a̶l̶ ̶c̶o̶o̶k̶i̶e̶s̶ ̶a̶t̶ ̶n̶i̶g̶h̶t̶ raise the limit up to the maximum allowed without getting caught. And Linux on the other side allows it to get away with a limit its glibc can’t really handle very well.

How did we end up here?

The culprit lies in something that is a very common and standard way of achieving parallelism in JavaScript:

const q = [];
for (const element of array) q.push(asyncProcess(element));
await Promise.all(q);

This is the power of the amazing transparent parallelism in JavaScript. It is the official way of doing it — some linters will even raise a warning if you are using await in a loop — you are supposed to let the engine parallelize it for you. It allows for a remarkably easy and painless parallel programming by developers who lack even the most basic understanding of locking and thread pools.

As a small side note: this programming pattern currently requires that the heavy lifting is implemented in C++ but I hope that one day V8 will support running small JS thunks which do not access anything but their arguments in parallel — think of it as OpenMP.js— as this would be a very elegant exit from the current dead-end that is JS shared-memory parallelism.

However this pattern comes with one huge pitfall — the way await or Promises (there is no real difference behind the scenes) work. The event loop algorithm is not fair and it is prone to starvation — and this is what happens in this particular case. Promises are enqueued in a microtask queue and then the event loop proceeds to execute the various queues one in a time — until each one of them is empty. If one queue has 47000 elements, the event loop will execute them all — starving out everything else. And in the aforementioned typical code example it will create all of the async contexts before actually running the first one.

The official Node.js developer guide is very clear on this point:

The fair treatment of clients is thus the responsibility of your application.

This is akin to giving you a driverless car which allows you to ride without actually being able to drive, but that also has a tendency to drift in roundabouts unless you apply the brakes at the right moment.

One has to check all the throttling packages on npm to see the problem. You should of course start by my own — the best one out there — https://www.npmjs.com/package/async-await-queue — but there are also many others.

Manual throttling is mandatory when using async parallelism in Node.js.

If you are not manually throttling your async loops this will come back to bite you and it will bite you hard.

This applies whether you are doing I/O, networking or CPU bound tasks.

The previous example can be throttled (limiting to 16 tasks in parallel) using async-await-queue like this:

const q = new Queue(16, 0);
for (const element of array) q.run(() => asyncProcess(element));
await q.flush();

In this particular case, throttling brought the execution time from 115s down to 9s — well within the expected loss from switching from a GDAL/C++ implementation to a Node.js/scijs/gdal-async one.

Can anything be done?

First of all, I would like to remark a fact that seems almost unreal to me: in a piece of software as fundamental and mature as glibc is supposed to be, in one of its most basic core features—handling file descriptors—in a mutex-protected critical section there is an implementation of an O(n) algorithm where an O(1) is possible. I don’t know what are my chances of landing a PR in glibc that modifies the data structure and the algorithm used for handling FDs, especially when the use case is a program that opens 47000 of them — but it is probably worth the try — especially since it seems that I am not the first one to be surprised by this behavior. It won’t be a trivial PR since the chaining is implemented directly in the FILE structure — which cannot be modified. In the mean time maybe people should abstain from using fopen/fclose at all.

Second, maybe Node.js should rethink its strategy. Maybe one should not completely empty all the queues at every event loop iteration — maybe there should be a limit on the work done at each iteration. Surely, this is a major behavioral change — that would probably create bugs even in the Node.js internals — which sometimes rely on the order of processing of a nextTick vs a Promise, but it is definitely something that should be considered at some point. Maybe it is a bad idea to rely on the order of the event loop.

Or maybe simply throttling should be a core Node.js internal feature. async loops are such a common programming pattern and there should be an official right way of doing them.

If you are interested in throttling, maybe you should also check my other story on the same subject: