Beyond .then.then.catch — using and abusing Promises for fun and profit
Some common pitfalls and design patterns when using Promises and async functions
Single-threaded event-driven asynchronous I/O is one of the most interesting features of the JS language. It is not a new concept, in fact, more than 20 years ago, there was one very influential paper in the world of network programming called “The C10K problem”. It was an eye opening experience for one whole generation of system developers and it spawned the arguably fastest and one of the most widely used low-level I/O frameworks in the world — libevent. It was one of the driving forces that made the Apache HTTP team to reimplement its worker model.
Today, JS takes that asynchronous programming pattern to a new level. It is a language that was never designed for heavy I/O — in fact its initial objective was writing simple UI components — which, by pure chance, also happen to be naturally asynchronous. The remarkably well performing V8 engine, coupled to the fact the JS imposes its blazingly fast but otherwise impractical to use I/O model, made Node.js the absolute performance winner among the modern high-level non-statically-compiled languages.
As asynchronous programming remains less straightforward and tends to result in difficult to read code, it is the area which has seen the largest amount of new features in JS language during the last few years. First Promises and then async/await code were supposed to make asynchronous programming easier and less prone to errors. In this story, I go through implementing a simple Wikipedia crawler that shows various advanced uses of Promises and asynchronous functions. All the code uses mjs notation and it is compatible with Node v12.
The task is very simple — use the public Wikipedia API to retrieve, for a given starting page, all the pages that link to it, up to a certain depth, and the number of times each page is present in our tree graph.
We will be using a very simple JS implementation of the Wikipedia API, wiki.mjs, which allows for rate-limited concurrent access. Using the API is beyond the scope of this story, but you are free to check my story on
Parallelizing download loops in JS with async-await-queue
How many times you have been confronted with the classical problem of parallelizing download loops
wiki.mjs will provide us with two simple asynchronous functions, wiki.info() and wiki.linksHere() to retrieve the page titles and the pages that link to the given page. info() takes a pageId and returns a string with the page title and linksHere takes a pageId and returns an array of pageIds. Retrying on error — the Wikipedia API can be sometimes flaky — is also taken care of. The full code can be found here.
Let’s first try the trivial implementation, using modern async JS:
Everything is pretty straightforward, a recursive asynchronous function, getPage retrieves all the linking pages, creates a new object in an array indexed by pageId if this is the first time we are seeing this pageId, and then increments the counter. We then proceed onto the next recursion level.
JS makes it all very easy and simple. However this first implementation will initially appear to work, but it contains one subtle and nasty bug that will be very hard to find. Look at line 9, that if. When the array element of the current pageId is undefined, we create a new object. Its title is retrieved by wiki.info() which we await upon. Now, JS doesn’t have mutexes. It doesn’t really need them either. Synchronization in JS is achieved purely by having naturally uninterruptible code blocks. In most other languages, a function can get interrupted by asynchronous code at any moment and we have to use mutexes to protect the critical sections. In JS the code has to trigger an iteration of the engine’s event loop by itself. Normally, this never happens in the middle of a function except when using await. Calling await will interrupt that async context, allowing other waiting async contexts to run. And until that await on line 12 returns, result[page] will stay undefined. Which means that another context will potentially fail that same if and create another object. Which will then get overwritten by the current context once its own await returns. This is a very common and particularly dangerous pitfall when using async contexts. You have to always be extra careful about which operations could potentially interrupt your code. Here is the correct code:
A very small change allows us to assign a value to result[page] before calling await.
Now that code is simple enough, but even for our example data set, the first 5 linking pages, starting with the main page, and up to depth 2, this takes ages to complete. If we want to do it faster, there are two possible improvements we can make it — first, try to make more than one request in parallel — and then — try to cache responses as we are expecting that some pages will be present more than once.
The parallel implementation is a little bit more interesting:
For each page we start by retrieving the linking pages. We have to await the links, there is no other choice. But once we have them, we can launch the retrieval of the title and the processing of all the pages from the next level at the same time. Every async function returns a Promise and we will collect all the Promises in an array. Then, at line 21, we will simply return a new Promise that will resolve when all the Promises from the array would be resolved (or one of them would be rejected). Thus we will construct one enormous chain of Promises that will be awaited by the final then. The engine will take care parallelizing everything. It is magic, isn’t it?
Also note the change on line 14. wiki.info() returns a Promise that resolves with the info object of the page. This means that once the Promise is resolved, its value will transform to the info object. We attach another handler to this Promise that will take just the title out and will return it. Thus result[page].title will initially be a pending Promise, then it will contain the title once this Promise is resolved. We push it to the array, to make sure that it will be among the Promises that we will await before outputting the final result. In theory, there should not be any difference between a normal object and a Promise resolved with that normal object. In practice there are some very subtle differences mainly around the order at which the engine’s event loop will handle them. Then there is the catch statement. Normally, the rule of thumb is that whenever there is a non-awaited Promise, there should be a catch() statement. A classical try..catch block won’t be enough because by the time the exception is thrown, the try..catch context might no longer exist:
Here, the execution of that portion of the code will be over when the exception from wiki.info() will be processed and the engine won’t be able to deliver it anywhere. On the other side, a catch() statement, which has its own closure and it is essentially an error callback, will remain chained to the Promise. Recent versions of Node.js will even try to intimidate you into writing correct code by threating that in the future, the engine will simply exit when it is not able to deliver the exception. It is an empty threat, because they will never dare to fundamentally modify Node’s behavior, but this doesn’t mean you should not strive to write good code.
If we try that new improved version of our crawler, we shall see an improvement roughly equal to the number of concurrent requests. So, with 4 of them running at the same time, this implementation will be (almost) 4 times faster than the trivial one.
Now let’s try the cache improvement:
It is very simple and straightforward, before retrieving the linking pages at line 9, we check if we don’t already have them in our new cache array. As it is always the case with caching, it’s efficiency will be very depending of the nature of the data set. For our example data set, the improvement will be quite marginal, reducing the number of API calls from 110 to 107, but for some data sets it could be even larger than the parallelization, so it is definitely worth doing it.
The last step will be to combine both. At first it seems easy:
We simply add the cache step at line 8 to the previous parallel implementation. Then we launch it, and… we end up doing 110 requests. Why?
Because once again we got burned by interruptible code. The first time we are retrieving the linking pages for id, we await. linksHereCache[id] stays undefined. And another async context that passes here while we are waiting will simply do the whole thing again. Then all the different values will overwrite each other.
This is a very common situation in which I usually use a standard design pattern that is a very good solution for caching of values obtained by an asynchronous operation — create a cache not of the values themselves, but rather of Promises for delivering the values:
The first getPage for a given id will populate linksHereCache with a pending Promise. A single Promise can be saved in a variable and then be awaited by multiple async contexts. Once it is resolved, its value will be (almost) identical to a simple reference to the resolved object.
Here only the first call for a given id will create a new Promise. All other calls will be blocked awaiting on that same Promise on the next line, line 10. Once that promise is resolved, all of them will proceed. Accessing the original object through that resolved Promise should not change (almost) anything and awaiting an already resolved Promise should be (almost) a non-op.
Almost, because there is only one notable difference in the V8 engine: every await, even an await of constant object like “await 1”, which is a perfectly valid expression, is interruptible code and it will provoke one full iteration of the engine’s loop. It will potentially allow for other waiting async contexts to run. In fact, the effect of “await 1” is very similar to that old sched_yield system call in POSIX — as it will give a chance to all other running tasks to complete. In theory, it can even be used to avoid blocking the event loop during CPU-intensive operations. However, since it is not an officially documented feature of the JS language, but rather an implementation quirk of the V8 engine, you would be best if you abstain from using it.
The only way to avoid that useless iteration of the event loop is to not await when the Promise is already resolved. Alas, there is no official way of determining if a Promise is still pending or not. There is an early ECMA proposal for it — Promise.prototype.inspect — but it isn’t considered a priority feature. The performance hit for preceding every cache access with an await is absolutely minimal and for most uses it will be impossible to measure, but should you really need it, the correct way to do it is to chain a small handler to the Promise which, upon the fulfillment of the Promise, will transform the reference to a plain object reference:
Line 9. It simply replaces the resolved Promise with a plain object reference avoiding you the need to always await it but forcing you to check if you have a Promise or a plain object. If you plan to loop through that cache retrieval more than a hundred thousand times per second, it might be worth it — like the example in
Reading large structured text files in Node.js
Reading and parsing a large CSV file in Node.js doesn’t have to be slower than the equivalent compiled C code… that is…
As you can see, I have added the same handler to the title Promise handler at line 21.
When using Promises, remember to always be extra careful with handling exceptions. Especially when using non-immediately awaited Promises in order to launch multiple async contexts. It is very good practice to always test your code with an explicit throw once in every single async handler. You will often be surprised by the exception propagation.
Also, when dealing with re-usable values obtained by executing I/O code, try to always keep them wrapped around the original Promise. This way you can easily dedupe access to them. First one to ask — creates the Promise and saves it — everyone else — waits on the that Promise.
All the code is available on https://github.com/mmomtchev/wiki-crawler .