How many times you have been confronted with the classical problem of parallelizing download loops in crawlers and scrapers
Consider the following typical crawler or scraper code:
Thanks to async / await this is as simple and readable as it can get.
But what if you have thousands of URLs? This would take ages as each iteration waits for the previous one.
Luckily, JS has a solution:
Here we are launching all the downloads without await. The beauty of promise chaining allows us to create new promises out of succession of asynchronous operations. We push all those promises to an array and then we await the parallel completion of all of them. It is important to use allSettled, as it won’t stop on the first rejected promise — thus preserving the behavior of the original code which continued when of the URLs failed.
Now you can launch this and just sit back and relax. That is, until the network system administrator comes in running and yelling. One of the main reasons for Node.js success was its particularly efficient I/O subsystem and this piece of code will easily bring a high-end server down when launched from a laptop. Or even worse, this could be a public API and the next message you will get will be “Your IP address has been permanently banned due to abuse”.
Should you decide to take this approach on the front-end, Chrome V8 will silently solve the problem for you, by limiting the number of available sockets and queuing the requests. However when a user of your site decides to open a new tab while the code is running, he will be greeted by the somewhat arcane message: “Waiting for sockets…”. Hopefully, a future version of the browser will even finger-point the ill-behaved page.
This is where async-await-queue comes to the rescue. It allows you to easily control the aggressiveness level of your crawler, so that it runs fast while still being somewhat nice or at least without annoying too much everyone around.
We create an async-await-queue that will allow for no more than 10 parallel downloads, spaced no less than 100ms apart. These settings are usually close to the maximum possible that won’t get you banned from accessing a public API.
We are creating new promises by chaining asynchronous operations. This time we begin by waiting for our position in the line. At the end, always in a finally block, we free our place for the next operation. Should an operation finish without calling end() its place will remain forever busy.
Now that is easy and simple enough.
But the whole point of async / await was to avoid that horrible then/catch/finally chaining and to make complex code easier to read. What if we had conditional expressions?
This is where async comes to help:
Each iteration is an anonymous async function that is declared and immediately called. We are pushing its return value, a Promise, to p.
This code is equivalent to the previous one. However using an anonymous async function makes the chaining much more intuitive to read and allows you to use natural conditional statements.
In the hope that it will be useful to use, async-await-queue is available on npm under an MIT licence.