Extremely high performant utility for building tools that read the file system, minimizing filesystem and path string munging operations to the greatest degree possible.
Yes. None of the existing ones gave me exactly what I wanted.
While working on glob, I found that I needed a module to very efficiently manage the traversal over a folder tree, such that:
- No
readdir()
orstat()
would ever be called on the same file or directory more than one time. - No
readdir()
calls would be made if we can be reasonably sure that the path is not a directory. (Ie, a previousreaddir()
orstat()
covered the path, andent.isDirectory()
is false.) -
path.resolve()
,dirname()
,basename()
, and other string-parsing/munging operations are be minimized. This means it has to track "provisional" child nodes that may not exist (and if we find that they don't exist, store that information as well, so we don't have to ever check again). - The API is not limited to use as a stream/iterator/etc. There
are many cases where an API like node's
fs
is preferrable. - It's more important to prevent excess syscalls than to be up to date, but it should be smart enough to know what it doesn't know, and go get it seamlessly when requested.
- Do not blow up the JS heap allocation if operating on a directory with a huge number of entries.
- Handle all the weird aspects of Windows paths, like UNC paths and drive letters and wrongway slashes, so that the consumer can return canonical platform-specific paths without having to parse or join or do any error-prone string munging.
JavaScript people throw around the word "blazing" a lot. I hope that this module doesn't blaze anyone. But it does go very fast, in the cases it's optimized for, if used properly.
PathScurry provides ample opportunities to get extremely good performance, as well as several options to trade performance for convenience.
Benchmarks can be run by executing npm run bench
.
As is always the case, doing more means going slower, doing less means going faster, and there are trade offs between speed and memory usage.
PathScurry makes heavy use of LRUCache
to efficiently cache whatever it can, and Path
objects remain
in the graph for the lifetime of the walker, so repeated calls
with a single PathScurry object will be extremely fast. However,
adding items to a cold cache means "doing more", so in those
cases, we pay a price. Nothing is free, but every effort has been
made to reduce costs wherever possible.
Also, note that a "cache as long as possible" approach means that changes to the filesystem may not be reflected in the results of repeated PathScurry operations.
For resolving string paths, PathScurry
ranges from 5-50 times
faster than path.resolve
on repeated resolutions, but around
100 to 1000 times slower on the first resolution. If your
program is spending a lot of time resolving the same paths
repeatedly (like, thousands or millions of times), then this can
be beneficial. But both implementations are pretty fast, and
speeding up an infrequent operation from 4µs to 400ns is not
going to move the needle on your app's performance.
For walking file system directory trees, a lot depends on how often a given PathScurry object will be used, and also on the walk method used.
With default settings on a folder tree of 100,000 items, consisting of around a 10-to-1 ratio of normal files to directories, PathScurry performs comparably to @nodelib/fs.walk, which is the fastest and most reliable file system walker I could find. As far as I can tell, it's almost impossible to go much faster in a Node.js program, just based on how fast you can push syscalls out to the fs thread pool.
On my machine, that is about 1000-1200 completed walks per second for async or stream walks, and around 500-600 walks per second synchronously.
In the warm cache state, PathScurry's performance increases
around 4x for async for await
iteration, 10-15x faster for
streams and synchronous for of
iteration, and anywhere from 30x
to 80x faster for the rest.
# walk 100,000 fs entries, 10/1 file/dir ratio
# operations / ms
New PathScurry object | Reuse PathScurry object
stream: 1112.589 | 13974.917
sync stream: 492.718 | 15028.343
async walk: 1095.648 | 32706.395
sync walk: 527.632 | 46129.772
async iter: 1288.821 | 5045.510
sync iter: 498.496 | 17920.746
A hand-rolled walk calling entry.readdir()
and recursing
through the entries can benefit even more from caching, with
greater flexibility and without the overhead of streams or
generators.
The cold cache state is still limited by the costs of file system
operations, but with a warm cache, the only bottleneck is CPU
speed and VM optimizations. Of course, in that case, some care
must be taken to ensure that you don't lose performance as a
result of silly mistakes, like calling readdir()
on entries
that you know are not directories.
# manual recursive iteration functions
cold cache | warm cache
async: 1164.901 | 17923.320
cb: 1101.127 | 40999.344
zalgo: 1082.240 | 66689.936
sync: 526.935 | 87097.591
In this case, the speed improves by around 10-20x in the async
case, 40x in the case of using entry.readdirCB
with protections
against synchronous callbacks, and 50-100x with callback
deferrals disabled, and several hundred times faster for
synchronous iteration.
If you can think of a case that is not covered in these benchmarks, or an implementation that performs significantly better than PathScurry, please l