path-scurry
TypeScript icon, indicating that this package has built-in type declarations

2.0.0 • Public • Published

path-scurry

Extremely high performant utility for building tools that read the file system, minimizing filesystem and path string munging operations to the greatest degree possible.

Ugh, yet another file traversal thing on npm?

Yes. None of the existing ones gave me exactly what I wanted.

Well what is it you wanted?

While working on glob, I found that I needed a module to very efficiently manage the traversal over a folder tree, such that:

  1. No readdir() or stat() would ever be called on the same file or directory more than one time.
  2. No readdir() calls would be made if we can be reasonably sure that the path is not a directory. (Ie, a previous readdir() or stat() covered the path, and ent.isDirectory() is false.)
  3. path.resolve(), dirname(), basename(), and other string-parsing/munging operations are be minimized. This means it has to track "provisional" child nodes that may not exist (and if we find that they don't exist, store that information as well, so we don't have to ever check again).
  4. The API is not limited to use as a stream/iterator/etc. There are many cases where an API like node's fs is preferrable.
  5. It's more important to prevent excess syscalls than to be up to date, but it should be smart enough to know what it doesn't know, and go get it seamlessly when requested.
  6. Do not blow up the JS heap allocation if operating on a directory with a huge number of entries.
  7. Handle all the weird aspects of Windows paths, like UNC paths and drive letters and wrongway slashes, so that the consumer can return canonical platform-specific paths without having to parse or join or do any error-prone string munging.

PERFORMANCE

JavaScript people throw around the word "blazing" a lot. I hope that this module doesn't blaze anyone. But it does go very fast, in the cases it's optimized for, if used properly.

PathScurry provides ample opportunities to get extremely good performance, as well as several options to trade performance for convenience.

Benchmarks can be run by executing npm run bench.

As is always the case, doing more means going slower, doing less means going faster, and there are trade offs between speed and memory usage.

PathScurry makes heavy use of LRUCache to efficiently cache whatever it can, and Path objects remain in the graph for the lifetime of the walker, so repeated calls with a single PathScurry object will be extremely fast. However, adding items to a cold cache means "doing more", so in those cases, we pay a price. Nothing is free, but every effort has been made to reduce costs wherever possible.

Also, note that a "cache as long as possible" approach means that changes to the filesystem may not be reflected in the results of repeated PathScurry operations.

For resolving string paths, PathScurry ranges from 5-50 times faster than path.resolve on repeated resolutions, but around 100 to 1000 times slower on the first resolution. If your program is spending a lot of time resolving the same paths repeatedly (like, thousands or millions of times), then this can be beneficial. But both implementations are pretty fast, and speeding up an infrequent operation from 4µs to 400ns is not going to move the needle on your app's performance.

For walking file system directory trees, a lot depends on how often a given PathScurry object will be used, and also on the walk method used.

With default settings on a folder tree of 100,000 items, consisting of around a 10-to-1 ratio of normal files to directories, PathScurry performs comparably to @nodelib/fs.walk, which is the fastest and most reliable file system walker I could find. As far as I can tell, it's almost impossible to go much faster in a Node.js program, just based on how fast you can push syscalls out to the fs thread pool.

On my machine, that is about 1000-1200 completed walks per second for async or stream walks, and around 500-600 walks per second synchronously.

In the warm cache state, PathScurry's performance increases around 4x for async for await iteration, 10-15x faster for streams and synchronous for of iteration, and anywhere from 30x to 80x faster for the rest.

# walk 100,000 fs entries, 10/1 file/dir ratio
# operations / ms
 New PathScurry object  |  Reuse PathScurry object
     stream:  1112.589  |  13974.917
sync stream:   492.718  |  15028.343
 async walk:  1095.648  |  32706.395
  sync walk:   527.632  |  46129.772
 async iter:  1288.821  |   5045.510
  sync iter:   498.496  |  17920.746

A hand-rolled walk calling entry.readdir() and recursing through the entries can benefit even more from caching, with greater flexibility and without the overhead of streams or generators.

The cold cache state is still limited by the costs of file system operations, but with a warm cache, the only bottleneck is CPU speed and VM optimizations. Of course, in that case, some care must be taken to ensure that you don't lose performance as a result of silly mistakes, like calling readdir() on entries that you know are not directories.

# manual recursive iteration functions
      cold cache  |  warm cache
async:  1164.901  |  17923.320
   cb:  1101.127  |  40999.344
zalgo:  1082.240  |  66689.936
 sync:   526.935  |  87097.591

In this case, the speed improves by around 10-20x in the async case, 40x in the case of using entry.readdirCB with protections against synchronous callbacks, and 50-100x with callback deferrals disabled, and several hundred times faster for synchronous iteration.

If you can think of a case that is not covered in these benchmarks, or an implementation that performs significantly better than PathScurry, please l