Operations on files collection and aggregate results report with non-blocking IO_问答_开发者

I would like to perform some arbitrarily expensive work on an arbitrarily large set of files. I would like to report progress in real-time and then display results after all files have been processed. If there are no files that match my expression, I'd like to to throw an error.

Imagine writing a test framework that loads up all of your test files, executes them (in no particular order), reports on progress in real-time, and then displays aggregate results after all tests have been completed.

Writing this code in a blocking language (like Ruby for example), is extremely straightforward.

As it turns out, I'm having trouble performing this seemingly simple task in node, while also truly taking advantage of asynchronous, event-based IO.

My first design, was to perform each step serially.

Load up all of the files, creating a collection of files to process
Process each file in the collection
Report the results when all files have been processed

This approach does work, but doesn't seem quite right to me since it causes the mo开发者_如何学运维re computationally expensive portion of my program to wait for all of the file IO to complete. Isn't this the kind of waiting that Node was designed to avoid?

My second design, was to process each file as it was asynchronously found on disk. For the sake of argument, let's imagine a method that looks something like:

eachFileMatching(path, expression, callback) {
  // recursively, asynchronously traverse the file system,
  // calling callback every time a file name matches expression.
}

And a consumer of this method that looks something like this:

eachFileMatching('test/', /_test.js/, function(err, testFile) {
  // read and process the content of testFile
});

While this design feels like a very 'node' way of working with IO, it suffers from 2 major problems (at least in my presumably erroneous implementation):

I have no idea when all of the files have been processed, so I don't know when to assemble and publish results.
Because the file reads are nonblocking, and recursive, I'm struggling with how to know if no files were found.

I'm hoping that I'm simply doing something wrong, and that there is some reasonably simple strategy that other folks use to make the second approach work.

Even though this example uses a test framework, I have a variety of other projects that bump up against this exact same problem, and I imagine anyone writing a reasonably sophisticated application that accesses the file system in node would too.

What do you mean by "read and process the content of testFile"?

I don't understand why you have no idea when all of the files are processed. Are you not using Streams? A stream has several events, not just data. If you handle the end events then you will know when each file has finished.

For instance you might have a list of filenames, set up the processing for each file, and then when you get an end event, delete the filename from the list. When the list is empty you are done. Or create a FileName object that contains the name and a completion status. When you get an end event, change the status and decrement a filename counter as well. When the counter gets to zero you are done, or if you are not confident you could scan all the FileName object to make sure that their status is completed.

You might also have a timer that checks the counter periodically, and if it doesn't change for some period of time, report that the processing might be stuck on the FileName objects whose status is not completed.

... I just came across this scenario in another question and the accepted answer (plus the github link) explains it well. Check out for loop over event driven code?

As it turns out, the smallest working solution that I've been able to build is much more complicated than I hoped.

Following is code that works for me. It can probably be cleaned up or made slightly more readable here and there, and I'm not interested in feedback like that.

If there is a significantly different way to solve this problem, that is simpler and/or more efficient, I'm very interested in hearing it. It really surprises me that the solution to this seemingly simple requirement would require such a large amount of code, but perhaps that's why someone invented blocking io?

The complexity is really in the desire to meet all of the following requirements:

Handle files as they are found
Know when the search is complete
Know if no files are found

Here's the code:

/**
 * Call fileHandler with the file name and file Stat for each file found inside
 * of the provided directory.
 *
 * Call the optionally provided completeHandler with an array of files (mingled
 * with directories) and an array of Stat objects (one for each of the found
 * files.
 *
 * Following is an example of a simple usage:
 *
 *   eachFileOrDirectory('test/', function(err, file, stat) {
 *     if (err) throw err;
 *     if (!stat.isDirectory()) {
 *       console.log(">> Found file: " + file);
 *     }
 *   });
 *
 * Following is an example that waits for all files and directories to be 
 * scanned and then uses the entire result to do something:
 *
 *   eachFileOrDirectory('test/', null, function(files, stats) {
 *     if (err) throw err;
 *     var len = files.length;
 *     for (var i = 0; i < len; i++) {
 *       if (!stats[i].isDirectory()) {
 *         console.log(">> Found file: " + files[i]);
 *       }
 *     }
 *   });
 */
var eachFileOrDirectory = function(directory, fileHandler, completeHandler) {
  var filesToCheck = 0;
  var checkedFiles = [];
  var checkedStats = [];

  directory = (directory) ? directory : './';

  var fullFilePath = function(dir, file) {
    return dir.replace(/\/$/, '') + '/' + file;
  };

  var checkComplete = function() {
    if (filesToCheck == 0 && completeHandler) {
      completeHandler(null, checkedFiles, checkedStats);
    }
  };

  var onFileOrDirectory = function(fileOrDirectory) {
    filesToCheck++;
    fs.stat(fileOrDirectory, function(err, stat) {
      filesToCheck--;
      if (err) return fileHandler(err);
      checkedFiles.push(fileOrDirectory);
      checkedStats.push(stat);
      fileHandler(null, fileOrDirectory, stat);
      if (stat.isDirectory()) {
        onDirectory(fileOrDirectory);
      }
      checkComplete();
    });
  };

  var onDirectory = function(dir) {
    filesToCheck++;
    fs.readdir(dir, function(err, files) {
      filesToCheck--;
      if (err) return fileHandler(err);
      files.forEach(function(file, index) {
        file = fullFilePath(dir, file);
        onFileOrDirectory(file);
      });
      checkComplete();
    });
  }

  onFileOrDirectory(directory);
};

2 ways of doing this, first and probably considered serially would go something like

var files = [];
doFile(files, oncomplete);

function doFile(files, oncomplete) {
  if (files.length === 0) return oncomplete();
  var f = files.pop();
  processFile(f, function(err) {
    // Handle error if any
    doFile(files, oncomplete); // Recurse
  });
};

function processFile(file, callback) {
  // Do whatever you want to do and once 
  // done call the callback
  ...
  callback();
};

Second way, lets call it parallel is similar and goes summin like:

var files = [];
doFiles(files, oncomplete);

function doFiles(files, oncomplete) {
  var exp = files.length;
  var done = 0;
  for (var i = 0; i < exp; i++) {
    processFile(files[i], function(err) {
      // Handle errors (but still need to increment counter)
      if (++done === exp) return oncomplete();      
    });
  }
};

function processFile(file, callback) {
  // Do whatever you want to do and once 
  // done call the callback
  ...
  callback();
};

Now it may seem obvious you should use the second approach but you'll find that for IO intensive operations you dont really get any performance gains when parallelising. One dissadvantage of first approach is that the recursion can blow out your stack trace.

Tnx

Guido