可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have an input file which may potentially contain upto 1M records and each record would look like this

field 1 field 2 field3 \n

I want to read this input file and sort it based on field3 before writing it to another file.

here is what I have so far

var fs = require('fs'),
    readline = require('readline'),
    stream = require('stream');

var start = Date.now();

var outstream = new stream;
outstream.readable = true;
outstream.writable = true;

var rl = readline.createInterface({
    input: fs.createReadStream('cross.txt'),
    output: outstream,
    terminal: false
});

rl.on('line', function(line) {
    //var tmp = line.split("\t").reverse().join('\t') + '\n';
    //fs.appendFileSync("op_rev.txt", tmp );
    // this logic to reverse and then sort is too slow
});

rl.on('close', function() {
    var closetime = Date.now();
    console.log('Read entirefile. ', (closetime - start)/1000, ' secs');
});

I am basically stuck at this point, all I have is the ability to read from one file and write to another, is there a way to efficiently sort this data before writing it

回答1:

DB and sort-stream are fine solutions, but DB might be an overkill and I think sort-stream eventually just sorts the entire file in an in-memory array (on through end callback), so I think performance will be roughly the same, comparing to the original solution.
~~(but I haven't ran any benchmarks, so I might be wrong).~~

So, just for the hack of it, I'll throw in another solution :)

EDIT: I was curious to see how big a difference this will be, so I ran some benchmarks.

Results were surprising even to me, turns out sort -k3,3 solution is better by far, x10 times faster then the original solution (a simple array sort), while nedb and sort-stream solutions are at least x18 times slower than the original solution (i.e. at least x180 times slower than sort -k3,3).

(See benchmark results below)

If on a *nix machine (Unix, Linux, Mac, ...) you can simply use
sort -k 3,3 yourInputFile > op_rev.txt and let the OS do the sorting for you.
You'll probably get better performance, since sorting is done natively.

Or, if you want to process the sorted output in Node:

var util = require('util'),
    spawn = require('child_process').spawn,
    sort = spawn('sort', ['-k3,3', './test.tsv']);

sort.stdout.on('data', function (data) {
    // process data
    data.toString()
        .split('\n')
        .map(line => line.split("\t"))
        .forEach(record => console.info(`Record: ${record}`));
});

sort.on('exit', function (code) {
    if (code) {
        // handle error
    }

    console.log('Done');
});

// optional
sort.stderr.on('data', function (data) {
    // handle error...
    console.log('stderr: ' + data);
});

Hope this helps :)

EDIT: Adding some benchmark details.

I was curious to see how big a difference this will be, so I ran some benchmarks.

Here are the results (running on a MacBook Pro):

sort1 uses a straightforward approach, sorting the records in an in-memory array.
Avg time: 35.6s (baseline)
sort2 uses sort-stream, as suggested by Joe Krill.
Avg time: 11.1m (about x18.7 times slower)
(I wonder why. I didn't dig in.)
sort3 uses nedb, as suggested by Tamas Hegedus.
Time: about 16m (about x27 times slower)
sort4 only sorts by executing sort -k 3,3 input.txt > out4.txt in a terminal
Avg time: 1.2s (about x30 times faster)
sort5 uses sort -k3,3, and process the response sent to stdout
Avg time: 3.65s (about x9.7 times faster)

回答2:

You can take advantage of streams for something like this. There's a few NPM modules that will be helpful -- first include them by running

npm install sort-stream csv-parse stream-transform

from the command line.

Then:

var fs = require('fs');
var sort = require('sort-stream');
var parse = require('csv-parse');
var transform = require('stream-transform');

// Create a readble stream from the input file.
fs.createReadStream('./cross.txt')
  // Use `csv-parse` to parse the input using a tab character (\t) as the 
  // delimiter. This produces a record for each row which is an array of 
  // field values.
  .pipe(parse({
    delimiter: '\t'
  }))
  // Use `sort-stream` to sort the parsed records on the third field. 
  .pipe(sort(function (a, b) {
    return a[2].localeCompare(b[2]);
  }))
  // Use `stream-transform` to transform each record (an array of fields) into 
  // a single tab-delimited string to be output to our destination text file.
  .pipe(transform(function(row) {
    return row.join('\t') + '\r';
  }))
  // And finally, output those strings to our destination file.
  .pipe(fs.createWriteStream('./cross_sorted.txt'));

回答3:

You have two options, depending on how much data is being processed. (1M record count with 3 columns doesn't say much about the amount of actual data)

Load the data in memory, sort in place

var lines = [];
rl.on('line', function(line) {
    lines.push(line.split("\t").reverse());
});

rl.on('close', function() {
    lines.sort(function(a, b) { return compare(a[0], b[0]); });

    // write however you want
    fs.writeFileSync(
        fileName,
        lines.map(function(x) { return x.join("\t"); }).join("\n")
    );
    function compare(a, b) {
        if (a < b) return -1;
        if (a > b) return 1;
        return 0;
    }
});

Load the data in a persistent database, read ordered

Using a database engine of your choice (for example nedb, a pure javascript db for nodejs)

EDIT: It seems that NeDB keeps the whole database in memory, the file is only a persistent copy of the data. We'll have to search for another implementation. TingoDB looks promising.

// This code is only to give an idea, not tested in any way

var Datastore = require('nedb');
var db = new Datastore({
    filename: 'path/to/temp/datafile',
    autoload: true
});

rl.on('line', function(line) {
    var tmp = line.split("\t").reverse();
    db.insert({
        field0: tmp[0],
        field1: tmp[1],
        field2: tmp[2]
    });
});

rl.on('close', function() {
    var cursor = db.find({})
            .sort({ field0: 1 }); // sort by field0, ascending
    var PAGE_SIZE = 1000;
    paginate(0);
    function paginate(i) {
        cursor.skip(i).take(PAGE_SIZE).exec(function(err, docs) {
            // handle errors

            var tmp = docs.map(function(o) {
                return o.field0 + "\t" + o.field1 + "\t" + o.field2 + "\n";
            });
            fs.appendFileSync("op_rev.txt", tmp.join(""));
            if (docs.length >= PAGE_SIZE) {
                paginate(i + PAGE_SIZE);
            } else {
                // cleanup temp database
            }
        });
    }
});