-->

Sorting a data stream before writing to file in no

2020-08-17 17:21发布

问题:

I have an input file which may potentially contain upto 1M records and each record would look like this

field 1 field 2 field3 \n

I want to read this input file and sort it based on field3 before writing it to another file.

here is what I have so far

var fs = require('fs'),
    readline = require('readline'),
    stream = require('stream');

var start = Date.now();

var outstream = new stream;
outstream.readable = true;
outstream.writable = true;

var rl = readline.createInterface({
    input: fs.createReadStream('cross.txt'),
    output: outstream,
    terminal: false
});

rl.on('line', function(line) {
    //var tmp = line.split("\t").reverse().join('\t') + '\n';
    //fs.appendFileSync("op_rev.txt", tmp );
    // this logic to reverse and then sort is too slow
});

rl.on('close', function() {
    var closetime = Date.now();
    console.log('Read entirefile. ', (closetime - start)/1000, ' secs');
});

I am basically stuck at this point, all I have is the ability to read from one file and write to another, is there a way to efficiently sort this data before writing it

回答1:

DB and sort-stream are fine solutions, but DB might be an overkill and I think sort-stream eventually just sorts the entire file in an in-memory array (on through end callback), so I think performance will be roughly the same, comparing to the original solution.
(but I haven't ran any benchmarks, so I might be wrong).

So, just for the hack of it, I'll throw in another solution :)


EDIT: I was curious to see how big a difference this will be, so I ran some benchmarks.

Results were surprising even to me, turns out sort -k3,3 solution is better by far, x10 times faster then the original solution (a simple array sort), while nedb and sort-stream solutions are at least x18 times slower than the original solution (i.e. at least x180 times slower than sort -k3,3).

(See benchmark results below)


If on a *nix machine (Unix, Linux, Mac, ...) you can simply use
sort -k 3,3 yourInputFile > op_rev.txt and let the OS do the sorting for you.
You'll probably get better performance, since sorting is done natively.

Or, if you want to process the sorted output in Node:

var util = require('util'),
    spawn = require('child_process').spawn,
    sort = spawn('sort', ['-k3,3', './test.tsv']);

sort.stdout.on('data', function (data) {
    // process data
    data.toString()
        .split('\n')
        .map(line => line.split("\t"))
        .forEach(record => console.info(`Record: ${record}`));
});

sort.on('exit', function (code) {
    if (code) {
        // handle error
    }

    console.log('Done');
});

// optional
sort.stderr.on('data', function (data) {
    // handle error...
    console.log('stderr: ' + data);
});

Hope this helps :)


EDIT: Adding some benchmark details.

I was curious to see how big a difference this will be, so I ran some benchmarks.

Here are the results (running on a MacBook Pro):

  • sort1 uses a straightforward approach, sorting the records in an in-memory array.
    Avg time: 35.6s (baseline)

  • sort2 uses sort-stream, as suggested by Joe Krill.
    Avg time: 11.1m (about x18.7 times slower)
    (I wonder why. I didn't dig in.)

  • sort3 uses nedb, as suggested by Tamas Hegedus.
    Time: about 16m (about x27 times slower)

  • sort4 only sorts by executing sort -k 3,3 input.txt > out4.txt in a terminal
    Avg time: 1.2s (about x30 times faster)

  • sort5 uses sort -k3,3, and process the response sent to stdout
    Avg time: 3.65s (about x9.7 times faster)



回答2:

You can take advantage of streams for something like this. There's a few NPM modules that will be helpful -- first include them by running

npm install sort-stream csv-parse stream-transform

from the command line.

Then:

var fs = require('fs');
var sort = require('sort-stream');
var parse = require('csv-parse');
var transform = require('stream-transform');

// Create a readble stream from the input file.
fs.createReadStream('./cross.txt')
  // Use `csv-parse` to parse the input using a tab character (\t) as the 
  // delimiter. This produces a record for each row which is an array of 
  // field values.
  .pipe(parse({
    delimiter: '\t'
  }))
  // Use `sort-stream` to sort the parsed records on the third field. 
  .pipe(sort(function (a, b) {
    return a[2].localeCompare(b[2]);
  }))
  // Use `stream-transform` to transform each record (an array of fields) into 
  // a single tab-delimited string to be output to our destination text file.
  .pipe(transform(function(row) {
    return row.join('\t') + '\r';
  }))
  // And finally, output those strings to our destination file.
  .pipe(fs.createWriteStream('./cross_sorted.txt'));


回答3:

You have two options, depending on how much data is being processed. (1M record count with 3 columns doesn't say much about the amount of actual data)

Load the data in memory, sort in place

var lines = [];
rl.on('line', function(line) {
    lines.push(line.split("\t").reverse());
});

rl.on('close', function() {
    lines.sort(function(a, b) { return compare(a[0], b[0]); });

    // write however you want
    fs.writeFileSync(
        fileName,
        lines.map(function(x) { return x.join("\t"); }).join("\n")
    );
    function compare(a, b) {
        if (a < b) return -1;
        if (a > b) return 1;
        return 0;
    }
});

Load the data in a persistent database, read ordered

Using a database engine of your choice (for example nedb, a pure javascript db for nodejs)

EDIT: It seems that NeDB keeps the whole database in memory, the file is only a persistent copy of the data. We'll have to search for another implementation. TingoDB looks promising.

// This code is only to give an idea, not tested in any way

var Datastore = require('nedb');
var db = new Datastore({
    filename: 'path/to/temp/datafile',
    autoload: true
});

rl.on('line', function(line) {
    var tmp = line.split("\t").reverse();
    db.insert({
        field0: tmp[0],
        field1: tmp[1],
        field2: tmp[2]
    });
});

rl.on('close', function() {
    var cursor = db.find({})
            .sort({ field0: 1 }); // sort by field0, ascending
    var PAGE_SIZE = 1000;
    paginate(0);
    function paginate(i) {
        cursor.skip(i).take(PAGE_SIZE).exec(function(err, docs) {
            // handle errors

            var tmp = docs.map(function(o) {
                return o.field0 + "\t" + o.field1 + "\t" + o.field2 + "\n";
            });
            fs.appendFileSync("op_rev.txt", tmp.join(""));
            if (docs.length >= PAGE_SIZE) {
                paginate(i + PAGE_SIZE);
            } else {
                // cleanup temp database
            }
        });
    }
});


回答4:

i had quite similar issue, needed to perform an external sort.

I figured out, after waste a few time on it that i could load up the data on a database and then query out the desired data from it.

It not even matter if the inserts aren't ordered, as long as my query result could be.

Hope it can work for you too.

In order to insert your data on a database, there are plenty of tools on node to perform such task. I have this pet project which does a similar job.

I'm also sure that if you search the subject, you'll find much more info.

Good luck.