-->

Loading large amount of data into memory - most ef

2020-02-24 13:14发布

问题:

I have a web-based documentation searching/viewing system that I'm developing for a client. Part of this system is a search system that allows the client to search for a term[s] contained in the documentation. I've got the necessary search data files created, but there's a lot of data that needs to be loaded, and it takes anywhere from 8-20 seconds to load all the data. The data is broken into 40-100 files, depending on what documentation needs to be searched. Each file is anywhere from 40-350kb.

Also, this application must be able to run on the local file system, as well as through a webserver.

When the webpage loads up, I can generate a list of what search data files I need load. This entire list must be loaded before the webpage can be considered functional.

With that preface out of the way, let's look at how I'm doing it now.

After I know that the entire webpage is loaded, I call a loadData() function

function loadData(){
            var d = new Date();
            var curr_min = d.getMinutes();
            var curr_sec = d.getSeconds();
         var curr_mil = d.getMilliseconds();
         console.log("test.js started background loading, time is: " + curr_min + ":" + curr_sec+ ":" + curr_mil);
          recursiveCall();
      }


   function recursiveCall(){
      if(file_array.length > 0){
         var string = file_array.pop();
         setTimeout(function(){$.getScript(string,recursiveCall);},1);
    }
    else{
        var d = new Date();
        var curr_min = d.getMinutes();
        var curr_sec = d.getSeconds();
        var curr_mil = d.getMilliseconds();
        console.log("test.js stopped background loading, time is: " + curr_min + ":" + curr_sec+ ":" + curr_mil);
    }
  }

What this does is processes an array of files sequentially, taking a 1ms break between files. This helps prevent the browser from being completely locked up during the loading process, but the browser still tends to get bogged down by loading the data. Each of the files that I'm loading look like this:

AddToBookData(0,[0,1,2,3,4,5,6,7,8]);
AddToBookData(1,[0,1,2,3,4,5,6,7,8]);
AddToBookData(2,[0,1,2,3,4,5,6,7,8]);

Where each line is a function call that is adding data to an array. The "AddToBookData" function simply does the following:

    function AddToBookData(index1,value1){
         BookData[BookIndex].push([index1,value1]);
    }

This is the existing system. After loading all the data, "AddToBookData" can get called 100,000+ times.

I figured that was pretty inefficient, so I wrote a script to take the test.js file which contains all the function calls above, and processed it to change it into a giant array which is equal to the data structure that BookData is creating. Instead of making all the function calls that the old system did, I simply do the following:

var test_array[..........(data structure I need).......]
BookData[BookIndex] = test_array;

I was expecting to see a performance increase because I was removing all the function calls above, this method takes slightly more time to create the exact data structure. I should note that "test_array" holds slightly over 90,000 elements in my real world test.

It seems that both methods of loading data have roughly the same CPU utilization. I was surprised to find this, since I was expecting the second method to require little CPU time, since the data structure is being created before hand.

Please advise?

回答1:

Looks like there are two basic areas for optimising the data loading, that can be considered and tackled separately:

  1. Downloading the data from the server. Rather than one large file you should gain wins from parallel loads of multiple smaller files. Experiment with number of simultaneous loads, bear in mind browser limits and diminishing returns of having too many parallel connections. See my parallel vs sequential experiments on jsfiddle but bear in mind that the results will vary due to the vagaries of pulling the test data from github - you're best off testing with your own data under more tightly controlled conditions.
  2. Building your data structure as efficiently as possible. Your result looks like a multi-dimensional array, this interesting article on JavaScript array performance may give you some ideas for experimentation in this area.

But I'm not sure how far you'll really be able to go with optimising the data loading alone. To solve the actual problem with your application (browser locking up for too long) have you considered options such as?

Using Web Workers

Web Workers might not be supported by all your target browsers, but should prevent the main browser thread from locking up while it processes the data.

For browsers without workers, you could consider increasing the setTimeout interval slightly to give the browser time to service the user as well as your JS. This will make things actually slightly slower but may increase user happiness when combined with the next point.

Providing feedback of progress

For both worker-capable and worker-deficient browsers, take some time to update the DOM with a progress bar. You know how many files you have left to load so progress should be fairly consistent and although things may actually be slightly slower, users will feel better if they get the feedback and don't think the browser has locked up on them.

Lazy Loading

As suggested by jira in his comment. If Google Instant can search the entire web as we type, is it really not possible to have the server return a file with all locations of the search keyword within the current book? This file should be much smaller and faster to load than the locations of all words within the book, which is what I assume you are currently trying to get loaded as quickly as you can?



回答2:

I tested three methods of loading the same 9,000,000 point dataset into Firefox 3.64.

1: Stephen's GetJSON Method
2) My function based push method
3) My pre-processed array appending method:

I ran my tests two ways: The first iteration of testing I imported 100 files containing 10,000 rows of data, each row containing 9 data elements [0,1,2,3,4,5,6,7,8]

The second interation I tried combining files, so that I was importing 1 file with 9 million data points.

This was a lot larger than the dataset I'll be using, but it helps demonstrate the speed of the various import methods.

Separate files:                 Combined file:

JSON:        34 seconds         34
FUNC-BASED:  17.5               24
ARRAY-BASED: 23                 46

Interesting results, to say the least. I closed out the browser after loading each webpage, and ran the tests 4 times each to minimize the effect of network traffic/variation. (ran across a network, using a file server). The number you see is the average, although the individual runs differed by only a second or two at most.



回答3:

Instead of using $.getScript to load JavaScript files containing function calls, consider using $.getJSON. This may boost performance. The files would now look like this:

{
    "key" : 0,
    "values" : [0,1,2,3,4,5,6,7,8]
}

After receiving the JSON response, you could then call AddToBookData on it, like this:

function AddToBookData(json) {
     BookData[BookIndex].push([json.key,json.values]);
}

If your files have multiple sets of calls to AddToBookData, you could structure them like this:

[
    {
        "key" : 0,
        "values" : [0,1,2,3,4,5,6,7,8]
    },
    {
        "key" : 1,
        "values" : [0,1,2,3,4,5,6,7,8]
    },
    {
        "key" : 2,
        "values" : [0,1,2,3,4,5,6,7,8]
    }
]

And then change the AddToBookData function to compensate for the new structure:

function AddToBookData(json) {
    $.each(json, function(index, data) {
        BookData[BookIndex].push([data.key,data.values]);
    });
}  

Addendum
I suspect that regardless what method you use to transport the data from the files to the BookData array, the true bottleneck is in the sheer number of requests. Must the files be fragmented into 40-100? If you change to JSON format, you could load a single file that looks like this:

{
    "file1" : [
        {
            "key" : 0,
            "values" : [0,1,2,3,4,5,6,7,8]
        },
        // all the rest...
    ],
    "file2" : [
        {
            "key" : 1,
            "values" : [0,1,2,3,4,5,6,7,8]
        },
        // yadda yadda
    ]
}

Then you could do one request, load all the data you need, and move on... Although the browser may initially lock up (although, maybe not), it would probably be MUCH faster this way.

Here is a nice JSON tutorial, if you're not familiar: http://www.webmonkey.com/2010/02/get_started_with_json/



回答4:

Fetch all the data as a string, and use split(). This is the fastest way to build an array in Javascript.

There's an excellent article a very similar problem, from the people who built the flickr search: http://code.flickr.com/blog/2009/03/18/building-fast-client-side-searches/