-->

Random sort order

2020-08-10 06:59发布

问题:

The question about the way to get a random document from collection has been asked many times and there were suggestions on this topic.

What I need is to get several random documents from collection and what is even worse - those documents must match certain criterias (filtered, I mean). For example, I have a collection of articles where each article has a 'topic' field. The User chooses a topic he's interested in, and my db must show the corresponding articles each time in random order.

Obviously, previously discussed hacks won't help me. The only way to achieve what I want is to query for corresponding topic getting ids only:

var arr = db.articles.find({topic: 3}, {_id:1}).toArray();

and then generate random sequense of numbers depending on how many documents were received and then obtain document ids from the array using random numbers as indexes of that array and then finally do another request to mongodb to obtain documents with those randomly chosen ids.

As you can see, it seems a little bit too slow altogether, especially, if there are too many articles returned by first query:)

So what I think is that there may be some mongodb command to get documents by index keys based on their position in the index. The point is that I can create covered compound index like this:

db.articles.ensureIndex({topic: 1, _id:1});

And now my query would only have to scan the continuos line of right _ids in index. And if I could request the documents from the collection by those '_ids' positions, then I could do the whole thing in one request! Something like:

var cursor = db.articles.find({topic:3, $indexKeyPosition: {$in: myRandomSequence}});

Does anyone know about such features?

回答1:

Nowadays, you should be able to use the $sample aggregation function.

Example (untested):

db.articles.aggregate([
    { $match : { topic : 3 } },
    { $sample : { size: 3 } }
])

Note, however, that it may return the same document more than once.



回答2:

So what I think is that there may be some mongodb command to get documents by index keys based on their position in the index. The point is that I can create covered compound index like this:

No there isn't such a function in MongoDB, though it is a good idea to be able to randomise a result set. In the meantime here is a JIRA: https://jira.mongodb.org/browse/SERVER-533

Since there is no way to pick from a position of index so that it can use an index and consequently a single round trip you have no choice but you open multiple cursors.

The current solution depends on how many documents are in your result set.

If there are a small number of document in your result set you can solve this with an easy skip(rand()) and limit(1) however you must be aware that both skip() and limit() make no effective use of indexes.

That does not mean it will scan the entire Btree it means it will scan as far as you skip().

This means that if your result set grows large and the rand() becomes a large number you will see serious performance problems, as many have.

One good way of possibly solving this is to maintain either:

  • A rand() field value between 0 and 1
  • Or a incremental ID field ( http://www.mongodb.org/display/DOCS/How+to+Make+an+Auto+Incrementing+Field )

And use that new field to "skip" on using the rest of your query like:

var arr = db.articles.find({topic: 3, rand: rand()}, {_id:1}).limit(7).toArray();

Will get 7 random rows using the 0 to 1 idea.

The random sort abilities of this rely on an ever changing data set to help create a randomness within the sort. This will, of course, not work if the result set is continously static.

As for using batchSize, it becomes irrelavant here and infact normally does. For example your logic of using BatchSize to get all your results back does not make complete sense since the BatchSize normally has a absolute max size of 16MB. This means that if your documents are large you may not be getting the single round trip you think you are.

This also only dictates that the server will send all this data at once, it does not denote the amount of work placed on the server, just the amount of data sent over the wire at once.

So considering that you must do this with mutliple cursors (the way I would recommend) you can just run:

var arr = db.articles.find({topic: 3, rand: {$gte:rand()}}).sort({rand:1}).limit(1);

Several, or however many you need, times over. This is not much different to a normal iteration of a cursor and provided you have the right indexes should be quite fast.

There is one other method, but I do not recommend it. You can run an MR, say, once every hour or something which creates another collection of _id and rand() this means you can do the first query I showed:

var arr = db.articles.find({topic: 3, rand: rand()}, {_id:1}).limit(7).toArray();

And really get 7 random records, since the rand() will, of course, be different. But this is not real time and nor would it be very nice for your server on a large data set as such I do not recommend such a thing.

Edit

There is one other way. With the auto incrementing id you could do an $or statement to pick out 7 rand()s all at once. However this introduces yet another problem, deleting.

If you delete any rows you might hit a rand() that doesn't exist and so no row will be returned. Since the auto incrementing id is not maintained to counter deletes server-side you would have to do this yourself. This would not be an easy or scalable thing to do.

To add to this $or statements cannot be limit()ed on clause which means you can't get around this by doing sub select type $ors to make MongoDB only pick out one result per $or clause using $gte.

The same applies for a rand() between 0 and 1. This would work with an $or if you could limit clauses.



回答3:

You can (as in pagination) count how many documents match the query. Then make N queries with skip(random_value) and limit(1).

db.collection.count({field:value,field2:value2})

db.collection.find({field:value,field2:value2}).skip(n).limit(1)

If the collection is indexed for the query it must be fast.