Querying directly on results from MongoDB mapreduce versus updating original collection -

- August 15, 2013

i have mapreduce job runs on collection of posts , calculates popularity each post. mapreduce outputs collection post_id , popularity each post. application needs able posts sorted popularity. there millions of posts, , these popularities updated every 10 minutes. 2 methods can think of:

method 1 keep index on posts table popularity field run mapreduce on posts table (this replace previous mapreduce results) loop through each row in mapreduce results collection , individually update popularity of corresponding post in posts table query straight on posts table posts sorted popularity method 2 run mapreduce on posts table (this replace previous mapreduce results) add index popularity field in resulting mapreduce collection when application needs posts, first query mapreduce results collection sorted post_ids, query posts collection actual post data

questions

method 1 need maintain index on popularity in posts table. it'll need update millions (the post table has millions of rows) of popularities individually every 10 or minutes. it'll update posts have changed popularity, it's still lot of updates on collection couple of indexes. there important # of reads on collection well. scalable? for method 2, possible mapreduce posts collection create new popularities collection, create index on it, , query it? are there concurrency issues question #2, assuming application querying popularities collection it's beingness updated map cut down , re-indexed. if mapreduce replaces popularities collection need manually create new index every time or mongo know maintain index on popularity field. basically, how indexes work mapreduce result collections. is there tweak or other method utilize this??

thanks help!

the generic advice concerning map cut down have application perform little computation on each insert, , avoid doing processor-intensive map cut down job whenever possible.

is possible add together "popularity" field each "post" document , have application increment each time each post viewed, clicked on, voted for, or measure popularity? index popularity field, , searches posts popularity lightning-fast.

if incrementing "popularity" field not option, , mapreduce operation must performed, seek prevent paging through of documents in collection. find becomes prohibitively slow collection grows. sounds though collection pretty large.

it possible perform incremental map reduce, results of latest map cut down integrated results of previous one, instead of simply beingness overwritten. can provide query mapreduce function, not documents read. perhaps add together query matches posts have been viewed, voted for, or added since lastly map reduce.

the documentation on incremental mapreduce operations here: http://www.mongodb.org/display/docs/mapreduce#mapreduce-incrementalmapreduce

integrating new results old ones explained in "output options" section.

i realize advice has been pretty general far, effort address questions now:

1) discussed above, if mapreduce operation has read every single document, not scale well. 2) mapreduce operation outputs collection. creating index , querying collection have done programmatically. 3) if there 1 process querying collection @ same time updating it, possible query homecoming document before has been updated. short reply is, "yes" 4) if collection dropped indexes have rebuilt. if documents in collection deleted, collection not dropped index(es) persist. in case of mapreduce run {out:{replace:"output"}} option, index(ex) persist, , won't have recreated. 5) stated above, if possible preferable add together field "posts" collection, , update that, instead of performing many mapreduce operations.

hopefully have been able provide additional factors consider when building application. ultimately, of import remember each application unique, , ultimate proof of way "best", have experiment of different options , decide way efficient. luck!

mongodb mapreduce

Search This Blog

Kamlesh

Querying directly on results from MongoDB mapreduce versus updating original collection -

Comments

Post a Comment

Popular posts from this blog

How do I check if an insert was successful with MySQLdb in Python? -

delphi - blogger via idHTTP : error 400 bad request -

postgresql - ERROR: operator is not unique: unknown + unknown -