Introduction
My collection has more than 1 million of documents. Each document's structure is identical and looks like this:
{_id: "LiTC4psuoLWokMPmY", number: "12345", letter: "A", extra: [{eid:"jAHBSzCeK4SS9bShT", value: "Some text"}]}
So, as you can see, my extra field is an array that contains small objects. I'm trying to insert these objects as many as possible (until I get closer to 16MB of document limit). And these objects usually present in the extra array of the most documents in the collections. So I usually have hundreds of thousands of the same objects.
I have an index on eid key in the extra array. I created this index by using this:
db.collectionName.createIndex({"extra.eid":1})
Problem
I want to count how many extra field object present in the collection. I'm doing it by using this:
db.collectionName.find({extra: {eid: "jAHBSzCeK4SS9bShT"}}).count()
In the beginning, the query above is very fast. But whenever extra array gets a little bit bigger (more than 20 objects), it gets really slow.
With 3-4 objects, it takes less than 100 miliseconds but when it gets bigger, it takes a lot more time. With 50 objects, it takes 6238 miliseconds.
Questions
Why is this happening?
How can I make this process faster?
Is there any other way that does this process but faster?
I ran into a similar problem. I bet your query isn't hitting your index.
You can do an explain (run db.collectionName.find({extra: {eid: "jAHBSzCeK4SS9bShT"}}).explain() in the Mongo shell) to know for sure.
The reason is that in Mongo db.collectionName.find({extra: {eid: "jAHBSzCeK4SS9bShT"}}) is not the same as db.collectionName.find({"extra.eid": "jAHBSzCeK4SS9bShT"}). The first form won't use your index, while the second form will (as an example, although this wouldn't work in your case because your subdocument is actually an array). Not sure why, but this seems to be a quirk of Mongo's query builder.
I didn't find any solution except for indexing the entire subdocument.
Related
This is from a book I am reading:
Strictly speaking, the find() command returns a cursor to the returning documents. Therefore, to access the documents you’ll need to iterate the cursor. The find() command automatically returns 20 documents—if they’re available—after iterating the cursor 20 times.
I cannot understand what the author means. What is a cursor in MongoDB.
There are many slightly different ways to process the result of a request:
Maybe you want to sort them
Maybe you want to limit the number of results
Maybe you want to skip items
Etc...
To allow you to do this in a convenient way, and offer a performant implementation, mongodb asks you to do things in two steps:
Speficy the request (filter and projection)
Then, tell what you want to do with the results (sort, skip, limit, etc...)
Step 1 returns the cursor.
The cursor has methods which allow you to specify what you want to do in step 2, and it also has methods which allow you to iterate on the result.
Results are actually retrived over time while you iterate. This allows to use reasonable amounts of system resources.
I think it's probably a simple answer but I thought I'd quickly check...
Let's say I'm adding Ints to an array at various points in my code, and then I want to find if an array contains a certain Int in the future..
var array = [Int]()
array.append(2)
array.append(4)
array.append(5)
array.append(7)
if array.contains(7) { print("There's a 7 alright") }
Is this heavier performance wise than if I created a dictionary?
var dictionary = [Int:Int]()
dictionary[7] = 7
if dictionary[7] != nil { print("There's a value for key 7")}
Obviously there's reasons like, you might want to eliminate the possibility of having duplicate entries of the same number... but I could also do that with a Set.. I'm mainly just wondering about the performance of dictionary[key] vs array.contains(value)
Thanks for your time
Generally speaking, Dictionaries provide constant, i.e. O(1), access, which means searching if a value exists and updating it are faster than with an Array, which, depending on implementation can be O(n). If those are things that you need to optimize for, then a Dictionary is a good choice. However, since dictionaries enforce uniqueness of keys, you cannot insert multiple values under the same key.
Based on the question, I would recommend for you to read Ray Wenderlich's Collection Data Structures to get a more holistic understanding of data structures than I can provide here.
I did some sampling!
I edited your code so that the print statements are empty.
I ran the code 1.000.000 times. Every time I measured how long it takes to access the dictionary and array separately. Then I subtracted the dictTime for arrTime (arrTime - dictTime) and saved this number each time.
Once it finished I took the average of the results.
The result is: 23150. Meaning that over 1.000.000 tries the array was faster to access by 23150 nanoSec.
The max difference was 2426737 and the min was -5711121.
Here are the results on a graph:
I have an array of integers storing some userIDs. I basically want to prevent a user from performing an action twice, so the moment he has done it his userID enters this array.
I wonder whether it is a good idea to sort or not this array. If it is sorted, then you have A={min, ..., max}. Then, if I'm not wrong, checking if an ID is in the array will take log2(|A|) 'steps'. On the other hand, if the array was not sorted then you will need |A|/2 (in average) steps.
So sorting seems better to check if an element exists in the array (log(|A|) vs |A|), but what about 'adding' a new value? Calculating the position of where the new userID should be can be done at the same time you're checking, but then you will have to displace all the elements from that position by 1... or at least that's how I'd do it on C, truth is this is going to be an array in a MongoDB document, so perhaps this is handled in some other most-effective way.
Of course if the array is unsorted then adding a new value will just take one step ("pushing" it to the end).
To me, an adding operation (with previous checking) will take:
If sorted: log2(|A|) + |A|/2. The log2 part to check and find the place and the |A|/2 as an average of the displacements needed.
If not sorted: |A|/2 + 1. The |A|/2 to check and the +1 to push the new element.
Given that for adding you'll always first check, then the not sorted version appears to have less steps, but truth is I'm not very confident on the +|A|/2 of the sorted version. That's how I would do it in C, but maybe it can work another way...
O(Log(A)) is definitely better than O(A), but this can be done in O(1). The data structure you are looking for is HashMap, if you are going to do this in C. I haven't worked in C in a very long time so I don't know if it is natively available now. It surely is available in C++. Also there are some libraries which you can use in the worst case.
For MongoDB, my solution may not be the best, but I think that you can create another collection of just the userIDs and index the collection keyed on userIDs. This way when someone tries to do that action, you can query the user status quickest.
Also in MongoDB you can try adding another key called UserDidTheAction to your User's collection. This key's value may be true or false. Index the collection based on userID and probably you will have similar performance as the other solution, but at the cost of modifying your original collection's design (though it's not required to be fixed in MongoDB).
I have a table with a bunch of documents that are updated regularly (in part).
What I'm essentially trying to do is create another table (called changes below) that stores the latest N changes to each of those documents.
I'm thus doing table.changes() to get all changes on the table, calculating the diff information I want (called diffentry below) and prepending that info to an array in the other table:
changes.get(doc_id).update({
'diffs': R.row['changes'].prepend(diffentry)
}).run()
This tricky bit is how to limit the size of the diffs array?
There's an array method delete_at() that can delete one or many items from an array, which I could just "brute force"-call, like:
delete_at(diff_limit, diff_limit + 10000)
and ignore any error (the insane upper limit is just paranoia). But that feels kind of dirty...
I thought a nicer, and better way would be to filter on arrays that is larger than the limit and remove the exceeding bits. Pseudo:
changes.get(doc_id).filter(R.row['diffs'].length > diff_limit).update({
'diffs': R.row['diffs'].delete_at(diff_limit, R.row['diffs'].length - 1)
}).run()
But, alas, there is no length, that I've found...
Any ideas on how to do this kind of thing in a nice way?
In JS you can use a function with .count() like this:
update(function(doc){
return {
diffs: doc("diffs").deleteAt(diff_limit, doc("diffs").count())
}
}
).run()
I think in python should be something similar.
I have n arrays of data, each of these arrays is sorted by the same criteria.
The number of arrays will, in almost all cases, not exceed 10, so it is a relatively small number. In each array, however, can be a large number of objects, that should be treated as infinite for the algorithm I am looking for.
I now want to treat these arrays as if they are one array. However, I do need a way, to retrieve objects in a given range as fast as possible and without touching all objects before the range and/or all objects after the range. Therefore it is not an option to iterate over all objects and store them in one single array. Fetches with low start values are also more likely than fetches with a high start value. So e.g. fetching objects [20,40) is much more likely than fetching objects [1000,1020), but it could happen.
The range itself will be pretty small, around 20 objects, or can be increased, if relevant for the performance, as long as this does not hit the limits of memory. So I would guess a couple of hundred objects would be fine as well.
Example:
3 arrays, each containing a couple of thousand entires. I now want to get the overall objects in the range [60, 80) without touching either the upper 60 objects in each set nor all the objets that are after object 80 in the array.
I am thinking about some sort of combined, modified binary search. My current idea is something like the following (note, that this is not fully thought through yet, it is just an idea):
get object 60 of each array - the beginning of the range can not be after that, as every single array would already meet the requirements
use these objects as the maximum value for the binary search in every array
from one of the arrays, get the centered object (e.g. 30)
with a binary search in all the other arrays, try to find the object in each array, that would be before, but as close as possible to the picked object.
we now have 3 objects, e.g. object 15, 10 and 20. The sum of these objects would be 45. So there are 42 objects in front, which is more than the beginning of the range we are looking for (30). We continue our binary search in the remaining left half of one of the arrays
if we instead get a value where the sum is smaller than the beginning of the range we are looking for, we continue our search on the right.
at some point we will hit object 30. From there on, we can simply add the objects from each array, one by one, with an insertion sort until we hit the range length.
My questions are:
Is there any name for this kind of algorithm I described here?
Are there other algorithms or ideas for this problem, that might be better suited for this issue?
Thans in advance for any idea or help!
People usually call this problem something like "selection in the union of multiple sorted arrays". One of the questions in the sidebar is about the special case of two sorted arrays, and this question is about the general case. Several comparison-based approaches appear in the combined answers; they more or less have to determine where the lower endpoint in each individual array is. Your binary search answer is one of the better approaches; there's an asymptotically faster algorithm due to Frederickson and Johnson, but it's complicated and not obviously an improvement for small ranks.