I have a records collection with the following indexes:
{"_id":1}
{"car.make":1,"city":1,"car.mileage":1}
And performing the following query:
db.records.aggregate([
{
"$match":{
"car.mileage":{"$in":[1000,2000,3000,4000]},
"car.make":"Honda",
"city":{"$in":["Miami","San Francisco","New York", "Chigaco", "Seattle", "Boston"]}
}
},
{
"$sort":{"_id":-1}
}
])
The query without the $sort clause finished in a few milliseconds but adding the $sort clause makes it takes around 2 minutes. This query should return around 40 documents from a collection of 6m documents. Any clues about what could cause this huge difference in query time?
After additional testing, this problem goes away by sorting on a different field like creation_date even if creation_date is not indexed. Any ideas why the _id field would perform so much worse than the unindexed creation_date field in this aggregation?
I ran into the same problem today and I'm speculating here but I believe in this case sorting by _id sorts all the entries in the collection before other operations (I said speculating because if you omit the $match clause and keep only the $sort clause even then you get your data in milliseconds).
The workaround that helped me was projection.
If you use a $project clause between $match and $sort then you will get your data in milliseconds again. So you can either use fields like creation_date or if you must use _id then use $project before sorting it.
Related
I have a document for software that contain these fields _id, category, brand etc. There is a price field which is of type string. Some documents have invalid prices or are null. I want to use an aggregation pipeline so that the price is >=4 and <=8 and convert the price to double. There is also a date field that I want to be >=10. I also want to use $out to create a new collection of this document. I have done this so far, I was wondering if someone could let me know how I can retrieve the documents but I don't want to lose or change the other fields only the Price and date.
db.sw.aggregate([{$match: {}},
{$project: {priceLen: {"$strLenCP": "$price"}}},
{"$match": {priceLen: {"$gte": 4, "$lte": 8}}},
{$project: {price: {$trim: {input: "$price", chars: "$"}}}},
{$project: {price: {$toDouble: "$price"}}}])
my thought process for the $match was to retrieve all the fields. Any help will be really appreciated.
No idea what your requirements are in terms of being "correct".
$project removed all fields (apart from _id) and populates the given fields. If you like to keep existing fields use $set or the alias $addFields which names the actual operation.
I'm trying to create an interface which gives our team the ability to build relatively simple queries to segment customers and aggregate data about those customers. In addition to this, I want to build a "List" feature" that allows users to upload a CSV that can be separately generated for more complex one-off queries. In this scenario, I'm trying to figure out the best way to query the database.
My main question is how the $in operator works. The example below has an aggregate which tries to check if a primary key (object ID) is in an array of object IDs.
How well does the $in operator perform? I'm mostly wondering how this query will run – does it loop over the array and look for documents that match each value in the array for N lookups, or will it loop over all of the documents in the database and for each one, loop over the array and check if the field matches?
db.getCollection('customers').aggregate([
{
$match: {
'_id': { $in: ['ObjectId("idstring1")','ObjectId("idstring2")'...'ObjectId("idstring5000")']}
}
}
])
If that's not how it works, what's the most efficient way of aggregating a set of documents given a bunch of object IDs? Should I just do the N lookups manually and pipe the documents into the aggregation?
My main question is how the $in operator works. The example below has
an aggregate which tries to check if a primary key (object ID) is in
an array of object IDs.
Consider the code:
var OBJ_ARR = [ ObjectId("5df9b50e7b7941c4273a5369"), ObjectId("5df9b50f7b7941c4273a5525"), ObjectId("5df9b50e7b7941c4273a515f"), ObjectId("5df9b50e7b7941c4273a51ba") ]
db.test.aggregate( [
{
$match: {
_id: { $in: OBJ_ARR }
}
}
])
The query tries to match each of the array elements with the documents in the collection. Since, there are four elements in the OBJ_ARR there might be four documents returned or lesser depending upon the matches.
If you have N _id's in the lookup array, the operation will try to find match for all elements in the input array. The more number of values you have in the array, the more time it takes; the number of ObjectIds matter in query performance. In case you have a single element in the array, it is considered as one equal match.
From the documentation - the $in works like an $or operator with equality checks.
... what's the most efficient way of aggregating a set of documents
given a bunch of object IDs?
The _id field of the collection has a unique index, by default. The query, you are trying will use this index to match the documents. Running an explain() on a query (with a small set of test data) confirms that there is an index scan (IXSCAN) on the match operation with $in used with the aggregation query. That is a better performing query (as it is) becuse of the index usage. But, the aggregation query's later/following stages, size of the data set, the input array size and other factors will influence the overall performance and efficiency.
Also, see:
Pipeline Operators and Indexes and Aggregation Pipeline Optimization.
Given the following Mongo document model:
Each document in the collection represents 1 hour of resource monitoring. In each document, there is a collection of summaries. There is also a count of the number of summaries as an integer, as it may make life easier.
Is there an efficient way to query the collection and return either just the recent most 1000 summaries as an aggregated list?
Or an efficient way to query the collection and return the number of documents that contain the recent most 1000 summaries?
The number of summaries in each document will differ, but the number of summaries in one single document will never equal more than 1000.
EDIT: I should mention I am using mongo with the .NET driver so have LINQ available to me.
Have you looked at Mongo aggregation ? If you want to return the 1000 most recent summaries you could go with an $unwind followed by a $replaceRoot operation, here's the shell query I tried :
db.getCollection('test').aggregate([
{$match : {your timestamp match query here}},
{$sort : {"timestamp": -1}},
{$unwind : "$summaries"},
{$limit : 1000},
{$replaceRoot: {newRoot: "$summaries"}}
])
The match operation at the beginning of your aggregation pipeline is important as indexes are only used at the first step. If you unwind your whole collection your performance might drop drastically.
I want to have search results from SOLR ordered like this:
All the documents that have the same score will be ordered descending by date added.
So when I query solr I will have n documents. In this results set there will be groups of documents with the same score. I want each of this group of documents to be ordered descending by date added.
I discovered I can accomplish this using function queries, more exactly using rord function http://wiki.apache.org/solr/FunctionQuery#rord, but as it is stated in the documentation
WARNING: as of Solr 1.4, ord() and rord() can cause excess memory use
since they must use a FieldCache entry at the top level reader, while
sorting and function queries now use entries at the segment level.
Hence sorting or using a different function query, in addition to
ord()/rord() will double memory use.
it will cause excess memory use.
What other options do I have ?
I was thinking to use recip(ms(NOW,startTime),1,1,0). Is this the best approach ?
Is there any negative performance impact if I use recip and ms ?
You can use multiple SORT conditions:
Multiple sort orderings can be separated by a comma, ie: sort=+[,+]...
http://wiki.apache.org/solr/CommonQueryParameters
So, in your case would be:
sort=score DESC, date_added DESC
Since your questions says:
All the documents that have the same score will be ordered descending
by date added.
the other answer you got is perfect.
Anyway, I'd suggest you to make sure that you really want to sort by date only for document with the same score. In my experience this has always been wrong. In fact, the solr score is not absolute but just relative to other documents, and each document is different.
Therefore I wouldn't sort by score and then something else, because it's hard to predict when you'll have the same score for different documents.
I would personally sort only on score and use a function to boost recent documents. You can find a good example on the solr wiki, the function used there is recip(ms(NOW,date_field),3.16e-11,1,1).
If you're worried for performance you can try index time boosting, which should be faster than query time boosting. Have a look here.
I have a solr index with the unique field as "id".
I have a ordered set of ids, using which I would like to query Solr. But I want the results in the same order.
so for example if i have the ids id = [5,1,3,4] I want the results displayed in solr in same order.
I tried http://localhost:8983/solr/select/?q=id:(5 OR 1 OR 3 OR 4)&fl=id, but the results displayed are in ascending order.
Is their a way to query solr, and get results as I mentioned?
I think you can't,
The results appear in the order they are indexed unless you specify a default sort field or the explicit sort field/order.
You can add another field to keep the initial sort order. You then can sort=field asc to retrieve the data in the original order.
The simple way is to query solr and sort the results in codes of yourself.