im working on an api and need fast queries for my MongoDB. I have objects like:
{
_id: "123456",
values: [
{
"value": "A",
"begin":0,
"end":1,
},
{
"value": "B",
"begin":1,
"end":2,
},
{
"value": "C",
"begin":3,
"end":7,
}
],
"name": "test"
}
I have about 6k documents in my Database and i want to unwind and group the values so i can count how many different values i have. After the unwind operation i have about 32 million documents.
I tried some variations of the aggregation but everytime its slow. I always need over 20 seconds.
I want to try the indexes but i cant creat and index that makes it any faster. I tried it with the MongoDB Compass app but i think im doing somethink wrong while i create the index.
Hope someone can help me.
Greetings
Related
While faceting azure search returns the count for each facet field by default.How do I also get other searchable fields for every facet?
Ex When I facet for area , I want something like this.(description is a searchable field)
{
"area": [
{
"count": 1,
"description": "Acrylics",
"value": "ACR"
},
{
"count": 1,
"description": "Power",
"value": "POW"
}
]
}
Can someone please help with the extra parameters I need to send in the query?
Unfortunately there is no good way to do this as there is no direct support for nested faceting in Azure search (you can upvote it here). To achieve the result you want you would need to store the data together as a composite value as described by this workaround.
Sometimes when using Azure Search's paging there may be duplicate documents in the results. Here is an example of a paging request:
GET /indexes/myindex/docs?search=*$top=15&$skip=15&$orderby=rating desc
Why is this possible? How can it happen? Are there any consistency guarantees when paging?
The results of paginated queries are not guaranteed to be stable if the underlying index is changing, or if you are relying on sorting by relevance score. Paging simply changes the value of $skip for each page, but each query is independent and operates on the current view of the data (i.e. – there is no snapshotting or other consistency mechanism like you’d find in a general-purpose database).
Here is an example of how you might get duplicates. Assume an index with four documents:
{ "id": "1", "rating": 5 }
{ "id": "2", "rating": 3 }
{ "id": "3", "rating": 2 }
{ "id": "4", "rating": 1 }
Now assume you want to page through the results with a page size of two, ordered by rating. You’d execute this query to get the first page:
$top=2&$skip=0&$orderby=rating desc
And get these results:
{ "id": "1", "rating": 5 }
{ "id": "2", "rating": 3 }
Now you insert a fifth document into the index:
{ "id": "5", "rating": 4 }
Shortly thereafter, you execute a query to fetch the second page of results:
$top=2&$skip=2&$orderby=rating desc
And get these results:
{ "id": "2", "rating": 3 }
{ "id": "3", "rating": 2 }
Notice that you’ve fetched document 2 twice. This is because the new document 5 has a greater value for rating, so it sorts before document 2 and lands on the first page.
In situations where you're relying on document score (either you don't use $orderby or you're using $orderby=search.score()), paging can return duplicate results because each query might be handled by a different replica, and that replica may have different term and document frequency statistics -- enough to change the relative ordering of documents at page boundaries.
For these reasons, it’s important to think of Azure Search as a search engine (because it is), and not a general-purpose database.
I have a database with documents like these:
{_id: "1", module:["m1"]}
{_id: "2", module:["m1", "m2"]}
{_id: "3", module:["m3"]}
There is an search index created for these documents with the following index function:
function (doc) {
doc.module && doc.module.forEach &&
doc.module.forEach(function(module){
index("module", module, {"store":true, "facet": true});
});
}
The index uses "keyword" analyzer on module field.
The sample data is quite small (11 documents, 3 different module values)
I have two issues with queries that are using group_field=module parameter:
Not all groups are returned. I get 2 out of 3 groups that I expect. Seems like if a document with ["m1", "m2"] is returned in the "m1" group, but there is no "m2" group. When I use counts=["modules"] I get complete lists of distinct values.
I'd like to be able to get something like:
{
"total_rows": 3,
"groups": [
{ "by": "m1",
"total_rows": 1,
"rows": [ {_id: "1", module: "m1"},
{_id: "2", module: "m2"}
]
},
{ "by": "m2",
"total_rows": 1,
"rows": [ {_id: "2", module: "m2"} ]
},
....
]
}
When using group_field, bookmark is not returned, so there is no way to get the next chunk of the data beyond 200 groups or 200 rows in a group.
Cloudant Search is based on Apache Lucene, and hence has its properties/limitations.
One limitation of grouping is that "the group field must be a single-valued indexed field" (Lucene Grouping), hence a document can be only in one group.
Another limitation/property of grouping is that topNGroups and maxDocsPerGroup need to be provided in advance, and in Cloudant case the max numbers are 200 and 200 (they can be set lower by using group_limit and limit parameters).
I created a JSON index in cloudant on _id like so:
{
"index": {
"fields": [ "_id"]
},
"ddoc": "mydesigndoc",
"type": "json",
"name": "myindex"
}
First off, unless I specified the index name, somehow cloudant could not differentiate between the index I created and the default text based index for _id (if that is truly the case, then this is a bug I believe)
I ran the following query against the _find endpoint of my db:
{
"selector": {
"_id": {
"$nin":["v1","v2"]
}
},
"fields":["_id", "field1", "field2"],
"use_index": "mydesigndoc/myindex"
}
The result was this error:
{"error":"no_usable_index","reason":"There is no index available for this selector."}
if I change "$nin":["v1","v2"] to "$eq":"v1" then it works fine, but that is not the query I am after.
So in order to get what I want, I had to this to my selector "_id": {"$gt":null}, which now looks like:
{
"selector": {
"_id": {
"$nin":["v1","v2"],
"$gt":null
}
},
"fields":["_id", "field1", "field2"],
"use_index": "mydesigndoc/myindex"
}
Why is this behavior? This seems to be only happening if I use the _id field in the selector.
What are the ramifications of adding "_id": {"$gt":null} to my selector? Is this going to scan the entire table rather than use the index?
I would appreciate any help, thank you
Cloudant Query can use Cloudant's pre-existing primary index for selection and range querying without you having to create your own index in the _id field.
Unfortunately, the index doesn't really help when using the $nin operator - Cloudant would have to scan the entire database to check for documents which are not in your list - the index doesn't really get it any further forward.
By changing the operator to $eq you are playing to the strengths of the index which can be used to locate the record you need quickly and efficiently.
In short, the query you are attempting is inefficient. If your query was more complex e.g. the equivalent of WHERE colour='red' AND _id NOT IN ['a','b'] then a Cloudant index on colour could be used to reduce the data set to a reasonable level before doing the $nin operation on the remaining data.
I've got a MongoDB collection where a particular string may appear in any of a number of fields:
{"_id":1, "field1": "foo", "field2": "bar", "field3": "baz", "otherfield": "stuff"},
{"_id":2, "field1": "bar", "field2": "baz", "field3": "foo", "otherfield": "morestuff"},
{"_id":3, "field1": "baz", "field2": "foo", "field3": "bar", "otherfield": "you get the idea"}
I need to query so that I am returned all records where any one of a set of fields is equal to any value in an array ... basically, if I have ["foo","bar"] I need it to match if either of those strings are in field1 or field2 (but not any other field).
Obviously I can do this with a series of multiple queries
db.collection.find({"field1":{"$in":["foo","bar"]}})
db.collection.find({"field2":{"$in":["foo","bar"]}})
etc., and I've also made a very large $or query that concatenates them all together, but it seems far too inefficient (my actual collection needs to match any of 15 strings that can occur in any of 9 fields) ... but I'm still new to nosql DBs and am not sure of the best paradigm I need to use here. Any help is greatly appreciated.
try
db.collection.find(
// Find documents matching any of these values
{$or:[
{"field1":{"$in":["foo","bar"]}},
{"field2":{"$in":["foo","bar"]}}
]}
)
also refer to this question
Found another answer through poring over the documentation that seems to hit a sweet spot -- text indexes.
db.collection.ensureIndex({"field1":"text","field2":"text"})
db.records.runCommand("text",{search:"foo bar"})
When I run my actual query with many more strings and fields (and about 100,000 records), the $or/$in approach takes 620 milliseconds while the text index takes 131 milliseconds. The one drawback is that it returns a different type of document as a result; luckily the actual documents are a parameter of each result object.
Thanks to those who took the time to make suggestions.
I would collect all the relevant fields in one field (i.e. collected) by adding their values like
"foo:field1",
"bar:field2",
"baz:field3",
"stuff:otherfield",
"bar:field1",
"baz:field2"
...
into that field.
If you search for bar existing in any field you can use:
db.collection.find( { collected: { $regex: "^bar" } }, ... );
Your example in the question would look like:
db.collection.find( collected: { { $all: [ "foo:field1", "foo:field2", "bar:field1", "bar:field2" ] } }, ... );