MongoDB OR with Regex not using compound index - database

Really at wits end here; I'm using the following query to search a collection with about 300K documents
query = { $or: [
{description: { $regex: ".*app.*"}},
{username: { $regex: ".*app.*"}},
]};
and simply putting that in a .find() function. It is tremendously slow. Like every single query takes at least 20 seconds.
I have tried individual indices on both username and description, and now have a compound index on {description: 1, username: 1}, but it does not seem to make a difference at all. If I check the MongoDB live metrics, it does not use the index at all.
Any pointers would be greatly appreciated.

Regex using partial string matching never use an index, because, as the name implies, with a partial string match it has no idea where to start looking for the match, and has to go over all strings.
As a solution, you can hook your database up to something like Lucene, which specializes in such queries.

Related

MongoDB Aggregate List of Object IDs

I'm trying to create an interface which gives our team the ability to build relatively simple queries to segment customers and aggregate data about those customers. In addition to this, I want to build a "List" feature" that allows users to upload a CSV that can be separately generated for more complex one-off queries. In this scenario, I'm trying to figure out the best way to query the database.
My main question is how the $in operator works. The example below has an aggregate which tries to check if a primary key (object ID) is in an array of object IDs.
How well does the $in operator perform? I'm mostly wondering how this query will run – does it loop over the array and look for documents that match each value in the array for N lookups, or will it loop over all of the documents in the database and for each one, loop over the array and check if the field matches?
db.getCollection('customers').aggregate([
{
$match: {
'_id': { $in: ['ObjectId("idstring1")','ObjectId("idstring2")'...'ObjectId("idstring5000")']}
}
}
])
If that's not how it works, what's the most efficient way of aggregating a set of documents given a bunch of object IDs? Should I just do the N lookups manually and pipe the documents into the aggregation?
My main question is how the $in operator works. The example below has
an aggregate which tries to check if a primary key (object ID) is in
an array of object IDs.
Consider the code:
var OBJ_ARR = [ ObjectId("5df9b50e7b7941c4273a5369"), ObjectId("5df9b50f7b7941c4273a5525"), ObjectId("5df9b50e7b7941c4273a515f"), ObjectId("5df9b50e7b7941c4273a51ba") ]
db.test.aggregate( [
{
$match: {
_id: { $in: OBJ_ARR }
}
}
])
The query tries to match each of the array elements with the documents in the collection. Since, there are four elements in the OBJ_ARR there might be four documents returned or lesser depending upon the matches.
If you have N _id's in the lookup array, the operation will try to find match for all elements in the input array. The more number of values you have in the array, the more time it takes; the number of ObjectIds matter in query performance. In case you have a single element in the array, it is considered as one equal match.
From the documentation - the $in works like an $or operator with equality checks.
... what's the most efficient way of aggregating a set of documents
given a bunch of object IDs?
The _id field of the collection has a unique index, by default. The query, you are trying will use this index to match the documents. Running an explain() on a query (with a small set of test data) confirms that there is an index scan (IXSCAN) on the match operation with $in used with the aggregation query. That is a better performing query (as it is) becuse of the index usage. But, the aggregation query's later/following stages, size of the data set, the input array size and other factors will influence the overall performance and efficiency.
Also, see:
Pipeline Operators and Indexes and Aggregation Pipeline Optimization.

MongoDB sorting order and unique fields

I'm using a mongoose schema like this one:
{
numero: {
type: Number,
required: true,
unique: true
},
capacidad: {
type: Number,
required: true
}
}
When I retrieve the collection's documents (p.e. using Model.find({})), I get the documents sorted by _id.
My questions are:
MongoDB creates an index for handling the unique: true requirement but it does not use it as default sorting mechanism?
If I do Model.find({}).sort("numero") does this use the index for handling uniqueness or must build another for my query?
If I define my own index (schema.index({ numero: 1 }), am I duplicating work?
Summarizing, what are the best practices for maintaining a collection sorted for querying?
First lets make a note that you are talking more specifically on how Mongoose does things as a mediator with MongoDB. What I mean by this is that in mongoose schema defining something like this:
numero: {
type: Number,
required: true,
unique: true
}
Actually means that (by using unique: true) you ARE creating a unique index on the field numero and the the index: true part is optional as per the documentation.
So you do not need to actually create another index.
That also should answer your question about Model.find({}).sort("numero") using the index as well.
As far as best practices go you should review the way you are querying your data and very importantly what type of data you have in order to figure out what kind of index you need. For example if you have lat/lng data you should probably be using the Geospatial Index etc.
You also do not want to go crazy on the indexing since that brings other issues as well.
Also very important tool to review what MongoDB is doing is the explain operator which gives you information on the query plan.
You should use it often to analyze and optimize your queries and figure out where your bottlenecks are.
Also you can view your collection statistics via
db.getCollection('your-collection-name').stats()
That would give you good information on the current indexes on the collection their size etc.
Hope this helps.

How to filter an array in Azure Search

I have following Data in my Index,
{
"name" : "The 100",
"lists" : [
"2c8540ee-85df-4f1a-b35f-00124e1d3c4a;Bellamy",
"2c8540ee-85df-4f1a-b35f-00155c40f11c;Pike",
"2c8540ee-85df-4f1a-b35f-00155c02e581;Clark"
]
}
I have to get all the documents where the lists has Pike in it.
Though a full search query works with Any I could't get the contains work.
$filter=lists/any(t: t eq '2c8540ee-85df-4f1a-b35f-00155c40f11c;Pike')
However i am not sure how to search only with Pike.
$filter=lists/any(t: t eq 'Pike')
I guess the eq looks for a full text search, is there any way with the given data structure I should make this query work.
Currently the field lists has no searchable property only the filterable property.
The eq operator looks for exact, case-sensitive matches. That's why it doesn't match 'Pike'. You need to structure your index such that terms like 'Pike' can be easily found. You can accomplish this in one of two ways:
Separate the GUIDs from the names when you index documents. So instead of indexing "2c8540ee-85df-4f1a-b35f-00155c40f11c;Pike" as a single string, you could index them as separate strings in the same array, or perhaps in two different collection fields (one for GUIDs and one for names) if you need to correlate them by position.
If the field is searchable, you can use the new search.ismatch function in your filter. Assuming the field is using the standard analyzer, full-text search will word-break on the semicolons, so you should be able to search just for "Pike" and get a match. The syntax would look like this: $filter=search.ismatch('Pike', 'lists') (If looking for "Pike" is all your filter does, you can just use the search and searchFields parameters to the Search API instead of $filter.) If the "lists" field is not already searchable, you will need to either add a new field and re-index the "lists" values, or re-create your index from scratch with the new field definition.
Update
There is a new approach to solve this type of problem that's available in API versions 2019-05-06 and above. You can now use complex types to represent structured data, including in collections. For the original example, you could structure the data like this:
{
"name" : "The 100",
"lists" : [
{ "id": "2c8540ee-85df-4f1a-b35f-00124e1d3c4a", "name": "Bellamy" },
{ "id": "2c8540ee-85df-4f1a-b35f-00155c40f11c", "name": "Pike" },
{ "id": "2c8540ee-85df-4f1a-b35f-00155c02e581", "name": "Clark" }
]
}
And then directly query for the name sub-field like this:
$filter=lists/any(l: l/name eq 'Pike')
The documentation for complex types is here.

Cloiudant using $nin There is no index available for this selector

I created a JSON index in cloudant on _id like so:
{
"index": {
"fields": [ "_id"]
},
"ddoc": "mydesigndoc",
"type": "json",
"name": "myindex"
}
First off, unless I specified the index name, somehow cloudant could not differentiate between the index I created and the default text based index for _id (if that is truly the case, then this is a bug I believe)
I ran the following query against the _find endpoint of my db:
{
"selector": {
"_id": {
"$nin":["v1","v2"]
}
},
"fields":["_id", "field1", "field2"],
"use_index": "mydesigndoc/myindex"
}
The result was this error:
{"error":"no_usable_index","reason":"There is no index available for this selector."}
if I change "$nin":["v1","v2"] to "$eq":"v1" then it works fine, but that is not the query I am after.
So in order to get what I want, I had to this to my selector "_id": {"$gt":null}, which now looks like:
{
"selector": {
"_id": {
"$nin":["v1","v2"],
"$gt":null
}
},
"fields":["_id", "field1", "field2"],
"use_index": "mydesigndoc/myindex"
}
Why is this behavior? This seems to be only happening if I use the _id field in the selector.
What are the ramifications of adding "_id": {"$gt":null} to my selector? Is this going to scan the entire table rather than use the index?
I would appreciate any help, thank you
Cloudant Query can use Cloudant's pre-existing primary index for selection and range querying without you having to create your own index in the _id field.
Unfortunately, the index doesn't really help when using the $nin operator - Cloudant would have to scan the entire database to check for documents which are not in your list - the index doesn't really get it any further forward.
By changing the operator to $eq you are playing to the strengths of the index which can be used to locate the record you need quickly and efficiently.
In short, the query you are attempting is inefficient. If your query was more complex e.g. the equivalent of WHERE colour='red' AND _id NOT IN ['a','b'] then a Cloudant index on colour could be used to reduce the data set to a reasonable level before doing the $nin operation on the remaining data.

Cloudant Search: match a whole phrase using a full text index

I want to be able to match a whole phrase using a full text index, but I can't seem to work out how to do it. The Lucene Query Parser syntax states that:
A Phrase is a group of words surrounded by double quotes such as "hello dolly".
But when I specify the following selector, it returns all records with either "sign" or "design" in the name but I would expect it to return only those with "sign design".
POST https://foo.cloudant.com/remote/_find
{"selector":{"$text":"\"SIGN DESIGN\""}}
My index is defined as follows:
db.index({
name: 'subbies_text',
type: 'text',
index: {},
})
Alternatively, is it possible to do a substring match on a field in json index?
You are using the index API to create the index, correct?
Would you please try creating this design document?
{ "_id": '_design/library',
"indexes": {
"subbies_text": {
"analyzer": {
"name":'standard'
},
"index": "function(doc) { index('XXX', doc.YYY); }"
}
}
}
(However, change the "XXX" and "YYY" to your field name.
If you know how many maximum words to allow, you can make a searchable index with a map-reduce view. I think it is not ideal, but just for posterity:
You can emit() every consecutive pair of words that you see. So, for example, given the phrase "The quick brown fox" then you can emit ["the","quick"], ["quick","brown"], ["brown", "fox"]. I think this can be nice and simple, but it's really only appropriate for small amounts of data. The index will likely grow too large.
If you want to use cloudant search, you should create a search index first just like JasonSmith said. Then you can use this search index to do the specific queries.
Suppose you have a document which has a "name:SIGNDESIN" field.
1.If you want to query a whole phrase ,you can query like this:
curl https://<username:password>#<username>.cloudant.com/db/_design/<design_doc>/_search/<searchname>?q=name:SIGNDESIN | jq .
2.If you want to query a substring phrase, you can query like this:
curl https://<username:password>#<username>.cloudant.com/db/_design/<design_doc>/_search/<searchname>?q=name:SI* | jq .

Resources