MongoDB sorting order and unique fields - database

I'm using a mongoose schema like this one:
{
numero: {
type: Number,
required: true,
unique: true
},
capacidad: {
type: Number,
required: true
}
}
When I retrieve the collection's documents (p.e. using Model.find({})), I get the documents sorted by _id.
My questions are:
MongoDB creates an index for handling the unique: true requirement but it does not use it as default sorting mechanism?
If I do Model.find({}).sort("numero") does this use the index for handling uniqueness or must build another for my query?
If I define my own index (schema.index({ numero: 1 }), am I duplicating work?
Summarizing, what are the best practices for maintaining a collection sorted for querying?

First lets make a note that you are talking more specifically on how Mongoose does things as a mediator with MongoDB. What I mean by this is that in mongoose schema defining something like this:
numero: {
type: Number,
required: true,
unique: true
}
Actually means that (by using unique: true) you ARE creating a unique index on the field numero and the the index: true part is optional as per the documentation.
So you do not need to actually create another index.
That also should answer your question about Model.find({}).sort("numero") using the index as well.
As far as best practices go you should review the way you are querying your data and very importantly what type of data you have in order to figure out what kind of index you need. For example if you have lat/lng data you should probably be using the Geospatial Index etc.
You also do not want to go crazy on the indexing since that brings other issues as well.
Also very important tool to review what MongoDB is doing is the explain operator which gives you information on the query plan.
You should use it often to analyze and optimize your queries and figure out where your bottlenecks are.
Also you can view your collection statistics via
db.getCollection('your-collection-name').stats()
That would give you good information on the current indexes on the collection their size etc.
Hope this helps.

Related

Is there any way to sort on a nested value in Azure Cognitive Search?

Is there any way to sort on a nested value in Azure Cognitive Search?
My use case is that I have a database of songs that are associated with dances that one can dance to that song. Users can vote on the danceability of a dance to a song, so there is a is a numeric vote tally for each song/dance combination. A core part of the functionality for the search is to be able to do an arbitrary search and sort the results by the popularity of a particular dance.
I am currently modeling this by creating a new top level field with a decorated name (e.g. DNC_Salsa or DNC_Waltz) for each dance. This works. But aside from being clumsy, I can't associate other information with a dance. In addition, I have to dynamically add the dance fields, so I have to use the generic SearchDocument type in the C# library rather than using a POCO type.
I'd much prefer to model this with the dance fields as an array of subdocuments where the subdocuments contain a dance name, a vote count and the other information I'd like to associate with a dance.
A simplified example record would look something like this:
{
"title": "Baby, It's Cold Outside",
"artist": "Seth MacFarlane",
"tempo": 119.1,
"dances": [
{ "name", "cha cah", "votes", 1 },
{ "name", "foxtrot", "votes", 4 }
]
}
I gave this a try and received:
{"error":{"code":"OperationNotAllowed","message":"The request is invalid.","details":[{"code":"CannotEnableFieldForSorting","message":"The field 'Votes' cannot be enabled for sorting because it is directly or indirectly contained in a collection, which makes it a multi-valued field. Sorting is not allowed on multi-valued fields. Parameters: definition"}]}}
It looks like elastic search will do what I want:
Sort search results | Elasticsearch Guide [7.17] | Elastic
If I'm reading the Elasticsearch documetion correctly, you can basically say I'd like to sort on the dances subdocument by first filtering for name == "cha cha" and then sorting on the vote field.
Is there anything like this in Azure Cognitive Search? Or even something more restrictive? I don't need to do arbitrary sorting on anything in the subdocument. I would be happy to only ever sort on the vote count (although I'd have to be able to do that for any dance name).
It's not clear to me what your records or data model looks like. However, from the error message you provided, it's clear that you try to sort on a multivalue property. That is logically impossible.
Imagine a property Color that can contain colors like 'Red' or 'Blue'. If you sort by Color, you would get your red values before the blues. If you instead had 'Colors' that can contain multiple values like both 'Red' and 'Blue', how would you sort it? You can't.
So, if you actually want to sort by a property, that property has to contain a single value.
When that's said, I have a feeling you are really asking about ranking/boosting. Not sorting. Have a look at the examples with boosting and scoring profiles for different genres of music. I believe the use case in these examples could help you solve your use case.
https://learn.microsoft.com/en-us/azure/search/index-add-scoring-profiles#extended-example

MongoDB: $lookup on an indexed property vs $in on a non-indexed property

I am currently using MongoDB 3.5, I have two collections (users,items). Each user has a list of items
//users
{
_id: ObjectId('userObjId1')
itemArray: [
{ ObjectId('itemA'), specialId: '123-this-is-unique'},
{ ObjectId('itemB'), specialId: '456-this-is-unique'},
{ ObjectId('itemC'),specialId: '789-this-is-unique'},
]
}
and items
//items
{
_id: ObjectId('itemA')
specialId: '123this-is-unique'
owner: ObjectId('userObjId1')
}
One of my operations involve querying for users, given an array of specialIds
In my items collection, the items' specialIds are indexed.
Which one would be a better practice (and potentially better performance)?
A) Query the array of specialIds in the users' collection using the $in operator.
Pros: query stays within the same collection
Cons: The itemArray itself in each user is not indexed, from my understanding this may affect the performance
B) Query in the items collection, project the owner and use it to run $lookup in the users collection
Pros: newer sytanx, since specialIds is already indexed in the items collection, it should be a better performance.
Cons: Needs to access two collections in one query
It depends on how many users you have, and how many items each user has.
Plan A will work well if you have a small number of users, dozens perhaps hundreds, or if you can create an index on {"itemArray.specialId":1}
Plan B will use the index on specialId to selecting the items, and then the _id index in the users collection during lookup, which should perform fairly well.

How do I use Solr "relatedness()" function to measure relatedness of two sets of documents?

I'd like to use the new Semantic Knowledge Graph capability in Solr to answer this question:
Given a set of documents from several different publishers, compute a "relatedness" metric between a given publisher and every other publisher, based on the text content of their respective documents.
I've watched several of Trey Grainger's talks regarding the Semantic Knowledge Graph functionality in Solr (this is a great recent one: https://www.youtube.com/watch?v=lLjICpFwbjQ) I have a reasonably good understanding of Solr faceted search functionality, and I have a working Solr engine with my dataset indexed and searchable. So far I've been unable to construct a facet query to do what I want.
Here is an example curl command which I thought might get me what I want
curl -sS -X POST http://localhost:8983/solr/plans/query -d '
{
params: {
fore:"publisher_url:life.church"
back:"*:*",
},
query:"*:*",
limit: 0,
facet:{
pub_type: {
type: terms,
field: "publisher_url",
limit: 5,
sort: { "r1": "desc" },
facet: {
r1: "relatedness($fore,$back)"
}
}
}
}
}'
Below are the result facets. Notice that after the first bucket (which matches the foreground query), the others all have exactly the same relatedness. Which leads me to believe that the "relatedness" is only based on the publisher_url field rather than the entire text content of the documents.
{
"facets":{
"count":2152,
"pub_type":{
"buckets":[{
"val":"life.church",
"count":141,
"r1":{
"relatedness":0.38905,
"foreground_popularity":0.06552,
"background_popularity":0.06552}},
{
"val":"10ofthose.com/us/products/1039/colossians",
"count":1,
"r1":{
"relatedness":-0.00285,
"foreground_popularity":0.0,
"background_popularity":4.6E-4}},
{
"val":"14DAYMARRIAGECHALLENGE.COM",
"count":1,
"r1":{
"relatedness":-0.00285,
"foreground_popularity":0.0,
"background_popularity":4.6E-4}},
{
"val":"23blast.com",
"count":1,
"r1":{
"relatedness":-0.00285,
"foreground_popularity":0.0,
"background_popularity":4.6E-4}},
{
"val":"2911worship.com",
"count":1,
"r1":{
"relatedness":-0.00285,
"foreground_popularity":0.0,
"background_popularity":4.6E-4}}]}}}
I'm not very familiar with the relatedness function, but as far as I understand, the relatedness score is generated from the similarity between your foreground and background set of documents for that facet bucket.
Since your foreground set only contain that single value (and none of the other), the first bucket is the only one that will generate a different similarity score when you're faceting for the same field as you use for selecting documents.
I'm not sure if your use case is a good match for what you're trying to use, as relatedness would indicate that single terms in a field is related between the two sets you're using, and not a similarity score across a different field for the two comparison operators.
You probably want something more structured than a text field to generate relatedness() scores, as that's usually more useful for finding single values that generate statistical insight into the structure of your query set.
The More Like This functionality might actually be a better match for getting the most similar other sites instead.
Again, this is based on my understanding of the functionality at the moment, so someone else can hopefully add more details and correct me as necessary.

MongoDB OR with Regex not using compound index

Really at wits end here; I'm using the following query to search a collection with about 300K documents
query = { $or: [
{description: { $regex: ".*app.*"}},
{username: { $regex: ".*app.*"}},
]};
and simply putting that in a .find() function. It is tremendously slow. Like every single query takes at least 20 seconds.
I have tried individual indices on both username and description, and now have a compound index on {description: 1, username: 1}, but it does not seem to make a difference at all. If I check the MongoDB live metrics, it does not use the index at all.
Any pointers would be greatly appreciated.
Regex using partial string matching never use an index, because, as the name implies, with a partial string match it has no idea where to start looking for the match, and has to go over all strings.
As a solution, you can hook your database up to something like Lucene, which specializes in such queries.

how to use indexes in arangodb graph search?

i'm evaluating ArangoDb for my application.
I have a data model like a file system, with a Items document collection and a ItemsParents edge collection with parent-child relations about Items.
Now i would like to find all childs of a specific item, with a specific attribute
Ex: All childs of A with property Properties.Age.Value = 20
so i created an hash index over Items.Properties.Age.Value, and design this AQL query:
FOR item
IN GRAPH_NEIGHBORS('ItemsGraph', 'Items/A',
{ direction : 'outbound',
includeData: true,
neighborExamples : { 'Properties.Age.Value': 20 }
})
RETURN { Id: item._key, Name: item.Name }
the above query work well, but no index are used, so it perform a full scan of Items collection for test Properties.Age.Value filter.
How to design the query so that it performance efficiently using index and avoid a collection scan?
Thanks
Currently ArangoDB can only utilize edge indices in graph operations;
Not using GRAPH_NEIGHBOURS may offer using the index, but then you would have to filter for neighbours yourself.
Vertex centric indices, which would offer that kind of index-support may arive in one of the next two ArangoDB Releases.
[edit]
Meanwhile this is possible with newer ArangoDB releases.
GRAPH_NEIGHBOURS was integrated into the traversal engine. You would now create a combined index on Age and _from. You should use db._explain() to inspect the usage of indices.

Resources