RethinkDB - Find documents with missing field

RethinkDB - Find documents with missing field - query-optimization

I'm trying to write the most optimal query to find all of the documents that do not have a specific field. Is there any better way to do this than the examples I have listed below?
// Get the ids of all documents missing "location"
r.db("mydb").table("mytable").filter({location: null},{default: true}).pluck("id")
// Get a count of all documents missing "location"
r.db("mydb").table("mytable").filter({location: null},{default: true}).count()
Right now, these queries take about 300-400ms on a table with ~40k documents, which seems rather slow. Furthermore, in this specific case, the "location" attribute contains latitude/longitude and has a geospatial index.
Is there any way to accomplish this? Thanks!

A naive suggestion
You could use the hasFields method along with the not method on to filter out unwanted documents:
r.db("mydb").table("mytable")
.filter(function (row) {
return row.hasFields({ location: true }).not()
})
This might or might not be faster, but it's worth trying.
Using a secondary index
Ideally, you'd want a way to make location a secondary index and then use getAll or between since queries using indexes are always faster. A way you could work around that is making all rows in your table have a value false value for their location, if they don't have a location. Then, you would create a secondary index for location. Finally, you can then query the table using getAll as much as you want!
Adding a location property to all fields without a location
For that, you'd need to first insert location: false into all rows without a location. You could do this as follows:
r.db("mydb").table("mytable")
.filter(function (row) {
return row.hasFields({ location: true }).not()
})
.update({
location: false
})
After this, you would need to find a way to insert location: false every time you add a document without a location.
Create secondary index for the table
Now that all documents have a location field, we can create a secondary index for location.
r.db("mydb").table("mytable")
.indexCreate('location')
Keep in mind that you only have to add the { location: false } and create the index only once.
Use getAll
Now we can just use getAll to query documents using the location index.
r.db("mydb").table("mytable")
.getAll(false, { index: 'location' })
This will probably be faster than the query above.
Using a secondary index (function)
You can also create a secondary index as a function. Basically, you create a function and then query the results of that function using getAll. This is probably easier and more straight-forward than what I proposed before.
Create the index
Here it is:
r.db("mydb").table("mytable")
.indexCreate('has_location',
function(x) { return x.hasFields('location');
})
Use getAll.
Here it is:
r.db("mydb").table("mytable")
.getAll(false, { index: 'has_location' })

Related

Querying Firestore without Primary Key

I'd like my users to be able to update the slug on the URL, like so:
url.co/username/projectname
I could use the primary key but unfortunately Firestore does not allow any modifcation on assigned uid once set so I created a unique slug field.
Example of structure:
projects: {
P10syfRWpT32fsceMKEm6X332Yt2: {
slug: "majestic-slug",
...
},
K41syfeMKEmpT72fcseMlEm6X337: {
slug: "beautiful-slug",
...
},
}
A way to modify the slug would be to delete and copy the data on a new document, doing this becomes complicated as I have subcollections attached to the document.
I'm aware I can query by document key like so:
var doc = db.collection("projects");
var query = doc.where("slug", "==", "beautiful-slug").limit(1).get();
Here comes the questions.
Wouldn't this be highly impractical as if I have more than +1000 docs in my database, each time I will have to call a project (url.co/username/projectname) wouldn't it cost +1000 reads as it has to query through all the documents? If yes, what would be the correct way?

As stated in this answer on StackOverflow: https://stackoverflow.com/a/49725001/7846567, only the document returned by a query is counted as a read operation.
Now for your special case:
doc.where("slug", "==", "beautiful-slug").limit(1).get();
This will indeed result in a lot of read operations on the Firestore server until it finds the correct document. But by using limit(1) you will only receive a single document, this way only a single read operation is counted against your limits.
Using the where() function is the correct and recommended approach to your problem.

Is there a built-in function to get all unique values in an array field, across all records?

My schema looks like this:
var ArticleSchema = new Schema({
...
category: [{
type: String,
default: ['general']
}],
...
});
I want to parse through all records and find all unique values for this field across all records. This will be sent to the front-end via being called by service for look-ahead search on tagging articles.
We can iterate through every single record and run go through each array value and do a check, but this would be O(n2).
Is there an existing function or another way that has better performance?

You can use the distinct function to get the unique values across all category array fields of all documents:
Article.distinct('category', function(err, categories) {
// categories is an array of the unique category values
});
Put an index on category for best performance.

Upsert an array of items with MongoDB

I've got an array of objects I'd like to insert into a MongoDB collection, however I'd like to check if each item already exists in the collection first. If it doesn't exist then insert it, if it does exist then don't insert it.
After some Googling it looks like I'm best using the update method with the upsert property set to true.
This seems to work fine if I pass it a single item, as well as if I iterate over my array and insert the items one at a time, however I'm not sure whether this is the correct way to do this, or if it's the most efficient. Here's what I'm doing at the moment:
var data = [{}, {}, ... {}];
for (var i = 0; i < data.length; i++) {
var item = data[i];
collection.update(
{userId: item.id},
{$setOnInsert: item},
{upsert: true},
function(err, result) {
if (err) {
// Handle errors here
}
});
}
Is iterating over my array and inserting them one-by-one the most efficient way to do this, or is there a better way?

I can see two options here, depending your exact needs:
If you need to insert or update use an upsert query. In that case, if the object is already present you will update it -- assuming your document will have some content, this means that you will possibly overwrite the previous value of some fields. As you noticed, using $setOnInsert will mitigate that issue.
If you need to insert or ignore add an unique index on the related fields and let the insert fail on duplicate key. As you have several documents to insert, you might send them all in an array using an unordered insert (i.e.: setting the ordered option to false) so MongoDB will continue to insert the remaining documents in case of error with one of them.
As about performances, with the second option MongoDB will use the index to look up for a possible duplicate. So chances are good this would perform better.

Count both sides of a partition in ReQL

I have a dataset which has a boolean attribute, like:
{
name: 'Steven',
isQualified: true
}
And I want to count both sides of a partition. That is, how many documents are qualified or not. What's the best way to do this with a single rethinkdb query?
Here's an example with underscore.js, but it relies on querying all the documents and processing them in my app:
results = _.partition(data, 'isQualified').map(_.iteratee('length'))
At the moment I have this, but it feels inefficient and I'm assuming/hoping there's a better way to do it.
r.expr({
'qualified': r.table('Candidate').filter({isQualified: true}).count(),
'unqualified': r.table('Candidate').filter({isQualified: false}).count()
})
How could I improve this and make it more DRY?

Create an index on isQualified
r.table('Candidate').indexCreate("isQualified");
Then use it to count
r.expr({
'qualified': r.table('Candidate').getAll(true, {index: "isQualified"}).count()
'unqualified': r.table('Candidate').getAll(false, {index: "isQualified"}).count(),
})
This is way faster as the server doesn't have to iterate through all the documents, but just has to traverse a B-tree.

Mongodb: how to auto-increment a subdocument field?

The following document records a conversation between Milhouse and Bart. I would like to insert a new message with the right num (the next in the example would be 3) in a unique operation. Is that possible ?
{ user_a:"Bart",
user_b:"Milhouse",
conversation:{
last_msg:2,
messages:[
{ from:"Bart",
msg:"Hello"
num:1
},
{ from:"Milhouse",
msg:"Wanna go out ?"
num:2
}
]
}
}

In MongoDB, arrays keep their order, so by adding a num attribute, you're only creating more data for something that you could accomplish without the additional field. Just use the position in the array to accomplish the same thing. Grabbing the X message in an array will provide faster searches than searching for { num: X }.
To keep the order, I don't think there's an easy way to add the num category besides does a find() on conversation.last_msg before you insert the new subdocument and increment last_msg.
Depending on what you need to keep the ordering for, you might consider including a time stamp in your subdocument, which is commonly kept in conversation records anyway and may provide other useful information.
Also, I haven't used it, but there's a Mongoose plugin that may or may not be able to do what you want: https://npmjs.org/package/mongoose-auto-increment

You can't create an auto increment field but you can use functions to generate and administrate sequence :
http://docs.mongodb.org/manual/tutorial/create-an-auto-incrementing-field/

I would recommend using a timestamp rather than a numerical value. By using a timestamp, you can keep the ordering of the subdocument and make other use of it.