Upsert an array of items with MongoDB - arrays

I've got an array of objects I'd like to insert into a MongoDB collection, however I'd like to check if each item already exists in the collection first. If it doesn't exist then insert it, if it does exist then don't insert it.
After some Googling it looks like I'm best using the update method with the upsert property set to true.
This seems to work fine if I pass it a single item, as well as if I iterate over my array and insert the items one at a time, however I'm not sure whether this is the correct way to do this, or if it's the most efficient. Here's what I'm doing at the moment:
var data = [{}, {}, ... {}];
for (var i = 0; i < data.length; i++) {
var item = data[i];
collection.update(
{userId: item.id},
{$setOnInsert: item},
{upsert: true},
function(err, result) {
if (err) {
// Handle errors here
}
});
}
Is iterating over my array and inserting them one-by-one the most efficient way to do this, or is there a better way?

I can see two options here, depending your exact needs:
If you need to insert or update use an upsert query. In that case, if the object is already present you will update it -- assuming your document will have some content, this means that you will possibly overwrite the previous value of some fields. As you noticed, using $setOnInsert will mitigate that issue.
If you need to insert or ignore add an unique index on the related fields and let the insert fail on duplicate key. As you have several documents to insert, you might send them all in an array using an unordered insert (i.e.: setting the ordered option to false) so MongoDB will continue to insert the remaining documents in case of error with one of them.
As about performances, with the second option MongoDB will use the index to look up for a possible duplicate. So chances are good this would perform better.

Related

How to get all vector ids from Milvus2.0?

I used to use Milvus1.0. And I can get all IDs from Milvus1.0 by using get_collection_stats and list_id_in_segment APIs.
These days I am trying Milvus2.0. And I also want to get all IDs from Milvus2.0. But I don't find any ways to do it.
milvus v2.0.x supports queries using boolean expressions.
This can be used to return ids by checking if the field is greater than zero.
Let's assume you are using this schema for your collection.
referencing: https://github.com/milvus-io/pymilvus/blob/master/examples/hello_milvus.py
as of 3/8/2022
fields = [
FieldSchema(name="pk", dtype=DataType.INT64, is_primary=True, auto_id=False),
FieldSchema(name="random", dtype=DataType.DOUBLE),
FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=dim)
]
schema = CollectionSchema(fields, "hello_milvus is the simplest demo to introduce the APIs")
hello_milvus = Collection("hello_milvus", schema, consistency_level="Strong")
Remember to insert something into your collection first... see the pymilvus example.
Here you want to query out all ids (pk)
You cannot currently list ids specific to a segment, but this would return all ids in a collection.
res = hello_milvus.query(
expr = "pk >= 0",
output_fields = ["pk", "embeddings"]
)
for x in res:
print(x["pk"], x["embeddings"])
I think this is the only way to do it now, since they removed list_id_in_segment

How to always filter on a column by default in Knex / Postgres?

When we're deleting data, often instead of deleting a DB row we instead set a date_deleted column (for non-privacy-sensitive data) so that the data is accessible later for auditing if necessary.
How can we use Objection, Knex, or Postgres to pre-filter on this column always (if it exists) and otherwise return all rows? (We only ever want to look at these columns manually, not through code.)
Looks like postProcessResponse will work fine in Knex - we just filter the returned rows checking for date_deleted. But this would of course be more efficient if we can find a way to always filter before the query fires, not after getting the results.
Using postProcessResponse will give you various problems with paging etc.
You could use start event to modify each query:
knex.on('start', function(builder) {
builder.whereNull('date_deleted')
});
knex.select('*')
.from('users')
.then(function(Rows) {
//Only contains Rows where date_deleted is null
});
But also that is quite error prone, for example if you use .orWhere in your query or to any other query that is not plain select...
Feature that you are looking for is not really convenient to implement with knex. For example with objection.js there are much more options how to do it.
For knex I would probably just extend the query builder with special function, which does something like this (since knex 0.19.1):
const Knex = require('knex');
Knex.QueryBuilder.extend('selectWithoutDeleted', function(tableName) {
return this
.with('tableWithoutDeleted',
knex(tableName).whereNull('date_deleted')
)
.from('tableWithoutDeleted');
});
const res = await knex.selectWithoutDeleted('table')
.where('col1', 'foo')
.orWhere('col2', 'bar');
That should work in theory... CTE first limits results to not contain deleted rows and rest of where clauses will be applied to that limited result set.

OrientDB - find "orphaned" binary records

I have some images stored in the default cluster in my OrientDB database. I stored them by implementing the code given by the documentation in the case of the use of multiple ORecordByte (for large content): http://orientdb.com/docs/2.1/Binary-Data.html
So, I have two types of object in my default cluster. Binary datas and ODocument whose field 'data' lists to the different record of binary datas.
Some of the ODocument records' RID are used in some other classes. But, the other records are orphanized and I would like to be able to retrieve them.
My idea was to use
select from cluster:default where #rid not in (select myField from MyClass)
But the problem is that I retrieve the other binary datas and I just want the record with the field 'data'.
In addition, I prefer to have a prettier request because I don't think the "not in" clause is really something that should be encouraged. Is there something like a JOIN which return records that are not joined to anything?
Can you help me please?
To resolve my problem, I did like that. However, I don't know if it is the right way (the more optimized one) to do it:
I used the following SQL request:
SELECT rid FROM (FIND REFERENCES (SELECT FROM CLUSTER:default)) WHERE referredBy = []
In Java, I execute it with the use of the couple OCommandSQL/OCommandRequest and I retrieve an OrientDynaElementIterable. I just iterate on this last one to retrieve an OrientVertex, contained in another OrientVertex, from where I retrieve the RID of the orpan.
Now, here is some code if it can help someone, assuming that you have an OrientGraphNoTx or an OrientGraph for the 'graph' variable :)
String cmd = "SELECT rid FROM (FIND REFERENCES (SELECT FROM CLUSTER:default)) WHERE referredBy = []";
List<String> orphanedRid = new ArrayList<String>();
OCommandRequest request = graph.command(new OCommandSQL(cmd));
OrientDynaElementIterable objects = request.execute();
Iterator<Object> iterator = objects.iterator();
while (iterator.hasNext()) {
OrientVertex obj = (OrientVertex) iterator.next();
OrientVertex orphan = obj.getProperty("rid");
orphanedRid.add(orphan.getIdentity().toString());
}

RethinkDB - Find documents with missing field

I'm trying to write the most optimal query to find all of the documents that do not have a specific field. Is there any better way to do this than the examples I have listed below?
// Get the ids of all documents missing "location"
r.db("mydb").table("mytable").filter({location: null},{default: true}).pluck("id")
// Get a count of all documents missing "location"
r.db("mydb").table("mytable").filter({location: null},{default: true}).count()
Right now, these queries take about 300-400ms on a table with ~40k documents, which seems rather slow. Furthermore, in this specific case, the "location" attribute contains latitude/longitude and has a geospatial index.
Is there any way to accomplish this? Thanks!
A naive suggestion
You could use the hasFields method along with the not method on to filter out unwanted documents:
r.db("mydb").table("mytable")
.filter(function (row) {
return row.hasFields({ location: true }).not()
})
This might or might not be faster, but it's worth trying.
Using a secondary index
Ideally, you'd want a way to make location a secondary index and then use getAll or between since queries using indexes are always faster. A way you could work around that is making all rows in your table have a value false value for their location, if they don't have a location. Then, you would create a secondary index for location. Finally, you can then query the table using getAll as much as you want!
Adding a location property to all fields without a location
For that, you'd need to first insert location: false into all rows without a location. You could do this as follows:
r.db("mydb").table("mytable")
.filter(function (row) {
return row.hasFields({ location: true }).not()
})
.update({
location: false
})
After this, you would need to find a way to insert location: false every time you add a document without a location.
Create secondary index for the table
Now that all documents have a location field, we can create a secondary index for location.
r.db("mydb").table("mytable")
.indexCreate('location')
Keep in mind that you only have to add the { location: false } and create the index only once.
Use getAll
Now we can just use getAll to query documents using the location index.
r.db("mydb").table("mytable")
.getAll(false, { index: 'location' })
This will probably be faster than the query above.
Using a secondary index (function)
You can also create a secondary index as a function. Basically, you create a function and then query the results of that function using getAll. This is probably easier and more straight-forward than what I proposed before.
Create the index
Here it is:
r.db("mydb").table("mytable")
.indexCreate('has_location',
function(x) { return x.hasFields('location');
})
Use getAll.
Here it is:
r.db("mydb").table("mytable")
.getAll(false, { index: 'has_location' })

Mongodb: how to auto-increment a subdocument field?

The following document records a conversation between Milhouse and Bart. I would like to insert a new message with the right num (the next in the example would be 3) in a unique operation. Is that possible ?
{ user_a:"Bart",
user_b:"Milhouse",
conversation:{
last_msg:2,
messages:[
{ from:"Bart",
msg:"Hello"
num:1
},
{ from:"Milhouse",
msg:"Wanna go out ?"
num:2
}
]
}
}
In MongoDB, arrays keep their order, so by adding a num attribute, you're only creating more data for something that you could accomplish without the additional field. Just use the position in the array to accomplish the same thing. Grabbing the X message in an array will provide faster searches than searching for { num: X }.
To keep the order, I don't think there's an easy way to add the num category besides does a find() on conversation.last_msg before you insert the new subdocument and increment last_msg.
Depending on what you need to keep the ordering for, you might consider including a time stamp in your subdocument, which is commonly kept in conversation records anyway and may provide other useful information.
Also, I haven't used it, but there's a Mongoose plugin that may or may not be able to do what you want: https://npmjs.org/package/mongoose-auto-increment
You can't create an auto increment field but you can use functions to generate and administrate sequence :
http://docs.mongodb.org/manual/tutorial/create-an-auto-incrementing-field/ 
I would recommend using a timestamp rather than a numerical value. By using a timestamp, you can keep the ordering of the subdocument and make other use of it.

Resources