My Code
I have a MongoDB with two collections, Items and Calculations.
Items
value: Number
date: Date
Calculations
calculation: Number
start_date: Date
end_date: Date
A Calculation is a stored calcluation based off of Item values for all Items in the DB which have dates in between the Calculation's start date and end date.
Mongo Change Streams
I figure a good way to create / update Calculations is to create a Mongo Change Stream on the Items collection which listens for changes to the Items collection to then recalculate relevant Calculations.
The issue is that according to the Mongo Change Event docs, when a document is deleted, the fullDocument field is omitted which would prevent me from accessing the deleted Item's date which would inform which Calculations should be updated.
Question
Is there any way to access the fullDocument of a Mongo Change Event that was fired due to a document deletion?
No I don't believe there is a way. From https://docs.mongodb.com/manual/changeStreams/#event-notification:
Change streams only notify on data changes that have persisted to a majority of data-bearing members in the replica set.
When the document was deleted and the deletion was persisted across the majority of the nodes in a replica set, the document has ceased to exist in the replica set. Thus the changestream cannot return something that doesn't exist anymore.
The solution to your question would be transactions in MongoDB 4.0. That is, you can adjust the Calculations and delete the corresponding Items in a single atomic operation.
fullDocument is not returned when a document is deleted.
But there is a workaround.
Right before you delete the document, set a hint field. (Obviously use a name that does not collide with your current properties.)
await myCollection.updateOne({_id:theId}, {_deleting: true})
await myCollection.deleteOne({_id:theId})
This will trigger one final change event in the stream, with a hint that the document is getting deleted. Then in your stream watcher, you simple check for this value.
stream.on('change', event => {
if (!event.fullDocument) {
// The document was deleted
}
else if (event.fullDocument._deleting) {
// The document is going to be deleted
}
else {
// Document created or updated
}
})
My oplog was not getting updated fast enough, and the update was looking up a document that was already removed, so I needed to add a small delay to get this working.
myCollection.updateOne({_id:theId}, {_deleting: true})
setTimeout( ()=> {
myCollection.deleteOne({_id:theId})
}, 100)
If the document did not exist in the first place, it won't be updated or deleted, so nothing gets triggered.
Using TTL Indexes
Another way to make this work is to add a ttl index, and then set that field to the current time. This will trigger an update first, and then delete the document automatically.
myCollection.setIndex({_deleting:1}, {expireAfterSeconds:0})
myCollection.updateOne({_id:theId}, {$set:{_deleting:new Date()}})
The problem with this approach is that mongo prunes TTL documents only during certain intervals, 60s or more as stated in the docs, so I prefer to use the first approach.
Related
my database like;
I want, when announcement0 field is deleted, announcement1 field name to change announcement0. Is there a way to do this ?
There is no way to rename fields in Firestore, let alone to have that happen automatically.
It sounds like you have multiple announcements in your document however. In that case, you could consider storing all announcements in a single array field announcements. In an array field, when you remove the first item (at index 0) all other items after that shift down in the array to take its place, which seems to be precisely what you want.
You cannot rename fields in a document. You'll have to delete and recreate it.
Now I'm assuming the number just defines the order of document. If that's the case you can use this workaround, instead of looking for 'announcement0' on client side, you can just store a number field in the document such as 0 in announcement0 and so on. So to get announcement1 when announcement0 is deleted you can uses this query:
const firstAnnouncement = await dbRef.orderBy('number').limit(1).get()
This will get the announcement with least number (highest rank). You can change the limit as per your needs.
But if renaming fields is needed then you'll have to delete and recreate all trailing announcements.
I have a stream of Booking elements of the following form:
Booking(id=B1, driverId=D1, time=t1, location=l1)
Booking(id=B2, driverId=D2, time=t2, location=l2)
I need to find, per location, count of bookings made in last 15mins. But the window should be evaluated for any new booking coming for a location.
Roughly like:
Assuming `time` field is set as timestamp of record
bookingStream.keyBy(b=>b.location).window(Any window of 15mins).trigger(triggerFunction)
Except that the trigger function should not be evaluated at the end of 15mins but instead whenever any booking arrives at a location, and emit the count of booking in last 15min from timestamp of newly arrived booking.
Approach:
Use RichMap function, maintain a priority queue of location bookings as a managed state(ValueState) with timestamp as priority of bookings. For each element that arrives, first add it to state and remove elements earlier than 15mins from currently arrived elements. Emit the count of remaining elements in priority queue to collector.
Is this the right way or it could be achieved by using some other flink construct in a better way.
If you are running on the heap-based state backend, what you propose should behave reasonably well. But with RocksDB you will have to go through serialization/deserialization of the priority queue for every access, which may be rather painful.
An approach that might perform better on RocksDB would be to keep the current count along with the earliest timestamp in ValueState, and the set of bookings in ListState. The RocksDB state backend can append to ListState without going through ser/de, so you would only have to deserialize and reserialize the whole list when the earliest element is too old.
Given the following code:
var dbRecords = _context.Alerts.AsNoTracking()
.Where(a => a.OrganizationId == _authorization.OrganizationId)
.ToList();
var dbRecords2 = _context.Alerts
.Where(a => a.OrganizationId == _authorization.OrganizationId)
.ToList();
foreach (var untrackedRecord in dbRecords) {
var trackedRecord = dbRecords2.First(a => a.Id == untrackedRecord.Id);
Assert.AreEqual(untrackedRecord.TimeStamp.Ticks, trackedRecord.TimeStamp.Ticks);
}
Where the TimeStamp data is stored in SQL Server 2012 in a column defined as datetime2(0).
The Assert fails, and the debugger demonstrates that the two Ticks values are always different.
Expected: 636179928520000000 But was: 636179928523681935
The untracked value will always be rounded off to the nearest second (which is expected, based on what SQL is storing). When creating the record, the value I'm saving comes from DateTime.Now.
Testing some more, this doesn't appear to be true (the inconsistent ticks) for every object I'm testing, only for records I've inserted recently. Looking at the code and given the way the column is defined, it's not obvious to my why that would matter.
For now, to get my tests to pass, I'm just comparing the DateTime values down to the second, which is all that's required. However, I'm just wanting to understand why this is happening: Why can I not reliably compare two DateTime values depending on whether or not the entities are being tracked?
I figured this out, so answering my own question; I found I left off what turns out to be a key piece of information here. I mentioned that this issue came up in testing. What I didn't mention is that we're inserting the records and then testing all within a single transaction, and within a single DbContext.
Because I use the same DbContext for all work, the Alert objects that are inserted for testing are cached. When I query the objects using AsNoTracking, the DbContext has to refresh the objects before giving them back to me (since their current state isn't being tracked, and therefore is unknown to EF), apparently without updating what's in the cache (since we told EF we don't want to track the objects).
Querying for the same objects without AsNoTracking results in a cache hit; those objects that were inserted are still in the cache, so the cached versions are returned.
Given that, it's clear why the Ticks aren't matching up. The non-cached objects are pulling the DateTime values from the database, where the precision is defined to only store the time down the nearest second. The cached objects have the original DateTime.Now values, which stores the time down to ms. This explains why the Ticks don't match between the two DateTimes, even though both objects represent the same underlying database record.
I'm working with Solr indexing data from two sources - real-time "pump" inserting (and updating) documents into Solr and database which holds backups of those documents.
The problem we encountered looks like that - if we make a data import from database while pump is performing inserts, we may index a doc from pump, and later overwrite it with doc extracted from database - which is a backup, so it's probably little outdated.
If we close the pump, import from database and open the pump again, it probably will cause instabilities in our application.
What I'd like to do is tell Solr to not automatically overwrite the document, but do so conditionally (for example by the value of 'last_modified_date' field).
My question is - how can I do it? Do I have to modify Solr source, make a new class overwriting some update processor, or just add some magic lines to solrconfig?
Sorry, but there there is not an option or config to tell Solr to not automatically update documents, but instead use some conditional check. The current model for Solr is that if you insert a document with the same unique id as one already in the index, it will "update" that document by a delete/add operation. Solr also does not currently support the ability to only update specific fields in an existing indexed document. Please see issue SOLR-139 for more details.
Based on the scenario you have described, I would suggest that you create a process outside of Solr that handles the retrieval of items from your data sources and then performs the conditional check to see what is in the index already and determine if an update to the index is necessary.
You can use solr script processors to check if that document exists proceeds in its accordance
Below code only works when solr uses java 8
function processAdd(cmd) {
doc = cmd.solrDoc;
var previousDoc=null;
try {
// create a term type object
var Term = Java.type("org.apache.lucene.index.Term");
var TermObject =new Term("fieldForSearchTryUnique","Value of field");
//retrieve document id from solr return -1 if not present
previousDocId= req.getSearcher().getFirstMatch(TermObject);
if(-1!=perviousDocId) {
// get complete document from solr for that searched field
previousDoc=req.getSearcher().doc(previousDocId);
// do required process here
}
}
catch(err) {
logger.error("error in update processor "+err)
}
}
I would like to create a logger using CouchDB. Basically, everytime someone accesses the file, I would like like to write to the database the username and time the file has been accessed. If this was MySQL, I would just add a row for every access correspond to the user. I am not sure what to do in CouchDB. Would I need to store each access in array? Then what do I do during update, is there a way to append to the document? Would each user have his own document?
I couldn't find any documentation on how to append to an existing document or array without retrieving and updating the entire document. So for every event you log, you'll have to retrieve the entire document, update it and save it to the database. So you'll want to keep the documents small for two reasons:
Log files/documents tend to grow big. You don't want to send large documents across the wire for each new log entry you add.
Log files/documents tend to get updated a lot. If all log entries are stored in a single document and you're trying to write a lot of concurrent log entries, you're likely to run into mismatching document revisions on updates.
Your suggestion of user-based documents sounds like a good solution, as it will keep the documents small. Also, a single user is unlikely to generate concurrent log entries, minimizing any race conditions.
Another option would be to store a new document for each log entry. Then you'll never have to update an existing document, eliminating any race conditions and the need to send large documents between your application and the database.
Niels' answer is going down the right path with transactions. As he said, you will want to create a different document for each access - think of them as actions. Here's what one of those documents might look like
{
"_id": "32 char hash",
"_rev": "32 char hash",
"when": Unix time stamp,
"by": "some unique identifier
}
If you were tracking multiple files, then you'd want to add a "file" field and include a unique identifier.
Now the power of Map/Reduce begins to really shine, as it's extremely good at aggregating multiple pieces of data. Here's how to get the total number of views:
Map:
function(doc)
{
emit(doc.at, 1);
}
Reduce:
function(keys, values, rereduce)
{
return sum(values);
}
The reason I threw the time stamp (doc.at) into the key is that it allows us to get total views for a range of time. Ex., /dbName/_design/designDocName/_view/viewName?startkey=1000&endkey=2000&group=true gives us the total number of views between those two time stamps.
Cheers.
Although Sam's answer is an ok pattern to follow I wanted to point out that there is, indeed, a nice way to append to a Couch document. It just isn't very well documented yet.
By defining an update function in your design document and using that to append to an array inside a couch document you may be able to save considerable disk space. Plus, you end up with a 1:1 correlation between the file you're logging accesses on and the couch doc that represents that file. This is how I imagine a doc might look:
{
"_id": "some/file/path/name.txt",
"_rev": "32 char hash",
"accesses": [
{"at": 1282839291, "by": "ben"},
{"at": 1282839305, "by": "kate"},
{"at": 1282839367, "by": "ozone"}
]
}
One caveat: You will need to encode the "/" as %2F when you request it from CouchDB or you'll get an error. Using slashes in document ids is totally ok.
And here is a pair of map/reduce functions:
function(doc)
{
if (doc.accesses) {
for (i=0; i < doc.accesses.length; i++) {
event = doc.accesses[i];
emit([doc._id, event.by, event.at], 1);
}
}
}
function(keys, values, rereduce)
{
return sum(values);
}
And now we can see another benefit of storing all accesses for a given file in one JSON document: to get a list of all accesses on a document just make a get request for the corresponding document. In this case:
GET http://127.0.0.1:5984/dbname/some%2Ffile%2Fpath%2Fname.txt
If you wanted to count the number of times each file was accessed by each user you'll query the view like so:
GET http://127.0.0.1:5984/test/_design/touch/_view/log?group_level=2
Use group_level=1 if you just want to count total accesses per file.
Finally, here is the update function you can use to append onto that doc.accesses array:
function(doc, req) {
var whom = req.query.by;
var when = Math.round(new Date().getTime() / 1000);
if (!doc.accesses) doc.accesses = [];
var event = {"at": when, "by": whom}
doc.accesses.push(event);
var message = 'Logged ' + event.by + ' accessing ' + doc._id + ' at ' + event.at;
return [doc, message];
}
Now whenever you need to log an access to a file issue a request like the following (depending on how you name your design document and update function):
http://127.0.0.1:5984/my_database/_design/my_designdoc/_update/update_function_name/some%2Ffile%2Fpath%2Fname.txt?by=username
A comment to the last two anwers is that they refer to CouchBase not Apache CouchDb.
It is however possible to define updatehandlers in CouchDb but I have not used it.
http://wiki.apache.org/couchdb/Document_Update_Handlers