Resolving replication conflicts for deleted documents in CouchDB - database

The way of resolving replication conflicts recommended by official documentation is:
Read conflicting revisions using document's _conflicts field (e.g. via a view)
Fetch docs for all revisions listed
Perform application-specific merging
Remove unwanted revisions
The problem comes in when I want to merge deleted documents. They do not show up in _conflicts field, but in _deleted_conflicts. If I merge only using _conflicts field, and a document is deleted in the local database and edited in the remote replica, it will be resurrected locally on replication. My application model assumes that deletion always takes precedence when merging: a deleted documents stays deleted regardless of what edits it conflicts with.
So, at a first glance, the simplest thing to do is to check that _deleted_conflicts is not empty and if it is not empty, delete the document, right? Well... the problem with this is that this may also contain deleted revisions that were introduced by resolving edit conflicts in step #4, so the meaning of _deleted_conflicts is ambiguous in this case.
What's the canonical way of handling deletion conflicts in CouchDB (if any) that doesn't involve doing gross things like marking documents as deleted and filtering at the application layer?

The best solution would be to use the reserved property _deleted to remove documents instead of HTTP DELETE. Then you are free to also set other properties:
doc._deleted = true;
doc.deletedByUser = true;
doc.save();
Then in the merge process check the _changes feed for _deleted_conflicts and delete the document if there is a revision within _deleted_conflicts that has the deletedByUser flag set to true.
I hope this helps!

Related

How can we implement archiving in SOLR?

I am new to Apache SOLR and I want to implement archiving in SOLR since my data is growing day by day. I am not very sure whether SOLR allows data archiving or not?
If anybody has any suggestions on this then please give it to me.
This question is pretty general so it's a bit hard to give a cut and dried answer, but if one thinks about archiving for a moment, there are two parts to it.
Removing old data
Storing the old data in an alternate location.
The first part is fairly easy in solr so long as you can identify a query that will select the "old" documents. For example if you have a field that records when you sent the data to solr called 'index_date' want to delete everything before Jan 1, 2014 you might do this:
curl http://localhost:8983/solr/update --data '<delete><query>indexed_date:[* TO 2014-01-01T00:00:00]</query></delete>' -H 'Content-type:text/xml; charset=utf-8'
The second part requires more thought. The first question is, why would you want to move the data in solr to some other location. The answer to that more or less has to be because you think you might need it again. But ask yourself what the use case for that is, and you you might service that use case. Are you planning on putting the data back into solr at some later point if you want it? Is solr the only place where this data was stored and you need it for record keeping/audit only?
You will have to determine the second half of "archiving" based on your needs, but here's some things to think about: The data behind fields in solr that are stored="false" are already lost. You can not completely reconstruct the data that went into creating them. Fields for which stored="true" can be retrieved in xml/json/csv with a regular query, and then output to the long term storage of your choice. Many systems use solr as an index into the primary sources rather than using solr as a primary source itself. In this case there may be no need to archive the data, simply remove the data that is too old to be relevant in the search results, but of course make sure that your business team understands and agrees with this strategy before you do it! :)
EDIT: I happened to look back at this and when I re-read it I realized I left something out and there's a new development.
What I Left Out
The above delete by query strategy has the drawback that deleted documents remain in the index (just marked deleted), potentially wasting as much as 50% of your space (or more if you've run "optimize"! in the past). Here's a good article by Eric Erickson about deleting and space consequences:
https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/
New Development
If time is the criteria for deletion and you followed the best practice I mentioned above about not having solr be the single source of truth (i.e. solr is just an index into a primary source, not the data store) then you may very well want to use the new Time Routed Aliases feature which keeps a set of temporally bounded collections and deletes the oldest collections. The great thing about deleting a collection rather than delete by query is that there's no merging to do. The segments for the index disappear as a whole, so there are no deleted documents hanging out wasting space.
http://lucene.apache.org/solr/guide/7_4/time-routed-aliases.html
Self Promotion Disclaimer: Along with David Smiley, I helped write this feature

Solr Deduplication use of overWriteDupes flag

I had a configuration where I had "overwriteDupes"=false. I added few duplicate documents. Result: I got duplicate documents in the index.
When I changed to "overwriteDupes"=true, the duplicate documents started overwriting the older documents.
Question 1: How do I achieve, [add if not there, fail if duplicate is found] i.e. mimic the behaviour of a DB which fails when trying to insert a record which violates some unique constraint. I thought that "overwriteDupes"=false would do that, but apparently not.
Question2: Is there some documentation around overwriteDupes? I have checked the existing Wiki; there is very little explanation of the flag there.
Thanks,
-Amit
Apparently "overwriteDupes"=false would indeed allow in duplicate documents. The utility of such a setting would be to allow duplicate records but be able to query them later, based on signature field and do whatever one wants to do with them.
The behavior is NOT well documented in the Solr wiki document.
One cannot achieve [add if not there, fail if duplicate is found] in a straight forward manner in Solr.

Override delete method of custom object

we have an custom object in our instance that effectively is a junction object. Right now, if a relationship is removed, the record in the junction object is deleted.
We want to change this behavior to such that the junction object is marked as deleted, but not physically deleted (please understand that I cannot go into details of why, there are good business reasons to do so). Since we have multiple clients accessing our instance through SOAP and REST APIs I would like to implement a solution whereby I override the standard delete functionality of the custom object to just check a custom field is_deleted, instead of deleting the record.
Is this possible?
Cheers,
Dan
I suppose you can't just put an on-delete trigger on the object?
If you can, then just add the trigger code to update the field, and then attach an error to the record being deleted (so the deletion doesn't go through). There are plenty of examples in the official docs for how to do this.
Remember to keep everything bulkified (process all the records being deleted at once, from a list)...
On a side note, the deleted records in SalesForce are kept in the Recycle Bin on the org for 15 days after deletion. So you can also select them from the object, by using the SELECT... ALL ROWS query form.
I don't think you can really override delete action. You could override a button (with a Visualforce page) but that won't help you in any way if delete is fired from API.
I suspect you want to pretend to API (SOAP, REST etc) users that record was deleted while in reality retaining it somewhere? Smells like some shady business practice to be honest but whatever, let's assume it really is legit... For sure you can't suddenly throw errors at the operation because your end users will notice.
I think I'd go with a hidden 1-to-1 matching "shadow" object and sync each action to it. You'd need a trigger on insert/update/delete/undelete of your junction that would replicate the action (difference being this custom "soft delete" flag). This has lots of concerns like storage usage but well.
One thing that comes to mind is that (if I recall correctly) the triggers on junction object don't fire if you delete one of masters. So if it's a real junction object (you wrote "acts like") you'd have to deal with this scenario too and put logic into master objects' triggers.
If it's not a real junction object (i.e. it has OwnerId field visible) and your sharing rules permit - maybe you could transfer the ownership of record to some special user/queue outside of roles hierarchy so it becomes invisible... But I doubt it'll work, in the end the delete should appear to complete succesfully, right? Maybe in combination with some #future that'd immediately undelete them & transfer... Still - messy!

Partial Update of documents

We have a requirement that documents that we currently index in SOLR may periodically need to be PARTIALLY UPDATED. The updates can either be
a. add new fields
b. update the content of existing fields.
Some of the fields in our schema are stored, others are not.
SOLR 4 does allow this but all the fields must be stored. See Update a new field to existing document and http://solr.pl/en/2012/07/09/solr-4-0-partial-documents-update/
Questions:
1. Is there a way that SOLR can achieve this. We've tried SOLR JOINs in the past but it wasn't the right fit for all our use cases.
On the other hand, can elastic search , linkedin's senseidb or other text search engines achieve this ?
For now, we manage by re-indexing the affected documents when they need to be indexed
Thanks
Solr has the limitation of stored fields, that's correct. The underlying lucene always requires to delete the old document and index the new one. In fact lucene segments are write-once, it never goes back to modify the existing ones, thus it only markes documents as deleted and deletes them for real when a merge happens.
Search servers on top of lucene try to work around this problem by exposing a single endpoint that's able to delete the old document and reindex the new one automatically, but there must be a way to retrieve the old document somehow. Solr can do that only if you store all the fields.
Elasticsearch works around it storing the source documents by default, in a special field called _source. That's exactly the document that you sent to the search engine in the first place, while indexing. This is by the way one of the features that make elasticsearch similar to NoSQL databases. The elasticsearch Update API allows you to update a document in two ways:
Sending a new partial document that will be merged with the existing one (still deleting the old one and indexing the result of the merge
Executing a script on the existing document and indexing the result after deleting the old one
Both options rely on the presence of the _source field. Storing the source can be disabled, if you disable it you of course lose this great feature.

Alfresco Solr Custom Search

using Alfresco 4.0.1 we have added numerous new entities and linked them to cm:content. When we search, we want to be able to search not only by criteria of content, but want to say give us all content that is linked to libraries with these properties (for examlpe).
We would expect we need to add a new Solr core (index) and populate it.
Has anyone done this? Can someone offer a hint or two, or a link to a post exlpaining it.
Thanks
--MB
Addition 1: linked means the content is 'linked' with other entities using Alfresco's Peer (Non-Child) Associations.
Addition 2: for example if our model is content and libraries (but it's much more complicated then that), these are linked using peer (non-child) associations because we were not able to use parent-child for other reasons. So what we want to search for is all content with name "document", but that are linked to libreries with location "Texas".
The bottom-line is that Alfresco isn't relational. You can set up associations and through the API you can ask a give node for its associations, but you cannot run queries across associations like you can when you do joins in a relational database.
Maybe you should add a location property to your content node and update its value with a behavior any time an association is created, updated, or deleted on that node. Then you'd be able to run a query by AND-ing the location with other criteria on the node.
Obviously, if you have many such properties that you need to keep in sync your behavior could start to affect performance negatively, but if you have only a handful you should be okay.

Resources