Solr Deduplication use of overWriteDupes flag - solr

I had a configuration where I had "overwriteDupes"=false. I added few duplicate documents. Result: I got duplicate documents in the index.
When I changed to "overwriteDupes"=true, the duplicate documents started overwriting the older documents.
Question 1: How do I achieve, [add if not there, fail if duplicate is found] i.e. mimic the behaviour of a DB which fails when trying to insert a record which violates some unique constraint. I thought that "overwriteDupes"=false would do that, but apparently not.
Question2: Is there some documentation around overwriteDupes? I have checked the existing Wiki; there is very little explanation of the flag there.
Thanks,
-Amit

Apparently "overwriteDupes"=false would indeed allow in duplicate documents. The utility of such a setting would be to allow duplicate records but be able to query them later, based on signature field and do whatever one wants to do with them.
The behavior is NOT well documented in the Solr wiki document.
One cannot achieve [add if not there, fail if duplicate is found] in a straight forward manner in Solr.

Related

In Azure Search, can an indexer combine information from different documents to a single index item without them overwritting each other?

My goal is to create a single searchable Azure Index that has all of the relevant information currently stored in many different sql tables.
I'm also using an Azure Cognitive Service to add additional info from related documents. Each document is tied to only a single item in my Index, but each item in the index will be tied to many documents.
According to my understanding, if two documents have the same value for the indexer's Key, then the index will overwrite the extracted information from the first document with the information extracted from the second. I'm hoping there's a way to append the information instead of overwriting it. For example: if two documents relate to the same index item, I want the values mapped to keyphrases for that item to include the keyphrases found in the first document and the keyphrases found in the second document.
Is this possible? Is there a different way I should be approaching this?
If it is possible, can I do it without having duplicate values?
Currently I have multiple indexes and I'm combining the search results from each one, but this seems inefficient and likely messes up the default scoring algorithm.
Every code example I find only has one document for each index item and doesn't address my problem. Admittedly, I haven't tried to set up my index as described above, because it would take a lot of refactoring, and I'm confident it would just overwrite itself.
I am currently creating my indexes and indexers programmatically using dotnet. I'm assuming my code isn't relevant to my question, but I can provide it if need be.
Thank you so much! I'd appreciate any feedback you can give.
Edit: I'm thinking about creating a custom skill to do the aggregation for me, but I don't know how the skill would access access everything it needs. It needs the extracted info from the current document, and it needs the previously aggregated info from previous documents. I guess the custom skill could perform a search on the index and get the item that way, but that sounds dangerously hacky. Any thoughts would be appreciated.
Pasting from docs:
Indexing actions: upload, merge, mergeOrUpload, delete
You can control the type of indexing action on a per-document basis, specifying whether the document should be uploaded in full, merged with existing document content, or deleted.
Whether you use the REST API or an SDK, the following document operations are supported for data import:
Upload, similar to an "upsert" where the document is inserted if it is new, and updated or replaced if it exists. If the document is missing values that the index requires, the document field's value is set to null.
merge updates a document that already exists, and fails a document that cannot be found. Merge replaces existing values. For this reason, be sure to check for collection fields that contain multiple values, such as fields of type Collection(Edm.String). For example, if a tags field starts with a value of ["budget"] and you execute a merge with ["economy", "pool"], the final value of the tags field is ["economy", "pool"]. It won't be ["budget", "economy", "pool"].
mergeOrUpload behaves like merge if the document exists, and upload if the document is new.
delete removes the entire document from the index. If you want to remove an individual field, use merge instead, setting the field in question to null.

What does "A version conflict was detected when attempting to index this document." mean in Azure Search?

My indexer is failing with the message
A version conflict was detected when attempting to index this
document. Please try again.
What is this referring to?
This error indicates that multiple documents with the same document key are being indexed at the same time. This can happen in several situations:
You have multiple indexers writing to the same index. One way of mitigating this situation is to stagger different indexers' schedule to avoid them overlapping as much as possible.
Your datasource actually has multiple items that map to the same document key.
You're using some other code in addition to indexer to push data into your index. Occasional conflicts between documents with the same key may be unavoidable, but if you're running your indexer on a schedule, indexing will still make forward progress.
HTH!
This issue happens when your INDEX VALUES are not unique. Make sure that Key Attribute is Unique in your Index Definition.

Prevent Duplication in Solr using UpdateRequestProcessor chain

We are using Solr to store items that have been received and ingested through another service.
I am currently looking into a task to avoid duplicate items being created with the same id.
I am not an expert in Solr and trying pick up the task from someone who has left the company. The last suggestion about how to prevent duplication mentioned that it should be possible using a combination of defining unique id on the id field and using UpdateRequestProcessor chain. I don't know enough about the UpdateRequestProcessor chain to know the approach in mind. I know the ultimate goal was that when an item was sent to Solr with the same id as an existing id then an update would be performed rather than a create.
I have looked at Solr documentation about the UpdateRequestProcessor chain. Without more background information, those resources have not helped that much so far. I think I would benefit from Solr experts to help me get started or pointing me in the right direction.
You don't need to get a URP involved. It is much simpler than that. If your doc's id (defined in schema.xml as <uniqueKey>id</uniqueKey>
) is already an unique id, then, you don't need to do anything else. Indexing the same doc with same id twice will update it the second time (delete and new insert under the hood).
If your uniqueKey is not the unique id, then just rework the schema (and the app using Solr if it needs to), so they match.

Identifying documents by multiple unique keys in solr

I have been setting SOLR up to automatically generate IDs for my documents by following this guide:
https://wiki.apache.org/solr/UniqueKey, which is working as intended.
Now, when inserting a document, I would like to check/ensure that the url field (just a string) is unique for all documents in the index. So whenever a new document is added, it should just update any existing document if an document already exists with that particular url.
The unique id is used to identify a document in another part of the system.
I have tried adding url to the url field, but it is just ignored and it is thus still possible to add a document with a non-unique url.
I'm using SOLR 4.10.2.
Any help is greatly appreciated!
You could prevent duplicates from entering the index by using the "De-duplication" Solr feature. Please have a look at the wiki for configuration and more details: https://cwiki.apache.org/confluence/display/solr/De-Duplication
There is a also a flag "overwriteDupes" that I believe issues an "update" command that overrides the old values, although it is not clearly documented in the wiki.

Partial Update of documents

We have a requirement that documents that we currently index in SOLR may periodically need to be PARTIALLY UPDATED. The updates can either be
a. add new fields
b. update the content of existing fields.
Some of the fields in our schema are stored, others are not.
SOLR 4 does allow this but all the fields must be stored. See Update a new field to existing document and http://solr.pl/en/2012/07/09/solr-4-0-partial-documents-update/
Questions:
1. Is there a way that SOLR can achieve this. We've tried SOLR JOINs in the past but it wasn't the right fit for all our use cases.
On the other hand, can elastic search , linkedin's senseidb or other text search engines achieve this ?
For now, we manage by re-indexing the affected documents when they need to be indexed
Thanks
Solr has the limitation of stored fields, that's correct. The underlying lucene always requires to delete the old document and index the new one. In fact lucene segments are write-once, it never goes back to modify the existing ones, thus it only markes documents as deleted and deletes them for real when a merge happens.
Search servers on top of lucene try to work around this problem by exposing a single endpoint that's able to delete the old document and reindex the new one automatically, but there must be a way to retrieve the old document somehow. Solr can do that only if you store all the fields.
Elasticsearch works around it storing the source documents by default, in a special field called _source. That's exactly the document that you sent to the search engine in the first place, while indexing. This is by the way one of the features that make elasticsearch similar to NoSQL databases. The elasticsearch Update API allows you to update a document in two ways:
Sending a new partial document that will be merged with the existing one (still deleting the old one and indexing the result of the merge
Executing a script on the existing document and indexing the result after deleting the old one
Both options rely on the presence of the _source field. Storing the source can be disabled, if you disable it you of course lose this great feature.

Resources