Updating solr index with deleted records - solr

I was trying to figure out how to update the index for the deleted records. I'm indexing from the database. I search for documents in the database, put them in an array and index them by creating a SolrInputDocument.
So, I couldn't figure out how to update the index for the deleted records (because they don't exist in the database now).
I'm using the php-solr-pecl extension.

You need to handle the deletion of the documents separately from Solr.
Solr won't handle it for you.
In case of Incremental, You need to maintain the Documents deleted from the Database and then fire a delete query for the same to clean up the index.
For this you have to maintain a timestamp and delete flag to identify the documents.
In case of the Full, you can just clean up the index and reindex all.
However, in case of failures you may loose all the data.
Solr DIH provides a bit of handling for the same

create a delete trigger on the database table which will insert the deleted record id in another table.(or have boolean field "deleted" and mark the record instead of actually deleting it, considering the trade-offs I would choose the trigger)
Once in a while do a batch delete on index based on the "deleted" table, also removing them from the table itself.

We faced the same issue and came up with batch deletion approach.
We created a program that will delete the document from SOLR based on the uniqueid, if the unique id is present in SOLR but not in database you can delete that document from SOLR.
(Get the uniqueid list from SOLR) minus (uniqueid list from database)
You can just use SQL minus to get the list of uniqueid belonging to the documents that needs to be deleted.
Else you can do everything in JAVA side. Get the list from database, get the list from solr.. Do a comparison between the 2 list and delete based on that..This would be lost faster for huge number of documents. You can use binary search method to do the comparison..
Something like
Collections.binarySearch(DatabaseUniqueidArray, "SOLRuniqueid");

Related

Best Practices to update/add/remove fields for an Azure Search Index

I was wondering if there any good resources for best practices to deal with changes (Add/remove fields from search index) to your search index without taking your Azure search service and index down.
Do we need to create a completely new index and indexer to do that? I discovered that the Azure portal currently lets you add new fields to your index but what about updating/deleting fields from your search index.
Thanks!
If you add a field there is no strict requirement on rebuild. Existing indexed documents are given a null value for the new field. On a future re-index, values from source data are added to documents.
While you can't directly delete a field from an Azure Search index, you can achieve the same effect without rebuilding the index by having your application simply ignore the "deleted" field. If you use this approach, a deleted field isn't used, but physically the field definition and contents remain in the index until the next time you rebuild your index.
Changing a field definition requires you to rebuild your index, with the exception of changing these index attributes: Retrievable, SearchAnalyzer, SynonymMaps. You can add the Retrievable, SearchAnalyzer, and SynonymMaps attributes to an existing field or change their values without having to rebuild the index.

Solr schema modifications that do not affect existing Documents

I am trying to figure out whether I need to re-index a [very large] document base in Solr in the following scenarios:
I want to add a few new fields to the schema: none of the old Documents need to be updated to add values for these fields, only new documents that I will be adding after the schema update will have these fields. Do I still need to re-index Solr?
I want to remove couple of not-used fields from the schema (they were added prematurely ...): none of the existing documents has any of these fields. Do I still need to re-index the Solr after the schema update?
I saw many recommendations for updating existing documents when adding/modifying fields, but this is not the case for me - I only want to update the schema, not the existing documents.
Thanks!
Marina
Answer 1: You are correct, you can add new field, you do not need to reindex if you want only new documents going forward to have value for that new field.
Answer 2: Yes, you can remove field without rebuilding index if none of documents have value for that field. You can make sure by looking at that field under:
http://localhost:8080/admin/schema.jsp
If one of documents has value for field you want to remove, you have to rebuild index, else it will give error.

Database search with Lucene.net and how to update the index

I have a SQL server database, with about 40 tables that need to be searched. I just started looking into Lucene for .net. These tables that need to be searched doesn't have any column that identifies when the row was last updated or created. We don't want to change the table structure right now. What are the options I have to identify if a row in a table has modified so that I can update the document in the Lucene index? And same for newly created rows too. Any help is greatly appreciated.
If you can't tell what has changed by looking at the database, then just assume all of the rows have changed and update them all in Lucene. That handles your new rows as well.
If this is too slow or time consuming, then that gives you a reason why you should change your table structure to store the last updated date.

Drop, not overwrite, on unique id field

When using a unique id field, Solr will overwrite old documents with newly indexed documents. Is there any way to prevent this, so that the old documents are stored but the new are dropped?
Thanks.
Nope. Solr will delete the existing record and insert a new one by default
You can check for Deduplication and UpdateXmlMessages#Optional_attributes which may serve the purpose.
You can write your own update request handler that detected extend UpdateRequestProcessorFactory/UpdateRequestProcessor.
Else, you can check if the Id exists and then not insert the new record. Overhead on the Client side.

mongoid include soft deleted document

Mongoid supports soft deletion with
include Mongoid::Paranoia
Lets suppose i have soft deleted a document from one of the collection.
Now I need a query that includes a soft deleted document from that collection.
How can I do that?
Do I need to make a separate method for this to achieve?
Thanks
You can find all deleted documents by query
Model.deleted
and if you want to find deleted documents with specific condition then
Model.deleted.where(:field => value)

Resources