Solr still gives old data while faceting from the deleted documents - solr

Solr gives old data while faceting from old deleted or updated documents.
For example we are doing faceting on name. name changes frequently for our application. When we index the document after changing the name we get both old name and new name in the search results. After digging more on this I got to know that Solr indexes are composed of segments (write once) and each segment contains set of documents. Whenever hard commit happens these segments will be closed and even if a document is deleted after that it will still have those documents (which will be marked as deleted). These documents will not be cleared immediately. It will not be displayed in the search result though, but somehow faceting is still able to access those data.
Optimizing fixed this issue. But we cannot perform this each time customer changes data on production. I tried below options and that did not work for me.
1) expungeDeletes.
Added this line below in solrconfig.xml
<autoCommit>
<maxTime>30000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
<!-- softAutoCommit is like autoCommit except it causes a
'soft' commit which only ensures that changes are visible
but does not ensure that data is synced to disk. This is
faster and more near-realtime friendly than a hard commit.
-->
<autoSoftCommit>
<maxTime>10000</maxTime>
</autoSoftCommit>
<commit waitSearcher="false" expungeDeletes="true"/>
2) Using TieredMergePolicyFactory might not help me as the threshold might not reach always and user will see old data during this time.
3) One more way of doing it is calling optimize() method which is exposed in solrj daily once. But not sure what impact this will have on performance.
Number of documents we index per server will be maximum 2M-3M.
Please suggest if there is any solution to this.
Let me know if more data needed.

Related

Update SOLR document without adding deleted documents

I'm running a lot of SOLR document updates which results in 100s of thousands of deleted documents and a significant increase in disk usage (100s of Gb).
I'm able to remove all deleted document by doing an optimize
curl http://localhost:8983/solr/core_name/update?optimize=true
But this takes hours to run and requires a lot of RAM and disk space.
Is there a better way to remove deleted documents from the SOLR index or to update a document without creating a deleted one?
Thanks for your help!
Lucene uses an append only strategy, which means that when a new version of an old document is added, the old document is marked as deleted, and a new one is inserted into the index. This way allows Lucene to avoid rewriting the whole index file as documents are added, at the cost of old documents physically still being present in the index - until a merge or an optimize happens.
When you issue expungeDeletes, you're telling Solr to perform a merge if the number of deleted documents exceed a certain threshold, in effect, meaning that you're forcing an optimize behind the scenes as Solr deems necessary.
How you can work around this depends on more specific information about your use case - in the general case just leaving it to the standard settings for merge factors etc. should be good enough. If you're not seeing any merges, you might have disabled automatic merges from taking place (depending on your index size and seeing hundred of thousands of deleted documents seems extensive for an indexing processing taking 2m30s). In that case make sure to enable it properly and tweak it values again. There's also changes that were introduced with 7.5 to the TieredMergePolicy that allows even more detailed control (and possibly better defaults) for the merge process.
If you're re-indexing your complete dataset each time, indexing to a separate collection/core and then switching an alias over or renaming the core when finished before removing the old dataset is also an option.

Is it better to update all records or reindex while using solr?

I am using solr search engine. I had defined a schema initially and imported data from SQL db to solr using DIH. I have got a new column in sql db and value to which is getting populated using some of the previous columns. Now, I have to index this new column into solr.
My question is: do I perform update for all records or do I delete all records from solr and rebuild index again using DIH? I am asking this question because I have read that if we perform update for any document, solr first deletes the index and then rebuild it again.
The answer regarding speed is, as always, "it depends". But it's usually easier to just reindex. It doesn't require all fields to be stored in Solr and it's something you'll have to support anyway - so it doesn't require any additional code.
It also offers a bit more flexibility in regards to the index, since as you note, if you are going to do partial updates, the actual implementation is delete+add internally (since there might be fields that depend on the field you're changing, update processors, distribution across the cluster, etc.) - which requires all fields to be stored. This can have a huge impact on index size, which might not be necessary - especially if you have all the content in the DB for all other uses anyway.
So in regards to speed you're probably just going to have to try (document sizes, speed of DB, field sizes, etc. is going to affect that for each single case) - but usually the speed of a reindex isn't the most important part.
If you update your index don't forget to optimize it afterwards (via the Admin console for example) to get rid of all the deleted documents.

Max length for Solr Delete by query

I am working on making some improvements on reindexing process. So we have our custom logic to figure out which documents have been modified and need to be reindexed. So at the end I can generate a delete query with something like delete all documents where fieldId in list
So instead of deleting and adding 50k documents everytime we only re-index a tiny percentage of it.
Now I am thinking about edge case scenario where our list of fieldIds is extremely large say 30-40,000 ids so if that's the case is there a upper limit on request length that I should worry about, or would it in turn cause negative effects on performance and exacerbate the situation instead of making it better.
I read some articles on google where they are advising to make it a post request instead.
I am using SolrNet latest build which is build on Solr 4.0
I would revisit that logic because deleting the documents then re-index them again is not the best solution. Because firstly it is an expensive operation, secondly your index will be empty or in-complete for a while until you re-index the documents again, which means if you query your index in the middle of the operation you could get zero, or partial results.
I would advise to just index again with the same document Id (uniquekey defined in solr schema.xml). And solr is smart eough to overwrite the document if it is indexed with the same Id. Then you don't have to worry about the hassle of deleting old documents. You might also do 'Optimize' to the index from time to time to physically get rid of 'deleted' documents.

When using DSE Search, Do I need to reindex to remove a field

I'm using DSE Search 3.2.1. We have removed some unneeded indexes and fields and posted the schema.xml document to all of the nodes. Do we need to do anything else to have it discontinue indexing data? Do we need to run a reindex, or a full reindex?
I'm pretty sure from what I see in Solr you need to reindex after changing the fields in the documents in your solr schema.xml. After you post it, you'll need to reload the core. If querying still works after that you might be ok, but I would guess you're going to need to run a reindex to be safe.
If you don't reindex, the existing Solr index field values will remain, occupying space and responding to queries. And fresh inserts or updates will not have the deleted fields. As Ben said, that might be okay.
A Solr reindex will delete all of the old field values.
Ideally if you change anything in schema.xml and want the changes to reflect you have to do a re-index. But, doing a re-index totally depends on the application use case and the number of records you have in it. If the reason for removing the index was due to lack of usage then there is no need for you to do re-index since no is going to search on them. The old indexes will take some space but it should be fine. Also, be careful when you are doing re-indexing because it highly depends on the number of documents you have. If you have somewhere around 10M and above I would NOT recommend re-indexing as it is CPU & I/O bound operation. If the number of documents are less then you can surely go ahead and do it.

How to configure Solr for improved indexing speed

I have a client program which generates a 1-50 millions Solr documents and add them to Solr.
I'm using ConcurrentUpdateSolrServer for pushing the documents from the client, 1000 documents per request.
The documents are relatively small (few small text fields).
I want to improve the indexing speed.
I've tried to increase the "ramBufferSizeMB" to 1G and the "mergeFactor" to 25 but didn't see any change.
I was wondering if there is some other recommended settings for improving Solr indexing speed.
Any links to relevant materials will be appreciated.
It looks like you are doing a bulk import of data into Solr, so you don't need to search any data right away.
First, you can increase the number of documents per request. Since your documents are small, I would even increase it to 100K docs per request or more and try.
Second, you want to reduce the number of times commits happen when you are bulk indexing. In your solrconfig.xml look for:
<!-- AutoCommit
Perform a hard commit automatically under certain conditions.
Instead of enabling autoCommit, consider using "commitWithin"
when adding documents.
http://wiki.apache.org/solr/UpdateXmlMessages
maxDocs - Maximum number of documents to add since the last
commit before automatically triggering a new commit.
maxTime - Maximum amount of time in ms that is allowed to pass
since a document was added before automatically
triggering a new commit.
openSearcher - if false, the commit causes recent index changes
to be flushed to stable storage, but does not cause a new
searcher to be opened to make those changes visible.
-->
<autoCommit>
<maxTime>15000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
You can disable autoCommit altogether and then call a commit after all your documents are posted. Otherwise you can tweak the numbers as follows:
The default maxTime is 15 secs so an auto commit happens every 15 secs if there are uncommitted docs, so you can set this to something large, say 3 hours (i.e. 3*60*60*1000). You can also add <maxDocs>50000000</maxDocs> which means an auto commit happens only after 50 million documents are added. After you post all your documents, call commit once manually or from SolrJ - it will take a while to commit, but this will be much faster overall.
Also after you are done with your bulk import, reduce maxTime and maxDocs, so that any incremental posts you will do to Solr will get committed much sooner. Or use commitWithin as mentioned in solrconfig.
In addition to what was written above, when using SolrCloud, you may want to consider using the CloudSolrClient when using SolrJ. The CloudSolrClient client class is Zookeeper aware and is able to directly connect to the leader shard speeding up the indexing in some cases.

Resources