Update SOLR document without adding deleted documents - solr

I'm running a lot of SOLR document updates which results in 100s of thousands of deleted documents and a significant increase in disk usage (100s of Gb).
I'm able to remove all deleted document by doing an optimize
curl http://localhost:8983/solr/core_name/update?optimize=true
But this takes hours to run and requires a lot of RAM and disk space.
Is there a better way to remove deleted documents from the SOLR index or to update a document without creating a deleted one?
Thanks for your help!

Lucene uses an append only strategy, which means that when a new version of an old document is added, the old document is marked as deleted, and a new one is inserted into the index. This way allows Lucene to avoid rewriting the whole index file as documents are added, at the cost of old documents physically still being present in the index - until a merge or an optimize happens.
When you issue expungeDeletes, you're telling Solr to perform a merge if the number of deleted documents exceed a certain threshold, in effect, meaning that you're forcing an optimize behind the scenes as Solr deems necessary.
How you can work around this depends on more specific information about your use case - in the general case just leaving it to the standard settings for merge factors etc. should be good enough. If you're not seeing any merges, you might have disabled automatic merges from taking place (depending on your index size and seeing hundred of thousands of deleted documents seems extensive for an indexing processing taking 2m30s). In that case make sure to enable it properly and tweak it values again. There's also changes that were introduced with 7.5 to the TieredMergePolicy that allows even more detailed control (and possibly better defaults) for the merge process.
If you're re-indexing your complete dataset each time, indexing to a separate collection/core and then switching an alias over or renaming the core when finished before removing the old dataset is also an option.

Related

Solr how to improve query speed time after too many atomic update

I'm using solrCloud 7.4 with 3 instance (16GB RAM each instance) and have 1 collection with 10m data. For starter it really fast, almost no query more than 2 seconds.
Then i have updated with transaction (i.e popularity) data in other oracle database to make my collection more relevant. I just simply loop transaction then using solr atomic update like set and inc about 1~10 fields (almost all field type float n long). But transaction has more than 300m data. So the process i set and inc every 10k transaction data to collection in solr.
The update part of 300m data only process once, After that maybe take 50k/day and processing at 0am.
In the end. the collection still have 10m data, but looks like my query has slow down almost up to 10 seconds.
I look in shard overview, each shard have 20+ segment and half of them are deleted document:
Have i do something miss here, why my query time drop?
How do i speed up again like before?
Should i copy and creating new collection n reindex my 10m collection after atomic update (from 300m transc) to the new collection?
The issue is caused by a large number of segments being created, mostly consisting of deleted documents. When you're doing an atomic update, the previous document is fetched, the value is changed, and the new document (with the new value) is indexed. This leaves the old document as deleted, while the new document is written to a new file.
These segments are merged when the mergeFactor value is hit; i.e. when the number of segments gets high enough, they're merged into a new segment file instead of having multiple files around. When this merges happens, deleted documents are expunged (no need to write documents that no longer exists to a new file).
You can force this process to happen by issuing an optimize, and while you usually can rely on mergeFactor to do the job for you (depending on the value of mergeFactor and your indexing strategy), datasets where everything is updated in one go, such as once at night, issuing an optimize afterwards works fine.
The down side is that it'll require extra processing (but that would happen anyway if you just relied on mergeFactor, but not everything at the same time), and up to 2x the current size of the index as temporary space.
You can perform an optimize by calling the update endpoint for your collection: http://localhost:8983/solr/collection/update?optimize=true&maxSegments=1&waitFlush=false
The maxSegments value tells Solr how many segments its acceptable to end up with. The default value is 1. For most use cases that'll be fine.
While calling optimize has gotten a bad rep (since mergeFactor usually should do the work for you, and people tend to call optimize far too often), this is a perfectly fine use case for optimize. There are also optimization enhancements for the optimize command in 7.5, which will help avoid the previous worst case scenarios.

Is it better to update all records or reindex while using solr?

I am using solr search engine. I had defined a schema initially and imported data from SQL db to solr using DIH. I have got a new column in sql db and value to which is getting populated using some of the previous columns. Now, I have to index this new column into solr.
My question is: do I perform update for all records or do I delete all records from solr and rebuild index again using DIH? I am asking this question because I have read that if we perform update for any document, solr first deletes the index and then rebuild it again.
The answer regarding speed is, as always, "it depends". But it's usually easier to just reindex. It doesn't require all fields to be stored in Solr and it's something you'll have to support anyway - so it doesn't require any additional code.
It also offers a bit more flexibility in regards to the index, since as you note, if you are going to do partial updates, the actual implementation is delete+add internally (since there might be fields that depend on the field you're changing, update processors, distribution across the cluster, etc.) - which requires all fields to be stored. This can have a huge impact on index size, which might not be necessary - especially if you have all the content in the DB for all other uses anyway.
So in regards to speed you're probably just going to have to try (document sizes, speed of DB, field sizes, etc. is going to affect that for each single case) - but usually the speed of a reindex isn't the most important part.
If you update your index don't forget to optimize it afterwards (via the Admin console for example) to get rid of all the deleted documents.

When to optimize a Solr Index [duplicate]

I have a classifieds website. Users may put ads, edit ads, view ads etc.
Whenever a user puts an ad, I am adding a document to Solr.
I don't know, however, when to commit it. Commit slows things down from what I have read.
How should I do it? Autocommit every 12 hours or so?
Also, how should I do it with optimize?
A little more detail on Commit/Optimize:
Commit: When you are indexing documents to solr none of the changes you are making will appear until you run the commit command. So timing when to run the commit command really depends on the speed at which you want the changes to appear on your site through the search engine. However it is a heavy operation and so should be done in batches not after every update.
Optimize: This is similar to a defrag command on a hard drive. It will reorganize the index into segments (increasing search speed) and remove any deleted (replaced) documents. Solr is a read only data store so every time you index a document it will mark the old document as deleted and then create a brand new document to replace the deleted one. Optimize will remove these deleted documents. You can see the search document vs. deleted document count by going to the Solr Statistics page and looking at the numDocs vs. maxDocs numbers. The difference between the two numbers is the amount of deleted (non-search able) documents in the index.
Also Optimize builds a whole NEW index from the old one and then switches to the new index when complete. Therefore the command requires double the space to perform the action. So you will need to make sure that the size of your index does not exceed %50 of your available hard drive space. (This is a rule of thumb, it usually needs less then %50 because of deleted documents)
Index Server / Search Server:
Paul Brown was right in that the best design for solr is to have a server dedicated and tuned to indexing, and then replicate the changes to the searching servers. You can tune the index server to have multiple index end points.
eg: http://solrindex01/index1; http://solrindex01/index2
And since the index server is not searching for content you can have it set up with different memory footprints and index warming commands etc.
Hope this is useful info for everyone.
Actually, committing often and optimizing makes things really slow. It's too heavy.
After a day of searching and reading stuff, I found out this:
1- Optimize causes the index to double in size while beeing optimized, and makes things really slow.
2- Committing after each add is NOT a good idea, it's better to commit a couple of times a day, and then make an optimize only once a day at most.
3- Commit should be set to "autoCommit" in the solrconfig.xml file, and there it should be tuned according to your needs.
The way that this sort of thing is usually done is to perform commit/optimize operations on a Solr node located out of the request path for your users. This requires additional hardware, but it ensures that the performance penalty of the indexing operations doesn't impact your users. Replication is used to periodically shuttle optimized index files from the master node to the nodes that perform search queries for users.
Try it first. It would be really bad if you avoided a simple and elegant solution just because you read that it might cause a performance problem. In other words, avoid premature optimization.

Max length for Solr Delete by query

I am working on making some improvements on reindexing process. So we have our custom logic to figure out which documents have been modified and need to be reindexed. So at the end I can generate a delete query with something like delete all documents where fieldId in list
So instead of deleting and adding 50k documents everytime we only re-index a tiny percentage of it.
Now I am thinking about edge case scenario where our list of fieldIds is extremely large say 30-40,000 ids so if that's the case is there a upper limit on request length that I should worry about, or would it in turn cause negative effects on performance and exacerbate the situation instead of making it better.
I read some articles on google where they are advising to make it a post request instead.
I am using SolrNet latest build which is build on Solr 4.0
I would revisit that logic because deleting the documents then re-index them again is not the best solution. Because firstly it is an expensive operation, secondly your index will be empty or in-complete for a while until you re-index the documents again, which means if you query your index in the middle of the operation you could get zero, or partial results.
I would advise to just index again with the same document Id (uniquekey defined in solr schema.xml). And solr is smart eough to overwrite the document if it is indexed with the same Id. Then you don't have to worry about the hassle of deleting old documents. You might also do 'Optimize' to the index from time to time to physically get rid of 'deleted' documents.

When using DSE Search, Do I need to reindex to remove a field

I'm using DSE Search 3.2.1. We have removed some unneeded indexes and fields and posted the schema.xml document to all of the nodes. Do we need to do anything else to have it discontinue indexing data? Do we need to run a reindex, or a full reindex?
I'm pretty sure from what I see in Solr you need to reindex after changing the fields in the documents in your solr schema.xml. After you post it, you'll need to reload the core. If querying still works after that you might be ok, but I would guess you're going to need to run a reindex to be safe.
If you don't reindex, the existing Solr index field values will remain, occupying space and responding to queries. And fresh inserts or updates will not have the deleted fields. As Ben said, that might be okay.
A Solr reindex will delete all of the old field values.
Ideally if you change anything in schema.xml and want the changes to reflect you have to do a re-index. But, doing a re-index totally depends on the application use case and the number of records you have in it. If the reason for removing the index was due to lack of usage then there is no need for you to do re-index since no is going to search on them. The old indexes will take some space but it should be fine. Also, be careful when you are doing re-indexing because it highly depends on the number of documents you have. If you have somewhere around 10M and above I would NOT recommend re-indexing as it is CPU & I/O bound operation. If the number of documents are less then you can surely go ahead and do it.

Resources