Solr full reindexing without downtime - solr

We got the following problem at hand. We want to do a full reindex with 100 % read availability during the process. The problem arises when deleting old documents from the index. At the moment we´re doing sth. like this:
1) fetch all data from db and update solr index per solrServer.add()
2) get all document ids that were updated and compare them with all the document ids in index
3) delete all documents that are in index but weren´t updated
This seems to work but is there maybe a better/easier solution for this task?

The changes do not become visible until you commit. So, you can issue delete and then index all your documents. Just make sure automatic commits are not there. This obviously requires more memory.
Alternatively, you can do a separate field with generational stamp (e.g. increasing ID or timestamp). Then, you issue a query delete to pick up the left over documents with old generation.
Finally, you can index into a new Core/Collection and then swap out the active collection to point to the new one. Then, you can just delete the old collection directory.

It sounds like you may have a performance issue with the deletes. IF you do this:
delete id:12345
delete id:23456
delete id:13254
then it is a lot slower than this:
delete id:(12345 OR 23456 OR 13254)
Collect the list of ids that need to be deleted, batch them in groups of 100 or so, and transform those batches into delete queries using parentheses and OR. I have done this with batches of deletes numbering several thousand, and it is much faster than stepping through one at a time.

Related

Get only N months data indexed in a Collection. It should be on rolling based

Currently I am facing an issue that a MongoDB collection might have billion records which contains document based on some rapid event happening in the system. These events get logged in the DB collection.
Since we have some 2-3 composite indexing on the same collection, the search definitely becomes slow.
The escape point to this is our customer has agreed if we can index only N months data in the MongoDB, then the efficiency for read can increase instead of having 2-3 years data indexed and we perform read operation.
My thoughts on solution 1: we can do TTL indexes and set expiry. After this expiry the data gets deleted from main collection. we can some how do backup for that expired records. This way we can only have specific data required in main collection.
My thoughts on solution 2: I can remove all the indexes, create the indexes again based on time frame, for example, Drop current indexes and again create indexes based on condition that indexes must be created only till past N months data only. This way I can maintain limited index. But I am not sure how much is it possible.
Question: I need more help on this on how can I achieve selective indexing. Also it must be rolling as everyday records gets added so does indexing.
If you're on Mongo 3.2 or above, you should be able to use a partial index to create the "selective index" that you want -- https://docs.mongodb.com/manual/core/index-partial/#index-type-partial You'll just need to be sure that your queries share the same partial filter expression that the index has.
(I suspect that there might also be issues with the indexes you currently have, and that reducing index size won't necessarily have a huge impact on search duration. Mongo indexes are stored in a B-tree, so the time to navigate the tree to find a single item is going to scale relative to the log of the number of items. It might be worth examining the explain output for the queries that you have to see what mongo is actually doing.)

Update SOLR document without adding deleted documents

I'm running a lot of SOLR document updates which results in 100s of thousands of deleted documents and a significant increase in disk usage (100s of Gb).
I'm able to remove all deleted document by doing an optimize
curl http://localhost:8983/solr/core_name/update?optimize=true
But this takes hours to run and requires a lot of RAM and disk space.
Is there a better way to remove deleted documents from the SOLR index or to update a document without creating a deleted one?
Thanks for your help!
Lucene uses an append only strategy, which means that when a new version of an old document is added, the old document is marked as deleted, and a new one is inserted into the index. This way allows Lucene to avoid rewriting the whole index file as documents are added, at the cost of old documents physically still being present in the index - until a merge or an optimize happens.
When you issue expungeDeletes, you're telling Solr to perform a merge if the number of deleted documents exceed a certain threshold, in effect, meaning that you're forcing an optimize behind the scenes as Solr deems necessary.
How you can work around this depends on more specific information about your use case - in the general case just leaving it to the standard settings for merge factors etc. should be good enough. If you're not seeing any merges, you might have disabled automatic merges from taking place (depending on your index size and seeing hundred of thousands of deleted documents seems extensive for an indexing processing taking 2m30s). In that case make sure to enable it properly and tweak it values again. There's also changes that were introduced with 7.5 to the TieredMergePolicy that allows even more detailed control (and possibly better defaults) for the merge process.
If you're re-indexing your complete dataset each time, indexing to a separate collection/core and then switching an alias over or renaming the core when finished before removing the old dataset is also an option.

Deleting Huge Data In Cassandra Cluster

I have Cassandra cluster with three nodes. We have data close to 7 TB from last 4 years. Now because of less space available in the server, we would like to keep data only for last 2 years. But we don't want to delete it completely(data older than 2 years). We want to keep specific data even it is older than 2 years.
Currently I can think of one approach:
1) Java client using "MutationBatch object". I can get all the records key which fall into date range and excluding rows which we don't want to delete. Then deleting records in a batch. But this solution raises concern over performance as data is huge.
Is it possible to handle it at the server level(opscenter). I read about TTL but how can I apply it to an existing data and also restrict some of the data which I want to keep even if it is older than 2 years.
Please help me in finding out the best solution.
The main thing that you need to understand is that when you remove the data in Cassandra, you're actually adding them by writing the tombstone, and then deletion of actual data will happen during compaction.
So it's very important to perform deletion correctly. There are different types of deletes - individual cells, row, range, partition (from least effective to most effective by number of tombstones generated). The best for you is to remove by partition, then second one is by ranges inside partition. Following article describes how the data is removed in great details.
You may need to perform deletion in several steps, so you don't add too much data as tombstones. You also need to check that you have enough disk space for compaction.

When to optimize a Solr Index [duplicate]

I have a classifieds website. Users may put ads, edit ads, view ads etc.
Whenever a user puts an ad, I am adding a document to Solr.
I don't know, however, when to commit it. Commit slows things down from what I have read.
How should I do it? Autocommit every 12 hours or so?
Also, how should I do it with optimize?
A little more detail on Commit/Optimize:
Commit: When you are indexing documents to solr none of the changes you are making will appear until you run the commit command. So timing when to run the commit command really depends on the speed at which you want the changes to appear on your site through the search engine. However it is a heavy operation and so should be done in batches not after every update.
Optimize: This is similar to a defrag command on a hard drive. It will reorganize the index into segments (increasing search speed) and remove any deleted (replaced) documents. Solr is a read only data store so every time you index a document it will mark the old document as deleted and then create a brand new document to replace the deleted one. Optimize will remove these deleted documents. You can see the search document vs. deleted document count by going to the Solr Statistics page and looking at the numDocs vs. maxDocs numbers. The difference between the two numbers is the amount of deleted (non-search able) documents in the index.
Also Optimize builds a whole NEW index from the old one and then switches to the new index when complete. Therefore the command requires double the space to perform the action. So you will need to make sure that the size of your index does not exceed %50 of your available hard drive space. (This is a rule of thumb, it usually needs less then %50 because of deleted documents)
Index Server / Search Server:
Paul Brown was right in that the best design for solr is to have a server dedicated and tuned to indexing, and then replicate the changes to the searching servers. You can tune the index server to have multiple index end points.
eg: http://solrindex01/index1; http://solrindex01/index2
And since the index server is not searching for content you can have it set up with different memory footprints and index warming commands etc.
Hope this is useful info for everyone.
Actually, committing often and optimizing makes things really slow. It's too heavy.
After a day of searching and reading stuff, I found out this:
1- Optimize causes the index to double in size while beeing optimized, and makes things really slow.
2- Committing after each add is NOT a good idea, it's better to commit a couple of times a day, and then make an optimize only once a day at most.
3- Commit should be set to "autoCommit" in the solrconfig.xml file, and there it should be tuned according to your needs.
The way that this sort of thing is usually done is to perform commit/optimize operations on a Solr node located out of the request path for your users. This requires additional hardware, but it ensures that the performance penalty of the indexing operations doesn't impact your users. Replication is used to periodically shuttle optimized index files from the master node to the nodes that perform search queries for users.
Try it first. It would be really bad if you avoided a simple and elegant solution just because you read that it might cause a performance problem. In other words, avoid premature optimization.

Max length for Solr Delete by query

I am working on making some improvements on reindexing process. So we have our custom logic to figure out which documents have been modified and need to be reindexed. So at the end I can generate a delete query with something like delete all documents where fieldId in list
So instead of deleting and adding 50k documents everytime we only re-index a tiny percentage of it.
Now I am thinking about edge case scenario where our list of fieldIds is extremely large say 30-40,000 ids so if that's the case is there a upper limit on request length that I should worry about, or would it in turn cause negative effects on performance and exacerbate the situation instead of making it better.
I read some articles on google where they are advising to make it a post request instead.
I am using SolrNet latest build which is build on Solr 4.0
I would revisit that logic because deleting the documents then re-index them again is not the best solution. Because firstly it is an expensive operation, secondly your index will be empty or in-complete for a while until you re-index the documents again, which means if you query your index in the middle of the operation you could get zero, or partial results.
I would advise to just index again with the same document Id (uniquekey defined in solr schema.xml). And solr is smart eough to overwrite the document if it is indexed with the same Id. Then you don't have to worry about the hassle of deleting old documents. You might also do 'Optimize' to the index from time to time to physically get rid of 'deleted' documents.

Resources