Solr storage handling - solr

I have six node solr cluster and every node having 200GB of storage, we created one collection with two shards.
I like to know what will happen if my document reached 400GB (node1-200GB,node-2 200GB) ? is solr automatically use another free node from my cluster ?

If my document reached 400GB (node1-200GB,node-2 200GB) ?
Ans: I am not sure about what exacly error you may get, however in production you should try not to face this situation. To avoid/handle such scenarios we have monitoring/autoscaling triggers apis.
Is solr automatically use another free node from my cluster ?
Ans: No, Extra shards will not be added automatically. However whenever you observe that search is getting slow or if solr is crossing physical limitations of machines then you should go for splitShard .
So ultimately you can handle this with autscaling triggers. That is you can set autscaling triggers to identify whether a shard is crossing specified limits about the number of document or size of the index etc. Once this limits reaches this trigger can call splitShard
This link mentions
This trigger can be used for monitoring the size of collection shards,
measured either by the number of documents in a shard or the physical
size of the shard’s index in bytes.
When either of the upper thresholds is exceeded the trigger will
generate an event with a (configurable) requested operation to perform
on the offending shards - by default this is a SPLITSHARD operation.

Related

Update SOLR document without adding deleted documents

I'm running a lot of SOLR document updates which results in 100s of thousands of deleted documents and a significant increase in disk usage (100s of Gb).
I'm able to remove all deleted document by doing an optimize
curl http://localhost:8983/solr/core_name/update?optimize=true
But this takes hours to run and requires a lot of RAM and disk space.
Is there a better way to remove deleted documents from the SOLR index or to update a document without creating a deleted one?
Thanks for your help!
Lucene uses an append only strategy, which means that when a new version of an old document is added, the old document is marked as deleted, and a new one is inserted into the index. This way allows Lucene to avoid rewriting the whole index file as documents are added, at the cost of old documents physically still being present in the index - until a merge or an optimize happens.
When you issue expungeDeletes, you're telling Solr to perform a merge if the number of deleted documents exceed a certain threshold, in effect, meaning that you're forcing an optimize behind the scenes as Solr deems necessary.
How you can work around this depends on more specific information about your use case - in the general case just leaving it to the standard settings for merge factors etc. should be good enough. If you're not seeing any merges, you might have disabled automatic merges from taking place (depending on your index size and seeing hundred of thousands of deleted documents seems extensive for an indexing processing taking 2m30s). In that case make sure to enable it properly and tweak it values again. There's also changes that were introduced with 7.5 to the TieredMergePolicy that allows even more detailed control (and possibly better defaults) for the merge process.
If you're re-indexing your complete dataset each time, indexing to a separate collection/core and then switching an alias over or renaming the core when finished before removing the old dataset is also an option.

How to monitor Vespa index disk usage and number of indexed documents

I am trying to monitor my Vespa cluster (with the help of the Prometheus exporter), but I can't find the right metrics to observe to know the space my index is taking, nor the space my replicas are taking. And I would also like to find a simple way of visualizing the number of documents that are indexed in my cluster, but I can't find a simple way of doing that. I have found the vespa_container_documents_total metric, but its value if always zero. The only way I've found to get its real value is to perform a search request on the cluster, then, this metric is populated. But only for a few time (like one minute), and then it gets back to zero.
So, is there a way to simply monitor those two metrics ?
Take a look at https://docs.vespa.ai/documentation/reference/metrics-health-format.html, you want to gather metrics from searchnode and not the container. If you fetch metrics from the search node metric port you'll find a ton of metrics related to disk usage, documents indexed, documents active +++.

SOLR Cloud 4.3 delete data from index and disc

I wish to clear the data from SOLR cloud 4.3 (both from index and disc - no recovery is needed)
I ran the following query:
http://host:port/solr/core/update?stream.body=<delete><query>*:*</query></delete>&commit=true
This delete the data from the index itself, but the data is still on the disc (I am not familiar with how solr is saving the data, but disc size remain the same). Is there a property that i need to add inorder to delete the data itself from disc ?
There are 2 shards managed with zookeeper.
Thanks
When you run the query all the documents will be marked as deleted. It will not clean up the space immediately. When the next time segment merger will execute, it will discard all the deleted document from the old segments. Once the merging process completes, old segments are discarded and space will be claimed.
The underline lucene data structure for store is called, segment. This is immutable in nature. So you cannot update/delete the entries from it directly. When the background merge of the segment happens as per the merge policy defined in the configuration. The update/delete will reflect into the new segment. Till then it just set a bit indicating, document is deleted so don't include in results.
Also the partial updates in the solr are treating as deleting the old document and re-index with whatever has updated.

Solr indexing issue (out of memory) - looking for a solution

I have a large index of 50 Million docs. all running on the same machine (no sharding).
I don't have an ID that will allow me to update the wanted docs, so for each update I must delete the whole index and to index everything from scratch and commit only at the end when I'm done indexing.
My problem is that every few index runs, My Solr crashes with out of memory exception, I am running with 12.5 GB memory.
From what I understand, until the commit everything is being saved in the memory, so I'm storing in the memory 100M docs instead of 50M. am I right?
But I cannot make commits while I'm indexing, because I deleted all docs at the beginning and than I'll run with partial index which is bad.
Is there any known solutions for that? can sharding solve it or I still going to have the same problem?
Is there a flag that allow me to make soft-commits but it won't change the index until the hard-commit?
You can use the master slave replication. Just dedicate one machine to do your indexing (master solr), and then, if it's finished, you can tell the slave to replicate the index from the master machine. The slave will download the new index, and it will only delete the old index if the download is successful. So it's quite safe.
http://wiki.apache.org/solr/SolrReplication
One other solution to avoid all this replication set-up is to use a reverse proxy, put nginx or something of the like in front of your solr. Use one machine for indexing the new data, and the other for searching. And you can just make the reverse proxy to always point at the one not currently doing any indexing.
If you do one of them, then you can just commit as often as you want.
And because it's generally a bad idea to do indexing and search in one same machine, I will prefer to use the master-slave solution (not to mention you have 50M docs).
out of memory error can be solved by providing more memory to jvm of your container it has nothing to do with your cache .
Use better options for Garbage collection because source of error is your jvm memory being full.
Increase the number of threads because if number of threads for a process is reached a new process is spawn (which have same number of threads as prior one and same memory allocation ).
PLease also write about cpu spike , and any other type of caching mechanism you are using
you can try one thing thats to put all auto warmup to 0 it would speed up commit time
regards
Rajat

Number Found Accuracy on Search API Affecting Cursor Results

When using the google app engine search API, if we have a query that returns a large result set (>1000), and need to iterate using the cursor to collect the entire result set, we are getting indeterminate results for the documents returned if the number_found_accuracy is lower than our result size.
In other words, the same query ran twice, collecting all the documents via cursors, does not return the same documents, UNLESS our number_found_accuracy is higher than the result size (ex, using the 10000 max). Only then do we always get the same documents.
Our understanding of how the number_found_accuracy is supposed to work is that it would only affect the number_found estimation. We assumed that if you use the cursor to get all the results, you would be able to get the same results as if you had run one large query.
Are we mis-understanding the use of the number_found_accuracy or cursors, or have we found a bug?
Your understanding of number_found_accuracy is correct. I think that the behavior you're observing is the surprising interplay between replicated query failover and how queries that specify number_found_accuracy affect future queries using continuation tokens.
When you index documents using the Search API, we store them in several distinct replicas using the same mechanism as the High Replication Datastore (i.e., Megastore). How quickly those documents become live on each replica depends on many factors. It's usually immediate, but the delay can become much longer if you're doing batch writes to a single (index, namespace) pair.
Searches can get executed on any of these replicas. We'll even potentially run a search that uses a continuation token on a different replica than the original search. If the original replica and/or continuation replica are catching up on their indexing work, they might have different sets of live documents. It will become consistent "eventually" but it's not always immediate.
If you specify number_found_accuracy on a query, we have to run most of the query as if we're going to return number_found_accuracy results. We specifically have to read much further down the posting lists to find and count matching documents. Reading a posting list results in its associated file block being inserted into various caches.
In turn, when you do the search using a cursor we'll be able to read the document for real much more quickly on the same replica that we'd used for the original search. You're thus less likely to have the continuation search failover to a different replica that might not have finished indexing the same set of documents. I think that the inconsistent results you've observed are the result of this kind of continuation query failover.
In summary, setting number_found_accuracy to something large effectively prewarms that replica's cache. It will thus almost certainly be the fastest replica for a continuation search. In the face of replicas that are trying to catch up on indexing, that will give the appearance that number_found_accuracy has a direct effect on the consistency of results, but in reality it's just a side-effect.

Resources