SOLR Cloud 4.3 delete data from index and disc

SOLR Cloud 4.3 delete data from index and disc - solr

I wish to clear the data from SOLR cloud 4.3 (both from index and disc - no recovery is needed)
I ran the following query:
http://host:port/solr/core/update?stream.body=<delete><query>*:*</query></delete>&commit=true
This delete the data from the index itself, but the data is still on the disc (I am not familiar with how solr is saving the data, but disc size remain the same). Is there a property that i need to add inorder to delete the data itself from disc ?
There are 2 shards managed with zookeeper.
Thanks

When you run the query all the documents will be marked as deleted. It will not clean up the space immediately. When the next time segment merger will execute, it will discard all the deleted document from the old segments. Once the merging process completes, old segments are discarded and space will be claimed.
The underline lucene data structure for store is called, segment. This is immutable in nature. So you cannot update/delete the entries from it directly. When the background merge of the segment happens as per the merge policy defined in the configuration. The update/delete will reflect into the new segment. Till then it just set a bit indicating, document is deleted so don't include in results.
Also the partial updates in the solr are treating as deleting the old document and re-index with whatever has updated.

Related

Solr storage handling

I have six node solr cluster and every node having 200GB of storage, we created one collection with two shards.
I like to know what will happen if my document reached 400GB (node1-200GB,node-2 200GB) ? is solr automatically use another free node from my cluster ?

If my document reached 400GB (node1-200GB,node-2 200GB) ?
Ans: I am not sure about what exacly error you may get, however in production you should try not to face this situation. To avoid/handle such scenarios we have monitoring/autoscaling triggers apis.
Is solr automatically use another free node from my cluster ?
Ans: No, Extra shards will not be added automatically. However whenever you observe that search is getting slow or if solr is crossing physical limitations of machines then you should go for splitShard .
So ultimately you can handle this with autscaling triggers. That is you can set autscaling triggers to identify whether a shard is crossing specified limits about the number of document or size of the index etc. Once this limits reaches this trigger can call splitShard
This link mentions
This trigger can be used for monitoring the size of collection shards,
measured either by the number of documents in a shard or the physical
size of the shard’s index in bytes.
When either of the upper thresholds is exceeded the trigger will
generate an event with a (configurable) requested operation to perform
on the offending shards - by default this is a SPLITSHARD operation.

Update SOLR document without adding deleted documents

I'm running a lot of SOLR document updates which results in 100s of thousands of deleted documents and a significant increase in disk usage (100s of Gb).
I'm able to remove all deleted document by doing an optimize
curl http://localhost:8983/solr/core_name/update?optimize=true
But this takes hours to run and requires a lot of RAM and disk space.
Is there a better way to remove deleted documents from the SOLR index or to update a document without creating a deleted one?
Thanks for your help!

Lucene uses an append only strategy, which means that when a new version of an old document is added, the old document is marked as deleted, and a new one is inserted into the index. This way allows Lucene to avoid rewriting the whole index file as documents are added, at the cost of old documents physically still being present in the index - until a merge or an optimize happens.
When you issue expungeDeletes, you're telling Solr to perform a merge if the number of deleted documents exceed a certain threshold, in effect, meaning that you're forcing an optimize behind the scenes as Solr deems necessary.
How you can work around this depends on more specific information about your use case - in the general case just leaving it to the standard settings for merge factors etc. should be good enough. If you're not seeing any merges, you might have disabled automatic merges from taking place (depending on your index size and seeing hundred of thousands of deleted documents seems extensive for an indexing processing taking 2m30s). In that case make sure to enable it properly and tweak it values again. There's also changes that were introduced with 7.5 to the TieredMergePolicy that allows even more detailed control (and possibly better defaults) for the merge process.
If you're re-indexing your complete dataset each time, indexing to a separate collection/core and then switching an alias over or renaming the core when finished before removing the old dataset is also an option.

Solr how to improve query speed time after too many atomic update

I'm using solrCloud 7.4 with 3 instance (16GB RAM each instance) and have 1 collection with 10m data. For starter it really fast, almost no query more than 2 seconds.
Then i have updated with transaction (i.e popularity) data in other oracle database to make my collection more relevant. I just simply loop transaction then using solr atomic update like set and inc about 1~10 fields (almost all field type float n long). But transaction has more than 300m data. So the process i set and inc every 10k transaction data to collection in solr.
The update part of 300m data only process once, After that maybe take 50k/day and processing at 0am.
In the end. the collection still have 10m data, but looks like my query has slow down almost up to 10 seconds.
I look in shard overview, each shard have 20+ segment and half of them are deleted document:
Have i do something miss here, why my query time drop?
How do i speed up again like before?
Should i copy and creating new collection n reindex my 10m collection after atomic update (from 300m transc) to the new collection?

The issue is caused by a large number of segments being created, mostly consisting of deleted documents. When you're doing an atomic update, the previous document is fetched, the value is changed, and the new document (with the new value) is indexed. This leaves the old document as deleted, while the new document is written to a new file.
These segments are merged when the mergeFactor value is hit; i.e. when the number of segments gets high enough, they're merged into a new segment file instead of having multiple files around. When this merges happens, deleted documents are expunged (no need to write documents that no longer exists to a new file).
You can force this process to happen by issuing an optimize, and while you usually can rely on mergeFactor to do the job for you (depending on the value of mergeFactor and your indexing strategy), datasets where everything is updated in one go, such as once at night, issuing an optimize afterwards works fine.
The down side is that it'll require extra processing (but that would happen anyway if you just relied on mergeFactor, but not everything at the same time), and up to 2x the current size of the index as temporary space.
You can perform an optimize by calling the update endpoint for your collection: http://localhost:8983/solr/collection/update?optimize=true&maxSegments=1&waitFlush=false
The maxSegments value tells Solr how many segments its acceptable to end up with. The default value is 1. For most use cases that'll be fine.
While calling optimize has gotten a bad rep (since mergeFactor usually should do the work for you, and people tend to call optimize far too often), this is a perfectly fine use case for optimize. There are also optimization enhancements for the optimize command in 7.5, which will help avoid the previous worst case scenarios.

Understanding SSTable immutiability

I'm trying to understand better the immutability of sstables in Cassandra. It's very clear what happens both in an insert operation, or in update/delete operation when the data exists in the memtable. But it's not clear what happens when I want to modify data that has already been flushed out.
So I understand the simple senario: I execute an insert opertaion and the data is written to a memtable. When the memtable is full then it's flushed to an sstable.
Now, how does modification of data occur? What happens when I execute a delete or update command (when the data has been flushed out)? If the sstable is immutable, so how will the data get deleted/updated? And how does the memtable work in delete and update commands (of data that does not exist in it because it has been flushed out)? What will the memtable contain?

In Cassandra / Scylla you ALWAYS append. Meaning any operation, whether it's insert / update / delete will create a new entry for that partition containing the new data and new timestamp. In case of a delete operation the new entry will actually be a tombstone with the new timestamp (indicating that the previous data was deleted). This applies whether the data is still in memory (memtable) or already flushed to disk -> sstable created.
Several "versions" of the same partition with different data and different timestamps can reside in multiple sstables (and even in memory) at the same time. SStables will be merged duration compaction and there are several compaction strategies that can be applied.
When the gc_grace_period (default: 10 days, tunable) has expired, on the next compaction that tombstone will be removed, meaning the data that was deleted and the tombstone indicating the latest action (delete), will not get merged into the new sstable.
The internal implementation of the memtables might be slightly different between Scylla and Cassandra but for the sake of simplicity let's assume it is the same.
You are welcomed to read more about the architecture in the following documentation:
SStables
Compaction strategies

Solr still gives old data while faceting from the deleted documents

Solr gives old data while faceting from old deleted or updated documents.
For example we are doing faceting on name. name changes frequently for our application. When we index the document after changing the name we get both old name and new name in the search results. After digging more on this I got to know that Solr indexes are composed of segments (write once) and each segment contains set of documents. Whenever hard commit happens these segments will be closed and even if a document is deleted after that it will still have those documents (which will be marked as deleted). These documents will not be cleared immediately. It will not be displayed in the search result though, but somehow faceting is still able to access those data.
Optimizing fixed this issue. But we cannot perform this each time customer changes data on production. I tried below options and that did not work for me.
1) expungeDeletes.
Added this line below in solrconfig.xml
<autoCommit>
<maxTime>30000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
<!-- softAutoCommit is like autoCommit except it causes a
'soft' commit which only ensures that changes are visible
but does not ensure that data is synced to disk. This is
faster and more near-realtime friendly than a hard commit.
-->
<autoSoftCommit>
<maxTime>10000</maxTime>
</autoSoftCommit>
<commit waitSearcher="false" expungeDeletes="true"/>
2) Using TieredMergePolicyFactory might not help me as the threshold might not reach always and user will see old data during this time.
3) One more way of doing it is calling optimize() method which is exposed in solrj daily once. But not sure what impact this will have on performance.
Number of documents we index per server will be maximum 2M-3M.
Please suggest if there is any solution to this.
Let me know if more data needed.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight