SOLRCloud 7.3.1: CDCR issues - solr

We have the following environment setup:
SOLRCloud
SOLR Version: 7.3.1
9 Nodes per DC
2 DCs
2 Separate ZK ensemble (one for each SOLR DC)
CDCR bidirectional enabled.
2 Collections.
3 shards per collection, 3 replication factors.
Basic auth enabled. (Aware of CDCR basic auth issues, so added other
DC_nodes information as part of live_nodes.
ZK ACL enabled.
Solr Node JVM heap=64 GB with G1GC enabled and tuned.
#########################
solrConfig settings for CDCR
<lst name="replicator">
<str name="threadPoolSize">8</str>
<str name="schedule">1000</str>
<str name="batchSize">512</str>
</lst>
<lst name="updateLogSynchronizer">
<str name="schedule">1000</str>
</lst>
#########################
-Dsolr.autoCommit.maxTime=60000 -Dsolr.autoSoftCommit.maxTime=1000
#########################
Now, we are seeing the following issues:
1. Data inserted into one DC not forwarding into other DC after insert
without any hard commit.
2. Data inserted into one DC not forwarding into other DC after insert with
hard commit. Verified with /get as well.
3. After doing a hard commit on target DC and RELOAD, data started showing
up. But solr numfound is not matching across DCs.
Errors:
Each individual shards leader queueSize was either -1 or 0. And showing
bad_request
8983/solr/collection_name_shard2_replica_n6/cdcr?action=QUEUES
{
"responseHeader":{
"status":0,
"QTime":1},
"queues":[
"abc.com:2181,abc1.com:2181,abc2.com:2181",[
"collection_name",[
"queueSize",0,
"lastTimestamp","2018-08-01T17:21:29.990Z"]]],
"tlogTotalSize":16545113,
"tlogTotalCount":5,
"updateLogSynchronizer":"stopped"}
ERROR from log:
INFO - 2018-07-31 17:54:46.722; [ ]
org.apache.solr.handler.CdcrReplicatorManager$BootstrapStatusRunnable; CDCR
bootstrap successful in 5 seconds
INFO - 2018-07-31 17:54:46.889; [ ]
org.apache.solr.handler.CdcrReplicatorManager$BootstrapStatusRunnable;
Create new update log reader for target collection_name with checkpoint
1607545724212346885 # collection_name:shard2
ER
ERROR - 2018-07-31 17:54:47.052; [ ]
org.apache.solr.handler.CdcrReplicatorManager$BootstrapStatusRunnable;
Unable to bootstrap the target collection collection_name shard: shard2
WARN : [c:collection_name s:shard2 r:core_node11
x:collection_name_shard2_replica_n8]
org.apache.solr.handler.CdcrRequestHandler; The log reader for target
collection collection_name is not initialised # collection_name:shard2
So wondering how do we proceed further. Thanks in advance.

Related

Solr Repeaters/Slaves replicating are every commit on Master instead of Optimize

I have a Master-Repeater-Slave configuration. Master/Slaves/Repeaters is setup with this replication configs <str name="replicateAfter">optimize</str>, full config below
<requestHandler name="/replication" class="solr.ReplicationHandler">
<str name="commitReserveDuration">01:00:00</str>
<lst name="master">
<str name="enable">${Project.enable.master:false}</str>
<str name="replicateAfter">${Project.master.setReplicateAfterCommit:}</str>
<str name="replicateAfter">${Project.master.setReplicateAfterStartup:}</str>
<str name="replicateAfter">optimize</str>
<str name="confFiles"></str>
</lst>
<lst name="slave">
<str name="enable">${Project.enable.slave:false}</str>
<str name="masterUrl">/solr/someCoreName</str>
<str name="pollInterval">${Newton.replication.pollInterval:00:02:00}</str>
</requestHandler>
Repeaters are configured to poll every 1 sec.
N Slaves are configured to poll at different intervals so as not to overwhelm the repeater with download requests, eg: 2,4,6,8 minutes.
Both via java startup command args.
Now, given that I issue optimize on Master index every 2 hours on master, I expect master to make a replicable version available only after optimize. But it seems that, master generation increases after every commit which happens after X (configurable) minutes and repeater and slaves get the unoptimized (but recent state with latest committed data).
<updateHandler class="solr.DirectUpdateHandler2">
<updateLog>
<str name="dir">some/dir</str>
</updateLog>
<autoCommit>
<maxDocs>10000000</maxDocs>
<maxTime>${Project.autoCommit.maxTime:60000}</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
</updateHandler>
Repeater/Slave logs after they see Master Generation increment
2019-08-31 14:48:05,544 [INFO ][indexFetcher-15-thread-1][solr.handler.IndexFetcher][fetchLatestIndex()] - Master's generation: 6
2019-08-31 14:48:05,544 [INFO ][indexFetcher-15-thread-1][solr.handler.IndexFetcher][fetchLatestIndex()] - Master's version: 1567288083960
2019-08-31 14:48:05,544 [INFO ][indexFetcher-15-thread-1][solr.handler.IndexFetcher][fetchLatestIndex()] - Slave's generation: 5
2019-08-31 14:48:05,544 [INFO ][indexFetcher-15-thread-1][solr.handler.IndexFetcher][fetchLatestIndex()] - Slave's version: 1567288023785
2019-08-31 14:48:05,544 [INFO ][indexFetcher-15-thread-1][solr.handler.IndexFetcher][fetchLatestIndex()] - Starting replication process
2019-08-31 14:48:05,563 [INFO ][indexFetcher-15-thread-1][solr.handler.IndexFetcher][fetchLatestIndex()] - Number of files in latest index in master: 66
2019-08-31 14:48:05,624 [INFO ][indexFetcher-15-thread-1][solr.update.DefaultSolrCoreState][changeWriter()] - New IndexWriter is ready to be used.
2019-08-31 14:48:05,627 [INFO ][indexFetcher-15-thread-1][solr.handler.IndexFetcher][fetchLatestIndex()] - Starting download (fullCopy=false) to MMapDirectory#/data/<path>/index.20190831144805564 lockFactory=org.apache.lucene.store.NativeFSLockFactory#416c5340
Question:
How do I absolutely make sure that I only allow index from Master to Repeaters/Slaves to flow through only after my issued optimize command is complete?
Note
Once I issue optimize, optimized index with 1 segment does flow as expected to Repeaters/Slave but intermediate commits which happen on master also results in Repeaters/Slaves downloading part of the new index making their segment count > 1 and slowing search as seraching on segment size > 1 costs more than searching on segment size 1. I want new index only after periodical (programmed in code) optimize command is issued and not after every commit. I actually removed commit duration on master and then it only incremented its Generation after optimize, but if I remove commit altogether then we are risk of losing uncommitted data between 2 optimize cycles and machines happens to die in between those 2 cycles.
Solr/luceneMatchVersion Version
7.7.1
I also tried adding in mergePolicyfactor configs but behaviour is still the same
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">32</int>
<int name="segmentsPerTier">32</int>
</mergePolicyFactory>
Try changing <replicateAfter>commit</replicateAfter> to <replicateAfter>optimize</replicateAfter>
Also, If it does not work, try removing the polling interval configuration from the slaves.
What you are seeing is expected behaviour for solr and nothing is unusual. Try out the changes and I hope it should work fine.

How can I check for more corrupted Solr indexed nodes in Alfresco?

We found a bug in DIT where a CMIS query returned some nodes in archive://SpacesStore among many seemingly valid nodes.
Using CMIS workbench, when I clicked on an archived nodes' objectId, CMIS workbench said that that node could not be found.
The query used in_tree like this:
SELECT * FROM <model> WHERE IN_TREE('6fd7f269-9d44-40a3-9152-1b89d6a3d07c')
Alfresco support suggested that our Solr index could be corrupted, and that we could simply reindex. I preferred to surgically fix the problem, so I ended up doing this:
Look up node ids in the database, and run the reindex script on each.
select n.id as node_id, n.UUID
from alf_node n
where n.uuid in ('a06d87d6-e49a-4f85-b5c3-2e68a81ad760',
'b6bb26d1-aaf2-4a80-8ce9-8aae93f89d9d',
'10951fbd-f2ab-4247-89c1-b050aa00a4f9')
I found two db node ids per uuid, so I did some more digging.
1724775 b6bb26d1-aaf2-4a80-8ce9-8aae93f89d9d
1724776 b6bb26d1-aaf2-4a80-8ce9-8aae93f89d9d
1726270 10951fbd-f2ab-4247-89c1-b050aa00a4f9
1726271 10951fbd-f2ab-4247-89c1-b050aa00a4f9
1726260 a06d87d6-e49a-4f85-b5c3-2e68a81ad760
1726261 a06d87d6-e49a-4f85-b5c3-2e68a81ad760
Those nodes had some properties with string_value that started with 'workspace'
select n.id as node_id, n.uuid
from alf_node_properties p, alf_node
where n.uuid in ('a06d87d6-e49a-4f85-b5c3-2e68a81ad760'
'b6bb26d1-aaf2-4a80-8ce9-8aae93f89d9d',
'10951fbd-f2ab-4247-89c1-b050aa00a4f9')
and p.STRING_VALUE like 'workspace%'
and n.id = p.node_id;
I used these node ids to send to Solr
1724775 b6bb26d1-aaf2-4a80-8ce9-8aae93f89d9d
1726270 10951fbd-f2ab-4247-89c1-b050aa00a4f9
1726260 a06d87d6-e49a-4f85-b5c3-2e68a81ad760
http://<solr_host>/solr/admin/cores?action=REINDEX&nodeid=1724775
http://<solr_host>/solr/admin/cores?action=REINDEX&nodeid=1726270
http://<solr_host>/solr/admin/cores?action=REINDEX&nodeid=1726260
Each time I got this response:
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
</lst>
</response>
Rerun
SELECT * FROM <model> WHERE IN_TREE('6fd7f269-9d44-40a3-9152-1b89d6a3d07c')
in CMIS workbench with max hits 1000.
Scroll through the results, and find no more archive records.
I cleaned up the corrupted indexed nodes that I found from that CMIS search, but how can I find such corrupted nodes throughout my index? I tried to add criteria to my search to specify only those nodes, but that just resulted in an empty result set. Can anyone explain how or why this could have happened and How can I check for more corrupted Solr indexed nodes in Alfresco?
By the way, we are still on Alfresco 4.2 and Solr 1

Solr replication is slow

We have an old Solr 3.6 server and replication is behaving very strangely.
Look at the image. It is like super slow. It says that the connection is slow, but actually that may not be true because even after several minutes the number of kb downloaded does not change at all.
Also it is wrong that you see a total download of 419 GB, that is the whole index but we are not not copying all of it.
I can see the "downloading File" gets to 100% in a second and then the rest is all waiting time. Even when it goes faster, the wait time is always around 120sec before the index moves to the next version.
It stays in this state sometimes for a long time (like 5 to 20 minutes) and then suddenly it is all done.
Sometimes it is quick instead.
We have a replication configuration like this:
<requestHandler name="/replication" class="solr.ReplicationHandler">
<lst name="master">
<str name="enable">${solr.master.enable:false}</str>
<str name="replicateAfter">startup</str>
<str name="replicateAfter">commit</str>
</lst>
<lst name="slave">
<str name="enable">${solr.slave.enable:false}</str>
<str name="masterUrl">http://10.20.16.125:8080/solr/replication</str>
<str name="pollInterval">00:00:60</str>
There are several possible causes that can lead to such issue:
java.lang.OutOfMemoryError happening during replication (in order to troubleshoot this kind of issue please refer to "How to deal with out of memory problems" in Apache Solr Cookbook);
A frequent segment merge that can be caused by:
optimization running after each commit;
wrong Merge Policy or Merge Factor;
As next step I advise to:
Verify in the Solr server log the presence of OutOfMemory or other interesting errors.
Verify how frequently the optimization is performed (do you have a trigger in your code?);
Lower the merge factor to 2 (<mergeFactor>**2**</mergeFactor>)
Try <useCompoundFile>true</useCompoundFile> that will tell Solr to use the compound index structure more and will thus reduce the number of files that create the index and the number of merges required.
Verify if there's some merge policy bug opened for your Solr/Lucene version.
Some additional interesting info can be found in this answer.

Storing filter definition on Solr server

I have a situation where all my queries have some sub filter queries which are added each time and are very long.
The query filters are the same each time so it is a waste of time sending them over and over to Solr server and parsing them on the other side just to find them in the cache.
Is there a way I can send filter query definition once to the Solr server and then reference it in following queries?
You can add a static configuration directive in your solr config (solrconfig.xml):
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="appends">
<str name="fq">foo:value</str>
</lst>
</requestHandler>
.. this will always append a fq= term to the query string before the SearchHandler receives the query. Other options are invariants or defaults. See Request Handlers and Search Handlers on the community wiki for more information.

What is DataImportHandler doing after Indexing completed?

I am using solr to index about 40m items, and the final index file is about 20G. Below is the message after a delta import:
<lst name="statusMessages">
<str name="Time Elapsed">0:51:44.149</str>
<str name="Total Requests made to DataSource">1</str>
<str name="Total Rows Fetched">5634016</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2012-09-27 01:25:17</str>
<str name="">
Indexing completed. Added/Updated: 5634016 documents. Deleted 0 documents.
</str>
I am wondering what solr is doing this status? and the message replication?command=details return is :
<lst name="masterDetails">
<str name="indexSize">36.69 GB</str>
The index is almost double, and is still going to be bigger. This made me very confused. I am doing delta import, why index will be double size when replace?
If you are replacing most of your documents that's normal. An update in lucene consists of a deletion and a re-insertion of the documents, since the index segments are write-once. When you delete a document, you are not really deleting it but only marking it as deleted, again because the segments are write-once.
Deleted documents will be deleted for real when the next merge happens, when a new bigger segments will be created out of the small segments that you have. That's when you should see a decreasement of the index size. That means that your index size shouldn't only increase. Merges happen more or less according to the merge policy in use. If you want to manually force a merge you can use the forceMerge operation, which is the new name for the optimize. Depending on the solr version in use you need to use either the first or the second one. Be careful, since the forceMerge takes a while if you have a lot of documents. Have a look at this article too.
Before Solr 3.6, dataImportHandler set optimize=true by default:
http://wiki.apache.org/solr/DataImportHandler
This triggers merging of all segments into one regardless of other settings. I think you might be able to address this by adding an optimize checkbox to debug.jsp, though I haven't actually tried it.

Resources