Solr replication is slow - solr

We have an old Solr 3.6 server and replication is behaving very strangely.
Look at the image. It is like super slow. It says that the connection is slow, but actually that may not be true because even after several minutes the number of kb downloaded does not change at all.
Also it is wrong that you see a total download of 419 GB, that is the whole index but we are not not copying all of it.
I can see the "downloading File" gets to 100% in a second and then the rest is all waiting time. Even when it goes faster, the wait time is always around 120sec before the index moves to the next version.
It stays in this state sometimes for a long time (like 5 to 20 minutes) and then suddenly it is all done.
Sometimes it is quick instead.
We have a replication configuration like this:
<requestHandler name="/replication" class="solr.ReplicationHandler">
<lst name="master">
<str name="enable">${solr.master.enable:false}</str>
<str name="replicateAfter">startup</str>
<str name="replicateAfter">commit</str>
</lst>
<lst name="slave">
<str name="enable">${solr.slave.enable:false}</str>
<str name="masterUrl">http://10.20.16.125:8080/solr/replication</str>
<str name="pollInterval">00:00:60</str>

There are several possible causes that can lead to such issue:
java.lang.OutOfMemoryError happening during replication (in order to troubleshoot this kind of issue please refer to "How to deal with out of memory problems" in Apache Solr Cookbook);
A frequent segment merge that can be caused by:
optimization running after each commit;
wrong Merge Policy or Merge Factor;
As next step I advise to:
Verify in the Solr server log the presence of OutOfMemory or other interesting errors.
Verify how frequently the optimization is performed (do you have a trigger in your code?);
Lower the merge factor to 2 (<mergeFactor>**2**</mergeFactor>)
Try <useCompoundFile>true</useCompoundFile> that will tell Solr to use the compound index structure more and will thus reduce the number of files that create the index and the number of merges required.
Verify if there's some merge policy bug opened for your Solr/Lucene version.
Some additional interesting info can be found in this answer.

Related

maxcommittokeep value=1 configuration in solr

<str name="maxCommitsToKeep">1</str>
For what this field is used . What happens if we increase the value of this key.Can someone help me on this.
Its part of the deletion policy.
The policy has sub-parameters for the maximum number of commits to keep (maxCommitsToKeep), the maximum number of optimized commits to keep (maxOptimizedCommitsToKeep), and the maximum age of any commit to keep (maxCommitAge).
<reopenReaders>true</reopenReaders>
<deletionPolicy class="solr.SolrDeletionPolicy">
<str name="maxCommitsToKeep">1</str>
<str name="maxOptimizedCommitsToKeep">0</str>
<str name="maxCommitAge">1DAY</str>
</deletionPolicy>
<infoStream>false</infoStream>
For more information, please check the documentation

Solr Repeaters/Slaves replicating are every commit on Master instead of Optimize

I have a Master-Repeater-Slave configuration. Master/Slaves/Repeaters is setup with this replication configs <str name="replicateAfter">optimize</str>, full config below
<requestHandler name="/replication" class="solr.ReplicationHandler">
<str name="commitReserveDuration">01:00:00</str>
<lst name="master">
<str name="enable">${Project.enable.master:false}</str>
<str name="replicateAfter">${Project.master.setReplicateAfterCommit:}</str>
<str name="replicateAfter">${Project.master.setReplicateAfterStartup:}</str>
<str name="replicateAfter">optimize</str>
<str name="confFiles"></str>
</lst>
<lst name="slave">
<str name="enable">${Project.enable.slave:false}</str>
<str name="masterUrl">/solr/someCoreName</str>
<str name="pollInterval">${Newton.replication.pollInterval:00:02:00}</str>
</requestHandler>
Repeaters are configured to poll every 1 sec.
N Slaves are configured to poll at different intervals so as not to overwhelm the repeater with download requests, eg: 2,4,6,8 minutes.
Both via java startup command args.
Now, given that I issue optimize on Master index every 2 hours on master, I expect master to make a replicable version available only after optimize. But it seems that, master generation increases after every commit which happens after X (configurable) minutes and repeater and slaves get the unoptimized (but recent state with latest committed data).
<updateHandler class="solr.DirectUpdateHandler2">
<updateLog>
<str name="dir">some/dir</str>
</updateLog>
<autoCommit>
<maxDocs>10000000</maxDocs>
<maxTime>${Project.autoCommit.maxTime:60000}</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
</updateHandler>
Repeater/Slave logs after they see Master Generation increment
2019-08-31 14:48:05,544 [INFO ][indexFetcher-15-thread-1][solr.handler.IndexFetcher][fetchLatestIndex()] - Master's generation: 6
2019-08-31 14:48:05,544 [INFO ][indexFetcher-15-thread-1][solr.handler.IndexFetcher][fetchLatestIndex()] - Master's version: 1567288083960
2019-08-31 14:48:05,544 [INFO ][indexFetcher-15-thread-1][solr.handler.IndexFetcher][fetchLatestIndex()] - Slave's generation: 5
2019-08-31 14:48:05,544 [INFO ][indexFetcher-15-thread-1][solr.handler.IndexFetcher][fetchLatestIndex()] - Slave's version: 1567288023785
2019-08-31 14:48:05,544 [INFO ][indexFetcher-15-thread-1][solr.handler.IndexFetcher][fetchLatestIndex()] - Starting replication process
2019-08-31 14:48:05,563 [INFO ][indexFetcher-15-thread-1][solr.handler.IndexFetcher][fetchLatestIndex()] - Number of files in latest index in master: 66
2019-08-31 14:48:05,624 [INFO ][indexFetcher-15-thread-1][solr.update.DefaultSolrCoreState][changeWriter()] - New IndexWriter is ready to be used.
2019-08-31 14:48:05,627 [INFO ][indexFetcher-15-thread-1][solr.handler.IndexFetcher][fetchLatestIndex()] - Starting download (fullCopy=false) to MMapDirectory#/data/<path>/index.20190831144805564 lockFactory=org.apache.lucene.store.NativeFSLockFactory#416c5340
Question:
How do I absolutely make sure that I only allow index from Master to Repeaters/Slaves to flow through only after my issued optimize command is complete?
Note
Once I issue optimize, optimized index with 1 segment does flow as expected to Repeaters/Slave but intermediate commits which happen on master also results in Repeaters/Slaves downloading part of the new index making their segment count > 1 and slowing search as seraching on segment size > 1 costs more than searching on segment size 1. I want new index only after periodical (programmed in code) optimize command is issued and not after every commit. I actually removed commit duration on master and then it only incremented its Generation after optimize, but if I remove commit altogether then we are risk of losing uncommitted data between 2 optimize cycles and machines happens to die in between those 2 cycles.
Solr/luceneMatchVersion Version
7.7.1
I also tried adding in mergePolicyfactor configs but behaviour is still the same
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">32</int>
<int name="segmentsPerTier">32</int>
</mergePolicyFactory>
Try changing <replicateAfter>commit</replicateAfter> to <replicateAfter>optimize</replicateAfter>
Also, If it does not work, try removing the polling interval configuration from the slaves.
What you are seeing is expected behaviour for solr and nothing is unusual. Try out the changes and I hope it should work fine.

Apache Solr Index Bechmarking

I recently started playing around with Apache Solr and currently trying to figure out the best way to benchmark the indexing of a corpus of XML documents. I am basically interested in the throughput (documents indexed/second) and index size on disk.
I am doing all this on Ubuntu.
Benchmarking Technique
* Run the following 5 times& get average total time taken *
Index documents [curl http://localhost:8983/solr/core/dataimport?command=full-import]
Get 'Time taken' name attribute from XML response when status is 'idle' [curl http://localhost:8983/solr/core/dataimport]
Get size of 'data/index' directory
Delete Index [curl http://localhost:8983/solr/core/update --data '<delete><query>*:*</query></delete>' -H 'Content-type:text/xml; charset=utf-8']
Commit [curl http://localhost:8983/solr/w5/update --data '<commit/>' -H 'Content-type:text/xml; charset=utf-8']
Re-index documents
Questions
I intend to calculate my throughput by dividing the number of documents indexed by average total time taken; is this fine?
Are there tools (like SolrMeter for query benchmarking) or standard scripts already available that I could use to achive my objectives? I do not want to re-invent the wheel...
Is my approach fine?
Is there an easier way of getting the index size as opposed to performing a 'du' on the data/index/ directory?
Where can I find information on how to interpret XML response attributes (see sample output below). For instance, I would want to know the difference between the QTime and Time taken values.
* XML Response Used to Get Throughput *
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
</lst>
<lst name="initArgs">
<lst name="defaults">
<str name="config">w5-data-config.xml</str>
</lst>
</lst>
<str name="status">idle</str>
<str name="importResponse"/>
<lst name="statusMessages">
<str name="Total Requests made to DataSource">0</str>
<str name="Total Rows Fetched">3200</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2012-12-11 14:06:19</str>
<str name="">Indexing completed. Added/Updated: 1600 documents. Deleted 0 documents.</str>
<str name="Total Documents Processed">1600</str>
<str name="Time taken">0:0:10.233</str>
</lst>
<str name="WARNING">This response format is experimental. It is likely to change in the future.</str>
</response>
To question 1:
I would suggest you should try to index more than 1 XML (with different dataset) file and compare the given results. Thats the way you will know if it´s ok to simply divide the taken time with your number of documents.
To question 2:
I didn´t find any of these tools, I did it by my own by developing a short Java application
To question 3:
Which approach you mean? I would link to my answer to question 1...
To question 4:
The size of the index folder gives you the correct size of the whole index, why don´t you want to use it?
To question 5:
The results you get in the posted XML is transfered through a XSL file. You can find it in the /bin/solr/conf/xslt folder. You can look up what the termes exactly means AND you can write your own XSL to display the results and informations.
Note: If you create a new XSL file, you have to change the settings in your solrconfig.xml. If you don´t want to make any changes, edit the existing file.
edit: I think the difference is, that the Qtime is the rounded value of the taken time value. There are only even numbers in Qtime.
Best regards

What is DataImportHandler doing after Indexing completed?

I am using solr to index about 40m items, and the final index file is about 20G. Below is the message after a delta import:
<lst name="statusMessages">
<str name="Time Elapsed">0:51:44.149</str>
<str name="Total Requests made to DataSource">1</str>
<str name="Total Rows Fetched">5634016</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2012-09-27 01:25:17</str>
<str name="">
Indexing completed. Added/Updated: 5634016 documents. Deleted 0 documents.
</str>
I am wondering what solr is doing this status? and the message replication?command=details return is :
<lst name="masterDetails">
<str name="indexSize">36.69 GB</str>
The index is almost double, and is still going to be bigger. This made me very confused. I am doing delta import, why index will be double size when replace?
If you are replacing most of your documents that's normal. An update in lucene consists of a deletion and a re-insertion of the documents, since the index segments are write-once. When you delete a document, you are not really deleting it but only marking it as deleted, again because the segments are write-once.
Deleted documents will be deleted for real when the next merge happens, when a new bigger segments will be created out of the small segments that you have. That's when you should see a decreasement of the index size. That means that your index size shouldn't only increase. Merges happen more or less according to the merge policy in use. If you want to manually force a merge you can use the forceMerge operation, which is the new name for the optimize. Depending on the solr version in use you need to use either the first or the second one. Be careful, since the forceMerge takes a while if you have a lot of documents. Have a look at this article too.
Before Solr 3.6, dataImportHandler set optimize=true by default:
http://wiki.apache.org/solr/DataImportHandler
This triggers merging of all segments into one regardless of other settings. I think you might be able to address this by adding an optimize checkbox to debug.jsp, though I haven't actually tried it.

Querying across multiple fields with different boosts in Solr

In Solr, what is the best way of querying across different fields where each query on each field has a different weighting?
We're using C# and ASP.NET, with SolrNet being used to query Solr. Our index looks a bit like this:
document_id
title
text_content
tags
[some more fields...]
This is then queried using keywords, where each keyword has a different weight. So, for example, "ipad" might have a weight of 40, but "android" might have a weight of 25.
In conjunction with this, each field has a different base weight. For example, keywords are more valuable than page title, which are more valuable than text content.
So, we end up with something like the following:
title^25
text_content^10
tags^50
And the following keywords:
ipad^25
apple^22
microsoft^15
windows^15
software^20
computer^18
So, each search query has a different weighting, and each field has a different weight. As a result, we end up with search criteria that looks like this:
title:ipad^50
title:apple^47
title:microsoft^40
[more titles...]
text_content:ipad^35
text_content:apple^32
text_content:microsoft^25
[lots more...]
This translates into a very, very long search query, which exceeds the limit allowed. It also seems like a very inefficient way of doing things, and I was wondering if there's a better way of achieving this.
Effectively, we have a list of keywords with varied weights, and a list of fields in Solr which also have varied weights, and the idea is to query the index to retrieve the most relevant documents.
Further complicating this matter, though it may be out of the scope of this question, is that the query also includes filters to filter out documents. This is done using the following type of query:
&fq=(-document_id:4f845eb321c90b0aec5ee0eb)&fq=(-document_id:4f845cd421c90b0aec5ee041)&fq=(-document_id:4f845cea21c90b0aec5ee049)&fq=(-document_id:4f845cf821c90b0aec5ee04d)&fq=(-document_id:4f845d0e21c90b0aec5ee056)&fq=(-document_id:4f845d3521c90b0aec5ee064)&fq=(-document_id:4f845d3921c90b0aec5ee065)&fq=(-document_id:4f845d4921c90b0aec5ee06b)&fq=(-document_id:4f845d7521c90b0aec5ee07b)&fq=(-document_id:4f845d9021c90b0aec5ee084)&fq=(-document_id:4f845dac21c90b0aec5ee08e)&fq=(-document_id:4f845dbc21c90b0aec5ee093)
These can also add a lot of characters to the search query, and it would be good if there was also a better way to handle this as well.
Any help or advice is most appreciated. Thanks.
I would suggest to add those default parameters to your requesthandler configuration within solrconfig.xml. They are always the same, right?
<requestHandler name="standard" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="defType">edismax</str>
<str name="qf">title^25 text_content^10 tags^50</str>
</lst>
</requestHandler>
You should be able to add your static filters and so on, so that you don't have to specify those values unless you want to do something different from the default, ending up with urls a lot shorter.

Resources