Lets say that I have replication on master Solr server configured like this:
<lst name="master">
<str name="enable">true</str>
<str name="replicateAfter">optimize</str>
<str name="confFiles">solrconfig.xml,schema.xml,stopwords.txt,synonyms.xml</str>
<str name="commitReserveDuration">00:00:10</str>
</lst>
and slave configured like this:
<lst name="slave">
<str name="enable">true</str>
<str name="masterUrl">masterSolr</str>
<str name="pollInterval">24:00:00</str>
</lst>
How slave knows that optimization was performed on master (master know nothing about slaves)?
Does slave checks it every 24 hours (not more often)?
Will replication be performed if there was no optimization but some several commits (on master)?
How to reach a state where slave will do the replication ONLY after optimization (nothing else) and will do it shortly after this optimization (we don't want to wait several hours)?
the replication is a pull mechanism - so to be able to support your scenario you need to do a bit of configuration.
For you questions:
1. it does not - it pulls in intervals (or forced) from master which tells what version is ready to be replicated
2. yes - 24 hours
3. only if an optimize have been done since the last index fetch
4. some configuration and knowledge from master to slave is needed.
You can use the postOptimize update event on the updatehandler to force a repication on slaves
<listener event="postOptimize" class="solr.RunExecutableListener">
<str name="exe">wget</str>
<str name="dir">solr/bin</str>
<bool name="wait">true</bool>
<arr name="args"> <str> http://slave_host:port/solr/core?/replication?command=fetchindex</str> </arr>
</listener>
you can then remove the poll interval from the slave config. you need to add multiple args (in str tags) for each slave
Related
<str name="maxCommitsToKeep">1</str>
For what this field is used . What happens if we increase the value of this key.Can someone help me on this.
Its part of the deletion policy.
The policy has sub-parameters for the maximum number of commits to keep (maxCommitsToKeep), the maximum number of optimized commits to keep (maxOptimizedCommitsToKeep), and the maximum age of any commit to keep (maxCommitAge).
<reopenReaders>true</reopenReaders>
<deletionPolicy class="solr.SolrDeletionPolicy">
<str name="maxCommitsToKeep">1</str>
<str name="maxOptimizedCommitsToKeep">0</str>
<str name="maxCommitAge">1DAY</str>
</deletionPolicy>
<infoStream>false</infoStream>
For more information, please check the documentation
I have a Master-Repeater-Slave configuration. Master/Slaves/Repeaters is setup with this replication configs <str name="replicateAfter">optimize</str>, full config below
<requestHandler name="/replication" class="solr.ReplicationHandler">
<str name="commitReserveDuration">01:00:00</str>
<lst name="master">
<str name="enable">${Project.enable.master:false}</str>
<str name="replicateAfter">${Project.master.setReplicateAfterCommit:}</str>
<str name="replicateAfter">${Project.master.setReplicateAfterStartup:}</str>
<str name="replicateAfter">optimize</str>
<str name="confFiles"></str>
</lst>
<lst name="slave">
<str name="enable">${Project.enable.slave:false}</str>
<str name="masterUrl">/solr/someCoreName</str>
<str name="pollInterval">${Newton.replication.pollInterval:00:02:00}</str>
</requestHandler>
Repeaters are configured to poll every 1 sec.
N Slaves are configured to poll at different intervals so as not to overwhelm the repeater with download requests, eg: 2,4,6,8 minutes.
Both via java startup command args.
Now, given that I issue optimize on Master index every 2 hours on master, I expect master to make a replicable version available only after optimize. But it seems that, master generation increases after every commit which happens after X (configurable) minutes and repeater and slaves get the unoptimized (but recent state with latest committed data).
<updateHandler class="solr.DirectUpdateHandler2">
<updateLog>
<str name="dir">some/dir</str>
</updateLog>
<autoCommit>
<maxDocs>10000000</maxDocs>
<maxTime>${Project.autoCommit.maxTime:60000}</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
</updateHandler>
Repeater/Slave logs after they see Master Generation increment
2019-08-31 14:48:05,544 [INFO ][indexFetcher-15-thread-1][solr.handler.IndexFetcher][fetchLatestIndex()] - Master's generation: 6
2019-08-31 14:48:05,544 [INFO ][indexFetcher-15-thread-1][solr.handler.IndexFetcher][fetchLatestIndex()] - Master's version: 1567288083960
2019-08-31 14:48:05,544 [INFO ][indexFetcher-15-thread-1][solr.handler.IndexFetcher][fetchLatestIndex()] - Slave's generation: 5
2019-08-31 14:48:05,544 [INFO ][indexFetcher-15-thread-1][solr.handler.IndexFetcher][fetchLatestIndex()] - Slave's version: 1567288023785
2019-08-31 14:48:05,544 [INFO ][indexFetcher-15-thread-1][solr.handler.IndexFetcher][fetchLatestIndex()] - Starting replication process
2019-08-31 14:48:05,563 [INFO ][indexFetcher-15-thread-1][solr.handler.IndexFetcher][fetchLatestIndex()] - Number of files in latest index in master: 66
2019-08-31 14:48:05,624 [INFO ][indexFetcher-15-thread-1][solr.update.DefaultSolrCoreState][changeWriter()] - New IndexWriter is ready to be used.
2019-08-31 14:48:05,627 [INFO ][indexFetcher-15-thread-1][solr.handler.IndexFetcher][fetchLatestIndex()] - Starting download (fullCopy=false) to MMapDirectory#/data/<path>/index.20190831144805564 lockFactory=org.apache.lucene.store.NativeFSLockFactory#416c5340
Question:
How do I absolutely make sure that I only allow index from Master to Repeaters/Slaves to flow through only after my issued optimize command is complete?
Note
Once I issue optimize, optimized index with 1 segment does flow as expected to Repeaters/Slave but intermediate commits which happen on master also results in Repeaters/Slaves downloading part of the new index making their segment count > 1 and slowing search as seraching on segment size > 1 costs more than searching on segment size 1. I want new index only after periodical (programmed in code) optimize command is issued and not after every commit. I actually removed commit duration on master and then it only incremented its Generation after optimize, but if I remove commit altogether then we are risk of losing uncommitted data between 2 optimize cycles and machines happens to die in between those 2 cycles.
Solr/luceneMatchVersion Version
7.7.1
I also tried adding in mergePolicyfactor configs but behaviour is still the same
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">32</int>
<int name="segmentsPerTier">32</int>
</mergePolicyFactory>
Try changing <replicateAfter>commit</replicateAfter> to <replicateAfter>optimize</replicateAfter>
Also, If it does not work, try removing the polling interval configuration from the slaves.
What you are seeing is expected behaviour for solr and nothing is unusual. Try out the changes and I hope it should work fine.
If I have lots of requests which search selecting different addresses, may I use a wildcard for select query, selecting all addresses for warming in settings of query related listeners? I would like to cache all addresses to make subsequent queries of separate addresses faster. Or using wildcards for caching isn't possible?
<listener event="newSearcher" class="solr.QuerySenderListener">
<arr name="queries">
<lst>
<str name="q">address:*</str>
<str name="rows">10000</str>
</lst>
</arr>
</listener>
<listener event="firstSearcher" class="solr.QuerySenderListener">
<arr name="queries">
<lst>
<str name="q">address:*</str>
<str name="rows">10000</str>
</lst>
</arr>
</listener>
The query address:* retrieves all documents having a non-empty value in the field address, but that won't be that much useful for Solr's filter cache since a subsequent hit would only match the wildcard character as a filter.
You need to load documents where address field actually matches a precise value, and the wildcard character in this context will be treated as a unique filter for the filter cache, not as a cacthall.
So it's not that caching a wildcard query doesn't work but it doesn't warm the cache as you might expect/need, that is for all distinct values in the field (it could be useful as a "shortcut" to warm all possible results though, but imagine the cost of warming a wildcard query if the field is not restricted to a finite set..).
Instead you may have to use filter queries, each intersecting the whole set of documents (this always implies a main wildcard query q=*:* on which you apply a fq), and using one fq per possible value in the field - or per most frequently submitted values if the field is not restricted, which will load every (or the most frequently loaded) subsets of documents by addresses, which actually means warming the filter cache for each one of them.
https://lucene.apache.org/solr/guide/7_3/query-settings-in-solrconfig.html#filtercache
I try to count issue 1 to 5 with this range facet query:
...&facet.range=issue&facet.range.start=1&q=magid:abc&facet.range.end=5&facet.range.gap=1
It returns:
<lst name="issue">
<lst name="counts">
<int name="1">5</int>
<int name="2">7</int>
<int name="3">9</int>
<int name="4">7</int>
</lst>
There's no issue 5 ##??? Also issue 1 should be 3, 5 is for issue 2 (Then I think "Hey! IT CAN'T BE array element starts from 0" problem, right?!..."). I chnage facet.range.start to 0 and do query again. This time it returns:
<lst name="issue">
<lst name="counts">
<int name="0">3</int>
<int name="1">5</int>
<int name="2">7</int>
<int name="3">9</int>
<int name="4">7</int>
</lst>
Oh My! it should be issue 1~5, instead 0~4? Why are Solr doing this? It is really confusing me!
I am sure that these are not 0-based index values. The values you see are the actual values being indexed as tokens, so if you index values from 1 to 5 you should see values from 1 to 5
So, if you want to make sure if you have documents with value 5 or not, the best way to debyg this from the Schema Browser -> Term info
So, go to Solr Admin interface, select the core, click on schema browser, choose the field name you want to see term info for, then click on Load term info.
I recently started playing around with Apache Solr and currently trying to figure out the best way to benchmark the indexing of a corpus of XML documents. I am basically interested in the throughput (documents indexed/second) and index size on disk.
I am doing all this on Ubuntu.
Benchmarking Technique
* Run the following 5 times& get average total time taken *
Index documents [curl http://localhost:8983/solr/core/dataimport?command=full-import]
Get 'Time taken' name attribute from XML response when status is 'idle' [curl http://localhost:8983/solr/core/dataimport]
Get size of 'data/index' directory
Delete Index [curl http://localhost:8983/solr/core/update --data '<delete><query>*:*</query></delete>' -H 'Content-type:text/xml; charset=utf-8']
Commit [curl http://localhost:8983/solr/w5/update --data '<commit/>' -H 'Content-type:text/xml; charset=utf-8']
Re-index documents
Questions
I intend to calculate my throughput by dividing the number of documents indexed by average total time taken; is this fine?
Are there tools (like SolrMeter for query benchmarking) or standard scripts already available that I could use to achive my objectives? I do not want to re-invent the wheel...
Is my approach fine?
Is there an easier way of getting the index size as opposed to performing a 'du' on the data/index/ directory?
Where can I find information on how to interpret XML response attributes (see sample output below). For instance, I would want to know the difference between the QTime and Time taken values.
* XML Response Used to Get Throughput *
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
</lst>
<lst name="initArgs">
<lst name="defaults">
<str name="config">w5-data-config.xml</str>
</lst>
</lst>
<str name="status">idle</str>
<str name="importResponse"/>
<lst name="statusMessages">
<str name="Total Requests made to DataSource">0</str>
<str name="Total Rows Fetched">3200</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2012-12-11 14:06:19</str>
<str name="">Indexing completed. Added/Updated: 1600 documents. Deleted 0 documents.</str>
<str name="Total Documents Processed">1600</str>
<str name="Time taken">0:0:10.233</str>
</lst>
<str name="WARNING">This response format is experimental. It is likely to change in the future.</str>
</response>
To question 1:
I would suggest you should try to index more than 1 XML (with different dataset) file and compare the given results. Thats the way you will know if it´s ok to simply divide the taken time with your number of documents.
To question 2:
I didn´t find any of these tools, I did it by my own by developing a short Java application
To question 3:
Which approach you mean? I would link to my answer to question 1...
To question 4:
The size of the index folder gives you the correct size of the whole index, why don´t you want to use it?
To question 5:
The results you get in the posted XML is transfered through a XSL file. You can find it in the /bin/solr/conf/xslt folder. You can look up what the termes exactly means AND you can write your own XSL to display the results and informations.
Note: If you create a new XSL file, you have to change the settings in your solrconfig.xml. If you don´t want to make any changes, edit the existing file.
edit: I think the difference is, that the Qtime is the rounded value of the taken time value. There are only even numbers in Qtime.
Best regards