What is DataImportHandler doing after Indexing completed? - solr

I am using solr to index about 40m items, and the final index file is about 20G. Below is the message after a delta import:
<lst name="statusMessages">
<str name="Time Elapsed">0:51:44.149</str>
<str name="Total Requests made to DataSource">1</str>
<str name="Total Rows Fetched">5634016</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2012-09-27 01:25:17</str>
<str name="">
Indexing completed. Added/Updated: 5634016 documents. Deleted 0 documents.
</str>
I am wondering what solr is doing this status? and the message replication?command=details return is :
<lst name="masterDetails">
<str name="indexSize">36.69 GB</str>
The index is almost double, and is still going to be bigger. This made me very confused. I am doing delta import, why index will be double size when replace?

If you are replacing most of your documents that's normal. An update in lucene consists of a deletion and a re-insertion of the documents, since the index segments are write-once. When you delete a document, you are not really deleting it but only marking it as deleted, again because the segments are write-once.
Deleted documents will be deleted for real when the next merge happens, when a new bigger segments will be created out of the small segments that you have. That's when you should see a decreasement of the index size. That means that your index size shouldn't only increase. Merges happen more or less according to the merge policy in use. If you want to manually force a merge you can use the forceMerge operation, which is the new name for the optimize. Depending on the solr version in use you need to use either the first or the second one. Be careful, since the forceMerge takes a while if you have a lot of documents. Have a look at this article too.

Before Solr 3.6, dataImportHandler set optimize=true by default:
http://wiki.apache.org/solr/DataImportHandler
This triggers merging of all segments into one regardless of other settings. I think you might be able to address this by adding an optimize checkbox to debug.jsp, though I haven't actually tried it.

Related

How precise should be a query specified in Solr query related listeners config?

If I have lots of requests which search selecting different addresses, may I use a wildcard for select query, selecting all addresses for warming in settings of query related listeners? I would like to cache all addresses to make subsequent queries of separate addresses faster. Or using wildcards for caching isn't possible?
<listener event="newSearcher" class="solr.QuerySenderListener">
<arr name="queries">
<lst>
<str name="q">address:*</str>
<str name="rows">10000</str>
</lst>
</arr>
</listener>
<listener event="firstSearcher" class="solr.QuerySenderListener">
<arr name="queries">
<lst>
<str name="q">address:*</str>
<str name="rows">10000</str>
</lst>
</arr>
</listener>
The query address:* retrieves all documents having a non-empty value in the field address, but that won't be that much useful for Solr's filter cache since a subsequent hit would only match the wildcard character as a filter.
You need to load documents where address field actually matches a precise value, and the wildcard character in this context will be treated as a unique filter for the filter cache, not as a cacthall.
So it's not that caching a wildcard query doesn't work but it doesn't warm the cache as you might expect/need, that is for all distinct values in the field (it could be useful as a "shortcut" to warm all possible results though, but imagine the cost of warming a wildcard query if the field is not restricted to a finite set..).
Instead you may have to use filter queries, each intersecting the whole set of documents (this always implies a main wildcard query q=*:* on which you apply a fq), and using one fq per possible value in the field - or per most frequently submitted values if the field is not restricted, which will load every (or the most frequently loaded) subsets of documents by addresses, which actually means warming the filter cache for each one of them.
https://lucene.apache.org/solr/guide/7_3/query-settings-in-solrconfig.html#filtercache

Solr replication is slow

We have an old Solr 3.6 server and replication is behaving very strangely.
Look at the image. It is like super slow. It says that the connection is slow, but actually that may not be true because even after several minutes the number of kb downloaded does not change at all.
Also it is wrong that you see a total download of 419 GB, that is the whole index but we are not not copying all of it.
I can see the "downloading File" gets to 100% in a second and then the rest is all waiting time. Even when it goes faster, the wait time is always around 120sec before the index moves to the next version.
It stays in this state sometimes for a long time (like 5 to 20 minutes) and then suddenly it is all done.
Sometimes it is quick instead.
We have a replication configuration like this:
<requestHandler name="/replication" class="solr.ReplicationHandler">
<lst name="master">
<str name="enable">${solr.master.enable:false}</str>
<str name="replicateAfter">startup</str>
<str name="replicateAfter">commit</str>
</lst>
<lst name="slave">
<str name="enable">${solr.slave.enable:false}</str>
<str name="masterUrl">http://10.20.16.125:8080/solr/replication</str>
<str name="pollInterval">00:00:60</str>
There are several possible causes that can lead to such issue:
java.lang.OutOfMemoryError happening during replication (in order to troubleshoot this kind of issue please refer to "How to deal with out of memory problems" in Apache Solr Cookbook);
A frequent segment merge that can be caused by:
optimization running after each commit;
wrong Merge Policy or Merge Factor;
As next step I advise to:
Verify in the Solr server log the presence of OutOfMemory or other interesting errors.
Verify how frequently the optimization is performed (do you have a trigger in your code?);
Lower the merge factor to 2 (<mergeFactor>**2**</mergeFactor>)
Try <useCompoundFile>true</useCompoundFile> that will tell Solr to use the compound index structure more and will thus reduce the number of files that create the index and the number of merges required.
Verify if there's some merge policy bug opened for your Solr/Lucene version.
Some additional interesting info can be found in this answer.

Solr supporting varying facets for different types of products

I am using Solr for indexing different types of products. The product types (category) have different facets. For example:
camera: megapixel, type (slr/..), body construction, ..
processors: no. of cores, socket type, speed, type (core i5/7)
food: type, origin, shelf-life, weight
tea: type (black/green/white/..), origin, weight, use milk?
serveware: type, material, color, weight
...
And they have common facets as well:
Brand, Price, Discount, Listing timeframe (like new), Availability, Category
I need to display the relevant facets and breadcrumbs when user clicks on any category, or brand page or a global search across all types of products. This is same as what we see on several ecommerce sites.
The query that I have is:
Since the facet types are more or less unique across different types of products, do I put them in separate schemas? Is that the way to do it? The fundamental worry is that those fields will not have any data for other types of products. And are there any implementation principles here that makes retrieving the respective faces for a given product type easier?
I would like to have a design that is scalable to accommodate lots of items in each product type as we go forward, as well as that is easy to use and performance oriented, if possible. Right now I am having a single instance of Solr.
The only risk of underpopulated facets are when they misrepresent the search. I'm sure you've used a search site where the metadata you want to facet on is underpopulated so that when you apply the facet you also eliminate from your result set a number of records that should have been included. The thing to watch is that the facet values are populated consistently where they are appropriate. That means that your "tea" records don't need to have a number of cores listed, and it won't impact anything, but all of your "processor" records should, and (to whatever extent possible) they should be populated consistently. This means that if one processor lists its number of cores as "4", and another says "quadcore", these are two different values and a user applying either facet value will eliminate the other processor from their result. If a third quadcore processor is entirely missing the "number of cores" stat from the no_cores facet field (field name is arbitrary), then your facet could be become counterproductive.
So, we can throw all of these records into the same Solr, and as long as the facets are populated consistently where appropriate, it's not really necessary that they be populated for all records, especially when not applicable.
Applying facets dynamically
Most of what you need to know is in the faceting documentation of Solr. The important thing is to specify the appropriate arguments in your query to tell Solr which facets you want to use. (Until you actually facet on a field, it's not a facet but just a field that's both stored="true" and indexed="true".) For a very dynamic effect, you can specify all of these arguments as part of the query to Solr.
&facet=true
This may seem obvious, but you need to turn on faceting. This argument is convenient because it also allows you to turn off faceting with facet=false even if there are lots of other arguments in your query detailing how to facet. None of it does anything if faceting is off.
&facet.field=no_cores
You can include this field over and over again for as many fields as you're interested in faceting on.
&facet.limit=7
&f.no_cores.facet.limit=4
The first line here limits the number of values for returned by Solr for each facet field to 7. The 7 most frequent values for the facet (within the search results) will be returned, with their record counts. The second line overrides this limit for the no_cores field specifically.
&facet.sort=count
You can either list the facet field's values in order by how many appear in how many records (count), or in index order (index). Index order generally means alphabetically, but depends on how the field is indexed. This field is used together with facet.limit, so if the number of facet values returned is limited by facet.limit they will either be the most numerous values in the result set or the earliest in the index, depending on how this value is set.
&facet.mincount=1
There are very few circumstances that you will want to see facet values that appear zero times in your search results, and this can fix the problem if it pops up.
The end result is a very long query:
http://localhost/solr/collecion1/search?facet=true&facet.field=no_cores&
facet.field=socket_type&facet.field=processor_type&facet.field=speed&
facet.limit=7&f.no_cores.facet.limit=4&facet.mincount=1&defType=dismax&
qf=name,+manufacturer,+no_cores,+description&
fl=id,name,no_cores,description,price,shipment_mode&q="Intel"
This is definitely effective, and allows for the greatest amount of on-the-fly decision-making about how the search should work, but isn't very readable for debugging.
Applying facets less dynamically
So these features allow you to specify which fields you want to facet on, and do it dynamically. But, it can lead to a lot of very long and complex queries, especially if you have a lot of facets you use in each of several different search modes.
One option is to formalize each set of commonly used options in a request handler within your solrconfig.xml. This way, you apply the exact same arguments but instead of listing all of the arguments in each query, you just specify which request handler you want.
<requestHandler name="/processors" class="solr.SearchHandler">
<lst name="defaults">
<str name="defType">dismax</str>
<str name="echoParams">explicit</str>
<str name="fl">id,name,no_cores,description,price,shipment_mode</str>
<str name="qf">name, manufacturer, no_cores, description</str>
<str name="sort">score desc</str>
<str name="rows">30</str>
<str name="wt">xml</str>
<str name="q.alt">*</str>
<str name="facet.mincount">1</str>
<str name="facet.field">no_cores</str>
<str name="facet.field">socket_type</str>
<str name="facet.field">processor_type</str>
<str name="facet.field">speed</str>
<str name="facet.limit">10</str>
<str name="facet.sort">count</str>
</lst>
<lst name="appends">
<str name="fq">category:processor</str>
</lst>
</requestHandler>
If you set up a request hander in solrconfig.xml, all it does is serve as a shorthand for a set of query arguments. You can have as many request handlers as you want for a single solr index, and you can alter them without rebuilding the index (reload the Solr core or restart the server application (JBoss or Tomcat, e.g.), to put changes into effect).
There are a number of things going on with this request handler that I didn't get into, but it's all just a way of representing default Solr request arguments so that your live queries can be simpler. This way, you might make a query like:
http://localhost/solr/collection1/processors?q="Intel"
to return a result set with all of your processor-specific facets populated, and filtered so that only processor records are returned. (This is the category:processor filter, which assumes a field called category where all the processor records have a value processor. This is entirely optional and up to you.) You will probably want to retain the default search request handler that doesn't filter by record category, and which may not choose to apply any of the available (stored="true" and indexed="true") fields as active facets.

Understading Solr nested queries

I'm trying to understand solr nested queries but I'm having a problem undestading the syntax.
I have the following two indexed documents (among others):
<doc>
<str name="city">Guarulhos</str>
<str name="name">Fulano Silva</str>
</doc>
<doc>
<str name="city">Fortaleza</str>
<str name="name">Fulano Cardoso Silva</str>
</doc>
If I query for q="Fulano Silva"~2&defType=edismax&qf=name&fl=score I have:
<doc>
<float name="score">28.038431</float>
<str name="city">Guarulhos</str>
<str name="name">Fulano Silva</str>
</doc>
<doc>
<float name="score">19.826164</float>
<str name="city">Fortaleza</str>
<str name="name">Fulano Cardoso Silva</str>
</doc>
So I thought that if I queried for:
q="Fulano Silva"~2 AND __query__="{!edismax qf=city}fortaleza" &defType=edismax&qf=name&fl=score
I'd give a bit more score for the second document, but actually I get an empty result set with numFound=0.
What am I doing wrong here?
Need to remove the "=" and replace it with ":" to use the nested query syntax:
q="Fulano Silva"~2 AND _query_:"{!edismax qf=city}fortaleza" &defType=edismax&qf=name&fl=score
*Use _query_: instead of _query_=
Hope this works...
EDIT: When you say q=, are you specifying the query in a URL, or is the text after the q= being put in an application or the Solr dashboard? If we're talking about a URL, you may need to use percent-encoding to get it to work. I mentioned that below, but since I haven't heard from you, I thought I'd reiterate.
Why don't you do q=name:"Fulano Silva" AND city:"fortaleza"?
Another possibility: q=_query_:"{!edismax qf='name'}Fulano Silva" AND city:"fortaleza"
If you're set on a nested query, select?defType=edismax&q="Fulano Silva" AND _query_:"{!edismax qf='city' v='fortaleza'}" should work, but the results and the way it matches will depend on what analyzers you are using to query and index name and city. Also, if these queries are in your query string, make sure you are
encoding them properly.
In order to help you any more, I need to know what you're trying to accomplish with your query. Then perhaps we can be sure you have the right indexing set up, that edismax is the right query handler, etc.
On top of the previous comments, the asker has mispelled _query_ as __query__ (note the double underscore in the second, mispelled, version); Solr expects _query_ to be spelled with only one underscore (_) before and one after the word query, not two.

Apache Solr Index Bechmarking

I recently started playing around with Apache Solr and currently trying to figure out the best way to benchmark the indexing of a corpus of XML documents. I am basically interested in the throughput (documents indexed/second) and index size on disk.
I am doing all this on Ubuntu.
Benchmarking Technique
* Run the following 5 times& get average total time taken *
Index documents [curl http://localhost:8983/solr/core/dataimport?command=full-import]
Get 'Time taken' name attribute from XML response when status is 'idle' [curl http://localhost:8983/solr/core/dataimport]
Get size of 'data/index' directory
Delete Index [curl http://localhost:8983/solr/core/update --data '<delete><query>*:*</query></delete>' -H 'Content-type:text/xml; charset=utf-8']
Commit [curl http://localhost:8983/solr/w5/update --data '<commit/>' -H 'Content-type:text/xml; charset=utf-8']
Re-index documents
Questions
I intend to calculate my throughput by dividing the number of documents indexed by average total time taken; is this fine?
Are there tools (like SolrMeter for query benchmarking) or standard scripts already available that I could use to achive my objectives? I do not want to re-invent the wheel...
Is my approach fine?
Is there an easier way of getting the index size as opposed to performing a 'du' on the data/index/ directory?
Where can I find information on how to interpret XML response attributes (see sample output below). For instance, I would want to know the difference between the QTime and Time taken values.
* XML Response Used to Get Throughput *
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
</lst>
<lst name="initArgs">
<lst name="defaults">
<str name="config">w5-data-config.xml</str>
</lst>
</lst>
<str name="status">idle</str>
<str name="importResponse"/>
<lst name="statusMessages">
<str name="Total Requests made to DataSource">0</str>
<str name="Total Rows Fetched">3200</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2012-12-11 14:06:19</str>
<str name="">Indexing completed. Added/Updated: 1600 documents. Deleted 0 documents.</str>
<str name="Total Documents Processed">1600</str>
<str name="Time taken">0:0:10.233</str>
</lst>
<str name="WARNING">This response format is experimental. It is likely to change in the future.</str>
</response>
To question 1:
I would suggest you should try to index more than 1 XML (with different dataset) file and compare the given results. Thats the way you will know if it´s ok to simply divide the taken time with your number of documents.
To question 2:
I didn´t find any of these tools, I did it by my own by developing a short Java application
To question 3:
Which approach you mean? I would link to my answer to question 1...
To question 4:
The size of the index folder gives you the correct size of the whole index, why don´t you want to use it?
To question 5:
The results you get in the posted XML is transfered through a XSL file. You can find it in the /bin/solr/conf/xslt folder. You can look up what the termes exactly means AND you can write your own XSL to display the results and informations.
Note: If you create a new XSL file, you have to change the settings in your solrconfig.xml. If you don´t want to make any changes, edit the existing file.
edit: I think the difference is, that the Qtime is the rounded value of the taken time value. There are only even numbers in Qtime.
Best regards

Resources