Apache Solr Index Bechmarking - solr

I recently started playing around with Apache Solr and currently trying to figure out the best way to benchmark the indexing of a corpus of XML documents. I am basically interested in the throughput (documents indexed/second) and index size on disk.
I am doing all this on Ubuntu.
Benchmarking Technique
* Run the following 5 times& get average total time taken *
Index documents [curl http://localhost:8983/solr/core/dataimport?command=full-import]
Get 'Time taken' name attribute from XML response when status is 'idle' [curl http://localhost:8983/solr/core/dataimport]
Get size of 'data/index' directory
Delete Index [curl http://localhost:8983/solr/core/update --data '<delete><query>*:*</query></delete>' -H 'Content-type:text/xml; charset=utf-8']
Commit [curl http://localhost:8983/solr/w5/update --data '<commit/>' -H 'Content-type:text/xml; charset=utf-8']
Re-index documents
Questions
I intend to calculate my throughput by dividing the number of documents indexed by average total time taken; is this fine?
Are there tools (like SolrMeter for query benchmarking) or standard scripts already available that I could use to achive my objectives? I do not want to re-invent the wheel...
Is my approach fine?
Is there an easier way of getting the index size as opposed to performing a 'du' on the data/index/ directory?
Where can I find information on how to interpret XML response attributes (see sample output below). For instance, I would want to know the difference between the QTime and Time taken values.
* XML Response Used to Get Throughput *
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
</lst>
<lst name="initArgs">
<lst name="defaults">
<str name="config">w5-data-config.xml</str>
</lst>
</lst>
<str name="status">idle</str>
<str name="importResponse"/>
<lst name="statusMessages">
<str name="Total Requests made to DataSource">0</str>
<str name="Total Rows Fetched">3200</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2012-12-11 14:06:19</str>
<str name="">Indexing completed. Added/Updated: 1600 documents. Deleted 0 documents.</str>
<str name="Total Documents Processed">1600</str>
<str name="Time taken">0:0:10.233</str>
</lst>
<str name="WARNING">This response format is experimental. It is likely to change in the future.</str>
</response>

To question 1:
I would suggest you should try to index more than 1 XML (with different dataset) file and compare the given results. Thats the way you will know if it´s ok to simply divide the taken time with your number of documents.
To question 2:
I didn´t find any of these tools, I did it by my own by developing a short Java application
To question 3:
Which approach you mean? I would link to my answer to question 1...
To question 4:
The size of the index folder gives you the correct size of the whole index, why don´t you want to use it?
To question 5:
The results you get in the posted XML is transfered through a XSL file. You can find it in the /bin/solr/conf/xslt folder. You can look up what the termes exactly means AND you can write your own XSL to display the results and informations.
Note: If you create a new XSL file, you have to change the settings in your solrconfig.xml. If you don´t want to make any changes, edit the existing file.
edit: I think the difference is, that the Qtime is the rounded value of the taken time value. There are only even numbers in Qtime.
Best regards

Related

Can SOLR faceted result using sum function be further filtered on the server

In SOLR version 6.0.0, can faceted result be further filtered before the response is returned to the client? My uses case uses the facet "sum" function to calculate the aggregate deposit for each customer. So far so good.
I would like to filter the rows returned in the facet response to only show me those customers who have deposited above a certain threshold. I can't seem to find a way to do that. Is it possible?
I am trying to avoid processing the response after it returns from the SOLR server. I am wondering if it is possible to accomplish this on the server side. The reason I am trying to do this is on the server side is that the data set might be very large. If I did this on the client side, I might need to do many faceted searches with the 'limit' and 'offset' parameters to find the location of my threshold. Once found continue with use case.
References consulted:
http://yonik.com/json-facet-api/
Below is what I did to setup the environment.
Created a csv file named entry.csv with the following data
id,name_s,date_dt,flow_s,amount_f
1,John,2016-01-01T00:00:00Z,Deposit,10
2,Mary,2016-01-15T00:00:00Z,Deposit,20
3,Peter,2016-01-19T00:00:00Z,Deposit,30
4,John,2016-01-20T00:00:00Z,Deposit,40
5,Mary,2016-01-22T00:00:00Z,Deposit,50
6,Mary,2016-01-23T00:00:00Z,Deposit,60
Start the SOLR server
$ bin/solr start
Create a new core named simple.
$ bin/solr create -c simple
Import the data
$ bin/post -c simple ~/entry.csv
Query the data to validate import was successful
$ curl -s http://localhost:8983/solr/simple/select?indent=on&q=*:*&wt=json
{
"responseHeader":{
"status":0,
"QTime":0,
"params":{
"q":"*:*",
"indent":"on",
"wt":"json"}},
"response":{"numFound":6,"start":0,"docs":[
{
"id":"1",
"name_s":"John",
"date_dt":"2016-01-01T00:00:00Z",
"flow_s":"Deposit",
"amount_f":10.0,
"_version_":1532465926194069504},
{
"id":"2",
"name_s":"Mary",
"date_dt":"2016-01-15T00:00:00Z",
"flow_s":"Deposit",
"amount_f":20.0,
"_version_":1532465926248595456},
{
"id":"3",
"name_s":"Peter",
"date_dt":"2016-01-19T00:00:00Z",
"flow_s":"Deposit",
"amount_f":30.0,
"_version_":1532465926250692608},
{
"id":"4",
"name_s":"John",
"date_dt":"2016-01-20T00:00:00Z",
"flow_s":"Deposit",
"amount_f":40.0,
"_version_":1532465926252789760},
{
"id":"5",
"name_s":"Mary",
"date_dt":"2016-01-22T00:00:00Z",
"flow_s":"Deposit",
"amount_f":50.0,
"_version_":1532465926253838336},
{
me_s":"Mary",
"date_dt":"2016-01-23T00:00:00Z",
"flow_s":"Deposit",
"amount_f":60.0,
"_version_":1532465926255935488}]
}}
Query the data with a facet showing the total gross deposit amount for each customer.
$ curl -s http://localhost:8983/solr/simple/select -d 'q=*:*&rows=0&
json.facet={
customers:{
type:terms,
field:name_s,
sort:{gross:desc},
facet:{
gross:"sum(amount_f)"
}
}
}
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
<lst name="params">
<str name="q">*:*</str>
<str name="json.facet">{ customers:{ type:terms, field:name_s, sort:{gross:desc}, facet:{ gross:"sum(amount_
<str name="rows">0</str>
</lst>
</lst>
<result name="response" numFound="6" start="0"/>
<lst name="facets">
<int name="count">6</int>
<lst name="customers">
<arr name="buckets">
<lst>
<str name="val">Mary</str>
<int name="count">3</int>
<double name="gross">130.0</double>
</lst>
<lst>
<str name="val">John</str>
<int name="count">2</int>
<double name="gross">50.0</double>
</lst>
<lst>
<str name="val">Peter</str>
<int name="count">1</int>
<double name="gross">30.0</double>
</lst>
</arr>
</lst>
</lst>
</response>
What I would like to achieve
I would like to filter only those customers that had deposited more than 100 dollars. That would mean that in the response, I would like to see only Mary who has an aggregate deposit of 130. I don't want to see John or Peter returned.
In version 6.3.0 of Apache SOLR cloud, there is a built in "/sql" handler. Please see the apache's wiki page at the URL below on the details of the handler.
https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface#ParallelSQLInterface-/sqlRequestHandler
To achieve a result for #7 of the question, the following query can be submitted and the result will only show names and total aggregate amount if they are above 100
stmt=SELECT name_s, sum(amount_f) as total
FROM simple
GROUP BY name_s
HAVING total > 100
ORDER BY total desc
In my local environment, SOLR cloud is hosted on ip 10.0.0.40 and port 8983 and I had created a collection name 'simple.'
curl --data-urlencode 'stmt=SELECT name_s, sum(amount_f) as total FROM simple GROUP BY name_s HAVING total>100 ORDER BY total desc' http://10.0.0.40:8983/solr/simple/sql
{"result-set":{"docs":[
{"name_s":"Mary","total":130.0},
{"EOF":true,"RESPONSE_TIME":7}]}}

Solr number range facet count wrong

I try to count issue 1 to 5 with this range facet query:
...&facet.range=issue&facet.range.start=1&q=magid:abc&facet.range.end=5&facet.range.gap=1
It returns:
<lst name="issue">
<lst name="counts">
<int name="1">5</int>
<int name="2">7</int>
<int name="3">9</int>
<int name="4">7</int>
</lst>
There's no issue 5 ##??? Also issue 1 should be 3, 5 is for issue 2 (Then I think "Hey! IT CAN'T BE array element starts from 0" problem, right?!..."). I chnage facet.range.start to 0 and do query again. This time it returns:
<lst name="issue">
<lst name="counts">
<int name="0">3</int>
<int name="1">5</int>
<int name="2">7</int>
<int name="3">9</int>
<int name="4">7</int>
</lst>
Oh My! it should be issue 1~5, instead 0~4? Why are Solr doing this? It is really confusing me!
I am sure that these are not 0-based index values. The values you see are the actual values being indexed as tokens, so if you index values from 1 to 5 you should see values from 1 to 5
So, if you want to make sure if you have documents with value 5 or not, the best way to debyg this from the Schema Browser -> Term info
So, go to Solr Admin interface, select the core, click on schema browser, choose the field name you want to see term info for, then click on Load term info.

Understading Solr nested queries

I'm trying to understand solr nested queries but I'm having a problem undestading the syntax.
I have the following two indexed documents (among others):
<doc>
<str name="city">Guarulhos</str>
<str name="name">Fulano Silva</str>
</doc>
<doc>
<str name="city">Fortaleza</str>
<str name="name">Fulano Cardoso Silva</str>
</doc>
If I query for q="Fulano Silva"~2&defType=edismax&qf=name&fl=score I have:
<doc>
<float name="score">28.038431</float>
<str name="city">Guarulhos</str>
<str name="name">Fulano Silva</str>
</doc>
<doc>
<float name="score">19.826164</float>
<str name="city">Fortaleza</str>
<str name="name">Fulano Cardoso Silva</str>
</doc>
So I thought that if I queried for:
q="Fulano Silva"~2 AND __query__="{!edismax qf=city}fortaleza" &defType=edismax&qf=name&fl=score
I'd give a bit more score for the second document, but actually I get an empty result set with numFound=0.
What am I doing wrong here?
Need to remove the "=" and replace it with ":" to use the nested query syntax:
q="Fulano Silva"~2 AND _query_:"{!edismax qf=city}fortaleza" &defType=edismax&qf=name&fl=score
*Use _query_: instead of _query_=
Hope this works...
EDIT: When you say q=, are you specifying the query in a URL, or is the text after the q= being put in an application or the Solr dashboard? If we're talking about a URL, you may need to use percent-encoding to get it to work. I mentioned that below, but since I haven't heard from you, I thought I'd reiterate.
Why don't you do q=name:"Fulano Silva" AND city:"fortaleza"?
Another possibility: q=_query_:"{!edismax qf='name'}Fulano Silva" AND city:"fortaleza"
If you're set on a nested query, select?defType=edismax&q="Fulano Silva" AND _query_:"{!edismax qf='city' v='fortaleza'}" should work, but the results and the way it matches will depend on what analyzers you are using to query and index name and city. Also, if these queries are in your query string, make sure you are
encoding them properly.
In order to help you any more, I need to know what you're trying to accomplish with your query. Then perhaps we can be sure you have the right indexing set up, that edismax is the right query handler, etc.
On top of the previous comments, the asker has mispelled _query_ as __query__ (note the double underscore in the second, mispelled, version); Solr expects _query_ to be spelled with only one underscore (_) before and one after the word query, not two.

Get total term frequencies by date query

I'd like to know the "top 10 terms" from a query (which is just a date range query). I need "total term frequency" by date...not a count of documents and not just a count of term frequency across the entire index. I've looked into the Solr TermsComponent and Lucene's HighFreqTerms, but neither seems to support the operation I want as the result of a query.
My index is pretty simple...every item goes into the 'content' field which also has a 'dateCreated' field (to support the query). Any thoughts to the technique I could use?
When you query for the date in question, you can iterate through the scoreDocs returned, and get TermVectors for the content field like:
Terms terms = myIndexReader.getTermVector(currentScoreDoc.doc, "content");
and you can then iterate through terms.iterator(), and create a collection of counts for each of the terms (acquired from the TermsEnum.next() or TermsEnum.term() methods)
Faceting provides almost what you're looking for, but will give document frequencies for each term, not the total term frequencies. Make your date range query as a /select call, then add parameters:
* rows=0 since you don't want to see the documents found, just counts
* facet=true
* facet.field=<the field with the required terms>
* facet.limit=10 since you want top ten terms
Over a field called text, part of the response would look like:
<lst name="facet_counts">
<lst name="facet_queries"/>
<lst name="facet_fields">
<lst name="text">
<int name="from">3690</int>
<int name="have">3595</int>
<int name="it">3495</int>
<int name="has">3450</int>
<int name="one">3375</int>
<int name="who">3221</int>
<int name="he">3137</int>
<int name="up">3125</int>
<int name="all">3112</int>
<int name="year">3089</int>
</lst>
</lst>
<lst name="facet_dates"/>
<lst name="facet_ranges"/>
</lst>
Warning, this request may be slow!

What is DataImportHandler doing after Indexing completed?

I am using solr to index about 40m items, and the final index file is about 20G. Below is the message after a delta import:
<lst name="statusMessages">
<str name="Time Elapsed">0:51:44.149</str>
<str name="Total Requests made to DataSource">1</str>
<str name="Total Rows Fetched">5634016</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2012-09-27 01:25:17</str>
<str name="">
Indexing completed. Added/Updated: 5634016 documents. Deleted 0 documents.
</str>
I am wondering what solr is doing this status? and the message replication?command=details return is :
<lst name="masterDetails">
<str name="indexSize">36.69 GB</str>
The index is almost double, and is still going to be bigger. This made me very confused. I am doing delta import, why index will be double size when replace?
If you are replacing most of your documents that's normal. An update in lucene consists of a deletion and a re-insertion of the documents, since the index segments are write-once. When you delete a document, you are not really deleting it but only marking it as deleted, again because the segments are write-once.
Deleted documents will be deleted for real when the next merge happens, when a new bigger segments will be created out of the small segments that you have. That's when you should see a decreasement of the index size. That means that your index size shouldn't only increase. Merges happen more or less according to the merge policy in use. If you want to manually force a merge you can use the forceMerge operation, which is the new name for the optimize. Depending on the solr version in use you need to use either the first or the second one. Be careful, since the forceMerge takes a while if you have a lot of documents. Have a look at this article too.
Before Solr 3.6, dataImportHandler set optimize=true by default:
http://wiki.apache.org/solr/DataImportHandler
This triggers merging of all segments into one regardless of other settings. I think you might be able to address this by adding an optimize checkbox to debug.jsp, though I haven't actually tried it.

Resources