I'd like to know the "top 10 terms" from a query (which is just a date range query). I need "total term frequency" by date...not a count of documents and not just a count of term frequency across the entire index. I've looked into the Solr TermsComponent and Lucene's HighFreqTerms, but neither seems to support the operation I want as the result of a query.
My index is pretty simple...every item goes into the 'content' field which also has a 'dateCreated' field (to support the query). Any thoughts to the technique I could use?
When you query for the date in question, you can iterate through the scoreDocs returned, and get TermVectors for the content field like:
Terms terms = myIndexReader.getTermVector(currentScoreDoc.doc, "content");
and you can then iterate through terms.iterator(), and create a collection of counts for each of the terms (acquired from the TermsEnum.next() or TermsEnum.term() methods)
Faceting provides almost what you're looking for, but will give document frequencies for each term, not the total term frequencies. Make your date range query as a /select call, then add parameters:
* rows=0 since you don't want to see the documents found, just counts
* facet=true
* facet.field=<the field with the required terms>
* facet.limit=10 since you want top ten terms
Over a field called text, part of the response would look like:
<lst name="facet_counts">
<lst name="facet_queries"/>
<lst name="facet_fields">
<lst name="text">
<int name="from">3690</int>
<int name="have">3595</int>
<int name="it">3495</int>
<int name="has">3450</int>
<int name="one">3375</int>
<int name="who">3221</int>
<int name="he">3137</int>
<int name="up">3125</int>
<int name="all">3112</int>
<int name="year">3089</int>
</lst>
</lst>
<lst name="facet_dates"/>
<lst name="facet_ranges"/>
</lst>
Warning, this request may be slow!
Related
I would like to create a query that will return facet counts for some field values even if that value count is 0. But on the other hand I don't want counts for the values that were not in the original query.
For example if I use in my query:
<arr name="fq">
<str>Field:(4 OR 5)</str>
</arr>
I and there exists no document with the value 5 in this field, I would like to get back:
<lst name="facet_fields">
<lst name="VersionStatus">
<int name="4">2</int>
<int name="5">0</int>
</lst>
</lst>
But there shouldn't be counts for other values (ex. 1, 2, 3, ...), because they weren't spcified in the query.
Is that even possible? I was trying to achieve that with missing=true parameter, but that didn't work.
instead of faceting on the field with 'facet.field' you can use N facet.query params, matching the terms in your fq, so, in your example:
&facet.query=Field:4&facet.query=Field:5
I cannot find any documentation on the solr website that indicates how to search for a string that contains a literal hash character inside it.
example:
?q=id_number:723#52
I've tried escaping the hash, 723\#52, and HTML encoding it, 723%2352, but the solr output shows that it cuts off at the hash symbol each time:
<response>
<lst name="responseHeader">
<int name="status">400</int>
<int name="QTime">2</int>
<lst name="params">
<str name="q">id_number:723</str>
</lst>
</lst>
Because solr will tokenize the query using class solr.StandardTokenizer so # character will removed from query. you can change the tokenizer for field type definition.
In your case for field id_number change the filter class from solr.StandardTokenizer to solr.WhiteSpaceTokenizer
But doing this method will accept all other special character in the query (.:,etc)
I try to count issue 1 to 5 with this range facet query:
...&facet.range=issue&facet.range.start=1&q=magid:abc&facet.range.end=5&facet.range.gap=1
It returns:
<lst name="issue">
<lst name="counts">
<int name="1">5</int>
<int name="2">7</int>
<int name="3">9</int>
<int name="4">7</int>
</lst>
There's no issue 5 ##??? Also issue 1 should be 3, 5 is for issue 2 (Then I think "Hey! IT CAN'T BE array element starts from 0" problem, right?!..."). I chnage facet.range.start to 0 and do query again. This time it returns:
<lst name="issue">
<lst name="counts">
<int name="0">3</int>
<int name="1">5</int>
<int name="2">7</int>
<int name="3">9</int>
<int name="4">7</int>
</lst>
Oh My! it should be issue 1~5, instead 0~4? Why are Solr doing this? It is really confusing me!
I am sure that these are not 0-based index values. The values you see are the actual values being indexed as tokens, so if you index values from 1 to 5 you should see values from 1 to 5
So, if you want to make sure if you have documents with value 5 or not, the best way to debyg this from the Schema Browser -> Term info
So, go to Solr Admin interface, select the core, click on schema browser, choose the field name you want to see term info for, then click on Load term info.
I recently started playing around with Apache Solr and currently trying to figure out the best way to benchmark the indexing of a corpus of XML documents. I am basically interested in the throughput (documents indexed/second) and index size on disk.
I am doing all this on Ubuntu.
Benchmarking Technique
* Run the following 5 times& get average total time taken *
Index documents [curl http://localhost:8983/solr/core/dataimport?command=full-import]
Get 'Time taken' name attribute from XML response when status is 'idle' [curl http://localhost:8983/solr/core/dataimport]
Get size of 'data/index' directory
Delete Index [curl http://localhost:8983/solr/core/update --data '<delete><query>*:*</query></delete>' -H 'Content-type:text/xml; charset=utf-8']
Commit [curl http://localhost:8983/solr/w5/update --data '<commit/>' -H 'Content-type:text/xml; charset=utf-8']
Re-index documents
Questions
I intend to calculate my throughput by dividing the number of documents indexed by average total time taken; is this fine?
Are there tools (like SolrMeter for query benchmarking) or standard scripts already available that I could use to achive my objectives? I do not want to re-invent the wheel...
Is my approach fine?
Is there an easier way of getting the index size as opposed to performing a 'du' on the data/index/ directory?
Where can I find information on how to interpret XML response attributes (see sample output below). For instance, I would want to know the difference between the QTime and Time taken values.
* XML Response Used to Get Throughput *
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
</lst>
<lst name="initArgs">
<lst name="defaults">
<str name="config">w5-data-config.xml</str>
</lst>
</lst>
<str name="status">idle</str>
<str name="importResponse"/>
<lst name="statusMessages">
<str name="Total Requests made to DataSource">0</str>
<str name="Total Rows Fetched">3200</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2012-12-11 14:06:19</str>
<str name="">Indexing completed. Added/Updated: 1600 documents. Deleted 0 documents.</str>
<str name="Total Documents Processed">1600</str>
<str name="Time taken">0:0:10.233</str>
</lst>
<str name="WARNING">This response format is experimental. It is likely to change in the future.</str>
</response>
To question 1:
I would suggest you should try to index more than 1 XML (with different dataset) file and compare the given results. Thats the way you will know if it´s ok to simply divide the taken time with your number of documents.
To question 2:
I didn´t find any of these tools, I did it by my own by developing a short Java application
To question 3:
Which approach you mean? I would link to my answer to question 1...
To question 4:
The size of the index folder gives you the correct size of the whole index, why don´t you want to use it?
To question 5:
The results you get in the posted XML is transfered through a XSL file. You can find it in the /bin/solr/conf/xslt folder. You can look up what the termes exactly means AND you can write your own XSL to display the results and informations.
Note: If you create a new XSL file, you have to change the settings in your solrconfig.xml. If you don´t want to make any changes, edit the existing file.
edit: I think the difference is, that the Qtime is the rounded value of the taken time value. There are only even numbers in Qtime.
Best regards
We have a field 'facet_tag' that contains tags describing a product. Since the tags are in german, they may contain non-ASCII characters (like umlauts). Here are some possible values:
"Zelte"
"Tunnelzelte"
"Äxte"
"Sägen"
"Softshells"
Now if we query solr for the facets with a query like:
http://<solr_host>:<solr_port>/solr/select?q=*&facet=on&facet.field=facet_tag&facet.sort=index
The sorted result looks like this:
<lst name="facet_counts">
<lst name="facet_queries"/>
<lst name="facet_fields">
<lst name="facet_tag">
<int name="Softshells">1</int>
<int name="Sägen">1</int>
<int name="Tunnelzelte">1</int>
<int name="Zelte">1</int>
<int name="Äxte">2</int>
</lst>
</lst>
<lst name="facet_dates"/>
<lst name="facet_ranges"/>
</lst>
The tag "Äxte" should be the first item, followed by "Sägen". Obviously Solr does not handle non-ASCII characters well in this case (which is also stated in the documentation for faceted search, see http://wiki.apache.org/solr/SimpleFacetParameters#facet.sort)
Is there any way to let Solr sort these values properly without normalizing umlauts (since we show the values to the user)?
I would use ASCIIFoldingFilterFactory:
Converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.
This way what you index becomes normalized (for example Äxte becomes indexed as Axte), but what is stored doesn't change. That's why you should then get the expected sorting, but the content you'll show will still be the original one (Äxte for example).
UPDATE
The solution doesn't apply to facets since they use the indexed values. Using the ASCIIFoldingFilterFactory you can have the right sort but you'll see normalized character as output as well. Basically you can have the right sort but wrong output or wrong sort but right output. Unfortunately I don't know any other solution.