Sort Facets by Index with non-ASCII values - solr

We have a field 'facet_tag' that contains tags describing a product. Since the tags are in german, they may contain non-ASCII characters (like umlauts). Here are some possible values:
"Zelte"
"Tunnelzelte"
"Äxte"
"Sägen"
"Softshells"
Now if we query solr for the facets with a query like:
http://<solr_host>:<solr_port>/solr/select?q=*&facet=on&facet.field=facet_tag&facet.sort=index
The sorted result looks like this:
<lst name="facet_counts">
<lst name="facet_queries"/>
<lst name="facet_fields">
<lst name="facet_tag">
<int name="Softshells">1</int>
<int name="Sägen">1</int>
<int name="Tunnelzelte">1</int>
<int name="Zelte">1</int>
<int name="Äxte">2</int>
</lst>
</lst>
<lst name="facet_dates"/>
<lst name="facet_ranges"/>
</lst>
The tag "Äxte" should be the first item, followed by "Sägen". Obviously Solr does not handle non-ASCII characters well in this case (which is also stated in the documentation for faceted search, see http://wiki.apache.org/solr/SimpleFacetParameters#facet.sort)
Is there any way to let Solr sort these values properly without normalizing umlauts (since we show the values to the user)?

I would use ASCIIFoldingFilterFactory:
Converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.
This way what you index becomes normalized (for example Äxte becomes indexed as Axte), but what is stored doesn't change. That's why you should then get the expected sorting, but the content you'll show will still be the original one (Äxte for example).
UPDATE
The solution doesn't apply to facets since they use the indexed values. Using the ASCIIFoldingFilterFactory you can have the right sort but you'll see normalized character as output as well. Basically you can have the right sort but wrong output or wrong sort but right output. Unfortunately I don't know any other solution.

Related

Solr facet behavior

So I'm running in to a problem with facetting in Solr. When I query for something, say sourcecode:WOS, I get what you would expect, like so:
This looks fine.
However, if I now try to FACET on this field instead, I get this:
As you can see, I now have 4 entries for each sourcecode, and all of them have some weird combination of symbols prepended to them:
<int name="C¨4C1">1433755</int>
<int name="C¨4C1">1433755</int>
<int name="¨4c1">1433755</int>
<int name="¨4c1">1433755</int>
It seems like some unicode character are added when I facet for sourcecode, which is a solr.TextField.
Has anyone ever encountered this issue before?
Thanks,
Rasmus Edvardsen

search for literal hash character in solr query

I cannot find any documentation on the solr website that indicates how to search for a string that contains a literal hash character inside it.
example:
?q=id_number:723#52
I've tried escaping the hash, 723\#52, and HTML encoding it, 723%2352, but the solr output shows that it cuts off at the hash symbol each time:
<response>
<lst name="responseHeader">
<int name="status">400</int>
<int name="QTime">2</int>
<lst name="params">
<str name="q">id_number:723</str>
</lst>
</lst>
Because solr will tokenize the query using class solr.StandardTokenizer so # character will removed from query. you can change the tokenizer for field type definition.
In your case for field id_number change the filter class from solr.StandardTokenizer to solr.WhiteSpaceTokenizer
But doing this method will accept all other special character in the query (.:,etc)

Using alternative label for Solr 4.0 Multivalue-field

I'm struggeling a little about the label of my facet-fields. I'm using Solr4 and feed my solr-index with the drupal-solr-search-api-modul (http://drupal.org/project/search_api_solr‎).
I use some taxonomy-fields for facets and almost everything is working finde. But I can't change the label of the fields. Maybe say, I have the field
"sm_thisisvocname"
Then the field is in the index like
sm_thisisvocname:name
for the values of the field and
sm_thisisvocname:vocabulary:name
for the label of the (taxonomy)-field like "This Is Vocname".
So the XML looks
<lst name="facet_fields">
<lst name="sm_thisisvocname:name">
<int name="C">2</int>
<int name="B">1</int>
<int name="D">1</int>
<int name="E">1</int>
</lst>
</lst>
AND
<sm_thisisvocname:vocabulary:name>This Is Vocname</sm_thisisvocname:vocabulary:name>
in the xml. I can't I use the query
&facet=true&facet.field=sm_thisisvocname:name
because there colons in the field-names ... Can anybody help me?
you should change your field name to not have the colon : as it is treated as a special character for multiple things in solr query.
Could only find the Documentation:-
Currently a field name must consist of only A-Z, a-z, 0-9, - or _
Field Alias is something that you can check upon, however it too depends on :)
You can also try to escape the : in the field name.

Get total term frequencies by date query

I'd like to know the "top 10 terms" from a query (which is just a date range query). I need "total term frequency" by date...not a count of documents and not just a count of term frequency across the entire index. I've looked into the Solr TermsComponent and Lucene's HighFreqTerms, but neither seems to support the operation I want as the result of a query.
My index is pretty simple...every item goes into the 'content' field which also has a 'dateCreated' field (to support the query). Any thoughts to the technique I could use?
When you query for the date in question, you can iterate through the scoreDocs returned, and get TermVectors for the content field like:
Terms terms = myIndexReader.getTermVector(currentScoreDoc.doc, "content");
and you can then iterate through terms.iterator(), and create a collection of counts for each of the terms (acquired from the TermsEnum.next() or TermsEnum.term() methods)
Faceting provides almost what you're looking for, but will give document frequencies for each term, not the total term frequencies. Make your date range query as a /select call, then add parameters:
* rows=0 since you don't want to see the documents found, just counts
* facet=true
* facet.field=<the field with the required terms>
* facet.limit=10 since you want top ten terms
Over a field called text, part of the response would look like:
<lst name="facet_counts">
<lst name="facet_queries"/>
<lst name="facet_fields">
<lst name="text">
<int name="from">3690</int>
<int name="have">3595</int>
<int name="it">3495</int>
<int name="has">3450</int>
<int name="one">3375</int>
<int name="who">3221</int>
<int name="he">3137</int>
<int name="up">3125</int>
<int name="all">3112</int>
<int name="year">3089</int>
</lst>
</lst>
<lst name="facet_dates"/>
<lst name="facet_ranges"/>
</lst>
Warning, this request may be slow!

Apache Solr TermsComponent: How to prevent from splitting words after one character. E.g. "t-shirt"

I'm trying to get autosuggestions for search terms. But I#ve run into a problem with words containing characters like "-" and "&" which are being splitted after just one character.
Example:
/solr/terms/?terms=true&terms.fl=item&terms.limit=10&terms.sort=count&terms.prefix=t
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
</lst>
<lst name="terms">
<lst name="item">
<int name="top">11335</int>
<int name="tshirt">10249</int>
<int name="t">10156</int>
<int name="trouser">4771</int>
<int name="tight">1577</int>
</lst>
</lst>
</response>
The problem lies with tshirt and t. "t" only appears within "t-shirt". so how do I prevent Solr from splitting words just after one character if there is no whitespace after it. "t shirt" should split - "t-shirt" and "h&m" should not.
Thanks for your help!
The field type for items seems to be text with WordDelimiterFilterFactory being one of the filters in the analysis.
WordDelimiterFilterFactory by default will split on Intra word delimiters.
So t-shirt would generate two tokens t and shirt, and hence the term t appears for you.
If you want to use terms for autosuggest, remove or tune the WordDelimiterFilterFactory as per the requirement.
You can use the TextField with basic configurations, like with WhitespaceTokenizerFactory and apply the lower, ascii folding filters on it so the tokens are least analyzed and don't appear fragmented.
You can also add words you don't want to be split by adding them to protwords.txt or map certain characters in wdfftypes.txt so they won't be used to split terms.
Also check this link for good AutoSuggester http://www.cominvent.com/2012/01/25/super-flexible-autocomplete-with-solr/
If that's the only problem you have using the TermsComponent to make auto suggestions the answer you got is perfect, but I'd like to propose an alternative answer.
The TermsComponent is fast and pretty easy to use, but it has the following limitations:
you can't apply any filter to your suggestions;
you may have trouble with case-sensitive queries: for example, if you use the LowerCaseFilterFactory and index the word Word, you'll get the suggestion only typing w and not W. You basically need to take care of lowering the query before submitting it to solr, since you can't apply any tokenizer or filter to your query.
Depending on your requirements, you might want to consider different ways to make auto suggestions with Solr. The Different ways to make auto suggestions with Solr article should be useful in order to make the right choice.

Resources