Manipulating and Removing Facets in Apache Solr - solr

I am creating a front end application which queries through a database using the Apache Solr engine, but I have two issues that I just cannot find the answer to.
When Solr is processing a Facet query, how do I get the facet to be a single phrase ("Department of the Navy (160)") instead of a broken up facet of 4 terms ("Department (160)" "of (200)" "the (200)" "Navy(160)").
Also, how do I remove certain facets from being queried, for example "and" "to" "the" etc.
Thank you.

Looks like your phrase is being indexed into a Text field which, among many things, splits by whitespace. This is very good for full text search but not for faceting.
You can have a duplicate field for this, of type string (and not Text), which is not splitted. You can still use the original field for searching but the new string field for faceting.

Related

Solr 3.6.2 spellcheck multi-word phrase: how to get collations without ignored stopwords?

I'm having a problem with the Solr 3.6.2 default (field based) spellchecker configured with query time parameters
spellcheck.onlyMorePopular=true
spellcheck.count=5
spellcheck.collate=true
spellcheck.maxCollations=5
spellcheck.maxCollationTries=5
on a field type which has a solr.StopFilterFactory filter on its analyzers.
The suggestion phase works as intended :
the indexed field does not contain any stopword
no suggestion is provided for a given stopword
But the resulting collation always contains the ignored stopwords, which I don't want: I'd prefer a raw suggestion of combined terms over something which looks like a "sort of" natural language answer.
For instance, searching for "handfum of perries", I'd prefer "handful berry" over "handful of berry".
I don't think that the stopwords excluded from spellchecking suggestions because of the field query analyzer are "marked" for preservation like the official documentation goes about other query elements :
Note that the non-spellcheckable terms such as those for range
queries, prefix queries etc. are detected and excluded for
spellchecking. Such non-spellcheckable terms are preserved in the
collated output so that the original query can be run again, as is.
It seems two solutions would be
either having a custom query converter so the stopwords are ignored right from the start: not sure it is possible in 3.6.2
or having a custom spellchecker that would not try to find any suggestion for a stopword (or would always suggest an "empty" string), without messing up the collation process
Am I missing something ?
Regards

Solrnet facet returning spaces

I'm using Solrnet to return search results and am also requesting the facets, in particular categories which is a multi-valued field.
The problem I'm coming up against is that the category "house products" is being returned as two seperate facets because of the space.
Is there a way of ensuring this is returned as a single facet value, or should I be escaping the value when it is added to the index?
Thanks in advance
Al
If the tokens are generated for house products then you are using text analysis for the field.
Text fields are not suggested to be used for Faceting.
You won't get the desired behavior as the text fields would be tokenized and filtered leading to the generation of multiple tokens which you see from the facets returned as response.
Use a copy field to copy the field to a String field to be able to facet on it without splitting the words.
SolrFacetingOverview :-
Because faceting fields are often specified to serve two purposes,
human-readable text and drill-down query value, they are frequently
indexed differently from fields used for searching and sorting:
They are often not tokenized into separate words
They are often not mapped into lower case
Human-readable punctuation is often not removed (other than double-quotes)
There is often no need to store them, since stored values would look much like indexed values and the faceting mechanism is used for
value retrieval.
Try to use String fields and it would be good enough without any overheads.
The faceting works on tokens, so if you have a field that is tokenized in many words it will split the facet too.
I suggest you create another field of type string used only for faceting.

Whole phrase search possible in solr for double quotes

I want to do a phrase search in solr without analyzers being applied to it
eg - If I search for "DelhiDareDevil" (i.e - with inverted commas)it should search the exact text and not apply any analyzers or tokenizers on this field
However if i search for DelhiDareDevil it should use tokenizers and analyzers and split it to something like this delhi dare devil
any help would be appreciated
Not sure of any Out of the Box approach, you can copy field the content to a new field with no analysis.
And, You can define your own query parser and have the query being searched on the field with no analysis.

Solr Index appears to be valid - but returns no results

Solr newbie here.
I have created a Solr index and write a whole bunch of docs into it. I can see
from the Solr admin page that the docs exist and the schema is fine as well.
But when I perform a search using a test keyword I do not get any results back.
On entering * : *
into the query (in Solr admin page) I get all the results.
However, when I enter any other query (e.g. a term or phrase) I get no results.
I have verified that the field being queried is Indexed and contains the values I am searching for.
So I am confused what I am doing wrong.
Probably you don't have a <defaultSearchField> correctly set up. See this question.
Another possibility: your field is of type string instead of text. String fields, in contrast to text fields, are not analyzed, but stored and indexed verbatim.
I had the same issue with a new setup of Solr 8. The accepted answer is not valid anymore, because the <defaultSearchField> configuration will be deprecated.
As I found no answer to why Solr does not return results from any fields despite being indexed, I consulted the query documentation. What I found is the DisMax query parser:
The DisMax query parser is designed to process simple phrases (without complex syntax) entered by users and to search for individual terms across several fields using different weighting (boosts) based on the significance of each field. Additional options enable users to influence the score based on rules specific to each use case (independent of user input).
In contrast, the default Lucene parser only speaks about searching one field. So I gave DisMax a try and it worked very well!
Query example:
http://localhost:8983/solr/techproducts/select?defType=dismax&q=video
You can also specify which fields to search exactly to prevent unwanted side effects. Multiple fields are separated by spaces which translate to + in URLs:
http://localhost:8983/solr/techproducts/select?defType=dismax&q=video&qf=features+text
Last but not least, give the fields a weight:
http://localhost:8983/solr/techproducts/select?defType=dismax&q=video&qf=features^20.0+text^0.3
If you are using pysolr like I do, you can add those parameters to your search request like this:
results = solr.search('search term', **{
'defType': 'dismax',
'qf': 'features text'
})
In my case the problem was the format of the query. It seems that my setup, by default, was looking and an exact match to the entire value of the field. So, in order to get results if I was searching for the sit I had to query *sit*, i.e. use wildcards to get the expected result.
With solr 4, I had to solve this as per Mauricio's answer by defining type="text_en" to the field.
With solr 6, use text_general.

Solr query results using *

I want to provide for partial matching, so I am tacking on * to the end of search queries. What I've noticed is that a search query of gatorade will return 12 results whereas gatorade* returns 7. So * seems to be 1 or many as opposed to 0 or many ... how can I achieve this? Am I going about partial matching in Solr all wrong? Thanks.
First, I think Solr wildcards are better summarized by "0 or many" than "1 or many". I doubt that's the source of your problem. (For example, see the javadocs for WildcardQuery.)
Second, are you using stemming, because my first guess is that you're dealing with a stemming issue. Solr wildcards can behave kind of oddly with stemming. This is because wildcard expansion is based by searching through the list of terms stored in the inverted index; these terms are going to be in stemmed form (perhaps something like "gatorad"), rather than the words from the original source text (perhaps "gatorade" or "gatorades").
For example, suppose you have a stemmer that maps both "gatorade" and "gatorades" to the stem "gatorad". This means your inverted index will not contain either "gatorade" or "gatorades", only "gatorad". If you then issue the query gatorade*, Solr will walk the term index looking for all the stems beginning with "gatorade". But there are no such stems, so you won't get any matches. Similarly, if you searched gatorades*, Solr will look for all stems beginning with "gatorades". But there are no such stems, so you won't get any matches.
Third, for optimal help, I'd suggest posting some more information, in particular:
Some particular query URLs you are submitting to Solr
An excerpt from your schema.xml file. In particular, include A) the field elements for the fields you are having trouble with, and B) the field type definitions corresponding to those fields
so what I was looking for is to make the search term for 'gatorade' -> 'gatorade OR gatorade*' which will give me all the matches i'm looking for.
If you want a query to return all documents that match either a stemmed form of gatorade or words that begin with gatorade, you'll need to construct the query yourself: +(gatorade gatorade*). You could alternatively extend the SolrParser to do this, but that's more work.
Another alternative is to use NGrams and TokenFilterFactories, specifically the EdgeNGramFilterFactory. .
This will create indexes for ngrams or parts of words. Documents, with a min ngram size of 5 and max ngram size of 8, would index: Docum Docume Document Documents
There is a bit of a tradeoff for index size and time. One of the Solr books quotes as a rough guide: Indexing takes 10 times longer Uses 5 times more disk space Creates 6 times more distinct terms.
However, the EdgeNGram will do better than that.
You do need to make sure that you don't submit wildcard character in your queries. As you aren't doing a wildcard search, you are matching a search term on ngrams(parts of words).
My guess is the missing matches are "Gatorade" (with a capital 'G'), and you have a lowercase filter on your field. The idea is that you have filters in your schema.xml that preprocess the input data, but wildcard queries do not use them;
see this about how Solr deals with wildcard queries:
http://solr.pl/en/2010/12/20/wildcard-queries-and-how-solr-handles-them/
("Solr and wildcard handling").
From what I've read the wildcards only matched words with additional characters after the search term. "Gatorade*" would match Gatorades but not Gatorade itself. It appears there's been an update to Solr in version 3.6 that takes this into account by using the 'multiterm' field type instead of the 'text' field.
A better description is here:
http://bensch.be/the-solr-wildcard-problem-and-multiterm-solution

Resources