Difference in scoring between multivalued field and tokenized field

Difference in scoring between multivalued field and tokenized field - solr

For example I have several tags per document. I can
index them as single text string spliting by space uisng WhiteSpaceTokenizer. (example "tag1 tag2 tag3")
add them separatly to single field name multiple times using KeywordAnalyzer (
example
doc.addField("tags1", "tag1");
doc.addField("tags", "tag2");
doc.addField("tags", "tag23)
)
Both approaches will work. The question is how different will be scoring for those types of indexing? (i.e. field normalization factor, tf/idf count, field length calucaltion, slope factor etc)

Lucene will concatenate all the values for a multivalued filed behind the scene anyway, so it'd not be much different than your first case, if at all. If you use tags only as filters (give me all docs tagged with tag2), then you definitely won't see any difference.

I would think the multi-value would be more accurate.
imagine a tokenized string "spider web developer"
vs
multi-value field with the values "spider" and "web developer"
a search for "web developer" would match both fields but the match vs the multi-value field could be seen as more accurate.

Related

Manipulating and Removing Facets in Apache Solr

I am creating a front end application which queries through a database using the Apache Solr engine, but I have two issues that I just cannot find the answer to.
When Solr is processing a Facet query, how do I get the facet to be a single phrase ("Department of the Navy (160)") instead of a broken up facet of 4 terms ("Department (160)" "of (200)" "the (200)" "Navy(160)").
Also, how do I remove certain facets from being queried, for example "and" "to" "the" etc.
Thank you.

Looks like your phrase is being indexed into a Text field which, among many things, splits by whitespace. This is very good for full text search but not for faceting.
You can have a duplicate field for this, of type string (and not Text), which is not splitted. You can still use the original field for searching but the new string field for faceting.

Solrnet facet returning spaces

I'm using Solrnet to return search results and am also requesting the facets, in particular categories which is a multi-valued field.
The problem I'm coming up against is that the category "house products" is being returned as two seperate facets because of the space.
Is there a way of ensuring this is returned as a single facet value, or should I be escaping the value when it is added to the index?
Thanks in advance
Al

If the tokens are generated for house products then you are using text analysis for the field.
Text fields are not suggested to be used for Faceting.
You won't get the desired behavior as the text fields would be tokenized and filtered leading to the generation of multiple tokens which you see from the facets returned as response.
Use a copy field to copy the field to a String field to be able to facet on it without splitting the words.
SolrFacetingOverview :-
Because faceting fields are often specified to serve two purposes,
human-readable text and drill-down query value, they are frequently
indexed differently from fields used for searching and sorting:
They are often not tokenized into separate words
They are often not mapped into lower case
Human-readable punctuation is often not removed (other than double-quotes)
There is often no need to store them, since stored values would look much like indexed values and the faceting mechanism is used for
value retrieval.
Try to use String fields and it would be good enough without any overheads.

The faceting works on tokens, so if you have a field that is tokenized in many words it will split the facet too.
I suggest you create another field of type string used only for faceting.

Haystack/Solr boosting results if the query is found in a specific field

We're having issues with non relevant results being returned as the highest results in our search and we're trying to improve that behavior, but not really sure how.
We have SearchIndex with about a dozen fields. The document=True field is a template backed field that we have placed the majority of the content into. Some of the stuff found in there is much less relevant than other stuff, even if it's still useful.
To give a concrete example: if a user searches for "red rose", we want to return red roses as the top results...even better if lower results are just roses or just red, or even are described as being "rose red" in color.
The issue is our document=True field has a ton of items that are described as being "rose red". Worse the actual red roses don't have "red" and "rose" particularly close to each other as those values would come from disparate fields. As a result we get the top few hundred results that are completely irrelevant.
What we would like to do is either:
A. Search the primary document and then search each of our other fields and boost (but not hard filter) accordingly. If the term "rose" appears in one of the items names and "red" appears as one of it's attribute values than that result should have a higher score. This gives us the optimal results in theory sorted by relevancy.
B. Search all fields at once and boost if the value is any of the "boosted" fields.
It seems like using field boost should be the answer, but we can't figure out how to express it since filtering based on a field is a harsh exclude and we want it to only impact the relevance scoring.
The result of both of these is effectively the same. We just can't figure out how to do either of them with Haystack. Or if we'd have to fall back to raw queries how to write a solr query that accomplishes this.

I can give you some pointers, as I did not get the exact use case :-
You can check on Solr edismax query parser to configure:-
Fields you want to search on - Mainly to select the results
Variable boost on fields for relevancy - To determine the importance on fields
Variable boost for different words combination e.g. single words, phrase match, shingle match with slop to determine relevancy
Provide additional boost on other fields
This will help you to filter the results and order them accordingly as per the field and word combination matches

Can I use Solr term component with filtering on non-term fields

http://localhost:8080/search/terms?terms.prefix=ab&terms.fl=text&terms.sort=count
I have the above terms query which works as I expect. Returns all the terms from the "text" field that have a certain prefix, sorted by count.
I want to return only the terms where another field "language" is "en" can I add such a filter to a terms query?

Unfortunately you can't filter while accessing the indexed terms within a field through the TermsComponent. That's one of the limitations you face when you make auto suggestions for example. If you're making auto-suggestions, one of the ways that supports filtering is based on a facet and the prefix parameter like explained here.

what is the advantages of mutivalued option in solr

What is the advantages of mutivalued field option in solr.
I have a field with comma separated keywords.
I can do 2 things
make a non-multivalued text field
make a multivalued text field which contains each keyword
I can still query in both the cases. So whats the advantages of multivalued over non-multivalued?

advantages of multivalued: you don't need to change the document design. If en document containes multiple values in one filed, so solr/lucen can handle this field.
Also an advantage: multiple values could describe an document more exact (thing about tags of an blog post, or so)
advantages of non-multivalued: you can use specific features, which required an single term (word) in one filed, like spell checking. It's also a benefit for clustering (carrot) or grouping, which works mostly better on non-multivalued fields

Querying by the multivalue field will receive what you want.
Example: doc1 has a keyword 'abc', and doc2 has a keyword 'abcd'. If query by keyword 'abc' only doc1 should be matched.
So in non-multivalue approach both documents will matched, case you'll use like syntax.

multivalue fields can be very handy, let say you have many fields and you wish to search for several fields but not in all of them. you can create multivalue field that include all the fields that you wont to search for them on this field and search in it.
for example, let say you have fields that may have value of string or value of number. and than you wish to search on all string values that were found in the document. so you can create multivalue field for all string values and search in it.