Solr - How do I match all user provided tokens, not just some? - solr

I'm currently experimenting with Solr and attempting to get a query to only retrieve documents where all the provided tokens match.
For example, assume I have a field called data which when indexed uses a PatternTokenizer to split the incoming string on a delimiting character, e.g. '/'. For the input string "Foo/Bar/Baz" I would expect to get three tokens (if my understanding of the docs is correct!). Adding a few more documents I end up with:
Foo/Bar/Baz ==> Foo, Bar, Baz
Foo/Far/Faz ==> Foo, Far, Faz
Boo/Bar/Baz ==> Boo, Bar, Baz
When I come to query this field however, I get results I wasn't quite expecting. Using the query:
+data:Foo/Bar
I would expect this to match documents which contained both Foo and Bar, but instead it returns documents which contain at least Foo or Bar, scoring those with both terms higher. Other than altering the query such that it resembles:
+data:Foo +data:Bar
is there any way to change the behaviour such that instead of matching all 3 of my example documents, it matches only the one?
This experiment was done using the nightly builds of Solr 4.0.
Thanks

You could set the default operator to be AND in schema.xml which will make all queries an AND search.
http://wiki.apache.org/solr/SchemaXml#Default_query_parser_operator
You could also change it per-query by adding q.op=AND to the solr url.
http://solrhost/solr/select?q=solr+lucene&q.op=AND

Related

Solr query not working as expected when it contains the `#` character

I have a field called email_txt of type text_general that holds a list of emails of type abc#xyz.com,
and I'm trying to create a query that will only search the username and disregard the domain.
My query looks something like this:
email_txt:*abc*#*
This produces 0 results. I expect to receive results where the username contains abc, like abcdefg#xyz.com, fooabc#xyzbuzz.com, barabcefg#fizzxyz.com, abc#fizz.com. And yes, I am confident that I have data of that type, it doesn't work even if I try email_txt:*#*.
If I try something like:
email_txt:*abc*
It works, and produces multiple results, including the desired ones from above, but also cases where the domain contains abc, like fizz#helpmeabc.com, which is not desired.
I've had a look at the documentation (just in case I'm going crazy) and it confirms that # is not a special character. Even so, I have tried to escape it like this (just in case, I am going crazy):
email_txt:*abc*\#*
still, 0 results
Now the actual question. Is # a special character? If so, how can it be escaped, if not what am I doing wrong in the query? I genuinely can't tell if there is a flaw in my logic, or if there is something that I am missing.
Notes: I'm using solr version 6.3.0, the doc is for 6.6 (the closest available)
When you're using the StandardTokenizer (which the default field types text_general, text_en, etc. use by default), the content will be split into tokens when the # sign occurs. That means that for your example, there are actually two or three tokens being stored, (izz and helpmeabc.com) or (izz, helpmeabc and com).
A wildcard match is applied against the tokens by themselves (unless using the complex phrase query parser), where no tokenization and filtering taking place (except for multi term aware filters such as the lowercase filter).
The effect is that your query, *abc*#* attempts to match a token containing #, but since the processing when you're indexing splits on # and separate the tokens based on that character, no tokens contain # - and thus, giving you no hits.
You can use the string field type or a KeywordTokenizer paired with filters such as the lower case filter, etc. to get the original input more or less as a complete token instead.

Solr query string not working for full text searches

I'm following this tutorial on how to perform indexing on sample documents using Solr. The default collection is "gettingstarted" as shown. Now I'm trying to query it. There are 52 entries as shown:
However, when I replace the q argument with say electronics, it should return 14 results. However, I get nothing.
When I replace the query string q with cat:electronics, then I actually get the 14 results. But why is this the case? isn't q=word supposed to search for word wherever it appears?
No, it's not. Your assumption that:
isn't q=word supposed to search for word wherever it appears?
is wrong. If you're using word as your only query, and nothing more - you're searching for word in the default search field. It does not search all available fields in all available documents.
Also be aware that the default query parser assumes that your query is in the Lucene Query Syntax. To handle more "natural" querying, you can use the edismax query parser. This query parser supports the qf parameter that tells Solr which fields to search, instead of having to use the cat:electronics syntax. Your example would then be q=electronics&qf=cat.
In the example documents you've given, qf=series_t author name cat is probably a decent value to search all these fields for the given query. You can also append ^<weight> to a field name to give hits in the different fields different weights. qf=name^10 cat would give a hit in name ten times the weight of a hit in the cat field.

Solr OR query on a text field

How to perform a simple query on a text field with an OR condition? Something like name:ABC OR name:XYZ so the resulting set would contain only those docs where name is exactly "XYZ" or "ABC"
Dug tons of manuals, cannot figure this out.
I use Solr 5.5.0
Update: Upgraded to Solr 6.6.0, still cannot figure it out. Below are illustrations to demonstrate my issue:
This works:
This works too:
This still works:
But this does not! Omg why!?
There are many ways to perform OR query. Below I have listed some of them. You can select any of it.
[Simple Query]
q=name:(XYZ OR ABC)
[Lucene Query Parser]
q={!lucene q.op=OR df=name v="XYZ ABC"}
Your syntax is right, but what you're asking for isn't what text fields are made for. A text field is tokenized (split into multiple tokens), and each token is searched by itself. So if the text inserted is "ABC DEF GHI", it will be split into three separate tokens, namely "ABC", "DEF" and "GHI". So when you're searching field:ABC, you're really asking for any document that has the token "ABC" somewhere.
Since you want to perform an exact match, you want to query against a field that is defined as a string field, as this will keep the value verbatim (including casing, so the matching will be case sensitive). You can tell Solr to index the same content into multiple fields by adding a copyFile instruction, telling it to take the content submitted for field foo and also copying it into field bar, allowing you to perform both an exact match if needed and a more general search if necessary.
If you need to perform exact, but case insensitive, searches, you can use a KeywordTokenizer - the KeywordTokenizer does nothing, keeping the whole string as a single token, before allowing you to add filters to the analysis chain. By adding a LowercaseFilter you tell Solr to lowercase the string as well before storing it (or querying for it).
You can use the "Analysis" page under the Solr admin page to experiment and see how content for your field is being processed for each step.
After that querying as string_field:ABC OR string_field:XYZ should do what you want (or string_field:(ABC OR XYZ) or a few other ways to express the same.
A wacky workaround I've just come up with:

Solr DisMax query equivalent

I am trying to set up elevate handler in SOLR 3.5.0 and I need the equivalent of the below query in dismax format which defines different boost values on the same field based on the match type(exact match gets 200 whereas wildcard match gets 100).
q=name:(foo*^100.0 OR foo^200.0)
This is one way to solve this problem.
Keep a text field with only WhiteSpaceTokenizer (and maybe LowerCaseFilter depending on your case-sensitivity needs). Use this field for the exact match. Let's call this field name_ws.
Instead of using a wild-card query on name_ws, use a text-type copy field with EdgeNGramTokenizer in your analyzer chain, which will output tokens like:
food -> f, fo, foo, food
Let's call this field name_edge.
Then you can issue this dismax query:
q=foo&defType=dismax&qf=name_ws^200+name_edge^100
(Add debugQuery=on to verify if the scoring works the way you want.)

Multiple word queries on solr

I am using solr 3.3.0 working out of the box using the example folder
solrQueryParser defaultOperator = "OR"
My problem is that Solr doesn't seem to be returning good results when I search for a multiple word phrase.
The following search return no results.
http://localhost:8080/solr/select/?q=roof+fixing
However, when I search for roof or fixing, they both return a few good results.
http://localhost:8080/solr/select/?q=roof returns 4 results
http://localhost:8080/solr/select/?q=fixing returns 3 results
On the query for "roof fixing", I expect solr to return 7 results. The 4 records for roof and 3 records for fixing.
Is any special configuration necessary for that to happen?
You just expressed your query incorrectly.
Try the following query from the Admin page:
(roof OR fixing)
Or, if you want to find that in a particular field:
fieldname:(roof OR fixing)
When you give SOLR a query like "roof fixing" you are effectively asking for all documents which have "roof" AND "fixing" in the default field (or the default dismax set of fields. The only way to change the meaning is to rewrite the query that your users type in. That's what we do, but on a larger scale. We have a front end interface that provides a whole bunch of options and generate a SOLR query from it. People can enter a search term in a specific field and if there is more than one word and it's not quoted, we add the AND. Then we OR together all of the fields that are filled in. Some fields are special and have a MIN and a MAX version which we turn into a range query :[0 TO 125000]. And there are some dropdowns that support multiple selections which we also turn into an OR, e.g. State:("WA" OR "CA" OR "OR" OR "NV")
Solr won't necessarily return 7 results for "roof OR fixing" as one result could include both "roof" and "fixing". Suppose "roof" has 3 results, "fixing" has 4, but both "roof" and "fixing" appear in 2 results. You will get only 5 results on a search for "roof OR fixing" as Solr will not return duplicate results.
Have you tried using a url-encoded space ("%20") instead of the "+" sign? If the default operator is OR you should not need to include that operator.

Resources