Solr wildcard query on multiple words in text field - solr

I'm searching for "foo" followed by "bar" in a text field named "doc".
My query needs to match the text "foo walks into a bar" but not "bar has place for foo"
I've seen a few similar questions, but no concrete answer.
Queries that don't work:
q=doc:foo*bar
q=doc:/.*foo.bar./
It seems that this is because each word in the text field is tokenized separately. Is there a way to get around this? (Note: I can't change the field type)

Have a look at the Surround Query Parser and at the Complex Phrase Query Parser
The SurroundQParser enables the Surround query syntax, which provides
proximity search functionality.
There are two positional operators: w creates an ordered span query
and n creates an unordered one. Both operators take a numeric value
to indicate distance between two terms. The default is 1, and the
maximum is 99.
Note that the query string is not analyzed in any way.
Example:
{!surround} 3w(foo, bar)
This example would find documents where the terms "foo" and "bar" were
no more than 3 terms away from each other (i.e., no more than 2 terms
between them).
Regarding the Complex Phrase Query Parser, pay attention at the inOrder parameter that let you specify the order of the matched keywords.

Related

Solr query string not working for full text searches

I'm following this tutorial on how to perform indexing on sample documents using Solr. The default collection is "gettingstarted" as shown. Now I'm trying to query it. There are 52 entries as shown:
However, when I replace the q argument with say electronics, it should return 14 results. However, I get nothing.
When I replace the query string q with cat:electronics, then I actually get the 14 results. But why is this the case? isn't q=word supposed to search for word wherever it appears?
No, it's not. Your assumption that:
isn't q=word supposed to search for word wherever it appears?
is wrong. If you're using word as your only query, and nothing more - you're searching for word in the default search field. It does not search all available fields in all available documents.
Also be aware that the default query parser assumes that your query is in the Lucene Query Syntax. To handle more "natural" querying, you can use the edismax query parser. This query parser supports the qf parameter that tells Solr which fields to search, instead of having to use the cat:electronics syntax. Your example would then be q=electronics&qf=cat.
In the example documents you've given, qf=series_t author name cat is probably a decent value to search all these fields for the given query. You can also append ^<weight> to a field name to give hits in the different fields different weights. qf=name^10 cat would give a hit in name ten times the weight of a hit in the cat field.

Solr eDismax Search - Prioritize phrase over individual words

I am trying to use the eDismax Query Parser with the following requirements where a search query can be intepreted as a phrase and also individual words, but where phrase takes precedence over individual words.
Example:
Search query: We are cool
Results should be:
Documents fields with phrase 'we are cool' appearing top of list
Documents where fields comprises of either 'we', 'are', 'cool' where highest number of occurences take precedence.
How would I go about implementing this? Thanks.
The simplest way: use pf param boosting for that, check the doc here
So for example, adding this (if you had those two fields):
q=We are cool&pf=mytitle^10 mydescription

Solr negative boost

I'm looking into the possibility of de-boosting a set of documents during
query time. In my application, when I search for e.g. "preferences", I want
to de-boost content tagged with ContentGroup:"Developer" or in other words,
push those content back in the order. Here's the catch. I've the following
weights on query fields and boost query on source
qf=text^6 title^15 IndexTerm^8
As you can see, title has a higher weight.
Now, a bunch of content tagged with ContentGroup:"Developer" consists of a
title like "Preferences.material" or "Preferences Property" or
"Preferences.graphics". The boost on title pushes these documents at the
top.
What I'm looking is to see if there's a way to deboost all documents that are
tagged with ContentGroup:"Developer" irrespective of the term occurrence is
text or title. I tried something like, but didn't make any difference.
Source:simplecontent^10 Source:Help^20 (-ContentGroup-local:("Developer"))^99
I'm using edismax query parser.
Any pointers will be appreciated.
Thanks,
Shamik
You're onto something with your last attempt, but you have to start with *:*, so that you actually have something to subtract the documents from. The resulting set of documents (those not matching your query) can then be boosted.
From the Solr Relevancy FAQ
How do I give a negative (or very low) boost to documents that match a query?
True negative boosts are not supported, but you can use a very "low" numeric boost value on query clauses. In general the problem that confuses people is that a "low" boost is still a boost, it can only improve the score of documents that match. For example, if you want to find all docs matching "foo" or "bar" but penalize the scores of documents matching "xxx" you might be tempted to try...
q = foo^100 bar^100 xxx^0.00001 # NOT WHAT YOU WANT
...but this will still help a document matching all three clauses score higher then a document matching only the first two. One way to fake a "negative boost" is to give a large boost to everything that does not match. For example...
q = foo^100 bar^100 (*:* -xxx)^999
NOTE: When using (e)dismax, people sometimes expect that specifying a pure negative query with a large boost in the "bq" param will work (since Solr automatically makes top level purely negative positive queries by adding an implicit ":" -- but this doesn't work with "bq", because of how queries specified via "bq" are added directly to the main query. You need to be explicit...
?defType=dismax&q=foo bar&bq=(*:* -xxx)^999

Solr highlighting gives field/snippets with ANY term, instead of those that satisfy the query fully

I'm using Solr 5.x, standard highlighter, and i'm getting snippets which matches even one of the search terms only, even if i indicate q.op=AND.
I need ONLY the fields and snippets that matches ALL the terms (unless i say q.op=OR or just omit it), i.e. the field/snippet must satisfy the query. Solr does return the field/snippet that has all the terms, but also return many others.
I'm using hl.fl=*, to get the only fields having the terms, and searching against the default field ('text' containing full doc). Need to use * since i have multiple dynamic fields. Most fields are 'text_general' type (for search and HL), and some are 'string' type for faceting.
If its not possible for snippets to have all the terms, i MUST get only the fields that satisfy the query fully (since the question is more talking about matching all the terms, but the search query can become arbitrarily complex, so the fields/snippets should match the query).
Also, next is to get snippets highlighted with proximity based search/terms. What should i do/use for this? The fields coming in highlighting in this scenario should also satisfy the proximity query (unlike i get a field that contain any term, without regard to proximity constrains and other query terms etc)
Thanks for your help.
I've also encountered the same problem with highlighting. In my case, the query like
(foo AND bar) OR eggs
highlighted eggs and foo despite bar was not present in the document. I didn't manage to come up with proper solution, however I devised a dirty workaround.
I use the following query:
id:highlighted_document_id AND text:(my_original_query)
with debugQuery set to true. Then I parse explain text for highlighted_document_id. The text contains the terms from the query, which have contributed to the score. The terms, which should not be highlighted, are not present in the explanation.
The Python regex expressions I use to extract the terms (valid for Solr 5.2.1):
term_regex = re.compile(r'weight\(text:(.+) in')
wildcard_term_regex = re.compile(r'text:(.+), product')
then I simply search the markings in the highlighted text and remove them if the term doesn't match against any of the term in term_regex and wildcard_term_regex.
The solution is probably pretty limited, but works for me.

Solr Tokenizer Question

I have what I think is a simple solr exercise, but I'm unsure what to use.
I have a field of names, e.g. Joe Smith and Jack Daniels and Steve. They could each be one name or two names. I want to be able to search this s.t. if you search for "Danie" you get everything that has a first or last name that starts with "Danie". Three example returns would be "Danielle", "Steven Daniels", and "Danier Daniellson".
I would also like it so that the preference is given to the first name.
So two questions would be do I need to use a copyField and break up the names into first and last name? And what would my analyzer look like?
Edit: Two edits on the searching ability.
1. Something like "Joe S" should return all users that look like "Joe S*"
2. If a user searches with an "&" character, that should be included in the search and not used as an operator.
To solve your first part I suggest the following solution:
index your fields twice:
once with solr.KeywordTokenizerFactory - that will index your entire field as it is. It will not be splitted into tokens. This will be useful for boosting results with the preference given to the first name.
once with WordDelimiterTokenizerFactory or StandardTokenizerFactory
You can find more about these tokenizers here: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
After you indexed them in two filters with different tokenizers you just use boost query to boost your results from one field (the one with preference given to the first name) as it is explained here: http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_make_.22superman.22_in_the_title_field_score_higher_than_in_the_subject_field
If a user searches with an "&" character, that should be included in the search and not used as an operator.
For this part you either use DisMax query http://wiki.apache.org/solr/DisMaxQParserPlugin or when you make a request use "&" instead of &
Also you need to use a tokenizer like WhiteSpaceDelimiter to just keep other characters in tokens.

Resources