We experience unexpectedly high recall for Chinese queries. I have managed to reproduce a minimal use-case using a simple data model with only 2 properties.
REPRODUCE
Define a property DescriptionZhCn for Chinese product descriptions, using zh-Hans.microsoft analyzer
Populate two records with the following values in DescriptionZhCn
Contoso 减振接杆
Contoso 缩径接柄
Search using options searchMode=all, queryType=full, searchFields=DescriptionZhCn, api-version=2019-05-06 with the following values in the search parameter:
减振接杆
缩径接柄
EXPECTED
When searching for 减振接杆 I would expect only the record with description "Contoso 减振接杆". When searching for 缩径接柄 I would expect only the record "Contoso 缩径接柄".
ACTUAL
Searching either 减振接杆 or 缩径接柄 unexpectedly return both records. The only thing common character is the third character 接.
I have verified the output from the zh-Hans.microsoft analyzer and it splits both of the Chinese strings into 4 tokens. E.g.
减振接杆 => 减 振 接 杆
My query only matches one of the tokens. And I'm using searchMode=all. Why does my query match? Is this a bug? Any input Yanoosh, Liam?
Related
I'm following this tutorial on how to perform indexing on sample documents using Solr. The default collection is "gettingstarted" as shown. Now I'm trying to query it. There are 52 entries as shown:
However, when I replace the q argument with say electronics, it should return 14 results. However, I get nothing.
When I replace the query string q with cat:electronics, then I actually get the 14 results. But why is this the case? isn't q=word supposed to search for word wherever it appears?
No, it's not. Your assumption that:
isn't q=word supposed to search for word wherever it appears?
is wrong. If you're using word as your only query, and nothing more - you're searching for word in the default search field. It does not search all available fields in all available documents.
Also be aware that the default query parser assumes that your query is in the Lucene Query Syntax. To handle more "natural" querying, you can use the edismax query parser. This query parser supports the qf parameter that tells Solr which fields to search, instead of having to use the cat:electronics syntax. Your example would then be q=electronics&qf=cat.
In the example documents you've given, qf=series_t author name cat is probably a decent value to search all these fields for the given query. You can also append ^<weight> to a field name to give hits in the different fields different weights. qf=name^10 cat would give a hit in name ten times the weight of a hit in the cat field.
I have a DB where i store strings of hex colors like f9f8f7 or aaaaaa
If search for a color, strange things happen:
if i search aaaaaa i get the one and only result hat contains aaaaaa
but if I search f9f8f7 i get more results not pertinent...
it seems like since f9f8f7 ha letters and numbers together Solr tries to split and search for single letters..
how do i prevent this?
You define your field to not split on numbers inside words. You can use the WhitespaceTokenizer in your field definition instead of the one you're currently using. The whitespace tokenizer will only break words on whitespace instead of using a range of other splitting points (which will depend on your tokenizer or if you have a WordDelimiter(Graph)Filter active in your analysis chain.
You can test out exactly how your strings are being processed for a specific field by going to "Analysis" under the collection / core in your Solr Admin.
How to perform a simple query on a text field with an OR condition? Something like name:ABC OR name:XYZ so the resulting set would contain only those docs where name is exactly "XYZ" or "ABC"
Dug tons of manuals, cannot figure this out.
I use Solr 5.5.0
Update: Upgraded to Solr 6.6.0, still cannot figure it out. Below are illustrations to demonstrate my issue:
This works:
This works too:
This still works:
But this does not! Omg why!?
There are many ways to perform OR query. Below I have listed some of them. You can select any of it.
[Simple Query]
q=name:(XYZ OR ABC)
[Lucene Query Parser]
q={!lucene q.op=OR df=name v="XYZ ABC"}
Your syntax is right, but what you're asking for isn't what text fields are made for. A text field is tokenized (split into multiple tokens), and each token is searched by itself. So if the text inserted is "ABC DEF GHI", it will be split into three separate tokens, namely "ABC", "DEF" and "GHI". So when you're searching field:ABC, you're really asking for any document that has the token "ABC" somewhere.
Since you want to perform an exact match, you want to query against a field that is defined as a string field, as this will keep the value verbatim (including casing, so the matching will be case sensitive). You can tell Solr to index the same content into multiple fields by adding a copyFile instruction, telling it to take the content submitted for field foo and also copying it into field bar, allowing you to perform both an exact match if needed and a more general search if necessary.
If you need to perform exact, but case insensitive, searches, you can use a KeywordTokenizer - the KeywordTokenizer does nothing, keeping the whole string as a single token, before allowing you to add filters to the analysis chain. By adding a LowercaseFilter you tell Solr to lowercase the string as well before storing it (or querying for it).
You can use the "Analysis" page under the Solr admin page to experiment and see how content for your field is being processed for each step.
After that querying as string_field:ABC OR string_field:XYZ should do what you want (or string_field:(ABC OR XYZ) or a few other ways to express the same.
A wacky workaround I've just come up with:
I just indexed a bunch of text data from our products DB. My goal is evaluating Apache Solr for production use.
This is a document example:
{
"shape":"Geometric",
"color":"MATTE BLACK",
"gender":"unisex",
"model":"CLUBMASTER RX 5154",
"sales":10,
"lens":"rugged",
"material":"plastic",
"brand":"Ray-Ban"
}
The most important thing in our search app is fuzzy matching, because inaccurate search terms are very frequent.
So, I'm a little disappointed with results found by Solr.
For example:
clubmaster -> many results
club master -> no results
Why?!
ray ban -> many results
rayban -> no results
I also tried putting ~1 or even ~2 after my term, with no luck!
All fields are indexed '*_txt_en' predefined field.
You can't just run a serious production setup without customizing schema/solrconfig to fit your specific needs. From what I can guess, you would get the results you want by:
copy your text fields into different versions with different analysis, for example:
one as a string type, hard to match
one field that is using EdgeNgram to match prefixes.
another with WordDelimiterFilterFactory to match ray-ban/rayban
...
using edismax as the query parser
in edismax, there are many things to tweak in it. But the most important is: search on all the fields above, but weight then in different way, the less analysis, the more weight
Solr newbie here.
I have created a Solr index and write a whole bunch of docs into it. I can see
from the Solr admin page that the docs exist and the schema is fine as well.
But when I perform a search using a test keyword I do not get any results back.
On entering * : *
into the query (in Solr admin page) I get all the results.
However, when I enter any other query (e.g. a term or phrase) I get no results.
I have verified that the field being queried is Indexed and contains the values I am searching for.
So I am confused what I am doing wrong.
Probably you don't have a <defaultSearchField> correctly set up. See this question.
Another possibility: your field is of type string instead of text. String fields, in contrast to text fields, are not analyzed, but stored and indexed verbatim.
I had the same issue with a new setup of Solr 8. The accepted answer is not valid anymore, because the <defaultSearchField> configuration will be deprecated.
As I found no answer to why Solr does not return results from any fields despite being indexed, I consulted the query documentation. What I found is the DisMax query parser:
The DisMax query parser is designed to process simple phrases (without complex syntax) entered by users and to search for individual terms across several fields using different weighting (boosts) based on the significance of each field. Additional options enable users to influence the score based on rules specific to each use case (independent of user input).
In contrast, the default Lucene parser only speaks about searching one field. So I gave DisMax a try and it worked very well!
Query example:
http://localhost:8983/solr/techproducts/select?defType=dismax&q=video
You can also specify which fields to search exactly to prevent unwanted side effects. Multiple fields are separated by spaces which translate to + in URLs:
http://localhost:8983/solr/techproducts/select?defType=dismax&q=video&qf=features+text
Last but not least, give the fields a weight:
http://localhost:8983/solr/techproducts/select?defType=dismax&q=video&qf=features^20.0+text^0.3
If you are using pysolr like I do, you can add those parameters to your search request like this:
results = solr.search('search term', **{
'defType': 'dismax',
'qf': 'features text'
})
In my case the problem was the format of the query. It seems that my setup, by default, was looking and an exact match to the entire value of the field. So, in order to get results if I was searching for the sit I had to query *sit*, i.e. use wildcards to get the expected result.
With solr 4, I had to solve this as per Mauricio's answer by defining type="text_en" to the field.
With solr 6, use text_general.