I have a field called email_txt of type text_general that holds a list of emails of type abc#xyz.com,
and I'm trying to create a query that will only search the username and disregard the domain.
My query looks something like this:
email_txt:*abc*#*
This produces 0 results. I expect to receive results where the username contains abc, like abcdefg#xyz.com, fooabc#xyzbuzz.com, barabcefg#fizzxyz.com, abc#fizz.com. And yes, I am confident that I have data of that type, it doesn't work even if I try email_txt:*#*.
If I try something like:
email_txt:*abc*
It works, and produces multiple results, including the desired ones from above, but also cases where the domain contains abc, like fizz#helpmeabc.com, which is not desired.
I've had a look at the documentation (just in case I'm going crazy) and it confirms that # is not a special character. Even so, I have tried to escape it like this (just in case, I am going crazy):
email_txt:*abc*\#*
still, 0 results
Now the actual question. Is # a special character? If so, how can it be escaped, if not what am I doing wrong in the query? I genuinely can't tell if there is a flaw in my logic, or if there is something that I am missing.
Notes: I'm using solr version 6.3.0, the doc is for 6.6 (the closest available)
When you're using the StandardTokenizer (which the default field types text_general, text_en, etc. use by default), the content will be split into tokens when the # sign occurs. That means that for your example, there are actually two or three tokens being stored, (izz and helpmeabc.com) or (izz, helpmeabc and com).
A wildcard match is applied against the tokens by themselves (unless using the complex phrase query parser), where no tokenization and filtering taking place (except for multi term aware filters such as the lowercase filter).
The effect is that your query, *abc*#* attempts to match a token containing #, but since the processing when you're indexing splits on # and separate the tokens based on that character, no tokens contain # - and thus, giving you no hits.
You can use the string field type or a KeywordTokenizer paired with filters such as the lower case filter, etc. to get the original input more or less as a complete token instead.
Related
I posted a document with the field value "Pineapple upside down cake." I want to get hits for pineapple, pine*, *side, pi?????le, upside down, etc. I chose text_en which does not find *side nor pi?????le.
What out of the box field type will give me hits for all the above?
I'm using Solr 7.6.
If you want to retain all the tokens as is (as I commented on your previous question about this, the text_en type contains a stemmer), use a field type with just a WhitespaceTokenizer and a LowercaseFilter. You'll have to define this field yourself.
I'm guessing you can use text_general to get a decent enough answer (it uses the StandardTokenizer, so it'll split on a few more cases than just whitespace).
The reason is that wildcard searches happens without most processing taking place (as it's impossible to do proper handling of stemming, splitting, etc. when you don't have the complete token), so any wildcard search will be against the generated list of tokens after processing.
How to perform a simple query on a text field with an OR condition? Something like name:ABC OR name:XYZ so the resulting set would contain only those docs where name is exactly "XYZ" or "ABC"
Dug tons of manuals, cannot figure this out.
I use Solr 5.5.0
Update: Upgraded to Solr 6.6.0, still cannot figure it out. Below are illustrations to demonstrate my issue:
This works:
This works too:
This still works:
But this does not! Omg why!?
There are many ways to perform OR query. Below I have listed some of them. You can select any of it.
[Simple Query]
q=name:(XYZ OR ABC)
[Lucene Query Parser]
q={!lucene q.op=OR df=name v="XYZ ABC"}
Your syntax is right, but what you're asking for isn't what text fields are made for. A text field is tokenized (split into multiple tokens), and each token is searched by itself. So if the text inserted is "ABC DEF GHI", it will be split into three separate tokens, namely "ABC", "DEF" and "GHI". So when you're searching field:ABC, you're really asking for any document that has the token "ABC" somewhere.
Since you want to perform an exact match, you want to query against a field that is defined as a string field, as this will keep the value verbatim (including casing, so the matching will be case sensitive). You can tell Solr to index the same content into multiple fields by adding a copyFile instruction, telling it to take the content submitted for field foo and also copying it into field bar, allowing you to perform both an exact match if needed and a more general search if necessary.
If you need to perform exact, but case insensitive, searches, you can use a KeywordTokenizer - the KeywordTokenizer does nothing, keeping the whole string as a single token, before allowing you to add filters to the analysis chain. By adding a LowercaseFilter you tell Solr to lowercase the string as well before storing it (or querying for it).
You can use the "Analysis" page under the Solr admin page to experiment and see how content for your field is being processed for each step.
After that querying as string_field:ABC OR string_field:XYZ should do what you want (or string_field:(ABC OR XYZ) or a few other ways to express the same.
A wacky workaround I've just come up with:
The problem is this: I've got a column (named name)which consist of names for Example "Иван Кирилов Петров", "Нина Семова Мариножа" and so on.
So I want to make a query which will get all the names that has first name 'Иван' and last name 'Петров'; The second name doesn't matter so i will put * wildcard character.
Also there is a bigger problem: I should be able in a case if the user writes "Иван Кирилов Петров" to find this exact person
what I have tried :
I made the field text_ws type
and tested the following queries:
q=name:Иван*Петров
perfect - it finds what I want - all the names with first Иван and last Петров;
But then i want to find Иван Кирилов Петров i get no response because I want to make an exact search and my type should be string
How can I solve this!
Try adding autoGeneratePhraseQueries="true" flag on your text_ws type definition. And use debugQuery=true flag to see how it does the matches against the field. If the basic thing work, you can then look at pf3 flag in eDismax configuration to boost the query matches.
Solr also comes with dedicated Token Filters for Russian, but you probably don't care about that for the people's names.
I don't think you need a wild-card query. If you are only splitting on white-space during index time (text_ws) and you get complete first, last and/or middle names for query, you can do an AND query like
q=name:(Иван AND Петров)
or
q=name:(ИВАН AND МИНЧЕВ AND ПЕТРОВ)
Update: After your comment, I see that this will do a bag-of-words search and won't preserve the order. I guess you need to keep a string copy field of name, say name_str, which will give you more search options. For example, if there are 2 spaces in the query, meaning you get the first, middle and last names, then you can do an exact match on name_str like
q=name_str:"ИВАН%20МИНЧЕВ%20ПЕТРОВ"
If you are using Solr 4.0 and above, then regex query on the string field can help you. You can do
q=name_str:/ИВАН.*ПЕТРОВ/
will match anything that begins with ИВАН and ends with ПЕТРОВ.
or even
q=name_str:/Иван.*?Кирилов.*?Петров/
Unfortunately, there is no Solr wiki page on regex search yet, but you can google around.
You need to distinguish between the different types of queries you want to do and do different searches. Maybe give a check-box to your users asking if they want an exact match or not.
For a phrase search, we want to bring up results only if there's an exact match (without ignoring stopwords). If it's a non-phrase search, we are fine displaying results even if the root form of the word matches etc.
We currently pass our data through standardTokenizer, StopFilter, PorterStemFilter and LowerCaseFilter. Due to this when user wants to search for "password management", search brings up results containing "password manager".
If I remove StemFilter, then I will not be able to match for the root form of the word for non-phrase queries. I was thinking if I should index the same data as part of two fields in document.
For the first field (to be used for phrase searches), following tokenizers/filters will be used:
StandardTokenizer, LowerCaseFilter
For the second field (Non-phrase searches)
StandardTokenizer, StopFilter, PorterStemFilter, LowerCaseFilter
Now, based on whether it's a phrase search or not, I need to rewrite user's query to search in the appropriate field.
Is this the right way to address this issue? Is there any other way to achieve this without doubling index size?
let's say user's query is
summary:"Furthermore, we should also fix this"
Internally this will be translated to
summary_field1:"Furthermore, we should also fix this"
If user's query is
summary:(Furthermore, we should also fix this)
Internally this will be translated to
+summary_field2:furthermor +summary_field2:we +summary_field2:should +summary_field2:also +summary_field2:fix
both summary_field1 and summary_field2 index the same data. summary_field1 passes through only StandardTokenizer and LowerCaseFilter, whereas summary_field2 passes through StandardTokenizer, StopFilter, PorterStemFilter and LowerCaseFilter.
Please let me know if I'm missing something here.
By defining two different fields you can search for exact matches.
By using boosts you can also bring results in one query. For example:
(firstField:"password management")^5 OR (secondField:"pasword management")^1
I'm currently experimenting with Solr and attempting to get a query to only retrieve documents where all the provided tokens match.
For example, assume I have a field called data which when indexed uses a PatternTokenizer to split the incoming string on a delimiting character, e.g. '/'. For the input string "Foo/Bar/Baz" I would expect to get three tokens (if my understanding of the docs is correct!). Adding a few more documents I end up with:
Foo/Bar/Baz ==> Foo, Bar, Baz
Foo/Far/Faz ==> Foo, Far, Faz
Boo/Bar/Baz ==> Boo, Bar, Baz
When I come to query this field however, I get results I wasn't quite expecting. Using the query:
+data:Foo/Bar
I would expect this to match documents which contained both Foo and Bar, but instead it returns documents which contain at least Foo or Bar, scoring those with both terms higher. Other than altering the query such that it resembles:
+data:Foo +data:Bar
is there any way to change the behaviour such that instead of matching all 3 of my example documents, it matches only the one?
This experiment was done using the nightly builds of Solr 4.0.
Thanks
You could set the default operator to be AND in schema.xml which will make all queries an AND search.
http://wiki.apache.org/solr/SchemaXml#Default_query_parser_operator
You could also change it per-query by adding q.op=AND to the solr url.
http://solrhost/solr/select?q=solr+lucene&q.op=AND