Hello there i'm trying to get the list of the imported field Movie_name
How ever the hyphen was escaped and not showing like this img example
And this the data i attached already
Like you see two data with the Movie_name
"Movie_name":"sci-Fi2"},
"Movie_name":"sci-Fi"}]
What i'm trying to do is a simple analytics & get all the list of names with the field Movie_name and not the data.
So the question why the hyphen are escaped in the schema-browser
Why i cannot get the exactly correct field name ???
It's not being escaped - what you're seeing in the schema browser are the actual terms stored in the index (what is usually referred to as "tokens"). If you want these to be preserved in the original form (i.e. as a single token) to be used for faceting or analytics, store them as the type string instead of as a text based field (which have a tokenizer attached - and that tokenizer usually splits the string into multiple smaller tokens on natural split points, such as -).
In your example, sci-fi is turned into sci and fi. If you use a string type, or a KeywordTokenizer, the input is kept as it is, and the token is stored as sci-fi instead.
Related
I have a field containing short texts (a few tokens). I index it as Text rather than String because I need to search within the text.
However, I need to search with the String-style (matching the entire field).
For example, if a field is Google Search Engine. I currently find the row by searching "search engine". While preserving this behavior, I need another option to catch the row only if the search term is "google search engine".
I believe it is possible by regex, but it should be slow.
I wonder if there is a standard way to do so or if I need to add another field of the same content but with the String type.
Use multiple fields - the definition of the second field will differ based on whether you want the search to be case sensitive or not. If you're OK with having a case sensitive field (i.e. "Google" and "google" are different terms), then string is the correct choice.
If you want the field to be case insensitive, use a TextField with a KeywordTokenizer (which keeps the input as a single, large token) with a LowercaseFilter attached (which lowercases the content).
You can then search both fields by using qf - query fields - with the edismax/dismax query parses and score them differently. If you only need explicit searching (you choose whether you want to match the whole string or just words in it yourself), using the field name in the regular way would work.
Use a copyField instruction to index the same content into both fields without changing your indexing pipeline. You'll need to reindex your core / collection for the new field to get any values.
And no, you can't do this with a regex, since the regex is applied against the tokens. You already have the tokens split up into smaller parts, so /foo bar/ doesn't have a foo bar token to match against, just foo and bar - neither match the regex.
I have a field called email_txt of type text_general that holds a list of emails of type abc#xyz.com,
and I'm trying to create a query that will only search the username and disregard the domain.
My query looks something like this:
email_txt:*abc*#*
This produces 0 results. I expect to receive results where the username contains abc, like abcdefg#xyz.com, fooabc#xyzbuzz.com, barabcefg#fizzxyz.com, abc#fizz.com. And yes, I am confident that I have data of that type, it doesn't work even if I try email_txt:*#*.
If I try something like:
email_txt:*abc*
It works, and produces multiple results, including the desired ones from above, but also cases where the domain contains abc, like fizz#helpmeabc.com, which is not desired.
I've had a look at the documentation (just in case I'm going crazy) and it confirms that # is not a special character. Even so, I have tried to escape it like this (just in case, I am going crazy):
email_txt:*abc*\#*
still, 0 results
Now the actual question. Is # a special character? If so, how can it be escaped, if not what am I doing wrong in the query? I genuinely can't tell if there is a flaw in my logic, or if there is something that I am missing.
Notes: I'm using solr version 6.3.0, the doc is for 6.6 (the closest available)
When you're using the StandardTokenizer (which the default field types text_general, text_en, etc. use by default), the content will be split into tokens when the # sign occurs. That means that for your example, there are actually two or three tokens being stored, (izz and helpmeabc.com) or (izz, helpmeabc and com).
A wildcard match is applied against the tokens by themselves (unless using the complex phrase query parser), where no tokenization and filtering taking place (except for multi term aware filters such as the lowercase filter).
The effect is that your query, *abc*#* attempts to match a token containing #, but since the processing when you're indexing splits on # and separate the tokens based on that character, no tokens contain # - and thus, giving you no hits.
You can use the string field type or a KeywordTokenizer paired with filters such as the lower case filter, etc. to get the original input more or less as a complete token instead.
How to perform a simple query on a text field with an OR condition? Something like name:ABC OR name:XYZ so the resulting set would contain only those docs where name is exactly "XYZ" or "ABC"
Dug tons of manuals, cannot figure this out.
I use Solr 5.5.0
Update: Upgraded to Solr 6.6.0, still cannot figure it out. Below are illustrations to demonstrate my issue:
This works:
This works too:
This still works:
But this does not! Omg why!?
There are many ways to perform OR query. Below I have listed some of them. You can select any of it.
[Simple Query]
q=name:(XYZ OR ABC)
[Lucene Query Parser]
q={!lucene q.op=OR df=name v="XYZ ABC"}
Your syntax is right, but what you're asking for isn't what text fields are made for. A text field is tokenized (split into multiple tokens), and each token is searched by itself. So if the text inserted is "ABC DEF GHI", it will be split into three separate tokens, namely "ABC", "DEF" and "GHI". So when you're searching field:ABC, you're really asking for any document that has the token "ABC" somewhere.
Since you want to perform an exact match, you want to query against a field that is defined as a string field, as this will keep the value verbatim (including casing, so the matching will be case sensitive). You can tell Solr to index the same content into multiple fields by adding a copyFile instruction, telling it to take the content submitted for field foo and also copying it into field bar, allowing you to perform both an exact match if needed and a more general search if necessary.
If you need to perform exact, but case insensitive, searches, you can use a KeywordTokenizer - the KeywordTokenizer does nothing, keeping the whole string as a single token, before allowing you to add filters to the analysis chain. By adding a LowercaseFilter you tell Solr to lowercase the string as well before storing it (or querying for it).
You can use the "Analysis" page under the Solr admin page to experiment and see how content for your field is being processed for each step.
After that querying as string_field:ABC OR string_field:XYZ should do what you want (or string_field:(ABC OR XYZ) or a few other ways to express the same.
A wacky workaround I've just come up with:
I'm using Solrnet to return search results and am also requesting the facets, in particular categories which is a multi-valued field.
The problem I'm coming up against is that the category "house products" is being returned as two seperate facets because of the space.
Is there a way of ensuring this is returned as a single facet value, or should I be escaping the value when it is added to the index?
Thanks in advance
Al
If the tokens are generated for house products then you are using text analysis for the field.
Text fields are not suggested to be used for Faceting.
You won't get the desired behavior as the text fields would be tokenized and filtered leading to the generation of multiple tokens which you see from the facets returned as response.
Use a copy field to copy the field to a String field to be able to facet on it without splitting the words.
SolrFacetingOverview :-
Because faceting fields are often specified to serve two purposes,
human-readable text and drill-down query value, they are frequently
indexed differently from fields used for searching and sorting:
They are often not tokenized into separate words
They are often not mapped into lower case
Human-readable punctuation is often not removed (other than double-quotes)
There is often no need to store them, since stored values would look much like indexed values and the faceting mechanism is used for
value retrieval.
Try to use String fields and it would be good enough without any overheads.
The faceting works on tokens, so if you have a field that is tokenized in many words it will split the facet too.
I suggest you create another field of type string used only for faceting.
In apache Solr why do we always need to prefer string field over text field if both solves purposes?
How string or text affects the parameters like index size, index read, index creation?
The fields as default defined in the solr schema are vastly different.
String stores a word/sentence as an exact string without performing tokenization etc. Commonly useful for storing exact matches, e.g, for facetting.
Text typically performs tokenization, and secondary processing (such as lower-casing etc.). Useful for all scenarios when we want to match part of a sentence.
If the following sample, "This is a sample sentence", is indexed to both fields we must search for exactly the text This is a sample sentence to get a hit from the string field, while it may suffice to search for sample (or even samples with stemmning enabled) to get a hit from the text field.
Adding to Johans Sjöbergs good answer:
You can sort a String but not a Text.