SOLR Query for literal values ignoring case - solr

I am attempting to index file shares as a way of identifying secrets. Problem is that most secrets (e.g. P#ssw0rd!) contain special characters that aren't easily escaped. I need a way to search for an exact literal string while ignoring special character meanings. I'm using SOLR 6.3 and I believe it uses a managed-schema which relies on a REST API to configuration. I've seen this resolved somewhat with the older schema method, but not this one.

If you only want an exact match against the complete value of a field, use a string field as that will only give you exact matches without any further processing.
To change a field's type or add a new field with a given type through the Schema API (the managed schema), use the add field method with string as the field type.
If you're not using a client library, you'll still need to [escape any character with special meaning](https://lucene.apache.org/core/2_9_4/queryparsersyntax.html#Escaping Special Characters) in Solr (the library will do this for you if you're using SolrJ for example) - this won't give a false positive, but will make certain strings not be able to match the field (if the secret has a space for example).

Related

Exact phrase search in solr (no substring)

I am using plain solr queries via http from rsolr ruby gem. So its practically the same as doing a query using solr admin interface. I need to search for exact phrases like "SFO 10+" in solr, but it will be treated as a substring search. And the "+" will have some special meaning even when using \\ as noted in the doc. It must not match substring, and the plus should be treated literally.
q: col0_t:"SFO 10+" returns 1 correct hit - 1 wrong
q: col0_t:"SFO 10\\+" same result
q: col0_t:"SFO 10" same result
The unexpected hit is:
"col0_t":"SFO 10-15"
Its a third party database, so I rather not modify anything in the database, unless its strictly necessary. It could damage the original system. We have made our own module on top of this system. I hope to do this directly, or I must clean the data afterwards in the app.

Searching with String-style within Text fields in Solar

I have a field containing short texts (a few tokens). I index it as Text rather than String because I need to search within the text.
However, I need to search with the String-style (matching the entire field).
For example, if a field is Google Search Engine. I currently find the row by searching "search engine". While preserving this behavior, I need another option to catch the row only if the search term is "google search engine".
I believe it is possible by regex, but it should be slow.
I wonder if there is a standard way to do so or if I need to add another field of the same content but with the String type.
Use multiple fields - the definition of the second field will differ based on whether you want the search to be case sensitive or not. If you're OK with having a case sensitive field (i.e. "Google" and "google" are different terms), then string is the correct choice.
If you want the field to be case insensitive, use a TextField with a KeywordTokenizer (which keeps the input as a single, large token) with a LowercaseFilter attached (which lowercases the content).
You can then search both fields by using qf - query fields - with the edismax/dismax query parses and score them differently. If you only need explicit searching (you choose whether you want to match the whole string or just words in it yourself), using the field name in the regular way would work.
Use a copyField instruction to index the same content into both fields without changing your indexing pipeline. You'll need to reindex your core / collection for the new field to get any values.
And no, you can't do this with a regex, since the regex is applied against the tokens. You already have the tokens split up into smaller parts, so /foo bar/ doesn't have a foo bar token to match against, just foo and bar - neither match the regex.

Solr query not working as expected when it contains the `#` character

I have a field called email_txt of type text_general that holds a list of emails of type abc#xyz.com,
and I'm trying to create a query that will only search the username and disregard the domain.
My query looks something like this:
email_txt:*abc*#*
This produces 0 results. I expect to receive results where the username contains abc, like abcdefg#xyz.com, fooabc#xyzbuzz.com, barabcefg#fizzxyz.com, abc#fizz.com. And yes, I am confident that I have data of that type, it doesn't work even if I try email_txt:*#*.
If I try something like:
email_txt:*abc*
It works, and produces multiple results, including the desired ones from above, but also cases where the domain contains abc, like fizz#helpmeabc.com, which is not desired.
I've had a look at the documentation (just in case I'm going crazy) and it confirms that # is not a special character. Even so, I have tried to escape it like this (just in case, I am going crazy):
email_txt:*abc*\#*
still, 0 results
Now the actual question. Is # a special character? If so, how can it be escaped, if not what am I doing wrong in the query? I genuinely can't tell if there is a flaw in my logic, or if there is something that I am missing.
Notes: I'm using solr version 6.3.0, the doc is for 6.6 (the closest available)
When you're using the StandardTokenizer (which the default field types text_general, text_en, etc. use by default), the content will be split into tokens when the # sign occurs. That means that for your example, there are actually two or three tokens being stored, (izz and helpmeabc.com) or (izz, helpmeabc and com).
A wildcard match is applied against the tokens by themselves (unless using the complex phrase query parser), where no tokenization and filtering taking place (except for multi term aware filters such as the lowercase filter).
The effect is that your query, *abc*#* attempts to match a token containing #, but since the processing when you're indexing splits on # and separate the tokens based on that character, no tokens contain # - and thus, giving you no hits.
You can use the string field type or a KeywordTokenizer paired with filters such as the lower case filter, etc. to get the original input more or less as a complete token instead.

What Solr field type provides basic wildcard searches?

I posted a document with the field value "Pineapple upside down cake." I want to get hits for pineapple, pine*, *side, pi?????le, upside down, etc. I chose text_en which does not find *side nor pi?????le.
What out of the box field type will give me hits for all the above?
I'm using Solr 7.6.
If you want to retain all the tokens as is (as I commented on your previous question about this, the text_en type contains a stemmer), use a field type with just a WhitespaceTokenizer and a LowercaseFilter. You'll have to define this field yourself.
I'm guessing you can use text_general to get a decent enough answer (it uses the StandardTokenizer, so it'll split on a few more cases than just whitespace).
The reason is that wildcard searches happens without most processing taking place (as it's impossible to do proper handling of stemming, splitting, etc. when you don't have the complete token), so any wildcard search will be against the generated list of tokens after processing.

Apache Solr string field or text field?

In apache Solr why do we always need to prefer string field over text field if both solves purposes?
How string or text affects the parameters like index size, index read, index creation?
The fields as default defined in the solr schema are vastly different.
String stores a word/sentence as an exact string without performing tokenization etc. Commonly useful for storing exact matches, e.g, for facetting.
Text typically performs tokenization, and secondary processing (such as lower-casing etc.). Useful for all scenarios when we want to match part of a sentence.
If the following sample, "This is a sample sentence", is indexed to both fields we must search for exactly the text This is a sample sentence to get a hit from the string field, while it may suffice to search for sample (or even samples with stemmning enabled) to get a hit from the text field.
Adding to Johans Sjöbergs good answer:
You can sort a String but not a Text.

Resources