We have some code that processes messages based on mimetypes, which requires matching them. A cursory glance suggests are all lower-case, which leads one to wonder if:
they are so by convention, or
is that part of the spec? (a search of RFC 2045/2046 for upper/lower case etc. did not return any hits)
So, can the case insensitive compare be omitted for the tiny performance boost ?
MIME types are case insensitive. They are lowercase by convention only.
RFC 2045 says: "The type, subtype, and parameter names are not case sensitive." If you have a MIME type of text/plain that's a type of text and a subtype of plain. So, per the spec, these are not case sensitive.
As Cromax notes in a comment, MIME type parameter values may be case-sensitive. See the comment or the spec for details. But if you're matching only the mime type, subtype, or parameter names, they are case insensitive. Anecdotally, most people work with mime type and maybe subtype, and those are case insensitive.
Related
I posted a document with the field value "Pineapple upside down cake." I want to get hits for pineapple, pine*, *side, pi?????le, upside down, etc. I chose text_en which does not find *side nor pi?????le.
What out of the box field type will give me hits for all the above?
I'm using Solr 7.6.
If you want to retain all the tokens as is (as I commented on your previous question about this, the text_en type contains a stemmer), use a field type with just a WhitespaceTokenizer and a LowercaseFilter. You'll have to define this field yourself.
I'm guessing you can use text_general to get a decent enough answer (it uses the StandardTokenizer, so it'll split on a few more cases than just whitespace).
The reason is that wildcard searches happens without most processing taking place (as it's impossible to do proper handling of stemming, splitting, etc. when you don't have the complete token), so any wildcard search will be against the generated list of tokens after processing.
I am attempting to index file shares as a way of identifying secrets. Problem is that most secrets (e.g. P#ssw0rd!) contain special characters that aren't easily escaped. I need a way to search for an exact literal string while ignoring special character meanings. I'm using SOLR 6.3 and I believe it uses a managed-schema which relies on a REST API to configuration. I've seen this resolved somewhat with the older schema method, but not this one.
If you only want an exact match against the complete value of a field, use a string field as that will only give you exact matches without any further processing.
To change a field's type or add a new field with a given type through the Schema API (the managed schema), use the add field method with string as the field type.
If you're not using a client library, you'll still need to [escape any character with special meaning](https://lucene.apache.org/core/2_9_4/queryparsersyntax.html#Escaping Special Characters) in Solr (the library will do this for you if you're using SolrJ for example) - this won't give a false positive, but will make certain strings not be able to match the field (if the secret has a space for example).
When creating a document to add to a search index, you can specify the document language. I've done this, but would now like to query only those docs in a specific language. Is this possible? I assumed it would be trivial (and documented), but I can't find how to do it.
Thanks!
I don't think you can currently, but I haven't seen anything explicitly saying that. I'm implying from these sentences that the language field is for their use and not for querying.
The language parameter for search.TextField:
Two-letter ISO 693-1 language code for the field's content, to assist in tokenization. If None, the language code of the document will be used.
And Building Queries:
Search supports all space-delimited languages as well as some languages not segmented by spaces (specifically, Chinese, Japanese, Korean, and Thai). For these languages, Search segments the text automatically.
They need to know the language so they know how to parse it into words.
My plan is to just add an additional field to my search documents that has the same value as the language field. It's slightly redundant, but simple to do.
search.Document(
fields = [
...,
search.TextField(name='language', value=lang),
],
language = lang,
)
For example:
I want to search "support", I hope it will only return the results containing "support", and do NOT return the result containing "supports" or any other relevant matches.
Is it possible to implement like this?
Thanks.
Yes, if you search against an unanalyzed field type, matches are exact. In the default Solr schema the unanalyzed field type is named "string" (of class "solr.StrField")
EDIT: it depends on what you mean by "precisely". If your field value is "support desk" and your query is "support", should it match?
If your answer is yes, then you should look into configuring stemming.
If your answer is no, i.e. the query must match the field value and nothing else, then you should use a string (i.e. unanalyzed) field type.
Furthermore, if your query is "supports" and the field value is "Supports", should it match?
If you answer yes, then you should use a LowerCaseFilterFactory (you can't do this on a string field type, you'll have to switch to a text field type).
If you answer no, then it's ok to use a string field type.
In summary, the Lucene/Solr text analysis pipeline is very configurable, take a look at the analyzer docs for a reference of all available options.
What you are describing is called stemming. There is another almost identical question on stack overflow, check it out : Solr exact word search
You will need to re-index and disable stemming in your configuration. I don't believe it's possible to do that at query time since what is stored in your index is the stemmed version of the word. In your case "support" is stored in the index even is "supports" is displayed.
This should get you started How to configure stemming in Solr?
The Google App Engine Datastore querying language (gql) does not offer inexact operators like "LIKE" or even case insensitivity. One can get around the case sensitive issue by storing a lower-case version of a field. But what if I want to search for a person but I'm not sure of the spelling of the name? Is there an accepted pattern for dealing with this scenario?
Quoting from the documentation:
Tip: Query filters do not have an explicit way to match just part of a string value, but you can fake a prefix match using inequality filters:
db.GqlQuery("SELECT * FROM MyModel WHERE prop >= :1 AND prop < :2", "abc", u"abc" + u"\ufffd")
This matches every MyModel entity with a string property prop that begins with the characters abc. The unicode string u"\ufffd" represents the largest possible Unicode character. When the property values are sorted in an index, the values that fall in this range are all of the values that begin with the given prefix.
http://code.google.com/appengine/docs/python/datastore/queriesandindexes.html
Another option is the SearchableModel, however, i dont believe it supports partial matches.
http://billkatz.com/2008/8/A-SearchableModel-for-App-Engine
You could store a soundex http://effbot.org/librarybook/soundex.htm version of the name in the datastore. Then, to query a name, soundex the query, and look that up.