Inconsistent results in solr when query contains special characters

Inconsistent results in solr when query contains special characters - solr

I am trying to solr search on some log files which are already indexed using solrJ. I want to be able to search a string and the result is the name of the files which contain that particular string.
But if I use special characters the search results are inconsistent. I have admin#10.x.x.x in my log file. If I search admin, I get proper results. If I search admin# then also I get proper results. But if I search admin#10 then I am getting no results. And if I search some string like *a# then I get some results which doesn't even have that string.
I tried escaping metacharacters. That did not work.
I cannot show any code because it is related to my office work.
Note: By results I mean names of log files which contain that particular string. (Not that the filename itself contains the string. That particular file(s) contents contain that search string)

Related

Solr schema-browser hyphen escaped problem

Hello there i'm trying to get the list of the imported field Movie_name
How ever the hyphen was escaped and not showing like this img example
And this the data i attached already
Like you see two data with the Movie_name
"Movie_name":"sci-Fi2"},
"Movie_name":"sci-Fi"}]
What i'm trying to do is a simple analytics & get all the list of names with the field Movie_name and not the data.
So the question why the hyphen are escaped in the schema-browser
Why i cannot get the exactly correct field name ???

It's not being escaped - what you're seeing in the schema browser are the actual terms stored in the index (what is usually referred to as "tokens"). If you want these to be preserved in the original form (i.e. as a single token) to be used for faceting or analytics, store them as the type string instead of as a text based field (which have a tokenizer attached - and that tokenizer usually splits the string into multiple smaller tokens on natural split points, such as -).
In your example, sci-fi is turned into sci and fi. If you use a string type, or a KeywordTokenizer, the input is kept as it is, and the token is stored as sci-fi instead.

Is it possible to use multiple words in a filter query in SOLRJ / SOLR?

I am using SOLRJ (with SOLR 7) and my index features some fields for the document contents named content_eng, content_ita, ...
It also features a field with the full path to the document (processed by a StandardTokenizer and a WordDelimiterGraphFilter).
The user is able to search in the content_xyz fields thanks to the lines :
final SolrQuery query = new SolrQuery();
query.setQuery(searchedText);
query.set("qf",searchFields); // searchFields is a generated String which looks like "content_eng content_ita" (field names separated by space)
Now the user needs to be able to specify some words contained in the path (namely some subdirectories). So I added a filterQuery :
query.addFilterQuery(
"full_path_split:" + searchedPath);
If searchedPath contains only a single word contained in the document path, the document is correctly returned however if searchedPath has several words contained in the path, the document is not returned. To sum it up the fq only works if searchedPath contains a single word.
For example doc1 is in /home/user/dir1/doc1.txt
If I search for all (* in searchedText) documents that are in user dir (fq=full_path_split%3Adir) doc1.txt is returned.
If I do the same search but for documents that are in user and dir1 (fq=full_path_split%3user+dir1) doc1.txt is not returned, and I think it is because the fq is parsed as "+full_path_split:user +text:dir1" as debug=query shows. I don't know where text comes from it may be a default field.
So is it possible to use a filter query with several words to fulfill my needs ?
Any help appreciated,

Your suspicion is correct - the _text_:dir1 part comes from you not providing a field name, and the default field name being used instead.
You can work around this by using the more general edismax (or the older dismax) parser as you're doing in your main query with qf:
fq={!type=edismax qf='full_path_split'}user dir1

Solr query giving wrong results searching multi-word (separated by space) string

I have indexed following document in Solr with app_name is multi-word string eg."Fire inspection" ,
{
"app_name":"Fire inspection",
"appversion":1,
"id":"app_1397_version_2417",
"icon":"/images/media/default_icons/app.png",
"type":"app",
"app_id":1397,
"account_id":556,
"app_description":"fire inspection app",
"_version_":1599441252925833216}]
}
if i execute following Solr query, Solr returning wrong response,
Query:
http://localhost:8983/solr/AxoSolrCollectionLocal/select?fq=app_name:*fire P*&q=*:*
I'm searching for record's whose app_name contains "fire P" but getting -response whose app_name contains "fire inspection". Here, string 'Fire P' does not match with below record but still it is responded by Solr.
Response:
{
"app_name":"Fire inspection",
"appversion":1,
"id":"app_1397_version_2417",
"icon":"/images/media/default_icons/app.png",
"type":"app",
"app_id":1397,
"account_id":556,
"app_description":"fire inspection app",
"_version_":1599441252925833216}]
}
Can someone please help me with the Solr query (same as that of like query in SQL) which will check for substring and spaces will not be mattered.
Your help is greatly appreciated.

First - your query does not mean what you think it means. app_name:*fire P* means "search for anything ending in fire in the field app_name and/or anything starting with p in the default search field". Since you haven't prefixed the second value with a field name, the default search field will be used.
If you want to search for a substring match inside a field like that (i.e. something that contains "fire P" as a substring inside the value, the field type has to be made a string field - or a field with a keyword tokenizer - that way the field retains its actual value, and it's not processed / filtered / tokenized further. If it's being tokenized, those tokens (i.e. fire, inspection etc) will be stored separately. You'll have to escape any spaces properly and query a single field (i.e. app_name:fire\ P`), and depending on the use case, performance may take a hit unless you have the ReversedWildcardFilter enabled as well.
However, you can probably also use the ComplexPhraseQueryParser to get support for wilcards in phrase queries:
{!complexphrase inOrder=true}app_name:"*fire P*"
should work, as long as you actually have uppercase letters in your tokens (wildcards disables many filters, so usually you'll want to match the end syntax in your tokens.

Solr OR query on a text field

How to perform a simple query on a text field with an OR condition? Something like name:ABC OR name:XYZ so the resulting set would contain only those docs where name is exactly "XYZ" or "ABC"
Dug tons of manuals, cannot figure this out.
I use Solr 5.5.0
Update: Upgraded to Solr 6.6.0, still cannot figure it out. Below are illustrations to demonstrate my issue:
This works:
This works too:
This still works:
But this does not! Omg why!?

There are many ways to perform OR query. Below I have listed some of them. You can select any of it.
[Simple Query]
q=name:(XYZ OR ABC)
[Lucene Query Parser]
q={!lucene q.op=OR df=name v="XYZ ABC"}

Your syntax is right, but what you're asking for isn't what text fields are made for. A text field is tokenized (split into multiple tokens), and each token is searched by itself. So if the text inserted is "ABC DEF GHI", it will be split into three separate tokens, namely "ABC", "DEF" and "GHI". So when you're searching field:ABC, you're really asking for any document that has the token "ABC" somewhere.
Since you want to perform an exact match, you want to query against a field that is defined as a string field, as this will keep the value verbatim (including casing, so the matching will be case sensitive). You can tell Solr to index the same content into multiple fields by adding a copyFile instruction, telling it to take the content submitted for field foo and also copying it into field bar, allowing you to perform both an exact match if needed and a more general search if necessary.
If you need to perform exact, but case insensitive, searches, you can use a KeywordTokenizer - the KeywordTokenizer does nothing, keeping the whole string as a single token, before allowing you to add filters to the analysis chain. By adding a LowercaseFilter you tell Solr to lowercase the string as well before storing it (or querying for it).
You can use the "Analysis" page under the Solr admin page to experiment and see how content for your field is being processed for each step.
After that querying as string_field:ABC OR string_field:XYZ should do what you want (or string_field:(ABC OR XYZ) or a few other ways to express the same.

A wacky workaround I've just come up with:

Lucene search for a filename, using WordDelimiterFilterFactory

If I search for toto.pdf, a token "pdf" is created for the search tI'm indexing some data, including filenames.
What I want is, according to indexed filename:
MySupercool123girlfriend.jpg
And to be able tosearch it with:
supercool
supercool123
123
girlfriend
jpg
So at index it pretty easy to be able to use WordDelimiterFilterFactory so that some tokens are created, like:
my
supercool
mysupercool
mysupercool123
supercool123
123
girlfriend
jpg
girlfriend.jgp
etc...
The matter is that at search time, I don't really know what I should do.
If I use WordDelimiterFilterFactory at search time, MySupercool123girlfriend.jpg would match even with toto.jpg because in both cases a token jpg is created.
toto.jpg should not be in the result list at all, so it's not a solution for me to have both results with the appropriate one having a better scoring
Have you any recommendation to index and search for filenames?

For this specific example of yours i.e. if the search is for MySupercool123girlfriend.jpg and you want this to only return documents that have the entire string in it, you can keep a copyField, say named filename_str, whose fieldType is string. String matches will ensure you that you get an exact match. This could be a first-level "exact match" search you do.
However, I am guessing that you would want a search for 123girlfriend.jpg to return the document containing MySupercool123girlfriend.jpg. You can do a 2nd level search for this. Beginning Solr 4.0 you can do a regex search like
q=filename_str:/.*123girlfriend.jpg/
(This regex query should also work for filename field itself, if you are using preserveOriginal=1 in WordDelimiterFilterFactory at index time.)
Else you can do a leading wild-card search, which works in earlier Solr versions too.
If you also want MySupercool.jpg to match MySupercool123girlfriend.jpg, then I guess you would have to manually do the work of DelimiterFilterFactory and construct a regex query like
q=filename_str:/.*My.*Supercool.*.jpg/
Another issue is that jpg is going to match lot of documents, so you may want to split the filename and the extension and keep them as separate fields.

Can you come up with some meaningful for your use case DisMax mm parameter?
See http://wiki.apache.org/solr/DisMaxQParserPlugin#mm_.28Minimum_.27Should.27_Match.29
E.g.
mm=100% and "MySupercool123girlfriend.jpg" would match only filenames that have all ["my", "supercool", "123", "girlfriend", "jpg"] terms in them
You can find some less strict but still giving relevant results expression. See http://lucene.apache.org/solr/4_1_0/solr-core/org/apache/solr/util/doc-files/min-should-match.html

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight