Luwak/Lucene vs Solr: TrieDateField range query - solr

For our system, we have a solr scheme defined with the basic TrieDateField fieldType, which has precisionStep=6 as well as stored/indexed/docvalues all equal to true. We also have a custom query parser which will take a query like 'date > 2012-02-10T13:19:11Z' and turn it into a range query (in lucene syntax it would look something like date:{1328879951000 TO *], but under the hoods it's just calling the getRangeQuery method on a TrieDateField object).
When running the query date > 2012-02-10T13:19:11Z in solr, I will correctly get back documents with a date field of 2014-05-11T12:00:00Z. However, when matching using luwak, the above query matches against nothing. In fact, the only query that works is with strict equality. However, if i change the precisionStep in the scheme for tdate to be either 0 or a high number (above say 32), all range queries work as expected.
Is there a reason range queries are matching only with less indexed ranges (higher precisionStep)? Why is it different between solr and luwak, if they're using the same schema and same query parser?

If anyone comes across this later (though this was probably a niche question considering no answers and I'm using a deprecated field type), I was indexing the the date without a specified precisionStep, while the query DID have a precisionStep.
When building the luwak document, I did:
InputDocument doc = InputDocument.builder("doc1")
.addField("date", iso_date_string, customAnalyzer).build();
When I needed to do something akin to:
FieldType ft = new FieldType();
ft.setNumericType(FieldType.LegacyNumericType.LONG);
ft.setNumericPrecisionStep(6);
ft.setStored(false);
ft.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
LegacyLongFiled field = ("date", iso_date_as_long, ft);
builder.addField(field);
Where iso_date_as_long is the given iso_date_string converted to date with JodaTime, converted back to string with DateTools.dateToString, and then converted to a long again with DateTools.stringToTime.

Related

Azure Search using Lucene Query Syntax Returns Incorrect Results

I am using the Microsoft.Azure.Search .NET SDK v5.0.1. I am attempting to perform a search against my Azure Search index as follows: Documents.SearchAsync("fieldname:val* AND timeStamp:2018-05-03T13\:23\:59Z"). The results are incorrect. There are exactly 2 documents in my index with that timestamp. There are 121 documents in my index where the fieldname starts with val. When I run the above query using the SDK, it always returns 121 documents. Is there some special way to query timestamp that I am missing?
There are a few points to make here:
In your index definition, I believe you have timeStamp set to be a String. Otherwise you wouldn't have been able to make the search query as DateTime fields are not searchable. Firstly, I'd advice against treating timeStamp as a string. This is because searchable fields go through a bunch of analysis (tokenization being one of them) Reference on query parsing. In your case, the timestamp query (say 2018-05-03) will be tokenized into smaller constituents (2018, 05, 03) and documents containing any of those terms will be returned. Which is why you observe what you see.
Your scenario seems to be a classic case of "filter" results based on a criteria, followed by "search" on the filtered documents. To accomplish this, you need to do the following:
Use a filter on the timestamp, so that it doesn't go through the analysis
On the filtered results, apply your search query.
Reference
I strongly recommend however that if possible, you should make your timeStamp column a datetime for more reasonable semantics.
As an example, here's how you'd go about achieving a filter + search combo:
parameters = new SearchParameters()
{
Filter = "timeStamp eq '2018-05-03'"
};
Documents.SearchAsync("fieldname:val*", parameters);

Solr Query Syntax conversion from boolean expression

I'm attempting to query solr for documents, given a basic schema with the following field names, data types irrelevant:
I'm attempting to match documents that match at least one of the following:
occupation, name, age, gender but i want to OR them together
How do you OR together many terms, and enforce the document to match at least one?
This seems to be failing: +(name:Sarah age:24 occupation:doctor gender:male)
How do you convert a boolean expression into solr query syntax? I can't figure out the syntax with + and - and the default operator for OR.
Still I don't get your requirement but you just need to query like:
+(age:24 OR gender:male)
Or if you want data for multiple value in same field with OR condition like.
i.e. You get data of age:24 and age:25 both.
+(age:24 OR age:25 OR gender:male)
Then you can:
+(age:(24 25) OR gender:male)
If it is't your requirement, then let me know.
If you want to make it as simple as possible for the client, just go for the dismax[1] or edismax[2] query parser.
Specifically you can configure a request parameter called "qf" :
"The qf parameter introduces a list of fields, each of which is assigned a boost factor to increase or decrease that particular field’s importance in the query. For example, the query below:
qf=fieldOne^2.3 fieldTwo fieldThree^0.4
assigns fieldOne a boost of 2.3, leaves fieldTwo with the default boost (because no boost factor is specified), and fieldThree a boost of 0.4.
These boost factors make matches in fieldOne much more significant than matches in fieldTwo, which in turn are much more significant than matches in fieldThree." from the wiki
Then you can just pass a free text query, and it will be searched in the fields you specified, giving also different importance to each one, if necessary.
[1] https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html
[2] https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html

Solr 4.2.1 and SOLR-1604 : ComplexPhrase AND date range queries do not work together

I recently patched my Solr 4.2.1 with the ComplexPhrase query addon (https://issues.apache.org/jira/browse/SOLR-1604). When I issue a query such as :
my_text_field:"testin* compl*"~1 AND my_date_field:2013-12-12T04:58:53.732Z
I get results that contain the text query I issued and the date I issued in the my_date_field.
But when I do this:
my_text_field:"testin* compl*"~1 AND my_date_field:[2013-01-01T02:58:53.732Z TO 2013-12-12T04:58:53.732Z]
I get no results.
If I remove the complexphrase parser things go back to normal ( but I have no support for complex phrase queries ).
Ok after some time reading the lucene and solr code I figured it out.
This patch creates a Query Parser that extends the Lucene QueryParser. The Lucene QueryParser does not handle range queries other than Term Ranges ( simple strings in a way ). If one wants to specialize the behavior of the QueryParser, he must extract the field type and create the appropriate range query ( eg NumericRangeQuery for numbers, etc).

Lucene OR query not working

I am trying to query Solr with following requirement:
_ I would like to get all documents which not have a particular field
-exclusivity:[* TO *]
I would like to get all document which have this field and got the specific value
exclusivity:(None)
so when I am trying to query Solr 4 with:
fq=(-exclusivity:[* TO *]) OR exclusivity:(None)
I have only got results if the field exists in document and the value is None but results not contain results from first query !!
I cannot understand why it is not working
To explain your results, the query (-exclusivity:[* TO *]) will always get no results, because you haven't specified any result to retrieve. By default, Lucene doesn't retrieve any results, unless you tell it to get them. exclusivity:(None) isn't a limitation placed on the full result set, it is the key used to find the documents to retrieve. This differs from a database, which by default returns all records in a table, and allows you to limit the set.
(-exclusivity:[* TO *]) only specifies what NOT to get, but doesn't tell it to GET anything at all.
Solr has logic to handle Pure negative queries (I believe, in much the same way as below, by implicitly retrieving all documents first), but from what I gather, only as the top level query, and it does not handle queries like term1 OR -term2 documented here.
I believe with solr you should be able to use the query *:* to get all docs (though that would not be available in raw lucene), so you could use the query:
(*:* -exclusivity:[* TO *]) exclusivity:(None)
which would mean, get (all docs except those with a value in exclusivity) or docs where exclusivity = "None"
I have founded answer to this problem. I have made bad assumption how "-" works in solr.I though that
-exclusivity:[* TO *]
add everything without exclusivity field to the data set but it is not the case. The '-' could only exclude things from data set. BTW femtoRgon you are right but I am using it as fq (filter query) not as a master query I have forgotten to mention that.
So the solution is like
-exclusivity:([* TO *] AND -(None))
and full query looks like
/?q=*:*&fq=-exclusivity:([* TO *] AND -(None))
so that means I will get everything does not have field exclusivity or has this field and it is populated with value None.

SOLR index time boost depending on the field value

Is it possible to boost a document on the indexing stage depending on the field value?
I'm indexing a text field pulled from the database. I would like to boost results that are shorter over the longer ones. So the value of boost should depend on the length of the text field.
This is needed to alter the standard SOLR behavior that in my case tends to return documents with multiple matches first.
Considering I have a field that stores the length of the document, the equivalent in the query of what I need at indexing would be:
q={!boost b=sqrt(length)}text:abcd
Example:
I have two items in the DB:
ABCDEBCE
ABCD
I always want to get ABCD first for the 'BC' query even though the other item contains the search query twice.
The other solution to the problem would be ability to 'switch off' the feature that scores multiple matches higher at query time. Don't know if that is possible either...
Doing this at index time is important as the hardware I run the SOLR on is not too powerful and trying to boost on query time returns with OutOfMemory Exception. (Even If I could work around that increasing memory for java I prefer to be on the safe side and implement the index the most efficient way possible.)
Yes and no - but how you do it depends on how you're indexing your documents.
As far as I know there's no way of resolving this only on the solr server side at the moment.
If you're using the regular XML based interface to submit documents, let the code that generates the submitted XML add boost=".." values to the field or to the document depending on the length of the text field.
You can check upon DIH Special Commands which has a $docBoost command
$docBoost : Boost the current doc. The value can be a number or the
toString of a number
However, there seems no $fieldBoost Command.
For you case though, if you are using DefaultSimilarity, shorter fields are boosted higher then longer fields in the Score calculation.
You can surely implement your own Simiarity class with a changed TF (Term Frequency) and LengthNorm Calculation as your needs.

Resources