Solr - execute different query based on condition - solr

I have 50 value fields, 50 booleans and a date field. To be compliant to the new GDPR standard, I need to make certain fields unsearchable. The difficult part in my case is, the fields that need to be unsearchable differ per record. So in one case field 2 and 5 might be protected, while in the other field 3 and 7 are protected. This is known by the booleans: every value field also has a boolean that defines if that field is protected or not.
All this only applies when the date field is still in the future. When the date is in the past, or there is no date at all, all fields of that record are searchable anyway, regardless of the booleans.
What I had in mind is execute a different query per record, based on whether or not the date field of that record is in the future.
if (date > today) -> query1
else -> query2
Where query1 checks every field individually, taking into account the matching boolean. Is this possible, and how?

For the first condition - use separate fields for searching before and after the date has passed (if you can still store the value - i'm not not too familiar with detailed GDPR requirements).
I.e. have field_1, field_1_before_date - and only submit a value for field_1_before_date if your boolean value is true when indexing the document.
Issue two separate queries, one to get documents in the future and one to get documents in the past - in the first one you limit the fields you query to field_1_before_date, while in the second one you use field_1 instead.
You can combine these using _query_ - Using nested queries in Solr:
q=yourfirstquery OR _query_:"your second query"
.. should work, unless there is a limitation to combining those using OR.

Related

Luwak/Lucene vs Solr: TrieDateField range query

For our system, we have a solr scheme defined with the basic TrieDateField fieldType, which has precisionStep=6 as well as stored/indexed/docvalues all equal to true. We also have a custom query parser which will take a query like 'date > 2012-02-10T13:19:11Z' and turn it into a range query (in lucene syntax it would look something like date:{1328879951000 TO *], but under the hoods it's just calling the getRangeQuery method on a TrieDateField object).
When running the query date > 2012-02-10T13:19:11Z in solr, I will correctly get back documents with a date field of 2014-05-11T12:00:00Z. However, when matching using luwak, the above query matches against nothing. In fact, the only query that works is with strict equality. However, if i change the precisionStep in the scheme for tdate to be either 0 or a high number (above say 32), all range queries work as expected.
Is there a reason range queries are matching only with less indexed ranges (higher precisionStep)? Why is it different between solr and luwak, if they're using the same schema and same query parser?
If anyone comes across this later (though this was probably a niche question considering no answers and I'm using a deprecated field type), I was indexing the the date without a specified precisionStep, while the query DID have a precisionStep.
When building the luwak document, I did:
InputDocument doc = InputDocument.builder("doc1")
.addField("date", iso_date_string, customAnalyzer).build();
When I needed to do something akin to:
FieldType ft = new FieldType();
ft.setNumericType(FieldType.LegacyNumericType.LONG);
ft.setNumericPrecisionStep(6);
ft.setStored(false);
ft.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
LegacyLongFiled field = ("date", iso_date_as_long, ft);
builder.addField(field);
Where iso_date_as_long is the given iso_date_string converted to date with JodaTime, converted back to string with DateTools.dateToString, and then converted to a long again with DateTools.stringToTime.

solr function query by example

Can someone explain with example that how Solr function query is used.
I could not find any concrete example which shows the result difference with function queries and without function queries.
I want something with example URL and what is shows in response result.
A function query is a query that invokes a function on one (or more) of the fields available. You add a function query if the value you have in a field has to be processed to get the value you want - just as you'd do in a mathematical sense.
Showing "the difference between a query with function queries and without" isn't really possible, as they don't do the same thing. You pick one (or both) depending on what you need.
An adopted example from the reference manual - Lets imagine we have a set of documents that describe users, and these users have two fields - mails_read and mails_received. To get anyone that has read less than 50% of their mails, we can apply a filter query as a function (with the frange query parser) (fq here means filter query - the frange is what makes it a function query):
fq={!frange l=0 u=0.5}div(mails_read,mails_received)
Otherwise we'd be limited to receive those who just had read a specific range of emails or that had received a specific range of emails - or we'd have to index a value that kept the updated value for mails_read / mails_received each time we updated the document (which is a perfectly valid strategy, and usually more efficient).
Another example is to use a function query for boosting documents, and the most common one is to boost by recency (i.e. that a more recent document receives a larger boost):
bf=recip(ms(NOW/HOUR,mydatefield),3.16e-11,1,1)
This applies the recip function to the difference (expressed in milliseconds) between the mydatefield field and the current hour.
recip: Performs a reciprocal function with recip(x,m,a,b) implementing a/(m*x+b) where m,a,b are constants, and x is any arbitrarily complex function.
Yet another fine use case is to use the special _val_ field - if you query against this magic field with a function, the value returned by the function will be used as the score of the document (instead of affecting it through boosting or limiting the resulting set of documents as a query).
_val_:"div(popularity, price)"
.. would give the score of the document based on the result of the division (what the values represent is up to you).

Filter on fields only if present on a document

Is it possible to filter a document by the value provided only if the document has the field.
For context,
I have document types A,B,C that have the field.
I also have document types D and E that don't.
I could define a query such that the filter only applies to the first subset, but I might later add a new document type to the first set which will invalidate this filter.
You'll have to combine the query with an match against all documents, except those who have a value in the field:
myfield:foobar OR (*:* NOT myfield:*)
.. should do what you want. That being said, I'd probably wait to introduce these additional queries until I actually see that it's needed, as it will make each query more expensive without possibly being necessary in the future - but that's up to your judgement.

SOLR index time boost depending on the field value

Is it possible to boost a document on the indexing stage depending on the field value?
I'm indexing a text field pulled from the database. I would like to boost results that are shorter over the longer ones. So the value of boost should depend on the length of the text field.
This is needed to alter the standard SOLR behavior that in my case tends to return documents with multiple matches first.
Considering I have a field that stores the length of the document, the equivalent in the query of what I need at indexing would be:
q={!boost b=sqrt(length)}text:abcd
Example:
I have two items in the DB:
ABCDEBCE
ABCD
I always want to get ABCD first for the 'BC' query even though the other item contains the search query twice.
The other solution to the problem would be ability to 'switch off' the feature that scores multiple matches higher at query time. Don't know if that is possible either...
Doing this at index time is important as the hardware I run the SOLR on is not too powerful and trying to boost on query time returns with OutOfMemory Exception. (Even If I could work around that increasing memory for java I prefer to be on the safe side and implement the index the most efficient way possible.)
Yes and no - but how you do it depends on how you're indexing your documents.
As far as I know there's no way of resolving this only on the solr server side at the moment.
If you're using the regular XML based interface to submit documents, let the code that generates the submitted XML add boost=".." values to the field or to the document depending on the length of the text field.
You can check upon DIH Special Commands which has a $docBoost command
$docBoost : Boost the current doc. The value can be a number or the
toString of a number
However, there seems no $fieldBoost Command.
For you case though, if you are using DefaultSimilarity, shorter fields are boosted higher then longer fields in the Score calculation.
You can surely implement your own Simiarity class with a changed TF (Term Frequency) and LengthNorm Calculation as your needs.

SOLR - Match range query only if all dates in range are matched

I am using SOLR and storing an array of dates a salesperson is available to visit clients (trips can last anywhere from a day upwards, depending on the client request). For each salesperson I have a list of dates that they are available for the salesperson for a given month. There are other fields, including salesperson data, geolocation information, etc.
I am familiar with range queries but it seems that SOLRs range searches on arrays work differently than I would like - as long as any item in the array is a match then the range is a match). I would like to send SOLR a query with a range and only return a match if all dates in that range are found in the array. For example:
<arr name="available_dates">
<date>2012-04-30T00:00:00Z</date>
<date>2012-05-01T00:00:00Z</date>
<date>2012-05-02T00:00:00Z</date>
</arr>
-- should match --
available_dates:[2012-04-30T00:00:00.000Z TO 2012-05-02T00:00:00.000Z]
-- should not match as 2012-04-29 is not contained in available_dates --
available_dates:[2012-04-29T00:00:00.000Z TO 2012-05-02T00:00:00.000Z]
Is this possible or am I going about this all wrong?
You have the right idea, but your initial query is a search instead of a match. Intuitively, your search within available_dates:[2012-04-30T00:00:00.000Z TO 2012-05-02T00:00:00.000Z] should contain all of the elements of available_dates for it to have matched successfully.
You have two options to implement this logic efficiently and successfully. You can either manually or dynamically perform the range query for each element in your array, or you can set up an ancillary that attempts to perform the match after your search has been performed. For example:
available_dates:[2012-04-30T00:00:00.000Z TO 2012-05-02T00:00:00.000Z](available_dates)
Which is saying, in left to right order: evaluate the range search, then check that all of the results from available_dates are contained in this evaluation (by way of a default AND query). If they are, return the element. If not, don't.
Syntactically, the above is untested and probably does not work. But procedurally, you should be able to draft the right query around this to fit your needs.
(Additional resource discussing the default AND behavior of composite search queries)
Instead of using a range query you should use multiple clauses, one for each date.
So instead of available_dates:[2012-04-29T00:00:00.000Z TO 2012-05-02T00:00:00.000Z]
You should use available_dates:"2012-04-29T00:00:00.000Z" AND available_dates:"2012-04-30T00:00:00Z" AND available_dates:"2012-05-01T00:00:00.000Z" AND available_dates:"2012-05-02T00:00:00.000Z"
Hope that answers your question!
Assuming you're importing this data from database.
In your database or in your search index, create a new column that stores the max of your sales person's date (as in latest date), as well as a min. Also, calculate and store the difference between the max & min date.
Three criterias must be matched for a matching query (so use AND in the query)
the differnce between the query's max & min can't be bigger than the difference as stored in the index
you'd make sure {!frange l=0 u=difn_bet_query_max_and_min}sub(field_min,query_min)
formulate the same thing for your max values
For a reference on function ranges
http://www.lucidimagination.com/blog/2009/07/06/ranges-over-functions-in-solr-14/

Resources