Why does Solr changes record position after updating a field - solr

I am new to Solr and encountered a weird behavior as I update a field and perform search.
Here's the scenario :
I have a 300records in my core, I have a search query wherein I filtered the results with this
fq=IsSoldHidden:false AND IsDeleted:false AND StoreId:60
and I sort it by DateInStock asc
Everything is perfectly returning my expected results,
Here is the sample top 3 results of my query :
--------------------------------------------------------------------------------------
id | Price | IsSoldHidden | IsDeleted | StoreId | StockNo | DateInStock
--------------------------------------------------------------------------------------
27236 | 15000.0 | false | false | 60 | A00059 | 2021-06-07T00:00:00Z
--------------------------------------------------------------------------------------
37580 | 0.0 | false | false | 60 | M9202 | 2021-06-08T00:00:00Z
--------------------------------------------------------------------------------------
37581 | 12000 | false | false | 60 | M9173 | 2021-06-08T00:00:00Z
but when I tried to update(AtomicUpdate to be specific) the Price field in 2nd row , and trigger a search again with the same filters requirements, the results changes to this :
--------------------------------------------------------------------------------------
id | Price | IsSoldHidden | IsDeleted | StoreId | StockNo | DateInStock
--------------------------------------------------------------------------------------
27236 | 15000.0 | false | false | 60 | A00059 | 2021-06-07T00:00:00Z
--------------------------------------------------------------------------------------
37581 | 0.0 | false | false | 60 | M9173 | 2021-06-08T00:00:00
--------------------------------------------------------------------------------------
37582 | 0.0 | false | false | 60 | M1236 | 2021-06-08T00:00:00Z
and the 2nd row(37580) of the 1st results was placed at the last row(document#300).
I have researched online , and Here's what I've found
Solr changes document's score when its random field value altered
but I think the situation is different to mine, since I did not add the score as a Sort.
I am not sure why does it behave like this,
Am I missing something ?
Or is there anyone can explain it ?
Thanks in advance.

Since the dates are identical, their internal sort order depends on their position in the index.
Updating the document marks the original document as deleted and adds a new document at the end of the index, so its position in the index changes.
If you want to have it stable, sort by date and id instead - that way the lower id will always be first when the dates are identical, and the sort will be stable.

Related

Replacing placeholder with another table's data (without knowing in advance the substitutions)

I need to replace placeholders in a text, reading from a query matched to that specific message.
Table Template_Messages
| ID | String | Query
| PICKUP_MSG|Your {vehicle_name} will be ready for pick-up on {pickup_date}|SELECT * FROM vehicles WHERE ID = ?
If I take the query I find in the 'Query' column, I will find the following table:
Table Vehicles
| ID | vehicle_name | plate | pickup_date | ... |
| P981| BMW X5 | AA014CC| 2022-09-20 | ... |
| Z323| Ford Focus | HH000JJ| 2022-10-21 | ... |
Then with the following query:
SELECT * FROM vehicles WHERE ID = 'Z323'
By making the appropriate substitutions I should obtain this output:
Your Ford Focus will be ready for pick-up on 2022-10-21
How can I achieve this?
And since the 'query' column of the first table does not only refer to the 'vehicles' table, can it work dynamically on any placeholder/query?

Algorithm or pseudo code for finding combinations from table

I have a table with thousands of items with a lot of attributes (approx 15+). I would like to select the following results:
Select all combination of items to have at least 100% from each attributes? Exactly 100% would be nice but thats not necessary so it can go over a little or be a little less (maybe +-2%).
All combinations would be a big dataset so I think it would be better to sort them by price and select only the 10 cheapest one.
Also if I would like to modify selects before so that one or several attributes cant get over some value, like 50% for example?
| ----------- | ------------ | ----------- | ----------- | ----- |
| item name | attribute 1 | attribute 2 | attribute 3 | price |
| item 1 | 25% | 1% | 5% | 1€ |
| item 2 | 10% | 10% | 10% | 2€ |
| item 3 | 5% | 20% | 5% | 3€ |
| item 4 | 20% | 15% | 50% | 12€ |
I don't know if there is an existing algorithm for my problem ( I hope so ) or my problem has a name I can google but I would be thankful for any tips how I should proceed.
The only way I could think of for now is to bruteforce all the combinations and drop the unusable ones. But I don't think that's the right way (maybe I'm wrong and thats the only way).
The number of items, price and attribute values can change over time. If they were static I would just run the bruteforce option once and be done with it.
Sorry if this question was already asked.
EDIT:
As an example I can provide nutritional information about food (All the numbers are made up):
daily intake of carbohydrates/fat/protein are 225g/30g/65g
| ----------- | --------------- | ------- | --------- | ------ | ----- |
| item name | carbohydrates | fat | protein | sodium | price |
| apple | 10g | 1g | 5g | 1mg | 1€ |
| banana | 20g | 2g | 10g | 1mg | 2€ |
| pear | 15g | 3g | 5g | 5mg | 3€ |
| ----------- | --------------- | ------- | --------- | ------ | ----- |
find me combination of foods which will reach daily intake.
Now i want the same as in 1. but sort it by the price/select the cheapest.
I want only combinations with sodium not exceeding 30mg

Solr poor performance on date range query

I'm currently struggling to get decent performance on a ~18M documents core with NRT indexing from a date range query, in Solr 4.10 (CDH 5.14).
I tried multiple strategies but everything seems to fail.
Each document has multiple versions (10 to 100) valid at different non-overlapping periods (startTime/endTime) of time.
The query pattern is the following: query on the referenceNumber (or other criteria) but only return documents valid at a referenceDate (day precision). 75% of queries select a referenceDate within the last 30 days.
If we query without the referenceDate, we have very good performance, but 100x slowdown with the additional referenceDate filter, even when forcing it as a postfilter.
Here are some perf tests from a python script executing http queries and computing the QTime of 100 distinct referenceNumber.
+----+-------------------------------------+----------------------+--------------------------+
| ID | Query | Results | Comment |
+----+-------------------------------------+----------------------+--------------------------+
| 1 | q=referenceNumber:{referenceNumber} | 100 calls in <10ms | Performance OK |
+----+-------------------------------------+----------------------+--------------------------+
| 2 | q=referenceNumber:{referenceNumber} | 99 calls in <10ms | 1 call to warm up |
| | &fq=startDate:[* to NOW/DAY] | 1 call in >=1000ms | the cache then all |
| | AND endDate:[NOW/DAY to *] | | queries hit the filter |
| | | | cache. Problem: as |
| | | | soon as new documents |
| | | | come in, they invalidate |
| | | | the cache. |
+----+-------------------------------------+----------------------+--------------------------+
| 3 | q=referenceNumber:{referenceNumber} | 99 calls in >=500ms | The average of |
| | &fq={!cache=false cost=200} | 1 call in >=1000ms | calls is 734.5ms. |
| | startDate:[* to NOW/DAY] | | |
| | AND endDate:[NOW/DAY to *] | | |
+----+-------------------------------------+----------------------+--------------------------+
How is it possible that the additional date range filter query creates a 100x slowdown? From this blog, I would have expected similar performance from the daterange query as without the additional filter: http://yonik.com/advanced-filter-caching-in-solr/
Or is the only option is to change the softCommit/hardCommit delays, create 30 warmup fq for the past 30 days and tolerate poor performance on 25% of our queries?
Edit 1: Thanks for the answer, unfortunately, using integers instead of tdate does not seem to provide any performance gains. It can only leverage caching, like the query ID 2 above. That means we need a strategy of warmup of 30+ fq.
+----+-------------------------------------+----------------------+--------------------------+
| ID | Query | Results | Comment |
+----+-------------------------------------+----------------------+--------------------------+
| 4 | fq={!cache=false} | 35 calls in <10ms | |
| | referenceNumber:{referenceNumber} | 65 calls in >10ms | |
+----+-------------------------------------+----------------------+--------------------------+
| 5 | fq={!cache=false} | 9 calls in >100ms | |
| | referenceNumber:{referenceNumber} | 6 calls in >500ms | |
| | AND versionNumber:[2 TO *] | 85 calls in >1000ms | |
+----+-------------------------------------+----------------------+--------------------------+
edit 2: It seems that passing my referenceNumber from fq to q and setting different costs improve the query time (no perfect, but better). What's weird though is that the cost >= 100 is supposed to be executed as a postFilter, but setting the cost from 20 to 200 does not seem to impact performance at all. Does anyone know how to see whether a fq param is executed as a post filter?
+----+-------------------------------------+----------------------+--------------------------+
| 6 | fq={!cache=false cost=0} | 89 calls in >100ms | |
| | referenceNumber:{referenceNumber} | 11 calls in >500ms | |
| | &fq={!cache=false cost=200} | | |
| | startDate:[* TO NOW] AND | | |
| | endDate:[NOW TO *] | | |
+----+-------------------------------------+----------------------+--------------------------+
| 7 | fq={!cache=false cost=0} | 36 calls in >100ms | |
| | referenceNumber:{referenceNumber} | 64 calls in >500ms | |
| | &fq={!cache=false cost=20} | | |
| | startDate:[* TO NOW] AND | | |
| | endDate:[NOW TO *] | | |
+----+-------------------------------------+----------------------+--------------------------+
Hi I have an another solution for you, it will give a good performance after performing same query to solr.
My Suggestion is store date in int format, please find below example.
Your Start Date : 2017-03-01
Your END Date : 2029-03-01
**Suggested format in int format.
Start Date : 20170301
END Date : 20290301**
When you are trying fire same query with int number instead of dates it works faster as expected.
So your query will be.
q=referenceNumber:{referenceNumber}
&fq=startNewDate:[* to YYMMDD]
AND endNewDate:[YYMMDD to *]
Hope it will help you ..

How to implement custom sort order with some items at fixed places in solr?

I have a list of drawers in a mysql table named Drawers like
| BOX | RANK |
|----------------|--------|
| Box1 | 1 |
| Box2 | 2 |
| Box3 | 3 |
| Box4 | 4 |
| Box5 | 5 |
Then I have another source which says some of these boxes contains jewel and should be placed at a specific position only(Lets call this table jewelboxes).
| BOX | RANK |
|----------------|--------|
| Box1 | 4 |
| Box3 | 1 |
| Box5 | 3 |
I have certain restrictions that needs to adhere to:
I cannot write a stored proc on these tables
I want to get a list of Boxes on Solr where position of the jewelboxes should be fixed irrespective of the calling order(ascending/descending). for example,
ascending order would be:
| BOX | RANK |
|----------------|--------|
| Box3 | 1 |
| Box2 | 2 |
| Box5 | 3 |
| Box1 | 4 |
| Box4 | 5 |
descending order would be:
| BOX | RANK |
|----------------|--------|
| Box3 | 1 |
| Box4 | 2 |
| Box5 | 3 |
| Box1 | 4 |
| Box2 | 5 |
I am importing these tables into solr from dih, and currently ripping my hair apart thinking about how to do this. I have 2 options in my mind, but both are not very clear, and would like you folks here to help me out. My options are:
Write a query in such a way that'll give me correct order. (this would need a master level querying skills because all we have is a select query in dih)
Write a CustomFieldComparator as described in the following link: http://sujitpal.blogspot.in/2011/05/custom-sorting-in-solr-using-external.html
Is there any third approach which can be followed to get the desired results ?
UPDATE:
I can work without the descending order criteria, but I still need the ascending one.
Thanks :-)
I would create a custom indexer in a language you are comfortable with (see https://wiki.apache.org/solr/IntegratingSolr - I work in Python and use mysolr) when you need more flexibility like this. Maybe create two fields, one for "ascending" and one for "descending"; make a list with the ranks, one with reverse ranks, then query the second source to use for overrides on both lists - or whatever you need to do to process the data. I guess the feasibility also depends on how many records you need to index. Then process each record, adding the information for the two ranking fields from the list, and send them to your Solr server in batches.

Fill sequence in sql rows

I have a table that stores a group of attributes and keeps them ordered in a sequence. The chance exists that one of the attributes (rows) could be deleted from the table, and the sequence of positions should be compacted.
For instance, if I originally have these set of values:
+----+--------+-----+
| id | name | pos |
+----+--------+-----+
| 1 | one | 1 |
| 2 | two | 2 |
| 3 | three | 3 |
| 4 | four | 4 |
+----+--------+-----+
And the second row was deleted, the position of all subsequent rows should be updated to close the gaps. The result should be this:
+----+--------+-----+
| id | name | pos |
+----+--------+-----+
| 1 | one | 1 |
| 3 | three | 2 |
| 4 | four | 3 |
+----+--------+-----+
Is there a way to do this update in a single query? How could I do this?
PS: I'd appreciate examples for both SQLServer and Oracle, since the system is supposed to support both engines. Thanks!
UPDATE: The reason for this is that users are allowed to modify the positions at will, as well as adding or deleting new rows. Positions are shown to the user, and for that reason, these should show a consistence sequence at all times (and this sequence must be stored, and not generated on demand).
Not sure it works, But with Oracle I would try the following:
update my_table set pos = rownum;
this would work but may be suboptimal for large datasets:
SQL> UPDATE my_table t
2 SET pos = (SELECT COUNT(*) FROM my_table WHERE id <= t.id);
3 rows updated
SQL> select * from my_table;
ID NAME POS
---------- ---------- ----------
1 one 1
3 three 2
4 four 3
Do you really need the sequence values to be contiguous, or do you just need to be able to display the contiguous values? The easiest way to do this is to let the actual sequence become sparse and calculate the rank based on the order:
select id,
name,
dense_rank() over (order by pos) as pos,
pos as sparse_pos
from my_table
(note: this is an Oracle-specific query)
If you make the position sparse in the first place, this would even make re-ordering easier, since you could make each new position halfway between the two existing ones. For instance, if you had a table like this:
+----+--------+-----+
| id | name | pos |
+----+--------+-----+
| 1 | one | 100 |
| 2 | two | 200 |
| 3 | three | 300 |
| 4 | four | 400 |
+----+--------+-----+
When it becomes time to move ID 4 into position 2, you'd just change the position to 150.
Further explanation:
Using the above example, the user initially sees the following (because you're masking the position):
+----+--------+-----+
| id | name | pos |
+----+--------+-----+
| 1 | one | 1 |
| 2 | two | 2 |
| 3 | three | 3 |
| 4 | four | 4 |
+----+--------+-----+
When the user, through your interface, indicates that the record in position 4 needs to be moved to position 2, you update the position of ID 4 to 150, then re-run your query. The user sees this:
+----+--------+-----+
| id | name | pos |
+----+--------+-----+
| 1 | one | 1 |
| 4 | four | 2 |
| 2 | two | 3 |
| 3 | three | 4 |
+----+--------+-----+
The only reason this wouldn't work is if the user is editing the data directly in the database. Though, even in that case, I'd be inclined to use this kind of solution, via views and instead-of triggers.

Resources