Solr DataImportHandler delta import - solr

I am using DataImportHandler for indexing data in SOLR. I used full-import to index all the data in the my database which is around 10000 products.Now I am confused with the delta-import usage? Does it index the new data added into the database on interval basis i mean it is going to index the new data added to my table around 10 rows or it just updates the changes in the already indexed data.
Can anyone please explain it to me with simple example as soon as you can.

The DataImportHandler can be a little daunting. Your initial query has loaded 10.000 unique products. This is loaded if you specify /dataimport?command=full-import.
When this import is done, the DIH stores a variable ({dataimporter.last_index_time}) which is the last date/time you did this import.
In order to do an update, you specify a deltaQuery. The deltaQuery is meant to identify the records that have changed in your database since the last update. So, you specify a query like this: SELECT product_id
FROM sometable
WHERE [date_update] >= '${dataimporter.last_index_time}'
This will retrieve all the product_ids from your database that are updated since you last full update. The next query (deltaImportQuery) you need to specify is the query that will retrieve the full record for each product_id that you have from the previous step.
Assuming product_id is you unique key, solr will figure out that it needs to update an existing record, or add one if the product_id doens't work.
In order to execute the deltaQuery and the deltaImportQuery you use /dataimport?command=delta-import
This is a great simplification of all the possibilities, check the Solr wiki on DataImportHandler, it is a VERY powerful tool!

On another note:
When you use a delta import within a small time window (like a couple of times in a few seconds) and the database server is on an other machine than the solr index service, make sure that the systemtime of both machines matches, since the timestamp of [date_update] is generated on the database server and dataimporter.last_index_time is generated on the other.
Otherwise you won't be updating the index (or too much) depending on the time differences.

I agree that the Data Import Handler can handle this situation. One important limitation to the DIH is that it does not queue requests. The result of this is that if the DIH is "busy" indexing it will ignore all future DIH requests until it is "idle" again. The skipped DIH requests are lost and not executed.

Related

Creating an efficient indexer for my Azure Search Service

My data source cannot be a single table as I need data that spans across 6 tables. For that I have created a view that does joins on these tables.
When I use this view as data source, indexing takes a lot of times and times out. I tried increasing timeout to 40 minutes and one more suggested change:
"disableOrderByHighWaterMarkColumn" : true
It timed-out. I also set Batch Size:1000. This time it populated the index but failed after few hours saying "connection lost". and "thanks" to disableOrderByHighWaterMarkColumn, If I rerun indexer again it will process all the rows again.
My question is what is the best way to approach a solution to this problem.
Follow-up Question: Since I am relying on a view, I cannot have auto change tracking. I am using a high watermark column (LastUpdatedTime) to track changes in my View. I only want to keep 6 months of data in my index so I am not sure how I can do that when I am using View. I have "where CreateDateTime > dateadd(month, -6, getdate())" clause already in my View but this will not enable Indexer to delete "out-of-time-window" rows(documents) from index. How can I achieve my goals here?
Should I write a processor task to periodically query all documents using C# SDK and delete documents based on date?
Sorry to hear the Azure SQL Database indexer is giving you trouble. I noticed a couple of things in your question that might be worth thinking about in terms of SQL performance:
My data source cannot be a single table as I need data that spans across 6 tables. For that I have created a view that does joins on these tables When I use this view as data source, indexing takes a lot of times and times out.
It's worth taking a look at the query performance troubleshooting guide and figure out what exactly is happening in your Azure SQL database that is causing problems. Assuming you want to use change tracking support, the default query the indexer uses against the SQL database looks like this:
SELECT * FROM c WHERE hwm_column > #hwmvalue ORDER BY hwm_column
We frequently see issues with performance here when there isn't an index on the hwm_column or if hwm_column is computed. You can read more about issues with the high water mark column here.
I tried increasing timeout to 40 minutes and one more suggested change: "disableOrderByHighWaterMarkColumn" : true It timed-out. I also set Batch Size:1000. This time it populated the index but failed after few hours saying "connection lost". and "thanks" to disableOrderByHighWaterMarkColumn, If I rerun indexer again it will process all the rows again.
disableOrderByHighWaterMarkColumn doesn't seem like it will work for your scenario, so I agree that you shouldn't set it. Decreasing the batch size seems to have had a positive effect, I would consider measuring the performance gain here using the troubleshooting guide referenced above
Follow-up Question: Since I am relying on a view, I cannot have auto change tracking. I am using a high watermark column (LastUpdatedTime) to track changes in my View. I only want to keep 6 months of data in my index so I am not sure how I can do that when I am using View. I have "where CreateDateTime > dateadd(month, -6, getdate())" clause already in my View but this will not enable Indexer to delete "out-of-time-window" rows(documents) from index. How can I achieve my goals here? Should I write a processor task to periodically query all documents using C# SDK and delete documents based on date?
Instead of filtering out data that is more than 6 months old, I would consider adding soft delete policy. The challenge here is that the indexer needs to pick up rows that should be deleted. The easiest way to accomplish this might updating your application logic to add a new column to your view indicating the row should be deleted. Once the value of this column changes, the LastUpdatedTime should also be updated so it shows up in the next indexer query.
You can write your own processor task, but querying all documents in Azure Cognitive Search and paging through them may have negative performance implications on your search performance. I would recommend trying to get it working with your indexer first.

MAX on a VIEW doing a TableScan instead of using metadata lookup. Why doesn't SF use metadata?

I noticed even the simplest 'SELECT MAX(TIMESTAMP) FROM MYVIEW' is somewhat slow (taking minutes) in my environment, and found it's doing a TableScan of 100+GB across 80K micropartitions.
My expectation was this to finish in milliseconds using MIN/MAX/COUNT metadata in each micropartitions. In fact, I do see Snowflake finishing the job in milliseconds using metadata for almost same MIN/MAX value lookup in following article:
http://cloudsqale.com/2019/05/03/performance-of-min-max-functions-metadata-operations-and-partition-pruning-in-snowflake/
Is there any limitation in how Snowflake decides to use metadata? Could it be because I'm querying through a view, instead of querying a table directly?
=== Added for clarity ===
Thanks for answering! Regarding how the view is defined, it seems to adds a WHERE clause for additional filtering with a cluster key. So I believe it should still be possible to fully use metadata of miropartitions. But as posted, TableScan is being done in profilter output.
I'm bit concerned on your comment on SecureView. The view I'm querying is indeed a SECURE VIEW - does it affect how optimizer handles my query? Could that be a reason why TableScan is done?
It looks like you're running the query on a view. The metadata you're referring to will be used when you're running a simple MIN MAX etc on the table, however if you have some logic inside your view which requires filtering / joining of data then Snowflake cannot return results just based off the metadata.
So yes, you are correct when you say the following because your view is probably doing something other than a simple MAX on the table:
...Could it be because I'm querying through a view, instead of querying a table directly?

How to perform bulk updates in Solr

Is there a way in Solr to perform bulk updates without specifying it document by document?
In Solr we can update a field of a single record at a time, But in order to update the 1000 record it's gonna take more time . So any option is there to update a field of thousand indexes in a shot or in a one go ?
No, there is nothing similar to UPDATE foo SET field = "bar" - you'll have to either submit the complete set of updated documents, or a batches of atomic update commands (each related to a separate id).
[{"id":"mydoc", "price":{"set":99}},
{"id":"mydoc2", "price":{"set":199}}]

Add DATE column to store when last read

We want to know what rows in a certain table is used frequently, and which are never used. We could add an extra column for this, but then we'd get an UPDATE for every SELECT, which sounds expensive? (The table contains 80k+ rows, some of which are used very often.)
Is there a better and perhaps faster way to do this? We're using some old version of Microsoft's SQL Server.
This kind of logging/tracking is the classical application server's task. If you want to realize your own architecture (there tracking architecture) do it on your own layer.
And in any case you will need application server there. You are not going to update tracking field it in the same transaction with select, isn't it? what about rollbacks? so you have some manager who first run select than write track information. And what is the point to save tracking information together with entity info sending it back to DB? Save it into application server file.
You could either update the column in the table as you suggested, but if it was me I'd log the event to another table, i.e. id of the record, datetime, userid (maybe ip address etc, browser version etc), just about anything else I could capture and that was even possibly relevant. (For example, 6 months from now your manager decides not only does s/he want to know which records were used the most, s/he wants to know which users are using the most records, or what time of day that usage pattern is etc).
This type of information can be useful for things you've never even thought of down the road, and if it starts to grow large you can always roll-up and prune the table to a smaller one if performance becomes an issue. When possible, I log everything I can. You may never use some of this information, but you'll never wish you didn't have it available down the road and will be impossible to re-create historically.
In terms of making sure the application doesn't slow down, you may want to 'select' the data from within a stored procedure, that also issues the logging command, so that the client is not doing two roundtrips (one for the select, one for the update/insert).
Alternatively, if this is a web application, you could use an async ajax call to issue the logging action which wouldn't slow down the users experience at all.
Adding new column to track SELECT is not a practice, because it may affect database performance, and the database performance is one of major critical issue as per Database Server Administration.
So here you can use one very good feature of database called Auditing, this is very easy and put less stress on Database.
Find more info: Here or From Here
Or Search for Database Auditing For Select Statement
Use another table as a key/value pair with two columns(e.g. id_selected, times) for storing the ids of the records you select in your standard table, and increment the times value by 1 every time the records are selected.
To do this you'd have to do a mass insert/update of the selected ids from your select query in the counting table. E.g. as a quick example:
SELECT id, stuff1, stuff2 FROM myTable WHERE stuff1='somevalue';
INSERT INTO countTable(id_selected, times)
SELECT id, 1 FROM myTable mt WHERE mt.stuff1='somevalue' # or just build a list of ids as values from your last result
ON DUPLICATE KEY
UPDATE times=times+1
The ON DUPLICATE KEY is right from the top of my head in MySQL. For conditionally inserting or updating in MSSQL you would need to use MERGE instead

MSSQL/Oracle Query Tuning 500,000+ records Coldfusion - does lower() reduce performance

I'm not trying to start a debate on which is better in general, I'm asking specifically to this question. :)
I need to write a query to pull back a list of userid (uid) from a database containing 500k+ records. I'm returning just the one field, uid. I can query either our Oracle box or our MSSQL 2000 box. The query looks like this (this has not been simplied)
select uid
from employeeRec
where uid = 'abc123'
Yes, it really is that simply of a query. Where I need the tuninig help is that the uid is indexed and some uid could be (not many but some) 'ABC123' or 'abc123'. MSSQL doesn't care of the case-sensitivity whereas Oracle does. So for Oracle, my query would look like this:
select uid
from employeeRec
where lower(uid) = 'abc123'
I've learned that if you use lower on an index field in MSSQL, you render the index useless (there are ways around it but that is beyond the scope of my question here - since if I choose MSSQL, I don't need to use lower at all). I wanted to know if I choose Oracle, and use the lower() function, will that also hurt performance of the query?
I'm looping over this query about 200 times in addition to some other queries that are being run and to process the entire loop takes 1 second per iteration and I've narrowed down the slowness to this particular query. For a web page, 200 seconds seems like eternity. For you CF readers, timeout value has been increased so the page doesn't error out and there are no page errors, I'm just trying to speed up this query.
Another item to note: This database is in a different city than the other queries being run so I do expect some lag time there.
As TomTom put, your index will simply not be used by Oracle. But, you can create a function based index, and this new index will be used when you issue your query.
create index my_new_ix on employeeRec(lower(uid));
Wrapping an indexed column in a function call would have the potential to cause performance problems in Oracle. Oracle couldn't use a plain index on UID to process your query. On the other hand, you could create a function-based index on lower(uid) that would be used by the query, i.e.
CREATE INDEX case_insensitive_idx
ON employeeRec( lower( uid ) );
Note that if you want to do case-insensitive queries in general, you may be better served setting NLS parameters to force case-insensitivity. You'd still need function-based indexes on the columns you're searching on, but it can simplify your queries a bit.
I wanted to know if I choose Oracle,
and use the lower() function, will
that also hurt performance of the
query?
Yes. The perforamnce reduction is because the index is on the original value and the collation i case sensitive, so all possible values must be run through the function to filter out the ones matching.

Resources