I am working on a spring batch service that pulls data from a db on a schedule. (e.g. every day at 12pm)
I am using JdbcPagingItemReader to read the data and a scheduler (#Scheduled provided by spring batch) to launch the job. The problem that I have now is: every time the job runs, it will just pull all the data from the beginning and not from the "last read" row.
The data from the db is changing everyday(deleting old ones and adding new ones) and all I have is a timestamp column to track them.
Is there a way to "remember" the last row read from the last execution of the job and read data only later than that row?
Since you need to pull data on a daily basis, and your records have a timestamp, then you can design your job instances to be based on a given date (ie using the date as an identifying job parameter). With this approach, you do not need to "remember" the last processed record. All you need to do is process records for a given date by using the correct SQL query. For example:
Job instance ID
Date
Job parameter
SQL
1
2021-03-22
date=2021-03-22
Select c1, c2 from table where date = 2021-03-22
2
2021-03-23
date=2021-03-23
Select c1, c2 from table where date = 2021-03-23
...
...
...
...
With that in place, you can use any cursor-based or paging-based reader to process records of a given date. If a job instance fails, you can restart it without a risk to interfere with other job instances. The restart could be done even several days after the failure since the job instance will always process the same data set. Moreover, in case of failure and job restart, Spring Batch will reprocess records from the last check point in the previous (failed) run.
Just want to post an update to this question.
So in the end I created two more steps to achieve what I wanted to do initially.
Since I don't have the privilege to modify the table where I read the data from, I couldn't use the "process indicator pattern" which involves having a column to mark if a record is processed or not. I created another table to store the last-read record's timestamp, and use it to update the sql query.
step 0: a tasklet that reads the bookmark from a table, pass it in the job context
step 1: a chunk step, get the bookmark from the context, use jdbcPagingItemReader to read the data
step 2: a tasklet to update the bookmark
But doing this I have to be very cautious with the bookmark table. If I lose that I lose everything
Related
I want to implement an incremental load process using SQL Server Change Data Capture. Every example I find takes the "happy path."
In other words, they assume that the CDC history exceeds the time since the last successful incremental load.
Suppose we leave the cleanup job with the default of 3 days, and for some reason our load hasn't successfully completed for longer than that. I need to check for this and run a full extract instead.
I'm logging the successful execution datetime in SQL Server tables. So, if I compare the last successful date to the earliest record in the cdc.lsn_time_mapping table, will this accomplish my task?
Basically something like:
Select #LastSuccessfulDate from audit....
Select #MinCdCDate = min(tran_begin_time) from cdc.lsn_time_mapping
if #MinCdCDate > #LastSuccessfulDate then 'Full' else 'Incremental'
Should this work? Is there a better way to accomplish it?
I would always stay in the "log domain" not the "time domain" when working directly with CDC. So track the last LSN of the last run and compare it against sys.fn_cdc_get_min_lsn every time you syncronize.
So if you last synchronized at lsn=100, and the min_lsn=110, then you've got a gap of 10 missing log records.
But this is only one of many scenarios that will require you to reinitialize the replication with a full sync, so you should also have an input parameter or somesuch to force a full sync.
I have an ERP System (Navision) where product data and stock numbers are frequently updated. Every time an attribute of a product is updated I want this change to be pushed to another SQL Server using Service Broker. I was considering using triggers for the detection, but I am unsure if that is the best way, and whether this is scalable. I expect updates to happen approx. once per second, but this number might double or triple.
Any feedback would be appreciated.
Add a column for Last Modified Date for each record and update this column using the trigger each time a record is being updated. Then Run a scheduled job at a specific time each day (Off-business hours preferred) So that all records that are updated after the last scheduled run is processed.
So The following items need to be done
Add a new column LastModifiedDate in the table with DATETIME data type.
Create a Trigger to update the ModifiedDate each time the record is updated
Create a new table to store the schedule run date and time
Create a scheduled job on Database that will run at a specified time every day.
This job will pick all the records that have the value greater than the date in the Table Create on Step#4.
So Since only 1 column is being updated in the trigger, it won't affect the performance of the table. Also since we are running the update job only once a day, It will also reduce the Database Traffic.
I'm trying to create SSIS package which will periodically send data to other database. I want to send only new records(I need to keep sent records) so I created status column in my source table.
I want my package to update this column after successfuly sending data, but I can't update all rows wih "unsent" status because during package execution some rows may have been added, and I also can't use transactions(I mean on isolation levels that would solve my problem: I can't use Serializable beacause i musn't prevent users from adding new rows, and Sequence Container doesn't support Snapshot).
My next idea was to use recordset and after sending data to other db use it to get ids of sent rows, but I couldn't find a way to use it as datasource.
I don't think I should set status "to send" and then update it to "sent", I believe it would be to costly.
Now I'm thinking about using temporary table, but I'm not convinced that this is the right way to do it, am I missing something?
Record Set is a destination. You cannot use it in Data Flow task.
But since the data is saved to a variable, it is available in the Control flow.
After completing the DataFlow, come to the control flow and create a foreach component that can run on the ResultSet varialbe.
Read each Record Set value into a variable and use it to run an update query.
Also, see if "Lookup Transform" can be useful to you. You can generate rows that match or doesn't match.
I will improve the answer based on discussions
What you have here is a very typical data mirroring problem. To start with, I would not simply have a boolean that signifies that a record was "sent" to the destination (mirror) database. At the very least, I would put a LastUpdated datetime column in the source table, and have triggers on that table, on insert and update, that put the system date into that column. Then, every day I would execute an SSIS package that reads the records updated in the last week, checks to see if those records exist in the destination, splitting the datastream into records already existing and records that do not exist in the destination. For those that do exist, if the LastUpdated in the destination is less than the LastUpdated in the source, then update them with the values from the source. For those that do not exist in the destination, insert the record from the source.
It gets a little more interesting if you also have to deal with record deletions.
I know it may seem wasteful to read and check a week's worth, every day, but your database should hardly feel it, it provides a lot of good double checking, and saves you a lot of headaches by providing a simple, error tolerant algorithm. Some record does not get transferred because of some hiccup on the network, no worries, it gets picked up the next day.
I would still set up the SSIS package as a server task that sends me an email with any errors, so that I can keep track. Most days, you get no errors, and when there are errors, you can wait a day or resolve the cause and let the next days run pick up the problems.
I am doing a similar thing, in my case, I have a status on the source record.
I read in all records with a status of new.
Then use a OLE DB Command to execute SQL on each row, changing
the status to "In progress"(in you where, enter a ? as the value in
the Component Property tab, and you can configure it as a parameter
from the table row like an ID or some pk in the Column Mappings
tab).
Once the records are processed, you can change all "In Progress"
records to "Success" or something similar using another OLE DB
Command.
Depending on what you are doing, you can use the status to mark records that errored at some point, and require further attention.
I have data to load where I only need to pull records since the last time I pulled this data. There are no date fields to save this information in my destination table so I have to keep track of the maximum date that I last pulled. The problem is I can't see how to save this value in SSIS for the next time the project runs.
I saw this:
Persist a variable value in SSIS package
but it doesn't work for me because there is another process that purges and reloads the data separate from my process. This means that I have to do more than just know the last time my process ran.
The only solution I can think of is to create a table but it seems a bit much to create a table to hold one field.
This is a very common thing to do. You create an execution table that stores the package name, the start time, the end time, and whether or not the package failed/succeeded. You are then able to pull the max start time of the last successfully ran execution.
You can't persist anything in a package between executions.
What you're talking about is a form of differential replication and this has been done many many times.
For differential replication it is normal to store some kind of state in the subscriber (the system reading the data) or the publisher (the system providing the data) that remembers what state you're up to.
So I suggest you:
Read up on differential replication design patterns
Absolutely put your mind at rest about writing data to a table
If you end up having more than one source system or more than one source table your storage table is not going to have just one record. Have a think about that. I answered a question like this the other day - you'll find over time that you're going to add handy things like the last time the replication ran, how long it took, how many records were transferred etc.
Is it viable to have a SQL table with only one row and one column?
TTeeple and Nick.McDermaid are absolutely correct, and you should follow their advice if humanly possible.
But if for some reason you don't have access to write to an execution table, you can always use a script task to read/write the last loaded date to a text file on on whatever local file-system you're running SSIS on.
I am currently in the process of revamping my company's management system to run a little more lean in terms of network traffic. Right now I'm trying to figure out an effective way to query only the records that have been modified (by any user) since the last time I asked.
When the application starts it loads the job information and caches it locally like the following: SELECT * FROM jobs.
I am writing out the date/time a record was modified ala UPDATE jobs SET Widgets=#Widgets, LastModified=GetDate() WHERE JobID=#JobID.
When any user requests the list of jobs I query all records that have been modified since the last time I requested the list like the following: SELECT * FROM jobs WHERE LastModified>=#LastRequested and store the date/time of the request to pass in as #LastRequest when the user asks again. In theory this will return only the records that have been modified since the last request.
The issue I'm running into is when the user's date/time is not quite in sync with the server's date/time and also of server load when querying an un-indexed date/time column. Is there a more effective system then querying date/time information?
I don't know that I would rely on Date-Time since it is external to SQL Server.
If you have an Identity column, I would use that column in a table UserId, LastQueryDateTime, LastIdRetrieved
Every time you query the base table, insert new row for user (or update if exists) the max id into this table. Also, the query should read the row from this table to get the LastIdRetrieved and use that in the where clause.
All this could be eliminated if all of your code chooses to insert GetDate() from SQL Server instead of from the client machines, but that task is pretty labor intensive.
The easiest solution seems to settle on one time as leading.
One way would be to settle on the server time. After updating the row, store the value returned by select LastModified where JobID = #JobID on the client side. That way, the client can effectively query using only the server time as reference.
Use an update sequence number (USN) much like Active Directory and DNS use to keep track of the objects that have changed since their last replication. Pick a number to start with, and each time a record in the Jobs table is inserted or modified, write the most recent USN. Keep track of the USN when the last Select query was executed, and you then always know what records were altered since the last query. For example...
Set LastQryUSN = 100
Update Jobs Set USN=101, ...
Update Jobs Set USN=102, ...
Insert Jobs (USN, ...) Values (103, ...)
Select * From Jobs Where USN > LastQryUSN
Set LastQryUSN = 103
Update Jobs Set USN=104
Insert Jobs (USN, ...) Values (105, ...)
Select * From Jobs Where USN > LastQryUSN
Set LastQryUSN = 105
... and so on
When you get the Jobs, get the server time too:
DECLARE #now DATETIME = GETUTCDATE();
SELECT #now AS [ServerTime], * FROM Jobs WHERE Modified >= #LastModified;
First time you pass in a minimum date as #LastModified. On each subsequent call, you pass in the ServerTime returned last call. This way the client time is taken out of the equation.
The answer to the server load is, I hope, obvious: add an index on Modified column.
And one more adice: never use local time, not even on server. Always use UTC times, and store UTC time in Modified. As it is right now, your program is completely screwed two times a year, when the daylight savings time changes are set in or when they are removed.
Current SQL Server has change tracking you can use for exactly that. Just enable change tracking on the tables you want to track.