Avoiding duplicates - database

Imagine that we have a file, and some job that processes it and sends the data:
into the database
to an external service
Can we guarantee to process the file only once or at least to determine that something went wrong and notify the user so that he manually solved this problem?

Yes, you can.
What you can do is create a table in the database to store the name and a flag/status (if read, yes else no) of files. When process feeds the file in that location, make sure that the same process updates the name (if name is different each time) and flag/status for that file in the database. Your file read process can get the name of file from the database and dump that file in wherever you ant and when it's done, It should update the flag to read or whatever. This way, you can avoid reading the file more than one time.

I would store two tables of information in your database.
The processed file lines like you were already doing.
A record of the files themselves. Include:
the filename
whether the processing was successful, failed, partially succeeded
a SHA1 hashed checksum that can be used to check for the uniqueness of the file later
When you go to process a file, you first check whether the checksum already exists. If it does, you can stop processing and log the issue. Or you can throw that information on the file table.
Also be sure to have a foreign key association between your processed lines and your files. That way if something does go wrong, the person doing manual intervention can trace the affected lines.

Neither Usmana or Tracy answer actually guarantees that a file is not processed more than once and your job doesn't send duplicate requests to the database and the external service(#1 and #2 in your question). Both solutions suggest keeping a log and update it after all the processing is done but if an error occurs when you try to update the log at the very end, your job will try processing the file again next time it runs and will send duplicate requests to the database and external service. The only way to deal with it using the solutions Usmana and Tracy suggested is to run everything in a transaction but it's quite a challenging task in a distributing environment like yours.
A common solution to your problem is to gracefully handle duplicate requests to the database and external services. The actual implementation can vary but for example you can add a unique constraint to the database and when the job tries to insert a duplicate record an exception will be thrown which you can just ignore in the job because it means the required data is already in the db.
My answer don't mean that you don't need the log table Usmana and Tracy suggested. You do need it to keep track of processing status but it doesn't really guarantee there won't be duplicate requests to your database and external service unless you use a distributed transaction.
Hope it helps!

Related

Perl SQL and file create race condition

How do I handle "race condition" between instances of script that is scheduled to run every minute, performing following tasks for every file in directory:
Connect to SQL database and check last element (filename) in table
Create several files (multiple folders) with the next available filename
Insert to SQL new record with the filename and created files' information
Because process runs every 1 minute, it's possible that 2 instances overlap and work on same files. I can prevent that by file locking and skipping already opened file, however the issue persists with:
Checking next available filename in database (2 processes want to use the same filename)
Creating files with this filename
Process A takes inputA.jpg and finds next available as filename image_01.
Process B takes inputB.jpg and finds next available as filename image_01.
And so the chaos begins...
Unfortunately, I can't insert any placeholder record in SQL table to show that the next filename is being processed.
Pseudo-code of the loop:
foreach ($file)
{
$name = findFileNameInSql($file)
$path1 = createFile($name, $settings1);
$path2 = createFile($name, $settings2);
$path3 = createFile($name, $settings3);
addToSql($file, $name, $path1, $path2, $path3)
}
The actual code is a bit more complicated, including file modifications and transactional insert to 2 SQL tables. In case of createFile() failure the application is rolling back all previously created files. It obviously creates issue when one instance of app is creating file "abc" and second instance has error that file "abc" already exists.
EDIT :
Sure, limiting script to have only one instance could be solution, but I was hoping to find a way to run them in parallel. If there's no way to do it, we can close this as duplicate.
You need to make the code which returns the next available filename from the database atomic in the database, so that the database can't return the same filename twice. This is really a database problem rather than a perl problem, per se.
You don't say which database you're using, but there are several ways to do it.
A naive and brutish way to do it in MySQL is for the perl script to perform a LOCK TABLE table WRITE on the table with the filenames while it calculates a new one and does its work. Once the table is updated with the new filename, you can release the lock. TABLE LOCKS don't play nicely with transactions though.
Or you could do something rather more elegant, like implement a stored procedure with appropriate locking within the database itself to return the new filename.
Or use an AUTOINCREMENT column, so that each time you add something to the table you get a new number (and hence a new filename).
This can all get quite complicated though; if you have multiple transactions simultaneously, how the database resolves those is usually a configurable thing, so I can't tell you what will happen.
Given that it sounds as though your code is principally reorganising data on disk, there's not much advantage to having multiple jobs running at the same time; this code is probably I/O bound anyway. In which case it's much simpler just to make the code changes others have suggested to run only one copy at once.

How we can make database resumable in perl script

I have a Perl code which connects with database and scan data from different different table. I face a problem if I lose my connection: it roll back all the transaction. How could I make the Perl script resume the connection and start the process from where the interruption took place? Can I use the Perl to resume the connection or any thing other technique to start the process from where the interruption took place if so could any one guide me with the steps please.
It is actually required because of we have lots of data and takes 1 week to scan all the data and insert in specific table, in between if we run database offline backup it disconnect all the connection and whatever transaction happens it roll back and need to run once again from the beginning.
We can commit the transaction whatever done but challenge is how we can start process from where the interruption took place so we don't require to run from the beginning.
Relying on a DB connection to be consistently open for over a day is the wrong approach.
A possible solution involves:
1) Connect to the DB to create a DB Handle. Use an infinite loop and sleep to wait until you have a good db handle. Put this into a subroutine.
2) Put the individual requests for individual tables in a data structure like an array
Execute separate Queries in separate statements in a loop (.
Check if the dbh is stale before each individual request. Clean and recreate the handle if necessary. Use the subroutine from 1)
Handle breakdowns during a request in EVAL blocks using the "redo" statement to make sure no statement gets skipped.
3) Keep the data between requests either in memory or any non-SQL storage like a Key/Value Store ( Redis, )
4) Compute whatever needs computation
5) When you have all data for your commit transaction, do.
This solution assumes you don't care about changes between reading and committing back. If you do, you need to LOCK your affected tables first. You probably don't want to lock a table for a week though.

Spring-JPA(Hibernate): AssertionFailure: null id in `xxxx` entry(don't flush the Session after an exception occurs)

I need do a data migration from a legacy system, and there are 20,000+ records totally[all the data will be provided by csv file], due to some technical reasons, I must use JPA/Hibernate to import those data into your new system currently used.
When I do the importing, always encountered a transaction problem like following:
The database is MS SQL Server 2005
org.hibernate.AssertionFailure: null id in xxxx entry (don't flush the Session after an exception occurs)
And I tried following things:
Using a nested exception to importing the separate record.
Split the data to some pieces small csv file
Call entityManager.flush() manually and add Thread.sleep(10000) to sleep 10ms
Call titityManager.flush every 20 records and sleep 5ms
But unfortunately, nothing seems to help. Please help.
Thanks for reading. Any help would be much appreciated!
If mappings are correct, probably is a wrong import code like described in org.hibernate.AssertionFailure.
Also you can look about Hibernate batch processing best pratices and Stateless session.
Last option can be move to spring-batch

How to only allow one update/insert to a table if row is outdated

I have a file stored on disk that can be access across multiple servers in a web farm. This file is updated as necessary based on data changes in the database. I have a database table that stores a row with a URI for this file and some hashes based off of some database tables. If the hashes don't match their respective tables, then the file need to be regenerated and a new row needs to be inserted.
How do I make it so that only 1 client regenerates this file and inserts a row?
The easiest but worst solution (because of locks) is to:
BEGIN TRANSACTION
SELECT ROW FROM TABLE (lock the table for the remainder of the transaction)
IF ROW IS OUT OF DATE:
REGENERATE FILE
INSERT ROW INTO TABLE
DO SOME STUFF WITH FILE (30s)
COMMIT TRANSACTION
However, if multiple clients execute this code, all of the subsequent clients sit for a long time while the "DO SOME STUFF WITH FILE" processes.
Is there a better way to handle this? Maybe changing the way I process the file before the commit to make it faster? I've been stumped on this for a couple days.
It sounds like you need to do your file processing asynchronously, so the file process is spun off and the transaction completes in a timely manner. There are a few ways to do that, but the easiest might be to replace the "do stuff with file" with a "insert a record into the table This_File_Needs_To_Be_Updated, then run a job every few minutes that updates each record in that table. Or HERE is some code that generates a job on the fly. Or see THIS question on Stack Overflow.
The answer depends on the details of file level processing.
If you just swap the database and file operations, you risk corruption of the file or busy waiting (depending on how exactly you open it, and what your code does when a concurrent open is rejected). Busy waiting would definitely be worse than waiting on a database lock from a throughput (or any other) perspective.
If your file processing really takes so long as to be frequently causing queueing of requests, the only solutions are to add more powerful hardware or optimize file level processing.
For example, if the file only reflects the data in the database, you might get away with not updating it at all, and having a background process that periodically regenerates its content based on the data in the database. You might need to add versioning that makes sure that whoever reads the file is not receiving stale data. If the file pointed to by the URL has a new name every time, you might need an error handler that makes sure that GET requests are not habitually receiving a 404 response on new files.

'tail -f' a database table

Is it possible to effectively tail a database table such that when a new row is added an application is immediately notified with the new row? Any database can be used.
Use an ON INSERT trigger.
you will need to check for specifics on how to call external applications with the values contained in the inserted record, or you will write your 'application' as a SQL procedure and have it run inside the database.
it sounds like you will want to brush up on databases in general before you paint yourself into a corner with your command line approaches.
Yes, if the database is a flat text file and appends are done at the end.
Yes, if the database supports this feature in some other way; check the relevant manual.
Otherwise, no. Databases tend to be binary files.
I am not sure but this might work for primitive / flat file databases but as far as i understand (and i could be wrong) the modern database files are encrypted. Hence reading a newly added row would not work with that command.
I would imagine most databases allow for write triggers, and you could have a script that triggers on write that tells you some of what happened. I don't know what information would be available, as it would depend on the individual database.
There are a few options here, some of which others have noted:
Periodically poll for new rows. With the way MVCC works though, it's possible to miss a row if there were two INSERTS in mid-transaction when you last queried.
Define a trigger function that will do some work for you on each insert. (In Postgres you can call a NOTIFY command that other processes can LISTEN to.) You could combine a trigger with writes to an unpublished_row_ids table to ensure that your tailing process doesn't miss anything. (The tailing process would then delete IDs from the unpublished_row_ids table as it processed them.)
Hook into the database's replication functionality, if it provides any. This should have a means of guaranteeing that rows aren't missed.
I've blogged in more detail about how to do all these options with Postgres at http://btubbs.com/streaming-updates-from-postgres.html.
tail on Linux appears to be using inotify to tell when a file changes - it probably uses similar filesystem notifications frameworks on other operating systems. Therefore it does detect file modifications.
That said, tail performs an fstat() call after each detected change and will not output anything unless the size of the file increases. Modern DB systems use random file access and reuse DB pages, so it's very possible that an inserted row will not cause the backing file size to change.
You're better off using inotify (or similar) directly, and even better off if you use DB triggers or whatever mechanism your DBMS offers to watch for DB updates, since not all file updates are necessarily row insertions.
I was just in the middle of posting the same exact response as glowcoder, plus another idea:
The low-tech way to do it is to have a timestamp field, and have a program run a query every n minutes looking for records where the timestamp is greater than that of the last run. The same concept can be done by storing the last key seen if you use a sequence, or even adding a boolean field "processed".
With oracle you can select an psuedo-column called 'rowid' that gives a unique identifier for the row in the table and rowid's are ordinal... new rows get assigned rowids that are greater than any existing rowid's.
So, first select max(rowid) from table_name
I assume that one cause for the raised question is that there are many, many rows in the table... so this first step will be taxing the db a little and take some time.
Then, select * from table_name where rowid > 'whatever_that_rowid_string_was'
you still have to periodically run the query, but it is now just a quick and inexpensive query

Resources