Perl SQL and file create race condition

Perl SQL and file create race condition - sql-server

How do I handle "race condition" between instances of script that is scheduled to run every minute, performing following tasks for every file in directory:
Connect to SQL database and check last element (filename) in table
Create several files (multiple folders) with the next available filename
Insert to SQL new record with the filename and created files' information
Because process runs every 1 minute, it's possible that 2 instances overlap and work on same files. I can prevent that by file locking and skipping already opened file, however the issue persists with:
Checking next available filename in database (2 processes want to use the same filename)
Creating files with this filename
Process A takes inputA.jpg and finds next available as filename image_01.
Process B takes inputB.jpg and finds next available as filename image_01.
And so the chaos begins...
Unfortunately, I can't insert any placeholder record in SQL table to show that the next filename is being processed.
Pseudo-code of the loop:
foreach ($file)
{
$name = findFileNameInSql($file)
$path1 = createFile($name, $settings1);
$path2 = createFile($name, $settings2);
$path3 = createFile($name, $settings3);
addToSql($file, $name, $path1, $path2, $path3)
}
The actual code is a bit more complicated, including file modifications and transactional insert to 2 SQL tables. In case of createFile() failure the application is rolling back all previously created files. It obviously creates issue when one instance of app is creating file "abc" and second instance has error that file "abc" already exists.
EDIT :
Sure, limiting script to have only one instance could be solution, but I was hoping to find a way to run them in parallel. If there's no way to do it, we can close this as duplicate.

You need to make the code which returns the next available filename from the database atomic in the database, so that the database can't return the same filename twice. This is really a database problem rather than a perl problem, per se.
You don't say which database you're using, but there are several ways to do it.
A naive and brutish way to do it in MySQL is for the perl script to perform a LOCK TABLE table WRITE on the table with the filenames while it calculates a new one and does its work. Once the table is updated with the new filename, you can release the lock. TABLE LOCKS don't play nicely with transactions though.
Or you could do something rather more elegant, like implement a stored procedure with appropriate locking within the database itself to return the new filename.
Or use an AUTOINCREMENT column, so that each time you add something to the table you get a new number (and hence a new filename).
This can all get quite complicated though; if you have multiple transactions simultaneously, how the database resolves those is usually a configurable thing, so I can't tell you what will happen.
Given that it sounds as though your code is principally reorganising data on disk, there's not much advantage to having multiple jobs running at the same time; this code is probably I/O bound anyway. In which case it's much simpler just to make the code changes others have suggested to run only one copy at once.

Related

Avoiding duplicates

Imagine that we have a file, and some job that processes it and sends the data:
into the database
to an external service
Can we guarantee to process the file only once or at least to determine that something went wrong and notify the user so that he manually solved this problem?

Yes, you can.
What you can do is create a table in the database to store the name and a flag/status (if read, yes else no) of files. When process feeds the file in that location, make sure that the same process updates the name (if name is different each time) and flag/status for that file in the database. Your file read process can get the name of file from the database and dump that file in wherever you ant and when it's done, It should update the flag to read or whatever. This way, you can avoid reading the file more than one time.

I would store two tables of information in your database.
The processed file lines like you were already doing.
A record of the files themselves. Include:
the filename
whether the processing was successful, failed, partially succeeded
a SHA1 hashed checksum that can be used to check for the uniqueness of the file later
When you go to process a file, you first check whether the checksum already exists. If it does, you can stop processing and log the issue. Or you can throw that information on the file table.
Also be sure to have a foreign key association between your processed lines and your files. That way if something does go wrong, the person doing manual intervention can trace the affected lines.

Neither Usmana or Tracy answer actually guarantees that a file is not processed more than once and your job doesn't send duplicate requests to the database and the external service(#1 and #2 in your question). Both solutions suggest keeping a log and update it after all the processing is done but if an error occurs when you try to update the log at the very end, your job will try processing the file again next time it runs and will send duplicate requests to the database and external service. The only way to deal with it using the solutions Usmana and Tracy suggested is to run everything in a transaction but it's quite a challenging task in a distributing environment like yours.
A common solution to your problem is to gracefully handle duplicate requests to the database and external services. The actual implementation can vary but for example you can add a unique constraint to the database and when the job tries to insert a duplicate record an exception will be thrown which you can just ignore in the job because it means the required data is already in the db.
My answer don't mean that you don't need the log table Usmana and Tracy suggested. You do need it to keep track of processing status but it doesn't really guarantee there won't be duplicate requests to your database and external service unless you use a distributed transaction.
Hope it helps!

Persist Data in SSIS for Next Execution

I have data to load where I only need to pull records since the last time I pulled this data. There are no date fields to save this information in my destination table so I have to keep track of the maximum date that I last pulled. The problem is I can't see how to save this value in SSIS for the next time the project runs.
I saw this:
Persist a variable value in SSIS package
but it doesn't work for me because there is another process that purges and reloads the data separate from my process. This means that I have to do more than just know the last time my process ran.
The only solution I can think of is to create a table but it seems a bit much to create a table to hold one field.

This is a very common thing to do. You create an execution table that stores the package name, the start time, the end time, and whether or not the package failed/succeeded. You are then able to pull the max start time of the last successfully ran execution.

You can't persist anything in a package between executions.
What you're talking about is a form of differential replication and this has been done many many times.
For differential replication it is normal to store some kind of state in the subscriber (the system reading the data) or the publisher (the system providing the data) that remembers what state you're up to.
So I suggest you:
Read up on differential replication design patterns
Absolutely put your mind at rest about writing data to a table
If you end up having more than one source system or more than one source table your storage table is not going to have just one record. Have a think about that. I answered a question like this the other day - you'll find over time that you're going to add handy things like the last time the replication ran, how long it took, how many records were transferred etc.
Is it viable to have a SQL table with only one row and one column?

TTeeple and Nick.McDermaid are absolutely correct, and you should follow their advice if humanly possible.
But if for some reason you don't have access to write to an execution table, you can always use a script task to read/write the last loaded date to a text file on on whatever local file-system you're running SSIS on.

SQLite : Make an existing connection to reload the database

I am maintaining a legacy code wherein we have a cocept of TempDB and FullDB, TempDB is just a small instance of FullDB, so that user can browse while FullDB is prepared.
Since lots of writes were involved in FullDB, reading and writing on same database file was creating a lock for readers on other thread. SO I am thinking of the following strategy, which best fits in our situation, in case its possible.
Here's what I want to do :
Start preparing the DB, when threshold for tempDB is reached, commit the transaction and close the connection.Make a copy of this file, lets call them orig(which is the temp db) and copy(which is copy of temp DB and further writes will be done to this file).
Next, readers can open a connection on orig as soon as they receive an event. Writer will open a connection on copy and perform remaining writes for quite a long time during which readers are using the orig temp db.
When the writer has prepared the full DB copy , I need to replace the orig file with the updated full db copy.
Here's the catch, readers will not close and reopen the connection. So I need to block the readers while I am replacing the DB. This I can achieve by acquiring an EXCLUSIVE lock on the orig DB, and then I can replace the orig db with copy db (renaming).
The Problem :
The readers are not accepting the new DB file.How can I make them to do that?
I mean when I tried through terminal : make a DB, copy it and make some entries into the copy and then replace the original with the copy, I was still getting entries that were present in the original DB. To the surprise, even when I deleted both (orig and copy) the DB files, I was still getting entries. It seems SQLite was picking data from some in-memory and not from the disk files.
Any help?
PS : On searching I found something called .open command but not sure how it works or whether its really helpful.
EDIT
Is this what I want?

You must not rename or delete database file while there is some open connection; on Unix-based systems, any open handles will still access the old file.
With an (exclusive) lock on the DB, you can just keep the file, but delete all its contents, and copy the new data into it.

How to only allow one update/insert to a table if row is outdated

I have a file stored on disk that can be access across multiple servers in a web farm. This file is updated as necessary based on data changes in the database. I have a database table that stores a row with a URI for this file and some hashes based off of some database tables. If the hashes don't match their respective tables, then the file need to be regenerated and a new row needs to be inserted.
How do I make it so that only 1 client regenerates this file and inserts a row?
The easiest but worst solution (because of locks) is to:
BEGIN TRANSACTION
SELECT ROW FROM TABLE (lock the table for the remainder of the transaction)
IF ROW IS OUT OF DATE:
REGENERATE FILE
INSERT ROW INTO TABLE
DO SOME STUFF WITH FILE (30s)
COMMIT TRANSACTION
However, if multiple clients execute this code, all of the subsequent clients sit for a long time while the "DO SOME STUFF WITH FILE" processes.
Is there a better way to handle this? Maybe changing the way I process the file before the commit to make it faster? I've been stumped on this for a couple days.

It sounds like you need to do your file processing asynchronously, so the file process is spun off and the transaction completes in a timely manner. There are a few ways to do that, but the easiest might be to replace the "do stuff with file" with a "insert a record into the table This_File_Needs_To_Be_Updated, then run a job every few minutes that updates each record in that table. Or HERE is some code that generates a job on the fly. Or see THIS question on Stack Overflow.

The answer depends on the details of file level processing.
If you just swap the database and file operations, you risk corruption of the file or busy waiting (depending on how exactly you open it, and what your code does when a concurrent open is rejected). Busy waiting would definitely be worse than waiting on a database lock from a throughput (or any other) perspective.
If your file processing really takes so long as to be frequently causing queueing of requests, the only solutions are to add more powerful hardware or optimize file level processing.
For example, if the file only reflects the data in the database, you might get away with not updating it at all, and having a background process that periodically regenerates its content based on the data in the database. You might need to add versioning that makes sure that whoever reads the file is not receiving stale data. If the file pointed to by the URL has a new name every time, you might need an error handler that makes sure that GET requests are not habitually receiving a 404 response on new files.

'tail -f' a database table

Is it possible to effectively tail a database table such that when a new row is added an application is immediately notified with the new row? Any database can be used.

Use an ON INSERT trigger.
you will need to check for specifics on how to call external applications with the values contained in the inserted record, or you will write your 'application' as a SQL procedure and have it run inside the database.
it sounds like you will want to brush up on databases in general before you paint yourself into a corner with your command line approaches.

Yes, if the database is a flat text file and appends are done at the end.
Yes, if the database supports this feature in some other way; check the relevant manual.
Otherwise, no. Databases tend to be binary files.

I am not sure but this might work for primitive / flat file databases but as far as i understand (and i could be wrong) the modern database files are encrypted. Hence reading a newly added row would not work with that command.

I would imagine most databases allow for write triggers, and you could have a script that triggers on write that tells you some of what happened. I don't know what information would be available, as it would depend on the individual database.

There are a few options here, some of which others have noted:
Periodically poll for new rows. With the way MVCC works though, it's possible to miss a row if there were two INSERTS in mid-transaction when you last queried.
Define a trigger function that will do some work for you on each insert. (In Postgres you can call a NOTIFY command that other processes can LISTEN to.) You could combine a trigger with writes to an unpublished_row_ids table to ensure that your tailing process doesn't miss anything. (The tailing process would then delete IDs from the unpublished_row_ids table as it processed them.)
Hook into the database's replication functionality, if it provides any. This should have a means of guaranteeing that rows aren't missed.
I've blogged in more detail about how to do all these options with Postgres at http://btubbs.com/streaming-updates-from-postgres.html.

tail on Linux appears to be using inotify to tell when a file changes - it probably uses similar filesystem notifications frameworks on other operating systems. Therefore it does detect file modifications.
That said, tail performs an fstat() call after each detected change and will not output anything unless the size of the file increases. Modern DB systems use random file access and reuse DB pages, so it's very possible that an inserted row will not cause the backing file size to change.
You're better off using inotify (or similar) directly, and even better off if you use DB triggers or whatever mechanism your DBMS offers to watch for DB updates, since not all file updates are necessarily row insertions.

I was just in the middle of posting the same exact response as glowcoder, plus another idea:
The low-tech way to do it is to have a timestamp field, and have a program run a query every n minutes looking for records where the timestamp is greater than that of the last run. The same concept can be done by storing the last key seen if you use a sequence, or even adding a boolean field "processed".

With oracle you can select an psuedo-column called 'rowid' that gives a unique identifier for the row in the table and rowid's are ordinal... new rows get assigned rowids that are greater than any existing rowid's.
So, first select max(rowid) from table_name
I assume that one cause for the raised question is that there are many, many rows in the table... so this first step will be taxing the db a little and take some time.
Then, select * from table_name where rowid > 'whatever_that_rowid_string_was'
you still have to periodically run the query, but it is now just a quick and inexpensive query

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight