I am maintaining a legacy code wherein we have a cocept of TempDB and FullDB, TempDB is just a small instance of FullDB, so that user can browse while FullDB is prepared.
Since lots of writes were involved in FullDB, reading and writing on same database file was creating a lock for readers on other thread. SO I am thinking of the following strategy, which best fits in our situation, in case its possible.
Here's what I want to do :
Start preparing the DB, when threshold for tempDB is reached, commit the transaction and close the connection.Make a copy of this file, lets call them orig(which is the temp db) and copy(which is copy of temp DB and further writes will be done to this file).
Next, readers can open a connection on orig as soon as they receive an event. Writer will open a connection on copy and perform remaining writes for quite a long time during which readers are using the orig temp db.
When the writer has prepared the full DB copy , I need to replace the orig file with the updated full db copy.
Here's the catch, readers will not close and reopen the connection. So I need to block the readers while I am replacing the DB. This I can achieve by acquiring an EXCLUSIVE lock on the orig DB, and then I can replace the orig db with copy db (renaming).
The Problem :
The readers are not accepting the new DB file.How can I make them to do that?
I mean when I tried through terminal : make a DB, copy it and make some entries into the copy and then replace the original with the copy, I was still getting entries that were present in the original DB. To the surprise, even when I deleted both (orig and copy) the DB files, I was still getting entries. It seems SQLite was picking data from some in-memory and not from the disk files.
Any help?
PS : On searching I found something called .open command but not sure how it works or whether its really helpful.
EDIT
Is this what I want?
You must not rename or delete database file while there is some open connection; on Unix-based systems, any open handles will still access the old file.
With an (exclusive) lock on the DB, you can just keep the file, but delete all its contents, and copy the new data into it.
Related
Imagine that we have a file, and some job that processes it and sends the data:
into the database
to an external service
Can we guarantee to process the file only once or at least to determine that something went wrong and notify the user so that he manually solved this problem?
Yes, you can.
What you can do is create a table in the database to store the name and a flag/status (if read, yes else no) of files. When process feeds the file in that location, make sure that the same process updates the name (if name is different each time) and flag/status for that file in the database. Your file read process can get the name of file from the database and dump that file in wherever you ant and when it's done, It should update the flag to read or whatever. This way, you can avoid reading the file more than one time.
I would store two tables of information in your database.
The processed file lines like you were already doing.
A record of the files themselves. Include:
the filename
whether the processing was successful, failed, partially succeeded
a SHA1 hashed checksum that can be used to check for the uniqueness of the file later
When you go to process a file, you first check whether the checksum already exists. If it does, you can stop processing and log the issue. Or you can throw that information on the file table.
Also be sure to have a foreign key association between your processed lines and your files. That way if something does go wrong, the person doing manual intervention can trace the affected lines.
Neither Usmana or Tracy answer actually guarantees that a file is not processed more than once and your job doesn't send duplicate requests to the database and the external service(#1 and #2 in your question). Both solutions suggest keeping a log and update it after all the processing is done but if an error occurs when you try to update the log at the very end, your job will try processing the file again next time it runs and will send duplicate requests to the database and external service. The only way to deal with it using the solutions Usmana and Tracy suggested is to run everything in a transaction but it's quite a challenging task in a distributing environment like yours.
A common solution to your problem is to gracefully handle duplicate requests to the database and external services. The actual implementation can vary but for example you can add a unique constraint to the database and when the job tries to insert a duplicate record an exception will be thrown which you can just ignore in the job because it means the required data is already in the db.
My answer don't mean that you don't need the log table Usmana and Tracy suggested. You do need it to keep track of processing status but it doesn't really guarantee there won't be duplicate requests to your database and external service unless you use a distributed transaction.
Hope it helps!
How do I handle "race condition" between instances of script that is scheduled to run every minute, performing following tasks for every file in directory:
Connect to SQL database and check last element (filename) in table
Create several files (multiple folders) with the next available filename
Insert to SQL new record with the filename and created files' information
Because process runs every 1 minute, it's possible that 2 instances overlap and work on same files. I can prevent that by file locking and skipping already opened file, however the issue persists with:
Checking next available filename in database (2 processes want to use the same filename)
Creating files with this filename
Process A takes inputA.jpg and finds next available as filename image_01.
Process B takes inputB.jpg and finds next available as filename image_01.
And so the chaos begins...
Unfortunately, I can't insert any placeholder record in SQL table to show that the next filename is being processed.
Pseudo-code of the loop:
foreach ($file)
{
$name = findFileNameInSql($file)
$path1 = createFile($name, $settings1);
$path2 = createFile($name, $settings2);
$path3 = createFile($name, $settings3);
addToSql($file, $name, $path1, $path2, $path3)
}
The actual code is a bit more complicated, including file modifications and transactional insert to 2 SQL tables. In case of createFile() failure the application is rolling back all previously created files. It obviously creates issue when one instance of app is creating file "abc" and second instance has error that file "abc" already exists.
EDIT :
Sure, limiting script to have only one instance could be solution, but I was hoping to find a way to run them in parallel. If there's no way to do it, we can close this as duplicate.
You need to make the code which returns the next available filename from the database atomic in the database, so that the database can't return the same filename twice. This is really a database problem rather than a perl problem, per se.
You don't say which database you're using, but there are several ways to do it.
A naive and brutish way to do it in MySQL is for the perl script to perform a LOCK TABLE table WRITE on the table with the filenames while it calculates a new one and does its work. Once the table is updated with the new filename, you can release the lock. TABLE LOCKS don't play nicely with transactions though.
Or you could do something rather more elegant, like implement a stored procedure with appropriate locking within the database itself to return the new filename.
Or use an AUTOINCREMENT column, so that each time you add something to the table you get a new number (and hence a new filename).
This can all get quite complicated though; if you have multiple transactions simultaneously, how the database resolves those is usually a configurable thing, so I can't tell you what will happen.
Given that it sounds as though your code is principally reorganising data on disk, there's not much advantage to having multiple jobs running at the same time; this code is probably I/O bound anyway. In which case it's much simpler just to make the code changes others have suggested to run only one copy at once.
I have a Perl code which connects with database and scan data from different different table. I face a problem if I lose my connection: it roll back all the transaction. How could I make the Perl script resume the connection and start the process from where the interruption took place? Can I use the Perl to resume the connection or any thing other technique to start the process from where the interruption took place if so could any one guide me with the steps please.
It is actually required because of we have lots of data and takes 1 week to scan all the data and insert in specific table, in between if we run database offline backup it disconnect all the connection and whatever transaction happens it roll back and need to run once again from the beginning.
We can commit the transaction whatever done but challenge is how we can start process from where the interruption took place so we don't require to run from the beginning.
Relying on a DB connection to be consistently open for over a day is the wrong approach.
A possible solution involves:
1) Connect to the DB to create a DB Handle. Use an infinite loop and sleep to wait until you have a good db handle. Put this into a subroutine.
2) Put the individual requests for individual tables in a data structure like an array
Execute separate Queries in separate statements in a loop (.
Check if the dbh is stale before each individual request. Clean and recreate the handle if necessary. Use the subroutine from 1)
Handle breakdowns during a request in EVAL blocks using the "redo" statement to make sure no statement gets skipped.
3) Keep the data between requests either in memory or any non-SQL storage like a Key/Value Store ( Redis, )
4) Compute whatever needs computation
5) When you have all data for your commit transaction, do.
This solution assumes you don't care about changes between reading and committing back. If you do, you need to LOCK your affected tables first. You probably don't want to lock a table for a week though.
I have a file stored on disk that can be access across multiple servers in a web farm. This file is updated as necessary based on data changes in the database. I have a database table that stores a row with a URI for this file and some hashes based off of some database tables. If the hashes don't match their respective tables, then the file need to be regenerated and a new row needs to be inserted.
How do I make it so that only 1 client regenerates this file and inserts a row?
The easiest but worst solution (because of locks) is to:
BEGIN TRANSACTION
SELECT ROW FROM TABLE (lock the table for the remainder of the transaction)
IF ROW IS OUT OF DATE:
REGENERATE FILE
INSERT ROW INTO TABLE
DO SOME STUFF WITH FILE (30s)
COMMIT TRANSACTION
However, if multiple clients execute this code, all of the subsequent clients sit for a long time while the "DO SOME STUFF WITH FILE" processes.
Is there a better way to handle this? Maybe changing the way I process the file before the commit to make it faster? I've been stumped on this for a couple days.
It sounds like you need to do your file processing asynchronously, so the file process is spun off and the transaction completes in a timely manner. There are a few ways to do that, but the easiest might be to replace the "do stuff with file" with a "insert a record into the table This_File_Needs_To_Be_Updated, then run a job every few minutes that updates each record in that table. Or HERE is some code that generates a job on the fly. Or see THIS question on Stack Overflow.
The answer depends on the details of file level processing.
If you just swap the database and file operations, you risk corruption of the file or busy waiting (depending on how exactly you open it, and what your code does when a concurrent open is rejected). Busy waiting would definitely be worse than waiting on a database lock from a throughput (or any other) perspective.
If your file processing really takes so long as to be frequently causing queueing of requests, the only solutions are to add more powerful hardware or optimize file level processing.
For example, if the file only reflects the data in the database, you might get away with not updating it at all, and having a background process that periodically regenerates its content based on the data in the database. You might need to add versioning that makes sure that whoever reads the file is not receiving stale data. If the file pointed to by the URL has a new name every time, you might need an error handler that makes sure that GET requests are not habitually receiving a 404 response on new files.
We're trying to locate a memory leak in our code. As the software runs, I can see memory usage slowly get bigger and bigger. Each operation add records in a database.
Then, I was wondering, where does the data from an INSERT command really go before we commit the changes? Is the data added in the actual database file and flagged as "Rollback this if requested"? or only stored in internal memory and dumped when the commit request is done?
If it helps, we're using Access for now.
As part of the "Begin Transaction", the "Data" as such does not go anywhere or change in any way, instead it records the list of commands issued to it. If you then cancel the transaction with a "Rollback" the instructions are dropped and no changes made, else if a "Commit" is issued then it executes the saved instructions in the correct order.
As the instructions are stored in a local table (as mentioned by Albert) this is why you see a memory increase as the local file is opened in it's entirety to memory (hence why we split front and back end database's in Access to avoid dumping a huge file on the RAM)
Also worth mentioning that any SQL statements that are issued have their syntax checked before been saved in order to ensure that if you are running multiple SQL's; it won't fail mid way through and leave your data in a state that you did not intend.
Apologies for "Instructions" I know this is the layman's term but I hope it makes sense.