Deleting files and corresponding entries from database - database

I have a web site which handles files management. User can upload file, add description, edit and delete. What are the best practices for that kind of scenario?
I store files in the file system.
How should I handle deleting of the file? In this case I have to entities to delete: file and entry in database. First scenario is that I delete file and if there was no error I delete the entry from database. But then if the entry from database couldn't be deleted I cannot restore my file. So the second scenario is oposite: first entry from the database and then file. But again, when file cannot be deleted I cannot restore the entry in db. Which approach is better? Or is there any other?
I think the problem is universal for all web programming languages and all databases engines. But let's say that I have MySQL and PHP, so deleting the file from the level of database store procedure is not possible.

I usually find it best to go with doing soft deletes, especially for data in the database.
Doing it this way you could simply put the file in a new location to denote it as being deleted, and mark the entry in the database as being removed. This then allows you to still work with the file if the database delete fails for some reason.
Once you have the file in a new location, you can either backup that location to another place or set something up to periodically delete items from that location.

Here is one possibility.
You could first move the file to a 'Deleted' folder, then delete the entry in the database. If that fails, restore the file from the 'Deleted' folder.
To get rid of the files in the 'Deleted' folder, either do it right after the entry from the database is deleted. If that fails, then you end up you orphan files in a 'Deleted' folder... Depending on you requirement, this might not be a problem.
Another option would be to have a call (maybe on SessionEnd, or a service) that would do the clean up of the database.

I would personally make sure the database gets the priority as it is more fundamental to the system. So i'd make sure the db row gets deleted, then delete the file. If the file deletion fails, i'd make it fail silently. I would then have a cronjob checking against all files if they have their db counterpart, and if not, marked them for deletion, so the system stays clean and coherent.

There is often a garbage collection phase, and in fact your database has something similar called the "transaction log" which it can use to rewind or play forward a transaction.
In the case of your file delete you will have a clean-up process that runs periodically (perhaps manually in the event of a crash, or automatically every so often) that compares what is on the disk with what is in the database and makes an appropriate correction.
In order for any operation to be "atomic" there must be a method of cleaning up in the event of a crash. The key is finding a method that cleans up consistently such that a failure at any point within the "atomic" operation doesn't leave the system unrecoverable.

Vista introduced Transactional NTFS that will allow you to wrap file system operations along with database operations into a transaction that can be rolled back if either failed. This is not in .Net 3.5 and I do not know if it will be part of .Net 4.0, but there is evidence that it works today with a bit of leg work, i.e. by using Win32 calls (see this article).
I know nothing about transaction management in PHP.

Related

Split Access database not allowing multiple users

I have a split database design in Microsoft Access. Copies of the front end (w/ forms, queries, linked tables) are distributed to multiple users, while the backend (tables only) resides on the network.
Everything works fine when there is only one user, but as soon as a second user tries to open their copy of the front end, they get an error message saying the backend is already in use.
I've already confirmed that everyone has read & write permissions for the backend.
I've used split databases before and never run into this issue. The only difference is this time I didn't use the Database Splitter utility. I just started with the backend, then created a new database and set up some linked tables. Could that be my problem? Is there a step or setting I'm missing?
In general, this should work.
However, not only do users need read/write, but they ALSO require create file, and delete file rights to that folder.
The reason is that of couse this is pure file based system, so on first open, Access will create a ldb (locking file). This locking file is used to manage (allow) multi-user operations of the file.
If the locking file can't be created (first user to open), then multi-user operations can't be used, and in fact in most cases you get a read-only file.
So, users need quite much full rights. I seen some work with delete rights, but that means the last user out does not allow access to remove (delete) that ldb locking file, and it should be allowed for deletion.
so, create file, delete file rights are also required in most cases for this to work.
It also possbile I suppose that one user launched Access, choose open, and browsed to that file, and opened it exclusive. However, you have a split system, and that should not be possible, but it certainly still possible that someone on the network opened the back end file (that shared file) directly with Access, and used the open exclusive option - which would prevent all other users from opening the file.
As noted, since this looks to be a split system, then I would suggest that users don't have the all important create file rights, and delete file rights to that folder. Without such rights, then as noted, multi-user operations can't occur - you get a read only file in most cases.
So, either users don't have enough rights to that folder, or someone has opened the file with Access, and opened the file "exclusive".

How to deal with issues when storing uploaded files in the file system for a web app?

I am building a web application where the users can create reports and then upload some images for the created reports. Those images will be rendered in the browser when the user clicks a button on the report page. The images are confidential and only authorized users will be able to access them.
I am aware of the pros and cons of storing images in database, in filesystem or a service like amazon S3. For my application, I am inclined to keep the images in the filesystem and paths of the images in the database. That means I have to deal with the problems arising around distributed transaction management. I need some advice on how to deal with these problems.
1- I believe one of the proper solutions is to use technologies like JTA and XADisk. I am not very knowledgeable about these technologies but I believe 2 phase commit is how automicity is achieved. I am using MySQL as the database, and it seems like 2 phase commit is supported by MySQL. Problem with this approach is XADisk does not seem to be an active project and there is not much documentation about it and there is the fact that I am not very knowlegable about the ins and outs of this approach. I am not sure if I should invest in this approach.
2- I believe I can get away with some of the problems arising from the violation of ACID properties for my application. While uploading images, I can first write the files to disk, if this operation succeeds I can update the paths in the database. If database transaction fails, I can delete the files from the disk. I know that is still not bulletproof; an electricity shortage might occur just after the db transaction or the disk might not be responsive for a while etc...I know there are also concurrency issues, for instance if one user tries to modify the uploaded image and another tries to delete it at the same time, there will be some problems. Still the chances for concurrent updates in my application will be relatively low.
I believe I can live with orphan files on the disk or orphan image paths on the db if such exceptional cases occur. If a file path exists in db and not in the file system, I can show a notification to the user on report page and he might try to reupload the image. Orphan files in the file system would not be too much problem, I might run a process to detect such files time to time. Still, I am not very comfortable with this approach.
3- The last option might be to not store file paths in the db at all. I can structure the filesystem such that I can infer the file path in code and load all images at once. For instance, I can create a folder with the name of report id for each report. When a request has been made to load images of the report, I can load the images at once since I know the report id. That might end up with huge number of folders in the filesystem and I am not sure if such a design is acceptable. Concurrency issues will still exist in this scheme.
I would appreciate some advice on which approach I should follow.
I believe you are trying to be ultra-correct, and maybe not that much is needed, but I also faced some similar situation some time ago and explored also different possibilities. I disliked options aligned to your option 1, but about the 2 and 3, I had different successful approaches.
Let's sum up first the list of concerns:
You want the file to be saved
You want the file path to be linked to the corresponding entity (i.e the report)
You don't want a file path to be linked to a file that doesn't exist
You don't want files in the filesystem not linked to any report
And the different approaches:
1. Using DB
You can assure transactions in the DB pretty much with any relational database, and with S3 you can ensure read-after-write consistency for both new objects and upload of new objects. If you PUT an object and you get a 200 OK, it will be readable. Now, how to put all this together? You need to keep track of the process. I can figure 2 ways:
1.1 With a progress table
The upload request is saved to a table with anything need to identify this file, report id, temp uploaded file path, destination path, and a status column
You save the file
If the file safe fails you can update the record in the table, or delete it
If saving the file is successful, in a transaction:
update the progress table with successful status
update the table where you actually save the relationship report-image
Have a cron, but not checking the filesystem, but checking the process table. If there is any file in the filesystem that is orphan, definitely it had been added to the table (it was point 1). Here you can decide if you will delete the file, or if you have enough info, you can continue with the aborted process triggering the point 4.
The same report-image relationship table with some extra status columns.
1.2 With a queue system
Like RabbitMQ, SQS, AMQ, etc
A very similar approach could be done with any queue system instead of a db table. I wont give much details because it depends more on your real infrastructure, but just the general idea.
The upload request goes to a queue, you send a message with anything you may need to identify this file, report id, and if you want a tentative final path.
You upload the file
A worker reads pending messages in the queue and does the work. The message is marked as consumed only when everything goes well.
If something fails, naturally the message will come back to the queue
In the next time a message is read, the worker can have enough info to see if there is work to resume, or even a file to delete if resuming is not possible
In both cases, concurrency problems wont be straightforward to manage, but can be managed (relying on DB locks in fist case, and FIFO queues in second cases) but always with some application logic
2. Without DB
To some extent a system without a database would be perfectly acceptable, if we can defend it as a proper convention over configuration design.
You have to deal with 3 things:
Save files
Read files
Make sure that the structure of the filesystem is manageable
Lets start with 3:
Folder structure
In general, something like one folder for report id will be too simple, and maybe hard to maintain, and also ultimately too plain. This will cause issues, because if we have a folder images with one folder per report, and tomorrow you have less say 200k reports, the images folder will have 200k elements, and even an ls will take too much time, same for any programing language trying to access. That will kill you
You can think about something more sophisticated. Personally like a way that I learnt from Magento 1 more than 10 years ago and I used a lot since then: Using a folder structure following first outside rules, but extended with rules derived extended with the file name itself.
We want to save a product image. The image name is: myproduct.jpg
first rule is: for product images i use /media/catalog/product
then, to avoid many images in the same one, i create one folder per every letter of the image name, up to some number of letters. Lets say 3. So my final folder will be something like /media/catalog/product/m/y/p/myproduct.jpg
like this, it is clear where to save any new image. You can do something similar using your reports id, categories, or anything that makes sense for you. The final objective is to avoid too flat structure, and to create a tree that makes sense to you, and also that can be automatized easily.
And that takes us to the next part:
Read and write.
I implemented a similar system before quite successfully. It allowed me to save files easy, and to retrieve them easily, with locations that were purely dynamic. The parts here were:
S3 (but you can do with any filesystem)
A small microservice acting as a proxy for both read and write.
Some namespace system and attached logic.
The logic is quite simple. The namespace lets me know where the file will be saved. For example, the namespace can be companyname/reports/images.
Lets say a develop a microservice for read and write:
For saving a file, it receives:
namespace
entity id (ie you report)
file to upload
And it will do:
based on the rules I have for that namespace, and the id and file name will save the file in this folder
it doesn't return the physical location. That remains unknown to the client.
Then, for reading, clients will use a URL that uses also convention. For example you can have something like
https://myservice.com/{NAMESPACE}/{entity_id}
And based on the logic, the microservice will know where to find that in the storage and return the image.
If you have more than one image per report, you can do different things, such as:
- you may want to have a third slug in the path such as https://myservice.com/{NAMESPACE}/{entity_id}/1 https://myservice.com/{NAMESPACE}/{entity_id}/2 etc...
- if it is for your internal application usage, you can have one endpoint that returns the list of all eligible images, lets say https://myservice.com/{NAMESPACE}/{entity_id} returns an array with all image urls
How I implemented this was with quite simple yml config to define the logic, and very simple code reading that config. That allowed me to have a lot of flexibility. For example save reports in total different paths or servers or s3 buckets if they belong to different companies or are different report types

Access database split problems

I am trying to split an Access database where I work but I have encountered a few issues that I am struggling to resolve. If I can first explain the problem.
I work for a large multi-national company that has on-site IT support but does not support Access (so no help there)
There are 12 of us working in our section, we have an old and badly designed StockMaster database on the networked F drive. The problem is that it is only set up for single users, we have to take turns using it. We aren't a computer savvy bunch, we tend to run the same named queries on a daily basis
The database is only updated once per day, every morning we get a download from our colleagues in Amsterdam. I do not want to play around with this database as first of all I'm no expert and secondly if I break it, no one will fix it.
My plan is this;
I have created a new Access database StockMaster2 that imports the required tables. Using VB coded modules, is deletes the old then imports the new. Therefore every morning it replicates what is in the original database and it works fine.
My next step is to split the database, create the front end and distribute. This is where I'm having problems.
I created the original front end StockMaster2_fe.accde and placed it in the database folder on the F:\ drive. Does every user get their own copy of the front end? I copied and saved two more front ends (copy and paste in the same folder -> rename) namely StockMaster2_alan_fe and StockMaster2_ryan_fe and tested it. I told Ryan (who sits next to me) to find the front end named after him on the F:\ drive and open it whilst I was in ...alan_fe. We both went to run macros at the same time but he was kicked out as it gave me exclusive access.
What am i doing wrong? Why is it not allowing multiple access?
My problem is that due to strict administrator privileges I cannot download any software or access the command line, so anything that I do must be done in Access itself
I apologize for not seeing this post sooner to end your agony. There are two absolute main issues that must be resolved to get you on the right track. First, and perhaps the most important, is that your file has the name of StockMaster2_fe.accde. The extension, the accde, is the executable version. Design changes cannot be made to that version. The extension should say .accdb to provide you with all the flexibility to alter the database, create one database for back-end tables, and a second database for front-end objects to include queries, forms, reports, macros and modules. If you have the accdb version, then your work will start to get much easier.
Issue number two, if your team is not able to share the database, then that is a sign that the database, when first opening, is opening in Exclusive mode. This option can be changed in the Access Options, in the Advance menu, under Advanced section. Look for Default open mode. It should say Shared to have multiple users operating all at once.
A possible hidden issue that can be happening, is that the database has VBA code which informs the database to open exclusively. With your version of the accde, you will not be able to access that code or change how the database opens.
Let's break this down (only because I finished all my work already...):
My next step is to split the database, create the front end and distribute. This is where i'm having problems. I created the original front end StockMaster2_fe.accde and placed it in the database folder on the F drive. Does every user get their own copy of the front end?
Yes
I copied and saved two more front ends (copy and paste in the same folder -> rename) namely StockMaster2_alan_fe and StockMaster2_ryan_fe and tested it. I told Ryan (who sits next to me) to find the front end named after him on the F drive and open it whilst i was in ...alan_fe. We both went to run macros at the same time but he was kicked out as it gave me exclusive access. What am i doing wrong?
Ensure your back end contains only tables. Access is a "client-centric" database, which means when a query is run it pulls all of the data over the pipe to your local computer, does what it does, and then sends it back. So, make sure the back end has only tables and all the other jazz (macros, queries, etc...) are in the front end. Also, the front end will contain links to the back-end tables. All of your queries/macros/etc.. will reference these links, and not the tables in the back-end DB directly.
Why is it not allowing multiple access?
Also, make sure your table-locking scheme is multi-user friendly. If you're doing table locking, it will cause errors. If you're doing record locking, it probably won't.
My problem is that due to strict administrator privileges i cannot download any software or access the command line, so anything that i do must be done in Access itself.
Shouldn't be a problem at all.

EventStore automatic backup

Is there a way to automatically backup EventStore, didn't find any solution, other than manual scripts/console apps etc. I am using Get EventStore.
Unfortunately, as of right now I believe your best bet is to either:
Create an additional node in a separate environment which asynchronously replicates - this acts as a sort of 'hot backup', however this does NOT protect against data destruction via application or user command (unless you catch it before synchronization).
Roll your own script to create copies of the *.chk and other necessary files and drop them off at a secure backup location.
Outside of that, I don't know of any automated/native backup functionality in EventStore.
Here is a link to the current (albeit limited) EventStore backup documentation: http://docs.geteventstore.com/server/3.1.0/database-backup/

How can I put a database under git (version control)?

I'm doing a web app, and I need to make a branch for some major changes, the thing is, these changes require changes to the database schema, so I'd like to put the entire database under git as well.
How do I do that? is there a specific folder that I can keep under a git repository? How do I know which one? How can I be sure that I'm putting the right folder?
I need to be sure, because these changes are not backward compatible; I can't afford to screw up.
The database in my case is PostgreSQL
Edit:
Someone suggested taking backups and putting the backup file under version control instead of the database. To be honest, I find that really hard to swallow.
There has to be a better way.
Update:
OK, so there' no better way, but I'm still not quite convinced, so I will change the question a bit:
I'd like to put the entire database under version control, what database engine can I use so that I can put the actual database under version control instead of its dump?
Would sqlite be git-friendly?
Since this is only the development environment, I can choose whatever database I want.
Edit2:
What I really want is not to track my development history, but to be able to switch from my "new radical changes" branch to the "current stable branch" and be able for instance to fix some bugs/issues, etc, with the current stable branch. Such that when I switch branches, the database auto-magically becomes compatible with the branch I'm currently on.
I don't really care much about the actual data.
Take a database dump, and version control that instead. This way it is a flat text file.
Personally I suggest that you keep both a data dump, and a schema dump. This way using diff it becomes fairly easy to see what changed in the schema from revision to revision.
If you are making big changes, you should have a secondary database that you make the new schema changes to and not touch the old one since as you said you are making a branch.
I'm starting to think of a really simple solution, don't know why I didn't think of it before!!
Duplicate the database, (both the schema and the data).
In the branch for the new-major-changes, simply change the project configuration to use the new duplicate database.
This way I can switch branches without worrying about database schema changes.
EDIT:
By duplicate, I mean create another database with a different name (like my_db_2); not doing a dump or anything like that.
Use something like LiquiBase this lets you keep revision control of your Liquibase files. you can tag changes for production only, and have lb keep your DB up to date for either production or development, (or whatever scheme you want).
Irmin (branching + time travel)
Flur.ee (immutable + time travel + graph query)
XTDB (formerly called 'CruxDB') (time travel + query)
TerminusDB (immutable + branching + time travel + Graph Query!)
DoltDB (branching + time-travel + SQL query)
Quadrable (branching + remote state verification)
EdgeDB (no real time travel, but migrations derived by the compiler after schema changes)
Migra (diffing for Postgres schemas/data. Auto-generate migration scripts, auto-sync db state)
ImmuDB (immutable + time-travel)
I've come across this question, as I've got a similar problem, where something approximating a DB based Directory structure, stores 'files', and I need git to manage it. It's distributed, across a cloud, using replication, hence it's access point will be via MySQL.
The gist of the above answers, seem to similarly suggest an alternative solution to the problem asked, which kind of misses the point, of using Git to manage something in a Database, so I'll attempt to answer that question.
Git is a system, which in essence stores a database of deltas (differences), which can be reassembled, in order, to reproduce a context. The normal usage of git assumes that context is a filesystem, and those deltas are diff's in that file system, but really all git is, is a hierarchical database of deltas (hierarchical, because in most cases each delta is a commit with at least 1 parents, arranged in a tree).
As long as you can generate a delta, in theory, git can store it. The problem is normally git expects the context, on which it's generating delta's to be a file system, and similarly, when you checkout a point in the git hierarchy, it expects to generate a filesystem.
If you want to manage change, in a database, you have 2 discrete problems, and I would address them separately (if I were you). The first is schema, the second is data (although in your question, you state data isn't something you're concerned about). A problem I had in the past, was a Dev and Prod database, where Dev could take incremental changes to the schema, and those changes had to be documented in CVS, and propogated to live, along with additions to one of several 'static' tables. We did that by having a 3rd database, called Cruise, which contained only the static data. At any point the schema from Dev and Cruise could be compared, and we had a script to take the diff of those 2 files and produce an SQL file containing ALTER statements, to apply it. Similarly any new data, could be distilled to an SQL file containing INSERT commands. As long as fields and tables are only added, and never deleted, the process could automate generating the SQL statements to apply the delta.
The mechanism by which git generates deltas is diff and the mechanism by which it combines 1 or more deltas with a file, is called merge. If you can come up with a method for diffing and merging from a different context, git should work, but as has been discussed you may prefer a tool that does that for you. My first thought towards solving that is this https://git-scm.com/book/en/v2/Customizing-Git-Git-Configuration#External-Merge-and-Diff-Tools which details how to replace git's internal diff and merge tool. I'll update this answer, as I come up with a better solution to the problem, but in my case I expect to only have to manage data changes, in-so-far-as a DB based filestore may change, so my solution may not be exactly what you need.
There is a great project called Migrations under Doctrine that built just for this purpose.
Its still in alpha state and built for php.
http://docs.doctrine-project.org/projects/doctrine-migrations/en/latest/index.html
Take a look at RedGate SQL Source Control.
http://www.red-gate.com/products/sql-development/sql-source-control/
This tool is a SQL Server Management Studio snap-in which will allow you to place your database under Source Control with Git.
It's a bit pricey at $495 per user, but there is a 28 day free trial available.
NOTE
I am not affiliated with RedGate in any way whatsoever.
I've released a tool for sqlite that does what you're asking for. It uses a custom diff driver leveraging the sqlite projects tool 'sqldiff', UUIDs as primary keys, and leaves off the sqlite rowid. It is still in alpha so feedback is appreciated.
Postgres and mysql are trickier, as the binary data is kept in multiple files and may not even be valid if you were able to snapshot it.
https://github.com/cannadayr/git-sqlite
I want to make something similar, add my database changes to my version control system.
I am going to follow the ideas in this post from Vladimir Khorikov "Database versioning best practices". In summary i will
store both its schema and the reference data in a source control system.
for every modification we will create a separate SQL script with the changes
In case it helps!
You can't do it without atomicity, and you can't get atomicity without either using pg_dump or a snapshotting filesystem.
My postgres instance is on zfs, which I snapshot occasionally. It's approximately instant and consistent.
I think X-Istence is on the right track, but there are a few more improvements you can make to this strategy. First, use:
$pg_dump --schema ...
to dump the tables, sequences, etc and place this file under version control. You'll use this to separate the compatibility changes between your branches.
Next, perform a data dump for the set of tables that contain configuration required for your application to operate (should probably skip user data, etc), like form defaults and other data non-user modifiable data. You can do this selectively by using:
$pg_dump --table=.. <or> --exclude-table=..
This is a good idea because the repo can get really clunky when your database gets to 100Mb+ when doing a full data dump. A better idea is to back up a more minimal set of data that you require to test your app. If your default data is very large though, this may still cause problems though.
If you absolutely need to place full backups in the repo, consider doing it in a branch outside of your source tree. An external backup system with some reference to the matching svn rev is likely best for this though.
Also, I suggest using text format dumps over binary for revision purposes (for the schema at least) since these are easier to diff. You can always compress these to save space prior to checking in.
Finally, have a look at the postgres backup documentation if you haven't already. The way you're commenting on backing up 'the database' rather than a dump makes me wonder if you're thinking of file system based backups (see section 23.2 for caveats).
What you want, in spirit, is perhaps something like Post Facto, which stores versions of a database in a database. Check this presentation.
The project apparently never really went anywhere, so it probably won't help you immediately, but it's an interesting concept. I fear that doing this properly would be very difficult, because even version 1 would have to get all the details right in order to have people trust their work to it.
This question is pretty much answered but I would like to complement X-Istence's and Dana the Sane's answer with a small suggestion.
If you need revision control with some degree of granularity, say daily, you could couple the text dump of both the tables and the schema with a tool like rdiff-backup which does incremental backups. The advantage is that instead of storing snapshots of daily backups, you simply store the differences from the previous day.
With this you have both the advantage of revision control and you don't waste too much space.
In any case, using git directly on big flat files which change very frequently is not a good solution. If your database becomes too big, git will start to have some problems managing the files.
Here is what i am trying to do in my projects:
separate data and schema and default data.
The database configuration is stored in configuration file that is not under version control (.gitignore)
The database defaults (for setting up new Projects) is a simple SQL file under version control.
For the database schema create a database schema dump under the version control.
The most common way is to have update scripts that contains SQL Statements, (ALTER Table.. or UPDATE). You also need to have a place in your database where you save the current version of you schema)
Take a look at other big open source database projects (piwik,or your favorite cms system), they all use updatescripts (1.sql,2.sql,3.sh,4.php.5.sql)
But this a very time intensive job, you have to create, and test the updatescripts and you need to run a common updatescript that compares the version and run all necessary update scripts.
So theoretically (and thats what i am looking for) you could
dumped the the database schema after each change (manually, conjob, git hooks (maybe before commit))
(and only in some very special cases create updatescripts)
After that in your common updatescript (run the normal updatescripts, for the special cases) and then compare the schemas (the dump and current database) and then automatically generate the nessesary ALTER Statements. There some tools that can do this already, but haven't found yet a good one.
What I do in my personal projects is, I store my whole database to dropbox and then point MAMP, WAMP workflow to use it right from there.. That way database is always up-to-date where ever I need to do some developing. But that's just for dev! Live sites is using own server for that off course! :)
Storing each level of database changes under git versioning control is like pushing your entire database with each commit and restoring your entire database with each pull.
If your database is so prone to crucial changes and you cannot afford to loose them, you can just update your pre_commit and post_merge hooks.
I did the same with one of my projects and you can find the directions here.
That's how I do it:
Since your have free choise about DB type use a filebased DB like e.g. firebird.
Create a template DB which has the schema that fits your actual branch and store it in your repository.
When executing your application programmatically create a copy of your template DB, store it somewhere else and just work with that copy.
This way you can put your DB schema under version control without the data. And if you change your schema you just have to change the template DB
We used to run a social website, on a standard LAMP configuration. We had a Live server, Test server, and Development server, as well as the local developers machines. All were managed using GIT.
On each machine, we had the PHP files, but also the MySQL service, and a folder with Images that users would upload. The Live server grew to have some 100K (!) recurrent users, the dump was about 2GB (!), the Image folder was some 50GB (!). By the time that I left, our server was reaching the limit of its CPU, Ram, and most of all, the concurrent net connection limits (We even compiled our own version of network card driver to max out the server 'lol'). We could not (nor should you assume with your website) put 2GB of data and 50GB of images in GIT.
To manage all this under GIT easily, we would ignore the binary folders (the folders containing the Images) by inserting these folder paths into .gitignore. We also had a folder called SQL outside the Apache documentroot path. In that SQL folder, we would put our SQL files from the developers in incremental numberings (001.florianm.sql, 001.johns.sql, 002.florianm.sql, etc). These SQL files were managed by GIT as well. The first sql file would indeed contain a large set of DB schema. We don't add user-data in GIT (eg the records of the users table, or the comments table), but data like configs or topology or other site specific data, was maintained in the sql files (and hence by GIT). Mostly its the developers (who know the code best) that determine what and what is not maintained by GIT with regards to SQL schema and data.
When it got to a release, the administrator logs in onto the dev server, merges the live branch with all developers and needed branches on the dev machine to an update branch, and pushed it to the test server. On the test server, he checks if the updating process for the Live server is still valid, and in quick succession, points all traffic in Apache to a placeholder site, creates a DB dump, points the working directory from 'live' to 'update', executes all new sql files into mysql, and repoints the traffic back to the correct site. When all stakeholders agreed after reviewing the test server, the Administrator did the same thing from Test server to Live server. Afterwards, he merges the live branch on the production server, to the master branch accross all servers, and rebased all live branches. The developers were responsible themselves to rebase their branches, but they generally know what they are doing.
If there were problems on the test server, eg. the merges had too many conflicts, then the code was reverted (pointing the working branch back to 'live') and the sql files were never executed. The moment that the sql files were executed, this was considered as a non-reversible action at the time. If the SQL files were not working properly, then the DB was restored using the Dump (and the developers told off, for providing ill-tested SQL files).
Today, we maintain both a sql-up and sql-down folder, with equivalent filenames, where the developers have to test that both the upgrading sql files, can be equally downgraded. This could ultimately be executed with a bash script, but its a good idea if human eyes kept monitoring the upgrade process.
It's not great, but its manageable. Hope this gives an insight into a real-life, practical, relatively high-availability site. Be it a bit outdated, but still followed.
Update Aug 26, 2019:
Netlify CMS is doing it with GitHub, an example implementation can be found here with all information on how they implemented it netlify-cms-backend-github
I say don't. Data can change at any given time. Instead you should only commit data models in your code, schema and table definitions (create database and create table statements) and sample data for unit tests. This is kinda the way that Laravel does it, committing database migrations and seeds.
I would recommend neXtep (Link removed - Domain was taken over by a NSFW-Website) for version controlling the database it has got a good set of documentation and forums that explains how to install and the errors encountered. I have tested it for postgreSQL 9.1 and 9.3, i was able to get it working for 9.1 but for 9.3 it doesn't seems to work.
Use a tool like iBatis Migrations (manual, short tutorial video) which allows you to version control the changes you make to a database throughout the lifecycle of a project, rather than the database itself.
This allows you to selectively apply individual changes to different environments, keep a changelog of which changes are in which environments, create scripts to apply changes A through N, rollback changes, etc.
I'd like to put the entire database under version control, what
database engine can I use so that I can put the actual database under
version control instead of its dump?
This is not database engine dependent. By Microsoft SQL Server there are lots of version controlling programs. I don't think that problem can be solved with git, you have to use a pgsql specific schema version control system. I don't know whether such a thing exists or not...
Use a version-controlled database, of which there are now several.
https://www.dolthub.com/blog/2021-09-17-database-version-control/
These products don't apply version control on top of another type of database -- they are their own database engines that support version control operations. So you need to migrate to them or start building on them in the first place.
I write one of them, DoltDB, which combines the interfaces of MySQL and Git. Check it out here:
https://github.com/dolthub/dolt
I wish it were simpler. Checking in the schema as a text file is a good start to capture the structure of the DB. For the content, however, I have not found a cleaner, better method for git than CSV files. One per table. The DB can then be edited on multiple branches and merges extremely well.

Resources