File storage system for a Postgresql database - database

I currently have a database which needs to store a lot of files. However, I also store information about specific files. I want to be able to store the files alongside the database which contains this metadata, and I am wondering how best to do this. The files are auto-generated and are in a hierarchical structure which is suited to file systems.
The original idea was to store the path to the root directory of the file system containing all the files, and then reference the files relative to this (e.g. ../file_group_1/file_1). In looking into this, it is difficult to find a way to store the files in this file system without say, running a separate server alongside the database which manages the filesystem.
I have looked into the Large Objects type in Postgresql, but I'm also concerned about the security implications. Is there a better solution to this?

It is often better to store files outside the database, because the access is faster and the database size remains small. The down side is that your application will have to manage consistency between the database and the file system in the face of crashes, aborted transactions etc.
I'd store the complete path of the file with the metadata; splitting it up doesn't save a lot and will make queries more complicated.
To maintain consistency between the database and the file system, you could always write the file first and never delete files except during special garbage collection runs when you can also identify and remove orphans.
If files are stored in the database, bytea is better than large objects unless the files are very large or need to be read and written in parts.

For making best file system. I suggest to use folder and document hierarchy.
Document table will have reference of entity table and parent_doc_id for hierarchy logic. you should use Recursive CTE to get document tree as required.
In file system you can use path with document refrence.
i.e
entity => 1001
Document 1 => 1002
Document 2 => 1003
I suggest to use integer path in file system to avoid duplicate filename overlapping.
for document 1: 1001\1002
for document 2: 1001\1003
Actual file name and path you can store in table for refrence.

Related

When files have to be saved to database?

The images and documents could be saved to filesystem or database. For example I can save it to byte array and store in database or I can save it to filesystem and get an url. So when files are saved to database and when to filesystem? Or sometimes both? What is the best method to store files?
If you are looking for a system to save files to, as the name suggests, I would look into a "filesystem". A Database CAN hold files, but for most purposes this is just useless.
You can save the location (directory, name/hash, etc) in a database, but just save your file on a filesystem. Databases are good in a lot of things, but saving files isn't specifically it.
Of course, if you documents are pure tekst, it might be better to save them to a database, e.g. elasticsearch, so you can do a search, but these are special cases: you're not talking about saving files, you're talking about searching, for which you accidentally find a different way to save the file.

Database of metadata files across multiple directories

Consider multiple binary files associated with one metadata file each across multiple directories:
directory1: file1.bin file1.txt
directory2: file2.bin file2.txt
The metadata files contain structered data in XML or JSON format.
Is there a database which can use these metadata files for operating and running queries on them?
From what I understand about document oriented databases is, that their data files are stored in one directory.
My question is related to this stackexchange question. Unfortunately, there is no good description on a XML-based solution.
To get good query performance on metadata based queries, virtually any system will have to extract the metadata from individual metadata files and store in a more optimized form: one or more index(es) of some form or other. If there's associated data only stored in files, and not in the index (like your .bin files), then the index entry would need to store a path to to the file it so the associated data can be retrieved when needed. The path can typically store directories names, machine names, etc. In modern systems the path could be a URL.
A document oriented database might be a perfectly good place to store the metadata index, but isn't necessarily the best choice. If the metadata you need to query on is highly regular (always has the same fields, then some other form of index storage could have substantially better performance, but if you don't know ahead of time the structure of the metadata, a document oriented database might be be more flexible. Another approach might be use of a full-text search engine if you are trying to match words and phrases in the metadata.
So yes, such databasees exist. Unfortunately, there are far too many factors unspecified to make a specific recommendation. The question isn't well suited to a generic answer, the size of the document collection, expected transaction rate, required storage and retrieval latency targets and and consistency requirements could all factor into a recommendation, as would might any platform preferences (window vs *nix, on-premise vs cloud, etc.)
If you want to query structured data directly in XML or JSON files there are tools for doing so, for example:
xml-grep
jq
If your metadata text files relate to interpreting the binary files, I'm not aware of any generic parser for this. One may exist, but it seems a stretch unless you are using well-defined formats.
The general approach of working with these files directly is going to be inefficient if you need to make repeated queries, as any non-database solution is going to involve parsing the files to resolve your queries. A document-oriented database refers to the ability to store structured content, but the on-disk format will be more efficient (and complex) than text files and XML/JSON metadata which has to be parsed.
If you actually want to use a database and build appropriate indexes over structured content, you should import your raw data into one.

Can RavenDB persist documents to a filesystem?

I need the following; I have a substantial collection of text files. I want them accessible (readonly) from ram for speed and persisted on disc (as the actual text file) for safety and access via Isilon. I was wondering if RavenDB is capable of doing this, or am I taking the 'document' to literaly?
Edit:
OR can the data inside the file (semi structured) be stored on disc, in a single file, not required to be in its original form, but still easily readable by other programs.
In other words, can RavenDB store a/every row in a file on disc?
You don't need RavenDB for that. Just use System.IO.File and related concepts.
Raven won't work with individual files on disk. It keeps it's own set of files for its index and data stores. Access from other programs is not allowed. What happens in RavenDB, stays in RavenDB.
Most people store big files on disk, and then just store a file path or url reference to them in their RavenDB database. Raven also supports the concept of attachments, where the file is uploaded into Raven's database - but then it wouldn't be available as a single file on disk the way you are thinking.

When would you store metadata on a filesystem rather than a database?

I want to store documents with metadata in a web application such that a person can view them in a hierarchy.
I gather that a typical way to do this is to create a database entry for each document, store metadata in the database and store the files on a filesystem.
It seems much simpler and faster to store both the documents and the metadata on the filesystem. So a directory might look like this
$ ls subdirectory
.json
Subsubdirectory
bar.pdf
bar.json
foo.tex
foo.json
And then I could get the metadata from the json files (or whatever format I use). I could render subdirectory/foo.html based on the contents of subdirectory/foo.json. And I could render subdirectory.html based on the contents of subdirectory/.json and the contents of the other child json files.
The main disadvantage I've thought of is that it might be harder to search based on the contents of the metadata file (though I could search based on filesystem-level metadata). What other disadvantages are there? And if people do use this approach, why don't I hear about it?
EDIT: I'm not really so concerned about searching; if I build some sort of searching, it'll probably be within a single, smallish directory.
"I could search based on filesystem-level metadata" - you could do this but it means each time you do a search you have to read all the metdata files from the FS and then you have to manually process it. There's no indexing, this is roughly the equivalent of a full table scan in an SQL database (but it's even slower..).
In general storing data on the FS has some other drawbacks, you have to do replication both for durability (so you don't lose files if the disk dies), and if your site is popupar, for scalability. But since you already storing the files on the disk you have to solve this issue anyway.

What is the best way to associate a file with a piece of data?

I have an application that creates records in a table (rocket science, I know). Users want to associate files (.doc, .xls, .pdf, etc...) to a single record in the table.
Should I store the contents of the
file(s) in the database? Wouldn't this
bloat the database?
Should I store the file(s) on a file
server, and store the path(s) in the
database?
What is the best way to do this?
I think you've accurately captured the two most popular approaches to solving this problem. There are pros and cons to each:
Store the Files in the DB
Most rbms have support for storing blobs (or binary file data, .doc, .xls, etc.) in a db. So you're not breaking new ground here.
Pros
Simplifies Backup of the data: you backup the db you have all the files.
The linkage between the metadata (the other columns ABOUT the files) and the file itself is solid and built into the db; so its a one stop shop to get data about your files.
Cons
Backups can quickly blossom into a HUGE nightmare as you're storing all of that binary data with your database. You could alleviate some of the headaches by keeping the files in a separate DB.
Without the DB or an interface to the DB, there's no easy way to get to the file content to modify or update it.
In general, its harder to code and coordinate the upload and storage of data to a DB vs. the filesystem.
Store the Files on the FileSystem
This approach is pretty simple, you store the files themselves in the filesystem. Your database stores a reference to the file's location (as well as all of the metadata about the file). One helpful hint here is to standardize your naming schema for the files on disk (don't use the file that the user gives you, create one on your own and store theirs in the db).
Pros
Keeps your file data cleanly separated from the database.
Easy to maintain the files themselves (if you need to change out the file or update it), you do so in the file system itself. You can just as easily do it from the application as well via a new upload.
Cons
If you're not careful, your database about the files can get out of sync with the files themselves.
Security can be an issue (again if you're careless) depending on where you store the files and whether or not that filesystem is available to the public (via the web I'm assuming here).
At the end of the day, we chose to go the filesystem route. It was easier to implement quickly, easy on the backup, pretty secure once we locked down any holes and streamed the file out (instead of just serving directly from the filesystem). Its been operational in pretty much the same format for about 6 years in two different government applications.
J
How well you can store binaries, or BLOBs, in a database will be highly dependant on the DBMS you are using.
If you store binaries on the file system, you need to consider what happens in the case of file name collision, where you try and store two different files with the same name - and if this is a valid operation or not. So, along with the reference to where the file lives on the file system, you may also need to store the original file name.
Also, if you are storing a large amount of files, be aware of possible performance hits of storing all your files in one folder. (You didn't specify your operating system, but you might want to look at this question for NTFS, or this reference for ext3.)
We had a system that had to store several thousands of files on the file system, on a file system where we were concerned about the number of files in any one folder (it may have been FAT32, I think).
Our system would take a new file to be added, and generate an MD5 checksum for it (in hex). It would take the first two characters and make that the first folder, the next two characters and make that the second folder as a sub-folder of the first folder, and then the next two as the third folder as a sub-folder of the second folder.
That way, we ended up with a three-level set of folders, and the files were reasonably well scattered so no one folder filled up too much.
If we still had a file name collision after that, then we would just add "_n" to the file name (before the extension), where n was just an incrementing number until we got a name that didn't exist (and even then, I think we did atomic file creation, just to be sure).
Of course, then you need tools to do the occasional comparison of the database records to the file system, flagging any missing files and cleaning up any orphaned ones where the database record no longer exists.
You should only store files in the database if you're reasonably sure you know that the sizes of those files aren't going to get out of hand.
I use our database to store small banner images, which I always know what size they're going to be. Your database will store a pointer to the data inside a row and then plunk the data itself somewhere else, so it doesn't necessarily impact speed.
If there are too many unknowns though, using the filesystem is the safer route.
Use the database for data and the filesystem for files. Simply store the file path in the database.
In addition, your webserver can probably serve files more efficiently than you application code will do (in order to stream the file from the DB back to the client).
Store the paths in the database. This keeps your database from bloating, and also allows you to separately back up the external files. You can also relocate them more easily; just move them to a new location and then UPDATE the database.
One additional thing to keep in mind: In order to use most of the filetypes you mentioned, you'll end up having to:
Query the database to get the file contents in a blob
Write the blob data to a disk file
Launch an application to open/edit/whatever the file you just created
Read the file back in from disk to a blob
Update the database with the new content
All that as opposed to:
Read the file path from the DB
Launch the app to open/edit/whatever the file
I prefer the second set of steps, myself.
The best solution would be to put the documents in the database. This simplifies all the linking and backingup and restoring issues - But it might not solve the basic 'we just want to point to documents on our file server' mindset the users may have.
It all depends (in the end) on actual user requirements.
BUt my recommendation would be to put it all together in the database so you retain control of them. Leaving them in the file system leaves them open to being deleted, moved, ACL'd or anyone of hundreds of other changes that could render your linking to them pointless or even damaging.
Database bloat is only an issue if you haven't sized for it. Do some tests and see what effects it has. 100GB of files on a disk is probably just as big as the same files in a database.
I would try to store it all in the database. Haven't done it. But if not. There are a small risk that file names get out of sync with files on the disk. Then you have a big problem.
And now for the completely off the wall suggestion - you could consider storing the binaries as attachments in a CouchDB document database. This would avoid the file name collision issues as you would use a generated UID as each document ID (which you what you would store in your RDBMS), and the actual attachment's file name is kept with the document.
If you are building a web-based system, then the fact that CouchDB uses REST over HTTP could also be leveraged. And, there's also the replication facilities that could prove of use.
Of course, CouchDB is still in incubation, although there are some who are already using it 'in the wild'.
I would store the files in the filesystem. But to keep the linking to the files robust, i.e. to avoid the cons of this option, I would generate some hash for each file, and then use the hash to retrieve it from the filesystem, without relying on the filenames and/or their path.
I don't know the details, but I know that this can be done, because this is the way BibDesk works (a BibTeX app for the Mac OS). It is wonderful software, used to keep tracks of the pdf attachments to the database of scientific literature that it manages.

Resources