Database Tables for files in filesystem

Database Tables for files in filesystem - database

i needed to save images to my back-end, and finally went with storing them in the file system instead of in the database as blobs. So now i have a different issue, i want to make my database as optimized as possible. Here are my needs, and my approaches:
I have these entities:
User
Image
In my file system, i can store the images in directories named after the user id. So basically:
16
asd.jpg
blaBla.jpg
Would represent the images about the user with id 16.
Now, i know i will have a lot of directories and a lot of images, and i know that storing their paths in a database would be better than querying the file system. (or would the OS know the locations of all the directories, making these tables not needed?)However i was wondering should i make a table such as (userId,imagePath), connecting every image to a userid, or (userId,directoryPath), connecting every-user with the path to his directory, then use something like Files.walk(directoryPath) to list all of the paths of the images inside that directory. What would be a better approach, or is this way to opinion-based ? A completely different approach or any tips would also be appreciated.

Related

Image path in database vs intelligent folder structure

Assuming I want to safe one profile picture for each user of my system.
Is it better to save the path to this image in my database or to rely on a intelligent folder structure like
/images/users/user1.png
and access the image directly?
What if I have more then one picture per user? Would this be a good practice?
/images/user1/pic1.png
So basically my question is, why would you save the picture's path and waste space when you already know where the pictures are without any db queries?
This is just a general question apart from any technologies.

It's always better to assume that each user will have more than one picture, even if you're absolutely sure of otherwise now.. Two ways to go about it:
Each user will have a folder for all their pictures. /path/user1/img1.jpg , /path/user1/im2.gif
All the pictures will be in the same folder, but the username will be a prefix of the filename /path/user_pics/user1_img1.jpg , /path/user_pics/user2_img3.gif
Personally, I prefer the former.
In most cases, you won't need to store the full path of the picture in a database.. Storing the image content-type (which determines its extension) and the filename will generally be enough. That gives you more freedom as to which machine is serving the image itself.

Server backend: how to generate file paths for uploaded files?

I am trying to create a site where users can upload images, videos and other types of files.
I did some research and people seem to suggest that saving the files as BLOB in database is a Bad idea; instead, save the file paths in database.
My questions are, if I save the file paths in a database:
1. How do I generate the file names?
I thought about computing the MD5 value of the file name, but what if two files have the same name? Adding the username and time-stamp etc. to file name? Does it even make sense?
2. What is the best directory structure?
If a user uploads images at 12/17/2013, 12/18/2018, can I just put it in user_ABC/images/, then create time-stamped sub-directories 20131217, 20131218 etc. ? What is the best structure for all these stuff?
3. How do all these come together?
It seems like maintaining this system is such a pain, because the file system manipulation scripts are tightly coupled with the database operations(may also need the worry about database transactions? Say in one transaction I updated the database but failed to modify the file system so I need to roll back my database?).
And I think this system doesn't scale (what if my machine runs out of hard disk so I need to upload the files to a second machine? What if my contents are on a cluster?)
I think my real question is:
4. Is there any existing framework/design pattern/db that handles this problem?
What is the standard way of handling this kind of problems?
Thanks in advance for your answers.

I've actually asked this same question when I was designing a social website for food chefs. I decided to store the url of the image in a MySQL database along with recipe. If you plan on storing multiple images for one recipe, in my example, maybe having a comma separated value would work. When the recipe loaded on the page, I would fetch the image associated with that recipe onto the screen.
Since it was a hackathon and wasn't meant for production purposes, I didn't encode the file name into something unique. However, if I were developing for productional purposes, I would append the time-stamp to the media file name when storing it into the server and database/backend.
I believe what I've proposed is the best data structure of handling this scenario. Storing the image onto the server is not only faster, but it should also take less space. I have found that when converting a standard jpg file of reasonable resolution to base64 encoding, the encoded text file representation took 30% more space. There is also the time of encoding the file and decoding the file for storage and resolving when using some BLOB type of data format instead of straight up storing the file on the server.
Using some sort of backend server scripting like PHP, you'll be able to do some pretty neat stuff with the information you have available. Fetch the result from the database, and load it in from the page using HTML.
As far as I know, there isn't a standard way of fetching media from a database yet. Perhaps there will be one day.

There is not standard way to do that, it is different to the different application. The idea is you need generate a different Path+FileName for every upload, here is a way:
HashId = sha1(microsecond + random(1,1000000));
Path = /[user_id]/[HashId{0,2}]/[HashId{-2}];
FileName = HashId

What is better for performance - many files in one directory, or many subdirectories each with one file?

While building web applications often we have files associated with database entries, eg: we have a user table and each category has a avatar field, which holds the path to associated image.
To make sure there are no conflicts in filenames we can either:
rename files upon upload to ID.jpg; the path would be then /user-avatars/ID.jpg
or create a sub-directory for each entity, and leave the original filename intact; the path would be then /user-avatars/ID/original_filename.jpg
where ID is users's unique ID number
Both perfectly valid from application logic's point of view.
But which one would be better from filesystem performance point of view? We have to keep in mind that the number of category entries can be very high (milions).
Is there any limit to a number of sub-directories a directory can hold?

It's going to depend on your file system, but I'm going to assume you're talking about something simple like ext3, and you're not running a distributed file system (some of which are quite good at this). In general, file systems perform poorly over a certain number of entries in a single directory, regardless of whether those entries are directories or files. So no matter whether if you're creating one directory per image or one image in the root directory, you will run into scaling problems. If you look at this answer:
How many files in a directory is too many (on Windows and Linux)?
You'll see that ext3 runs into limits at about 32K entries in a directory, far fewer than you're proposing.
Off the top of my head, I'd suggest doing some rudimentary sharding into a multilevel directory tree, something like /user-avatars/1/2/12345/original_filename.jpg. (Or something appropriate for your type of ID, but I am interpreting your question to be about numeric IDs.) Doing that will also make your life easier later when you decide you want to distribute across a storage cluster, since you can spread the directories around.

Millions of entries (either files or directories) in one parent directory would be hard to deal with for any filesystem. While modern filesystems use sorting and various tree algorithms for quick search for the needed files, even navigating to the folder with Windows Explorer or Midnight Commander or any other file manager will be complicated as the file manager would have to read contents of the directory. The same applies to file search. So subdirectories are preferred for this.
Yet I need to notice that access to particular file would be a bit faster when all files are in one directory than when they are separated into subdirectories at least on NTFS (measured this myself several times with 400K files).

I've been having a very similar issue with html files not images. Trying to store millions of them in a Ubuntu server in ext4. Ended running my own benchmarks. Found out that flat directory performs way better while being way simpler to use:
Reference: article

If you really want to use files, maybe your best bet is to partition the files off into several subdirectories so that you don't hit a limit. For example, if you have an ID 123456, you can put it in /12/34/56.jpg.
However, I would recommend just using the database to store this data since you are already using one. You can store the image data and ID in the same table, and you don't have to worry about some of the pesky business of dealing with files like making sure the permissions are set right, etc.

Should data files be stored on the same computer (server) the database is stored in?

Currently in our research group, we have many "data files" stored on three servers and a couple of personal computers running different operating systems.
We want to build a database, which would store some information in addition to the URLs of those various "data files". My question is, do we have to copy all the data files and put them in a directory in the same server the database is in? Or can they be left as they are on the different computers? If the second case is ok, what would be the format of the url of the "data files"?

It really depends on what your intended goal is and what your current setup is like
If the files are currently sitting somewhere on the network, and you need a path that the application can use to access them, you just need to store the network path (\\server\share\file for Windows environments) in the database, then read it and access that path to access the files. You'll need to make sure everyone has read access to them.
If the files are currently accessible through a website URL, internal or external, then again, you just need to store that URL (or some portion thereof) (http://mywebsite.com/myfile or http://servername/myfile) and access that.
If either of the above are not currently true, but you want them to be, then you'll need to set up a new share/webserver and put the files there. There's no requirement that this be the same server as the database, but it'd make for better backups if it was.
If you want the files themselves to be in the database, you should check out Bob Fanger's link.

Not sure what you're asking here but...
If you want your database engine to read files filled with data, it probably doesn't matter where they are stored - though this may depend on the database you are using. Are you using MySQL? MS-SQL Server? Oracle?
Many database vendors provide relatively easy-to-use admin tools that would let you choose a file to be loaded, and usually the file chooser dialoge lets you browse networks so you could load a file over the network. Details on how to do this vary so consult the manual for your database engine for loading data from a pre-existing file.
Be aware that if the database is on Computer A and the data is being loaded from Computer B over the network, it will probably be slower than if the data was on the same computer as the database.

It doesn't really matter if the files are stored outside the database anyway.
See Storing Images in DB - Yea or Nay? for more thoughts on that one.
If the files accessible by an url, you can store that with the meta data, like
http://server1/folder/file.ext, file://\server1\folder\file.ext or "file://P:\folder\file.ext"
Things to consider:
Backups
Performance
Synchronisation between the meta-data and the data

What is a database file system?

I have a very little idea about what database file system is.
Can somebody out here explain to me what actually a database file system is, and what its applications are?
How is it different from a conventional file system?
How I can build it?

Typical file systems (*nix, ms-dos, etc) organize files hierarchically. For example,
c:\ represents the top of a hierarchy
c:\foo is the next level in the hierarchy
c:\foo\bar is a sub-node of \foo
etc..
Each file exists in one and only one location in this hierarchy.
By contrast, a database file system organizes files by metadata attributes. For example, topic, type, author, etc.. Rather than existing in one particular place in a hierarchy, the file exists in multiple "places" depending on its attributes.
The last question you ask is unanswerable.

Found some good links
DBFS (This one is really good)
Towards A Single Folder Filesystem

It's a file system where files have significant amounts of metadata. For example, the iTunes library might count as a database file system; not only do you have files on disk and know where they are, but you have tags (genres) and other metadata like author (artist).

It's a file system that stores files as blobs in a database, rather than in a hierarchy of directories. Imagine a web-site with no "directory-like" hierarchy in the URL - just loads of tags and categories and a big "search" field - something like that, only on your hard-drive.
Pros & cons? Ask yourself, how many database filesystems have I ever seen? Do you need to ask more?