Basic File storage

Basic File storage - file

In order to prevent file storage problems like when two people upload a file that might have the same file name...
Is it better to get each user a separate folder to prevent issues or is better to have all files in one folder for all users but change the file-name to keep them unique?

It depends on what you are trying to achieve.
What kind of service do you want to provide? A general file storage service? Then use different folders, since the number of files in a directory may be limited (depending on the file system) and can have major influence on the performance.
Do you provide an upload area for a simple blog? Use a single directory and change the file names.
Sorry, an absolute answer can only be given if you provide more information.

Related

Is there a standard for protecting application files from user interference outside of application

Sorry if I didn't express myself precisely in the title, I'll try to explain what i meant to say here.
My application uses a lot of small files like DB files, xml files, fonts, etc. There is folder and file presence check when application starts, but I would like to make sure that user can not accidentally change or delete some important file from disk.
Only thing that comes to mind is archiving files in few archives by usage frequency, changing archive extension to something unfamiliar and hiding those archives.
But compressing and uncompressing those files all the time through application doesn't seem like efficient solution.
Is there some standard procedure for keeping those important files from tampering?

Only thing that comes to mind is archiving files in few archives by usage frequency, changing archive extension to something unfamiliar and hiding those archives
That is security through obscurity, which is not a recommended practice.
Instead, use the file security mechanisms built-in to your operating system. Allow appropriate file access only to a specific group/role or user, and ensure your application runs in that group/role or as that user.

Store file reference in LDAP

In initial phase, developer started with storing small file in an LDAP attribute. Later, as file size grow, it became a problem. Now I am planning to change it like, storing file content in disk and file path in a attribute. My doubt, Is it possible for the OpenLDAP server to automatically serve the file content, as the client read that attribute??
I saw reference attributes like LabeledURI. Is there any specific Attribute to handle this situation?

Nope, it's not possible and a bad idea.
A LDAP directory should never be treated as a file store, as it is designed to host many but small objects. To be performant, requests should be as short as possible.
A NAS would be better suited to host those files.
You'll have to modify your code to access those files based on a filename stored in the directory.

Certainly storing the URI to a file (or any other Resource) is possible and often done.
Serving a file, depends on the LDAP server implementation and size of the file. Certificates and photos are often stored in LDAP.
eDirectory, as an example, streams data over a certain size, to a file in the DIB store. However, the LDAP protocol is not very efficient in streaming large blocks or data.
-jim

Server backend: how to generate file paths for uploaded files?

I am trying to create a site where users can upload images, videos and other types of files.
I did some research and people seem to suggest that saving the files as BLOB in database is a Bad idea; instead, save the file paths in database.
My questions are, if I save the file paths in a database:
1. How do I generate the file names?
I thought about computing the MD5 value of the file name, but what if two files have the same name? Adding the username and time-stamp etc. to file name? Does it even make sense?
2. What is the best directory structure?
If a user uploads images at 12/17/2013, 12/18/2018, can I just put it in user_ABC/images/, then create time-stamped sub-directories 20131217, 20131218 etc. ? What is the best structure for all these stuff?
3. How do all these come together?
It seems like maintaining this system is such a pain, because the file system manipulation scripts are tightly coupled with the database operations(may also need the worry about database transactions? Say in one transaction I updated the database but failed to modify the file system so I need to roll back my database?).
And I think this system doesn't scale (what if my machine runs out of hard disk so I need to upload the files to a second machine? What if my contents are on a cluster?)
I think my real question is:
4. Is there any existing framework/design pattern/db that handles this problem?
What is the standard way of handling this kind of problems?
Thanks in advance for your answers.

I've actually asked this same question when I was designing a social website for food chefs. I decided to store the url of the image in a MySQL database along with recipe. If you plan on storing multiple images for one recipe, in my example, maybe having a comma separated value would work. When the recipe loaded on the page, I would fetch the image associated with that recipe onto the screen.
Since it was a hackathon and wasn't meant for production purposes, I didn't encode the file name into something unique. However, if I were developing for productional purposes, I would append the time-stamp to the media file name when storing it into the server and database/backend.
I believe what I've proposed is the best data structure of handling this scenario. Storing the image onto the server is not only faster, but it should also take less space. I have found that when converting a standard jpg file of reasonable resolution to base64 encoding, the encoded text file representation took 30% more space. There is also the time of encoding the file and decoding the file for storage and resolving when using some BLOB type of data format instead of straight up storing the file on the server.
Using some sort of backend server scripting like PHP, you'll be able to do some pretty neat stuff with the information you have available. Fetch the result from the database, and load it in from the page using HTML.
As far as I know, there isn't a standard way of fetching media from a database yet. Perhaps there will be one day.

There is not standard way to do that, it is different to the different application. The idea is you need generate a different Path+FileName for every upload, here is a way:
HashId = sha1(microsecond + random(1,1000000));
Path = /[user_id]/[HashId{0,2}]/[HashId{-2}];
FileName = HashId

What is better for performance - many files in one directory, or many subdirectories each with one file?

While building web applications often we have files associated with database entries, eg: we have a user table and each category has a avatar field, which holds the path to associated image.
To make sure there are no conflicts in filenames we can either:
rename files upon upload to ID.jpg; the path would be then /user-avatars/ID.jpg
or create a sub-directory for each entity, and leave the original filename intact; the path would be then /user-avatars/ID/original_filename.jpg
where ID is users's unique ID number
Both perfectly valid from application logic's point of view.
But which one would be better from filesystem performance point of view? We have to keep in mind that the number of category entries can be very high (milions).
Is there any limit to a number of sub-directories a directory can hold?

It's going to depend on your file system, but I'm going to assume you're talking about something simple like ext3, and you're not running a distributed file system (some of which are quite good at this). In general, file systems perform poorly over a certain number of entries in a single directory, regardless of whether those entries are directories or files. So no matter whether if you're creating one directory per image or one image in the root directory, you will run into scaling problems. If you look at this answer:
How many files in a directory is too many (on Windows and Linux)?
You'll see that ext3 runs into limits at about 32K entries in a directory, far fewer than you're proposing.
Off the top of my head, I'd suggest doing some rudimentary sharding into a multilevel directory tree, something like /user-avatars/1/2/12345/original_filename.jpg. (Or something appropriate for your type of ID, but I am interpreting your question to be about numeric IDs.) Doing that will also make your life easier later when you decide you want to distribute across a storage cluster, since you can spread the directories around.

Millions of entries (either files or directories) in one parent directory would be hard to deal with for any filesystem. While modern filesystems use sorting and various tree algorithms for quick search for the needed files, even navigating to the folder with Windows Explorer or Midnight Commander or any other file manager will be complicated as the file manager would have to read contents of the directory. The same applies to file search. So subdirectories are preferred for this.
Yet I need to notice that access to particular file would be a bit faster when all files are in one directory than when they are separated into subdirectories at least on NTFS (measured this myself several times with 400K files).

I've been having a very similar issue with html files not images. Trying to store millions of them in a Ubuntu server in ext4. Ended running my own benchmarks. Found out that flat directory performs way better while being way simpler to use:
Reference: article

If you really want to use files, maybe your best bet is to partition the files off into several subdirectories so that you don't hit a limit. For example, if you have an ID 123456, you can put it in /12/34/56.jpg.
However, I would recommend just using the database to store this data since you are already using one. You can store the image data and ID in the same table, and you don't have to worry about some of the pesky business of dealing with files like making sure the permissions are set right, etc.

Efficiency of searching files in a directory?

I am building a website with a user authentication system allowing each user to upload images to their account, essentially I am doing this as an experience in web development so please forgive my ignorance on the topic.
My question involves the efficiency of placing files into a directory. Is it more efficient to create a deeper directory structure or to place all files into one folder? The former seems obvious, but does it not depend on the search algorithm implemented by the file system?
For example:
root/user/2012/----------------A/
/2013/---------- A/ B/
/2014/------A/ B/ C/
B/ C/ D/
C/ D/
D/
Or dump all files into a single folder?
root/user/
When an image is retrieved, for example by an <img> tag, which way provides a more efficient result? I have searched Google for information on the topic, but couldn't find anything definitive or at my level of understanding.

Accessing a single file should be roughly equivalent. A single directory or multiple choice really depends on how you are trying to use the file listing. If you expect the user to have thousands of files and you only display a single year at a time, it may make sense to break up the directory structure into multiple sections to keep file listings manageable. If you always show all the files, I suspect the single folder may be faster, since you will have to run through the whole directory listing doing multiple file listings. I would do a few tests based on what you expect your app to have to deal with. My guess would be a single directory should be fine, unless you expect large numbers of files and you can break the listing down.

i dont know what OS you intend to run on, but i'd go with the multiple directories approach as some FSs (NTFS on windows, for example) slow down horribly when dealing with 10000+ files in a single directory

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight