Storing files in multiple directories vs single directory - file

I am working on a web server where I have to store many image files of different cities.
There are two choices :
All files can be stored in the same directory or
Files of city1 can be stored in city1 directory, files of city2 in city2 directory and so on. Around 20 such directories will be created.
Does it cause any difference in the response time of the server, robustness of the file system, or any other factor? Which is the better choice?

I would argue for choice 2 because it is better for overview and extenibility. Additionally an existing filename in folder city1 can also be used in folder city2. In solution 1 you might have a collision of filenames then. There is no difference in response time worth mentioning. If you do read/write often the same files the server's cache will have a significant role for access time, which is the same either in solution 1 or 2.

Related

What is the most appropriate way to store files?

I am dealing with one problem, before starting developing, I decided to do some research.
So the problem is what would be the efficient solution when storing files?
I read about it, and some people was against storing files within database, as it will have negative impact on backups / restorations, it will add more processing time when reading database for large files and etc...
Good option would be to use S3 or any other cloud solution to store the files, but for this current customer cloud won't be good.
Another option would be to store files under file system. The concept is clear, but I try to understand what I need to understand before implementing that solution.
For example we need to consider how we structure directories, if we would store 100 000 files in one directory it can be come slow to open and etc. As well there is like maximum amount of files that can be stored in one directory.
Is there any 3rd party tools that helps to manage files in file system? that automatically partitions files and places them in directories?
I work with a software that have more than 10 million files in file system, how you will structure the folders depends, but what I did was:
Create a folder to each entity (Document, Image...)
Into each folder create a folder to each ID object with ID beign the name of the folder, and put their files inside, but this could vary.
Example to Person that have the ID 15:
ls /storage/Person/15/Image/
Will give me this 4 images that in the database I linked to person with the ID 15:
Output:
1.jpg
2.png
3.bmp
4.bmp
If you have a HUGE amount of elements, you cold separate each digit of an ID into a subfolder, that is: Person wih ID 16549 will have this path: /storage/Person/1/6/5/4/9/Image/
About limits of files in folder I suggest you to read this thread: https://stackoverflow.com/a/466596/12914069
About a 3rd party tool I don't know, but for sure you could build this logic into your code.
English isn't my first language, tell me if you didn't understand something.

Database Tables for files in filesystem

i needed to save images to my back-end, and finally went with storing them in the file system instead of in the database as blobs. So now i have a different issue, i want to make my database as optimized as possible. Here are my needs, and my approaches:
I have these entities:
User
Image
In my file system, i can store the images in directories named after the user id. So basically:
16
asd.jpg
blaBla.jpg
Would represent the images about the user with id 16.
Now, i know i will have a lot of directories and a lot of images, and i know that storing their paths in a database would be better than querying the file system. (or would the OS know the locations of all the directories, making these tables not needed?)However i was wondering should i make a table such as (userId,imagePath), connecting every image to a userid, or (userId,directoryPath), connecting every-user with the path to his directory, then use something like Files.walk(directoryPath) to list all of the paths of the images inside that directory. What would be a better approach, or is this way to opinion-based ? A completely different approach or any tips would also be appreciated.

File Management for Large Quantity of Files

Before I begin, I would like to express my appreciation for all of the insight I've gained on stackoverflow and everyone who contributes. I have a general question about managing large numbers of files. I'm trying to determine my options, if any. Here it goes.
Currently, I have a large number of files and I'm on Windows 7. What I've been doing is categorizing the files by copying them into folders based on what needs to be processed together. So, I have one set that contains the files by date (for long term storage) and another that contains the copies by category (for processing and calculations). Of course this doubles my data each time. Now I'm having to create more than one set of categories; 3 copies to be exact. This is quadrupling my data.
For the processing side of things, the data ends up in excel. Originally, all the data was brough into excel. Then all organization and filtering was performed in excel. This was time consuming and not easily maintainable over the long term. Later the work load was shifted to the file system itself, which lightened the work in excel.
The long and short of it is that this is an extremely inefficient use of disk space. What would be a better way of handling this?
Things that have come to mind:
Overlapping Folders
Is there a way to create a folder that only holds the addresses of a file, rather than copying the file. This way I could have two folders reference the same file.
To my understanding, a folder is a file listing the memory addresses of the files inside of it, but on Windows a file can only be contained in one folder.
Microsoft SQL Server
Not sure what could be done here.
Symbolic Links
I'm not an administrator, so I cannot execute the mklink command.
Also, I'm uncertain about any performance issues with this.
A Junction
Apparently not allowed for individual files, only folders in windows.
Search folders (*.search-ms)
Maybe I'm missing something, but to my knowledge there is no way to specify individual files to be listed.
Hashing the files
Creating hash tags for all the files, would allow for the files to be stored once. But then I have no idea how I would handle the hash tags.
XML
Maybe I could use xml files to attach meta data to the files and somehow search using them.
Database File System
I recently came across this concept in my search. Not sure how it would apply Windows.
I have found a partial solution. First, I discovered that the laptop I'm using is actually logged in as Administrator. As an alternative to options 3 and 4, I have decided to use hard-links, which are part of the NTFS file system. However, due to the large number of files, this is unmanageable using the following command from an elevated command prompt:
mklink /h <source\file> <target\file>
Luckily, Hermann Schinagl has created the Link Shell Extension application for Windows Explorer and a very insightful reading of how Junctions, Symbolic Links, and Hard Links work. The only reason that this is currently a partial solution, is due to a separate problem with Windows Explorer, which I intend to post as a separate question. Thank you Hermann.

What is better for performance - many files in one directory, or many subdirectories each with one file?

While building web applications often we have files associated with database entries, eg: we have a user table and each category has a avatar field, which holds the path to associated image.
To make sure there are no conflicts in filenames we can either:
rename files upon upload to ID.jpg; the path would be then /user-avatars/ID.jpg
or create a sub-directory for each entity, and leave the original filename intact; the path would be then /user-avatars/ID/original_filename.jpg
where ID is users's unique ID number
Both perfectly valid from application logic's point of view.
But which one would be better from filesystem performance point of view? We have to keep in mind that the number of category entries can be very high (milions).
Is there any limit to a number of sub-directories a directory can hold?
It's going to depend on your file system, but I'm going to assume you're talking about something simple like ext3, and you're not running a distributed file system (some of which are quite good at this). In general, file systems perform poorly over a certain number of entries in a single directory, regardless of whether those entries are directories or files. So no matter whether if you're creating one directory per image or one image in the root directory, you will run into scaling problems. If you look at this answer:
How many files in a directory is too many (on Windows and Linux)?
You'll see that ext3 runs into limits at about 32K entries in a directory, far fewer than you're proposing.
Off the top of my head, I'd suggest doing some rudimentary sharding into a multilevel directory tree, something like /user-avatars/1/2/12345/original_filename.jpg. (Or something appropriate for your type of ID, but I am interpreting your question to be about numeric IDs.) Doing that will also make your life easier later when you decide you want to distribute across a storage cluster, since you can spread the directories around.
Millions of entries (either files or directories) in one parent directory would be hard to deal with for any filesystem. While modern filesystems use sorting and various tree algorithms for quick search for the needed files, even navigating to the folder with Windows Explorer or Midnight Commander or any other file manager will be complicated as the file manager would have to read contents of the directory. The same applies to file search. So subdirectories are preferred for this.
Yet I need to notice that access to particular file would be a bit faster when all files are in one directory than when they are separated into subdirectories at least on NTFS (measured this myself several times with 400K files).
I've been having a very similar issue with html files not images. Trying to store millions of them in a Ubuntu server in ext4. Ended running my own benchmarks. Found out that flat directory performs way better while being way simpler to use:
Reference: article
If you really want to use files, maybe your best bet is to partition the files off into several subdirectories so that you don't hit a limit. For example, if you have an ID 123456, you can put it in /12/34/56.jpg.
However, I would recommend just using the database to store this data since you are already using one. You can store the image data and ID in the same table, and you don't have to worry about some of the pesky business of dealing with files like making sure the permissions are set right, etc.

Efficiency of searching files in a directory?

I am building a website with a user authentication system allowing each user to upload images to their account, essentially I am doing this as an experience in web development so please forgive my ignorance on the topic.
My question involves the efficiency of placing files into a directory. Is it more efficient to create a deeper directory structure or to place all files into one folder? The former seems obvious, but does it not depend on the search algorithm implemented by the file system?
For example:
root/user/2012/----------------A/
/2013/---------- A/ B/
/2014/------A/ B/ C/
B/ C/ D/
C/ D/
D/
Or dump all files into a single folder?
root/user/
When an image is retrieved, for example by an <img> tag, which way provides a more efficient result? I have searched Google for information on the topic, but couldn't find anything definitive or at my level of understanding.
Accessing a single file should be roughly equivalent. A single directory or multiple choice really depends on how you are trying to use the file listing. If you expect the user to have thousands of files and you only display a single year at a time, it may make sense to break up the directory structure into multiple sections to keep file listings manageable. If you always show all the files, I suspect the single folder may be faster, since you will have to run through the whole directory listing doing multiple file listings. I would do a few tests based on what you expect your app to have to deal with. My guess would be a single directory should be fine, unless you expect large numbers of files and you can break the listing down.
i dont know what OS you intend to run on, but i'd go with the multiple directories approach as some FSs (NTFS on windows, for example) slow down horribly when dealing with 10000+ files in a single directory

Resources