Efficiency of searching files in a directory?

Efficiency of searching files in a directory? - filesystems

I am building a website with a user authentication system allowing each user to upload images to their account, essentially I am doing this as an experience in web development so please forgive my ignorance on the topic.
My question involves the efficiency of placing files into a directory. Is it more efficient to create a deeper directory structure or to place all files into one folder? The former seems obvious, but does it not depend on the search algorithm implemented by the file system?
For example:
root/user/2012/----------------A/
/2013/---------- A/ B/
/2014/------A/ B/ C/
B/ C/ D/
C/ D/
D/
Or dump all files into a single folder?
root/user/
When an image is retrieved, for example by an <img> tag, which way provides a more efficient result? I have searched Google for information on the topic, but couldn't find anything definitive or at my level of understanding.

Accessing a single file should be roughly equivalent. A single directory or multiple choice really depends on how you are trying to use the file listing. If you expect the user to have thousands of files and you only display a single year at a time, it may make sense to break up the directory structure into multiple sections to keep file listings manageable. If you always show all the files, I suspect the single folder may be faster, since you will have to run through the whole directory listing doing multiple file listings. I would do a few tests based on what you expect your app to have to deal with. My guess would be a single directory should be fine, unless you expect large numbers of files and you can break the listing down.

i dont know what OS you intend to run on, but i'd go with the multiple directories approach as some FSs (NTFS on windows, for example) slow down horribly when dealing with 10000+ files in a single directory

Related

What is the most appropriate way to store files?

I am dealing with one problem, before starting developing, I decided to do some research.
So the problem is what would be the efficient solution when storing files?
I read about it, and some people was against storing files within database, as it will have negative impact on backups / restorations, it will add more processing time when reading database for large files and etc...
Good option would be to use S3 or any other cloud solution to store the files, but for this current customer cloud won't be good.
Another option would be to store files under file system. The concept is clear, but I try to understand what I need to understand before implementing that solution.
For example we need to consider how we structure directories, if we would store 100 000 files in one directory it can be come slow to open and etc. As well there is like maximum amount of files that can be stored in one directory.
Is there any 3rd party tools that helps to manage files in file system? that automatically partitions files and places them in directories?

I work with a software that have more than 10 million files in file system, how you will structure the folders depends, but what I did was:
Create a folder to each entity (Document, Image...)
Into each folder create a folder to each ID object with ID beign the name of the folder, and put their files inside, but this could vary.
Example to Person that have the ID 15:
ls /storage/Person/15/Image/
Will give me this 4 images that in the database I linked to person with the ID 15:
Output:
1.jpg
2.png
3.bmp
4.bmp
If you have a HUGE amount of elements, you cold separate each digit of an ID into a subfolder, that is: Person wih ID 16549 will have this path: /storage/Person/1/6/5/4/9/Image/
About limits of files in folder I suggest you to read this thread: https://stackoverflow.com/a/466596/12914069
About a 3rd party tool I don't know, but for sure you could build this logic into your code.
English isn't my first language, tell me if you didn't understand something.

Using a script to organize music library

Not sure if this is the correct section or whether I should try SuperUser, but here goes.
I have a ton of music (around 70GB) in my library that I would like to organize, rename, etc. Is it possible to do this using some kind of script? I am not familiar with this type of thing at all, just to put things in perspective.
I have the metadata organized (artist name, album name, track no, track name, etc.) already so if I could write a script to organize my folders and files that would be great.

The answer to your question is that: of course it's possible. You simply need to use a tool that helps you read existing tags in music files. This might depend on the format of the music files, and of course which operating system you are using.
Following on from that, all OSes have the ability to move files, so it's a case of running the metadata query on the file then using the outputs to rename.
For MP3s you can use, for example, id3 on Linux.
You might also want to look at beets - http://beets.radbox.org/

Is there a standard for protecting application files from user interference outside of application

Sorry if I didn't express myself precisely in the title, I'll try to explain what i meant to say here.
My application uses a lot of small files like DB files, xml files, fonts, etc. There is folder and file presence check when application starts, but I would like to make sure that user can not accidentally change or delete some important file from disk.
Only thing that comes to mind is archiving files in few archives by usage frequency, changing archive extension to something unfamiliar and hiding those archives.
But compressing and uncompressing those files all the time through application doesn't seem like efficient solution.
Is there some standard procedure for keeping those important files from tampering?

Only thing that comes to mind is archiving files in few archives by usage frequency, changing archive extension to something unfamiliar and hiding those archives
That is security through obscurity, which is not a recommended practice.
Instead, use the file security mechanisms built-in to your operating system. Allow appropriate file access only to a specific group/role or user, and ensure your application runs in that group/role or as that user.

File Management for Large Quantity of Files

Before I begin, I would like to express my appreciation for all of the insight I've gained on stackoverflow and everyone who contributes. I have a general question about managing large numbers of files. I'm trying to determine my options, if any. Here it goes.
Currently, I have a large number of files and I'm on Windows 7. What I've been doing is categorizing the files by copying them into folders based on what needs to be processed together. So, I have one set that contains the files by date (for long term storage) and another that contains the copies by category (for processing and calculations). Of course this doubles my data each time. Now I'm having to create more than one set of categories; 3 copies to be exact. This is quadrupling my data.
For the processing side of things, the data ends up in excel. Originally, all the data was brough into excel. Then all organization and filtering was performed in excel. This was time consuming and not easily maintainable over the long term. Later the work load was shifted to the file system itself, which lightened the work in excel.
The long and short of it is that this is an extremely inefficient use of disk space. What would be a better way of handling this?
Things that have come to mind:
Overlapping Folders
Is there a way to create a folder that only holds the addresses of a file, rather than copying the file. This way I could have two folders reference the same file.
To my understanding, a folder is a file listing the memory addresses of the files inside of it, but on Windows a file can only be contained in one folder.
Microsoft SQL Server
Not sure what could be done here.
Symbolic Links
I'm not an administrator, so I cannot execute the mklink command.
Also, I'm uncertain about any performance issues with this.
A Junction
Apparently not allowed for individual files, only folders in windows.
Search folders (*.search-ms)
Maybe I'm missing something, but to my knowledge there is no way to specify individual files to be listed.
Hashing the files
Creating hash tags for all the files, would allow for the files to be stored once. But then I have no idea how I would handle the hash tags.
XML
Maybe I could use xml files to attach meta data to the files and somehow search using them.
Database File System
I recently came across this concept in my search. Not sure how it would apply Windows.

I have found a partial solution. First, I discovered that the laptop I'm using is actually logged in as Administrator. As an alternative to options 3 and 4, I have decided to use hard-links, which are part of the NTFS file system. However, due to the large number of files, this is unmanageable using the following command from an elevated command prompt:
mklink /h <source\file> <target\file>
Luckily, Hermann Schinagl has created the Link Shell Extension application for Windows Explorer and a very insightful reading of how Junctions, Symbolic Links, and Hard Links work. The only reason that this is currently a partial solution, is due to a separate problem with Windows Explorer, which I intend to post as a separate question. Thank you Hermann.

What is better for performance - many files in one directory, or many subdirectories each with one file?

While building web applications often we have files associated with database entries, eg: we have a user table and each category has a avatar field, which holds the path to associated image.
To make sure there are no conflicts in filenames we can either:
rename files upon upload to ID.jpg; the path would be then /user-avatars/ID.jpg
or create a sub-directory for each entity, and leave the original filename intact; the path would be then /user-avatars/ID/original_filename.jpg
where ID is users's unique ID number
Both perfectly valid from application logic's point of view.
But which one would be better from filesystem performance point of view? We have to keep in mind that the number of category entries can be very high (milions).
Is there any limit to a number of sub-directories a directory can hold?

It's going to depend on your file system, but I'm going to assume you're talking about something simple like ext3, and you're not running a distributed file system (some of which are quite good at this). In general, file systems perform poorly over a certain number of entries in a single directory, regardless of whether those entries are directories or files. So no matter whether if you're creating one directory per image or one image in the root directory, you will run into scaling problems. If you look at this answer:
How many files in a directory is too many (on Windows and Linux)?
You'll see that ext3 runs into limits at about 32K entries in a directory, far fewer than you're proposing.
Off the top of my head, I'd suggest doing some rudimentary sharding into a multilevel directory tree, something like /user-avatars/1/2/12345/original_filename.jpg. (Or something appropriate for your type of ID, but I am interpreting your question to be about numeric IDs.) Doing that will also make your life easier later when you decide you want to distribute across a storage cluster, since you can spread the directories around.

Millions of entries (either files or directories) in one parent directory would be hard to deal with for any filesystem. While modern filesystems use sorting and various tree algorithms for quick search for the needed files, even navigating to the folder with Windows Explorer or Midnight Commander or any other file manager will be complicated as the file manager would have to read contents of the directory. The same applies to file search. So subdirectories are preferred for this.
Yet I need to notice that access to particular file would be a bit faster when all files are in one directory than when they are separated into subdirectories at least on NTFS (measured this myself several times with 400K files).

I've been having a very similar issue with html files not images. Trying to store millions of them in a Ubuntu server in ext4. Ended running my own benchmarks. Found out that flat directory performs way better while being way simpler to use:
Reference: article

If you really want to use files, maybe your best bet is to partition the files off into several subdirectories so that you don't hit a limit. For example, if you have an ID 123456, you can put it in /12/34/56.jpg.
However, I would recommend just using the database to store this data since you are already using one. You can store the image data and ID in the same table, and you don't have to worry about some of the pesky business of dealing with files like making sure the permissions are set right, etc.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight