Before I begin, I would like to express my appreciation for all of the insight I've gained on stackoverflow and everyone who contributes. I have a general question about managing large numbers of files. I'm trying to determine my options, if any. Here it goes.
Currently, I have a large number of files and I'm on Windows 7. What I've been doing is categorizing the files by copying them into folders based on what needs to be processed together. So, I have one set that contains the files by date (for long term storage) and another that contains the copies by category (for processing and calculations). Of course this doubles my data each time. Now I'm having to create more than one set of categories; 3 copies to be exact. This is quadrupling my data.
For the processing side of things, the data ends up in excel. Originally, all the data was brough into excel. Then all organization and filtering was performed in excel. This was time consuming and not easily maintainable over the long term. Later the work load was shifted to the file system itself, which lightened the work in excel.
The long and short of it is that this is an extremely inefficient use of disk space. What would be a better way of handling this?
Things that have come to mind:
Overlapping Folders
Is there a way to create a folder that only holds the addresses of a file, rather than copying the file. This way I could have two folders reference the same file.
To my understanding, a folder is a file listing the memory addresses of the files inside of it, but on Windows a file can only be contained in one folder.
Microsoft SQL Server
Not sure what could be done here.
Symbolic Links
I'm not an administrator, so I cannot execute the mklink command.
Also, I'm uncertain about any performance issues with this.
A Junction
Apparently not allowed for individual files, only folders in windows.
Search folders (*.search-ms)
Maybe I'm missing something, but to my knowledge there is no way to specify individual files to be listed.
Hashing the files
Creating hash tags for all the files, would allow for the files to be stored once. But then I have no idea how I would handle the hash tags.
XML
Maybe I could use xml files to attach meta data to the files and somehow search using them.
Database File System
I recently came across this concept in my search. Not sure how it would apply Windows.
I have found a partial solution. First, I discovered that the laptop I'm using is actually logged in as Administrator. As an alternative to options 3 and 4, I have decided to use hard-links, which are part of the NTFS file system. However, due to the large number of files, this is unmanageable using the following command from an elevated command prompt:
mklink /h <source\file> <target\file>
Luckily, Hermann Schinagl has created the Link Shell Extension application for Windows Explorer and a very insightful reading of how Junctions, Symbolic Links, and Hard Links work. The only reason that this is currently a partial solution, is due to a separate problem with Windows Explorer, which I intend to post as a separate question. Thank you Hermann.
Related
I am dealing with one problem, before starting developing, I decided to do some research.
So the problem is what would be the efficient solution when storing files?
I read about it, and some people was against storing files within database, as it will have negative impact on backups / restorations, it will add more processing time when reading database for large files and etc...
Good option would be to use S3 or any other cloud solution to store the files, but for this current customer cloud won't be good.
Another option would be to store files under file system. The concept is clear, but I try to understand what I need to understand before implementing that solution.
For example we need to consider how we structure directories, if we would store 100 000 files in one directory it can be come slow to open and etc. As well there is like maximum amount of files that can be stored in one directory.
Is there any 3rd party tools that helps to manage files in file system? that automatically partitions files and places them in directories?
I work with a software that have more than 10 million files in file system, how you will structure the folders depends, but what I did was:
Create a folder to each entity (Document, Image...)
Into each folder create a folder to each ID object with ID beign the name of the folder, and put their files inside, but this could vary.
Example to Person that have the ID 15:
ls /storage/Person/15/Image/
Will give me this 4 images that in the database I linked to person with the ID 15:
Output:
1.jpg
2.png
3.bmp
4.bmp
If you have a HUGE amount of elements, you cold separate each digit of an ID into a subfolder, that is: Person wih ID 16549 will have this path: /storage/Person/1/6/5/4/9/Image/
About limits of files in folder I suggest you to read this thread: https://stackoverflow.com/a/466596/12914069
About a 3rd party tool I don't know, but for sure you could build this logic into your code.
English isn't my first language, tell me if you didn't understand something.
My problem is the following:
I usually backup files (e.g. pictures) on external harddisc drives and the store them away in safe places. In the meantime also on NAS. But I don't want to have them connected and online all the time, for power and security reasons.
If I'm now looking for an old file (e.g. a special jpg from the holiday in April 2004) I would have to connect a few discs and search them for the needed file.
To overcome this problem I usually create a recursive dir-dump into a textfile for the whole disc after backup.
This way I can search the filename in the text-file.
But there still is a problem if I don't exactly know the file name that I am looking for. I know the Year and month and maybe the camera I was using then, but there must be hundreds of files in this month.
Therefore I would like to create a "dummy"-backup-filesystem with all the filesnames on the harddisc but without the actual data behind it. This way I could click through the folders and see the foldernames and filenames and easily find the respective file.
The question is: How do I create such a filesystem copy with the complete folderstructures but only the filenames and not the data?
I'm working on Linux, Opensuse, but I guess this is not a linux specific question.
In the meantime I found the solution I was looking for:
Virtual Volume View:
http://vvvapp.sourceforge.net/
Works with Linux, MacOS and Windows!
Our current system database system is a clipper DOS application. The database inside its folder is fragmented/divided into many parts. I want to decrypt the database so that I will have only one database in all and avoid reshuffling of data. I'll attached the file folder Screenshot.. the database is on .DBF format
VScreenshot of files
Often you can decompile the CLIPPER exe file to source code and work from the .prg I've done it many times. The program to use is called WALKYRIE.
In Clipper and Fox Pro for DOS .dbf file is a simple table file.
If You want to use as data base with many tables in one unit.
You can import these tables in MS SQL data base and/or part of a MS Access database.
I see that you got several answers. Most are partially right. Let's address these one at a time:
All those files essentially comprise the "database" for the application you're using. They could be used by other applications as well. Besides having a lot of files, what is the problem you're trying to solve?
People mentioned indexes. You can generally ignore these. There are there primarily to make access to the data files faster. Any properly written clipper application will recreate these if they're missing or corrupted. You could test this by renaming one, running the app, and seeing what happens. If it doesn't recreate it you can name it back. Not replacing missing index files would be unusual behavior.
The DBF file format is binary, but barely. Most of what's in a DBF is text and is readable with an editor. But there's no reason to do so - I'm sure there are several free DBF utilities out there to to read DBF files. Getting the structure of the files could be very helpful.
Getting the data out of the files would also be fairly simple with a utility. If you look up the DBF format you could even write one fairly easily in Clipper, any other language that uses DBF files, or in something like Python. Any language that can open and write files, really. It's not hard - any competent developer could do this in a matter of hours. Must less if you're using Clipper or another language that natively reads DBX files.
Most people create dBase/Clipper programs with relational data, like SQL Server. Where SQL Server has tables that relate to each other dBase/Clipper has a file for each "table." This isn't a requirement, but it was almost certainly done this way.
Given that, if you get the table structures through a utility or by reading the headers in an editor (don't save them from an editor!) you could quite likely recreate the database schema (i.e. the map of the data). Once you have that it's fairly trivial to get the data into another type of database (SQL Sever, Access, or whatever you like to use.) If non of the files are too large it's conceivable to put all the files into Excel sheets. It really depends on what you want to do with it.
As others have said, you may be able to get the code by Valkyrie. Some people have used it very successfully. I don't know where you get it and I've never used it. Why do you not have the code? If this is a commercial application you likely should not have it. If it's a custom app who ever wrote it or paid to have it written should have the code.
Again, it's not clear to me what problem you're trying to solve. But there are many options for doing something with those DBF files. Fortunately they are one of the easier to read data formats you could be working with.
Let me know if you have any questions. Apologies for the typos that are no doubt scattered throughout this reply.
You sort of can get an idea of how they relate to each other by opening the index files they use (.NTX files). If you have the DBU utility (executable) around, you can open the DBF and load the index (NTX). LibreOffice Calc is also able to open DBFs (haven't tested .NTX).
If you open the .NTX on a text editor you will see the indexes in the beginning.
I open with Access, but I can save the data using a PrintFill Program.
While building web applications often we have files associated with database entries, eg: we have a user table and each category has a avatar field, which holds the path to associated image.
To make sure there are no conflicts in filenames we can either:
rename files upon upload to ID.jpg; the path would be then /user-avatars/ID.jpg
or create a sub-directory for each entity, and leave the original filename intact; the path would be then /user-avatars/ID/original_filename.jpg
where ID is users's unique ID number
Both perfectly valid from application logic's point of view.
But which one would be better from filesystem performance point of view? We have to keep in mind that the number of category entries can be very high (milions).
Is there any limit to a number of sub-directories a directory can hold?
It's going to depend on your file system, but I'm going to assume you're talking about something simple like ext3, and you're not running a distributed file system (some of which are quite good at this). In general, file systems perform poorly over a certain number of entries in a single directory, regardless of whether those entries are directories or files. So no matter whether if you're creating one directory per image or one image in the root directory, you will run into scaling problems. If you look at this answer:
How many files in a directory is too many (on Windows and Linux)?
You'll see that ext3 runs into limits at about 32K entries in a directory, far fewer than you're proposing.
Off the top of my head, I'd suggest doing some rudimentary sharding into a multilevel directory tree, something like /user-avatars/1/2/12345/original_filename.jpg. (Or something appropriate for your type of ID, but I am interpreting your question to be about numeric IDs.) Doing that will also make your life easier later when you decide you want to distribute across a storage cluster, since you can spread the directories around.
Millions of entries (either files or directories) in one parent directory would be hard to deal with for any filesystem. While modern filesystems use sorting and various tree algorithms for quick search for the needed files, even navigating to the folder with Windows Explorer or Midnight Commander or any other file manager will be complicated as the file manager would have to read contents of the directory. The same applies to file search. So subdirectories are preferred for this.
Yet I need to notice that access to particular file would be a bit faster when all files are in one directory than when they are separated into subdirectories at least on NTFS (measured this myself several times with 400K files).
I've been having a very similar issue with html files not images. Trying to store millions of them in a Ubuntu server in ext4. Ended running my own benchmarks. Found out that flat directory performs way better while being way simpler to use:
Reference: article
If you really want to use files, maybe your best bet is to partition the files off into several subdirectories so that you don't hit a limit. For example, if you have an ID 123456, you can put it in /12/34/56.jpg.
However, I would recommend just using the database to store this data since you are already using one. You can store the image data and ID in the same table, and you don't have to worry about some of the pesky business of dealing with files like making sure the permissions are set right, etc.
I am building a website with a user authentication system allowing each user to upload images to their account, essentially I am doing this as an experience in web development so please forgive my ignorance on the topic.
My question involves the efficiency of placing files into a directory. Is it more efficient to create a deeper directory structure or to place all files into one folder? The former seems obvious, but does it not depend on the search algorithm implemented by the file system?
For example:
root/user/2012/----------------A/
/2013/---------- A/ B/
/2014/------A/ B/ C/
B/ C/ D/
C/ D/
D/
Or dump all files into a single folder?
root/user/
When an image is retrieved, for example by an <img> tag, which way provides a more efficient result? I have searched Google for information on the topic, but couldn't find anything definitive or at my level of understanding.
Accessing a single file should be roughly equivalent. A single directory or multiple choice really depends on how you are trying to use the file listing. If you expect the user to have thousands of files and you only display a single year at a time, it may make sense to break up the directory structure into multiple sections to keep file listings manageable. If you always show all the files, I suspect the single folder may be faster, since you will have to run through the whole directory listing doing multiple file listings. I would do a few tests based on what you expect your app to have to deal with. My guess would be a single directory should be fine, unless you expect large numbers of files and you can break the listing down.
i dont know what OS you intend to run on, but i'd go with the multiple directories approach as some FSs (NTFS on windows, for example) slow down horribly when dealing with 10000+ files in a single directory