why does the drive space used vary between two disk drives holding identical files - filesystems

I have two external NTFS USB drives with identical files but 5GB difference in used space. The drives are 500/1000 (465GB and 931GB) with 257GB and 252GB used. There is no fragmentation and no windows shadow storage on either.
I have run windows chkdsk and get very different results:
file records processed 4832512 versus 119296
large file records 11 versus 1
reparse records 6272 versus 2
security descriptor data files 10778 versus 10774
files 108387 versus 108386
Should I be concerned at the 5GB leakage of space on the first/older drive or is this expected?

One thing that can happen (among others) is that NTFS keeps a journal which records things that happen to files and folders, such as move, copy, update. This is kept in a large, fairly well hidden file. I doubt that it or it's content are reported through chkdsk. There are APIs to read from it...but the file itself is not generally accessible. If one volume was an active volume for some period of time, and the other is a backup, then this can account for a lot of hidden size difference.
Also, I noticed that the reparse points are somewhat numerous on one and almost non-existent on the other. A naïve backup can effectively destroy reparse points...and can undo hard links (pointers to files in various folders that are really not copies but links to the same file...kinda like shortcuts). For example, most of the files in c:\Windows\WinSXS are hard links to other files in other locations on the same volume. When copying files off, a program should be track and recover the structure of the reparse points and hard links. Depending on the utility used, these files can be accounted for differently.

Related

Strategy for mass storage of small files

What is the good strategy for mass storage for millions of small files (~50 KB on average) with auto-pruning of files older than 20 minutes? I need to write and access them from the web server.
I am currently using ext4, and during delete (scheduled in cron) HDD usage spikes up to 100% with [flush-8:0] showing up as the process that creates the load. This load is interferes with other applications on the server. When there are no deletes, max HDD utilisation is 0-5%. Situation is same with nested and non-nested directory structures. The worst part is that it seems that mass-removing during peak load is slower than the rate of insertions, so amount of files that need to be removed grows larger and larger.
I have tried changing schedulers (deadline, cfq, noop), it didn't help. I have also tried setting ionice to removing script, but it didn't help either.
I have tried GridFS with MongoDB 2.4.3 and it performs nicely, but horrible during mass delete of old files. I have tried running MongoDB with journaling turned off (nojournal) and without write confirmation for both delete and insert (w=0) and it didn't help. It only works fast and smooth when there are no deletes going on.
I have also tried storing data in MySQL 5.5, in BLOB column, in InnoDB table, with InnoDB engine set to use innodb_buffer_pool=2GB, innodb_log_file_size=1GB, innodb_flush_log_on_trx_commit=2, but the perfomance was worse, HDD load was always at 80%-100% (expected, but I had to try). Table was only using BLOB column, DATETIME column and CHAR(32) latin1_bin UUID, with indexes on UUID and DATETIME columns, so there was no room for optimization, and all queries were using indexes.
I have looked into pdflush settings (Linux flush process that creates the load during mass removal), but changing the values didn't help anything so I reverted to default.
It doesn't matter how often I run auto-pruning script, each 1 second, each 1 minute, each 5 minutes, each 30 minutes, it is disrupting server significantly either way.
I have tried to store inode value and when removing, remove old files sequentially by sorting them with their inode numbers first, but it didn't help.
Using CentOS 6. HDD is SSD RAID 1.
What would be good and sensible solution for my task that will solve auto-pruning performance problem?
Deletions are kind of a performance nuisance because both the data and the metadata need to get destroyed on disk.
Do they really need to be separate files? Do the old files really need to get deleted, or is it OK if they get overwritten?
If the answer is "no" to the second of these questions, try this:
Keep a list of files that's roughly sorted by age. Maybe chunk it by file size.
When you want to write to a new file, find an old file that's preferably bigger than what you're replacing it with. Instead of blowing away the old file, truncate() it to the appropriate length and then overwrite its contents. Make sure you update your old-files list.
Clean up the really old stuff that hasn't been replaced explicitly once in a while.
It might be advantageous to have an index into these files. Try using a tmpfs full of symbolic links to the real file system.
You might or might not get a performance advantage in this scheme by chunking the files in to manageably-sized subdirectories.
If you're OK with multiple things being in the same file:
Keep files of similar sizes together by storing each one as an offset into an array of similarly-sized files. If every file is 32k or 64k, keep a file full of 32k chunks and a file full of 64k chunks. If files are of arbitrary sizes, round up to the next power of two.
You can do lazy deletes here by keeping track of how stale each file is. If you're trying to write and something's stale, overwrite it instead of appending to the end of the file.
Another thought: Do you get a performance advantage by truncate()ing all of the files to length 0 in inode order and then unlink()ing them? Ignorance stops me from knowing whether this can actually help, but it seems like it would keep the data zeroing together and the metadata writing similarly together.
Yet another thought: XFS has a weaker write ordering model than ext4 with data=ordered. Is it fast enough on XFS?
If mass-removing millions of files results in performance problem, you can resolve this problem by "removing" all files at once. Instead of using any filesystem operation (like "remove" or "truncate") you could just create a new (empty) filesystem in place of the old one.
To implement this idea you need to split your drive into two (or more) partitions. After one partition is full (or after 20 minutes) you start writing to second partition while using the first one for reading only. After another 20 minutes you unmount first partition, create empty filesystem on it, mount it again, then start writing to first partition while using the second one for reading only.
The simplest solution is to use just two partitions. But this way you don't use disk space very efficiently: you can store twice less files on the same drive. With more partitions you can increase space efficiency.
If for some reason you need all your files in one place, use tmpfs to store links to files on each partition. This requires mass-removing millions of links from tmpfs, but this alleviates performance problem because only links should be removed, not file contents; also these links are to be removed only from RAM, not from SSD.
If you don't need to append to the small files, I would suggest that you create a big file and do a sequential write of the small files right in it, while keeping records of offsets and sizes of all the small files within that big file.
As you reach the end of the big file, start writing from its beginning again, while invalidating records of the small files in the beginning that you replace with new data.
If you choose the big file size properly, based new files saving rate, you can get auto-pruning of files older than ~20 minutes almost as you need.

Does a large number of directories negatively impact performance?

I'm intending to use files to store data as a kind of cache for PHP-generated files, so as to avoid having to re-generate them every time they are loaded (their contents only change once a day).
One thing I have noticed in the past is that if a directory has a large number of files inside, reaching the thousands, it will take a long time for an FTP program to load its contents, sometimes even crashing the computer that's trying to load them. So I'm looking into a tree-based system, where each file is stored in a subfolder based on its ID. So for example a file with the ID 123456 would be stored as 12/34/56.html. In this way, each folder will have at most 100 items (except in the event that there are millions of files, but that is extremely unlikely to happen).
Is this a good idea, is it overkill, or is it unnecessary? The question essentially boils down to: "What is the best way to organise a large number of files?"
1) The answer depends on
OS (e.g. Linux vs Windows) and
filesystem (e.g. ext3 vs NTFS).
2) Keep in mind that when you arbitrarily create a new subdirectory, you're using more inodes
3) Linux usually handles "many files/directory" better than Windows
4) A couple of additional links (assuming you're on Linux):
200,000 images in single folder in linux, performance issue or not?
https://serverfault.com/questions/43133/filesystem-large-number-of-files-in-a-single-directory
https://serverfault.com/questions/147731/do-large-folder-sizes-slow-down-io-performance
http://tldp.org/LDP/intro-linux/html/sect_03_01.html
*

Custom-made archive format question

I was thinking about developing an own file archive format to use for private projects. The thing is that I am not looking for a solution like 7z or RAR, but I want to make something different, similar to a file system.
Looking at real file system, each has two sections in common in its architecture - information about files stored on disk and actual data of the files, as follows:
----------------------------
METADATA | FILE DATA
----------------------------
My question is - how is it possible that these two sections will not overlap? I mean, the FAT STRUCTURE section grows towards the FILE DATA section, while the latter grows towards the end of the disk (partition). How does a file system manage these sections?
This is what I have been trying to figure out for most of the time and any tip would be more than welcome.
Most file systems operate with clusters or pages or blocks, which have fixed size. In many filesystems the directory (metadata) is a just a special file, so it can grow in the same way the regular data files grow. On other filesystems some master metadata block has a fixed size which is pre-allocated during file system formatting. In this case the file system can become full before files take all available space.
On a side note, is there a reason to reinvent the wheel (custom file system for private needs)? There exist some implementations of in-file virtual file systems which are similar to archives, but provide more functionality. One of examples is our SolFS.
All you need is a manifest containing the file list, archive name, and or password and then have all the files listed there
if you can make the files smaller than that's even better!

Uniquely identify files/folders in NTFS, even after move/rename

I haven't found a backup (synchronization) program which does what I want so I'm thinking about writing my own.
What I have now does the following: It goes through the data in the source and for every file which has its archive bit set OR does not exist in the destination, copies it to the destination, overwriting a possibly existing file. When done, it checks for all files in the destination if it exists in the source, and if it doesn't, deletes it.
The problem is that if I move or rename a large folder, it first gets copied to the destination even though it is in principle already there, just has a different path. Then the folder which was already there is deleted afterwards.
Apart from the unnecessary copying, I frequently run into space problems because my backup drive isn't large enough to hold the original data twice.
Is there a way to programmatically identify such moved/renamed files or folders, i.e. by NTFS ID or physical location on media or something else? Are there solutions to this problem?
I do not care about the programming language, but hints for doing this with Python, C++, C#, Java or Prolog are appreciated.
Are you familiar with object IDs? This might be what you're looking for: http://msdn.microsoft.com/en-us/library/aa363997.aspx
You may also want to use file IDs. You can get this from the FileId field of FILE_ID_BOTH_DIR_INFO you get by calling GetFileInformationByHandleEx or the nFileIndexLow and nFileIndexHigh fields of BY_HANDLE_FILE_INFORMATION you get by calling GetFileInformationByHandle.
Although it would require you to redesign your system, NTFS has a feature called a change journal that was designed for just this situation. It keeps track of every file that was changed, even across reboots. When your program runs, it would read the change journal from whenever it left off. For every file that was deleted, delete that file on your backup. For every file that was renamed, rename that file on your backup. For every file that was created or changed, copy that file to your backup. Now, instead of having to traverse both directory trees in parallel, you can simply traverse the list of files you'll actually have to pay attention to.
Not sure about NTFS specifics which might help you but didn't you think about comparing file hashes? And in order not to calculate hash many times you can firstly compare file sizes.

Fastest file access/storage?

I have about 750,000,000 files I need to store on disk. What's more is I need to be able to access these files randomly--any given file at any time--in the shortest time possible. What do I need to do to make accessing these files fastest?
Think of it like a hash table, only the hash keys are the filenames and the associated values are the files' data.
A coworker said to organize them into directories like this: if I want to store a file named "foobar.txt" and it's stored on the D: drive, put the file in "D:\f\o\o\b\a\r.\t\x\t". He couldn't explain why this was a good idea though. Is there anything to this idea?
Any ideas?
The crux of this is finding a file. What's the fastest way to find a file by name to open?
EDIT:
I have no control over the file system upon which this data is stored. It's going to be NTFS or FAT32.
Storing the file data in a database is not an option.
Files are going to be very small--maximum of probably 1 kb.
The drives are going to be solid state.
Data access is virtually random, but I could probably figure out a priority for each file based on how often it is requested. Some files will be accessed much more than others.
Items will constantly be added, and sometimes deleted.
It would be impractical to consolidate multiple files into single files because there's no logical association between files.
I would love to gather some metrics by running tests on this stuff, but that endeavour could become as consuming as the project itself!
EDIT2:
I want to upvote several thorough answers, whether they're spot-on or not, and cannot because of my newbie status. Sorry guys!
This sounds like it's going to be largely a question of filesystem choice. One option to look at might be ZFS, it's designed for high volume applications.
You may also want to consider using a relational database for this sort of thing. 750 million rows is sort of a medium size database, so any robust DBMS (eg. PostgreSQL) would be able to handle it well. You can store arbitrary blobs in the database too, so whatever you were going to store in the files on disk you can just store in the database itself.
Update: Your additional information is certainly helpful. Given a choice between FAT32 and NTFS, then definitely choose NTFS. Don't store too many files in a single directory, 100,000 might be an upper limit to consider (although you will have to experiment, there's no hard and fast rule). Your friend's suggestion of a new directory for every letter is probably too much, you might consider breaking it up on every four letters or something. The best value to choose depends on the shape of your dataset.
The reason breaking up the name is a good idea is that typically the performance of filesystems decreases as the number of files in a directory increases. This depends highly on the filesystem in use, for example FAT32 will be horrible with probably only a few thousand files per directory. You don't want to break up the filenames too much, so you will minimise the number of directory lookups the filesystem will have to do.
That file algorithm will work, but it's not optimal. I would think that using 2 or 3 character "segments" would be better for performance - especially when you start considering doing backups.
For example:
d:\storage\fo\ob\ar\foobar.txt
or
d:\storage\foo\bar\foobar.txt
There are some benefits to using this sort of algorithm:
No database access is necessary.
Files will be spread out across many directories. If you don't spread them out, you'll hit severe performance problems. (I vaguely recall hearing about someone having issues at ~40,000 files in a single folder, but I'm not confident in that number.)
There's no need to search for a file. You can figure out exactly where a file will be from the file name.
Simplicity. You can very easily port this algorithm to just about any language.
There are some down-sides to this too:
Many directories may lead to slow backups. Imagine doing recursive diffs on these directories.
Scalability. What happens when you run out of disk space and need to add more storage?
Your file names cannot contain spaces.
This depends to a large extent on what file system you are going to store the files on. The capabilities of file systems in dealing with large number of files varies widely.
Your coworker is essentially suggesting the use of a Trie data structure. Using such a directory structure would mean that at each directory level there are only a handful of files/directories to choose from; this could help because as the number of files within a directory increases the time to access one of them does too (the actual time difference depends on the file system type.)
That said, I personally wouldn't go that many levels deep -- three to four levels ought to be enough to give the performance benefits -- most levels after that will probably have very entries (assuming your file names don't follow any particular patterns.)
Also, I would store the file itself with its entire name, this will make it easier to traverse this directory structure manually also, if required.
So, I would store foobar.txt as f/o/o/b/foobar.txt
This highly depends on many factors:
What file system are you using?
How large is each file?
What type of drives are you using?
What are the access patterns?
Accessing files purely at random is really expensive in traditional disks. One significant improvement you can get is to use solid state drive.
If you can reason an access pattern, you might be able to leverage locality of reference to place these files.
Another possible way is to use a database system, and store these files in the database to leverage the system's caching mechanism.
Update:
Given your update, is it possbile you consolidate some files? 1k files are not very efficient to store as file systems (fat32, ntfs) have cluster size and each file will use the cluster size anyway even if it is smaller than the cluster size. There is usually a limit on the number of files in each folder, with performance concerns. You can do a simple benchmark by putting as many as 10k files in a folder to see how much performance degrades.
If you are set to use the trie structure, I would suggest survey the distribution of file names and then break them into different folders based on the distribution.
First of all, the file size is very small. Any File System will eat something like at least 4 times more space. I mean any file on disk will occupy 4kb for 1kb file. Especially on SSD disks, the 4kb sector will be the norm.
So you have to group several files into 1 physical file. 1024 file in 1 storage file seems reasonable. To locate the individual files in these storage files you have to use some RDBMS (PostgreSQL was mentioned and it is good but SQLite may be better suited to this) or similar structure to do the mapping.
The directory structure suggested by your friend sounds good but it does not solve the physical storage problem. You may use similar directory structure to store the storage files. It is better to name them by using a numerical system.
If you can, do not let them format as FAT32, at least NTFS or some recent File System of Unix flavor. As total size of the files is not that big, NTFS may be sufficient but ZFS is the better option...
Is there any relation between individual files? As far as access times go, what folders you put things in won't affect much; the physical locations on the disk are what matter.
Why isn't storing the paths in a database table acceptable?
My guess is he is thinking of a Trie data structure to create on disk where the node is a directory.
I'd check out hadoops model.
P
I know this is a few years late, but maybe this can help the next guy..
My suggestion use a SAN, mapped to a Z drive that other servers can map to as well. I wouldn't go with the folder path your friend said to go with, but more with a drive:\clientid\year\month\day\ and if you ingest more than 100k docs a day, then you can add sub folders for hour and even minute if needed. This way, you never have more than 60 sub folders while going all the way down to seconds if required. Store the links in SQL for quick retrieval and reporting. This makes the folder path pretty short for example: Z:\05\2004\02\26\09\55\filename.txt so you don't run into any 256 limitations across the board.
Hope that helps someone. :)

Resources