I haven't found a backup (synchronization) program which does what I want so I'm thinking about writing my own.
What I have now does the following: It goes through the data in the source and for every file which has its archive bit set OR does not exist in the destination, copies it to the destination, overwriting a possibly existing file. When done, it checks for all files in the destination if it exists in the source, and if it doesn't, deletes it.
The problem is that if I move or rename a large folder, it first gets copied to the destination even though it is in principle already there, just has a different path. Then the folder which was already there is deleted afterwards.
Apart from the unnecessary copying, I frequently run into space problems because my backup drive isn't large enough to hold the original data twice.
Is there a way to programmatically identify such moved/renamed files or folders, i.e. by NTFS ID or physical location on media or something else? Are there solutions to this problem?
I do not care about the programming language, but hints for doing this with Python, C++, C#, Java or Prolog are appreciated.
Are you familiar with object IDs? This might be what you're looking for: http://msdn.microsoft.com/en-us/library/aa363997.aspx
You may also want to use file IDs. You can get this from the FileId field of FILE_ID_BOTH_DIR_INFO you get by calling GetFileInformationByHandleEx or the nFileIndexLow and nFileIndexHigh fields of BY_HANDLE_FILE_INFORMATION you get by calling GetFileInformationByHandle.
Although it would require you to redesign your system, NTFS has a feature called a change journal that was designed for just this situation. It keeps track of every file that was changed, even across reboots. When your program runs, it would read the change journal from whenever it left off. For every file that was deleted, delete that file on your backup. For every file that was renamed, rename that file on your backup. For every file that was created or changed, copy that file to your backup. Now, instead of having to traverse both directory trees in parallel, you can simply traverse the list of files you'll actually have to pay attention to.
Not sure about NTFS specifics which might help you but didn't you think about comparing file hashes? And in order not to calculate hash many times you can firstly compare file sizes.
Related
I have two external NTFS USB drives with identical files but 5GB difference in used space. The drives are 500/1000 (465GB and 931GB) with 257GB and 252GB used. There is no fragmentation and no windows shadow storage on either.
I have run windows chkdsk and get very different results:
file records processed 4832512 versus 119296
large file records 11 versus 1
reparse records 6272 versus 2
security descriptor data files 10778 versus 10774
files 108387 versus 108386
Should I be concerned at the 5GB leakage of space on the first/older drive or is this expected?
One thing that can happen (among others) is that NTFS keeps a journal which records things that happen to files and folders, such as move, copy, update. This is kept in a large, fairly well hidden file. I doubt that it or it's content are reported through chkdsk. There are APIs to read from it...but the file itself is not generally accessible. If one volume was an active volume for some period of time, and the other is a backup, then this can account for a lot of hidden size difference.
Also, I noticed that the reparse points are somewhat numerous on one and almost non-existent on the other. A naïve backup can effectively destroy reparse points...and can undo hard links (pointers to files in various folders that are really not copies but links to the same file...kinda like shortcuts). For example, most of the files in c:\Windows\WinSXS are hard links to other files in other locations on the same volume. When copying files off, a program should be track and recover the structure of the reparse points and hard links. Depending on the utility used, these files can be accounted for differently.
File paths are inherently dubious when working with data.
Lets say I have a hypothetical situation with a program called find_brca, and some data called my.genome and both are in the /Users/Desktop/ directory.
find_brca takes a single argument, a genome, runs for about 4 hours, and returns the probability of that individual developing breast cancer in their lifetime. Some people, presented with a very high % probability, might then immediately have both of their breasts removed as a precaution.
Obviously, in this scenario, it is absolutely vital that /Users/Desktop/my.genome actually contains the genome we think it does. There are no do-overs. "oops we used an old version of the file from a previous backup" or any other technical issue will not be acceptable to the patient. How do we ensure we are analysing the file we think we are analysing?
To make matters trickier, lets also assert that we cannot modify find_brca itself, because we didn't write it, its closed source, proprietary, whatever.
You might think MD5 or other cryptographic checksums might be able to come to the rescue, and while they do help to a degree, you can only MD5 the file before and/or after find_brca has run, but you can never know exactly what data find_brca used (without doing some serious low-level system probing with DTrace/ptrace, etc).
The root of the problem is that file paths do not have a 1:1 relationship with actual data. Only in a filesystem where files can only be requested by their checksum - and as soon as the data is modified its checksum is modified - can we ensure that when we feed find_brca the genome's file path 4fded1464736e77865df232cbcb4cd19, we are actually reading the correct genome.
Are there any filesystems that work like this? If I wanted to create such a filesystem because none currently exists, how would you recommend I go about doing it?
I have my doubts about the stability, but hashfs looks exactly like what you want: http://hashfs.readthedocs.io/en/latest/
HashFS is a content-addressable file management system. What does that mean? Simply, that HashFS manages a directory where files are saved based on the file’s hash. Typical use cases for this kind of system are ones where: Files are written once and never change (e.g. image storage). It’s desirable to have no duplicate files (e.g. user uploads). File metadata is stored elsewhere (e.g. in a database).
Note: Not to be confused with the hashfs, a student of mine did a couple of years ago: http://dl.acm.org/citation.cfm?id=1849837
I would say that the question is a little vague, however, there are several answers which can be given to parts of your questions.
First of all, not all filesystems lack path/data correspondence. On many (if not most) filesystems, the file is identified only by its path, not by any IDs.
Next, if you want to guarantee that the data is not changed while the application handles them, then the approach depends on the filesystem being used and the way this application works with the file (if it keeps it opened or opens and closes the file as needed).
Finally, if you are concerned by the attacker altering the data on the filesystem in some way while the file data are used, then you probably have a bigger problem, than just the file paths, and that problem should be addressed beforehand.
On a side note, you can implement a virtual file system (FUSE on Linux, our CBFS on Windows), which will feed your application with data taken from elsewhere, be it memory, a database or a cloud. This approach answers your question as well.
Update: if you want to get rid of file paths at all and have the data addressed by hash, then probably a NoSQL database, where the hash is the key, would be your best bet.
supposed I had N files to process using hadoop map-reduce, let's assume they are large, well beyond the block size and there are only a few hundred of them. Now I would like to process each of these files, let's assume the word counting example.
My question is: What is the difference between creating a map-reduce job whose input is a text file with the paths to each of the files as opposed to sending each of the files directly to the map function i.e. concatenating all the files and pushing them into different mappers [EDIT].
Are these both valid approaches?
Are there any drawbacks to them?
Thanks for the prompt answers I've included a detailed description of my problem since my abstraction may have missed a few important topics:
I have N small files on Hadoop HDFS in my application and I just need to process each file. So I am using a map function to apply a python script to each file (actually image [I've already looked at all the hadoop image processing links out there]), I am aware of the small file problem and the typical recommendation is to group the smaller files so we avoid the overhead of moving files around (the basic recommendation using sequence files or creating your own data structures as in the case of the HIPI).
This makes me wonder can't we just tell each mapper to look for files that are local to him and operate on those?
I haven't found a solution to that issue which is why I was looking at either sending a path to the files to each mapper or the file it self.
Creating a list of path names for each of the collection of images seems to be ok, but as mentioned in the comments I loose the data locality property.
Now when I looked at the hadoop streaming interface it mentions that the different pieces communicate based on stdin and stdout typically used for text files. That's where I get confused, if I am just sending a list of path names this shouldn't be an issue since each mapper would just try to find the collection of images it is assigned. But when I look at the word count example the input is the file which then gets split up across the mappers and so that's when I was confused as to if I should concatenate images into groups and then send these concatenated groups just like the text document to the different mappers or if I should instead concatenate the images leave them in hadoop HDFS and then just pass the path to them to the mapper... I hope this makes sense... maybe I'm completely off here...
Thanks again!
Both are valid. But latter would incur extra overhead and performance will go down because you are talking about concatenating all the files into one and feeding it to just 1 mapper. And by doing that you would go against one of the most basic principles of Hadoop, parallelism. Parallelism is what makes Hadoop so efficient.
FYI, if you really need to do that you have to set isSplittable to false in your InputFormat class, otherwise the framework will split the file(based on your InputFormat).
And as far as input path is considered, you just need to give the path of the input directory. Each file inside this directory will be processed without human intervention.
HTH
In response to your edit :
I think you have misunderstood this a bit. You don't have to worry about localization. Hadoop takes care of that. You just have to run your job and the data will be processed on the node where it is present. Size of the file has nothing to with it. You don't have to tell anything to mappers. Process goes like this :
You submit your job to JT. JT directs the TT running on the node which has the block of data required by the job to start start the mappers. If the slots are occupied by some other process, then same thing takes place on some other node having the data block.
The bottleneck will be there if you are processing the whole concatenated file in a single mapper as you have mentioned.
It won't be a problem is you are providing the concatenated file as input to Hadoop. Since, the large file formed will obviously be distributed in HDFS (I assume you are using HDFS) and will be processed by multiple Mappers and reducers concurrently.
My question is: What is the difference between creating a map-reduce job whose input is a text file with the paths to each of the files as opposed to sending each of the files directly to the map function i.e. concatenating all the files and pushing them into a single mapper.
By listing the files paths in a text file and (i assume) manually opening them in the mapper, you'll be defeating data locality (that is where hadoop will try and run your mapper code where the data is, rather than moving the data to where your code executes. with 1000 files, this will also probably be processed by a single mapper instance (i imagine 1000 lines of text should be less than your block size).
If you concatenate all the files first and then use as input, this will usually be less efficient, mainly because you're copying all the files to a single node (to concatenate them) and then pushing the data back up to HDFS as a single file. This is even before you then get to process the file again in a mapper (or more depending on splittability of your input format split / compression codec).
If you were going to process this concatenated file multiple times, and each file is smaller than the block size, then merging them to a single file may be beneficial, but you've already noted that each file is larger than the default block size.
Is there particular reason you want all files to flow through a single mapper (which is what it sounds like is you are trying to achieve by doing these two options).
I am writing a program which outputs a file. This file has two parts of the content. The second part however, is computed before the first. I was thinking of creating a temporary file, writing the data to it. And then creating a permanent file and then dumping the temp file content into the permanent one and deleting that file. I saw some posts that this does not work, and it might produce some problems among different compilers or something.
The data is a bunch of chars. Every 32 chars have to appear on a different line. I can store it in a linked list or something, but I do not want to have to write a linked list for that.
Does anyone have any suggestions or alternative methods?
A temporary file can be created, although some people do say they have problems with this, i personally have used them with no issues. Using the platform functions to obtain a temporary file is the best option. Dont assume you can write to c:\ etc on windows as this isnt always possible. Dont assume a filename incase the file is already used etc. Not using temporary files correctly is what causes people problems, rather than temporary files being bad
Is there any reason you cannot just keep the second part in ram until you are ready for the first? Otherwise, can you work out the size needed for the first part and leave that section of the file blank to come back to fill in later on. This would eliminate the needs of the temporary file.
Both solutions you propose could work. You can output intermediate results to a temporary file, and then later append that file to the file that contains the dataset that you want to present first. You could also store your intermediate data in memory. The right data structure depends on how you want to organize the data.
As one of the other answerers notes, files are inherently platform specific. If your code will only run on a single platform, then this is less of a concern. If you need to support multiple platforms, then you may need to special case some or all of those platforms, if you go with the temporary file solution. Whether this is a deal-breaker for you depends on how much complexity this adds compared to structuring and storing your data in memory.
I have about 750,000,000 files I need to store on disk. What's more is I need to be able to access these files randomly--any given file at any time--in the shortest time possible. What do I need to do to make accessing these files fastest?
Think of it like a hash table, only the hash keys are the filenames and the associated values are the files' data.
A coworker said to organize them into directories like this: if I want to store a file named "foobar.txt" and it's stored on the D: drive, put the file in "D:\f\o\o\b\a\r.\t\x\t". He couldn't explain why this was a good idea though. Is there anything to this idea?
Any ideas?
The crux of this is finding a file. What's the fastest way to find a file by name to open?
EDIT:
I have no control over the file system upon which this data is stored. It's going to be NTFS or FAT32.
Storing the file data in a database is not an option.
Files are going to be very small--maximum of probably 1 kb.
The drives are going to be solid state.
Data access is virtually random, but I could probably figure out a priority for each file based on how often it is requested. Some files will be accessed much more than others.
Items will constantly be added, and sometimes deleted.
It would be impractical to consolidate multiple files into single files because there's no logical association between files.
I would love to gather some metrics by running tests on this stuff, but that endeavour could become as consuming as the project itself!
EDIT2:
I want to upvote several thorough answers, whether they're spot-on or not, and cannot because of my newbie status. Sorry guys!
This sounds like it's going to be largely a question of filesystem choice. One option to look at might be ZFS, it's designed for high volume applications.
You may also want to consider using a relational database for this sort of thing. 750 million rows is sort of a medium size database, so any robust DBMS (eg. PostgreSQL) would be able to handle it well. You can store arbitrary blobs in the database too, so whatever you were going to store in the files on disk you can just store in the database itself.
Update: Your additional information is certainly helpful. Given a choice between FAT32 and NTFS, then definitely choose NTFS. Don't store too many files in a single directory, 100,000 might be an upper limit to consider (although you will have to experiment, there's no hard and fast rule). Your friend's suggestion of a new directory for every letter is probably too much, you might consider breaking it up on every four letters or something. The best value to choose depends on the shape of your dataset.
The reason breaking up the name is a good idea is that typically the performance of filesystems decreases as the number of files in a directory increases. This depends highly on the filesystem in use, for example FAT32 will be horrible with probably only a few thousand files per directory. You don't want to break up the filenames too much, so you will minimise the number of directory lookups the filesystem will have to do.
That file algorithm will work, but it's not optimal. I would think that using 2 or 3 character "segments" would be better for performance - especially when you start considering doing backups.
For example:
d:\storage\fo\ob\ar\foobar.txt
or
d:\storage\foo\bar\foobar.txt
There are some benefits to using this sort of algorithm:
No database access is necessary.
Files will be spread out across many directories. If you don't spread them out, you'll hit severe performance problems. (I vaguely recall hearing about someone having issues at ~40,000 files in a single folder, but I'm not confident in that number.)
There's no need to search for a file. You can figure out exactly where a file will be from the file name.
Simplicity. You can very easily port this algorithm to just about any language.
There are some down-sides to this too:
Many directories may lead to slow backups. Imagine doing recursive diffs on these directories.
Scalability. What happens when you run out of disk space and need to add more storage?
Your file names cannot contain spaces.
This depends to a large extent on what file system you are going to store the files on. The capabilities of file systems in dealing with large number of files varies widely.
Your coworker is essentially suggesting the use of a Trie data structure. Using such a directory structure would mean that at each directory level there are only a handful of files/directories to choose from; this could help because as the number of files within a directory increases the time to access one of them does too (the actual time difference depends on the file system type.)
That said, I personally wouldn't go that many levels deep -- three to four levels ought to be enough to give the performance benefits -- most levels after that will probably have very entries (assuming your file names don't follow any particular patterns.)
Also, I would store the file itself with its entire name, this will make it easier to traverse this directory structure manually also, if required.
So, I would store foobar.txt as f/o/o/b/foobar.txt
This highly depends on many factors:
What file system are you using?
How large is each file?
What type of drives are you using?
What are the access patterns?
Accessing files purely at random is really expensive in traditional disks. One significant improvement you can get is to use solid state drive.
If you can reason an access pattern, you might be able to leverage locality of reference to place these files.
Another possible way is to use a database system, and store these files in the database to leverage the system's caching mechanism.
Update:
Given your update, is it possbile you consolidate some files? 1k files are not very efficient to store as file systems (fat32, ntfs) have cluster size and each file will use the cluster size anyway even if it is smaller than the cluster size. There is usually a limit on the number of files in each folder, with performance concerns. You can do a simple benchmark by putting as many as 10k files in a folder to see how much performance degrades.
If you are set to use the trie structure, I would suggest survey the distribution of file names and then break them into different folders based on the distribution.
First of all, the file size is very small. Any File System will eat something like at least 4 times more space. I mean any file on disk will occupy 4kb for 1kb file. Especially on SSD disks, the 4kb sector will be the norm.
So you have to group several files into 1 physical file. 1024 file in 1 storage file seems reasonable. To locate the individual files in these storage files you have to use some RDBMS (PostgreSQL was mentioned and it is good but SQLite may be better suited to this) or similar structure to do the mapping.
The directory structure suggested by your friend sounds good but it does not solve the physical storage problem. You may use similar directory structure to store the storage files. It is better to name them by using a numerical system.
If you can, do not let them format as FAT32, at least NTFS or some recent File System of Unix flavor. As total size of the files is not that big, NTFS may be sufficient but ZFS is the better option...
Is there any relation between individual files? As far as access times go, what folders you put things in won't affect much; the physical locations on the disk are what matter.
Why isn't storing the paths in a database table acceptable?
My guess is he is thinking of a Trie data structure to create on disk where the node is a directory.
I'd check out hadoops model.
P
I know this is a few years late, but maybe this can help the next guy..
My suggestion use a SAN, mapped to a Z drive that other servers can map to as well. I wouldn't go with the folder path your friend said to go with, but more with a drive:\clientid\year\month\day\ and if you ingest more than 100k docs a day, then you can add sub folders for hour and even minute if needed. This way, you never have more than 60 sub folders while going all the way down to seconds if required. Store the links in SQL for quick retrieval and reporting. This makes the folder path pretty short for example: Z:\05\2004\02\26\09\55\filename.txt so you don't run into any 256 limitations across the board.
Hope that helps someone. :)