Copying millions of files

Copying millions of files - file

I've got about 3 million files I need to copy from one folder to another over my company's SAN. What's the best way for me to do this?

If a straight copy is too slow (although a SAN with write-back caching would be about as fast as anything for this type of operation) you could tar the files up into one or more archives and then expand the archives out at the destination. This would slightly reduce the disk thrashing.
At a more clever level, you can do a trick with tar or cpio where you archive the files and write them to stdout which you pipe to another tar/cpio process to unravel them at their destination.
A sample command to do this with tar looks like:
tar cf - * | (cd [destination dir] ; tar xf - )
Some SANs will also directly clone a disk volume.

If you're on windows, use robocopy. It's very robust and build for situations like that. It supports dead link detection and can be told to retry copies if one is interrupted.

Have you considered using rsync? This is a tool that uses an algorithm that involves calculating hashes on chunks of the files to compare two sites and send deltas between the sites.

Microsoft SyncToy is in my experience very good at handling ridiculous numbers of files. And it's very easy to use.

Teracopy will do this I think.
http://www.codesector.com/teracopy.php
Or, if on *nix, try cuteftp.

If you ask me, its just the best way to copy with neatest system software.
Just something like:
cp -pvr /pathtoolddir /pathtonewdir
on a linux box will do and work great. Any compression in between will just slow down the process.

Related

rsync on fat32 and ntfs

A little background: I have tried to use rsync to backup my wife's home directory to an external usb drive with the command
rsync -va /home/wife /run/media/wife
but kept getting error messages that mkstemp failed, and that rsync failed to set times, becuase of a read-only filesystem. Worse, it seems that rsync is unable to tell when files don't need syncing, and winds up copying a lot of stuff it doesn't need to, resulting in rediculously slow backup times.
So I tried using rsync -rtvO instead, based on this guy's advice. Okay, no more warnings, but the backups still seem too slow, and esp on big media files that already exists -- i.e. it's still copying stuff unnecessarily.
Is my analysis correct?
Is there a workaround?
Will the problem be fixed if I use an NTFS drive for here backups?
I could of course use a linux filesytem, but on rare occasions she would like to be able to take the drive to work and access it from the Windows machines there.

Try using --modify-window=1
In particular, when transferring to or from an MS Windows FAT
filesystem (which represents times with a 2-second resolution),
--modify-window=1 is useful (allowing times to differ by up to 1 second).
https://download.samba.org/pub/rsync/rsync.html
You could also try using --size-only
skip files that match in size
For rsync to FAT, this is what I use and it seems to work pretty well:
rsync -rtv --modify-window=1 source/ destination/
Source: https://serverfault.com/a/144475/58568

Traverse Zipped Filesystem from Command Line

Is there a way to traverse the files & folders inside an archive? For example, if I have a file my-zip-file.zip, could I do
ls -l my-zip-file.zip
or even
cd my-zip-file
I know there are the command tar and the command line version of 7-Zip, but it seems like you can only do these things from outside the archive. Also, with grep, you can pretty much simulate the ls situation from this question, but much slower and, again, only from outside the archive.
With the GUI version of 7-Zip, you can do pretty much all of this, just with a different shell, so I am looking for a command-line version. From this question that I asked, it seems 7-Zip does this by creating temporary folders to hold the represented files & folders, so this might be a bottleneck.
I would like this solution to be cross-platform, but I understand if that's not possible.

Yes, you can effectively mount a zip file on the file system using AVFS.

How to "defragment" a directory on ext3?

I am running a daemon which analyses files in a directory and then deletes them. In case the daemon is not running for whatever reason, files get stacked there. Today I had 90k files in that directory. After starting the daemon again, it processed all the files.
However, the directory remains large; "ls -dh ." returns a size of 5.6M. How can I "defragment" that directory? I already figured out that renaming that directory, and creaing a new one with the same name and permissions solves the problem. However, as files get written in there at any time, there doesn't seem to be a safe way to rename the directory and create a new one as for a moment, the target directory does not exist.
So a) is there a way/a (shell) program which can defragment directories on an ext3 filesystem? or b) is there a way to create a lock on a directory so that trying to write files blocks until the rename/create has finished?

"Optimize directories in filesystem. This option causes e2fsck to try to optimize all directories, either by reindexing them if the filesystem supports directory indexing, or by sorting and compressing directories for smaller directories, or for filesystems using traditional linear directories." -- fsck.ext3 -D
Of course this should not be done on a mounted filesystem.

Not really applicable for Ext3, but maybe useful for users of other filesystems:
according to https://wiki.archlinux.org/index.php/Btrfs#Defragmentation, with Btrfs it is apparently possible to defragment the metadata of a directory: btrfs filesystem defragment / will defragment the metadata of the root folder. This uses the online defragmentation support of Btrfs.
while Ext4 does support online defragmentation (with e4defrag), this doesn't seem to apply to directory metadata (according to http://sourceforge.net/p/e2fsprogs/bugs/326/).
I haven't tried either of these solutions, though.

I'm not aware of a way to reclaim free space from within a directory.
5MB isn't very much space, so it may be easiest to just ignore it. If this problem (files stacking up in the directory) occurs on a regular basis, then that space will be reused anytime the directory fills up again.
If you desperately need the ability to shrink the directory here's a (ugly) hack that might work.
Replace the directory with a symbolic link to an empty directory. If this problem reoccurs, you can create a new empty directory, and then change the symlink to point to the new directory. Changing the symlink should be atomic, so you won't loose any incoming files. Then you can safely empty and delete the old directory.
[Edited to add: It turns out that this does not work. As Bada points out in the comments you can't atomically change a symlink in the way I suggested. This leaves me with my original point. File systems I'm familiar with don't provide a mechanism to reclaim free space within directory blocks.]

Read and write directly from and to compressed files in C

in Java I think it is possible to cruise through jar files like they were not compressed. Is there some similar (and portable) thing in C/C++ ?
I would like to import binary data into memory from a large (zipped or similar) file without decompressing to disk first and afterwards writing to disk in a compressed way.
Maybe some trick with shell pipes and the zip utility?

I think you want zlib:
http://www.zlib.net/

Apply file structure diff/patch on remote system?

Is there a tool that creates a diff of a file structure, perhaps based on an MD5 manifest. My goal is to send a package across the wire that contains new/updated files and a list of files to remove. It needs to copy over new/updated files and remove files that have been deleted on the source file structure?

You might try rsync. Depending on your needs, the command might be as simple as this:
rsync -az --del /path/to/master dup-site:/path/to/duplicate
Quoting from rsync's web site:
rsync is an open source utility that
provides fast incremental file
transfer. rsync is freely available
under the GNU General Public License
and is currently being maintained by
Wayne Davison.
Or, if you prefer wikipedia:
rsync is a software application for
Unix systems which synchronizes files
and directories from one location to
another while minimizing data transfer
using delta encoding when appropriate.
An important feature of rsync not
found in most similar
programs/protocols is that the
mirroring takes place with only one
transmission in each direction. rsync
can copy or display directory contents
and copy files, optionally using
compression and recursion.

#vfilby I'm the process of implementing something similar.
I've been using rsync for a while, but it gets funky when deploying to remote server with permission changes that are out of my control. With rsync you can choose to not include permissions, but they still endup being considered for some reason.
I'm now using git diff. This works very well for text files. Diff generates patches, rather then a MANIFEST that you have to include with your files. The nice thing about patches is that there is already an established framework for using and testing these patches before they're applied.
For example, with patch utility that comes standard on any *unix box, you can run the patch in dry-run mode. This will tell you if the patch that you're going to apply is actually going to apply before you run it. This helps you to make sure that the files that you're updating have not changed while you were preparing the patch.
If this is similar to what you're looking for, I can elaborate on my process.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Copying millions of files - file

I've got about 3 million files I need to copy from one folder to another over my company's SAN. What's the best way for me to do this?

If you're on windows, use robocopy. It's very robust and build for situations like that. It supports dead link detection and can be told to retry copies if one is interrupted.

Have you considered using rsync? This is a tool that uses an algorithm that involves calculating hashes on chunks of the files to compare two sites and send deltas between the sites.

Microsoft SyncToy is in my experience very good at handling ridiculous numbers of files. And it's very easy to use.

Teracopy will do this I think. http://www.codesector.com/teracopy.php Or, if on *nix, try cuteftp.

If you ask me, its just the best way to copy with neatest system software. Just something like: cp -pvr /pathtoolddir /pathtonewdir on a linux box will do and work great. Any compression in between will just slow down the process.

Related

rsync on fat32 and ntfs

Traverse Zipped Filesystem from Command Line

How to "defragment" a directory on ext3?

Read and write directly from and to compressed files in C

Apply file structure diff/patch on remote system?

Categories

Resources