Split file occupying the same memory space as source file - file

I have a file, say 100MB in size. I need to split it into (for example) 4 different parts.
Let's say first file from 0-20MB, second 20-60MB, third 60-70MB and last 70-100MB.
But I do not want to do a safe split - into 4 output files. I would like to do it in place. So the output files should use the same place on the hard disk that is occupied by this one source file, and literally split it, without making a copy (so at the moment of split, we should loose the original file).
In other words, the input file is the output files.
Is this possible, and if yes, how?
I was thinking maybe to manually add a record to the filesystem, that a file A starts here, and ends here (in the middle of another file), do it 4 times and afterwards remove the original file. But for that I would probably need administrator privileges, and probably wouldn't be safe or healthy for the filesystem.
Programming language doesn't matter, I'm just interested if it would be possible.

The idea is not so mad as some comments paint it. It would certainly be possible to have a file system API that supports such reinterpreting operations (to be sure, the desired split is probably not exacly aligned to block boundaries, but you could reallocate just those few boundary blocks and still save a lot of temporary space).
None of the common file system abstraction layers support this; but recall that they don't even support something as reasonable as "insert mode" (which would rewrite only one or two blocks when you insert something into the middle of a file, instead of all blocks), only an overwrite and an append mode. The reasons for that are largely historical, but the current model is so entrenched that it is unlikely a richer API will become common any time soon.

As I explain in this question on SuperUser, you can achieve this using the technique outlined by Tom Zych in his comment.
bigfile="mybigfile-100Mb"
chunkprefix="chunk_"
# Chunk offsets
OneMegabyte=1048576
chunkoffsets=(0 $((OneMegabyte*20)) $((OneMegabyte*60)) $((OneMegabyte*70)))
currentchunk=$((${#chunkoffsets[#]}-1))
while [ $currentchunk -ge 0 ]; do
# Print current chunk number, so we know it is still running.
echo -n "$currentchunk "
offset=${chunkoffsets[$currentchunk]}
# Copy end of $archive to new file
tail -c +$((offset+1)) "$bigfile" > "$chunkprefix$currentchunk"
# Chop end of $archive
truncate -s $offset "$archive"
currentchunk=$((currentchunk-1))
done
You need to give the script the starting position (offset in bytes, zero means a chunk starting at bigfile's first byte) of each chunk, in ascending order, like on the fifth line.
If necessary, automate it using seq : The following command will give a chunkoffsets with one chunk at 0, then one starting at 100k, then one for every megabyte for the range 1--10Mb, (note the -1 for the last parameter, so it is excluded) then one chunk every two megabytes for the range 10--20Mb.
OneKilobyte=1024
OneMegabyte=$((1024*OneKilobyte))
chunkoffsets=(0 $((100*OneKilobyte)) $(seq $OneMegabyte $OneMegabyte $((10*OneMegabyte-1))) $(seq $((10*OneMegabyte-1)) $((2*OneMegabyte)) $((20*OneMegabyte-1))))
To see which chunks you have set :
for offset in "${chunkoffsets[#]}"; do echo "$offset"; done
0
102400
1048576
2097152
3145728
4194304
5242880
6291456
7340032
8388608
9437184
10485759
12582911
14680063
16777215
18874367
20971519
This technique has the drawback that it needs at least the size of the largest chunk available (you can mitigate that by making smaller chunks, and concatenating them somewhere else, though). Also, it will copy all the data, so it's nowhere near instant.
As to the fact that some hardware video recorders (PVRs) manage to split videos within seconds, they probably only store a list of offsets for each video (a.k.a. chapters), and display these as independent videos in their user interface.

Related

File Browser in C for POSIX OS

I have created a file browsing UI for an embedded device. On the embedded side, I am able to get all files in a directory off the hard disk and return stats such as name, size, modified, etc. This is done using opendir and closedir and a while loop that goes through every file until no files are left.
This is cool, until file counts reach large quantities. I need to implement pagination and sorting. Suppose I have 10,000 files in a directory - how can I possibly go through this amount of files and sort based on size, name, etc, without easily busting the RAM (about 1mb of RAM... !). Perhaps something already exists within the Hard Drive OS or drivers?
Here's two suggestions, both of which have small memory footprint. The first will use no more memory that the number of results you wish to return for the request. It's a constant-time O(1) memory - it only depends on the size of the result set but is ultimately quadratic time (or worse) if the user really does page through all results:
You are only looking for a small paged result (e.g the r=25 entries). You can generate these by scanning through all filenames and maintaining a sorted list of items you will return, using an insertion sort of length r and for each file inserted, only retain the first r results. (In practice you would not insert the file F if it is lower than the rth entry).
How would you generate the 2nd page of results? You already know the 25th file from the previous request - so during the scan ignore all entries that are before that. (You'll need to work harder if sorting on fields with duplicates)
The upside is the minimum memory required - the memory needed is not much larger than the r results you wish to return (and can even be less if you don't cache the names). The downside is generating the complete result will be quadratic in time in terms of the number of total files you have. In practice people don't sort results then page through all pages, so this may be acceptable.
If your memory budget is larger (e.g. fewer than 10000 files) but you still don't have enough space to perform a simple in-memory sort of all 10000 filenames then seekdir/telldir is your friend. i.e. create an array of longs by streaming readdir and using telldir to capture the position of each entry. (you might even be able to compress the delta between each telldir to a 2 byte short). As a minimal implementation you can then sort 'em all with clib's sort function and writing your own callback to convert a location into a comparable value. Your call back will use seekdir twice to read the two filenames.
The above approach is overkill - you just sorted all entries and you only needed one page of ~25, so for fun why not read up on Hoare's QuickSelect algorithm and use a version of it to identify the results within the required range. You can recursive ignore all entries outside the required range and only sort the entries between the first and last entry of the results.
What you want is an external sort, that's a sort done with external resources, usually on disk. The Unix sort command does this. Typically this is done with an external merge sort.
The algorithm is basically this. Let's assume you want to dedicate 100k of memory to this (the more you dedicate, the fewer disk operations, the faster it will go).
Read 100k of data into memory (ie. call readdir a bunch).
Sort that 100k hunk in-memory.
Write the hunk of sorted data to its own file on disk.
You can also use offsets in a single file.
GOTO 1 until all hunks are sorted.
Now you have X hunks of 100k on disk an each of them are sorted. Let's say you have 9 hunks. To keep within the 100k memory limit, we'll divide the work up into the number of hunks + 1. 9 hunks, plus 1, is 10. 100k / 10 is 10k. So now we're working in blocks of 10k.
Read the first 10k of each hunk into memory.
Allocate another 10k (or more) as a buffer.
Do an K-way merge on the hunks.
Write the smallest in any hunk to the buffer. Repeat.
When the buffer fills, append it to a file on disk.
When a hunk empties, read the next 10k from that hunk.
When all hunks are empty, read the resulting sorted file.
You might be able to find a library to perform this for you.
Since this is obviously overkill for the normal case of small lists of files, only use it if there are more files in the directory than you care to have in memory.
Both in-memory sort and external sort begin with the same step: start calling readdir and writing to a fixed-sized array. If you run out of files before running out of space, just do an in-memory quicksort on what you've read. If you run out of space, this is now the first hunk of an external sort.

Cut a file in parts (without using more space) [duplicate]

I have a file, say 100MB in size. I need to split it into (for example) 4 different parts.
Let's say first file from 0-20MB, second 20-60MB, third 60-70MB and last 70-100MB.
But I do not want to do a safe split - into 4 output files. I would like to do it in place. So the output files should use the same place on the hard disk that is occupied by this one source file, and literally split it, without making a copy (so at the moment of split, we should loose the original file).
In other words, the input file is the output files.
Is this possible, and if yes, how?
I was thinking maybe to manually add a record to the filesystem, that a file A starts here, and ends here (in the middle of another file), do it 4 times and afterwards remove the original file. But for that I would probably need administrator privileges, and probably wouldn't be safe or healthy for the filesystem.
Programming language doesn't matter, I'm just interested if it would be possible.
The idea is not so mad as some comments paint it. It would certainly be possible to have a file system API that supports such reinterpreting operations (to be sure, the desired split is probably not exacly aligned to block boundaries, but you could reallocate just those few boundary blocks and still save a lot of temporary space).
None of the common file system abstraction layers support this; but recall that they don't even support something as reasonable as "insert mode" (which would rewrite only one or two blocks when you insert something into the middle of a file, instead of all blocks), only an overwrite and an append mode. The reasons for that are largely historical, but the current model is so entrenched that it is unlikely a richer API will become common any time soon.
As I explain in this question on SuperUser, you can achieve this using the technique outlined by Tom Zych in his comment.
bigfile="mybigfile-100Mb"
chunkprefix="chunk_"
# Chunk offsets
OneMegabyte=1048576
chunkoffsets=(0 $((OneMegabyte*20)) $((OneMegabyte*60)) $((OneMegabyte*70)))
currentchunk=$((${#chunkoffsets[#]}-1))
while [ $currentchunk -ge 0 ]; do
# Print current chunk number, so we know it is still running.
echo -n "$currentchunk "
offset=${chunkoffsets[$currentchunk]}
# Copy end of $archive to new file
tail -c +$((offset+1)) "$bigfile" > "$chunkprefix$currentchunk"
# Chop end of $archive
truncate -s $offset "$archive"
currentchunk=$((currentchunk-1))
done
You need to give the script the starting position (offset in bytes, zero means a chunk starting at bigfile's first byte) of each chunk, in ascending order, like on the fifth line.
If necessary, automate it using seq : The following command will give a chunkoffsets with one chunk at 0, then one starting at 100k, then one for every megabyte for the range 1--10Mb, (note the -1 for the last parameter, so it is excluded) then one chunk every two megabytes for the range 10--20Mb.
OneKilobyte=1024
OneMegabyte=$((1024*OneKilobyte))
chunkoffsets=(0 $((100*OneKilobyte)) $(seq $OneMegabyte $OneMegabyte $((10*OneMegabyte-1))) $(seq $((10*OneMegabyte-1)) $((2*OneMegabyte)) $((20*OneMegabyte-1))))
To see which chunks you have set :
for offset in "${chunkoffsets[#]}"; do echo "$offset"; done
0
102400
1048576
2097152
3145728
4194304
5242880
6291456
7340032
8388608
9437184
10485759
12582911
14680063
16777215
18874367
20971519
This technique has the drawback that it needs at least the size of the largest chunk available (you can mitigate that by making smaller chunks, and concatenating them somewhere else, though). Also, it will copy all the data, so it's nowhere near instant.
As to the fact that some hardware video recorders (PVRs) manage to split videos within seconds, they probably only store a list of offsets for each video (a.k.a. chapters), and display these as independent videos in their user interface.

Binary Search on Large Disk File in C - Problems

This question recurs frequently on StackOverflow, but I have read all the previous relevant answers, and have a slight twist on the question.
I have a 23Gb file containing 475 million lines of equal size, with each line consisting of a 40-character hash code followed by an identifier (an integer).
I have a stream of incoming hash codes - billions of them in total - and for each incoming hash code I need to locate it and print out corresponding identifier. This job, while large, only needs to be done once.
The file is too large for me to read into memory and so I have been trying to usemmap in the following way:
codes = (char *) mmap(0,statbuf.st_size,PROT_READ,MAP_SHARED,codefile,0);
Then I just do a binary search using address arithmetic based on the address in codes.
This seems to start working beautifully and produces a few million identifiers in a few seconds, using 100% of the cpu, but then after some, seemingly random, amount of time it slows down to a crawl. When I look at the process using ps, it has changed from status "R" using 100% of the cpu, to status "D" (diskbound) using 1% of the cpu.
This is not repeatable - I can start the process off again on the same data, and it might run for 5 seconds or 10 seconds before the "slow to crawl" happens. Once last night, I got nearly a minute out of it before this happened.
Everything is read only, I am not attempting any writes to the file, and I have stopped all other processes (that I control) on the machine. It is a modern Red Hat Enterprise Linux 64-bit machine.
Does anyone know why the process becomes disk-bound and how to stop it?
UPDATE:
Thanks to everyone for answering, and for your ideas; I had not previously tried all the various improvements before because I was wondering if I was somehow using mmap incorrectly. But the gist of the answers seemed to be that unless I could squeeze everything into memory, I would inevitable run into problems. So I squashed the size of the hash code to the size of the leading prefix that did not create any duplicates - the first 15 characters were enough. Then I pulled the resulting file into memory, and ran the incoming hash codes in batches of about 2 billion each.
The first thing to do is split the file.
Make one file with the hash-codes and another with the integer ids. Since the rows are the same then it will line up fine after the result is found. Also you can try an approach that puts every nth hash into another file and then stores the index.
For example, every 1000th hash key put into a new file with the index and then load that into memory. Then binary scan that instead. This will tell you the range of 1000 entries that need to be further scanned in the file. Yes that will do it fine! But probably much less than that. Like probably every 20th record or so will divide that file size down by 20 +- if I am thinking good.
In other words after scanning you only need to touch a few kilobytes of the file on disk.
Another option is to split the file and put it in memory on multiple machines. Then just binary scan each file. This will yield the absolute fastest possible search with zero disk access...
Have you considered hacking a PATRICIA trie algorithm up? It seems to me that if you can build a PATRICIA tree representation of your data file, which refers to the file for the hash and integer values, then you might be able to reduce each item to node pointers (2*64 bits?), bit test offsets (1 byte in this scenario) and file offsets (uint64_t, which might need to correspond to multiple fseek()s).
Does anyone know why the process becomes disk-bound and how to stop it?
Binary search requires a lot of seeking within the file. In the case where the whole file doesn't fit in memory, the page cache doesn't handle the big seeks very well, resulting in the behaviour you're seeing.
The best way to deal with this is to reduce/prevent the big seeks and make the page cache work for you.
Three ideas for you:
If you can sort the input stream, you can search the file in chunks, using something like the following algorithm:
code_block <- mmap the first N entries of the file, where N entries fit in memory
max_code <- code_block[N - 1]
while(input codes remain) {
input_code <- next input code
while(input_code > max_code) {
code_block <- mmap the next N entries of the file
max_code <- code_block[N - 1]
}
binary search for input code in code_block
}
If you can't sort the input stream, you could reduce your disk seeks by building an in-memory index of the data. Pass over the large file, and make a table that is:
record_hash, offset into file where this record starts
Don't store all records in this table - store only every Kth record. Pick a large K, but small enough that this fits in memory.
To search the large file for a given target hash, do a binary search in the in-memory table to find the biggest hash in the table that is smaller than the target hash. Say this is table[h]. Then, mmap the segment starting at table[h].offset and ending at table[h+1].offset, and do a final binary search. This will dramatically reduce the number of disk seeks.
If this isn't enough, you can have multiple layers of indexes:
record_hash, offset into index where the next index starts
Of course, you'll need to know ahead of time how many layers of index there are.
Lastly, if you have extra money available you can always buy more than 23 gb of RAM, and make this a memory bound problem again (I just looked at Dell's website - you pick up a new low-end workstation with 32 GB of RAM for just under $1,400 Australian dollars). Of course, it will take a while to read that much data in from disk, but once it's there, you'll be set.
Instead of using mmap, consider just using plain old lseek+read. You can define some helper functions to read a hash value or its corresponding integer:
void read_hash(int line, char *hashbuf) {
lseek64(fd, ((uint64_t)line) * line_len, SEEK_SET);
read(fd, hashbuf, 40);
}
int read_int(int line) {
lseek64(fd, ((uint64_t)line) * line_len + 40, SEEK_SET);
int ret;
read(fd, &ret, sizeof(int));
return ret;
}
then just do your binary search as usual. It might be a bit slower, but it won't start chewing up your virtual memory.
We don't know the back story. So it is hard to give you definitive advice. How much memory do you have? How sophisticated is your hard drive? Is this a learning project? Who's paying for your time? 32GB of ram doesn't seem so expensive compared to two days of work of person that makes $50/h. How fast does this need to run? How far outside the box are you willing to go? Does your solution need to use advanced OS concepts? Are you married to a program in C? How about making Postgres handle this?
Here's is a low risk alternative. This option isn't as intellectually appealing as the other suggestions but has the potential to give you significant gains. Separate the file into 3 chunks of 8GB or 6 chunks of 4GB (depending on the machines you have around, it needs to comfortably fit in memory). On each machine run the same software, but in memory and put an RPC stub around each. Write an RPC caller to each of your 3 or 6 workers to determine the integer associated with a given hash code.

How can I predict the size of an ISO 9660 filesystem?

I'm archiving data to DVD, and I want to pack the DVDs full. I know the names and sizes of all the files I want on the DVD, but I don't know how much space is taken up by metadata. I want to get as many files as possible onto each DVD, so I'm using a Bubblesearch heuristic with greedy bin-packing. I try 10,000 alternatives and get the best one. Currently I know the sizes of all the files and because I don't know how files are stored in an ISO 9660 filesystem, I add a lot of slop for metadata. I'd like to cut down the slop.
I could use genisoimage -print-size except it is too slow---given 40,000 files occupying 500MB, it takes about 3 seconds. Taking 8 hours per DVD is not in the cards. I've modified the genisoimage source before and am really not keen to try to squeeze the algorithm out of the source code; I am hoping someone knows a better way to get an estimate or can point me to a helpful specification.
Clarifying the problem and the question:
I need to burn archives that split across multiple DVDs, typically around five at a time. The problem I'm trying to solve is to decide which files to put on each DVD, so that each DVD (except the last) is as full as possible. This problem is NP-hard.
I'm using the standard greedy packing algorithm where you place the largest file first and you put it in the first DVD having sufficient room. So j_random_hacker, I am definitely not starting from random. I start from sorted and use Bubblesearch to perturb the order in which the files are packed. This procedure improves my packing from around 80% of estimated capacity to over 99.5% of estimated capacity. This question is about doing a better job of estimating the capacity; currently my estimated capacity is lower than real capacity.
I have written a program that tries 10,000 perturbations, each of which involves two steps:
Choose a set of files
Estimate how much space those files will take on DVD
Step 2 is the step I'm trying to improve. At present I am "erring on the side of caution" as Tyler D suggests. But I'd like to do better. I can't afford to use genisomage -print-size because it's too slow. Similarly, I can't tar the files to disk, because on only is it too slow, but a tar file is not the same size as an ISO 9660 image. It's the size of the ISO 9660 image I need to predict. In principle this could be done with complete accuracy, but I don't know how to do it. That's the question.
Note: these files are on a machine with 3TB of hard-drive storage. In all cases the average size of the files is at least 10MB; sometimes it is significantly larger. So it is possible that genisomage will be fast enough after all, but I doubt it---it appears to work by writing the ISO image to /dev/null, and I can't imagine that will be fast enough when the image size approaches 4.7GB. I don't have access to that machine right now, or when I posted the original question. When I do have access in the evening I will try to get better numbers for the question. But I don't think genisomage is going to be a good solution---although it might be a good way to learn a model of the filesystem
that tells me how how it works. Knowing that block size is 2KB is already helpful.
It may also be useful to know that files in the same directory are burned to the samae DVD, which simplifies the search. I want access to the files directly, which rules out tar-before-burning. (Most files are audio or video, which means there's no point in trying to hit them with gzip.)
Thanks for the detailed update. I'm satisfied that your current bin-packing strategy is pretty efficient.
As to the question, "Exactly how much overhead does an ISO 9660 filesystem pack on for n files totalling b bytes?" there are only 2 possible answers:
Someone has already written an efficient tool for measuring exactly this. A quick Google search turned up nothing however which is discouraging. It's possible someone on SO will respond with a link to their homebuilt tool, but if you get no more responses for a few days then that's probably out too.
You need to read the readily available ISO 9660 specs and build such a tool yourself.
Actually, there is a third answer:
(3) You don't really care about using every last byte on each DVD. In that case, grab a small representative handful of files of different sizes (say 5), pad them till they are multiples of 2048 bytes, and put all 2^5 possible subsets through genisoimage -print-size. Then fit the equation nx + y = iso_size - total_input_size on that dataset, where n = number of files in a given run, to find x, which is the number of bytes of overhead per file, and y, which is the constant amount of overhead (the size of an ISO 9660 filesystem containing no files). Round x and y up and use that formula to estimate your ISO filesystem sizes for a given set of files. For safety, make sure you use the longest filenames that appear anywhere in your collection for the test filenames, and put each one under a separate directory hierarchy that is as deep as the deepest hierarchy in your collection.
I'm not sure exactly how you are currently doing this -- according to my googling, "Bubblesearch" refers to a way to choose an ordering of items that is in some sense near a greedy ordering, but in your case, the order of adding files to a DVD does not change the space requirements so this approach wastes time considering multiple different orders that amount to the same set of files.
In other words, if you are doing something like the following to generate a candidate file list:
Randomly shuffle the list of files.
Starting from the top of the list, greedily choose all files that you estimate will fit on a DVD until no more will.
Then you are searching the solution space inefficiently -- for any final candidate set of n files, you are potentially considering all n! ways of producing that set. My suggestion:
Sort all files in decreasing order of file size.
Mark the top (largest) file as "included," and remove it from the list. (It must be included on some DVD, so we might as well include it now.)
Can the topmost file in the list be included without the (estimated) ISO filesystem size exceeding the DVD capacity? If so:
With probability p (e.g. p = 0.5), mark the file as "included".
Remove the topmost file from the list.
If the list is now empty, you have a candidate list of files. Otherwise, goto 3.
Repeat this many times and choose the best file list.
Tyler D's suggestion is also good: if you have ~40000 files totalling ~500Mb, that means an average file size of 12.5Kb. ISO 9660 uses a block size of 2Kb, meaning those files are wasting on average 1Kb of disk space, or about 8% of their size. So packing them together with tar first will save around 8% of space.
Can't use tar to store the files on disk?
It's unclear if you're writing a program to do this, or simply making some backups.
Maybe do some experimentation and err on the side of caution - some free space on a disk wouldn't hurt.
Somehow I imagine you've already considered these, or that my answer is missing the point.
I recently ran an experiment to find a formula to do a similar filling estimate on dvds, and found a simple formula given some assumptions... from your original post this formula will likely be a low number for you, it sounds like you have multiple directories and longer file names.
Assumptions:
all the files are exactly 8.3 characters.
all the files are in the root directory.
no extensions such as Joliet.
The formula:
174 + floor(count / 42) + sum( ceil(file_size / 2048) )
count is the number of files
file_size is each file's size in bytes
the result is in 2048 byte blocks.
An example script:
#!/usr/bin/perl -w
use strict;
use POSIX;
sub sum {
my $out = 0;
for(#_) {
$out += $_;
}
return $out;
}
my #sizes = ( 2048 ) x 1000;
my $file_count = #sizes;
my $data_size = sum(map { ceil($_ / 2048) } #sizes);
my $dir_size = floor( $file_count / 42 ) + 1;
my $overhead = 173;
my $size = $overhead + $dir_size + $data_size;
$\ = "\n";
print $size;
I verified this on disks with up to 150k files, with sizes ranging from 200 bytes to 1 MiB.
Nice thinking, J. Random. Of course I don't need every last byte, this is mostly for fun (and bragging rights at lunch). I want to be able to type du at the CD-ROM and have it very close to 4700000000.
I looked at the ECMA spec but like most specs it's medium painful and I have no confidence in my ability to get it right. Also it appears not to discuss Rock Ridge extensions, or if it does, I missed it.
I like your idea #3 and think I will carry it a bit further: I'll try to build a fairly rich model of what's going on and then use genisoimage -print-size on a number of filesets to estimate the parameters of the model. Then I can use the model to do my estimation. This is a hobby project so it will take a while, but I will get around to it eventually. I will post an answer here to say how much wastage is eliminated!

How can you concatenate two huge files with very little spare disk space? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Suppose that you have two huge files (several GB) that you want to concatenate together, but that you have very little spare disk space (let's say a couple hundred MB). That is, given file1 and file2, you want to end up with a single file which is the result of concatenating file1 and file2 together byte-for-byte, and delete the original files.
You can't do the obvious cat file2 >> file1; rm file2, since in between the two operations, you'd run out of disk space.
Solutions on any and all platforms with free or non-free tools are welcome; this is a hypothetical problem I thought up while I was downloading a Linux ISO the other day, and the download got interrupted partway through due to a wireless hiccup.
time spent figuring out clever solution involving disk-sector shuffling and file-chain manipulation: 2-4 hours
time spent acquiring/writing software to do in-place copy and truncate: 2-20 hours
times median $50/hr programmer rate: $400-$1200
cost of 1TB USB drive: $100-$200
ability to understand the phrase "opportunity cost": priceless
I think the difficulty is determining how the space can be recovered from the original files.
I think the following might work:
Allocate a sparse file of the
combined size.
Copy 100Mb from the end of the second file to the end of the new file.
Truncate 100Mb of the end of the second file
Loop 2&3 till you finish the second file (With 2. modified to the correct place in the destination file).
Do 2&3&4 but with the first file.
This all relies on sparse file support, and file truncation freeing space immediately.
If you actually wanted to do this then you should investigate the dd command. which can do the copying step
Someone in another answer gave a neat solution that doesn't require sparse files, but does copy file2 twice:
Copy 100Mb chunks from the end of file 2 to a new file 3, ending up in reverse order. Truncating file 2 as you go.
Copy 100Mb chunks from the end of file 3 into file 1, ending up with the chunks in their original order, at the end of file 1. Truncating file 3 as you go.
Here's a slight improvement over my first answer.
If you have 100MB free, copy the last 100MB from the second file and create a third file. Truncate the second file so it is now 100MB smaller. Repeat this process until the second file has been completely decomposed into individual 100MB chunks.
Now each of those 100MB files can be appended to the first file, one at a time.
With those constraints I expect you'd need to tamper with the file system; directly edit the file size and allocation blocks.
In other words, forget about shuffling any blocks of file content around, just edit the information about those files.
if the file is highly compressible (ie. logs):
gzip file1
gzip file2
zcat file1 file2 | gzip > file3
rm file1
rm file2
gunzip file3
At the risk of sounding flippant, have you considered the option of just getting a bigger disk? It would probably be quicker...
Not very efficient, but I think it can be done.
Open the first file in append mode, and copy blocks from the second file to it until the disk is almost full. For the remainder of the second file, copy blocks from the point where you stopped back to the beginning of the file via random access I/O. Truncate the file after you've copied the last block. Repeat until finished.
Obviously, the economic answer is buy more storage assuming that's a possible answer. It might not be, though--embedded system with no way to attach more storage, or even no access to the equipment itself--say, space probe in flight.
The previously presented answer based on the sparse file system is good (other than the destructive nature of it if something goes wrong!) if you have a sparse file system. What if you don't, though?
Starting from the end of file 2 copy blocks to the start of the target file reversing them as you go. After each block you truncate the source file to the uncopied length. Repeat for file #1.
At this point the target file contains all the data backwards, the source files are gone.
Read a block from the tart and from the end of the target file, reverse them and write them to the spot the other came from. Work your way inwards flipping blocks.
When you are done the target file is the concatenation of the source files. No sparse file system needed, no messing with the file system needed. This can be carried out at zero bytes free as the data can be held in memory.
ok, for theoretical entertainment, and only if you promise not to waste your time actually doing it:
files are stored on disk in pieces
the pieces are linked in a chain
So you can concatenate the files by:
linking the last piece of the first file to the first piece of the last file
altering the directory entry for the first file to change the last piece and file size
removing the directory entry for the last file
cleaning up the first file's end-of-file marker, if any
note that if the last segment of the first file is only partially filled, you will have to copy data "up" the segments of the last file to avoid having garbage in the middle of the file [thanks #Wedge!]
This would be optimally efficient: minimal alterations, minimal copying, no spare disk space required.
now go buy a usb drive ;-)
Two thoughts:
If you have enough physical RAM, you could actually read the second file entirely into memory, delete it, then write it in append mode to the first file. Of course if you lose power after deleting but before completing the write, you've lost part of the second file for good.
Temporarily reduce disk space used by OS functionality (e.g. virtual memory, "recycle bin" or similar). Probably only of use on Windows.
I doubt this is a direct answer to the question. You can consider this as an alternative way to solve the problem.
I think it is possible to consider 2nd file as the part 2 of the first file. Usually in zip application, we would see a huge file is split into multiple parts. If you open the first part, the application would automatically consider the other parts in further processing.
We can simulate the same thing here. As #edg pointed out, tinkering file system would be one way.
you could do this:
head file2 --bytes=1024 >> file1 && tail --bytes=+1024 file2 >file2
you can increase 1024 according to how much extra disk space you have, then just repeat this until all the bytes have been moved.
This is probably the fastest way to do it (in terms of development time)
You may be able to gain space by compressing the entire file system. I believe NTFS supports this, and I am sure there are flavors of *nix file systems that would support it. It would also have the benefit of after copying the files you would still have more disk space left over than when you started.
OK, changing the problem a little bit. Chances are there's other stuff on the disk that you don't need, but you don't know what it is or where it is. If you could find it, you could delete it, and then maybe you'd have enough extra space.
To find these "tumors", whether a few big ones, or lots of little ones, I use a little sampling program. Starting from the top of a directory (or the root) it makes two passes. In pass 1, it walks the directory tree, adding up the sizes of all the files to get a total of N bytes. In pass 2, it again walks the directory tree, pretending it is reading every file. Every time it passes N/20 bytes, it prints out the directory path and name of the file it is "reading". So the end result is 20 deep samples of path names uniformly spread over all the bytes under the directory.
Then just look at that list for stuff that shows up a lot that you don't need, and go blow it away.
(It's the space-equivalent of the sampling method I use for performance optimization.)
"fiemap"
http://www.mjmwired.net/kernel/Documentation/filesystems/fiemap.txt

Resources