Checking hard disk memory before file writes - c

I have a program which simulates the orbits of planets. The information for each time step is written to a .txt file with each iteration requiring 64 bytes of memory (8 doubles). The time step is chosen by the user and the final time is chosen by the user. This allows me to calculate the amount of memory required on the disk. E.g, a step of 10 with a final time of 1000 gives 100 set of info, implying at least 6400 bytes of memory.
Is there a way of using this information to, for lack of a better word, check the drive to see if there is enough space before allowing the program write to the file, as i would like to prevent files which are too large from being written to disk. Ideally this should be standard C if possible.

If you know in advance exactly how much space you need, you can try for all the needed files:
fopen the file and fill it with blanks until you reach the exact size you need.
fflush the file and check for error.
rewind the file pointer to the beginning and overwrite the content with the real data.
In case of error when fflush'ing, remove all files.
When you're finished, fclose the files.

Related

what's more efficient: reading from a file or allocating memory

I have a text file and I should allocate an array with as many entries as the number of lines in the file. What's more efficient: to read the file twice (first to find out the number of lines) and allocate the array once, or to read the file once, and use "realloc" after each line read? thank you in advance.
Reading the file twice is a bad idea, regardless of efficiency. (It's also almost certainly less efficient.)
If your application insists on reading its input teice, that means its input must be rewindable, which excludes terminal input and pipes. That's a limitation so annoying that apps which really need to read their input more than once (like sort) generally have logic to make a temporary copy if the input is unseekable.
In this case, you are only trying to avoid the trivial overhead of a few extra malloc calls. That's not justification to limit the application's input options.
If that's not convincing enough, imagine what will happen if someone appends to the file between the first time you read it and the second time. If your implementation trusts the count it got on the first read, it will overrun the vector of line pointers on the second read, leading to Undefined Behaviour and a potential security vulnerability.
I presume you want to store the read lines also and not just allocate an array of that many entries.
Also that you don't want to change the lines and then write them back as in that case you might be better off using mmap.
Reading a file twice is always bad, even if it is cached the 2nd time, too many system calls are needed. Also allocing every line separately if a waste of time if you don't need to dealloc them in a random order.
Instead read the entire file at once, into an allocated area.
Find the number of lines by finding line feeds.
Alloc an array
Put the start pointers into the array by finding the same line feeds again.
If you need it as strings, then replace the line feed with \0
This might also be improved upon on modern cpu-architectures, instead of reading the array twice it might be faster simply allocating a "large enough" array for the pointer and scan the array once. This will cause a realloc at the end to have the right size and potentially a couple of times to make the array larger if it wasn't large enough at start.
Why is this faster? because you have a lot of if's that can take a lot of time for each line. So its better to only have to do this once, the cost is the reallocation, but copying large arrays with memcpy can be a bit cheaper.
But you have to measure it, your system settings, buffer sizes etc. will influence things too.
The answer to "What's more efficient/faster/better? ..." is always:
Try each one on the system you're going to use it on, measure your results accurately, and find out.
The term is "benchmarking".
Anything else is a guess.

Copy sparse files

I'm trying to understand Linux (UNIX) low-level interfaces and as an exercise want to write a code which will copy a file with holes into a new file (again with holes).
So my question is, how to read from the first file not till the first hole, but till the very end of the file?
If I'm not mistaken, read() returns 0 when reaches the first hole(EOF).
I was thinking about seeking right byte by byte and trying to read this byte, but then I have to know the number of holes in advance.
If by holes you mean sparse files, then you have to find the holes in the input file and recreate them using lseek when writing the output file. Since Linux 3.1, you can even use lseek to jump to the beginning or end of a hole, as described in great detail in the man page.
As ThiefMaster already pointed out, normal file operations will treat holes simply as sequences of zero bytes, so you won't see the EOF you mention.
For copies of sparse files, from the cp manual;
By default, sparse SOURCE files are detected by a crude heuristic and the corresponding DEST file is made sparse as well. That is the behavior selected by --sparse=auto. Specify --sparse=always to create a sparse DEST file whenever the SOURCE file contains a long enough sequence of zero bytes. Use --sparse=never to inhibit creation of sparse files.
Thus, try --sparse=always if you need to copy a sparse file 'as-is' (still seems affected by an algo)
A file is not presented as if it has any gaps. If your intention is to say that the file has sections on one area of the disk, then more on another, etc., you are not going to be able to see this through a call to open() on that file and a series of read() calls. You would instead need to open() and read() the raw disk instead, seeking to sectors on your own.
If your meaning of "holes" in a file is as #ThiefMaster says, just areas of 0 bytes -- these are only "holes" according to your application use of the data; to the file system they're just bytes in a file, no different than any other. In this case, you can copy it through a simple read of the data source and write to the data target, and you will get a full copy (along with what you're calling holes).

Replacing spaces with %20 in a file on hard disk

I have gone through all the answers for the similar question posted earlier Replacing spaces with %20 in C. However I'm unable to guess how can we do this in case of a file on hard disk, where disk accesses can be expensive and file is too long to load into memory at once. In case it is possible to fit, we can simply load the file and write onto the same existing one.
Further, for memory constraints one would like to replace the original file and not create a new one.
Horrible idea. Since the "%20" is longer than " " you can't just replace chars inside the file, you have to move whatever follows it further back. This is extremely messy and expensive if you want to do it on the existing disk file.
You could try to determine the total growth of the file on a first pass, then do the whole shifting from the back of the file taking blocksize into account and adjusting the shifting as you encounter " ". But as I said -- messy. You really don't want to do that unless it's a definite must.
Read the file, do the replacements, write to a new file, and rename the new file over the old one.
EDIT: as a side effect, if your program terminates while doing the thing you won't end up with a half-converted file. That's actually the reason why many programs write to a new file even if they wouldn't need to, to make sure the file is "always" correct because the new file only replaces the old file after it has been written successfully. It's a simple transaction scheme that doesn't take system failures into account, but works well for application failures (including users forcibly terminating the program)
For the replacement part, you can have two buffers, one that you read into and one that you write the translated string to and which you write to disk. Depending on your memory constraints even a small input buffer (say 1KiB) is enough. However, to avoid repeating reallocations you can keep a fixed buffer for the output, and have it three times the size of the input buffer (worst case scenario, input is all spaces). Total that's 4KiB of memory, plus whatever buffers the OS uses. I would recommend to use a multiple of the disk block size as the input size.
The problem is your requirement of reading and writing to the same file. Unfortunately this is impossible.If you read char-by-char, think about what happens when you reach a space... You then have to write three characters and overwrite the next two characters in the file. Not exactly what you want.

What's a short way to prepend 3 bytes to the beginning of a binary file in C?

The straightforward way I know of is to create a new file, write the three bytes to it, and then read the original file into memory (in a loop) and write it out to the new file.
Is there a faster way that would either permit skipping the creation of a new file, or skip reading the original file into memory and writing back out again?
There is, unfortunately, no way (with POSIX or standard libc file APIs) to insert or delete a range of bytes in an existing file.
This isn't so much about C as about filesystems; there aren't many common filesystem APIs that provide shortcuts for prepending data, so in general the straightforward way is the only one.
You may be able to use some form of memory-mapped I/O appropriate to your platform, but this trades off one set of problems for another (such as, can you map the entire file into your address space or are you forced to break it up into chunks?).
You could open the file as read/write, read the first 4KB, seek backward 4KB, write your three bytes, write (4KB - 3) bytes, and repeat the process until you reach the end of the file.

How can you concatenate two huge files with very little spare disk space? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Suppose that you have two huge files (several GB) that you want to concatenate together, but that you have very little spare disk space (let's say a couple hundred MB). That is, given file1 and file2, you want to end up with a single file which is the result of concatenating file1 and file2 together byte-for-byte, and delete the original files.
You can't do the obvious cat file2 >> file1; rm file2, since in between the two operations, you'd run out of disk space.
Solutions on any and all platforms with free or non-free tools are welcome; this is a hypothetical problem I thought up while I was downloading a Linux ISO the other day, and the download got interrupted partway through due to a wireless hiccup.
time spent figuring out clever solution involving disk-sector shuffling and file-chain manipulation: 2-4 hours
time spent acquiring/writing software to do in-place copy and truncate: 2-20 hours
times median $50/hr programmer rate: $400-$1200
cost of 1TB USB drive: $100-$200
ability to understand the phrase "opportunity cost": priceless
I think the difficulty is determining how the space can be recovered from the original files.
I think the following might work:
Allocate a sparse file of the
combined size.
Copy 100Mb from the end of the second file to the end of the new file.
Truncate 100Mb of the end of the second file
Loop 2&3 till you finish the second file (With 2. modified to the correct place in the destination file).
Do 2&3&4 but with the first file.
This all relies on sparse file support, and file truncation freeing space immediately.
If you actually wanted to do this then you should investigate the dd command. which can do the copying step
Someone in another answer gave a neat solution that doesn't require sparse files, but does copy file2 twice:
Copy 100Mb chunks from the end of file 2 to a new file 3, ending up in reverse order. Truncating file 2 as you go.
Copy 100Mb chunks from the end of file 3 into file 1, ending up with the chunks in their original order, at the end of file 1. Truncating file 3 as you go.
Here's a slight improvement over my first answer.
If you have 100MB free, copy the last 100MB from the second file and create a third file. Truncate the second file so it is now 100MB smaller. Repeat this process until the second file has been completely decomposed into individual 100MB chunks.
Now each of those 100MB files can be appended to the first file, one at a time.
With those constraints I expect you'd need to tamper with the file system; directly edit the file size and allocation blocks.
In other words, forget about shuffling any blocks of file content around, just edit the information about those files.
if the file is highly compressible (ie. logs):
gzip file1
gzip file2
zcat file1 file2 | gzip > file3
rm file1
rm file2
gunzip file3
At the risk of sounding flippant, have you considered the option of just getting a bigger disk? It would probably be quicker...
Not very efficient, but I think it can be done.
Open the first file in append mode, and copy blocks from the second file to it until the disk is almost full. For the remainder of the second file, copy blocks from the point where you stopped back to the beginning of the file via random access I/O. Truncate the file after you've copied the last block. Repeat until finished.
Obviously, the economic answer is buy more storage assuming that's a possible answer. It might not be, though--embedded system with no way to attach more storage, or even no access to the equipment itself--say, space probe in flight.
The previously presented answer based on the sparse file system is good (other than the destructive nature of it if something goes wrong!) if you have a sparse file system. What if you don't, though?
Starting from the end of file 2 copy blocks to the start of the target file reversing them as you go. After each block you truncate the source file to the uncopied length. Repeat for file #1.
At this point the target file contains all the data backwards, the source files are gone.
Read a block from the tart and from the end of the target file, reverse them and write them to the spot the other came from. Work your way inwards flipping blocks.
When you are done the target file is the concatenation of the source files. No sparse file system needed, no messing with the file system needed. This can be carried out at zero bytes free as the data can be held in memory.
ok, for theoretical entertainment, and only if you promise not to waste your time actually doing it:
files are stored on disk in pieces
the pieces are linked in a chain
So you can concatenate the files by:
linking the last piece of the first file to the first piece of the last file
altering the directory entry for the first file to change the last piece and file size
removing the directory entry for the last file
cleaning up the first file's end-of-file marker, if any
note that if the last segment of the first file is only partially filled, you will have to copy data "up" the segments of the last file to avoid having garbage in the middle of the file [thanks #Wedge!]
This would be optimally efficient: minimal alterations, minimal copying, no spare disk space required.
now go buy a usb drive ;-)
Two thoughts:
If you have enough physical RAM, you could actually read the second file entirely into memory, delete it, then write it in append mode to the first file. Of course if you lose power after deleting but before completing the write, you've lost part of the second file for good.
Temporarily reduce disk space used by OS functionality (e.g. virtual memory, "recycle bin" or similar). Probably only of use on Windows.
I doubt this is a direct answer to the question. You can consider this as an alternative way to solve the problem.
I think it is possible to consider 2nd file as the part 2 of the first file. Usually in zip application, we would see a huge file is split into multiple parts. If you open the first part, the application would automatically consider the other parts in further processing.
We can simulate the same thing here. As #edg pointed out, tinkering file system would be one way.
you could do this:
head file2 --bytes=1024 >> file1 && tail --bytes=+1024 file2 >file2
you can increase 1024 according to how much extra disk space you have, then just repeat this until all the bytes have been moved.
This is probably the fastest way to do it (in terms of development time)
You may be able to gain space by compressing the entire file system. I believe NTFS supports this, and I am sure there are flavors of *nix file systems that would support it. It would also have the benefit of after copying the files you would still have more disk space left over than when you started.
OK, changing the problem a little bit. Chances are there's other stuff on the disk that you don't need, but you don't know what it is or where it is. If you could find it, you could delete it, and then maybe you'd have enough extra space.
To find these "tumors", whether a few big ones, or lots of little ones, I use a little sampling program. Starting from the top of a directory (or the root) it makes two passes. In pass 1, it walks the directory tree, adding up the sizes of all the files to get a total of N bytes. In pass 2, it again walks the directory tree, pretending it is reading every file. Every time it passes N/20 bytes, it prints out the directory path and name of the file it is "reading". So the end result is 20 deep samples of path names uniformly spread over all the bytes under the directory.
Then just look at that list for stuff that shows up a lot that you don't need, and go blow it away.
(It's the space-equivalent of the sampling method I use for performance optimization.)
"fiemap"
http://www.mjmwired.net/kernel/Documentation/filesystems/fiemap.txt

Resources