Change one line, rewrite whole file? - filesystems

Simple question about how file system works.
If I change one line in 100MB .txt file, will the file system invalid and rewrite whole 100MB of file?
What if I add one line? (and file size changes)
Thanks.

The filesystem doesn't have a notion of writing to the middle of the file -- it only knows how to append to the end. In C, if you want to write data to the middle of a file, you need to manually move forward all the data past the point at which you want to write, then write the new characters to the newly created space.
In doing this, it's possible that the filesystem will have to rearrange parts of the file, for instance if you exceed the size of the disk block that that piece of the file is stored on. So everything at and past the point where you're adding text to the middle has to be re-written, but where it's re-written depends on the filesystem and the arrangement of the file on disk.
TL;DR it depends on the filesystem and how the file was stored on disk.

Related

Why is in-place replacing so hard in files?

I have a very large CSV file that I want to import straight into Postgresql with COPY. For that, the CSV column headers need to match DB column names. So I need to do a simple string replace on the first line of the very large file.
There are many answers on how to do that like:
Is it possible to modify lines in a file in-place?
Optimizing find and replace over large files in Python
All the answers imply creating a copy of the large file or using file-system level solutions that access the entire file, although only the first line is relevant. That makes all solutions slow and seemingly overkill.
What is the underlying cause that makes this simple job so hard? Is it file-system related?
The underlying cause is that a .csv file is a textfile, and making changes to the first line of the file implies random access to the first "record" of the file. But textfiles don't really have "records", they have lines, of unequal length. So changing the first line implies reading the file up to the first carriage return, putting something in its place, and then moving all of the rest of the data in the file to the left, if the replacement is shorter, or to the right if it is longer. And to do that you have two choices. (1) Read the entire file into memory so you can do the left or right shift. (2) Read the file line by line and write out a new one.
It is easy to add stuff at the end because that doesn't involve displacing what is there already.

Text files edit C

I have a program which takes data(int,floats and strings) given by the user and writes it in a text file.Now I have to update a part of that written data.
For example:
At line 4 in file I want to change the first 2 words (there's an int and a float). How can I do that?
With the information I found out, fseek() and fputs() can be used but I don't know exactly how to get to a specific line.
(Explained code will be appreciated as I'm a starter in C)
You can't "insert" characters in a file. You will have to create program, which will read whole file, then copy part before insert to a new file, your edition, rest of file.
You really need to read all the file, and ignore what is not needed.
fseek is not really useful: it positions the file at some byte offset (relative to the start or the end of the file) and don't know about line boundaries.
Actually, lines inside a file are an ill defined concept. Often a line is a sequence of bytes (different from the newline character) ended by a newline ('\n'). Some operating systems (Windows, MacOSX) read in a special manner text files (e.g. the real file contains \r\n to end each line, but the C library gives you the illusion that you have read \n).
In practice, you probably want to use line input routines notably getline (or perhaps fgets).
if you use getline you should care about free-ing the line buffer.
If your textual file has a very regular structure, you might fscanf the data (ignoring what you need to skip) without caring about line boundaries.
If you wanted to absolutely use fseek (which is a mistake), you would have to read the file twice: a first time to remember where each line starts (or ends) and a second time to fseek to the line start. Still, that does not work for updates, because you cannot insert bytes in the middle of a file.
And in practice, the most costly operation is the actual disk read. Buffering (partly done by the kernel and <stdio.h> functions, and partly by you when you deal with lines) is negligible.
Of course you cannot change in place some line in a file. If you need to do that, process the file for input, produce some output file (containing the modified input) and rename that when finished.
BTW, you might perhaps be interested in indexed files like GDBM etc... or even in databases like SqlLite, MariaDb, mongodb etc.... and you might be interested in standard textual serialization formats like JSON or YAML (both have many libraries, even for C, to deal with them).
fseek() is used for random-access files where each record of data has the same size. Typically the data is binary, not text.
To solve your particular issue, you will need to read one line at a time to find the line you want to change. A simple solution to make the change is to write these lines to a temporary file, write the changes to the same temporary file, then skip the parts from the original file that you want to change and copy the reset to the temporary file. Finally, close the original file, copy the temporary file to it, and delete the temporary file.
With that said, I suggest that you learn more about random-access files. These are very useful when storing records all of the same size. If you have control over creating the orignal file, these might be better for your current purpose.

Pack files using c so that can be unpacked to original files

I have to pack few files in such a way so that at some later stage i can unpack them again to the original files using c program, please suggest.
I suppose the explanation for wanting to write your own implementation might be curiosity.
Whether you add compression or not, if you simply want to store files in an archive, similar to the tar command, then you have a few possible approaches.
One of the fundamental choices you have to make is: how to demarcate the boundaries of the packed files within the archive? It is not a great idea to use a special character, because the packed files could contain any character to begin with.
To keep track of the end of files, you can use the length of the file in bytes. For example, you could, for each file:
Write to the archive the '\0' terminated C-string which names the packed file.
Write to the archive an off64_t which gives the length, in bytes, of the packed file.
Write to the archive the actual bytes (if any) of the packed file.
(Optional) Write to the archive a checksum or CRC of the packed file.
Repeatedly perform this for each file, concatenating the results with no intervening characters.
Finally, when no files remain, write an empty C-string, a zero character.
The unpacking process is:
Read the '\0'-terminated C-string which names this packed file.
If the name is empty, assert that we have read the entire archive, then exit.
Read the off64_t which gives the length of the packed file.
Read as many bytes as the packed file length from the archive and write to the newly-created unpacked file.
Again, repeat these steps until step (2) concludes the program.
This design, in which file names alternate with file data is workable. It has some drawbacks. The essential problem is that the data structure isn't designed for random access. In order to get the information about a file in the "middle" of the archive, a program is required to process the preceding files. The program can call lseek_64 to skip reading program data that isn't needed, but a processor needs to read at least each file name and each file length. The file length is needed to skip over the file data. The file name, as I arranged the data, must be read in order to locate the file length.
So this is inefficient. Even if the file names did not have to be read in order to access file size, the fact that the file details are sprinkled throughout the archive mean that reading the index data requires accessing several ranges of data on the disk.
A better approach might be to write a "block" of index data to the front of the file. This data structure might be something like:
The size of the first file in the archive.
The name of the first file in the archive.
The position, in bytes, within this archive, where the "first file" may be located as a contiguous block of bytes.
The size of the second file in the archive...
And the data in the index might repeat until, again, a file with empty name marks the end of the index.
Having an index like this is nice, but presents a difficulty: when the user wishes to append a file to the archive, the index might need to grow in size. This could change the locations of the packed files within the archive -- the archive program may need to move them around to make room for a bigger index.
The file structure can get more and more complex in order to serve all these different needs. For example, the index can be designed so that it is always allocated out of what the file system considers a "page" (the amount the OS reads or writes from the disk as a minimum-size granule), and if the index needs to grow, discontiguous "index pages" are chained together by file-position data leading from one index page to another. (Like a linked list, but on disk.) The complexity can go on and on.
A fast solution would be to take advantage of an external library like zLib (usage example: http://zlib.net/zlib_how.html ) and use it for compression.
If you want to dig deeper into the topic of compression, have a look at the different lossless compression algorithms and further hints at Wikipedia - Data compression.
I wrote a tar like program a couple of day ago, here my implementation (hope you can get some ideas):
Each file is stored in the file archive with an "header", which is like:
<file-type,file-path,file-size,file-mode>
in file-type i used 0 for files and 1 for directories (in this way you can recreate the directories tree)
For example, the header of a file named foo.txt of size 245 bytes with mode 0755 (on unix, see chmod) will looks like:
<0,foo.txt,245,0755>
here the file contents
in this way, the first character of the file archive is always a <, then you parse the list separated by commas (first possible bug) and extract the file type, the path, the size (which you will use to read the next size bytes from the archive - to avoid the "special character bug" pointed out by Heath Hunnicutt) and the mode of the file (let's say you have a binary file and you want to have it executable when extracted too, you need to chmod it with the original file mode).
About the first possible bug, a comma is not commonly used in a file name, but it's probably better to use another character or "sanitize" the path with a couple "" (sorry i don't remeber the name now, and english is not my mother tongue), obviously the parser should be aware of it, and ignore any comma in the "".
For writing and reading files in C see fgetc and fputc from stdio.h
To get file infos, chmod and directories tree see stat and chmod from sys/stat.h and ftw from ftw.h (probably linux/unix only, because is a system call).
Hope it helps! (if you need some code i can post some snippets, the header parsing is probably the hardest part).

IO Question: Writing a portion of a file

I have a general IO question. I was trying to replace a single line in an ascii encoded file. After searching around quite a bit I found that it is not possible to do that. According to what I read if a single line needs to be replaced in a file, the whole file needs to be rewritten. I read that this is the same for all OS's. After reading that I thought ok, no choice, I'll just rewrite the whole file.\n
What got me wondering about this again is I've been working with a program that uses a ".dat" and ".idx" file for it's database. The program is constantly reading and writing to the db. So my question is, it obviously needs to write only small portions at a time (the db is about 200mb in size) so theres no way it could be efficient to write the whole file each time. So my question is what kind of solution would a program like this have for such a problem. Would it write to memory and then every now and then rewrite the whole database. Would it be writing temp files and then merging them to the DB at some point? Or is it possible for a single (or several) lines in the db to be written without the whole file be written?
Any info on this would be greatly appreciated!
Thx
nt
The general comment 'you have to rewrite the whole file' applies when the line you are replacing is of length L1 and the line you are adding is of length L2 and L1 ≠ L2. The trouble is that if L1 is bigger than L2, then you have to move the data in the rest of the file down the file to avoid leaving a gap with garbage where the end of the line was (and you must chop off the tail of the file - shortening it, to avoid leaving garbage at the end). Conversely, if L1 is smaller than L2, you have to move the lines after line up the file to avoid having the new line overwrite the start of the next line.
In the case of the .dat and .idx files, though, you will find that indeed, you are correct: the software is not rewriting the whole file each time. There's a moderate chance that the files represent a C-ISAM file, or one of the related systems (D-ISAM, T-ISAM, etc). In original (Informix) C-ISAM, the .dat file contains fixed length records, so it is possible to write over any old record with a new record because L1 = L2, always. The .idx file is more complex, but it is split into pages (possibly as small as 512 bytes per page), and when an edit is needed, the whole page is rewritten. Since the pages are all the same size, L1 = L2 again - and it is safe to do the rewrite of just the section of the index file that changes.
When a C-ISAM file contains variable length data, the fixed portion of the record is stored in the .dat file, and the variable length portion of the data is stored in pages within the .idx file. This arrangement has just one merit - it leaves the records in the .dat file at a fixed size.
This is not true ntmp. You can indeed write in the middle of a file. How you do it depends on the system and programming language you use. What you are looking for might be seeking operations in IO.
Well you will not exactly have to rewrite the whole file. Only the rest of the file where you start inserting, since that part will needed to be moved behind what you are inserting.
There are several ways you can solve this, one would for example be to reserve space in the file (making the file larger). That way you would only have to move data when the placeholder areas have been filled out.
Write a bit more and we might be able to help you out.

How can you concatenate two huge files with very little spare disk space? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Suppose that you have two huge files (several GB) that you want to concatenate together, but that you have very little spare disk space (let's say a couple hundred MB). That is, given file1 and file2, you want to end up with a single file which is the result of concatenating file1 and file2 together byte-for-byte, and delete the original files.
You can't do the obvious cat file2 >> file1; rm file2, since in between the two operations, you'd run out of disk space.
Solutions on any and all platforms with free or non-free tools are welcome; this is a hypothetical problem I thought up while I was downloading a Linux ISO the other day, and the download got interrupted partway through due to a wireless hiccup.
time spent figuring out clever solution involving disk-sector shuffling and file-chain manipulation: 2-4 hours
time spent acquiring/writing software to do in-place copy and truncate: 2-20 hours
times median $50/hr programmer rate: $400-$1200
cost of 1TB USB drive: $100-$200
ability to understand the phrase "opportunity cost": priceless
I think the difficulty is determining how the space can be recovered from the original files.
I think the following might work:
Allocate a sparse file of the
combined size.
Copy 100Mb from the end of the second file to the end of the new file.
Truncate 100Mb of the end of the second file
Loop 2&3 till you finish the second file (With 2. modified to the correct place in the destination file).
Do 2&3&4 but with the first file.
This all relies on sparse file support, and file truncation freeing space immediately.
If you actually wanted to do this then you should investigate the dd command. which can do the copying step
Someone in another answer gave a neat solution that doesn't require sparse files, but does copy file2 twice:
Copy 100Mb chunks from the end of file 2 to a new file 3, ending up in reverse order. Truncating file 2 as you go.
Copy 100Mb chunks from the end of file 3 into file 1, ending up with the chunks in their original order, at the end of file 1. Truncating file 3 as you go.
Here's a slight improvement over my first answer.
If you have 100MB free, copy the last 100MB from the second file and create a third file. Truncate the second file so it is now 100MB smaller. Repeat this process until the second file has been completely decomposed into individual 100MB chunks.
Now each of those 100MB files can be appended to the first file, one at a time.
With those constraints I expect you'd need to tamper with the file system; directly edit the file size and allocation blocks.
In other words, forget about shuffling any blocks of file content around, just edit the information about those files.
if the file is highly compressible (ie. logs):
gzip file1
gzip file2
zcat file1 file2 | gzip > file3
rm file1
rm file2
gunzip file3
At the risk of sounding flippant, have you considered the option of just getting a bigger disk? It would probably be quicker...
Not very efficient, but I think it can be done.
Open the first file in append mode, and copy blocks from the second file to it until the disk is almost full. For the remainder of the second file, copy blocks from the point where you stopped back to the beginning of the file via random access I/O. Truncate the file after you've copied the last block. Repeat until finished.
Obviously, the economic answer is buy more storage assuming that's a possible answer. It might not be, though--embedded system with no way to attach more storage, or even no access to the equipment itself--say, space probe in flight.
The previously presented answer based on the sparse file system is good (other than the destructive nature of it if something goes wrong!) if you have a sparse file system. What if you don't, though?
Starting from the end of file 2 copy blocks to the start of the target file reversing them as you go. After each block you truncate the source file to the uncopied length. Repeat for file #1.
At this point the target file contains all the data backwards, the source files are gone.
Read a block from the tart and from the end of the target file, reverse them and write them to the spot the other came from. Work your way inwards flipping blocks.
When you are done the target file is the concatenation of the source files. No sparse file system needed, no messing with the file system needed. This can be carried out at zero bytes free as the data can be held in memory.
ok, for theoretical entertainment, and only if you promise not to waste your time actually doing it:
files are stored on disk in pieces
the pieces are linked in a chain
So you can concatenate the files by:
linking the last piece of the first file to the first piece of the last file
altering the directory entry for the first file to change the last piece and file size
removing the directory entry for the last file
cleaning up the first file's end-of-file marker, if any
note that if the last segment of the first file is only partially filled, you will have to copy data "up" the segments of the last file to avoid having garbage in the middle of the file [thanks #Wedge!]
This would be optimally efficient: minimal alterations, minimal copying, no spare disk space required.
now go buy a usb drive ;-)
Two thoughts:
If you have enough physical RAM, you could actually read the second file entirely into memory, delete it, then write it in append mode to the first file. Of course if you lose power after deleting but before completing the write, you've lost part of the second file for good.
Temporarily reduce disk space used by OS functionality (e.g. virtual memory, "recycle bin" or similar). Probably only of use on Windows.
I doubt this is a direct answer to the question. You can consider this as an alternative way to solve the problem.
I think it is possible to consider 2nd file as the part 2 of the first file. Usually in zip application, we would see a huge file is split into multiple parts. If you open the first part, the application would automatically consider the other parts in further processing.
We can simulate the same thing here. As #edg pointed out, tinkering file system would be one way.
you could do this:
head file2 --bytes=1024 >> file1 && tail --bytes=+1024 file2 >file2
you can increase 1024 according to how much extra disk space you have, then just repeat this until all the bytes have been moved.
This is probably the fastest way to do it (in terms of development time)
You may be able to gain space by compressing the entire file system. I believe NTFS supports this, and I am sure there are flavors of *nix file systems that would support it. It would also have the benefit of after copying the files you would still have more disk space left over than when you started.
OK, changing the problem a little bit. Chances are there's other stuff on the disk that you don't need, but you don't know what it is or where it is. If you could find it, you could delete it, and then maybe you'd have enough extra space.
To find these "tumors", whether a few big ones, or lots of little ones, I use a little sampling program. Starting from the top of a directory (or the root) it makes two passes. In pass 1, it walks the directory tree, adding up the sizes of all the files to get a total of N bytes. In pass 2, it again walks the directory tree, pretending it is reading every file. Every time it passes N/20 bytes, it prints out the directory path and name of the file it is "reading". So the end result is 20 deep samples of path names uniformly spread over all the bytes under the directory.
Then just look at that list for stuff that shows up a lot that you don't need, and go blow it away.
(It's the space-equivalent of the sampling method I use for performance optimization.)
"fiemap"
http://www.mjmwired.net/kernel/Documentation/filesystems/fiemap.txt

Resources