I am learning to code in Unix with C. So far I have written the code to find the index of the first byte of the line that I want to replace. The problem is that sometimes, the number of bytes replacing the line might be greater than the number of bytes already on the line. In this case, the code start overwriting the next line. I came up with two standard solutions:
a) Rather than trying to edit the file in-place, I could copy the entire file into memory, edit it by shifting all the bytes if necessary and rewriting it back to file.
b) Only copy the line I want to end-of-file to memory and edit.
Both suggestions doesn't scale well. And I don't want to impose any restrictions on the line size(like every line must be 50 bytes or something). Is there any efficient way to do the line replacement ? Any help would be appreciated.
Copy the first part of the file to a new file (no need to read it all into memory). Then, write the new version of the line. Finally, copy the final part of the file. Swap files and done.
Related
I have a very large CSV file that I want to import straight into Postgresql with COPY. For that, the CSV column headers need to match DB column names. So I need to do a simple string replace on the first line of the very large file.
There are many answers on how to do that like:
Is it possible to modify lines in a file in-place?
Optimizing find and replace over large files in Python
All the answers imply creating a copy of the large file or using file-system level solutions that access the entire file, although only the first line is relevant. That makes all solutions slow and seemingly overkill.
What is the underlying cause that makes this simple job so hard? Is it file-system related?
The underlying cause is that a .csv file is a textfile, and making changes to the first line of the file implies random access to the first "record" of the file. But textfiles don't really have "records", they have lines, of unequal length. So changing the first line implies reading the file up to the first carriage return, putting something in its place, and then moving all of the rest of the data in the file to the left, if the replacement is shorter, or to the right if it is longer. And to do that you have two choices. (1) Read the entire file into memory so you can do the left or right shift. (2) Read the file line by line and write out a new one.
It is easy to add stuff at the end because that doesn't involve displacing what is there already.
I have a program which takes data(int,floats and strings) given by the user and writes it in a text file.Now I have to update a part of that written data.
For example:
At line 4 in file I want to change the first 2 words (there's an int and a float). How can I do that?
With the information I found out, fseek() and fputs() can be used but I don't know exactly how to get to a specific line.
(Explained code will be appreciated as I'm a starter in C)
You can't "insert" characters in a file. You will have to create program, which will read whole file, then copy part before insert to a new file, your edition, rest of file.
You really need to read all the file, and ignore what is not needed.
fseek is not really useful: it positions the file at some byte offset (relative to the start or the end of the file) and don't know about line boundaries.
Actually, lines inside a file are an ill defined concept. Often a line is a sequence of bytes (different from the newline character) ended by a newline ('\n'). Some operating systems (Windows, MacOSX) read in a special manner text files (e.g. the real file contains \r\n to end each line, but the C library gives you the illusion that you have read \n).
In practice, you probably want to use line input routines notably getline (or perhaps fgets).
if you use getline you should care about free-ing the line buffer.
If your textual file has a very regular structure, you might fscanf the data (ignoring what you need to skip) without caring about line boundaries.
If you wanted to absolutely use fseek (which is a mistake), you would have to read the file twice: a first time to remember where each line starts (or ends) and a second time to fseek to the line start. Still, that does not work for updates, because you cannot insert bytes in the middle of a file.
And in practice, the most costly operation is the actual disk read. Buffering (partly done by the kernel and <stdio.h> functions, and partly by you when you deal with lines) is negligible.
Of course you cannot change in place some line in a file. If you need to do that, process the file for input, produce some output file (containing the modified input) and rename that when finished.
BTW, you might perhaps be interested in indexed files like GDBM etc... or even in databases like SqlLite, MariaDb, mongodb etc.... and you might be interested in standard textual serialization formats like JSON or YAML (both have many libraries, even for C, to deal with them).
fseek() is used for random-access files where each record of data has the same size. Typically the data is binary, not text.
To solve your particular issue, you will need to read one line at a time to find the line you want to change. A simple solution to make the change is to write these lines to a temporary file, write the changes to the same temporary file, then skip the parts from the original file that you want to change and copy the reset to the temporary file. Finally, close the original file, copy the temporary file to it, and delete the temporary file.
With that said, I suggest that you learn more about random-access files. These are very useful when storing records all of the same size. If you have control over creating the orignal file, these might be better for your current purpose.
As i think we have fseek function to set file pointer's new position measured in terms of bytes. How we can move file pointer new position in terms of lines?
The short answer: there's no easy way. A file in C is a bunch of bytes, and there is nothing in particular that makes the bytes '\n' and '\r' special (depending on your system). If you really care about a general solution, I would recommend building a lookup table for the byte offsets of line endings as you read the file, and then using it to jump around in the file later on.
Cant make pointer directly to the lines . Reads the file
The basic stdio functions operate on bytes only. You will have to read the file byte by byte and count the lines yourself.
I was facing the same problem. My solution was to store the seek positions of some of the lines and doing a forward search from there.
Eg. If you have a million lines, you can store seek positions of every thousandth line.
I have a general IO question. I was trying to replace a single line in an ascii encoded file. After searching around quite a bit I found that it is not possible to do that. According to what I read if a single line needs to be replaced in a file, the whole file needs to be rewritten. I read that this is the same for all OS's. After reading that I thought ok, no choice, I'll just rewrite the whole file.\n
What got me wondering about this again is I've been working with a program that uses a ".dat" and ".idx" file for it's database. The program is constantly reading and writing to the db. So my question is, it obviously needs to write only small portions at a time (the db is about 200mb in size) so theres no way it could be efficient to write the whole file each time. So my question is what kind of solution would a program like this have for such a problem. Would it write to memory and then every now and then rewrite the whole database. Would it be writing temp files and then merging them to the DB at some point? Or is it possible for a single (or several) lines in the db to be written without the whole file be written?
Any info on this would be greatly appreciated!
Thx
nt
The general comment 'you have to rewrite the whole file' applies when the line you are replacing is of length L1 and the line you are adding is of length L2 and L1 ≠ L2. The trouble is that if L1 is bigger than L2, then you have to move the data in the rest of the file down the file to avoid leaving a gap with garbage where the end of the line was (and you must chop off the tail of the file - shortening it, to avoid leaving garbage at the end). Conversely, if L1 is smaller than L2, you have to move the lines after line up the file to avoid having the new line overwrite the start of the next line.
In the case of the .dat and .idx files, though, you will find that indeed, you are correct: the software is not rewriting the whole file each time. There's a moderate chance that the files represent a C-ISAM file, or one of the related systems (D-ISAM, T-ISAM, etc). In original (Informix) C-ISAM, the .dat file contains fixed length records, so it is possible to write over any old record with a new record because L1 = L2, always. The .idx file is more complex, but it is split into pages (possibly as small as 512 bytes per page), and when an edit is needed, the whole page is rewritten. Since the pages are all the same size, L1 = L2 again - and it is safe to do the rewrite of just the section of the index file that changes.
When a C-ISAM file contains variable length data, the fixed portion of the record is stored in the .dat file, and the variable length portion of the data is stored in pages within the .idx file. This arrangement has just one merit - it leaves the records in the .dat file at a fixed size.
This is not true ntmp. You can indeed write in the middle of a file. How you do it depends on the system and programming language you use. What you are looking for might be seeking operations in IO.
Well you will not exactly have to rewrite the whole file. Only the rest of the file where you start inserting, since that part will needed to be moved behind what you are inserting.
There are several ways you can solve this, one would for example be to reserve space in the file (making the file larger). That way you would only have to move data when the placeholder areas have been filled out.
Write a bit more and we might be able to help you out.
Hi I am working in C on Unix platform. Please tell me how to append one line before the last line in C. I have used fopen in appending mode but I cant add one line before the last line.
I just want to write to the second last line in the file.
You don't need to overwrite the whole file. You just have to:
open your file in "rw" mode,
read your file to find the last line: store its position (ftell/ftello) in the file and its contents
go back to the beginning of the last line (fseek/fseeko)
write whatever you want before the last line
write the last line.
close your file.
There is no way of doing this directly in standard C, mostly because few file systems support this operation. The easiest way round this is to read the file into an in memory structure (where you probably have it anyway), insert the line in memory, then write the whole structure out again, overwriting the original file.
Append only appends to the end, not in the middle.
You need to read in the entire file, and then write it out to a new file. You might have luck starting from the back, and finding the byte offset of the second-to-last linefeed. Then you can just block write the entire "prelude", add your new line, and then emit the remaining trailer.
You can find the place where the last line ends, read the last line into memory, seek back to the place, write the new line, and then the last line.
To find the place: Seek to the end, minus a buffer size. Read buffer, look for
newline. If not found, seek backwards two buffer sizes, and try again.
You'll need to use the r+ mode for fopen.
Oh, and you'll need to be careful about text and binary modes. You need to use binary mode, since with text mode you can't compute jump positions, you can only jump to locations you've gotten from ftell. You can work around that by reading through the entire file, and calling ftell at the beginning of each line. For large files, that is going to be slow.
Use fseek to jump to end of file, read backwards until you encounter a newline. Then insert your line.
You might want to save the 'last line' you are reading by counting how many chars you are reading backwards then strncpy it to a properly allocated buffer.