I have gone through all the answers for the similar question posted earlier Replacing spaces with %20 in C. However I'm unable to guess how can we do this in case of a file on hard disk, where disk accesses can be expensive and file is too long to load into memory at once. In case it is possible to fit, we can simply load the file and write onto the same existing one.
Further, for memory constraints one would like to replace the original file and not create a new one.
Horrible idea. Since the "%20" is longer than " " you can't just replace chars inside the file, you have to move whatever follows it further back. This is extremely messy and expensive if you want to do it on the existing disk file.
You could try to determine the total growth of the file on a first pass, then do the whole shifting from the back of the file taking blocksize into account and adjusting the shifting as you encounter " ". But as I said -- messy. You really don't want to do that unless it's a definite must.
Read the file, do the replacements, write to a new file, and rename the new file over the old one.
EDIT: as a side effect, if your program terminates while doing the thing you won't end up with a half-converted file. That's actually the reason why many programs write to a new file even if they wouldn't need to, to make sure the file is "always" correct because the new file only replaces the old file after it has been written successfully. It's a simple transaction scheme that doesn't take system failures into account, but works well for application failures (including users forcibly terminating the program)
For the replacement part, you can have two buffers, one that you read into and one that you write the translated string to and which you write to disk. Depending on your memory constraints even a small input buffer (say 1KiB) is enough. However, to avoid repeating reallocations you can keep a fixed buffer for the output, and have it three times the size of the input buffer (worst case scenario, input is all spaces). Total that's 4KiB of memory, plus whatever buffers the OS uses. I would recommend to use a multiple of the disk block size as the input size.
The problem is your requirement of reading and writing to the same file. Unfortunately this is impossible.If you read char-by-char, think about what happens when you reach a space... You then have to write three characters and overwrite the next two characters in the file. Not exactly what you want.
Related
I have a very large CSV file that I want to import straight into Postgresql with COPY. For that, the CSV column headers need to match DB column names. So I need to do a simple string replace on the first line of the very large file.
There are many answers on how to do that like:
Is it possible to modify lines in a file in-place?
Optimizing find and replace over large files in Python
All the answers imply creating a copy of the large file or using file-system level solutions that access the entire file, although only the first line is relevant. That makes all solutions slow and seemingly overkill.
What is the underlying cause that makes this simple job so hard? Is it file-system related?
The underlying cause is that a .csv file is a textfile, and making changes to the first line of the file implies random access to the first "record" of the file. But textfiles don't really have "records", they have lines, of unequal length. So changing the first line implies reading the file up to the first carriage return, putting something in its place, and then moving all of the rest of the data in the file to the left, if the replacement is shorter, or to the right if it is longer. And to do that you have two choices. (1) Read the entire file into memory so you can do the left or right shift. (2) Read the file line by line and write out a new one.
It is easy to add stuff at the end because that doesn't involve displacing what is there already.
I have a text file and I should allocate an array with as many entries as the number of lines in the file. What's more efficient: to read the file twice (first to find out the number of lines) and allocate the array once, or to read the file once, and use "realloc" after each line read? thank you in advance.
Reading the file twice is a bad idea, regardless of efficiency. (It's also almost certainly less efficient.)
If your application insists on reading its input teice, that means its input must be rewindable, which excludes terminal input and pipes. That's a limitation so annoying that apps which really need to read their input more than once (like sort) generally have logic to make a temporary copy if the input is unseekable.
In this case, you are only trying to avoid the trivial overhead of a few extra malloc calls. That's not justification to limit the application's input options.
If that's not convincing enough, imagine what will happen if someone appends to the file between the first time you read it and the second time. If your implementation trusts the count it got on the first read, it will overrun the vector of line pointers on the second read, leading to Undefined Behaviour and a potential security vulnerability.
I presume you want to store the read lines also and not just allocate an array of that many entries.
Also that you don't want to change the lines and then write them back as in that case you might be better off using mmap.
Reading a file twice is always bad, even if it is cached the 2nd time, too many system calls are needed. Also allocing every line separately if a waste of time if you don't need to dealloc them in a random order.
Instead read the entire file at once, into an allocated area.
Find the number of lines by finding line feeds.
Alloc an array
Put the start pointers into the array by finding the same line feeds again.
If you need it as strings, then replace the line feed with \0
This might also be improved upon on modern cpu-architectures, instead of reading the array twice it might be faster simply allocating a "large enough" array for the pointer and scan the array once. This will cause a realloc at the end to have the right size and potentially a couple of times to make the array larger if it wasn't large enough at start.
Why is this faster? because you have a lot of if's that can take a lot of time for each line. So its better to only have to do this once, the cost is the reallocation, but copying large arrays with memcpy can be a bit cheaper.
But you have to measure it, your system settings, buffer sizes etc. will influence things too.
The answer to "What's more efficient/faster/better? ..." is always:
Try each one on the system you're going to use it on, measure your results accurately, and find out.
The term is "benchmarking".
Anything else is a guess.
I have a program which simulates the orbits of planets. The information for each time step is written to a .txt file with each iteration requiring 64 bytes of memory (8 doubles). The time step is chosen by the user and the final time is chosen by the user. This allows me to calculate the amount of memory required on the disk. E.g, a step of 10 with a final time of 1000 gives 100 set of info, implying at least 6400 bytes of memory.
Is there a way of using this information to, for lack of a better word, check the drive to see if there is enough space before allowing the program write to the file, as i would like to prevent files which are too large from being written to disk. Ideally this should be standard C if possible.
If you know in advance exactly how much space you need, you can try for all the needed files:
fopen the file and fill it with blanks until you reach the exact size you need.
fflush the file and check for error.
rewind the file pointer to the beginning and overwrite the content with the real data.
In case of error when fflush'ing, remove all files.
When you're finished, fclose the files.
I have a program which takes data(int,floats and strings) given by the user and writes it in a text file.Now I have to update a part of that written data.
For example:
At line 4 in file I want to change the first 2 words (there's an int and a float). How can I do that?
With the information I found out, fseek() and fputs() can be used but I don't know exactly how to get to a specific line.
(Explained code will be appreciated as I'm a starter in C)
You can't "insert" characters in a file. You will have to create program, which will read whole file, then copy part before insert to a new file, your edition, rest of file.
You really need to read all the file, and ignore what is not needed.
fseek is not really useful: it positions the file at some byte offset (relative to the start or the end of the file) and don't know about line boundaries.
Actually, lines inside a file are an ill defined concept. Often a line is a sequence of bytes (different from the newline character) ended by a newline ('\n'). Some operating systems (Windows, MacOSX) read in a special manner text files (e.g. the real file contains \r\n to end each line, but the C library gives you the illusion that you have read \n).
In practice, you probably want to use line input routines notably getline (or perhaps fgets).
if you use getline you should care about free-ing the line buffer.
If your textual file has a very regular structure, you might fscanf the data (ignoring what you need to skip) without caring about line boundaries.
If you wanted to absolutely use fseek (which is a mistake), you would have to read the file twice: a first time to remember where each line starts (or ends) and a second time to fseek to the line start. Still, that does not work for updates, because you cannot insert bytes in the middle of a file.
And in practice, the most costly operation is the actual disk read. Buffering (partly done by the kernel and <stdio.h> functions, and partly by you when you deal with lines) is negligible.
Of course you cannot change in place some line in a file. If you need to do that, process the file for input, produce some output file (containing the modified input) and rename that when finished.
BTW, you might perhaps be interested in indexed files like GDBM etc... or even in databases like SqlLite, MariaDb, mongodb etc.... and you might be interested in standard textual serialization formats like JSON or YAML (both have many libraries, even for C, to deal with them).
fseek() is used for random-access files where each record of data has the same size. Typically the data is binary, not text.
To solve your particular issue, you will need to read one line at a time to find the line you want to change. A simple solution to make the change is to write these lines to a temporary file, write the changes to the same temporary file, then skip the parts from the original file that you want to change and copy the reset to the temporary file. Finally, close the original file, copy the temporary file to it, and delete the temporary file.
With that said, I suggest that you learn more about random-access files. These are very useful when storing records all of the same size. If you have control over creating the orignal file, these might be better for your current purpose.
I have a general IO question. I was trying to replace a single line in an ascii encoded file. After searching around quite a bit I found that it is not possible to do that. According to what I read if a single line needs to be replaced in a file, the whole file needs to be rewritten. I read that this is the same for all OS's. After reading that I thought ok, no choice, I'll just rewrite the whole file.\n
What got me wondering about this again is I've been working with a program that uses a ".dat" and ".idx" file for it's database. The program is constantly reading and writing to the db. So my question is, it obviously needs to write only small portions at a time (the db is about 200mb in size) so theres no way it could be efficient to write the whole file each time. So my question is what kind of solution would a program like this have for such a problem. Would it write to memory and then every now and then rewrite the whole database. Would it be writing temp files and then merging them to the DB at some point? Or is it possible for a single (or several) lines in the db to be written without the whole file be written?
Any info on this would be greatly appreciated!
Thx
nt
The general comment 'you have to rewrite the whole file' applies when the line you are replacing is of length L1 and the line you are adding is of length L2 and L1 ≠ L2. The trouble is that if L1 is bigger than L2, then you have to move the data in the rest of the file down the file to avoid leaving a gap with garbage where the end of the line was (and you must chop off the tail of the file - shortening it, to avoid leaving garbage at the end). Conversely, if L1 is smaller than L2, you have to move the lines after line up the file to avoid having the new line overwrite the start of the next line.
In the case of the .dat and .idx files, though, you will find that indeed, you are correct: the software is not rewriting the whole file each time. There's a moderate chance that the files represent a C-ISAM file, or one of the related systems (D-ISAM, T-ISAM, etc). In original (Informix) C-ISAM, the .dat file contains fixed length records, so it is possible to write over any old record with a new record because L1 = L2, always. The .idx file is more complex, but it is split into pages (possibly as small as 512 bytes per page), and when an edit is needed, the whole page is rewritten. Since the pages are all the same size, L1 = L2 again - and it is safe to do the rewrite of just the section of the index file that changes.
When a C-ISAM file contains variable length data, the fixed portion of the record is stored in the .dat file, and the variable length portion of the data is stored in pages within the .idx file. This arrangement has just one merit - it leaves the records in the .dat file at a fixed size.
This is not true ntmp. You can indeed write in the middle of a file. How you do it depends on the system and programming language you use. What you are looking for might be seeking operations in IO.
Well you will not exactly have to rewrite the whole file. Only the rest of the file where you start inserting, since that part will needed to be moved behind what you are inserting.
There are several ways you can solve this, one would for example be to reserve space in the file (making the file larger). That way you would only have to move data when the placeholder areas have been filled out.
Write a bit more and we might be able to help you out.