I have a program which takes data(int,floats and strings) given by the user and writes it in a text file.Now I have to update a part of that written data.
For example:
At line 4 in file I want to change the first 2 words (there's an int and a float). How can I do that?
With the information I found out, fseek() and fputs() can be used but I don't know exactly how to get to a specific line.
(Explained code will be appreciated as I'm a starter in C)
You can't "insert" characters in a file. You will have to create program, which will read whole file, then copy part before insert to a new file, your edition, rest of file.
You really need to read all the file, and ignore what is not needed.
fseek is not really useful: it positions the file at some byte offset (relative to the start or the end of the file) and don't know about line boundaries.
Actually, lines inside a file are an ill defined concept. Often a line is a sequence of bytes (different from the newline character) ended by a newline ('\n'). Some operating systems (Windows, MacOSX) read in a special manner text files (e.g. the real file contains \r\n to end each line, but the C library gives you the illusion that you have read \n).
In practice, you probably want to use line input routines notably getline (or perhaps fgets).
if you use getline you should care about free-ing the line buffer.
If your textual file has a very regular structure, you might fscanf the data (ignoring what you need to skip) without caring about line boundaries.
If you wanted to absolutely use fseek (which is a mistake), you would have to read the file twice: a first time to remember where each line starts (or ends) and a second time to fseek to the line start. Still, that does not work for updates, because you cannot insert bytes in the middle of a file.
And in practice, the most costly operation is the actual disk read. Buffering (partly done by the kernel and <stdio.h> functions, and partly by you when you deal with lines) is negligible.
Of course you cannot change in place some line in a file. If you need to do that, process the file for input, produce some output file (containing the modified input) and rename that when finished.
BTW, you might perhaps be interested in indexed files like GDBM etc... or even in databases like SqlLite, MariaDb, mongodb etc.... and you might be interested in standard textual serialization formats like JSON or YAML (both have many libraries, even for C, to deal with them).
fseek() is used for random-access files where each record of data has the same size. Typically the data is binary, not text.
To solve your particular issue, you will need to read one line at a time to find the line you want to change. A simple solution to make the change is to write these lines to a temporary file, write the changes to the same temporary file, then skip the parts from the original file that you want to change and copy the reset to the temporary file. Finally, close the original file, copy the temporary file to it, and delete the temporary file.
With that said, I suggest that you learn more about random-access files. These are very useful when storing records all of the same size. If you have control over creating the orignal file, these might be better for your current purpose.
Related
I have a very large CSV file that I want to import straight into Postgresql with COPY. For that, the CSV column headers need to match DB column names. So I need to do a simple string replace on the first line of the very large file.
There are many answers on how to do that like:
Is it possible to modify lines in a file in-place?
Optimizing find and replace over large files in Python
All the answers imply creating a copy of the large file or using file-system level solutions that access the entire file, although only the first line is relevant. That makes all solutions slow and seemingly overkill.
What is the underlying cause that makes this simple job so hard? Is it file-system related?
The underlying cause is that a .csv file is a textfile, and making changes to the first line of the file implies random access to the first "record" of the file. But textfiles don't really have "records", they have lines, of unequal length. So changing the first line implies reading the file up to the first carriage return, putting something in its place, and then moving all of the rest of the data in the file to the left, if the replacement is shorter, or to the right if it is longer. And to do that you have two choices. (1) Read the entire file into memory so you can do the left or right shift. (2) Read the file line by line and write out a new one.
It is easy to add stuff at the end because that doesn't involve displacing what is there already.
I'm trying to understand Linux (UNIX) low-level interfaces and as an exercise want to write a code which will copy a file with holes into a new file (again with holes).
So my question is, how to read from the first file not till the first hole, but till the very end of the file?
If I'm not mistaken, read() returns 0 when reaches the first hole(EOF).
I was thinking about seeking right byte by byte and trying to read this byte, but then I have to know the number of holes in advance.
If by holes you mean sparse files, then you have to find the holes in the input file and recreate them using lseek when writing the output file. Since Linux 3.1, you can even use lseek to jump to the beginning or end of a hole, as described in great detail in the man page.
As ThiefMaster already pointed out, normal file operations will treat holes simply as sequences of zero bytes, so you won't see the EOF you mention.
For copies of sparse files, from the cp manual;
By default, sparse SOURCE files are detected by a crude heuristic and the corresponding DEST file is made sparse as well. That is the behavior selected by --sparse=auto. Specify --sparse=always to create a sparse DEST file whenever the SOURCE file contains a long enough sequence of zero bytes. Use --sparse=never to inhibit creation of sparse files.
Thus, try --sparse=always if you need to copy a sparse file 'as-is' (still seems affected by an algo)
A file is not presented as if it has any gaps. If your intention is to say that the file has sections on one area of the disk, then more on another, etc., you are not going to be able to see this through a call to open() on that file and a series of read() calls. You would instead need to open() and read() the raw disk instead, seeking to sectors on your own.
If your meaning of "holes" in a file is as #ThiefMaster says, just areas of 0 bytes -- these are only "holes" according to your application use of the data; to the file system they're just bytes in a file, no different than any other. In this case, you can copy it through a simple read of the data source and write to the data target, and you will get a full copy (along with what you're calling holes).
I have gone through all the answers for the similar question posted earlier Replacing spaces with %20 in C. However I'm unable to guess how can we do this in case of a file on hard disk, where disk accesses can be expensive and file is too long to load into memory at once. In case it is possible to fit, we can simply load the file and write onto the same existing one.
Further, for memory constraints one would like to replace the original file and not create a new one.
Horrible idea. Since the "%20" is longer than " " you can't just replace chars inside the file, you have to move whatever follows it further back. This is extremely messy and expensive if you want to do it on the existing disk file.
You could try to determine the total growth of the file on a first pass, then do the whole shifting from the back of the file taking blocksize into account and adjusting the shifting as you encounter " ". But as I said -- messy. You really don't want to do that unless it's a definite must.
Read the file, do the replacements, write to a new file, and rename the new file over the old one.
EDIT: as a side effect, if your program terminates while doing the thing you won't end up with a half-converted file. That's actually the reason why many programs write to a new file even if they wouldn't need to, to make sure the file is "always" correct because the new file only replaces the old file after it has been written successfully. It's a simple transaction scheme that doesn't take system failures into account, but works well for application failures (including users forcibly terminating the program)
For the replacement part, you can have two buffers, one that you read into and one that you write the translated string to and which you write to disk. Depending on your memory constraints even a small input buffer (say 1KiB) is enough. However, to avoid repeating reallocations you can keep a fixed buffer for the output, and have it three times the size of the input buffer (worst case scenario, input is all spaces). Total that's 4KiB of memory, plus whatever buffers the OS uses. I would recommend to use a multiple of the disk block size as the input size.
The problem is your requirement of reading and writing to the same file. Unfortunately this is impossible.If you read char-by-char, think about what happens when you reach a space... You then have to write three characters and overwrite the next two characters in the file. Not exactly what you want.
I have to pack few files in such a way so that at some later stage i can unpack them again to the original files using c program, please suggest.
I suppose the explanation for wanting to write your own implementation might be curiosity.
Whether you add compression or not, if you simply want to store files in an archive, similar to the tar command, then you have a few possible approaches.
One of the fundamental choices you have to make is: how to demarcate the boundaries of the packed files within the archive? It is not a great idea to use a special character, because the packed files could contain any character to begin with.
To keep track of the end of files, you can use the length of the file in bytes. For example, you could, for each file:
Write to the archive the '\0' terminated C-string which names the packed file.
Write to the archive an off64_t which gives the length, in bytes, of the packed file.
Write to the archive the actual bytes (if any) of the packed file.
(Optional) Write to the archive a checksum or CRC of the packed file.
Repeatedly perform this for each file, concatenating the results with no intervening characters.
Finally, when no files remain, write an empty C-string, a zero character.
The unpacking process is:
Read the '\0'-terminated C-string which names this packed file.
If the name is empty, assert that we have read the entire archive, then exit.
Read the off64_t which gives the length of the packed file.
Read as many bytes as the packed file length from the archive and write to the newly-created unpacked file.
Again, repeat these steps until step (2) concludes the program.
This design, in which file names alternate with file data is workable. It has some drawbacks. The essential problem is that the data structure isn't designed for random access. In order to get the information about a file in the "middle" of the archive, a program is required to process the preceding files. The program can call lseek_64 to skip reading program data that isn't needed, but a processor needs to read at least each file name and each file length. The file length is needed to skip over the file data. The file name, as I arranged the data, must be read in order to locate the file length.
So this is inefficient. Even if the file names did not have to be read in order to access file size, the fact that the file details are sprinkled throughout the archive mean that reading the index data requires accessing several ranges of data on the disk.
A better approach might be to write a "block" of index data to the front of the file. This data structure might be something like:
The size of the first file in the archive.
The name of the first file in the archive.
The position, in bytes, within this archive, where the "first file" may be located as a contiguous block of bytes.
The size of the second file in the archive...
And the data in the index might repeat until, again, a file with empty name marks the end of the index.
Having an index like this is nice, but presents a difficulty: when the user wishes to append a file to the archive, the index might need to grow in size. This could change the locations of the packed files within the archive -- the archive program may need to move them around to make room for a bigger index.
The file structure can get more and more complex in order to serve all these different needs. For example, the index can be designed so that it is always allocated out of what the file system considers a "page" (the amount the OS reads or writes from the disk as a minimum-size granule), and if the index needs to grow, discontiguous "index pages" are chained together by file-position data leading from one index page to another. (Like a linked list, but on disk.) The complexity can go on and on.
A fast solution would be to take advantage of an external library like zLib (usage example: http://zlib.net/zlib_how.html ) and use it for compression.
If you want to dig deeper into the topic of compression, have a look at the different lossless compression algorithms and further hints at Wikipedia - Data compression.
I wrote a tar like program a couple of day ago, here my implementation (hope you can get some ideas):
Each file is stored in the file archive with an "header", which is like:
<file-type,file-path,file-size,file-mode>
in file-type i used 0 for files and 1 for directories (in this way you can recreate the directories tree)
For example, the header of a file named foo.txt of size 245 bytes with mode 0755 (on unix, see chmod) will looks like:
<0,foo.txt,245,0755>
here the file contents
in this way, the first character of the file archive is always a <, then you parse the list separated by commas (first possible bug) and extract the file type, the path, the size (which you will use to read the next size bytes from the archive - to avoid the "special character bug" pointed out by Heath Hunnicutt) and the mode of the file (let's say you have a binary file and you want to have it executable when extracted too, you need to chmod it with the original file mode).
About the first possible bug, a comma is not commonly used in a file name, but it's probably better to use another character or "sanitize" the path with a couple "" (sorry i don't remeber the name now, and english is not my mother tongue), obviously the parser should be aware of it, and ignore any comma in the "".
For writing and reading files in C see fgetc and fputc from stdio.h
To get file infos, chmod and directories tree see stat and chmod from sys/stat.h and ftw from ftw.h (probably linux/unix only, because is a system call).
Hope it helps! (if you need some code i can post some snippets, the header parsing is probably the hardest part).
I'm using fgets to read in text from simple files such as txt files however I need the ability to jump back to previous lines. Is there anyway to do this using fgets? Or should I just store the text in some sort a array/structure?
fseek or a combination of fgetpos and fsetpos would be appropriate. AFAIK, there is no "go to line X" function; you'll have to save some information about each line (e.g. its starting position) instead, using fseek or fsetpos to move around.
You may be able to solve your problems with fseek() and friends ( http://linux.die.net/man/3/fseek ).
However, mixing the "fseek" functions with text files (especially if you're reading and writing to the same stream) may cause problems due to the library translation of line breaks.
If you're not too tight on memory, I'd go with saving information from previous lines.
Better yet, if possible review your algorithm/data structure so that you don't need to go back.