Reading and piping large files with C - c

I am interested in writing a utility that modifies PostScript files. It needs to traverse the file, make certain decisions about the page count and dimensions, and then write the output to a file or stdout making certain modifications to the PostScript code.
What would be a good way to handle file processing on a *NIX system in this case? I'm fairly new to pipes and forking in C, and it is my understanding that, in case of reading a file directly, I could probably seek back and forth around the input file, but if input is directly piped into the program, I can't simply rewind to the beginning of an input as the input could be a network stream for example, correct?
Rather than store the entire PS file into memory, which can grow huge, it seems like it would make more sense to buffer the input to disk while doing my first pass of page analysis, then re-read from the temporary file, produce output, and remove the temporary file. If that's a viable solution, where would be a good place to store such a file on a *NIX system? I'm not sure how safe such code would be either: the program could potentially be used by multiple users on the same server. It sounds like I would have make sure to save the file somewhere in a temporary directory unique to a given user account as well as give the temporary file on disk a fairly unique name.
Would appreciate any tips and pointers on this crazy puzzling world of file processing.

Use mkstemp(3) to create your temporary file. It will handle concurrency issues for you. mmap(2) will let you move around in the file with abandon.

if input is directly piped into the program, I can't simply rewind to the beginning of an input as the input could be a network stream for example, correct?
That's correct. You can only perform random access on a file.
If you read the file, perhaps you could build a table of metadata, which you can use to seek specific portions of the file later, without keeping the file itself in memory.

/tmp is the temporary directory on unix systems. It's specified by FHS. It's cleaned out when the system is rebooted.
If you need more persistent data storage than that there's /var/tmp which is not cleaned out after reboots. Also FHS.
http://en.wikipedia.org/wiki/Filesystem_Hierarchy_Standard

Related

Removing bytes from File in (C) without creating new File

I have a file let's log. I need to remove some bytes let's n bytes from starting of file only. Issue is, this file referenced by some other file pointers in other programs and may these pointer write to this file log any time. I can't re-create new file otherwise file-pointer would malfunction(i am not sure about it too).
I tried to google it but all suggestion for only to re-write to new files.
Is there any solution for it?
I can suggest two options:
Ring bufferUse a memory mapped file as your logging medium, and use it as a ring buffer. You will need to manually manage where the last written byte is, and wrap around your ring appropriately as you step over the end of the ring. This way, your logging file stays a constant size, but you can't tail it like a regular file. Instead, you will need to write a special program that knows how to walk the ring buffer when you want to display the log.
Multiple number of small log filesUse some number of smaller log files that you log to, and remove the oldest file as the collection of files grow beyond the size of logs you want to maintain. If the most recent log file is always named the same, you can use the standard tail -F utility to follow the log contents perpetually. To avoid issues of multiple programs manipulating the same file, your logging code can send logs as messages to a single logging daemon.
So... you want to change the file, but you cannot. The reason you cannot is that other programs are using the file. In general terms, you appear to need to:
stop all the other programs messing with the file while you change it -- to chop now unwanted stuff off the front;
inform the other programs that you have changed it -- so they can re-establish their file-pointers.
I guess there must be a mechanism to allow the other programs to change the file without tripping over each other... so perhaps you can extend that ? [If all the other programs are children of the main program, then if the children all O_APPEND, you have a fighting chance of doing this, perhaps with the help of a file-lock or a semaphore (which may already exist ?). But if the programs are this intimately related, then #jxh has other, probably better, suggestions.]
But, if you cannot change the other programs in any way, you appear to be stuck, except...
...perhaps you could try 'sparse' files ? On (recent-ish) Linux (at least) you can fallocate() with FALLOC_FL_PUNCH_HOLE, to remove the stuff you don't want without affecting the other programs file-pointers. Of course, sooner or later the other programs may overflow the file-pointer, but that may be a more theoretical than practical issue.

File being simultaneously written, read and removed

I'm wondering if there is an mechanism that reads a file while it is being written and remove the content that has been read simultaneously. The purpose for doing this is because the file is stored in memory (ramdisk) and as the file size increases, we need to remove the part that has already being processed.
Thanks a lot!!!
PS: I'm using Linux and Java for this. :)
Data cannot be removed from the beginning or middle of a file. Process the data using multiple files and erase them as they are consumed.
Reading from a file while it is being written to is no big deal, this is the purpose of every tail program, however deleting already read content of an opened file... I don't think it is possible.
You may want to think of a work around. For example you can have a number of files {0,n} with the same limit of bytes to write to. Start writing the file_i where i is the highest available number out of {0,n} and go up to limit. Reading starts from the lowest available file_i, reads up to limit and when done deletes the file just consumed.
We still haven't heard what OS our friend user2386567 is using, but as a counterpoint to the other answers declaring that it's impossible to delete data from the middle of a file, I'd like to point out that Linux has FALLOC_FL_PUNCH_HOLE for that exact purpose.

If my application frequently writes out huge files what would be the most memory efficient way of dealing with the files?

Would it be beneficial for me to create a tmp file then rename it once the whole file is written to ?
This is a standard technique used by many applications to ensure that the file you see is always up to date as writing it in parts can take a while if the file is very large.
Write out to temp file, close the file descriptor and force a sync call to make sure that the contents are written to disk (because of write buffering) and then do a rename on the temp file to the one you want to write.

How exactly is a file opened for reading/writing by applications like msword/pdf?

I want are the steps that an application takes inorder to open the file and allow user to read. File is nothing more than sequence of bits on the disk. What steps does it take to show show the contents of the file?
I want to programatically do this in C. I don't want to begin with complex formats like word/pdf but something simpler. So, which format is best?
If you want to investigate this, start with plain ASCII text. It's just one byte per character, very straightforward, and you can open it in Notepad or any one of its much more capable replacements.
As for what actually happens when a program reads a file... basically it involves making a system call to open the file, which gives you a file handle (just a number that the operating system maps to a record in the filesystem). You then make a system call to read some data from the file, and the OS fetches it from the disk and copies it into some region of RAM that you specify (that would be a character/byte array in your program). Repeat reading as necessary. And when you're done, you issue yet another system call to close the file, which simply tells the OS that you're done with it. So the sequence, in C-like pseudocode, is
int f = fopen(...);
while (...) {
byte foo[BLOCK_SIZE];
fread(f, foo, BLOCK_SIZE);
do something with foo
}
fclose(f);
If you're interested in what the OS actually does behind the scenes to get data from the disk to RAM, well... that's a whole other can of worms ;-)
Start with plain text files

Is it ‘safe’ to remove() open file?

I think about adding possibility of using same the filename for both input and output file to my program, so that it will replace the input file.
As the processed file may be quite large, I think that best solution would to be first open the file, then remove it and create a new one, i.e. like that:
/* input == output in this case */
FILE *inf = fopen(input, "r");
remove(output);
FILE *outf = fopen(output, "w");
(of course, with error handling added)
I am aware that not all systems are going to allow me to remove open file and that's acceptable as long as remove() is going to fail in that case.
I am worried though if there isn't any system which will allow me to remove that open file and then fail to read its' contents.
C99 standard specifies behavior in that case as ‘implementation-defined’; SUS doesn't even mention the case.
What is your opinion/experience? Do I have to worry? Should I avoid such solutions?
EDIT: Please note this isn't supposed to be some mainline feature but rather ‘last resort’ in the case user specifies same filename as both input and output file.
EDIT: Ok, one more question then: is it possible that in this particular case the solution proposed by me is able to do more evil than just opening the output file write-only (i.e. like above but without the remove() call).
No, it's not safe. It may work on your file system, but fail on others. Or it may intermittently fail. It really depends on your operating system AND file system. For an in depth look at Solaris, see this article on file rotation.
Take a look at GNU sed's '--in-place' option. This option works by writing the output to a temporary file, and then copying over the original. This is the only safe, compatible method.
You should also consider that your program could fail at any time, due to a power outage or the process being killed. If this occurs, then your original file will be lost. Additionally, for file systems which do have reference counting, your not saving any space, over the temp file solution, as both files have to exist on disk until the input file is closed.
If the files are huge, and space is at premium, and developer time is cheap, you may be able to open a single for read/write, and ensure that your write pointer does not advance beyond your read pointer.
All systems that I'm aware of that let you remove open files implement some form of reference-counting for file nodes. So, removing a file removes the directory entry, but the file node itself still has one reference from open file handle. In such an implementation, removing a file obviously won't affect the ability to keep reading it, and I find it hard to imagine any other reasonable way to implement this behavior.
I've always got this to work on Linux/Unix. Never on Windows, OS/2, or (shudder) DOS. Any other platforms you are concerned about?
This behaviour actually is useful in using temporary diskspace - open the file for read/write, and immediately delete it. It gets cleaned up automatically on program exit (for any reason, including power-outage), and makes it much harder (but not impossible) for others to monitor it (/proc can give clues, if you have read access to that process).

Resources