file opening pointer - c

I want to ask that in c programming we open a file using pointer by using how many pointer at the same time we can open the same file with out getting any error? Is there a limit? Also does sequence matter like
f1= fopen("abc.txt",r)
f2= fopen("abc.txt",w)
do f2 be close first or f1 can be close first too

Yes, most standard libraries impose some limit on how many files a particular process can have open at a time. As long as you're halfway reasonable about things, however, and only open files as you need them, and close them when you're done, it's rarely an issue.
You're guaranteed that you can open at least FOPEN_MAX files simultaneously. In some cases you can open more than that, but (absent limits imposed elsewhere, such as the OS being short of resources) you can open that many.
Edit: As to why you can often open many more files than FOPEN_MAX indicates: it's pretty simple: to guarantee the ability to open N files, you pretty much need to pre-allocate all the space you're going to use for those files (e.g., a buffer for each). Since most programs never open more than a few files at a time anyway, they try to keep that number fairly low to keep from wasting too much memory on space most don't use anyway.
Then, to accommodate programs that need to open more files, they can/will use realloc (or something similar) to try to allocate more space as needed. Since realloc can fail, though, the attempt at opening more files can also fail.

This will give you the answer for your system. I got 16 on mine, FWIW.
#include <stdio.h>
int main(void)
{
printf("%d\n", FOPEN_MAX);
return 0;
}

Related

Use fopen to open file repeatedly in C

I have a question about "fopen" function.
FILE *pFile1, *pFile2;
pFile1 = fopen(fileName,"rb+");
pFile2 = fopen(fileName,"rb+");
Can I say that pFile1==pFile2? Besides, can FILE type be used as a key of map?
Thanks!
Can I say that pFile1 == pFile2?
No pFile1 and pFile2 are pointers to two distinct FILE structures, returned by the two different function calls.
Give it a try!!
To add further:
Note opening a file that is already open has implementation-defined behavior, according to the C Standard:
FIO31-C. Do not open a file that is already open
subclause 7.21.3, paragraph 8 [ISO/IEC 9899:2011]:
Functions that open additional (nontemporary) files require a file
name, which is a string. The rules for composing valid file names are
implementation-defined. Whether the same file can be simultaneously
open multiple times is also implementation-defined.
Some platforms may forbid a file simultaneously being opened multiple times, but other platforms may allow it. Therefore, portable code cannot depend on what will happen if this rule is violated. Although this isn't a problem on POSIX compliant systems. Many applications open a file multiple times to read concurrently (of-course if you wants writing operation also then you may need concurrency control mechanism, but that's a different matter).
Can I say that pFile1==pFile2?
(edited after reading the pertinent remark of Grijesh Chauhan)
you can say that pFile1 != pFile2, because 2 things can happen:
the system forbids opening the file twice, in which case pFile2 will be NULL
the system allows a second opening, i, which case pFile2 will point to a different context.
This is one more reason among thousands to check system calls, by the way.
Assuming the second call succeeded you can,for instance, seek to a given position with pFile1 while you read from another with pFile2.
As a side note, since you will eventually access the same physical disk, it is rarely a good idea to do so unless you know exactly what you're doing. Seeking back and forth like crazy between two different parts of a big file could eventually force the disk driver to wobble between two physical parts of the disk, reducing your I/O performance dramatically (unless the disk is a non-seeking device like an SSD).
can FILE type be used as a key of map?
No, because
it would not make any sense to use an unknown structure of an unknown size whose lifetime you have no direct control of as a key
the FILE class does not implement the necessary comparison operator
You could use a FILE *, though, since any pointer can be used as a map key.
However, it is pretty dangerous to do so. For one thing, the pointer is just like a random number to you. It comes from some memory allocation within the sdtio library, and you have no control over it.
second, if for some reason you deallocate the file handle (i.e. you close the file), you will keep using an invalid pointer reference as a key unless you also remove the file from the map. This is doable, but both awkward and dangerous IMHO.

What is generally the best approach reading a file for a compiler?

I know this is a general question.
I'm going to program a compiler and I was wondering if it's better to take the tokens of the language while reading the file (i.e., first open the file, then extract tokens while reading, and finally close the file) or read the file first, close it and then work with the data in a variable. The pseudo-code for this would be something like:
file = open(filename);
textVariable = read(file);
close(file);
getTokens(textVariable);
The first option would be something like:
file = open(filename);
readWhileGeneratingTokens(file);
close(file);
I guess the first option looks better, since there isn't an additional cost in terms of main memory. However, I think there might be some benefits using the second option, for I minimize the time the file is going to be open.
I can't provide any hard data, but generally the amount of time a compiler spends tokenizing source code is rather small compared to the amount of time spent optimizing/generating target code. Because of this, wanting to minimize the amount of time the source file is open seems premature. Additionally, reading the entire source file into memory before tokenizing would prevent any sort of line-by-line execution (think interpreted language) or reading input from a non-file location (think of a stream like stdin). I think it is safe to say that the overhead in reading the entire source file into memory is not worth the computer's resources and will ultimately be detrimental to your project.
Compilers are carefully designed to be able to proceed on as little as one character at a time from the input. They don't read entire files prior to processing, or rather they have no need to do so: that would just add pointless latency. They don't even need to read entire lines before processing.

random reading on very large files with fgets seems to bring Windows caching at it's limits

I have written a C/C++-program for Windows 7 - 64bit that works on very large files. In the final step it reads lines from an input-file (10GB+) and writes them to an output file. The access to the input-file is random, the writing is sequential.
EDIT: Main reason for this approach is to reduce RAM usage.
What I basically do in the reading part is this: (Sorry, very shortened and maybe buggy)
void seekAndGetLine(char* line, size_t lineSize, off64_t pos, FILE* filePointer){
fseeko64(filePointer, pos, ios_base::beg);
fgets(line, lineSize, filePointer);
}
Normally this code is fine, not to say fast, but under some very special conditions it gets very slow. The behaviour doesn't seem to be deterministic, since the performance drops occure on different machines at other parts of the file or even don't occure at all. It even goes so far, that the program totally stops reading, while there are no disc-operations.
Another sympthom seems to be the used RAM. My process keeps it's RAM steady, but the RAM used by the System grows sometimes very large. After using some RAM-Tools I found out, that the Windows Mapped File grows into several GBs. This behaviour also seems to depend on the hardware, since it occure on different machines at different parts of the process.
As far as I can tell, this problem doesn't exist on SSDs, so it definitely has something to do with the responsetime of the HDD.
My guess is that the Windows Caching gets somehow "wierd". The program is fast as long as the cache does it's work. But when Caching goes wrong, the behaviour goes either into "stop reading" or "grow cache size" and sometimes even both. Since I'm no expert for the windows caching algorithms, I would be happy to hear an explanation. Also, is there any way to get Windows out of C/C++ to manipulate/stop/enforce the caching.
Since I'm hunting this problem for a while now, I've already tried some tricks, that didn't work out:
filePointer = fopen(fileName, "rbR"); //Just fills the cache till the RAM is full
massive buffering of the read/write, to stop getting the two into each others way
Thanks in advance
Truly random access across a huge file is the worst possible case for any cache algorithm. It may be best to turn off as much caching as possible.
There are multiple levels of caching:
the CRT library (since you're using the f- functions)
the OS and filesystem
probably onboard the drive itself
If you replace your I/O calls via the f- functions in the CRT with the comparable ones in the Windows API (e.g., CreateFile, ReadFile, etc.) you can eliminate the CRT caching, which may be doing more harm than good. You can also warn the OS that you're going to be doing random accesses, which affects its caching strategy. See options like FILE_FLAG_RANDOM_ACCESS and possibly FILE_FLAG_NO_BUFFERING.
You'll need to experiment and measure.
You might also have to reconsider how your algorithm works. Are the seeks truly random? Can you re-sequence them, perhaps in batches, so that they're in order? Can you limit access to a relatively small region of the file at a time? Can you break the huge file into smaller files and then work with one piece at a time? Have you checked the level of fragmentation on the drive and on the particular file?
Depending on the larger picture of what your application does, you could possibly take a different approach - maybe something like this:
decide which lines you need from the input file and store the
line numbers in a list
sort the list of line numbers
read through the input file once, in order, and pull out the lines
you need (better yet, seek to next line and grab it, especially when there's big gaps)
if the list of lines you're grabbing is small enough, you can store
them in memory for reordering before output, otherwise, stick them
in a smaller temporary file and use that file as input for your
current algorithm to reorder the lines for final output
It's definitely a more complex approach, but it would be much kinder to your caching subsystem, and as a result, could potentially perform significantly better.

What is the best way to truncate the beginning of a file in C?

There are many similar questions, but nothing that answers this specifically after googling around quite a bit. Here goes:
Say we have a file (could be binary, and much bigger too):
abcdefghijklmnopqrztuvwxyz
what is the best way in C to "move" a right most portion of this file to the left, truncating the beginning of the file.. so, for example, "front truncating" 7 bytes would change the file on disk to be:
hijklmnopqrztuvwxyz
I must avoid temporary files, and would prefer not to use a large buffer to read the whole file into memory. One possible method I thought of is to use fopen with "rb+" flag, and constantly fseek back and forth reading and writing to copy bytes starting from offset to the beginning, then setEndOfFile to truncate at the end. That seems to be a lot of seeking (possibly inefficient).
Another way would be to fopen the same file twice, and use fgetc and fputc with the respective file pointers. Is this even possible?
If there are other ways, I'd love to read all of them.
You could mmap the file into memory and then memmove the contents. You would have to truncate the file separately.
You don't have to use an enormous buffer size, and the kernel is going to be doing the hard work for you, but yes, reading a buffer full from up the file and writing nearer the beginning is the way to do it if you can't afford to do the simpler job of create a new file, copy what you want into that file, and then copy the new (temporary) file over the old one. I wouldn't rule out the possibility that the approach of copying what you want to a new file and then either moving the new file in place of the old or copying the new over the old will be faster than the shuffling process you describe. If the number of bytes to be removed was a disk block size, rather than 7 bytes, the situation might be different, but probably not. The only disadvantage is that the copying approach requires more intermediate disk space.
Your outline approach will require the use of truncate() or ftruncate() to shorten the file to the proper length, assuming you are on a POSIX system. If you don't have truncate(), then you will need to do the copying.
Note that opening the file twice will work OK if you are careful not to clobber the file when opening for writing - using "r+b" mode with fopen(), or avoiding O_TRUNC with open().
If you are using Linux, since Kernel 3.15 you can use
#include <fcntl.h>
int fallocate(int fd, int mode, off_t offset, off_t len);
with the FALLOC_FL_COLLAPSE_RANGE flag.
http://manpages.ubuntu.com/manpages/disco/en/man2/fallocate.2.html
Note that not all file systems support it but most modern ones such as ext4 and xfs do.

C malloc/free + fgets performance

As I loop through lines in file A, I am parsing the line and putting each string (char*) into a char**.
At the end of a line, I then run a procedure that consists of opening file B, using fgets, fseek and fgetc to grab characters from that file. I then close file B.
I repeat reopening and reclosing file B for each line.
What I would like to know is:
Is there a significant performance hit from using malloc and free, such that I should use something static like myArray[NUM_STRINGS][MAX_STRING_WIDTH] instead of a dynamic char** myArray?
Is there significant performance overhead from opening and closing file B (conceptually, many thousands of times)? If my file A is sorted, is there a way for me to use fseek to move "backwards" in file B, to reset where I was previously located in file B?
EDIT Turns out that a two-fold approach greatly reduced the runtime:
My file B is actually one of twenty-four files. Instead of opening up the same file B1 a thousand times, and then B2 a thousand times, etc. I open up file B1 once, close it, B2 once, close it, etc. This reduces many thousands of fopen and fclose operations to roughly 24.
I used rewind() to reset the file pointer.
This yielded a roughly 60-fold speed improvement, which is more than sufficient. Thanks for pointing me to rewind().
If your dynamic array grows in time, there is a copy cost on some reallocs. If you use the "always double" heuristic, this is amortized to O(n), so it is not horrible. If you know the size ahead of time, a stack allocated array will still be faster.
For the second question read about rewind. It has got to be faster than opening and closing all the time, and lets you do less resource management.
What I would like to know is:
does your code work correctly?
is it running fast enough for your purpose?
If the answer both of these is "yes", don't change anything.
Opening and closing has a variable overhead depending on if other programs are competitng for that resource.
measure the file size first and then use that to calculate the array size in advance to do one big heap allocation.
You won't get a multi-dimensional array right off, but a bit of pointer arithmetic and you are there.
Can you not cache positional information in the other file and then, rather than opening and closing it, use previous seek indexes as an offset? Depends on the exact logic really.
If your files are large, disk I/O will be far more expensive than memory management. Worrying about malloc/free performance before profiling indicates that it is a bottleneck is premature optimization.
It is possible that the overhead from frequent open/close is significant in your program, but again the actual I/O is likely to be more expensive, unless the files are small, in which case the loss of buffers between close and open can potentially cause extra disk I/O. And yes you can use ftell() to get the current position in a file then fseek with SEEK_SET to get to that.
There is always a performance hit with using dynamic memory. Using a static buffer will provide a speed boost.
There is also going to be a performance hit with reopening a file. You can use fseek(pos, SEEK_SET) to set the file pointer to any position in the file or fseek(offset, SEEK_CUR) to do a relative move.
Significant performance hit is relative, and you will have to determine what that means for yourself.
I think it's better to allocate the
actual space you need, and the
overhead will probably not be
significant. This avoids both
wasting space and stack overflows
Yes. Though the IO is cached,
you're making unnecessary syscalls
(open and close). Use fseek with
probably SEEK_CUR or SEEK_SET.
In both cases, there is some performance hit, but the significance will depend on the size of the files and the context your program runs in.
If you actually know the max number of strings and max width, this will be a lot faster (but you may waste a lot of memory if you use less than the "max"). The happy medium is to do what a lot of dynamic array implementations in C++ do: whenever you have to realloc myArray, alloc twice as much space as you need, and only realloc again once you've run out of space. This has O(log n) performance cost.
This may be a big performance hit. I strongly recommend using fseek, though the details will depend on your algorithm.
I often find the performance overhead to be outweighed by the direct memory management that comes with malloc and those low-level C handlers on memory. Unless these areas of memory are going to remain static and untouched for an amount of time that is in amortized time greater than touching this memory, it may be more beneficial to stick with the static array. In the end, it's up to you.

Resources