Use fopen to open file repeatedly in C

I have a question about "fopen" function.
FILE *pFile1, *pFile2;
pFile1 = fopen(fileName,"rb+");
pFile2 = fopen(fileName,"rb+");
Can I say that pFile1==pFile2? Besides, can FILE type be used as a key of map?

Can I say that pFile1 == pFile2?
No pFile1 and pFile2 are pointers to two distinct FILE structures, returned by the two different function calls.
To add further:
Note opening a file that is already open has implementation-defined behavior, according to the C Standard:
FIO31-C. Do not open a file that is already open
subclause 7.21.3, paragraph 8 [ISO/IEC 9899:2011]:
Functions that open additional (nontemporary) files require a file
name, which is a string. The rules for composing valid file names are
implementation-defined. Whether the same file can be simultaneously
open multiple times is also implementation-defined.
Some platforms may forbid a file simultaneously being opened multiple times, but other platforms may allow it. Therefore, portable code cannot depend on what will happen if this rule is violated. Although this isn't a problem on POSIX compliant systems. Many applications open a file multiple times to read concurrently (of-course if you wants writing operation also then you may need concurrency control mechanism, but that's a different matter).

Can I say that pFile1==pFile2?
(edited after reading the pertinent remark of Grijesh Chauhan)
you can say that pFile1 != pFile2, because 2 things can happen:
the system forbids opening the file twice, in which case pFile2 will be NULL
the system allows a second opening, i, which case pFile2 will point to a different context.
This is one more reason among thousands to check system calls, by the way.
Assuming the second call succeeded you can,for instance, seek to a given position with pFile1 while you read from another with pFile2.
As a side note, since you will eventually access the same physical disk, it is rarely a good idea to do so unless you know exactly what you're doing. Seeking back and forth like crazy between two different parts of a big file could eventually force the disk driver to wobble between two physical parts of the disk, reducing your I/O performance dramatically (unless the disk is a non-seeking device like an SSD).
can FILE type be used as a key of map?
No, because
it would not make any sense to use an unknown structure of an unknown size whose lifetime you have no direct control of as a key
the FILE class does not implement the necessary comparison operator
You could use a FILE *, though, since any pointer can be used as a map key.
However, it is pretty dangerous to do so. For one thing, the pointer is just like a random number to you. It comes from some memory allocation within the sdtio library, and you have no control over it.
second, if for some reason you deallocate the file handle (i.e. you close the file), you will keep using an invalid pointer reference as a key unless you also remove the file from the map. This is doable, but both awkward and dangerous IMHO.


C: Reading large files with limited memory

I am working on something that requires reading from and writing to a large file (or equivalent) but is allowed fairly minimal memory to do it (I don't have the exact spec, but let's call the "large" 15GB and the "minimal" 16K). The file is accessed randomly, usually in chunks of 512 Bytes and it is guaranteed that sometimes consecutive reads will be significant distance apart - possibly literally opposite ends of the disk (or a small number of MB from either end). Currently I'm using pread/pwrite to hit the locations I want in the file (I was previously using fseek, but abandoned it in favor of p(wread|write) because reasons.
Accessing the file this way is (perhaps obviously) slow, and I'm looking for ways to optimise/speed up the performance as much as possible (with as limited use (read: NO) as possible of external libraries).
I don't mean to be too cagey about exactly what we're doing, so it might help to think of it as a driver for a file system. At one end of the disk we're accessing the file and directory tables, and at the other raw data - so we need to write file information and then skiup to the data. But even within such zones don't assume anything about the layout. There is no guarantee that multiple files (or even multiple chunks of a single file) will be stored contiguously - or even close together. This also means that we can't make assumptions about the order that data will be read.
A couple of things I have considered include:
Opening Multiple File Descriptors for different parts of the file (but I'm not sure there's any state associated with the FD and whether this would even have an impact)
A few smarts around caching data that I expect to be accessed several times in a short amount of time
I was wondering whether others might have been in a similar boat and/or have opinions (or articles they can link) that discuss different strategies to minimise the impact of reading.
I guess I was always wondering whether pread is even the right choice in this situation....
Any thoughts/opinions/criticisms/etc more than welcome.
NOTE: The program will always run in a single thread (so options don't need to be thread-safe, but equally pushing the read to the background isn't an option either).

What does a file pointer point to in C?

I am trying to understand input and output files in C. In the beginning, when we want to open a file to read, we declare a file pointer as follows:
FILE *fptr1 = fopen( "filename", "r")
I understand that FILE is a data structure in the stdio.h library and that it contains information about the file. I also know that the fopen() function returns a FILE structure. But, is that the purpose of the pointer. It just points to a bunch of information about the file? I've been reading into this and I have heard the term "file streams" floating around a bit. I understand that it is a an interface of communication with the file (find it vague, but I'll take it). Is that what the pointer points to in simple terms - a file stream? In the above code example, would the pointer be pointing to an input file stream?
The FILE structure is intended to be opaque. In other words, you are not supposed to look into it if you want your programs to remain portable.
Further, FILE is always used through a pointer, so you don't even need to know its size.
In a way, you can consider it a void * for all intents and purposes.
Now, if you are really interested on what the FILE type may hold, the C standard itself explains it quite well! See C11 7.21.1p2:
(...) FILE which is an object type capable of recording all the information needed to control a stream, including its file position indicator, a pointer to its associated buffer (if any), an error indicator that records whether a read/write error has occurred, and an end-of-file indicator that records whether the end of the file has been reached; (...)
So as you see, at least it contains stuff like:
The position inside the file
A pointer to a buffer
Error flags
EOF flag
It mentions (as you do) streams. You can find some more details about it in section 7.21.2 Streams:
Input and output, whether to or from physical devices such as terminals and tape drives, or whether to or from files supported on structured storage devices, are mapped into logical data streams, whose properties are more uniform than their various inputs and outputs. Two forms of mapping are supported, for text streams and for binary streams.
A binary stream is an ordered sequence of characters that can transparently record internal data. (...)
As we can read, a stream is an ordered sequence of characters. Note that it does not say whether this sequence is finite or not! (More on that later)
So, how do they relate to files? Let's see section 7.21.3 Files:
A stream is associated with an external file (which may be a physical device) by opening a file, which may involve creating a new file. Creating an existing file causes its former contents to be discarded, if necessary. If a file can support positioning requests (such as a disk file, as opposed to a terminal), then a file position indicator associated with the stream is positioned at the start character number zero) of the file, unless the file is opened with append mode in which case it is implementation-defined whether the file position indicator is initially positioned at the beginning or the end of the file. The file position indicator is maintained by subsequent reads, writes, and positioning requests, to facilitate an orderly progression through the file.
See, when you open a "disk file" (the typical file in your computer), you are associating a "stream" (finite, in this case) which you can open/read/write/close/... through fread() and related functions; and the data structure that holds all the required information about it is FILE.
However, there are other kinds of files. Imagine a pseudo-random number generator. You can conceptualize it as an infinite read-only file: every time you read it gives you a different value and it never "ends". Therefore, this file would have an infinite stream associated with it. And some operations may not make sense with it (e.g. maybe you cannot seek it, i.e. move the file position indicator).
This only serves as a quick introduction, but as you can see, the FILE structure is an abstraction over the concept of a file. If you want to learn more about this kind of thing, the best you can do is reach for a good book on Operating Systems, e.g. Modern Operating Systems from Tanenbaum. This book also refers to C, so even better.

Under what circumstances will fseek/ftell or fstat fail to get the size of a file?

I'm trying to access a file as a char array, via memory mapping it, or copying it into a buffer or whatever, but both of these need the size of the file, easy enough, thought I, just use fseek(file, 0, SEEK_END).
However: according to C++ Reference "Library implementations [of fseek] are allowed to not meaningfully support SEEK_END," Meaning that I can't get the size of a file using that method.
Next I tried fstat, which is less portable, but at least will provide a compile error rather than a runtime problem; but The Open Group notes that fstat does not need to provide a meaningful value for st_size.
So: has anyone actually come across a system where these methods do not work?
The notes about files not having valid sizes reported are there because, in Linux, there are many "files" for which "file size" is not a meaningful concept.
There are two main cases:
The file is not a regular file. In particular, pipes, sockets, and character device files are streams of data where data is consumed on read, and not put on disk, so a size does not make much sense.
The file system that the file resides on does not provide the file size. This is especially common in "virtual" filesystems, where the file contents are generated when read and, again, have no disk backing.
To expand, filesystems do not necessarily keep file contents on disk. Since the filesystem API is a convenient API for expressing hierarchal data, and there are many tools for operating on files, it sometimes makes sense to expose data as a file hierarchy. For example, /proc/ contains information about processes (such as open files and used memory) and /sys/ contains driver-specific information and options (anything from sensor sampling rates to LED colors). With FUSE (Filesystem in UserSpacE), you can program a filesystem to do pretty much anything, from SSHing into a remote computer to exposing Twitter as a filesystem.
For a lot of these filesystems, "file size" may not make much sense. For example, an LED driver might expose three files red, green, and blue. They can be read to get the current color or written to to change the color. Now, is it really worth implementing a file size for them, since they are merely settings in RAM, don't have any disk backing, and can't be removed? Not really.
In summary, files are not necessarily "things on disk". For many of the more advanced usages of files, "file size" either does not make sense or is not worth providing.

How feasible is it to virtualise the FILE* interfaces of C?

It have often noticed that I would have been able to solve practical problems in C elegantly if there had been a way of creating a ‘virtual FILE’ and attaching the necessary callbacks for events such as buffer full, input requested, close, flush. It should then be possible to use a large part of the stdio.h functions, e.g. fprintf unchanged. Is there a framework enabling one to do this? If not, is it feasible with a moderate amount of effort, on at least some platforms?
Possible applications would be:
To write to or read from a dynamic or static region of memory.
To write to multiple files in parallel.
To read from a thread or co-routine generating data.
To apply a filter to another (virtual or real) FILE.
Support for file formats with indirection (like #include).
A C pre-processor(?).
I am less interested in solutions for specific cases than in a framework to let you roll your own FILE. I am also not looking for a virtual filesystem, but rather virtual FILE*s that I can pass to the CRT.
To my disappointment I have never seen anything of the sort; as far as I can see C11 considers FILE entirely up to the language implementer, which is perhaps reasonable if one wishes to keep the language (+library) specifications small but sad if you compare it with Java I/O streams.
I feel sure that virtual FILEs must be possible with any (fully) open source implementation of the C run-time, but I imagine there might be a large number of details making it trickier than it seems, and if it has already been done it would be a shame to reduplicate the effort. It would also be greatly preferable not to have to modify the CRT code. Without open source one might be able to reverse engineer the functions supplied, but I fear the result would be far too vulnerable to changes in unsupported features, unless there were a commitment to a set of interfaces. I suppose too that any system for which one can write a device driver would allow one to create a virtual device, but I suspect that of being unnecessarily low-level and of requiring one to write privileged code.
I have to admit that while I have code that would have benefited from virtual FILEs, I have no current requirement for it; nonetheless it is something I have often wondered about and that I imagine could be of interest to others.
This is somewhat similar to a-reader-interface-that-consumes-files-and-char-in-c, but there the questioner did not hope to return a virtual FILE; the answer, however, using fmemopen, did.
There is no standard C interface for creating virtual FILE*s, but both the GNU and the BSD standard libraries include one. On linux (glibc), you can use fopencookie; on most *BSD systems, funopen (including Mac OS X). (See Note 1)
The two interfaces are similar but slightly different in some details. However, it is usually very simple to adapt code written for one interface to the other.
These are not complete virtualizations. They associated the FILE* with four callbacks and a void* context (the "cookie" in fopencookie). The callbacks are read, write, seek and close; there are no callbacks for flush or tell operations. Still, this is sufficient for many simple FILE* adaptors.
For a simple example, see the two answers to Write simultaneousely to two streams.
funopen is derived from "functional open", not from "file unopen".

Parsing: load into memory or use stream

I'm writing a little parser and I would like to know the advantages and disadvantages of the different ways to load the data to be parsed. The two ways that I thought of are:
Load the file's contents into a string then parse the string (access the character at an array position)
Parse as reading the file stream (fgetc)
The former will allow me to have two functions: one for parse_from_file and parse_from_string, however I believe this mode will take up more memory. The latter will not have that disadvantage of using more memory.
Does anyone have any advice on the matter?
Reading the entire file in or memory mapping it will be faster, but may cause issues if you want your language to be able to #include other files as these would be memory mapped or read into memory as well.
The stdio functions would work well because they usually try to buffer up data for you, but they are also general purpose so they also try to look out for usage patterns which differ from reading a file from start to finish, but that shouldn't be too much overhead.
A good balance is to have a large circular buffer (x * 2 * 4096 is a good size) which you load with file data and then have your tokenizer read from. Whenever a block's worth of data has been passed to your tokenizer (and you know that it is not going to be pushed back) you can refill that block with new data from the file and update some buffer location info.
Another thing to consider is if there is any chance that the tokenizer would ever need to be able to be used to read from a pipe or from a person typing directly in some text. In these cases your reads may return less data than you asked for without it being at the end of the file, and the buffering method I mentioned above gets more complicated. The stdio buffering is good for this as it can easily be switched to/from line or block buffering (or no buffering).
Using gnu fast lex (flex, but not the Adobe Flash thing) or similar can greatly ease the trouble with all of this. You should look into using it to generate the C code for your tokenizer (lexical analysis).
Whatever you do you should try to make it so that your code can easily be changed to use a different form of next character peek and consume functions so that if you change your mind you won't have to start over.
Consider using lex (and perhaps yacc, if the language of your grammar matches its capabilities). Lex will handle all the fiddly details of lexical analysis for you and produce efficient code. You can probably beat its memory footprint by a few bytes, but how much effort do you want to expend into that?
The most efficient on a POSIX system would probably neither of the two (or a variant of the first if you like): just map the file read-only with mmap, and parse it then. Modern systems are quite efficient with that in that they prefetch data when they detect a streaming access etc., multiple instances of your program that parse the same file will get the same physical pages of memory etc. And the interfaces are relatively simple to handle, I think.
