Fast way to check if file pointer points to a valid file - c

I am looking for fast (for performance-critical code), safe and cross-platform way to check if FILE* in fact points to a file upon successful previous call to fopen().
Asking for current position with ftell() is one approach, but
I doubt that it is fastest, accurate, safe or that there is no better more straightforward and dedicated for this way.

If a call to fopen has succeeded, but you want to know whether you've just opened a file or something else, I know of two general approaches:
Use fstat on the file descriptor (or stat on the same pathname you just opened), then inspect the mode bits.
Attempt to seek on the file descriptor. If this works as expected it's probably a file; if it doesn't it's a pipe or a socket or something like that.
The code for (1) might look like
struct stat st;
fstat(fileno(fp), &st);
if(st.st_mode & S_IFMT) == S_IFREG)
/* it's a regular file */
To perform (2) I normally seek to offset 1, then test to see what offset I'm at. If I'm at 1, it's a seekable file, and I rewind to 0 for the rest of the program. But if I'm still at 0, it's not a seekable file. (And of course I do this once, right after I open the file, and record the result in my own flag associated with the open file, so the performance hit is minimal.)

In C there are three kinds of pointer values:
Values that are NULL, because the programmer initialized them (or because they took advantage of default static initialization).
Values that were returned by a pointer-returning function such as fopen or malloc (and that have not yet been passed to fclose or free).
Values where neither 1 nor 2 is true.
And the simple fact is that if you have a pointer of kind 3, there is no mechanism in the language that will tell you whether the pointer is valid or not. If you have a pointer p that might have been obtained from malloc or not, but you can't remember, there is no way to ask the compiler or run-time system to tell you if it currently points to valid memory. If you have a FILE pointer fp that might have been obtained from fopen or not, but you can't remember, there is no way to ask the compiler or run-time system to tell you if it currently "points to" a valid file.
So it's up to you, the programmer, to keep track of pointer values, and to use programming practices which help you determine whether pointer values are valid or not.
Those ways include the following:
Always initialize pointer variables, either to NULL, or to point to something valid.
When you call a function that returns a pointer, such as fopen or malloc, always test the return value to see if it's NULL, and if it is, return early or print an error message or whatever is appropriate.
When you're finished with a dynamically-allocated pointer, and you release it by calling fclose or free or the equivalent, always set it back to NULL.
If you do these three things religiously, then you can test to see if a pointer is valid by doing
if(p != NULL)
or
if(p)
Similarly, and again if you do those things religiously, you can test to see if a pointer is invalid by doing
if(p == NULL)
or
if(!p)
But those tests work reliably only if you have performed steps 1 and 3 religiously. If you haven't, it's possible -- and quite likely -- for various pointer values to be non-NULL but invalid.
The above is one strategy. I should point out that steps 1 and 3 are not strictly necessary. The other strategy is to apply step 2 religiously, and to never keep around -- never attempt to use -- a pointer that might be null. If functions like fopen or malloc return NULL, you either exit the program immediately, or return immediately from whatever function you're in, typically with a failure code that tells your caller you couldn't do your job (because you couldn't open the file you needed, or you couldn't allocate the memory you needed). In a program that applies rule 2 religiously, you don't even need to test pointers for validity, because all pointer values in such programs are valid. (Well, all pointers are valid as long as Rule 2 was applied religiously. If you forget to apply Rule 2 even once, things can begin to break down.)

Trying to programmatically detect an invalid pointer is a little like hunting for witches.
Supposedly, one way to detect a witch was to hold her underwater. If she died, she was an ordinary human. But if she used her magical powers to avoid drowning, that meant she was a witch — so you killed her. (I can't remember just now if this was ever considered a "legitimate" method, or just a joke out of Monty Python and the Holy Grail.)
But, similarly, if you have a pointer that might be valid or might be invalid, and you try to test it by calling a function that acts on the pointer — like calling ftell on an unknown FILE pointer — there are two possible outcomes:
If the pointer was valid, the function will return normally.
But if the pointer was invalid, the behavior is undefined. In particular, it's significantly likely that the program will crash. That is, the function will not return normally, and it will not return with an error code, either. It will not return at all, because the program will have crashed, and your code (that was going to do one thing or the other depending on whether the pointer was or wasn't valid) won't run at all, because the whole program won't be running any more.
So, once again, if a pointer might or might not be valid, you (that is, explicit code in your program) must keep track of this fact somehow. If you have an unknown pointer value, for which you've lost track of its status, there is no well-defined way to determine its validity.

Related

Assign to a null pointer an area inside a function and preserve the value outside

I have a function that reads from a socket, it returns a char** where packets are stored and my intention is to use a NULL unsigned int pointer where I store the length of single packet.
char** readPackets(int numToRead,unsigned int **lens,int socket){
char** packets=(char**)malloc(numToRead);
int *len=(int*)malloc(sizeof(int)*numToRead);
*(lens)=len;
for(int i=0;i<numToRead;i++){
//read
packets[i]=(char*)malloc(MAX_ETH_LEN);
register int pack_len=read(socket,packets[i],MAX_ETH_LEN);
//TODO handler error in case of freezing
if(pack_len<=0){
i--;
continue;
}
len[i]=pack_len;
}
return packets;
}
I use it in this way:
unsigned int *lens_out=NULL;
char **packets=readPackets(N_PACK,&lens,sniff_sock[handler]);
where N_PACK is a constant defined previously.
Now the problem is that when I am inside the function everything works, in fact *(lens) points to the same memory area of len and outside the function lens_out points to the same area too. Inside the function len[i] equals to *(lens[i]) (I checked it with gdb).
The problem is that outside the function even if lens_out points to the same area of len elements with same index are different for example
len[0]=46
lens_out[0]=4026546640
Can anyone explain where I made the mistake?
Your statement char** packets=(char**)malloc(numToRead) for sure does not reserve enough memory. Note that an element of packets-array is of type char*, and that sizeof(char*) is probably 8 (eventually 4), but very very unlikely 1. So you should write
char** packets = malloc(sizeof(char*) * numToRead)
Otherwise, you write out of the bounds of reserved memory, thereby yielding undefined behaviour (probably the one you explained).
Note further that with i--; continue;, you get memory leaks since you assign a new memory block to the ith element, but you lose reference to the memory reserved right before. Write free(packets[i]);i--;continue; instead.
Further, len[0] is an integral type, whereas lens[0] refers to a pointer to int. Comparing these two does not make sense.
Firstly, I want to put it out there that you should write clear code for the sake of future maintenance, and not for what you think is optimal (yet). This entire function should merely be replaced with read. That's the crux of my post.
Now the problem is that when I am inside the function everything works
I disagree. On a slightly broader topic, the biggest problem here is that you've posted a question containing code which doesn't compile when copied and pasted unmodified, and the question isn't about the error messages, so we can't answer the question without guessing.
My guess is that you haven't noticed these error messages; you're running a stale binary which we don't have the source code for, we can't reproduce the issue and we can't see the old source code, so we can't help you. It is as valid as any other guess. For example, there's another answer which speculates:
Your statement char** packets=(char**)malloc(numToRead) for sure does not reserve enough memory.
The malloc manual doesn't guarantee that precisely numToRead bytes will be allocated; in fact, allocations to processes tend to be performed in pages and just as the sleep manual doesn't guarantee a precise number of milliseconds/microseconds, it may allocate more or it may allocate less; in the latter case, malloc must return NULL which your code needs to check.
It's extremely common for implementations to seem to behave correctly when a buffer is overflowed anyway. Nonetheless, it'd be best if you fixed that buffer overflow. malloc doesn't know about the type you're allocating; you need to tell it everything about the size, not just the number of elements.
P.S. You probably want select or sleep within your loop, you know, to "handle error in case of freezing" or whatever. Generally, OSes will switch context to another program when you call one of those, and only switch back when there's data ready to process. By calling sleep after sending or receiving, you give the OS a heads up that it needs to perform some I/O. The ability to choose that timing can be beneficial, when you're optimising. Not at the moment, though.
Inside the function len[i] equals to *(lens[i]) (I checked it with gdb).
I'm fairly sure you've misunderstood that. Perhaps gdb is implicitly dereferencing your pointers, for you; that's really irrelevant to C (so don't confuse anything you learn from gdb with anything C-related).
In fact, I strongly recommend learning a little bit less about gdb and a lot more about assert, because the former won't help you document your code for future maintenance from other people, including us, those who you ask questions to, where-as the latter will. If you include assert in your code, you're almost certainly strengthening your question (and code) much more than including gdb into your question would.
The types of len[i] and *(len[i]) are different, and their values are affected by the way types are interpreted. These values can only be considered equal When they're converted to the same type. We can see this through C11/3.19p1 (the definition of "value", where the standard establishes it is dependant upon type). len[i] is an int * value, where-as *(len[i]) is an int value. The two categories of values might have different alignment, representation and... well, they have different semantics entirely. One is for integral data, and the other is a reference to an object or array. You shouldn't be comparing them, no matter how equal they may seem; the information you obtain from such a comparison is virtually useless.
You can't use len[i] in a multiplication expression, for example. They're certainly not equal in that respect. They might compare equal (as a side-effect of comparison introducing implicit conversions), which is useless information for you to have, and that is a different story.
memcmp((int[]){0}, (unsigned char[]){ [sizeof int] = 42 }, sizeof int) may return 0 indicating that they're equal, but you know that array of characters contains an extra byte, right? Yeh... they're equal...
You must check the return value of malloc (and don't cast the return value), if you're using it, though I really think you should reconsider your options with that regard.
The fact that you use malloc means everyone who uses your function must then use free; it's locking down-stream programmers into an anti-pattern that can tear the architecture of software apart. You should separate categories of allocation logic and user interface logic from processing logic.
For example, you use read which gives you the opportunity to choose whatever storage duration you like. This means you have an immense number of optimisation opportunities. It gives you, the downstream programmer, the opportunity to write flexible code which assigns whatever storage duration you like to the memory used. Imagine if, on the other hand, you had to free every return value from every function... That's the mess you're encouraging.
This is especially a poor, inefficient design when constants are involved (i.e. your usecase), because you could just use an automatic array and get rid of the calls to malloc and free altogether... Your downstream programmers code could be:
char packet[size][count];
int n_read = read(fd, packet, size * count);
Perhaps you think using malloc to allocate (and later read) n spaces for packets is faster than using something else to allocate n spaces. You should test that theory, because from my experience computers tend to be optimised for simpler, shorter, more concise logic.
In anticipation:
But I can't return packet; like that!
True. You can't return packet; to your downstream programmer, so you modify an object pointed at by an argument. That doesn't mean you should use malloc, though.
Unfortunately, too many programs are adopting this "use malloc everywhere" mentality. It's reminiscent of the "don't use goto" crap that we've been fed. Rather than listening to cargo cult propaganda, I recommend thinking critically about what you hear, because your peers are in the same position as you; they don't necessarily know what they're talking about.

C Design: Pass memory address or return

In some functions (such as *scanf variants) there is a argument that takes a memory space for the result. You could also write the code where it returns an address. What are the advantages, why design the function in such a weird way?
Example
void process_settings(char* data)
{
.... // open file and put the contents in the data memory
return;
}
vs
char* process_settings()
{
char* data = malloc(some_size);
.... // open file and load it into data memory
return data;
}
The benefit is that you can reserve the return value of the function for error checking, status indicators, etc, and actually send back data using the output parameter. In fact, with this pattern, you can send back any amount of data along with the return value, which could be immensely useful. And, of course, with multiple calls to the function (for example, calling scanf in a loop to validate user input), you don't have to malloc every time.
One of the best examples of this pattern being used effectively is the function strtol, which converts a string to a long.
The function accepts a pointer to a character as one of its parameters. It's common to declare this char locally as endptr and pass in its address to the function. The function will return the converted number if it was able to, but if not, it'll return 0 to indicate failure but also set the character pointer passed in to the non-digit character it encountered that caused the failure.
You can then report that the conversion failed on that particular character.
This is better design than using global error indicators; consider multithreaded programs. It likely isn't reasonable to use global error indicators if you'll be calling functions that could fail in several threads.
You mention that a function should be responsible for its own memory. Well, scanf doesn't exist to create the memory to store the scanned value. It exists to scan a value from an input buffer. The responsibilities of that function are very clear and don't include allocating the space.
It's also not unreasonable to return a malloc'd pointer. The programmer should be prudent, though, and free the returned pointer when they're done using it.
The decision of using one method instead of another depends on what you intend to do.
Example
If you want to modify an array inside a function an maintain the modification in the original array, you should use your first example.
If you are creating your own data structure, you have to deal with all the operations. And if you want to create a new struct you should allocate memory inside the function and return the pointer. The second example.
If you want to "return" two values from a function, like a vector and the length of the vector, and you don't want to create a struct for this, you could return the pointer of the vector and pass an int pointer as an argument of the function. That way you could modify the value of the int inside the function and you use it outside too.
char* return_vector_and_length(int* length);
Let’s say that, for example, you wanted to store process settings in a specific place in memory. With the first version, you can write this as, process_settings(output_buffer + offset);. How would you have to do it in you only had the second version? What would happen to performance if it were a really big array? Or what if, let’s say, you’re writing a multithreaded application where having all the threads call malloc() all the time would make them fight over the heap and serialize the program, so you want to preallocate all your buffers?
Your intuition is correct in some cases, though: on modern OSes that can memory-map files, it does turn out to be more efficient to return a pointer to the file contents than the way the standard library was historically written, and this is how glib does it. Sometimes allocating all your buffers on the heap helps avoid buffer overflows that smash the stack.
An important point is that, if you have the first version, you can trivially get the second one by calling malloc and then passing the buffer as the dest argument. But, if you have only the second, you can’t implement the first without copying the whole array.

How to handle error conditions in a void function

I'm making a data structures and algorithms library in C for learning purposes (so this doesn't necessarily have to be bullet-proof), and I'm wondering how void functions should handle errors on preconditions. If I have a function for destroying a list as follows:
void List_destroy(List* list) {
/*
...
free()'ing pointers in the list. Nothing to return.
...
*/
}
Which has a precondition that list != NULL, otherwise the function will blow up in the caller's face with a segfault.
So as far as I can tell I have a few options: one, I throw in an assert() statement to check the precondition, but that means the function would still blow up in the caller's face (which, as far as I have been told, is a big no-no when it comes to libraries), but at least I could provide an error message; or two, I check the precondition, and if it fails I jump to an error block and just return;, silently chugging along, but then the caller doesn't know the List* was NULL.
Neither of these options seem particularly appealing. Moreover, implementing a return value for a simple destroy() function seems like it should be unnecessary.
EDIT: Thank you everyone. I settled on implementing (in all my basic list functions, actually) consistent behavior for NULL List* pointers being passed to the functions. All the functions jump to an error block and exit(1) as well as report an error message to stderr along the lines of "Cannot destroy NULL list." (or push, or pop, or whatever). I reasoned that there's really no sensible reason why a caller should be passing NULL List* pointers anyway, and if they didn't know they were then by all means I should probably let them know.
Destructors (in the abstract sense, not the C++ sense) should indeed never fail, no matter what. Consistent with this, free is specified to return without doing anything if passed a null pointer. Therefore, I would consider it reasonable for your List_destroy to do the same.
However, a prompt crash would also be reasonable, because in general the expectation is that C library functions crash when handed invalid pointers. If you take this option, you should crash by going ahead and dereferencing the pointer and letting the kernel fire a SIGSEGV, not by assert, because assert has a different crash signature.
Absolutely do not change the function signature so that it can potentially return a failure code. That is the mistake made by the authors of close() for which we are still paying 40 years later.
Generally, you have several options if a constraint of one of your functions is violated:
Do nothing, successfully
Return some value indicating failure (or set something pointed-to by an argument to some error code)
Crash randomly (i.e. introduce undefined behaviour)
Crash reliably (i.e. use assert or call abort or exit or the like)
Where (but this is my personal opinion) this is a good rule of thumb:
the first option is the right choice if you think it's OK to not obey the constraints (i.e. they aren't real constraints), a good example for this is free.
the second option is the right choice, if the caller can't know in advance if the call will succeed; a good example is fopen.
the third and fourth option are a good choice if the former two don't apply. A good example is memcpy. I prefer the use of assert (one of the fourth options) because it enables both: Crashing reliably if someone is unwilling to read your documentation and introduce undefined behaviour for people who do read it (they will prevent that by obeying your constraints), depending on whether they compile with NDEBUG defined or not. Dereferencing a pointer argument can serve as an assert, because it will make your program crash (which is the right thing, people not reading your documentation should crash as early as possible) if these people pass an invalid pointer.
So, in your case, I would make it similar to free and would succeed without doing anything.
HTH
If you wish not to return any value from function, then it is good idea to have one more argument for errCode.
void List_destroy(List* list, int* ErrCode) {
*ErrCode = ...
}
Edit:
Changed & to * as question is tagged for C.
I would say that simply returning in case the list is NULL would make sense at this would indicate that list is empty(not an error condition). If list is an invalid pointer, you cant detect that and let kernel handle it for you by giving a seg fault and let programmer fix it.

Why glibc's fclose(NULL) cause segmentation fault instead of returning error?

According to man page fclose(3):
RETURN VALUE
Upon successful completion 0 is returned. Otherwise, EOF is returned and the
global variable errno is set to indicate the error. In either case any further
access (including another call to fclose()) to the stream results in
undefined behavior.
ERRORS
EBADF The file descriptor underlying fp is not valid.
The fclose() function may also fail and set errno for any of the errors
specified for the routines close(2), write(2) or fflush(3).
Of course fclose(NULL) should fail but I expect that it to return with an errno normally instead of dying by segmentation fault directly. Is there any reason of this behavior?
Thanks in advance.
UPDATE: I shall put my code here (I'm trying strerror(), particularly).
FILE *not_exist = NULL;
not_exist = fopen("nonexist", "r");
if(not_exist == NULL){
printError(errno);
}
if(fclose(not_exist) == EOF){
printError(errno);
}
fclose requires as its argument a FILE pointer obtained either by fopen, one of the standard streams stdin, stdout, or stderr, or in some other implementation-defined way. A null pointer is not one of these, so the behavior is undefined, just like fclose((FILE *)0xdeadbeef) would be. NULL is not special in C; aside from the fact that it's guaranteed to compare not-equal to any valid pointer, it's just like any other invalid pointer, and using it invokes undefined behavior except when the interface you're passing it to documents as part of its contract that NULL has some special meaning to it.
Further, returning with an error would be valid (since the behavior is undefined anyway) but harmful behavior for an implementation, because it hides the undefined behavior. The preferable result of invoking undefined behavior is always a crash, because it highlights the error and enables you to fix it. Most users of fclose do not check for an error return value, and I'd wager that most people foolish enough to be passing NULL to fclose are not going to be smart enough to check the return value of fclose. An argument could be made that people should check the return value of fclose in general, since the final flush could fail, but this is not necessary for files that are opened only for reading, or if fflush was called manually before fclose (which is a smarter idiom anyway because it's easier to handle the error while you still have the file open).
fclose(NULL) should succeed. free(NULL) succeeds, because that makes it easier to write cleanup code.
Regrettably, that's not how it was defined. Therefore you can't use fclose(NULL) in portable programs. (E.g. see http://pubs.opengroup.org/onlinepubs/9699919799/).
As others have mentioned, you don't generally want an error return if you pass NULL to the wrong place. You want a warning message, at least on debug/test builds. Dereferencing NULL gives you an immediate warning message, and the opportunity to collect a backtrace which identifies the programming error :). While you're programming, a segfault is about the best error you can get. C has many more subtle errors, which take much longer to debug...
It is possible to abuse error returns to increase robustness against programming errors. However, if you're worried a software crash would lose data, note that exactly the same can happen e.g. if your hardware loses power. That's why we have autosave (since Unix text editors with two-letter names like ex and vi). It'd still be preferable for your software to crash visibly, rather than continuing with an inconsistent state.
The errors that the man page are talking about are runtime errors, not programming errors. You can't just pass NULL into any API expecting a pointer and expect that API to do something reasonable. Passing a NULL pointer to a function documented to require a pointer to data is a bug.
Related question: In either C or C++, should I check pointer parameters against NULL/nullptr?
To quote R.'s comment on one of the answers to that question:
... you seem to be confusing errors arising from exceptional conditions in the operating environment (fs full, out of memory, network down, etc.) with programming errors. In the former case, of course a robust program needs to be able to handle them gracefully. In the latter, a robust program cannot experience them in the first place.
This fclose() issue seems to be a legacy of FreeBSD, and was accepted uncritically by both the Microsoft and Linux camps.
But HP, SGI, Solaris, and CYGWIN, on the other hand, all handle fclose(NULL) reasonably. For example, man fclose for CYGWIN, which uses newlib rather than the OP's glibc, states:
fclose returns 0 if successful (including when FP is NULL or not an open file)
See https://stackoverflow.com/a/8442421/318716 for a related discussion.
I think the manpage talks about underlying file descriptor (the one that is obtained by it internally via the open system call when you call fopen) being invalid, not the file pointer which you pass to fclose.

Why does MapViewOfFile return an unusable pointer for rapidxml?

As suggested: I have a file which is larger than 2 giga. I am mapping to memory using the following function:
char* ptr = (char*) MapViewOfFile( map_handle,
FILE_MAP_WRITE | FILE_MAP_READ, 0, 0, 0 );
I parse ptr to rapidxml which accepts Ch* . As per the documentation from rapidxml ptr should be modifiable but since it is declared to be of type char* this cannot be done. Program compiles but when during runtime it crashes with the following error: Access violation. I found out that this occurs when I am parsing the char*. How do I get round this please?
You are passing a 0 for the last argument of MapViewOfFile(). That argument is named dwNumberOfBytesToMap. Since you picked zero, the entire 2 gigabytes is going to be mapped. This cannot work in 32-bit mode, there is not nearly enough virtual memory available. The ptr value will be NULL, any attempt to write through the pointer is going to generate an AV.
You'll need to map sections of the file.
Blind guess: ptr is probably NULL. From the documentation
If the function fails, the return
value is NULL. To get extended error
information, call GetLastError.
If you give more information we probably can help more. Check the return value in the debugger. Regarding the first handle parameter map_handle: CreateFileMapping and OpenFileMapping functions return this handle. Maybe you used some other function to get a handle?
Your "Access Violation" is a memory access error. In other words, your program accessed memory it didn't own. This is probably caused by your parser trying to read beyond the bounds of the memory allocated to the file, or, like jdehaan suggests, your MapViewOfFile function is returning NULL.
UPDATE:
If MapViewOfFile is not returning NULL, then the problem is probably that you are accessing beyond the allocated range for the mapped file. You seemed to indicate in your comments on this question that the parsing operation is also modifying the xml document by adding some terminating tags. This will undoubtedly increase the length of the file and, thus, write past the end of the file's block in memory. That would cause the error you are seeing.
If it isn't that, then perhaps you didn't call CreateFileMapping with the proper access specifiers. The documentation for MapViewOfFile says that you need to specify the PAGE_EXECUTE_READWRITE option when you create the file mapping object if you want a map view which allows read/write access.
If it isn't that, then I would suspect that Hans' answer could be the key. What system are you running this on? Is it 32-bit Windows, or 64-bit? If the file is larger than 2GB, you won't be able to map it.

Resources