Understanding C built-in library function implementations - c

So I was going through K&R second edition doing the exercises. Feeling pretty confident after doing few exercises I thought I'd check the actual implementations of these functions. It was then my confidence fled the scene. I could not understand any of it.
For example I check the getchar():
Here is the prototype in libio/stdio.h
extern int getchar (void);
So I follow it through it and gets this:
__STDIO_INLINE int
getchar (void)
{
return _IO_getc (stdin);
}
Again I follow it to the libio/getc.c:
int
_IO_getc (fp)
FILE *fp;
{
int result;
CHECK_FILE (fp, EOF);
_IO_acquire_lock (fp);
result = _IO_getc_unlocked (fp);
_IO_release_lock (fp);
return result;
}
And I'm taken to another header file libio/libio.h, which is pretty cryptic:
#define _IO_getc_unlocked(_fp) \
(_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) \
? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++)
Which is where I finally ended my journey.
My question is pretty broad. What does all this mean? I could not for the life of me figure out anything logical out of it by looking at the code. Looks like a bunch of codes abstracted away layers after layer.
More importantly when does it really get the character from stdin

_IO_getc_unlocked is an inlinable macro. The idea is that you can get a character from the stream without having to call a function, making it hopefully fast enough to use in tight loops, etc.
Let's take it apart one layer at a time. First, what is _IO_BE?
/usr/include/libio.h:# define _IO_BE(expr, res) __builtin_expect ((expr), res)
_IO_BE is a hint to the compiler, that expr will usually evaluate to res. It's used to structure code flow to be faster when the expectation is true, but has no other semantic effect. So we can get rid of that, leaving us with:
#define _IO_getc_unlocked(_fp) \
( ( (_fp)->_IO_read_ptr >= (_fp)->_IO_read_end ) \
? __uflow(_fp) : *(unsigned char *)(_fp)->_IO_read_ptr++) )
Let's turn this into an inline function for clarity:
inline int _IO_getc_unlocked(FILE *fp) {
if (_fp->_IO_read_ptr >= _fp->_IO_read_end)
return __uflow(_fp);
else
return *(unsigned char *)(_fp->_IO_read_ptr++);
}
In short, we have a pointer into a buffer, and a pointer to the end of the buffer. We check if the pointer is outside the buffer; if not, we increment it and return whatever character was at the old value. Otherwise we call __uflow to refill the buffer and return the newly read character.
As such, this allows us to avoid the overhead of a function call until we actually need to do IO to refill the input buffer.
Keep in mind that standard library functions can be complicated like this; they can also use extensions to the C language (such as __builtin_expect) that are NOT standard and may NOT work on all compilers. They do this because they need to be fast, and because they can make assumptions about what compiler they're using. Generally speaking your own code should not use such extensions unless absolutely necessary, as it'll make porting to other platforms more difficult.

Going from pseudo-code to real code we can break it down:
if (there is a character in the buffer)
return (that character)
else
call a function to refill the buffer and return the first character
end
Let's use the ?: operator:
#define getc(f) (is_there_buffered_stuff(f) ? *pointer++ : refill())
A bit closer:
#define getc(f) (is_there_buffered_stuff(f) ? *f->pointer++ : refill(f))
Now we are almost there. To determine if there is something buffered already, it uses the file structure pointer and a read pointer within the buffer
_fp->_IO_read_ptr >= _fp->_IO_read_end ?
This actually tests the opposite condition to my pseudo-code, "is the buffer empty", and if so, it calls __uflow(_fp) // "underflow", otherwise, it just reaches directly into the buffer with a pointer, gets the character, and then increments the pointer:
? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++)

I can highly recommend The Standard C Library by P.J. Plauger. He provides background on the standard and provides an implementation of every function. The implementation is simpler than what you'll see in glibc or a modern C compiler, but does still make use of macros like the _IO_getc_unlocked() you posted.
The macro is going to pull a character from buffered data (which could be the ungetc buffer) or read it from the stream (which may read and buffer multiple bytes).

The reason there is a standard library is that you should not need to know the exact implantation details of these functions. The code that implements the library calls at some point has to use nonstandard system calls which have to deal with issues you may not be concerned with. If you are learning C make sure you can understand other C programs besides the stdlib once you get a little more advance look at the stdlib, but it still won't make alot of sense until you understand the system calls involved.

The definition of getchar() redefines the request as a specific request for a character from stdin.
The definition of _IO_getc() does a sanity check to make sure that the FILE* exists and is not an End-Of-File, then it locks the stream to prevent other threads from corrupting the call to _IO_getc_unlocked().
The macro definition of _IO_getc_unlocked() simply checks to see if the read pointer is at or past the end of file point, and either calls __uflow if it is, or returns the char at the read pointer if it is not.
This is standard stuff for all stdlib implementations. You are not supposed to ever look at it. In fact, many stdlib implementations will use assembly language for optimal processing, which is even more cryptic.

Related

Why does this code accessing the array after scanf result in a segmentation error?

For some homework I have to write a calculator in C. I wanted to input some string with scanf and then access it. But when I access the first element I get a segmentation error.
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
int main(){
char input1[30];
scanf("%s",input1);
printf("%s",input1);
char current = input1[0];
int counter = 0;
while(current != '\0'){
if(isdigit(current) || current == '+' || current == '-' || current == '*' || current == '/'){
counter++;
current = input1[counter];
}else{
printf("invalid input\n");
exit(1);
}
}
return 0;
}
The printf in line 3 returns the string, but accessing it in line 4 returns a segmentation error (tested in gdb). Why?
There are a few potential causes, some of which have been mentioned in the comments (I won't cover those). It's hard to say which one (or more) is the cause of your problem, so I guess it makes sense to iterate them. However, you may notice that I cite some resources in the process... The information is out there, yet you're not stumbling across it until it's too late. Something needs to change with how you research, because this is slowing your progress down.
On input/output dynamics, just a quick note
printf("%s",input1);
Unless we include a trailing newline, this output may be delayed (or "buffered"), which may have the effect of confusing you about the root of your issues. As an alternative to using a trailing newline (which I'd prefer, personally) you could explicitly force partial lines to be written by invoking fflush(stdout) immediately after each of the relevant output operations, or use setbuf to disable buffering entirely. I think this is unlikely to be your problem, but it may mask your problem, so it's important to realise, when using printf to debug, it might be best to include a trailing newline...
On main entry points
The first potential culprit I see is here:
int main()
I don't know why our education system is still pushing these broken lessons. My only guess is the professors learnt many years back using the nowadays irrelevant Turbo C and don't want to stay up-to-date with tech. We can further reduce this to a simple testcase to work out if this is your segfault, but like I said, it's hard to say whether this is actually your problem...
int main() {
char input1[30];
memset(input1, '\x90', sizeof input1);
return 0; // this is redundant for `main` nowadays, btw
}
To explain what's going on here, I'll cite this page, which you probably ought to go and read (in its entirety) once you're done here:
A common misconception for C programmers, is to assume that a function prototyped as follows takes no arguments:
int foo();
In fact, this function is deemed to take an unknown number of arguments. Using the keyword void within the brackets is the correct way to tell the compiler that the function takes NO arguments.
Simply put, if the linker doesn't know/can't work out how many arguments are required for the entry point, there's probably gonna be some oddness to your callstack, and that's gonna occur at the beginning or end of your program.
On input errors, return values and uninitialised access
#include <assert.h>
#include <stdio.h>
#include <string.h>
int main(void) {
char input1[30];
memset(input1, '\x90', sizeof input1);
scanf("%s",input1); // this is sus a.f.
assert(memchr(input1, '\0', sizeof input1));
}
In my testcase, I actually wrote '\x90' to each byte in the array, to show that if the scanf call fails you may end up with an array that has no null terminator. If this is your problem, this assertion is likely to throw (as you can see from the ideone demo) when you run it, which indicates that your loop is likely accessing garbage beyond the bounds of input1. On this note I intended to demonstrate that we (mostly) cannot rely upon scanf and friends unless we also check their return values! There's a good chance your compiler is warning you about this one, so another lesson is uto pay close attention to warning messages, and strive to have none.
On argument expectations for standard library functions
For many standard library functions it may be possible to give input that is outside of the acceptable domain, and so causes instability. The most common form, which I also see in your program, exists in the form of possibly passing invalid values to <ctype.h> functions. In your case, you could change the declaration of current to be an unsigned char instead, but the usual idiom is to put the cast explicitly in the call (like isdigit((unsigned char) current)) so the rest of us can see you're not stuck in this common error, at least while you're learning C.
Please note at this point I'm thinking whichever resources you're using to learn aren't working, because you're stumbling into common traps... please try to find more reputable resources to learn from so you don't fall into more common traps and waste more time later on. If you're struggling, check out the C tag wiki...

Understanding K&R's putc macro: K&R Chapter 8 (The Unix System Interface) Exercise 2

I've been trying to understand K&R's version of putc for some time now, and I'm out of resources (google, stack overflow, clcwiki don't quite have what I'm looking for and I have no friends or colleagues to turn to). I'll explain the context first and then ask for clarification.
This chapter of the text introduced an example of a data structure that describes a file. The structure includes a character buffer for reading and writing large chunks at a time. They then asked the reader to write a version of the standard library putc.
As a clue for the reader, K&R wrote a version of getc that has supports both buffered and unbuffered reading. They also wrote the skeleton of the putc macro, leaving the user to write the function _flushbuf() for themselves. The putc macro looks like this (p is a pointer to the file structure):
int _flushbuf(int, FILE *);
#define putc(x,p) (--(p)->cnt >= 0 \
? *(p)->ptr++ = (x) : _flushbuf((x),p)
typedef struct {
int cnt; /*characters left*/
char *ptr; /*next character position*/
char *base; /*location of buffer*/
int flag; /*mode of file access*/
int fd; /*file descriptor*/
} FILE;
Confusingly, the conditional in the macro is actually testing if the structure's buffer is full (this is stated in the text) - as a side note, the conditional in getc is exactly the same but means the buffer is empty. Weird?
Here's where I need clarification: I think there's a pretty big problem with buffered writing in putc; since writing to p is only performed in _flushbuf(), but _flushbuf() is only called when the file structure's buffer is full, then writing is only done if the buffer is entirely filled. And the size for buffered reading is always the system's BUFSIZ. Writing anything other than exactly 'BUFSIZ' characters just doesn't happen, because _flushbuf() will never be called in putc.
putc works just fine for unbuffered writing. But the design of the macro makes buffered writing almost entirely pointless. Is this correct, or am I missing something here? Why is it like this? I truly appreciate any and all help here.
I think you may be misreading what takes place inside the putc() macro; there are a lot of operators and symbols in there, and they all matter (and their order-of-execution matters!) for this to work. To help understand it better, let's substitute it into a real usage, and then expand it out until you can see what's going on.
Let's start with a simple invocation of putc('a', file), as in the example below:
FILE *file = /* ... get a file pointer from somewhere ... */;
putc('a', file);
Now substitute the macro in place of the call to putc() (this is the easy part, and is performed by the C preprocessor; also, I think you're missing a parenthesis at the end of the version you provided, so I'm going to insert it at the end where it belongs):
FILE *file = /* ... get a file pointer from somewhere ... */;
(--(file)->cnt >= 0 ? *(file)->ptr++ = ('a') : _flushbuf(('a'),file));
Well, isn't that a mess of symbols. Let's strip off the unneeded parentheses, and then convert the ?...: into the if-statement that it actually is under the hood:
FILE *file = /* ... get a file pointer from somewhere ... */;
if (--file->cnt >= 0)
*file->ptr++ = 'a';
else
_flushbuf('a', file);
This is closer, but it's still not quite obvious what's going on. Let's move the increments and decrements into separate statements so it's easier to see the order of execution:
FILE *file = /* ... get a file pointer from somewhere ... */;
--file->cnt;
if (file->cnt >= 0) {
*file->ptr = 'a';
file->ptr++;
}
else {
_flushbuf('a', file);
}
Now, with the content reordered, it should be a little easier to see what's going on. First, we decrement cnt, the count of remaining characters. If that indicates there's room left, then it's safe to write a into the file's buffer, at the file's current write pointer, and then we move the write pointer forward.
If there isn't room left, then we call _flushbuf(), passing it both the file (whose buffer is full) and the character we wanted to write but couldn't. Presumably, _flushbuf() will first write the whole buffer out to the actual underlying I/O system, and then it will write that character, and then likely reset ptr to the beginning of the buffer and cnt to a big number to indicate that the buffer is able to store lots of data again.
So why does this result in buffered writing? The answer is that the _flushbuf() call only gets performed "every once in a while," when the buffer is full. Writing a byte to a buffer is cheap, while performing the actual I/O is expensive, so this results in _flushbuf() being invoked relatively rarely (only once for every BUFSIZ characters).
If you write enough, the buffer will eventually get full. If you don't, you will eventually close the file (or the runtime will do that for you when main() returns) and fclose() calls _flushbuf() or its equivalent. Or you will manually fflush() the stream, which also does the equivalent to _flushbuf().
If you were to write a few characters and then call sleep(1000), you would find that nothing gets printed for quite a while. That's indeed the way it works.
The tests in getc and putc are the same because in one case the counter records how many characters are available and in the other case it records how much space is available.

Working with getenv_s in C to process CONTENT_LENGTH

Being new to C, I just came across the C11 addition getenv_s. Here is what I'm actually trying to do:
Handling POST data sent by html form in CGI C
I'm trying to sanitize, both CONTENT_LENGTH and message-body(stdin) in my case. That is the objective here.
So in order to limit the upper-bounds (against malformed CONTENT_LENGTH, trying to cause overflow), I tried using an array instead of pointer, like this:
char some[512];
some = getenv("CONTENT_LENGTH");
It naturally threw an error (incompatible types when assigning to type char[512] from type char *). So I assume,
Q1. getenv is already a string?
Then I came across "getenv_s"
http://en.cppreference.com/w/c/program/getenv
Q2. Can anyone tell me a safe-as-rocksolid way of using this? To avoid underflow, overflow, etc.
First off, do not use any of the _s functions. They are an optional feature of C11 which, to my knowledge, has never been fully implemented by anyone, not even Microsoft, which invented them, and it has been proposed to remove them again; even more importantly, they do not actually solve the problems they purport to address. (The intention was to have a bunch of drop-in replacements for dangerous string-related functions, but it turns out that that doesn't work; fixing string-related security bugs in C programs requires actual redesign with thought put into it. The functions that genuinely could not be used safely already had portable replacements, e.g. fgets instead of gets, snprintf instead of sprintf, strsep instead of strtok -- sometimes the replacement is not in ISO C but it's usually widespread enough not to worry about, or you can get a shim implementation from gnulib.)
getenv is guaranteed to return a valid NUL-terminated C string (or a null pointer), but the string could be arbitrarily long. In the context of a CGI program written in C, the correct way to "sanitize" the value of the CONTENT_LENGTH environment variable is to feed it to strtol and carefully check for errors:
/* Returns a valid content_length, or -1 on error. */
long get_content_length(void)
{
char *content_length, *endp;
long rv;
content_length = getenv("CONTENT_LENGTH");
if (!content_length) return -1;
errno = 0;
rv = strtol(content_length, &endp, 10);
if (endp == content_length || *endp || errno || rv <= 0)
return -1;
return rv;
}
Each of the four clauses in the if statement after the strtol call checks for a different class of ill-formed input. You have to clear errno explicitly before the call, because the value strtol returns when it reports an overflow is also a value it can return when there was no overflow, so the only way to distinguish is to look at errno, but errno could have a stale nonzero value from some earlier operation.
Note that even if CONTENT_LENGTH is syntactically valid, it might not be trustworthy. That is, the actual amount of POST data available to you might be either less or more than CONTENT_LENGTH. Make sure to pay attention to the numbers returned by read as well. (This is an example of how swapping out string functions for "hardened" ones doesn't solve all your problems.)

How to implement wrapper function for C sscanf() without using vsscanf()

I want to implement a wrapper function for C sscanf without using vsscanf, because in my environment vsscanf() is not there only sscanf is there. I don't want to do a complete implementation of sscanf also because for that I need to consider all possible scenarios. I have seen some samples in google, but it has not considered all scenarios.
So now I want to implement like below:
int my_sscanf(char * buf, char format[], ...)
{
va_list vargs = {0};
va_start(vargs, format);
//some loop to get the variable aguments
//and call again sscanf() here.
va_end (vargs);
}
Ouch! Here's a hammer; it'll be more fun hitting yourself on the head with it. Seriously, that's a non-trivial proposition.
You'll need a loop that scans through the format string, reading characters from the buffer when they're normal characters, remembering that spaces in the format chew up zero or more spaces in the buffer. When you encounter a conversion specification, you'll need to create a singleton format string containing the user-supplied conversion specification plus a %n conversion specification. You'll invoke:
int pos;
int rc = sscanf(current_pos_in_buf, manufactured_format_with_percent_n,
appropriate_pointer_from_varargs, &pos);
If rc is not 1, you'll fail. Otherwise, you update the current position in the buffer using the value stored in pos, and then repeat. Note that scanning a conversion specification is not trivial. Also, if there is an assignment-suppressing * in the specification, you'll have to expect a 0 back from sscanf() (and not provide the appropriate pointer from the variable args).
Try telling your compiler to compile your code as C99. If that still doesn't work, your libc does not comply with the C99 standard – in that case, get a proper libc.
E.g. if you're using gcc, try adding -std=c99 to the compiler command line.
There's a slightly simpler way to do this using the preprocessor, but it's a little hacky. Take this as an example:
#define my_sscanf(buf, fmt, ...) { \
do_something(); \
sscanf((buf), (fmt), __VA_ARGS__); \
do_something_else(); }

Resuming [vf]?nprintf after reaching the limit

I have an application which prints strings to a buffer using snprintf and vsnprintf. Currently, if it detects an overflow, it appends a > to the end of the string as a sign that the string was chopped and prints a warning to stderr. I'm trying to find a way to have it resume the string [from where it left off] in another buffer.
If this was using strncpy, it would be easy; I know how many bytes were written, and so I can start the next print from *(p+bytes_written); However, with printf, I have two problems; first, the formatting specifiers may take up more or less space in the final string as in the format string, and secondly, my valist may be partially parsed.
Does anyone have an easy-ish solution to this?
EDIT: I should probably clarify that I'm working on an embedded system with limited memory + no dynamic allocation [i.e., I don't want to use dynamic allocation]. I can print messages of 255 bytes, but no more, although I can print as many of those as I want. I don't, however, have the memory to allocate lots of memory on the stack, and my print function needs to be thread-safe, so I can't allocate just one global / static array.
I don't think you can do what you're looking for (other than by the straightforward way of reallocating the buffer to the necessary size and performing the entire operation again).
The reasons you listed are a couple contributors to this, but the real killer is that the formatter might have been in the middle of formatting an argument when it ran out of space, and there's no reasonable way to restart that.
For example, say there's 3 bytes left in the buffer, and the formatter starts working on a "%d" conversion for the value -1234567. It ll put "-1\0" into the buffer then do whatever else it needs to do to return the size of buffer you really need.
In addition to you being able to determine which specifier the formatter was working on, you'd need to be able to figure out that instead of passing in -1234567 on the second round you need to pass in 234567. I defy you to come up with a reasonable way to do that.
Now if there's a real reason you don't want to restart the operation from the top, you probably could wrap the snprintf()/vsnprintf() call with something that breaks down the format string, sending only a single conversion specifier at a time and concatenating that result to the output buffer. You'd have to come up with some way for the wrapper to keep some state across retries so it knows which conversion spec to pick up from.
So maybe it's doable in a sense, but it sure seems like it would be an awful lot of work to avoid the much simpler 'full retry' scheme. I could see maybe (maybe) trying this on a system where you don't have the luxury of dynamically allocating a larger buffer (an embedded system, maybe). In that case, I'd probably argue that what's needed is a much simpler/restricted scope formatter that doesn't have all the flexibility of printf() formatters and can handle retrying (because their scope is more limited).
But, man, I would try very hard to talk some sense into whoever said it was a requirement.
Edit:
Actually, I take some of that back. If you're willing to use a customized version of snprintf() (let's call it snprintf_ex()) I could see this being a relatively simple operation:
int snprintf_ex( char* s, size_t n, size_t skipChars, const char* fmt, ...);
snprintf_ex() (and its companion functions such as vsnprintf()) will format the string into the provided buffer (as usual) but will skip outputting the first skipChars characters.
You could probably rig this up pretty easy using the source from your compiler's library (or using something like Holger Weiss' snprintf()) as a starting point. Using this might look something like:
int bufSize = sizeof(buf);
char* fmt = "some complex format string...";
int needed = snprintf_ex( buf, bufSize, 0, fmt, arg1, arg2, etc, etc2);
if (needed >= bufSize) {
// dang truncation...
// do whatever you want with the truncated bits (send to a logger or whatever)
// format the rest of the string, skipping the bits we already got
needed = snprintf_ex( buf, bufSize, bufSize - 1, fmt, arg1, arg2, etc, etc2);
// now the buffer contains the part that was truncated before. Note that
// you'd still need to deal with the possibility that this is truncated yet
// again - that's an exercise for the reader, and it's probably trickier to
// deal with properly than it might sound...
}
One drawback (that might or might not be acceptable) is that the formatter will do all the formatting work over again from the start - it'll just throw away the first skipChars characters that it comes up with. If I had to use something like this, I'd think that would almost certainly be an acceptable thing (it what happens when someone deals with truncation using the standard snprintf() family of functions).
The C99 functions snprintf() and vsnprintf() both return the number of characters needed to print the whole format string with all the arguments.
If your implementation conforms to C99, you can create an array large enough for your output strings then deal with them as needed.
int chars_needed = snprintf(NULL, 0, fmt_string, v1, v2, v3, ...);
char *buf = malloc(chars_needed + 1);
if (buf) {
snprintf(buf, chars_needed + 1, fmt_string, v1, v2, v3, ...);
/* use buf */
free(buf);
} else {
/* no memory */
}
If you're on a POSIX-ish system (which I'm guessing you may be since you mentioned threads), one nice solution would be:
First try printing the string to a single buffer with snprintf. If it doesn't overflow, you've saved yourself a lot of work.
If that doesn't work, create a new thread and a pipe (with the pipe() function), fdopen the writing end of the pipe, and use vfprintf to write the string. Have the new thread read from the reading end of the pipe and break the output string into 255-byte messages. Close the pipe and join with the thread after vfprintf returns.

Resources