getc and getwc: How exactly do they read stdin? - c

I'm not exactly sure whether or not this is a silly question, but I guess I will find out soon enough.
I'm having problems understanding exactly how getc and getwc work. It's not that I can't use them, but more like I don't know exactly what they do. int and getc return most characters if I `printf("%c") them, including multibyte like € or even £.
My question is: how exactly do these functions work, how do they read stdin exactly? Explanations and good pointers to docs much appreciated.
Edit: Please, read the comment I left in William's answer. It helps clarify the level of detail I'm after.

If you are on a system with 8-bit chars (that is, UCHAR_MAX == 255) then getc() will return a single 8-bit character. The reason it returns an int is so that the EOF value can be distinguished from any of the possible character values. This is almost any system you are likely to come across today.
The reason that fgetc() is apparently working for multibyte characters for you is because the bytes making up the multibyte character are being read in seperately, written out seperately and then interpreted as a multibyte character by your console. If you change your printf to:
printf("%c ", somechar);
(that is, put a space after each character) then you should see multibyte characters broken up into their constituent bytes, which will probably look quite weird).

The answer is platform dependent. On unix-like machines, getc checks if there is data available in the buffer. If not, it invokes read() to get some data in the buffer, returns the next character, and increments the file pointer (and other details). The details differ on different implementations, and really are not important to the developer.

If you really want to know how they work, check the source for glibc.
For starters, getc() from libio/getc.c will call _IO_getc_unlocked(), which is defined in libio/libio.h and will call __uflow() from libio/genops.c on underflow.
Tracking the call chain can get a bit tedious, but you asked for it ;)

Related

What's the difference between putch() and putchar()?

Okay so, I'm pretty new to C.
I've been trying to figure out what exactly is the difference between putch() and putchar()?
I tried googling my answers but all I got was the same copy-pasted-like message that said:
putchar(): This function is used to print one character on the screen, and this may be any character from C character set (i.e it may be printable or non printable characters).
putch(): The putch() function is used to display all alphanumeric characters through the standard output device like monitor. this function display single character at a time.
As English isn't my first language I'm kinda lost. Are there non printable characters in C? If so, what are they? And why can't putch produce the same results?
I've tried googling the C character set and all of the alphanumeric characters there are, but as much as my testing went, there wasn't really anything that one function could print and the other couldn't.
Anyways, I'm kind of lost.
Could anyone help me out? thanks!
TLDR;
what can putchar() do that putch() can't? (or the opposite or something idk)
dunno, hoped there would be a visible difference between the two but can't seem to find it.
putchar() is a standard function, possibly implemented as a macro, defined in <stdio.h> to output a single byte to the standard output stream (stdout).
putch() is a non standard function available on some legacy systems, originally implemented on MS/DOS as a way to output characters to the screen directly, bypassing the standard streams buffering scheme. This function is mostly obsolete, don't use it.
Output via stdout to a terminal is line buffered by default on most systems, so as long as your output ends with a newline, it will appear on the screen without further delay. If you need output to be flushed in the absence of a newline, use fflush(stdout) to force the stream buffered contents to be written to the terminal.
putch replaces the newline character (\ n)
putchar is a function in the C programming language that writes a single character to the standard output

Will program crash if Unicode data is parsed as Multibyte?

Okay basically what I'm asking is:
Let's say I use PathFindFileNameA on a unicode enabled path. I obtain this path via GetModuleFileNameA, but since this api doesn't support unicode characters (italian characters for example) it will output junk characters in that part of the system path.
Let's assume x represents a junk character in the file path, such as:
C:\Users\xxxxxxx\Desktop\myfile.sys
I assume that PathFindFileNameA just parses the string with strtok till it encounters the last \\, and outputs the remainder in a preallocated buffer given str_length - pos_of_last \\.
The question is, will the PathFindFileNameA parse the string correctly even if it encounters the junk characters from a failed unicode conversion (since the multi-byte API reciprocal is called), or will it crash the program?
Don't answer something like "Well just use MultiByteToWideChar", or "Just use a wide-version of the API". I am asking a specific question, and a specific answer would be appreciated.
Thanks!
Why you think that Windows API only do strtok? I used to hear that all xxA APIs are redirected to xxW APIs before win10 are released.
And I think the answer to this question is quite simple. Just write a easy program and then set the code page to what you want.Running that program and the answer goes out.
P.S.:personally I think that GetModuleFileNameA will work correctly even if there are junk characters because Windows will store the Image Name as an UNICODE_STRING internally. And even if you uses MBCS, the junk code does not contain zero bytes, and it will work as usual since it's just using strncpy.
Sorry for my last answer :)

Implementing scanf using read()

I want to implement scanf function using read system call, but I have a problem - when you use read you have to specify how many bits it will read, and I am not sure what number to use. I have for example implemented flag %b for binary numbers, %x for hex etc, should I just use a constant amount of bits like 256?
You will probably have to implement a buffering layer. When you call read(), you get a number of bytes from the stream. From the point of view of the system, those bytes are now gone and will not be returned by a subsequent call to read(). The problem is, if you have a format specifier like %d, you can't know when you are done reading the number until you actually read the next non-digit byte. You'll have to store that already-read input somewhere so you can use it later, for example for a following %s.
You'll basically have to do your own buffering, since any number you choose may be too large in some cases (unless you choose something ridiculously small) and too small in others (for example, lots of whitespace), so there is no right number.
read(), without a layer of buffering code, will not work to implement scanf() as scanf() needs the ability to "unget" characters back to stdin.
Instead, scanf() could be implemented with fgetc()/unget(), but it is a great deal to implement a complete scanf(), no matter what route is taken.

How can I get how many bytes sscanf_s read in its last operation?

I wrote up a quick memory reader class that emulates the same functions as fread and fscanf.
Basically, I used memcpy and increased an internal pointer to read the data like fread, but I have a fscanf_s call. I used sscanf_s, except that doesn't tell me how many bytes it read out of the data.
Is there a way to tell how many bytes sscanf_s read in the last operation in order to increase the internal pointer of the string reader? Thanks!
EDIT:
And example format I am reading is:
|172|44|40|128|32|28|
fscanf reads that fine, so does sscanf. The only reason is that, if it were to be:
|0|0|0|0|0|0|
The length would be different. What I'm wondering is how fscanf knows where to put the file pointer, but sscanf doesn't.
With scanf and family, use %n in the format string. It won't read anything in, but it will cause the number of characters read so far (by this call) to be stored in the corresponding parameter (expects an int*).
Maybe I´m silly, but I´m going to try anyway. It seems from the comment threads that there's still some misconception. You need to know the amount of bytes. But the method returns only the amount of fields read, or EOF.
To get to the amount of bytes, either use something that you can easily count, or use a size specifier in the format string. Otherwise, you won't stand a chance finding out how many bytes are read, other then going over the fields one by one. Also, what you may mean is that
sscanf_s(source, "%d%d"...)
will succeed on both inputs "123 456" and "10\t30", which has a different length. In these cases, there's no way to tell the size, unless you convert it back. So: use a fixed size field, or be left in oblivion....
Important note: remember that when using %c it's the only way to include the field separators (newline, tab and space) in the output. All others will skip the field boundaries, making it harder to find the right amount of bytes.
EDIT:
From "C++ The Complete Reference" I just read that:
%n Receives an integer value equal to
the nubmer of characters read so far
Isn't that precisely what you were after? Just add it in the format string. This is confirmed here, but I haven't tested it with sscanf_s.
From MSDN:
sscanf_s, _sscanf_s_l, swscanf_s, _swscanf_s_l
Each of these functions returns the number of fields successfully converted and assigned; the return value does not include fields that were read but not assigned. A return value of 0 indicates that no fields were assigned. The return value is EOF for an error or if the end of the string is reached before the first conversion.

When/why is it a bad idea to use the fscanf() function?

In an answer there was an interesting statement: "It's almost always a bad idea to use the fscanf() function as it can leave your file pointer in an unknown location on failure. I prefer to use fgets() to get each line in and then sscanf() that."
Could you expand upon when/why it might be better to use fgets() and sscanf() to read some file?
Imagine a file with three lines:
1
2b
c
Using fscanf() to read integers, the first line would read fine but on the second line fscanf() would leave you at the 'b', not sure what to do from there. You would need some mechanism to move past the garbage input to see the third line.
If you do a fgets() and sscanf(), you can guarantee that your file pointer moves a line at a time, which is a little easier to deal with. In general, you should still be looking at the whole string to report any odd characters in it.
I prefer the latter approach myself, although I wouldn't agree with the statement that "it's almost always a bad idea to use fscanf()"... fscanf() is perfectly fine for most things.
The case where this comes into play is when you match character literals. Suppose you have:
int n = fscanf(fp, "%d,%d", &i1, &i2);
Consider two possible inputs "323,A424" and "323A424".
In both cases fscanf() will return 1 and the next character read will be an 'A'. There is no way to determine if the comma was matched or not.
That being said, this only matters if finding the actual source of the error is important. In cases where knowing there is malformed input error is enough, fscanf() is actually superior to writing custom parsing code.
When fscanf() fails, due to an input failure or a matching failure, the file pointer (that is, the position in the file from which the next byte will be read) is left in a position other than where it would be had the fscanf() succeeded. This is typically undesirable in sequential file reads. Reading one line at a time results in the file input being predictable, while single line failures can be handled individually.
There are two reasons:
scanf() can leave stdin in a state that's difficult to predict; this makes error recovery difficult if not impossible (this is less of a problem with fscanf()); and
The entire scanf() family take pointers as arguments, but no length limit, so they can overrun a buffer and alter unrelated variables that happen to be after the buffer, causing seemingly random memory corruption errors that very difficult to understand, find, and debug, particularly for less experienced C programmers.
Novice C programmers are often confused about pointers and the “address-of” operator, and frequently omit the & where it's needed, or add it “for good measure” where it's not. This causes “random” segfaults that can be hard for them to find. This isn't scanf()'s fault, so I leave it off my list, but it is worth bearing in mind.
After 23 years, I still remember it being a huge pain when I started C programming and didn't know how to recognize and debug these kinds of errors, and (as someone who spent years teaching C to beginners) it's very hard to explain them to a novice who doesn't yet understand pointers and stack.
Anyone who recommends scanf() to a novice C programmer should be flogged mercilessly.
OK, maybe not mercilessly, but some kind of flogging is definitely in order ;o)
It's almost always a bad idea to use the fscanf() function as it can leave your file pointer in an unknown location on failure. I prefer to use fgets() to get each line in and then sscanf() that.
You can always use ftell() to find out current position in file, and then decide what to do from there. Basicaly, if you know what you can expect then feel free to use fscanf().
Basically, there's no way to to tell that function not to go out of bounds for the memory area you've allocated for it.
A number of replacements have come out, like fnscanf, which is an attempt to fix those functions by specifying a maximum limit for the reader to write, thus allowing it to not overflow.

Resources