While writing some C code, I came across a little problem where I had to convert a character into a "string" (some memory chunk the beginning of which is given by a char* pointer).
The idea is that if some sourcestr pointer is set (not NULL), then I should use it as my "final string", otherwise I should convert a given charcode into the first character of another array, and use it instead.
For the purposes of this question, we'll assume that the types of the variables cannot be changed beforehand. In other words, I can't just store my charcode as a const char* instead of an int.
Because I tend to be lazy, I thought to myself : "hey, couldn't I just use the character's address and treat that pointer as a string?". Here's a little snippet of what I wrote (don't smash my head against the wall just yet!) :
int charcode = FOO; /* Assume this is always valid ASCII. */
char* sourcestr = "BAR"; /* Case #1 */
char* sourcestr = NULL; /* Case #2 */
char* finalstr = sourcestr ? sourcestr : (char*)&charcode;
Now of course I tried it, and as I expected, it does work. Even with a few warning flags, the compiler is still happy. However, I have this weird feeling that this is actually undefined behaviour, and that I just shouldn't be doing it.
The reason why I think this way is because char* arrays need to be null-terminated in order to be printed properly as strings (and I want mine to be!). Yet, I have no certainty that the value at &charcode + 1 will be zero, hence I might end up with some buffer overflow madness.
Is there an actual reason why it does work properly, or have I just been lucky to get zeroes in the right places when I tried?
(Note that I'm not looking for other ways to achieve the conversion. I could simply use a char tmp[2] = {0} variable, and put my character at index 0. I could also use something like sprintf or snprintf, provided I'm careful enough with buffer overflows. There's a myriad of ways to do this, I'm just interested in the behaviour of this particular cast operation.)
Edit: I've seen a few people call this hackery, and let's be clear: I completely agree with you. I'm not enough of a masochist to actual do this in released code. This is just me getting curious ;)
Your code is well-defined as you can always cast to char*. But some issues:
Note that "BAR" is a const char* literal - so don't attempt to modify the contents. That would be undefined.
Don't attempt to use (char*)&charcode as a parameter to any of the string functions in the C standard library. It will not be null-terminated. So in that sense, you cannot treat it as a string.
Pointer arithmetic on (char*)&charcode will be valid up to and including one past the scalar charcode. But don't attempt to dereference any pointer beyond charcode itself. The range of n for which the expression (char*)&charcode + n is valid depends on sizeof(int).
The cast and assignment, char* finalstr = (char*)&charcode; is defined.
Printing finalstr with printf as a string, %s, if it points to charcode is undefined behavior.
Rather than resorting to hackery and hiding string in a type int, convert the values stored in the integer to a string using a chosen conversion function. One possible example is:
char str[32] = { 0 };
snprintf( str , 32 , "%d" , charcode );
char* finalstr = sourcestr ? sourcestr : str;
or use whatever other (defined!) conversion you like.
Like other said it happens to work because the internal representation of an int on your machine is little endian and your char is smaller than an int. Also the ascii value of your character is either below 128 or you have unsigned chars (otherwise there would be sign extension). This means that the value of the character is in the lower byte(s) of the representation of the int and the rest of the int will be all zeroes (assuming any normal representation of an int). You're not "lucky", you have a pretty normal machine.
It is also completely undefined behavior to give that char pointer to any function that expects a string. You might get away with it now but the compiler is free to optimize that to something completely different.
For example if you do a printf just after that assignment, the compiler is free to assume that you'll always pass a valid string to printf which means that the check for sourcestr being NULL is unnecessary because if sourcestr was NULL printf would be called with something that isn't a string and the compiler is free to assume that undefined behavior never happens. Which means that any check of sourcestr being NULL before or after that assignment are unnecessary because the compiler already knows it isn't NULL. This assumption is allowed to spread to everywhere in your code.
This was rarely a thing to worry about and you could get away with tricks uglier than this until a decade ago or so when compiler writers started an arms race about how much they can follow the C standard to the letter to get away with more and more brutal optimizations. Today compilers are getting more and more aggressive and while the optimization I speculated about probably doesn't exist yet, if a compiler person sees this, they'll probably implement it just because they can.
This is absolutely undefined behavior for the following reasons:
Less probable, but to consider when strictly referencing to the standards: you can't assume the sizeof int on the machine/system where code will be compiled
As above you can't assume the codeset. E.g. what happen on an EBCDIC machine/system?
Easy to say that your machine has a little endian processor. On big endian machines the code fails due to big-endian memory layout.
Because on many systems char is a signed integer, as is int, when your char is a negative value (i.e. char>127 on machines having 8bits char), it could fail due to sign extension if you assign the value as in the code below
code:
char ch = FOO;
int charcode = ch;
P.S. About the point 3: your string will be indeed NULL terminated in a little endian machine having sizeof(int)>sizeof(char) and char having a positive value, because the MSB of int will be 0 and the memory layout for such endianess is LSB-MSB (LSB first).
Related
I have the following program that causes a segmentation fault.
#include <stdio.h>
#include <string.h>
#include <ctype.h>
int main(int argc, char *argv[])
{
printf("TEST");
for (int k=0; k<(strlen(argv[1])); k++)
{
if (!isalpha(argv[1])) {
printf("Enter only alphabets!");
return 1;
}
}
return 0;
}
I've figured out that it is this line that is causing the problem
if (!isalpha(argv[1])) {
and replacing argv[1] with argv[1][k] solves the problem.
However, I find it rather curious that the program results in a segmentation fault without even printing TEST. I also expect the isalpha function to incorrectly check if the lower byte of the char* pointer to argv[1], but this doesn't seem to be the case. I have code to check for the number of arguments but isn't shown here for brevity.
What's happening here?
In general it is rather pointless to discuss why undefined behaviour leads to this result or the other.
But maybe it doesn't harm to try to understand why something happens even if it is outside the spec.
There are implementation of isalpha which use a simple array to lookup all possible unsigned char values. In that case the value passed as parameter is used as index into the array.
While a real character is limited to 8 bits, an integer is not.
The function takes an int as parameter. This is to allow entering EOF as well which does not fit into unsigned char.
If you pass an address like 0x7239482342 into your function this is far beyond the end of the said array and when the CPU tries to read the entry with that index it falls off the rim of the world. ;)
Calling isalpha with such an address is the place where the compiler should raise some warning about converting a pointer to an integer. Which you probably ignore...
The library might contain code that checks for valid parameters but it might also just rely on the user not passing things that shall not be passed.
printf was not flushed
the implicit conversion from pointer to integer that ought to have generated at least compile-time diagnostics for constraint violation produced a number that was out of range for isalpha. isalpha being implemented as a look-up table means that your code accessed the table out of bounds, therefore undefined behaviour.
Why you didn't get diagnostics might be in one part because of how isalpha is implemented as a macro. On my computer with Glibc 2.27-3ubuntu1, isalpha is defined as
# define isalpha(c) __isctype((c), _ISalpha)
# define __isctype(c, type) \
((*__ctype_b_loc ())[(int) (c)] & (unsigned short int) type)
the macro contains an unfortunate cast to int in it, which will silence your error!
One reason why I am posting this answer after so many others is that you didn't fix the code, it still suffers from undefined behaviour given extended characters and char being signed (which happens to be generally the case on x86-32 and x86-64).
The correct argument to give to isalpha is (unsigned char)argv[1][k]! C11 7.4:
In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavior is undefined.
I find it rather curious that the program results in a segmentation fault without even printing TEST
printf doesn't print instantly, but it writes to a temporal buffer. End your string with \n if you want to flush it to actual output.
and replacing argv[1] with argv[1][k] solves the problem.
isalpha is intended to work with single characters.
First of all, a conforming compiler must give you a diagnostic message here. It is not allowed to implicitly convert from a pointer to the int parameter that isalpha expects. (It is a violation of the rules of simple assignment, 6.5.16.1.)
As for why "TEST" isn't printed, it could simply be because stdout isn't flushed. You could try adding fflush(stdout); after printf and see if this solves the issue. Alternatively add a line feed \n at the end of the string.
Otherwise, the compiler is free to re-order the execution of code as long as there are no side effects. That is, it is allowed to execute the whole loop before the printf("TEST");, as long as it prints TEST before it potentially prints "Enter only alphabets!". Such optimizations are probably not likely to happen here, but in other situations they can occur.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Coming from Python, where I would simply use type() to find out the type of an object, C, lacking introspection, is forcing me to better grasp its data types, their relatedness, and pointers, before moving on to more advanced topics. This is a good thing. So I have the following piece of code, which I will tweak in various ways and try understand the resulting behavior:
int main(int argc, char *argv[])
{
int i = 0;
for(i = 0; argv[1][i] != '\0'; i++) {
printf("%d\n", argv[1][i]);
char letter = argv[1][i];
switch(letter) {
case 2:
printf("%d: 2\n", i);
break;
If I run this and pass the number 2 as a single argument, nothing happens. My understanding then is that because I have defined argv1[i] as a char, comparing it to 2 (an int) will return false, hence the code does not get called. And indeed, if I change the case from 2 to '2', the code does get called. Makes sense, but it leads me to my first question:
Q1. I have read in various places that to C, a character and an integer are essentially the same thing. So why doesn't C "know" that the 2 passed as an argument should be interpreted as an integer and not a string? After all, C allows me to, for example, use the %d string formatter for characters.
If I then change the type of variable letter from a char to an int:
int letter = argv[1][i];
... I still get the same behavior as in the first variant (i.e. nothing happens), even though now I am apparently comparing an int to and int in the case statement. This leads me to surmise that although I am defining letter now as an int, C is still reading it in as a char on the command line, and just calling it an int isn't enough to change it's type from the point of view of subsequent program flow.
Q2. Is the above reasoning correct?
So now I figure that if I change the type of letter to an int using atoi(), things should go OK. So:
int letter = atoi(argv[1][i]);
When I now try compile, I get:
Switch.c:14:27: warning: incompatible integer to pointer conversion passing
'char' to parameter of type 'const char *'; take the address with &
[-Wint-conversion]
int letter = atoi(argv[1][i]);
^~~~~~~~~~
&
/usr/include/stdlib.h:132:23: note: passing argument to parameter here
int atoi(const char *);
I then look up the documentation for atoi() and see that it can only be used to converted a string (a more precisely, a const char *), not a character. I would have though that since a char * is just a sequence of chars, that atoi() would work with both. And apparently there is no equivalent of atoi() for a char, rather only workarounds, such as the one described here.
Anyway, I decide the take the warning's instructions, and place an ampersand before the value (knowing that this implies a memory address, but not yet knowing why it is being suggested). So:
int letter = atoi(&argv[1][i]);
When I do so, it compiles. And now the program in this final form - with letter defined as an int, with the case statement comparing to an int, and with atoi being passed the address rather than value of argv[1][i] - runs successfully.
But I don't know why, so I strip this down to test the values of argv[1][i] and &argv[1][i] by printing them. I observe that the program will only compile if I use the %s string formatter to print &argv[1][i], as it tells me that &argv[1][i] is a char *.
Q3. Why is &argv[1][i], an address in memory, a char *?
In my printout, I observe that the values of &argv[1][i] and argv[1][i] are the same, namely: 2. So:
Q4. Why didn't the compiler allow me to use argv[1][i], if its value is no different to that of &argv[1][i]?
Q5. Any time I've printed a memory address in a C program, it has always been some long number such as 1246377222. Why is the memory address == the value in this case?
No doubt there will be someone objecting that this mammoth post should be split into separate posts with separate questions, but I think the flow of trial-and-error, and the answers you provide, will help not only me but others looking to understand these aspects of C. Also feel free to suggest a better title for the post.
Many thanks.
Your misunderstanding is that 2 != '2'. They both have integral values but those values are different from one another. The ascii value of '2' does not equal 2 it equals 50. This means that int a = '2'; causes a to evaluate to 50. The standard way of converting integral chars to a numeric value is writing int a = '2' - '0'; This will cause a to evaluate to 2.
argv is an array of char *. This means that argv[j][i] is a char and &argv[j][i] is a char * it is the address of the character at location argv[j][i]. This means that atoi(&argv[j][i]) will compile but I am not sure it is doing what you expect because it will try to translate the entire string starting at argv[j][i] into a number instead of only the specific character at argv[j][i].
Q1: char and int are "the same" only in the sense that both are (signed - usually) integers. There are multiple differences, for example char is (usually) 1 byte long, while int is (usually) at least 4 bytes long.
Q2: your reasoning is wrong, because you compare an ASCII code of letter '2' (which is 50) with a number 2 (which has no visual representation in ASCII).
Q3: you made a mistake in your debugging - &argv[1][i] is an address of i-th character in the argv[1] string. So essentially this is a pointer. In your debugger you probably saw the character that was "pointed to".
Q4: see above - argv[1][i] is '2', while &argv[1][i] is an address in memory where this '2' can be found.
Q5: you probably made a mistake in debugging - see answer to Q3.
A string in C is a sequence of characters with a null terminator. It will always be referenced by address (of the first character), because its length is variable.
Q4. Why didn't the compiler allow me to use argv[1][i], if its value is no different to that of &argv[1][i]?
They are different
&argv[1][i] is a pointer to a memory position and
argv[1][i] is the value of the char in that position
It is just that
printf("%s", &argv[1][i]); // Prints the c-string at memory position
printf("%c", argv[1][i]); // Prints the char
I assume that when you say "printed" you mean the printf() function.
For Q5, as you say you come from Python, the id in implemented in C Python as the address of the variable. And even in Python, the id of a numeric variable is not the value of the variable.
I want to understand a number of things about the strings on C:
I could not understand why you can not change the string in a normal assignment. (But only through the functions of string.h), for example: I can't do d="aa" (d is a pointer of char or a array of char).
Can someone explain to me what's going on behind the scenes - the compiler gives to run such thing and you receive segmentation fault error.
Something else, I run a program in C that contains the following lines:
char c='a',*pc=&c;
printf("Enter a string:");
scanf("%s",pc);
printf("your first char is: %c",c);
printf("your string is: %s",pc);
If I put more than 2 letters (on scanf) I get segmentation fault error, why is this happening?
If I put two letters, the first letter printed right! And the string is printed with a lot of profits (incorrect)
If I put a letter, the letter is printed right! And the string is printed with a lot of profits and at the end something weird (a square with four numbers containing zeros and ones)
Can anyone explain what is happening behind?
Please note: I do not want the program to work, I did not ask the question to get suggestions for another program, I just want to understand what happens behind the scenes in these situations.
Strings almost do not exist in C (except as C string literals like "abc" in some C source file).
In fact, strings are mostly a convention: a C string is an array of char whose last element is the zero char '\0'.
So declaring
const char s[] = "abc";
is exactly the same as
const char s[] = {'a','b','c','\0'};
in particular, sizeof(s) is 4 (3+1) in both cases (and so is sizeof("abc")).
The standard C library contains a lot of functions (such as strlen(3) or strncpy(3)...) which obey and/or presuppose the convention that strings are zero-terminated arrays of char-s.
Better code would be:
char buf[16]="a",*pc= buf;
printf("Enter a string:"); fflush(NULL);
scanf("%15s",pc);
printf("your first char is: %c",buf[0]);
printf("your string is: %s",pc);
Some comments: be afraid of buffer overflow. When reading a string, always give a bound to the read string, or else use a function like getline(3) which dynamically allocates the string in the heap. Beware of memory leaks (use a tool like valgrind ...)
When computing a string, be also aware of the maximum size. See snprintf(3) (avoid sprintf).
Often, you adopt the convention that a string is returned and dynamically allocated in the heap. You may want to use strdup(3) or asprintf(3) if your system provides it. But you should adopt the convention that the calling function (or something else, but well defined in your head) is free(3)-ing the string.
Your program can be semantically wrong and by bad luck happening to sometimes work. Read carefully about undefined behavior. Avoid it absolutely (your points 1,2,3 are probable UB). Sadly, an UB may happen to sometimes "work".
To explain some actual undefined behavior, you have to take into account your particular implementation: the compiler, the flags -notably optimization flags- passed to the compiler, the operating system, the kernel, the processor, the phase of the moon, etc etc... Undefined behavior is often non reproducible (e.g. because of ASLR etc...), read about heisenbugs. To explain the behavior of points 1,2,3 you need to dive into implementation details; look into the assembler code (gcc -S -fverbose-asm) produced by the compiler.
I suggest you to compile your code with all warnings and debugging info (e.g. using gcc -Wall -g with GCC ...), to improve the code till you got no warning, and to learn how to use the debugger (e.g. gdb) to run your code step by step.
If I put more than 2 letters (on scanf) I get segmentation fault error, why is this happening?
Because memory is allocated for only one byte.
See char c and assigned with "a". Which is equal to 'a' and '\0' is written in one byte memory location.
If scanf() uses this memory for reading more than one byte, then this is simply undefined behavior.
char c="a"; is a wrong declaration in c language since even a single character is enclosed within a pair of double quotes("") will treated as string in C because it is treated as "a\0" since all strings ends with a '\0' null character.
char c="a"; is wrong where as char c='c'; is correct.
Also note that the memory allocated for char is only 1byte, so it can hold only one character, memory allocation details for datatypes are described bellow
I have this code:
#include <ctype.h>
char *tokenHolder[2500];
for(i = 0; tokenHolder[i] != NULL; ++i){
if(isdigit(tokenHolder[i])){ printf("worked"); }
Where tokenHolder holds the input of char tokens from user input which have been tokenized through getline and strtok. I get a seg fault when trying to use isdigit on tokenHolder — and I'm not sure why.
Since tokenHolder is an array of char *, when you index tokenHolder[i], you are passing a char * to isdigit(), and isdigit() does not accept pointers.
You are probably missing a second loop, or you need:
if (isdigit(tokenHolder[i][0]))
printf("working\n");
Don't forget the newline.
Your test in the loop is odd too; you normally spell 'null pointer' as 0 or NULL and not as '\0'; that just misleads people.
Also, you need to pay attention to the compiler warnings you are getting! Don't post code that compiles with warnings, or (at the least) specify what the warnings are so people can see what the compiler is telling you. You should be aiming for zero warnings with the compiler set to fussy.
If you are trying to test that the values in the token array are all numbers, then you need a test_integer() function that tries to convert the string to a number and lets you know if the conversion does not use all the data in the string (or you might allow leading and trailing blanks). Your problem specification isn't clear on exactly what you are trying to do with the string tokens that you've found with strtok() etc.
As to why you are getting the core dump:
The code for the isdigit() macro is often roughly
#define isdigit(x) (_Ctype[(x)+1]&_DIGIT)
When you provide a pointer, it is treated as a very large (positive or possibly negative) offset to an array of (usually) 257 values, and because you're accessing memory out of bounds, you get a segmentation fault. The +1 allows EOF to be passed to isdigit() when EOF is -1, which is the usual value but is not mandatory. The macros/functions like isdigit() take either an character as an unsigned char — usually in the range 0..255, therefore — or EOF as the valid inputs.
You're declaring an array of pointer to char, not a simple array of just char. You also need to initialise the array or assign it some value later. If you read the value of a member of the array that has not been initialised or assigned to, you are invoking undefined behaviour.
char tokenHolder[2500] = {0};
for(int i = 0; tokenHolder[i] != '\0'; ++i){
if(isdigit(tokenHolder[i])){ printf("worked"); }
On a side note, you are probably overlooking compiler warnings telling you that your code might not be correct. isdigit expects an int, and a char * is not compatible with int, so your compiler should have generated a warning for that.
You need/want to cast your input to unsigned char before passing it to isdigit.
if(isdigit((unsigned char)tokenHolder[i])){ printf("worked"); }
In most typical encoding schemes, characters outside the USASCII range (e.g., any letters with umlauts, accents, graves, etc.) will show up as negative numbers in the typical case that char is a signed.
As to how this causes a segment fault: isdigit (along with islower, isupper, etc.) is often implemented using a table of bit-fields, and when you call the function the value you pass is used as an index into the table. A negative number ends up trying to index (well) outside the table.
Though I didn't initially notice it, you also have a problem because tokenHolder (probably) isn't the type you expected/planned to use. From the looks of the rest of the code, you really want to define it as:
char tokenHolder[2500];
I have a basic C programming question, here is the situation. If I am creating a character array and if I wanted to treat that array as a string using the %s conversion code do I have to include a null zero. Example:
char name[6] = {'a','b','c','d','e','f'};
printf("%s",name);
The console output for this is:
abcdef
Notice that there is not a null zero as the last element in the array, yet I am still printing this as a string.
I am new to programming...So I am reading a beginners C book, which states that since I am not using a null zero in the last element I cannot treat it as a string.
This is the same output as above, although I include the null zero.
char name[7] = {'a','b','c','d','e','f','\0'};
printf("%s",name);
You're just being lucky; probably after the end of that array, on the stack, there's a zero, so printf stops reading just after the last character. If your program is very short and that zone of stack is still "unexplored" - i.e. the stack hasn't grown yet up to that point - it's very easy that it's zero, since generally modern OSes give initially zeroed pages to the applications.
More formally: by not having explicitly the NUL terminator, you're going in the land of undefined behavior, which means that anything can happen; such anything may also be that your program works fine, but it's just luck - and it's the worst type of bug, since, if it generally works fine, it's difficult to spot.
TL;DR version: don't do that. Stick to what is guaranteed to work and you won't introduce sneaky bugs in your application.
The output of your fist printf is not predictable specifically because you failed to include the terminating zero character. If it appears to work in your experiment, it is only because by a random chance the next byte in memory happened to be zero and worked as a zero terminator. The chances of this happening depend greatly on where you declare your name array (it is not clear form your example). For a static array the chances might be pretty high, while for a local (automatic) array you'll run into various garbage being printed pretty often.
You must include the null character at the end.
It worked without error because of luck, and luck alone. Try this:
char name[6] = {'a','b','c','d','e','f'};
printf("%s",name);
printf("%d",name[6]);
You'll most probably see that you can read that memory, and that there's a zero in it. But it's sheer luck.
What most likely happened is that there happened to be the value of 0 at memory location name + 6. This is not defined behavior though, you could get different output on a different system or with a different compiler.
Yes. You do. There are a few other ways to do it.
This form of initialization, puts the NUL character in for you automatically.
char name[7] = "abcdef";
printf("%s",name);
Note that I added 1 to the array size, to make space for that NUL.
One can also get away with omitting the size, and letting the compiler figure it out.
char name[] = "abcdef";
printf("%s",name);
Another method is to specify it with a pointer to a char.
char *name = "abcdef";
printf("%s",name);