Convert a char inside a String to ASCII - c

I'm doing a mini C compiler for a Univeristy Project.
My problem is:
How can i obtain the ascii of the char inside the string?
The string is always in this format:
"'{char}'"
For example:
char* c1 = "'a'" #I want the ascii code of char: a
In this case, I can obtain the ascii code with the command int(c1[1]).
But if the case is:
char* c1 = "'\000'" #I want the ascii code of char: \000
How can I obtain the ascii code of this case?
Is it possible to obtain a generic function for all cases?

If you want to know the codes of characters, a simple way is
char c1 = '\0';
int c1_code = c1;
char c2 = 'a';
int c2_code = c2;
printf("%d %d\n", c1_code, c2_code);
On an ASCII system, this will print 0 97.
But this is a little silly. It would be simpler and more straightforward to just write
int c1_code = '\0';
int c2_code = 'a';
This works because of a super easy, super important basic definition in C:
In C, a character is represented by a small integer corresponding to the value of that character in the machine's character set.
In some languages, you need special functions to convert back and forth between characters and cheir character-set values. (I believe BASIC uses CHR$ and INT$, or something.) But in C, you don't need any special processing: a character basically just is its value.
If you want to find character values out of strings (not single characters, as I've shown so far), it's only a tiny bit more involved. A string in C is just an array of characters, so you can do something like this:
char str3[] = "a";
int c3_value = str2[0]; /* value of first character in string */
I can print character values even more simply like this:
printf("%d %d %d\n", 'a', 'b', 'c');
If I read a line of text from the user:
char line[100];
printf("type something:\n");
fgets(line, 100, stdin);
I can print the values of the first few characters like this:
printf("you typed:\n");
printf("%c = %d\n", line[0], line[0]);
printf("%c = %d\n", line[1], line[1]);
printf("%c = %d\n", line[2], line[2]);
If you're unfamiliar with C's character handling, I encourage you to write a little program like this and play with it. For example, if I run that program and type "Hello, world!" into it, it will print
You typed:
H = 72
e = 101
l = 108
Perhaps you knew all of this. Perhaps you really did want to do something like
char *c1 = "'\\000'";
meaning that c1 is a string containing the six characters
' \ 0 0 0 '
and you want to interpret this string as the syntax of a C character constant, just as a C compiler would. That is, perhaps you're trying to basically write a miniature version of that portion of a C compiler that parses character constants. If so, that's a completely different (and considerably more involved) problem.
And evidently this is what you're trying to do. See my other answer.

So you're trying to write a simple lexical analyzer. The syntax of a character constant in C is a single quote, followed by a thing that can go inside, followed by a single quote. The thing that can go inside is either a single character, or an escape sequence. An escape sequence is a \ character followed either by a single character like n, or by one to three octal digits. (There are also hexadecimal escapes, and multi-character character constants, but we'll probably want to ignore those for now.)
So you'll need to write code that can handle all of these possibilities. In pseudocode, it might look something like this:
if first character is `'`, step over it
else error
if next character is not '\', it's the character code we want
else if next character is '\', we have an escape sequence; skip over it and...
if next character is 'n', character code we want is '\n'
else if next character is 'r', character code we want is '\r'
else if next character is 't', character code we want is '\t'
else if next character is a digit:
read 1-3 digits
convert from octal
that's the character code we want
finally, if next character is `'`, step over it
else error
When people write lexical analyzers for real, they usually use a program to help them, such as lex or flex. But it's also a great learning exercise to write your own, by hand, like this.

If you just want the ASCII code of any character in a string, you can literally just index into the array like so:
char *foo = "asdf";
char bar = foo[3];
// bar == 'f'

Related

Why does fgetc() in C always reads extra, non-existent characters whenever I try to read non-printable characters from txt files?

I am trying to read non-printable characters from a text file, print out the characters' ASCII code, and finally write these non-printable characters into an output file.
However, I have noticed that for every non-printable character I read, there is always an extra non-printable character existing in front of what I really want to read.
For example, the character I want to read is "§".
And when I print out its ASCII code in my program, instead of printing just "167", it prints out "194 167".
I looked it up in the debugger and saw "§" in the char array. But I don't have  anywhere in my input file.
screenshot of debugger
And after I write the non-printable character into my output file, I have noticed that it is also just "§", not "§".
There is an extra character being attached to every single non-printable character I read. Why is this happening? How do I get rid of it?
Thanks!
Code as follows:
case 1:
mode = 1;
FILE *fp;
fp = fopen ("input2.txt", "r");
int charCount = 0;
while(!feof(fp)) {
original_message[charCount] = fgetc(fp);
charCount++;
}
original_message[charCount - 1] = '\0';
fclose(fp);
k = strlen(original_message);//split the original message into k input symbols
printf("k: \n%lld\n", k);
printf("ASCII code:\n");
for (int i = 0; i < k; i++)
{
ASCII = original_message[i];
printf("%d ", ASCII);
}
C's getchar (and getc and fgetc) functions are designed to read individual bytes. They won't directly handle "wide" or "multibyte" characters such as occur in the UTF-8 encoding of Unicode.
But there are other functions which are specifically designed to deal with those extended characters. In particular, if you wish, you can replace your call to fgetc(fp) with fgetwc(fp), and then you should be able to start reading characters like § as themselves.
You will have to #include <wchar.h> to get the prototype for fgetwc. And you may have to add the call
setlocale(LC_CTYPE, "");
at the top of your program to synchronize your program's character set "locale" with that of your operating system.
Not your original code, but I wrote this little program:
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main()
{
wchar_t c;
setlocale(LC_CTYPE, "");
while((c = fgetwc(stdin)) != EOF)
printf("%lc %d\n", c, c);
}
When I type "A", it prints A 65.
When I type "§", it prints § 167.
When I type "Ƶ", it prints Ƶ 437.
When I type "†", it prints † 8224.
Now, with all that said, reading wide characters using functions like fgetwc isn't the only or necessarily even the best way of dealing with extended characters. In your case, it carries a number of additional consequences:
Your original_message array is going to have to be an array of wchar_t, not an array of char.
Your original_message array isn't going to be an ordinary C string — it's a "wide character string". So you can't call strlen on it; you're going to have to call wcslen.
Similarly, you can't print it using %s, or its characters using %c. You'll have to remember to use %ls or %lc.
So although you can convert your entire program to use "wide" strings and "w" functions everywhere, it's a ton of work. In many cases, and despite anomalies like the one you asked about, it's much easier to use UTF-8 everywhere, since it tends to Just Work. In particular, as long as you don't have to pick a string apart and work with its individual characters, or compute the on-screen display length of a string (in "characters") using strlen, you can just use plain C strings everywhere, and let the magic of UTF-8 sequences take care of any non-ASCII characters your users happen to enter.

Logical XOR in character arrays

I've been trying to make a program on Vernam Cipher which requires me to XOR two strings. I tried to do this program in C and have been getting an error.The length of the two strings are the same.
#include<stdio.h>
#include<string.h>
int main()
{
printf("Enter your string to be encrypted ");
char a[50];
char b[50];
scanf("%s",a);
printf("Enter the key ");
scanf("%s",b);
char c[50];
int q=strlen(a);
int i=0;
for(i=0;i<q;i++)
{
c[i]=(char)(a[i]^b[i]);
}
printf("%s",c);
}
Whenever I run the code, I get output as ????? in boxes. What is the method to XOR these two strings ?
I've been trying to make a program on Vernam Cipher which requires me to XOR two strings
Yes, it does, but that's not the only thing it requires. The Vernam cipher involves first representing the message and key in the ITA2 encoding (also known as Baudot-Murray code), and then computing the XOR of each pair of corresponding character codes from the message and key streams.
Moreover, to display the result in the manner you indicate wanting to do, you must first convert it from ITA2 to the appropriate character encoding for your locale, which is probably a superset of ASCII.
The transcoding to and from ITA2 is relatively straightforward, but not so trivial that I'm inclined to write them for you. There is a code chart at the ITA2 link above.
Note also that ITA2 is a stateful encoding that includes shift codes and a null character. This implies that the enciphered message may contain non-printing characters, which could cause some confusion, including a null character, which will be misinterpreted as a string terminator if you are not careful. More importantly, encoding in ITA2 may increase the length of the message as a result of a need to insert shift codes.
Additionally, as a technical matter, if you want to treat the enciphered bytes as a C string, then you need to ensure that it is terminated with a null character. On a related note, scanf() will do that for the strings it reads, which uses one character, leaving you only 49 each for the actual message and key characters.
What is the method to XOR these two strings ?
The XOR itself is not your problem. Your code for that is fine. The problem is that you are XORing the wrong values, and (once the preceding is corrected) outputting the result in a manner that does not serve your purpose.
Whenever I run the code, I get output as ????? in boxes...
XORing two printable characters does not always result in a printable value.
Consider the following:
the ^ operator operates at the bit level.
there is a limited range of values that are printable. (from here):
Control Characters (0–31 & 127): Control characters are not printable characters. They are used to send commands to the PC or the
printer and are based on telex technology. With these characters, you
can set line breaks or tabs. Today, they are mostly out of use.
Special Characters (32–47 / 58–64 / 91–96 / 123–126): Special characters include all printable characters that are neither letters
nor numbers. These include punctuation or technical, mathematical
characters. ASCII also includes the space (a non-visible but printable
character), and, therefore, does not belong to the control characters
category, as one might suspect.
Numbers (30–39): These numbers include the ten Arabic numerals from 0-9.
Letters (65–90 / 97–122): Letters are divided into two blocks, with the first group containing the uppercase letters and the second
group containing the lowercase.
Using the following two strings and the following code:
char str1 = {"asdf"};
char str1 = {"jkl;"};
Following demonstrates XORing the elements of the strings:
int main(void)
{
char str1[] = {"asdf"};
char str2[] = {"jkl;"};
for(int i=0;i<sizeof(str1)/sizeof(str1[i]);i++)
{
printf("%d ^ %d: %d\n", str1[i],str2[i], str1[i]^str2[i]);
}
getchar();
return 0;
}
While all of the input characters are printable (except the NULL character), not all of the XOR results of corresponding characters are:
97 ^ 106: 11 //not printable
115 ^ 107: 24 //not printable
100 ^ 108: 8 //not printable
102 ^ 59: 93
0 ^ 0: 0
This is why you are seeing the odd output. While all of the values may be completely valid for your purposes, they are not all printable.

How does c compare character variable against string?

The following code is completely ok in C but not in C++. In following code if statement is always false. How C compares character variable against string?
int main()
{
char ch='a';
if(ch=="a")
printf("confusion");
return 0;
}
The following code is completely ok in C
No, Not at all.
In your code
if(ch=="a")
is essentially trying to compare the value of ch with the base address of the string literal "a",. This is meaning-and-use-less.
What you want here, is to use single quotes (') to denote a char literal, like
if(ch == 'a')
NOTE 1:
To elaborate on the difference between single quotes for char literals and double quotes for string literal s,
For char literal, C11, chapter §6.4.4.4
An integer character constant is a sequence of one or more multibyte characters enclosed in single-quotes, as in 'x'
and, for string literal, chapter §6.4.5
Acharacter string literal is a sequence of zero or more multibyte characters enclosed in
double-quotes, as in "xyz".
NOTE 2:
That said, as a note, the recommend signature of main() is int main(void).
I wouldn't say the code is okay in either language.
'a' is a single character. It is actually a small integer, having as its value the value of the given character in the machine's character set (almost invariably ASCII). So 'a' has the value 97, as you can see by running
char c = 'a';
printf("%d\n", c);
"a", on the other hand, is a string. It is an array of characters, terminated by a null character. In C, arrays are almost always referred to by pointers to their first element, so in this case the string constant "a" acts like a pointer to an array of two characters, 'a' and the terminating '\0'. You could see that by running
char *str = "a";
printf("%d %d\n", str[0], str[1]);
This will print
97 0
Now, we don't know where in memory the compiler will choose to put our string, so we don't know what the value of the pointer will be, but it's safe to say that it will never be equal to 97. So the comparison if(ch=="a") will always be false.
When you need to compare a character and a string, you have two choices. You can compare the character to the first character of the string:
if(c == str[0])
printf("they are equal\n");
else printf("confusion\n");
Or you can construct a string from the character, and compare that. In C, that might look like this:
char tmpstr[2];
tmpstr[0] = c;
tmpstr[1] = '\0';
if(strcmp(tmpstr, str) == 0)
printf("they are equal\n");
else printf("confusion\n");
That's the answer for C. There's a different, more powerful string type in C++, so things would be different in that language.
There is difference between 'a' (a character) and "a" (a string having two characters a and \0). ch=="a" comparison will be evaluated to false because in this expression "a" will converted to pointer to its first element and of course that address is not a character but a hexadecimal number.
Change it to
if(ch=='a')

Confused about C string constants

When I came across this C language implementation of Porters Stemming algorithm I found a C-ism I was confused about.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
void test( char *s )
{
int len = s[0];
printf("len= %i\n", len );
printf("s[len] = %c\n", s[len] );
}
int main()
{
test("\07" "abcdefg");
return 0;
}
and output:
len = 7
s[len] = g
However, when I input
test("\08" "abcdefgh");
or any string constant that is longer than 7 with the corresponding length in the first pair of parenthesis ( i.e. test("\09" "abcdefghi"); the output is
len = 0
s[len] =
But any input like test("\01" "abcdefgh"); prints out the character in that position ( if we call the first character position 1 and not 0 for the moment )
It appears if test( char *s ) reads the number in the first pair of parenthesis ( how it does this I am not sure since I thought s[0] would be able to only read a single char, i.e. the '\' ) and prints the last character at that index + 1 of the string constant in the second pair of parenthesis.
My question is this: It seems as if we are passing two string constants into test( char *s ). What exactly is happening here, meaning, how does the compiler seem to "split" up the string over two pairs of parenthesis? Another question one might have is, is a string of the form "blah" "abcdefg" one consecutive block of memory? It may be the case that I have overlooked something elementary, but even so I would like to know what I overlooked. I know this is a basic concept but I could not find a clear example or situation on the web that explains this and in all honesty I don't follow the output. Any helpful comments are welcomed.
There are at least three things going on here:
Literal strings juxtaposed against one another are concatenated by the compiler. "a" "b" is exactly the same as "ab".
The backslash is an escape character, which means it is not copied literally into the resulting string. The notation \01 means "the character with ASCII value 1".
The notation \0... means an octal character constant. Octal numbers are base 8, made up from digits that range from 0 through 7 inclusive. 8 is not a valid octal constant, so "\08" does not follow "\07".
The problem is not in the length of the string, but in the \o syntax for specifying non-printable values in string literals. \o, \oo, and \ooo denote octal constants, i.e. a single character whose value is written in base 8. Since 08 in \08 doesn't represent a valid base 8 number, it is interpreted as \0 followed by the ASCII character 8.
To fix the problem, represent 8 as \10 or \010:
test("\007" "abcdefg");
test("\010" "abcdefgh");
...or switch to hexadecimal, where the \x prefix makes the base more explicit to the casual reader:
test("\x07" "abcdefg");
test("\x08" "abcdefgh");
test("\x09" "abcdefghi");
test("\x0a" "abcdefghij");
...
\number in a character or string literal is means the character whose code is the value number. number is interpreted in octal, so the first non-octal digit terminates the number. So "\07" is a one-character string containing the character with code 7, but \08 is a two-character string containing the character with code 0 followed by the digit 8.
Additionally, code 0 the null terminator that's used in C to indicate the end of the string. So that second string ends at the beginning, because its first byte is the terminator. This why the length of the string in your second example is 0.
When two or more string literals are adjacent (separated only by white-space), the compiler will join them into a single string. Therefore "\07" "abcdefg" is equivalent to "\07abcdefg".
"\07" is an octal escape. An octal escape ends after three digits or with first non-octal character. So, when you enter "\08", 8 is a non octal character therefore escape ends and 0 is stored at s[0].
Now, len is 0 and printing s[len] will try to print the character at s[0] which has a non printable ASCII code (Only character above ASCII value above 32 are printable).

Decimal string to character ASCII conversion - C

Can someone explain how to convert a string of decimal values from ASCII table to its character 'representation' in C ? For example: user input could be 097 and the function would print 'a' on the screen, but also user could type in '097100101' and the function would have to print 'ade' etc. I have written something clunky that does the opposite operation:
char word[30];
scanf("%s", word);
while(word[i]!=0)
{
if(word[i]<'d')
printf("0%d", (int)word[i]);
if(word[i]>='d')
printf("%d", (int)word[i]);
i++;
}
but it works. Now I want to have function that works in a similar way but of course does decimal > char conversion. The point is, I cannot use any functions like 'atoi' or something like that (not sure about names, never used them ;)).
You can use this function instead of atoi:
char a3toc(const char *ptr)
{
return (ptr[0]-'0')*100 + (ptr[1]-'0')*10 + (ptr[0]-'0');
}
So, a3toc("102") will return the same thing as (char) 102, which is an 'f'.
If you don't see why, substitute in the values: ptr[0] is '1', so the first part becomes ('1'-'0')*100 or 1*100 or 100, which is what that first 1 in 102 represents.
Tokenize the input string. I'm assuming you are forcing that every letter MUST be represented in 3 characters. So break the string that way. And simply use explicit type casting to get the desired character.
I don't think I should be giving you the code for this, since it is pretty easy and seems more like a Homework question.

Resources