Logical XOR in character arrays - c

I've been trying to make a program on Vernam Cipher which requires me to XOR two strings. I tried to do this program in C and have been getting an error.The length of the two strings are the same.
#include<stdio.h>
#include<string.h>
int main()
{
printf("Enter your string to be encrypted ");
char a[50];
char b[50];
scanf("%s",a);
printf("Enter the key ");
scanf("%s",b);
char c[50];
int q=strlen(a);
int i=0;
for(i=0;i<q;i++)
{
c[i]=(char)(a[i]^b[i]);
}
printf("%s",c);
}
Whenever I run the code, I get output as ????? in boxes. What is the method to XOR these two strings ?

I've been trying to make a program on Vernam Cipher which requires me to XOR two strings
Yes, it does, but that's not the only thing it requires. The Vernam cipher involves first representing the message and key in the ITA2 encoding (also known as Baudot-Murray code), and then computing the XOR of each pair of corresponding character codes from the message and key streams.
Moreover, to display the result in the manner you indicate wanting to do, you must first convert it from ITA2 to the appropriate character encoding for your locale, which is probably a superset of ASCII.
The transcoding to and from ITA2 is relatively straightforward, but not so trivial that I'm inclined to write them for you. There is a code chart at the ITA2 link above.
Note also that ITA2 is a stateful encoding that includes shift codes and a null character. This implies that the enciphered message may contain non-printing characters, which could cause some confusion, including a null character, which will be misinterpreted as a string terminator if you are not careful. More importantly, encoding in ITA2 may increase the length of the message as a result of a need to insert shift codes.
Additionally, as a technical matter, if you want to treat the enciphered bytes as a C string, then you need to ensure that it is terminated with a null character. On a related note, scanf() will do that for the strings it reads, which uses one character, leaving you only 49 each for the actual message and key characters.
What is the method to XOR these two strings ?
The XOR itself is not your problem. Your code for that is fine. The problem is that you are XORing the wrong values, and (once the preceding is corrected) outputting the result in a manner that does not serve your purpose.

Whenever I run the code, I get output as ????? in boxes...
XORing two printable characters does not always result in a printable value.
Consider the following:
the ^ operator operates at the bit level.
there is a limited range of values that are printable. (from here):
Control Characters (0–31 & 127): Control characters are not printable characters. They are used to send commands to the PC or the
printer and are based on telex technology. With these characters, you
can set line breaks or tabs. Today, they are mostly out of use.
Special Characters (32–47 / 58–64 / 91–96 / 123–126): Special characters include all printable characters that are neither letters
nor numbers. These include punctuation or technical, mathematical
characters. ASCII also includes the space (a non-visible but printable
character), and, therefore, does not belong to the control characters
category, as one might suspect.
Numbers (30–39): These numbers include the ten Arabic numerals from 0-9.
Letters (65–90 / 97–122): Letters are divided into two blocks, with the first group containing the uppercase letters and the second
group containing the lowercase.
Using the following two strings and the following code:
char str1 = {"asdf"};
char str1 = {"jkl;"};
Following demonstrates XORing the elements of the strings:
int main(void)
{
char str1[] = {"asdf"};
char str2[] = {"jkl;"};
for(int i=0;i<sizeof(str1)/sizeof(str1[i]);i++)
{
printf("%d ^ %d: %d\n", str1[i],str2[i], str1[i]^str2[i]);
}
getchar();
return 0;
}
While all of the input characters are printable (except the NULL character), not all of the XOR results of corresponding characters are:
97 ^ 106: 11 //not printable
115 ^ 107: 24 //not printable
100 ^ 108: 8 //not printable
102 ^ 59: 93
0 ^ 0: 0
This is why you are seeing the odd output. While all of the values may be completely valid for your purposes, they are not all printable.

Related

Caesar cypher in c shows rare behaviour

I was doing a caesar cypher in c to practice and I make a functioning one; but it shows a strange behavior. The code is the one follows:
#define lenght 18
char * caesar ( char * cyphertext, int key){
static char result [lenght];
for ( int i= 0; i < lenght ; i++){
result [i] =(char)(((int) cyphertext[i]) + key) % 255;
}
return result;
}
int main(){
char * text = caesar("Hola buenas tardes", 23 );
printf("%s \n" , text );
char * check = caesar( text , 256 - 23);
printf("%s \n" , check);
return 0;
}
The encrypted version is _x7y|x7x{|; a shorter number; but when i run the second caesar cypher with the decryption
key it decrypts it with no problem to the original state. I have been looking around and it probably is about how
the characters are stored. I will very grateful for any help
The encrypted version is _x7y|x7x{|; a shorter number;
No, what printf prints is the above. Or even more precisely, that's how your terminal displays what printf prints. If you want to be certain exactly what the encrypted version is then you should run your program in a debugger, and use it to examine the bytes of the encoded version.
Your approach will encode some character codes from the printable ASCII range (codes 32 - 126 decimal) as codes outside that range. How your terminal handles those bytes depends on your configuration and environment, but if it expects UTF-8-encoded data then it will trip over invalid code sequences in the output, and if it expects an ISO-8859 encoding then some of the output codes will be interpreted as control characters from the C1 set. There are other possibilities.
Usually, a Caesar-cypher program is careful to map all printable characters to other printable characters, leaving others alone. The typical academic exercise is even more narrowly scoped, asking for a program that maps only the upper- and lowercase Latin letters, preserving case, and leaves all others (punctuation, digits, control characters) alone. This is, accordingly, left as an exercise.
The printf function should not be used to print the cipher text, it mainly support ascii characters and you have random unprintable characters. Consider converting it to a hexadecimal string.

Null character and strings in C

I have the following C code:
#include <stdio.h>
#include <strings.h>
int main(void){
char * str = "\012\0345";
char testArr[8] = {'\0','1','2','\0','3','4','5','\0'};
printf("%s\n",str);
printf("**%s**",testArr);
return 0;
}
See live code here
I'm having trouble understanding the results and I have googled but am unsure that I understand why a null character at the start of a string and why one in the middle would cause only the string "5" to display. Also, when I assign each string character to array testArr and then attempt to display that array of characters the result is different despite the string and the array having the same characters. So, I'm struck by the confounding results, especially their disparity. With the string str, does the code display "5" because the null characters overwrite what is in memory?
Also, with the array I created using the same characters, nothing displays of the data contained in array testArr. Is it that once the first null is encountered for some reason everything else is ignored? If so, why doesn't the same behavior occur with string str which contains the same characters?
An octal escape sequence is \ followed by one to three octal digits, per C 2018 6.4.4.4 1. Per 6.4.4.4 7: “Each octal or hexadecimal escape sequence is the longest sequence of characters that can constitute the escape sequence.” So, when the compiler sees "\012\0345", it interprets it as the sequence \012 (which is ten), the sequence \034 (which is twenty-eight), and the character 5.
To represent the string you intended, you could use "\00012\000345". Since an octal escape sequence stops at three digits, this is interpreted as the sequence \000, the characters 1 and 2, the sequence \000, and the characters 3, 4, and 5. (A null terminating character will also be appended automatically.)
When you printed "\012\0345", the characters with codes ten and twenty-eight were printed but had no visible effect. (Your C implementation likely uses ASCII, in which case they are control characters. \012 is new-line, so it should have caused a line advance, but you probably did not notice that. \034 is a file-separator control character, which likely has no effect when printed to a regular terminal display.)
When you printed testArr, the null character in the first position ended the string.

Check if array is ASCII

How do I check in C if an array of uint8 contains only ASCII elements?
If possible please refer me to the condition that checks if an element is ASCII or not
Your array elements are uint8, so must be in the range 0-255
For standard ASCII character set, bytes 0-127 are used, so you can use a for loop to iterate through the array, checking if each element is <= 127.
If you're treating the array as a string, be aware of the 0 byte (null character), which marks the end of the string
From your example comment, this could be implemented like this:
int checkAscii (uint8 *array) {
for (int i=0; i<LEN; i++) {
if (array[i] > 127) return 0;
}
return 1;
}
It breaks out early at the first element greater than 127.
All valid ASCII characters have value 0 to 127, so the test is simply a value check or 7-bit mask. For example given the inclusion of stdbool.h:
bool is_ascii = (ch & ~0x7f) == 0 ;
Possibly however you intended only printable ASCII characters (excluding control characters). In that case, given inclusion of ctype.h:
bool is_printable_ascii = (ch & ~0x7f) == 0 &&
(isprint() || isspace()) ;
Your intent may be lightly different in terms of what characters you intend to include in your set - in which case other functions in ctype.h may be applied or simply test the values for value or range to include/exclude.
Note also that the ASCII set is very restricted in international terms. The ANSI or "extended ASCII" set uses locale specific codepages to define the glyphs associated with codes 128 to 255. That is to say the set changes depending on language/locale settings to accommodate different language characters, accents and alphabets. In modern systems it is common instead to use a multi-byte Unicode encoding (or which there are several with either fixed or variable length codes). UTF-8 encoding is a variable width encoding where all single byte encodings are also ASCII codes. As such, while it is trivial to determine whether data is entirely within the ASCII set, it does not follow that the data is therefore text. If the test is intended to distinguish binary data from text, it will fail in a great many scenarios unless you can guarantee a priori that all text is restricted to the ASCII set - and that is application specific.
You cannot check if something is "ASCII" with standard C.
Because C does not specify which symbol table that is used by a compiler. Various other more or less exotic symbol tables exists/existed.
UTF8 for example, is a superset of ASCII. Older, dysfunctional 8 bit symbol tables have existed, such as EBCDIC and "Extended ASCII". To tell if something is for example ASCII or EBCDIC can't be done trivially, without a long line of value checks.
With standard C, you can only do the following:
You can check if a character is printable, with the function isprint() from ctype.h.
Or you can check if it only has up to 7 bits only set, if((ch & 0x7F)==ch).
In C programming, a character variable holds ASCII value (an integer number between 0 and 127) rather than that character itself.
The ASCII value of lowercase alphabets are from 97 to 122. And, the ASCII value of uppercase alphabets are from 65 to 90.
incase of giving the actual code , i am giving you example.
You can assign int to char directly.
int a = 47;
char c = a;
printf("%c", c);
And this will also work.
printf("%c", a); // a is in valid range
Another approach.
An integer can be assigned directly to a character. A character is different mostly just because how it is interpreted and used.
char c = atoi("47");
Try to implement this after understand the following logic properly.

Convert a char inside a String to ASCII

I'm doing a mini C compiler for a Univeristy Project.
My problem is:
How can i obtain the ascii of the char inside the string?
The string is always in this format:
"'{char}'"
For example:
char* c1 = "'a'" #I want the ascii code of char: a
In this case, I can obtain the ascii code with the command int(c1[1]).
But if the case is:
char* c1 = "'\000'" #I want the ascii code of char: \000
How can I obtain the ascii code of this case?
Is it possible to obtain a generic function for all cases?
If you want to know the codes of characters, a simple way is
char c1 = '\0';
int c1_code = c1;
char c2 = 'a';
int c2_code = c2;
printf("%d %d\n", c1_code, c2_code);
On an ASCII system, this will print 0 97.
But this is a little silly. It would be simpler and more straightforward to just write
int c1_code = '\0';
int c2_code = 'a';
This works because of a super easy, super important basic definition in C:
In C, a character is represented by a small integer corresponding to the value of that character in the machine's character set.
In some languages, you need special functions to convert back and forth between characters and cheir character-set values. (I believe BASIC uses CHR$ and INT$, or something.) But in C, you don't need any special processing: a character basically just is its value.
If you want to find character values out of strings (not single characters, as I've shown so far), it's only a tiny bit more involved. A string in C is just an array of characters, so you can do something like this:
char str3[] = "a";
int c3_value = str2[0]; /* value of first character in string */
I can print character values even more simply like this:
printf("%d %d %d\n", 'a', 'b', 'c');
If I read a line of text from the user:
char line[100];
printf("type something:\n");
fgets(line, 100, stdin);
I can print the values of the first few characters like this:
printf("you typed:\n");
printf("%c = %d\n", line[0], line[0]);
printf("%c = %d\n", line[1], line[1]);
printf("%c = %d\n", line[2], line[2]);
If you're unfamiliar with C's character handling, I encourage you to write a little program like this and play with it. For example, if I run that program and type "Hello, world!" into it, it will print
You typed:
H = 72
e = 101
l = 108
Perhaps you knew all of this. Perhaps you really did want to do something like
char *c1 = "'\\000'";
meaning that c1 is a string containing the six characters
' \ 0 0 0 '
and you want to interpret this string as the syntax of a C character constant, just as a C compiler would. That is, perhaps you're trying to basically write a miniature version of that portion of a C compiler that parses character constants. If so, that's a completely different (and considerably more involved) problem.
And evidently this is what you're trying to do. See my other answer.
So you're trying to write a simple lexical analyzer. The syntax of a character constant in C is a single quote, followed by a thing that can go inside, followed by a single quote. The thing that can go inside is either a single character, or an escape sequence. An escape sequence is a \ character followed either by a single character like n, or by one to three octal digits. (There are also hexadecimal escapes, and multi-character character constants, but we'll probably want to ignore those for now.)
So you'll need to write code that can handle all of these possibilities. In pseudocode, it might look something like this:
if first character is `'`, step over it
else error
if next character is not '\', it's the character code we want
else if next character is '\', we have an escape sequence; skip over it and...
if next character is 'n', character code we want is '\n'
else if next character is 'r', character code we want is '\r'
else if next character is 't', character code we want is '\t'
else if next character is a digit:
read 1-3 digits
convert from octal
that's the character code we want
finally, if next character is `'`, step over it
else error
When people write lexical analyzers for real, they usually use a program to help them, such as lex or flex. But it's also a great learning exercise to write your own, by hand, like this.
If you just want the ASCII code of any character in a string, you can literally just index into the array like so:
char *foo = "asdf";
char bar = foo[3];
// bar == 'f'

figure out 2 strings similar or not

Rules:
2 strings, a and b, both of them consist of ASCII chars and non-ASCII chars (say, Chinese Characters gbk-encoded).
If the non-ASCII chars contained in b also show up in a and no less than the times they appear in b, then we say b is similar with a.
For example:
a = "ab中ef日jkl中本" //non-ASCII chars:'中'(twice), '日'(once), '本'(once)
b = "bej中中日" //non-ASCII chars:'中'(twice), '日'(once)
c = 'lk日日日' //non-ASCII chars:'日'(3 times, more than twice in a)
according to the rule, b is similar with a, but c is not.
Here is my question:
We don't know how many non-ASCII chars are there in a and b, probably many.
So to find out how many times a non-ASCII char appears in a and b, am I supposed to use a Hash-Table to store their appearing-times?
Take string a as an example:
[non-ASCII's hash-value]:[times]
中's hash-val : 2
日's hash-val : 1
本's hash-val : 1
Check string b, if we encounter a non-ASCII char in b, then hash it and check a's hash-table, if the char is present in a's hash-table, then its appearing-times decrements by 1.
If the appearing-times is less than 0 (-1), then we say b is not similar with a.
Or is there any better way?
PS:
I read string a byte by byte, if the byte is less than 128, then I take is as an ASCII char, otherwise I take it as part of a non-ASCII char (multi-bytes).
This is what I am doing to find out the non-ASCII chars.
Is it right?
You have asked two questions:
Can we count the non-ASCII characters using a hashtable? Answer: sure. As you read the characters (not the bytes), examine the codepoints. For any codepoint greater than 127, put it into a counting hashtable. That is for a character c, add (c,1) if c is not in the table, and update (c,x) to (c, x+1) if c is in the table already.
Is there a better way to solve this problem than your approach of incrementing counts in a and decrementing as you run through b? If your hashtable implementation gives nearly O(1) access, then I suspect not. You are looking at each character in the string exactly once, and for each character your are doing either an hashtable insert or lookup and an addition or subtraction, and a check against 0. With unsorted strings, you have to look at all the characters in both strings anyway, so you've given, I think, the best solution.
The interviewer might be looking for you to say things like, "Hmmmmm, if these strings were actually massive files that could not fit in memory, what would I do?" Or for you to ask "Well are the string sorted? Because if they are, I can do it faster...".
But now let's say the strings are massive. The only thing you are storing in memory is the hashtable. Unicode has only around 1 million codepoints and you are storing an integer count for each, so even if you are getting data from gigabyte sized files you only need around 4MB or so for your hash table (or a small multiple of this, as there will be overhead).
In the absence of any other conditions, your algorithm is nice. Sorting the strings beforehand isn't good; it takes up more memory and isn't a linear-time operation.
ADDENDUM
Since your original comments mentioned the type char as opposed to wchar_t, I thought I'd show an example of using wide strings. See http://codepad.org/B3MXOgqc
Hope that helps.
ADDENDUM 2
Okay here is a C program that shows exactly how to go through a widestring and work at the character level:
http://codepad.org/QVX3QPat
It is a very short program so I will also paste it here:
#include <stdio.h>
#include <string.h>
#include <wchar.h>
char *s1 = "abd中日";
wchar_t *s2 = L"abd中日";
int main() {
int i, n;
printf("length of s1 is %d\n", strlen(s1));
printf("length of s2 using wcslen is %d\n", wcslen(s2));
printf("The codepoints of the characters of s2 are\n");
for (i = 0, n = wcslen(s2); i < n; i++) {
printf("%02x\n", s2[i]);
}
return 0;
}
Output:
length of s1 is 9
length of s2 using wcslen is 5
The codepoints of the characters of s2 are
61
62
64
4e2d
65e5
What can we learn from this? A couple things:
If you use plain old char for CJK characters then the string length will be wrong.
To use Unicode characters in C, use wchar_t
String literals have a leading L for wide strings
In this example I defined a string with CJK characters and used wchar_t and a for-loop with wcslen. Please note here that I am working with real characters, NOT BYTES, so I get the correct count of characters, which is 5. Now I print out each codepoint. In your interview question, you will be looking to see if the codepoint is >= 128. I showed them in Hex, as is the culture, so you can look for > 0x7F. :-)
ADDENDUM 3
A few notes in http://tldp.org/HOWTO/Unicode-HOWTO-6.html are worth reading. There is a lot more to character handling than the simple example above shows. In the comments below J.F. Sebastian gives a number of other important links.
Of the few things that need to be addressed is normalization. For example, does your interviewer care that when given two strings, one containing just a Ç and the other a C followed by a COMBINING MARK CEDILLA BELOW, would they be the same? They represent the same character, but one uses one codepoint and the other uses two.

Resources