figure out 2 strings similar or not - c

Rules:
2 strings, a and b, both of them consist of ASCII chars and non-ASCII chars (say, Chinese Characters gbk-encoded).
If the non-ASCII chars contained in b also show up in a and no less than the times they appear in b, then we say b is similar with a.
For example:
a = "ab中ef日jkl中本" //non-ASCII chars:'中'(twice), '日'(once), '本'(once)
b = "bej中中日" //non-ASCII chars:'中'(twice), '日'(once)
c = 'lk日日日' //non-ASCII chars:'日'(3 times, more than twice in a)
according to the rule, b is similar with a, but c is not.
Here is my question:
We don't know how many non-ASCII chars are there in a and b, probably many.
So to find out how many times a non-ASCII char appears in a and b, am I supposed to use a Hash-Table to store their appearing-times?
Take string a as an example:
[non-ASCII's hash-value]:[times]
中's hash-val : 2
日's hash-val : 1
本's hash-val : 1
Check string b, if we encounter a non-ASCII char in b, then hash it and check a's hash-table, if the char is present in a's hash-table, then its appearing-times decrements by 1.
If the appearing-times is less than 0 (-1), then we say b is not similar with a.
Or is there any better way?
PS:
I read string a byte by byte, if the byte is less than 128, then I take is as an ASCII char, otherwise I take it as part of a non-ASCII char (multi-bytes).
This is what I am doing to find out the non-ASCII chars.
Is it right?

You have asked two questions:
Can we count the non-ASCII characters using a hashtable? Answer: sure. As you read the characters (not the bytes), examine the codepoints. For any codepoint greater than 127, put it into a counting hashtable. That is for a character c, add (c,1) if c is not in the table, and update (c,x) to (c, x+1) if c is in the table already.
Is there a better way to solve this problem than your approach of incrementing counts in a and decrementing as you run through b? If your hashtable implementation gives nearly O(1) access, then I suspect not. You are looking at each character in the string exactly once, and for each character your are doing either an hashtable insert or lookup and an addition or subtraction, and a check against 0. With unsorted strings, you have to look at all the characters in both strings anyway, so you've given, I think, the best solution.
The interviewer might be looking for you to say things like, "Hmmmmm, if these strings were actually massive files that could not fit in memory, what would I do?" Or for you to ask "Well are the string sorted? Because if they are, I can do it faster...".
But now let's say the strings are massive. The only thing you are storing in memory is the hashtable. Unicode has only around 1 million codepoints and you are storing an integer count for each, so even if you are getting data from gigabyte sized files you only need around 4MB or so for your hash table (or a small multiple of this, as there will be overhead).
In the absence of any other conditions, your algorithm is nice. Sorting the strings beforehand isn't good; it takes up more memory and isn't a linear-time operation.
ADDENDUM
Since your original comments mentioned the type char as opposed to wchar_t, I thought I'd show an example of using wide strings. See http://codepad.org/B3MXOgqc
Hope that helps.
ADDENDUM 2
Okay here is a C program that shows exactly how to go through a widestring and work at the character level:
http://codepad.org/QVX3QPat
It is a very short program so I will also paste it here:
#include <stdio.h>
#include <string.h>
#include <wchar.h>
char *s1 = "abd中日";
wchar_t *s2 = L"abd中日";
int main() {
int i, n;
printf("length of s1 is %d\n", strlen(s1));
printf("length of s2 using wcslen is %d\n", wcslen(s2));
printf("The codepoints of the characters of s2 are\n");
for (i = 0, n = wcslen(s2); i < n; i++) {
printf("%02x\n", s2[i]);
}
return 0;
}
Output:
length of s1 is 9
length of s2 using wcslen is 5
The codepoints of the characters of s2 are
61
62
64
4e2d
65e5
What can we learn from this? A couple things:
If you use plain old char for CJK characters then the string length will be wrong.
To use Unicode characters in C, use wchar_t
String literals have a leading L for wide strings
In this example I defined a string with CJK characters and used wchar_t and a for-loop with wcslen. Please note here that I am working with real characters, NOT BYTES, so I get the correct count of characters, which is 5. Now I print out each codepoint. In your interview question, you will be looking to see if the codepoint is >= 128. I showed them in Hex, as is the culture, so you can look for > 0x7F. :-)
ADDENDUM 3
A few notes in http://tldp.org/HOWTO/Unicode-HOWTO-6.html are worth reading. There is a lot more to character handling than the simple example above shows. In the comments below J.F. Sebastian gives a number of other important links.
Of the few things that need to be addressed is normalization. For example, does your interviewer care that when given two strings, one containing just a Ç and the other a C followed by a COMBINING MARK CEDILLA BELOW, would they be the same? They represent the same character, but one uses one codepoint and the other uses two.

Related

What happens when we make an array defined using characters instead of integers in C?

This is a code I have used to define an array:
int characters[126];
following which I wanted to get a record of the frequencies of all the characters recorded for which I used the while loop in this format:
while((a=getchar())!=EOF){
characters[a]=characters[a]+1;
}
Then using a for loop I print the values of integers in the array.
How exactly is this working?
Does C assign a specific number for letters ie. a,b,c, etc in the array?
What happens when we make an array defined using characters instead of integers in C?
Let's be sure we are clear: you are using integer values returned by getchar() as indexes into your array. This is not defining the array, it is just accessing its elements.
Does C assign a specific number for letters ie. a,b,c, etc in the array?
There are no letters in the array. There are ints. However, yes, the characters read by getchar() are encoded as integer values, so they are, in principle, suitable array indexes. Thus, this line ...
characters[a]=characters[a]+1;
... reads the int value then stored at index a in array characters, adds 1 to it, and then assigns the result back to element a of the array, provided that the value of a is a valid index into the array.
More generally, it is important to understand that although one of its major uses is to represent characters, type char is an integer type. Its values are numbers. The mapping from characters to numbers is implementation and context dependent, but it is common enough for the mapping to be consistent with the ASCII code that you will often see programs that assume such a mapping.
Indeed, your code makes exactly such an assumption (and others) by allowing only for character codes less than 126.
You should also be aware that if your characters array is declared inside a function then it is not initialized. The code depends on all elements to be initially to zero. I would recommend this declaration instead:
int characters[UCHAR_MAX + 1] = {0};
That upper bound will be sufficient for all the non-EOF values returned by getchar(), and the explicit zero-initialization will ensure the needed initial values regardless of where the array is declared.
I have realized the charecter set that can function as an input for getchar() is part of the ASCII table and comes under an int. I used the code following to find that out:
#include <stdio.h>
int main(){
int a[128];
a['b']=4;
printf("%d",a[98]); //it is 98 as according to the table 'b' is assigned the value of 98
}
following which executing this code i get the output of 4.
I am really new to coding so feel free to correct me.
Character values are represented using some kind of integer encoding - ASCII (very common), EBCDIC (mostly IBM mainframes), UTF-8 (backward-compatible to ASCII), etc.
The character value 'a' maps to some integer value - 97 in ASCII and UTF-8, 129 in EBCDIC. So yes, you can use a character value to index into an array - arr['a']++ would be equivalent to arr[97]++ if you were using ASCII or UTF-8.
The C language does not dictate this - it's determined by the underlying platform.

Logical XOR in character arrays

I've been trying to make a program on Vernam Cipher which requires me to XOR two strings. I tried to do this program in C and have been getting an error.The length of the two strings are the same.
#include<stdio.h>
#include<string.h>
int main()
{
printf("Enter your string to be encrypted ");
char a[50];
char b[50];
scanf("%s",a);
printf("Enter the key ");
scanf("%s",b);
char c[50];
int q=strlen(a);
int i=0;
for(i=0;i<q;i++)
{
c[i]=(char)(a[i]^b[i]);
}
printf("%s",c);
}
Whenever I run the code, I get output as ????? in boxes. What is the method to XOR these two strings ?
I've been trying to make a program on Vernam Cipher which requires me to XOR two strings
Yes, it does, but that's not the only thing it requires. The Vernam cipher involves first representing the message and key in the ITA2 encoding (also known as Baudot-Murray code), and then computing the XOR of each pair of corresponding character codes from the message and key streams.
Moreover, to display the result in the manner you indicate wanting to do, you must first convert it from ITA2 to the appropriate character encoding for your locale, which is probably a superset of ASCII.
The transcoding to and from ITA2 is relatively straightforward, but not so trivial that I'm inclined to write them for you. There is a code chart at the ITA2 link above.
Note also that ITA2 is a stateful encoding that includes shift codes and a null character. This implies that the enciphered message may contain non-printing characters, which could cause some confusion, including a null character, which will be misinterpreted as a string terminator if you are not careful. More importantly, encoding in ITA2 may increase the length of the message as a result of a need to insert shift codes.
Additionally, as a technical matter, if you want to treat the enciphered bytes as a C string, then you need to ensure that it is terminated with a null character. On a related note, scanf() will do that for the strings it reads, which uses one character, leaving you only 49 each for the actual message and key characters.
What is the method to XOR these two strings ?
The XOR itself is not your problem. Your code for that is fine. The problem is that you are XORing the wrong values, and (once the preceding is corrected) outputting the result in a manner that does not serve your purpose.
Whenever I run the code, I get output as ????? in boxes...
XORing two printable characters does not always result in a printable value.
Consider the following:
the ^ operator operates at the bit level.
there is a limited range of values that are printable. (from here):
Control Characters (0–31 & 127): Control characters are not printable characters. They are used to send commands to the PC or the
printer and are based on telex technology. With these characters, you
can set line breaks or tabs. Today, they are mostly out of use.
Special Characters (32–47 / 58–64 / 91–96 / 123–126): Special characters include all printable characters that are neither letters
nor numbers. These include punctuation or technical, mathematical
characters. ASCII also includes the space (a non-visible but printable
character), and, therefore, does not belong to the control characters
category, as one might suspect.
Numbers (30–39): These numbers include the ten Arabic numerals from 0-9.
Letters (65–90 / 97–122): Letters are divided into two blocks, with the first group containing the uppercase letters and the second
group containing the lowercase.
Using the following two strings and the following code:
char str1 = {"asdf"};
char str1 = {"jkl;"};
Following demonstrates XORing the elements of the strings:
int main(void)
{
char str1[] = {"asdf"};
char str2[] = {"jkl;"};
for(int i=0;i<sizeof(str1)/sizeof(str1[i]);i++)
{
printf("%d ^ %d: %d\n", str1[i],str2[i], str1[i]^str2[i]);
}
getchar();
return 0;
}
While all of the input characters are printable (except the NULL character), not all of the XOR results of corresponding characters are:
97 ^ 106: 11 //not printable
115 ^ 107: 24 //not printable
100 ^ 108: 8 //not printable
102 ^ 59: 93
0 ^ 0: 0
This is why you are seeing the odd output. While all of the values may be completely valid for your purposes, they are not all printable.

How do I index a (not all ascii) utf8 string in C?

I want to index the characters in a utf8 string which does not necessarily contain
only ascii characters. I want the same kind of behavior I get in javascript:
> str = "lλך" // i.e. Latin ell, Greek lambda, Hebrew lamedh
'lλך'
> str[0]
'l'
> str[1]
'λ'
> str[2]
'ך'
Following the advice of UTF-8 Everywhere, I am representing my mixed character-length string just as any other sting in c - and not using wchars.
The problem is that, in C, one cannot access the 16th character of a string: only the 16th byte. Because λ is encoded with two bytes in utf-8, I have to access the 16th and 17th bytes of the string in order to print out one λ.
For reference, the output of:
#include <stdio.h>
int main () {
char word_with_greek[] = "this is lambda:_λ";
printf("%s\n",word_with_greek);
printf("The 0th character is: %c\n", word_with_greek[0]);
printf("The 15th character is: %c\n",word_with_greek[15]);
printf("The 16th character is: %c%c\n",word_with_greek[16],word_with_greek[17]);
return 0;
}
is:
this is lambda:_λ
The 0th character is: t
The 15th character is: _
The 16th character is: λ
Is there an easy way to break up the string into characters? It does not seem too difficult to write a function which breaks a string into wchars- but I imagine that someone has already written this yet I cannot find it.
It depends on what your unicode characters can be. Most strings are restricted to the Basic Multilanguage Plane. If yours are (not by accident by because of their very nature: at least no risk for emoji...) you can use the char16_t to represent any character. BTW wchar_t is at least as large as char16_t so in that case it is safe to use it.
If your script can contain emoji character, or other characters not in the BMP or simply if you are unsure, the only foolproof way is to convert everything to char32_t because any unicode character (at least in 2019...) as a code using less than 32 bits.
Converting for UTF8 to 32 (or 16) bits unicode is not that hard, and can be coded by hand, Wikipedia contains enough information for it. But you will find tons of library where this is already coded and tested, mainly the excellent libiconv, but the C11 version of the C standard library contains functions for UTF8 conversions. Not as nice but useable.

Differences between int/char arrays/strings

I'm still new to the forum so I apologize in advance for forum - etiquette issues.
I'm having trouble understanding the differences between int arrays and char arrays.
I recently wrote a program for a Project Euler problem that originally used a char array to store a string of numbers, and later called specific characters and tried to use int operations on them to find a product. When I used a char string I got a ridiculously large product, clearly incorrect. Even if I converted what I thought would be compiled as a character (str[n]) to an integer in-line ((int)str[n]) it did the exact same thing. Only when I actually used an integer array did it work.
Code is as follows
for the char string
char str[21] = "73167176531330624919";
This did not work. I got an answer of about 1.5 trillion for an answer that should have been about 40k.
for the int array
int str[] = {7,3,1,6,7,1,7,6,5,3,1,3,3,0,6,2,4,9,1,9};
This is what did work. I took off the in-line type casting too.
Any explanation as to why these things worked/did not work and anything that can lead to a better understanding of these ideas will be appreciated. Links to helpful stuff are as well. I have researched strings and arrays and pointers plenty on my own (I'm self taught as I'm in high school) but the concepts are still confusing.
Side question, are strings in C automatically stored as arrays or is it just possible to do so?
To elaborate on WhozCraig's answer, the trouble you are having does not have to do with strings, but with the individual characters.
Strings in C are stored by and large as arrays of characters (with the caveat that there exists a null terminator at the end).
The characters themselves are encoded in a system called ascii which assigns codes between 0 - 127 for characters used in the english language (only). Thus "7" is not stored as 7 but as the ascii encoding of 7 which is 55.
I think now you can see why your product got so large.
One elegant way to fix would be to convert
int num = (int) str[n];
to
int num = str[n] - '0';
//thanks for fixing, ' ' is used for characters, " " is used for strings
This solution subtracts the ascii code for 0 from the ascii code for your character, say "7". Since the numbers are encoded linearly, this will work (for single digit numbers). For larger numbers, you should use atoi or strtol from stdlib.h
Strings are just character arrays with a null terminating byte.
There is no separate string data type in c.
When using a char as an integer, the numeric ascii value is used. For example, saying something like printf("%d\n", (int)'a'); will result in 97 (the ascii value of 'a') being printed.
You cannot use a string of numbers to do numeric calculations unless you convert it to an integer array. To convert a digit as a character into its integer form, you can do something like this:
char a = '2';
int a_num = a - '0';
//a_num now stores integer 2
This causes the ascii value of '0' (48) to be subtracted from ascii value '2' (50), finally leaving 2.
char str[21] = "73167176531330624919"
this code is equivalent to
char str[21] = {'7','3','1','6','7','1','7','6','5',/
'3','1','3','3','0','6','2','4','9','1','9'}
so whatever stored in str[21] is not numbers, but the char(their ASCII equivalent representation is different).
side question answer - yes/no, the strings are automatically stored as char arrays, but the string does has a extra character('\0') as the last element(where a char array need not have such a one).

Trouble comparing UTF-8 characters using wchar.h

I am in the process of making a small program that reads a file, that contains UTF-8 elements, char by char. After reading a char it compares it with a few other characters and if there is a match it replaces the character in the file with an underscore '_'.
(Well, it actually makes a duplicate of that file with specific letters replaced by underscores.)
I'm not sure where exactly I'm messing up here but it's most likely everywhere.
Here is my code:
FILE *fpi;
FILE *fpo;
char ifilename[FILENAME_MAX];
char ofilename[FILENAME_MAX];
wint_t sample;
fpi = fopen(ifilename, "rb");
fpo = fopen(ofilename, "wb");
while (!feof(fpi)) {
fread(&sample, sizeof(wchar_t*), 1, fpi);
if ((wcscmp(L"ά", &sample) == 0) || (wcscmp(L"ε", &sample) == 0) ) {
fwrite(L"_", sizeof(wchar_t*), 1, fpo);
} else {
fwrite(&sample, sizeof(wchar_t*), 1, fpo);
}
}
I have omitted the code that has to do with the filename generation because it has nothing to offer to the case. It is just string manipulation.
If I feed this program a file containing the words γειά σου κόσμε. I would want it to return this:
γει_ σου κόσμ_.
Searching the internet didn't help much as most results were very general or talking about completely different things regarding UTF-8. It's like nobody needs to manipulate single characters for some reason.
Anything pointing me the right way is most welcome.
I am not, necessarily, looking for a straightforward fixed version of the code I submitted, I would be grateful for any insightful comments helping me understand how exactly the wchar mechanism works. The whole wbyte, wchar, L, no-L, thing is a mess to me.
Thank you in advance for your help.
C has two different kinds of characters: multibyte characters and wide characters.
Multibyte characters can take a varying number of bytes. For instance, in UTF-8 (which is a variable-length encoding of Unicode), a takes 1 byte, while α takes 2 bytes.
Wide characters always take the same number of bytes. Additionally, a wchar_t must be able to hold any single character from the execution character set. So, when using UTF-32, both a and α take 4 bytes each. Unfortunately, some platforms made wchar_t 16 bits wide: such platforms cannot correctly support characters beyond the BMP using wchar_t. If __STDC_ISO_10646__ is defined, wchar_t holds Unicode code-points, so must be (at least) 4 bytes long (technically, it must be at least 21-bits long).
So, when using UTF-8, you should use multibyte characters, which are stored in normal char variables (but beware of strlen(), which counts bytes, not multibyte characters).
Unfortunately, there is more to Unicode than this.
ά can be represented as a single Unicode codepoint, or as two separate codepoints:
U+03AC GREEK SMALL LETTER ALPHA WITH TONOS ← 1 codepoint ← 1 multibyte character ← 2 bytes (0xCE 0xAC) = 2 char's.
U+03B1 GREEK SMALL LETTER ALPHA U+0301 COMBINING ACUTE ACCENT ← 2 codepoints ← 2 multibyte characters ← 4 bytes (0xCE 0xB1 0xCC 0x81) = 4 char's.
U+1F71 GREEK SMALL LETTER ALPHA WITH OXIA ← 1 codepoint ← 1 multibyte character ← 3 bytes (0xE1 0xBD 0xB1) = 3 char's.
All of the above are canonical equivalents, which means that they should be treated as equal for all purposes. So, you should normalize your strings on input/output, using one of the Unicode normalization algorithms (there are 4: NFC, NFD, NFKC, NFKD).
First of all, please do take the time to read this great article, which explains UTF8 vs Unicode and lots of other important things about strings and encodings: http://www.joelonsoftware.com/articles/Unicode.html
What you are trying to do in your code is read in unicode character by character, and do comparisons with those. That's won't work if the input stream is UTF8, and it's not really possible to do with quite this structure.
In short: Fully unicode strings can be encoded in several ways. One of them is using a series of equally-sized "wide" chars, one for each character. That is what the wchar_t type (sometimes WCHAR) is for. Another way is UTF8, which uses a variable number of raw bytes to encode each character, depending on the value of the character.
UTF8 is just a stream of bytes, which can encode a unicode string, and is commonly used in files. It is not the same as a string of WCHARs, which are the more common in-memory representation. You can't poke through a UTF8 stream reliably, and do character replacements within it directly. You'll need to read the whole thing in and decode it, and then loop through the WCHARs that result to do your comparisons and replacement, and then map that result back to UTF8 to write to the output file.
On Win32, use MultiByteToWideChar to do the decoding, and you can use the corresponding WideCharToMultiByte to go back.
When you use a "string literal" with regular quotes, you're creating a nul-terminated ASCII string (char*), which does not support Unicode. The L"string literal" with the L prefix will create a nul-terminated string of WCHARs (wchar_t *), which you can use in string or character comparisons. The L prefix also works with single-quote character literals, like so: L'ε'
As a commenter noted, when you use fread/fwrite, you should be using sizeof(wchar_t) and not its pointer type, since the amount you are trying to read/write is an actual wchar, not the size of a pointer to one. This advice is just code feedback independent of the above-- you don't want to be reading the input character by character anyways.
Note too that when you do string comparisons (wcscmp), you should use actual wide strings (which are terminated with a nul wide char)-- not use single characters in memory as input. If (when) you want to do character-to-character comparisons, you don't even need to use the string functions. Since a WCHAR is just a value, you can compare directly: if (sample == L'ά') {}.

Resources