Resolving equal XOR values for different strings for anagram detection

Resolving equal XOR values for different strings for anagram detection - c

I recently had an interview question where I had to write a function that takes two strings, and it would return 1 if they are anagrams of each other or else return 0. To simplify things, both strings are of the same length, non-empty, and only contain lower-case alphabetical and numeric characters.
What I implemented a function that accumulates the XOR value of each character of each string independently then compared the final XOR values of each string to see if they are equal. If they are, I would return 1, else return 0.
My function:
int isAnagram(char* str1, char* str2){
int xor_acc_1 = 0;
int xor_acc_2 = 0;
for(int i = 0; i<strlen(str1); i++){
xor_acc_1 ^= str1[i] - '0';
xor_acc_2 ^= str2[i] - '0';
}
return xor_acc_1 == xor_acc_2;
}
My function worked for every case except for one test case.
char* str1 = "123";
char* str2 = "303";
To my surprise, even though these two strings are not anagrams of each other, they both returned 48 as their XOR value.
My question is: Can this be resolve still with XOR in linear time, without the usage of a data structure e.g. a Map, through modification on the mathematics behind XOR?

A pure xor solution will not work as there is information lost during the process (this problem is likely to exist in other forms of lossy calculation as well, such as hashing). The information lost in this case is the actual characters being used for comparison.
By way of example, consider the two strings ae and bf (in ASCII):
a: 0110 0001 b: 0110 0010
e: 0110 0101 f: 0110 0110
---- ---- ---- ----
xor: 0000 0100 0000 0100
You can see that the result of the xor is identical for both string despite the fact they are totally different.
This may become even more obvious once you realise that any value xor-ed with itself is zero, meaning that all strings like aa, bb, cc, xx, and so on, would be considered anagrams under your scheme.
So, now you've established that method as unsuitable, there are a couple of options that spring to mind.
The first is to simply sort both strings and compare them. Once sorted, they will be identical on a character-by-character basis. This will work but it's unlikely to deliver your requested O(n) time complexity since you'll almost certainly be using a comparison style sort.
The second still allows you to meet that requirement by using the usual "trick" of trading space for time. You simply set up a count of each character (all initially zero) then, for each character in the first string, increase its count.
After that, for each character in the second string, decrease its count.
That's linear time complexity and the strings can be deemed to be anagrams if every character count is set to zero after the process. Any non-zero count will only be there if a character occurred more times in one string than the other.
This is effectively a counting sort, a non-comparison sort meaning it's not subject to the normal minimum O(n log n) time complexity for those sorts.
The pseudo-code for such a beast would be:
def isAnagram(str1, str2):
if len(str1) != len(str2): # Can also handle different lengths.
return false
dim count[0..255] = {0} # Init all counts to zero.
for each code in str1: # Increase for each char in string 1.
count[code]++
for each code in str2: # Decrease for each char in string 2.
count[code]--
for each code in 0..255:
if count[code] != 0: # Any non-zero means non-anagram.
return false
return true # All zero means anagram.
Here, by the way, is a complete C test program which illustrates this concept, able to handle 8-bit character widths though more widths can be added with a simple change to the #if section:
#include <stdio.h>
#include <string.h>
#include <limits.h>
#include <stdbool.h>
#if CHAR_BIT == 8
#define ARRSZ 256
#else
#error Need to adjust for unexpected CHAR_BIT.
#endif
static bool isAnagram(unsigned char *str1, unsigned char *str2) {
// Ensure strings are same size.
size_t len = strlen(str1);
if (len != strlen(str2))
return false;
// Initialise all counts to zero.
int count[ARRSZ];
for (size_t i = 0; i < sizeof(count) / sizeof(*count); ++i)
count[i] = 0;
// Increment for string 1, decrement for string 2.
for (size_t i = 0; i < len; ++i) {
count[str1[i]]++;
count[str2[i]]--;
}
// Any count non-zero means non-anagram.
for (size_t i = 0; i < sizeof(count) / sizeof(*count); ++i)
if (count[i] != 0)
return false;
// All counts zero means anagram.
return true;
}
int main(int argc, char *argv[]) {
if ((argc - 1) % 2 != 0) {
puts("Usage: check_anagrams [<string1> <string2>] ...");
return 1;
}
for (size_t i = 1; i < argc; i += 2) {
printf("%s: '%s' '%s'\n",
isAnagram(argv[i], argv[i + 1]) ? "Yes" : " No",
argv[i], argv[i + 1]);
}
return 0;
}
Running this on some suitable test data shows it in action:
pax$ ./check_anagrams ' paxdiablo ' 'a plaid box' paxdiablo PaxDiablo \
one two aa bb aa aa '' '' paxdiablo pax.diablo
Yes: ' paxdiablo ' 'a plaid box'
No: 'paxdiablo' 'PaxDiablo'
No: 'one' 'two'
No: 'aa' 'bb'
Yes: 'aa' 'aa'
Yes: '' ''
No: 'paxdiablo' 'pax.diablo'

Why do you need to do XOR on the first place?
The most simple and fast enough approach is sort both the string by character and compare if both of them are equal or not. In this case, if you need faster sorting algorithm, you can use counting sort to achieve linear time.
Another way is, you can simply count the number of characters in each string and check if those counts are equal.
EDIT
Your XOR based solution is not right in terms of correctness. There can be more than one combination of characters which can XOR up to a same number, the XOR of characters/ASCII codes of two different strings might not yield to different XOR all the time. So for same string, the output will be always correct. But for different string, the output MAY not be correct always (False positive).

Related

Run length encoding on binary files in C

I wrote this function that performs a slightly modified variation of run-length encoding on text files in C.
I'm trying to generalize it to binary files but I have no experience working with them. I understand that, while I can compare bytes of binary data much the same way I can compare chars from a text file, I am not sure how to go about printing the number of occurrences of a byte to the compressed version like I do in the code below.
A note on the type of RLE I'm using: bytes that occur more than once in a row are duplicated to signal the next-to-come number is in fact the number of occurrences vs just a number following the character in the file. For occurrences longer than one digit, they are broken down into runs that are 9 occurrences long.
For example, aaaaaaaaaaabccccc becomes aa9aa2bcc5.
Here's my code:
char* encode(char* str)
{
char* ret = calloc(2 * strlen(str) + 1, 1);
size_t retIdx = 0, inIdx = 0;
while (str[inIdx]) {
size_t count = 1;
size_t contIdx = inIdx;
while (str[inIdx] == str[++contIdx]) {
count++;
}
size_t tmpCount = count;
// break down counts with 2 or more digits into counts ≤ 9
while (tmpCount > 9) {
tmpCount -= 9;
ret[retIdx++] = str[inIdx];
ret[retIdx++] = str[inIdx];
ret[retIdx++] = '9';
}
char tmp[2];
ret[retIdx++] = str[inIdx];
if (tmpCount > 1) {
// repeat character (this tells the decompressor that the next digit
// is in fact the # of consecutive occurrences of this char)
ret[retIdx++] = str[inIdx];
// convert single-digit count to string
snprintf(tmp, 2, "%ld", tmpCount);
ret[retIdx++] = tmp[0];
}
inIdx += count;
}
return ret;
}
What changes are in order to adapt this to a binary stream? The first problem I see is with the snprintf call since it's operating using a text format. Something that rings a bell is also the way I'm handling the multiple-digit occurrence runs. We're not working in base 10 anymore so that has to change, I'm just unsure how having almost never worked with binary data.

A few ideas that can be useful to you:
one simple method to generalize RLE to binary data is to use a bit-based compression. For example the bit sequence 00000000011111100111 can be translated to the sequence 0 9623. Since the binary alphabet is composed by only two symbols, you need to only store the first bit value (this can be as simple as storing it in the very first bit) and then the number of the contiguous equal values. Arbitrarily large integers can be stored in a binary format using Elias gamma coding. Extra padding can be added to fit the entire sequence nicely into an integer number of bytes. So using this method, the above sequence can be encoded like this:
00000000011111100111 -> 0 0001001 00110 010 011
^ ^ ^ ^ ^
first bit 9 6 2 3
If you want to keep it byte based, one idea is to consider all the even bytes frequencies (interpreted as an unsigned char) and all the odd bytes the values. If one byte occur more than 255 times, than you can just repeat it. This can be very inefficient, though, but it is definitively simple to implement, and it might be good enough if you can make some assumptions on the input.
Also, you can consider moving out from RLE and implement Huffman's coding or other sophisticated algorithms (e.g. LZW).
Implementation wise, i think tucuxi already gave you some hints.

You only have to address 2 problems:
you cannot use any str-related functions, because C strings do not deal well with '\0'. So for example, strlen will return the index of the 1st 0x0 byte in a string. The length of the input must be passed in as an additional parameter: char *encode(char *start, size_t length)
your output cannot have an implicit length of strlen(ret), because there may be extra 0-bytes sprinkled about in the output. You again need an extra parameter: size_t encode(char *start, size_t length, char *output) (this version would require the output buffer to be reserved externally, with a size of at least length*2, and return the length of the encoded string)
The rest of the code, assuming it was working before, should continue to work correctly now. If you want to go beyond base-10, and for instance use base-256 for greater compression, you would only need to change the constant in the break-things-up loop (from 9 to 255), and replace the snprintf as follows:
// before
snprintf(tmp, 2, "%ld", tmpCount);
ret[retIdx++] = tmp[0];
// after: much easier
ret[retIdx++] = tmpCount;

intermix command line strings in C

This is homework, so I am not looking for a direct answer I am more-so looking for the logic behind this. I do not believe the question is stated very well for novice C devs, and I cannot find any resources to help me out here. I am new to C much more a Java guy so this may seem totally and utterly noobish. The instructions are below
$ ./mixedupecho HELLO!
.H/EmLiLxOe!dHuEpLeLcOh!oH
*
For this program, you can ignore any command-line arguments beyond the first two (including the program name itself):
$ ./mixedupecho HELLO! morestuff lalala
.H/EmLiLxOe!dHuEpLeLcOh!oH
*
Notice how "HELLO!" is shorter than "./mixedupecho", and so the program "wraps around"
and starts over again at 'H'whenever it reaches the end of the string.
*
How can you implement that? The modulo % operator is your friend here.
Spcecifically, note that "HELLO!"[5] yields '!', and "HELLO!"[6] is beyond the bounds of the array.
But "HELLO!"[6 % 6] evaluates to "HELLO!"[0], which yields 'H'.
And "HELLO!"[7 % 6] evaluates to "HELLO!"[1] ...
Below is the code I have so far. This iterates through the every character of the argv string which I get. What I don't get is how to print it off so instead of the sequence [0][0], [0][1], [0][2]... I get [0][0], [1][0], [0][1]... etc.
Can someone take a crack at explaining this to me?
int main(int argc, string argv[])
{
for(int i = 0; i < argc; i++)
{
for(int j = 0, n = strlen(argv[i]); j < n; j++)
{
printf("%c", argv[i][j]);
}
}
printf("\n");
}
THANKS SO MUCH! THIS IS DRIVING ME INSANE!

You want to keep the index i incrementing until it is equal to the index of the null terminator in the longest string. Meanwhile, you'll use the % operator to ensure that i stays within the boundaries of the shorter string.
Here's how I'd do it:
Set the initial (unsigned) lengths to -1U to avoid calculating lengths unnecessarily. I'll use LIMIT for the rest of this example as if I did #define LIMIT -1U.
Iterate through the strings, checking to ensure that argv[N][i] % len[N] is not a null terminator. If it is a null terminator and len[N] == LIMIT, set len[N] = i.
When the expression len[0] != LIMIT && len[1] != LIMIT is true, the loop ends since both strings will have the correct length, meaning all characters in each string have been enumerated.
The only thing left is printing the character for each string, which I'm sure you can handle. I would have used 0 as the initial length, except that complicates things since you can't do x % 0. The reason for unsigned length is that -1U results in an unsigned int value (e.g. 4294967295 or 65535); plain -1 results in x % -1, which makes no sense because dividing by -1 yields no remainder.

Boyer-Moore Algorithm

I'm trying to implement Boyer-Moore Algorithm in C for searching a particular word in .pcap file. I have referenced code from http://ideone.com/FhJok5. I'm using this code as it is.
Just I'm passing packet as string and the keyword I'm searching for to the function search() in it. When I'm running my code it is giving different values every time. Some times its giving correct value too. But most of times its not identifying some values.
I have obtained results from Naive Algo Implementation. Results are always perfect.
I am using Ubuntu 12.0.4 over VMware 10.0.1. lang: C
My question is It has to give the same result every time right? whether right or wrong. This output keeps on changing every time i run the file on same inputs; and during several runs, it gives correct answer too. Mostly the value is varying between 3 or 4 values.
For Debugging I did so far:
passed strings in stead of packet every time, Its working perfect and same and correct value every time.
checking pcap part, I can see all packets are being passed to the function (I checked by printing packet frame no).
same packets I am sending to Naive Algo code, its giving perfect code.
Please give me some idea, what can be the issue. I suspect some thing wrong with memory management. but how to find which one?
Thanks in advance.
# include <limits.h>
# include <string.h>
# include <stdio.h>
# define NO_OF_CHARS 256
// A utility function to get maximum of two integers
int max (int a, int b) { return (a > b)? a: b; }
// The preprocessing function for Boyer Moore's bad character heuristic
void badCharHeuristic( char *str, int size, int badchar[NO_OF_CHARS])
{
int i;
// Initialize all occurrences as -1
for (i = 0; i < NO_OF_CHARS; i++)
badchar[i] = -1;
// Fill the actual value of last occurrence of a character
for (i = 0; i < size; i++)
badchar[(int) str[i]] = i;
}
/* A pattern searching function that uses Bad Character Heuristic of
Boyer Moore Algorithm */
void search( char *txt, char *pat)
{
int m = strlen(pat);
int n = strlen(txt);
int badchar[NO_OF_CHARS];
/* Fill the bad character array by calling the preprocessing
function badCharHeuristic() for given pattern */
badCharHeuristic(pat, m, badchar);
int s = 0; // s is shift of the pattern with respect to text
while(s <= (n - m))
{
int j = m-1;
/* Keep reducing index j of pattern while characters of
pattern and text are matching at this shift s */
while(j >= 0 && pat[j] == txt[s+j])
j--;
/* If the pattern is present at current shift, then index j
will become -1 after the above loop */
if (j < 0)
{
printf("\n pattern occurs at shift = %d", s);
/* Shift the pattern so that the next character in text
aligns with the last occurrence of it in pattern.
The condition s+m < n is necessary for the case when
pattern occurs at the end of text */
s += (s+m < n)? m-badchar[txt[s+m]] : 1;
}
else
/* Shift the pattern so that the bad character in text
aligns with the last occurrence of it in pattern. The
max function is used to make sure that we get a positive
shift. We may get a negative shift if the last occurrence
of bad character in pattern is on the right side of the
current character. */
s += max(1, j - badchar[txt[s+j]]);
}
}
/* Driver program to test above function */
int main()
{
char txt[] = "ABAAAABAACD";
char pat[] = "AA";
search(txt, pat);
return 0;

find the longest non decreasing sub sequence

given a string consists only of 0s and 1s say 10101
how to find the length of the longest non decreasing sub-sequence??
for example,
for the string,
10101
the longest non decreasing sub sequences are
111
001
so you should output 3
for the string
101001
the longest non decreasing sub sequence is
0001
so you should output 4
how to find this??
how can this be done when we are provided with limits.sequence between the limit
for example
101001
limits [3,6]
the longest non decreasing sub sequence is
001
so you should output 3
can this be achieved in o(strlen)

Can this be achieved in O(strlen)?
Yes. Observe that the non-decreasing subsequences would have one of these three forms:
0........0 // Only zeros
1........1 // Only ones
0...01...1 // Some zeros followed by some ones
The first two forms can be easily checked in O(1) by counting all zeros and by counting all ones.
The last one is a bit harder: you need to go through the string keeping the counter of zeros that you've seen so far, along with the length of the longest string of 0...01...1 form that you have discovered so far. At each step where you see 1 in the string, the length of the longest subsequence of the third form is the larger of the number of zeros plus one or the longest 0...01...1 sequence that you've seen so far plus one.
Here is the implementation of the above approach in C:
char *str = "10101001";
int longest0=0, longest1=0;
for (char *p = str ; *p ; p++) {
if (*p == '0') {
longest0++;
} else { // *p must be 1
longest1 = max(longest0, longest1)+1;
}
}
printf("%d\n", max(longest0, longest1));
max is defined as follows:
#define max( a, b ) ( ((a) > (b)) ? (a) : (b) )
Here is a link to a demo on ideone.

Use dynamic programming. Run through the string from left to right, and keep track of two variables:
zero: length of longest subsequence ending in 0
one: length of longest subsequence ending in 1
If we see a 0, we can append this to any prefix that ends in 0, so we increase zero. If we see a 1, we can either append it to the prefix that ends in 0, or in 1, so we set one the one which is longest. In C99:
int max(int a, int b) {
return a > b ? a : b;
}
int longest(char *string) {
int zero = 0;
int one = 0;
for (; *string; ++string) {
switch (*string) {
case '0':
++zero;
break;
case '1':
one = max(zero, one) + 1;
break;
}
}
return max(zero, one);
}

do {
count++;
if (array[i] < prev) {
if (count > max)
max = count;
count = 0;
}
prev = array[i];
} while (++i < length);
Single pass. Will even work on any numbers, not just 1s and 0s.
For limits - set i to starting number, use ending instead of array length.

Algorithm for processing the string

I really don't know how to implement this function:
The function should take a pointer to an integer, a pointer to an array of strings, and a string for processing. The function should write to array all variations of exchange 'ch' combination to '#' symbol and change the integer to the size of this array. Here is an example of processing:
choker => {"choker","#oker"}
chocho => {"chocho","#ocho","cho#o","#o#o"}
chachacha => {"chachacha","#achacha","cha#acha","chacha#a","#a#acha","cha#a#a","#acha#a","#a#a#a"}
I am writing this in C standard 99. So this is sketch:
int n;
char **arr;
char *string = "chacha";
func(&n,&arr,string);
And function sketch:
int func(int *n,char ***arr, char *string) {
}
So I think I need to create another function, which counts the number of 'ch' combinations and allocates memory for this one. I'll be glad to hear any ideas about this algorithm.

You can count the number of combinations pretty easily:
char * tmp = string;
int i;
for(i = 0; *tmp != '\0'; i++){
if(!(tmp = strstr(tmp, "ch")))
break;
tmp += 2; // Skip past the 2 characters "ch"
}
// i contains the number of times ch appears in the string.
int num_combinations = 1 << i;
// num_combinations contains the number of combinations. Since this is 2 to the power of the number of occurrences of "ch"

First, I'd create a helper function, e.g. countChs that would just iterate over the string and return the number of 'ch'-s. That should be easy, as no string overlapping is involved.
When you have the number of occurences, you need to allocate space for 2^count strings, with each string (apart from the original one) of length strlen(original) - 1. You also alter your n variable to be equal to that 2^count.
After you have your space allocated, just iterate over all indices in your new table and fill them with copies of the original string (strcpy() or strncpy() to copy), then replace 'ch' with '#' in them (there are loads of ready snippets online, just look for "C string replace").
Finally make your arr pointer point to the new table. Be careful though - if it pointed to some other data before, you should think about freeing it or you'll end up having memory leaks.

If you would like to have all variations of replaced string, array size will have 2^n elements. Where n - number of "ch" substrings. So, calculating this will be:
int i = 0;
int n = 0;
while(string[i] != '\0')
{
if(string[i] == 'c' && string[i + 1] == 'h')
n++;
i++;
}
Then we can use binary representation of number. Let's note that incrementing integer from 0 to 2^n, the binary representation of i-th number will tell us, which "ch" occurrence to change. So:
for(long long unsigned int i = 0; i < (1 << n); i++)
{
long long unsigned int number = i;
int k = 0;
while(number > 0)
{
if(number % 2 == 1)
// Replace k-th occurence of "ch"
number /= 2;
k++;
}
// Add replaced string to array
}
This code check every bit in binary representation of number and changes k-th occurrence if k-th bit is 1. Changing k-th "ch" is pretty easy, and I leave it for you.
This code is useful only for 64 or less occurrences, because unsigned long long int can hold only 2^64 values.

There are two sub-problems that you need to solve for your original problem:
allocating space for the array of variations
calculating the variations
For the first problem, you need to find the mathematical function f that takes the number of "ch" occurrences in the input string and returns the number of total variations.
Based on your examples: f(1) = 1, f(2) = 4 and f(3) = 8. This should give you a good idea of where to start, but it is important to prove that your function is correct. Induction is a good way to make that proof.
Since your replace process ensures that the results have either the same of a lower length than the original you can allocate space for each individual result equal to the length of original.
As for the second problem, the simplest way is to use recursion, like in the example provided by nightlytrails.
You'll need another function which take the array you allocated for the results, a count of results, the current state of the string and an index in the current string.
When called, if there are no further occurrences of "ch" beyond the index then you save the result in the array at position count and increment count (so the next time you don't overwrite the previous result).
If there are any "ch" beyond index then call this function twice (the recurrence part). One of the calls uses a copy of the current string and only increments the index to just beyond the "ch". The other call uses a copy of the current string with the "ch" replaced by "#" and increments the index to beyond the "#".
Make sure there are no memory leaks. No malloc without a matching free.
After you make this solution work you might notice that it plays loose with memory. It is using more than it should. Improving the algorithm is an exercise for the reader.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Resolving equal XOR values for different strings for anagram detection - c

Related

Run length encoding on binary files in C

intermix command line strings in C

Boyer-Moore Algorithm

find the longest non decreasing sub sequence

Algorithm for processing the string

Categories

Resources