Hash function is not giving desire results - c

I am implementing hash function in order to check the anagrams, but I am not getting desired output. Could you suggest what went wrong?
Output:
key[148]:val[joy]
key[174]:val[jam]
key[294]:val[paula]
key[13]:val[ulrich]
key[174]:val[cat]
key[174]:val[act]
key[148]:val[yoj]
key[265]:val[vij]
key[265]:val[jiv]
Here key value 174 is fine for strings act and cat (anagrams) but same can't be expected with jam.
Below is the code snippet.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
unsigned long hash(char *str, size_t size) {
unsigned long hash_val = 5381;
unsigned long sum = 0;
char *val;
int i, j;
for (j = 0; j < 9; j++) {
val = malloc(strlen(str) + 1);
memset(val, '\0', strlen(str) + 1);
strcpy(val, str);
for (i = 0; val[i] != '\0'; i++) {
sum = sum + val[i];
}
return size % sum;
}
}
int main() {
int i;
char *str[9] = { "joy", "jam", "paula", "ulrich","cat", "act","yoj", "vij", "jiv" };
unsigned long key;
size_t size = 4542; // it may be anything just for test it is being used
for (i = 0; i < 9; i++) {
key = hash(str[i], size);
printf("\nkey[%ld]:val[%s]", key, str[i]);
}
return 1;
}

Yes, it can, because your hash function is very poorly written - it returns your constant 'size' variable modulo sum of all the string characters.
The problem is that the sum of ASCII codes 'c' + 'a' + 't' is equal to the 'j' + 'a' + 'm' (equal to 312) so you are getting the same value for your 'hash'.
You could use a 'normal' (e.g. polynomial) hash function for your anagram table, but with sorted strings - that would be the easiest approach.
For another method, you can calculate a number of appearances of each letter in the string (a histogram) and hash (or just store as is) them instead.
I recommend you to do some research on this topic as it's a very common task.
Also, you could just sort the strings and let unordered_set<string> do the job for you.

but same can't be expected with jam.
Well, there you go wrong. Let's see your algo. What you're doing is basically summing up the ASCII value of the elements of the strings, and returning the modulus result of a fixed value taken with respect to the sum.
To elaborate, as per the ASCII table,
j == 106
a == 97
m == 109
and
c == 99
a == 97
t == 116
Both the words end up having a sum result of 312.
Now as per your algo,
4542 % 312
is suppose to give a constant value, right? That is what it is giving.
Now, don't be "sad", as
s == 115
a ==97
d == 100
that also comes up with 312.
That said, I see you have a local variable unsigned long hash_val = 5381; defined inside your function, but used nowhere.

Your hash function has many problems:
The for (j = 0; j < 9; j++) loop is completely useless.
It is utterly inadequate to allocate memory for a copy of the string, and to forget to free it! Just use the string directly.
You summing method has too many easy collisions, as you diagnosed: anagrams produce the same sum, but also many simple words. You should shuffle the sum between before each character value is added.
return size % sum; should really be return sum % size; so the return value can be used as an index into the hash table of size size. As a matter of fact, size % sum would invoke undefined behavior if sum happened to compute to 0, which would require a very long string (>16MB) but is possible.
Here is an improved hash function:
#include <limits.h>
// constraints: str != NULL, size > 0
size_t hash(const char *str, size_t size) {
size_t sum = 5381; // initial salt
while (*str != '\0') {
// rotate the current sum 2 places to the left
sum = (sum << 2) | (sum >> (CHAR_BIT * sizeof(sum) - 2));
// add the next character value
sum += (unsigned char)*str++;
}
return sum % size;
}

Related

Finding substring, but not for all inputs?

I wrote a code to find the index of the largest substring in a larger string.
A substring is found when there is an equal amount of a's and b's.
For example, giving 12 and bbbbabaababb should give 2 9, since the first appearing substring starts at index 0 and ends at index 9. 3 10 is also an answer, but since this is not the first appearing substring, this will not be the answer.
The code I made is:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
void substr(char str[], int n) {
int sum = 0;
int max = -1, start;
for (int i = 0; i < n; i++) {
if (str[i]=='a') {
str[i] = 0;
} else if(str[i]=='b') {
str[i] = 1;
}
}
// starting point i
for (int i = 0; i < n - 1; i++) {
sum = (str[i] == 0) ? -1 : 1;
// all subarrays from i
for (int j = i + 1; j < n; j++) {
(str[j] == 0) ? (sum += -1) : (sum += 1);
// sum == 0
if (sum == 0 && max < j - i + 1 && n%2==0) {
max = j - i + 1;
start = i-1;
} else if (sum == 0 && max < j - i + 1 && n%2!=0) {
max = j - i + 1;
start = i;
}
}
}
// no subarray
if (max == -1) {
printf("No such subarray\n");
} else {
printf("%d %d\n", start, (start + max - 1));
}
}
/* driver code */
int main(int argc, char* v[]) {
int n; // stores the length of the input
int i = 0; // used as counter
scanf("%d", &n);
n += 1; // deals with the /0 at the end of a str
char str[n]; // stores the total
/* adding new numbers */
while(i < n) {
char new;
scanf("%c", &new);
str[i] = new;
++i;
}
substr(str, n);
return 0;
}
It works for a lot of values, but not for the second example (given below). It should output 2 9 but gives 3 10. This is a valid substring, but not the first one...
Example inputs and outputs should be:
Input Input Input
5 12 5
baababb bbbbabaababb bbbbb
Output Output Output
0 5 2 9 No such subarray
You have several problems, many of them to do with arrays sizes and indices.
When you read in the array, you want n characters. You then increase n in oder to accomodate the null terminator. It is a good idea to null-terminate the string, but the '\0' at the end is really not part of the string data. Instead, adjust the array size when you create the array and place the null terminator explicitly:
char str[n + 1];
// scan n characters
str[n] = '\0';
In C (and other languages), ranges are defined by an inclusive lower bound, but by an exclusive upper bound: [lo, hi). The upper bound hi is not part of the range and there are hi - lo elements in the range. (Arrays with n elements are a special case, where the valid range is [0, n).) You should embrace rather than fight this convention. If your output should be different, amend the output, not the representation in your program.
(And notw how your first example, where you are supposed to have a string of five characters actually reads and considers the b in the 6th position. That's a clear error.)
The position of the maximum valid substring does not depend on whether the overall string length is odd or even!
The first pass, where you convert all "a"s and "b"s to 0's and 1's is unnecessary and it destroys the original string. That's not a big problem here, but keep that in mind.
The actual problem is how you try to find the substrings. Your idea to add 1 for an "a" and subtract one for a "b" is good, but you don't keep your sums correctly. For each possible starting point i, you scan the rest of the string and look for a zero sum. That will only work, if you reset the sum to zero for each i.
void substr(char str[], int n)
{
int max = 0;
int start = -1;
for (int i = 0; i + max < n; i++) {
int sum = 0;
for (int j = i; j < n; j++) {
sum += (str[j] == 'a') ? -1 : 1;
if (sum == 0 && max < j - i) {
max = j - i;
start = i;
}
}
}
if (max == 0) {
printf("No such subarray\n");
} else {
printf("%d %d\n", start, start + max);
}
}
Why initialize max = 0 instead of -1? Because you add +1/−1 as first thing, your check can never find a substring of max == 0, but there's a possibility of optimization: If you have already found a long substring, there's no need to look at the "tail" of your string: The loop condition i + max < n will cut the search short.
(There's another reason: Usually, sizes and indices are represented by unsigned types, e.g. size_t. If you use 0 as initial value, your code will work for unsigned types.)
The algorithm isn't the most efficient for large arrays, but it should work.

Subtracting arbitrary large integers in C

Question:
I want to know the difference of number n and a, both stored in char
arrays in ALI structures. Basically, what I'm doing is initialising
two integers (temp_n and temp_a) with the current digits of n and a,
subtracting them and placing the result in a new ALI instance named
k. If the j-th digits of a is greater than the i-th digit of n, then
I add 10 to the digit if n, finish the subtraction, and in the next
turn, I increase temp_a by one. The value of number a certainly falls
between 1 and n - 1 (that's given). If a is shorter than n, as soon
as I reach the last digits of a, I put the remaining digits of n to
the result array k. And I do this all backwards, so the initialising
value of i would be the size of n -1.
Example:
I store a number in a structure like this:
typedef struct Arbitrary_Large_Integer
{
char digits[];
} ALI;
Requirements:
I know that it could be easier to use char arrays instead of a
structure with a single member which barely makes sense, but I'm
forced to put structures in my code this time (that's a requirement
for my assignment).
Code:
ALI *subtraction(ALI n, ALI a, int nLength, int aLength)
{
ALI *result;
result = (ALI*)malloc(nLength * sizeof(ALI));
if (result == NULL)
printf("ERROR");
int temp_n, temp_a, difference;
int i = nLength - 1; //iterator for number 'n'
int j = aLength - 1; //iterator for number 'a'
int k = 0; //iterator for number 'k', n - a = k
bool carry = false; //to decide whether a carry is needed or not the turn
for (i; i >= 0; i--)
{
//subtracting 48 from n.digits[i], so temp_n gets the actual number
//and not its ASCII code when the value is passed
temp_n = n.digits[i] - ASCIICONVERT;
temp_a = a.digits[j] - ASCIICONVERT;
//Performing subtraction the same way as it's used on paper
if (carry) //if there is carry, a needs to be increased by one
{
temp_a++;
carry = false;
}
if (temp_n >= temp_a)
{
difference = temp_n - temp_a;
}
//I wrote else if instead of else so I can clearly see the condition
else if (temp_a > temp_n)
{
temp_n += 10;
difference = temp_n - temp_a;
carry = true;
}
//placing the difference in array k, but first converting it back to ASCII
result->digits[k] = difference + ASCIICONVERT;
k++;
//n is certainly longer than a, so after every subtraction is performed on a's digits,
//I place the remaining digits of n in k
if (j == 0)
{
for (int l = i - 1; l >= 0; l--)
{
result->digits[k] = n.digits[l];
k++;
}
//don't forget to close the array
result->digits[k] = '\0';
break;
}
j--;
}
//reverse the result array
_strrev(result->digits);
return result;
}
Output/Error:
Output results
It seems like when the array is passed to the function, its value
changes for some reason. I can't figure out what's wrong with it.
Problems:
Non-standard C
The typedef is not a valid standard C structure. The Flexible Array Member(FAM) .digits must be accompanied by at least one more prior named member in addition to the flexible array member. Recommend to put .nLength as the first member.
// Not standard
typedef struct Arbitrary_Large_Integer {
char digits[];
} ALI;
malloc(0)??
Since code is using a non-standard C, watch out that nLength * sizeof(ALI) may be the same as nLength * 0.
No room for the null character
Code is attempting to use .digits as a string with _strrev(), themallloc() is too small by 1, at least.
Other problems may exist
A Minimal, Complete, and Verifiable example is useful for additional fixes/solutions

C - Counting the occurrence of same number in an array

I have an array in C where:
int buf[4];
buf[0] = 1;
buf[1] = 2;
buf[2] = 5;
buf[3] = 2;
and I want to count how many elements in the array that have the same value with a counter.
In the above example, the number of elements of similar value is 2 since there are two 2s in the array.
I tried:
#include <stdio.h>
int main() {
int buf[4];
int i = 0;
int count = 0;
buf[0] = 1;
buf[1] = 2;
buf[2] = 5;
buf[3] = 2;
int length = sizeof(buf) / sizeof(int);
for (i=0; i < length; i++) {
if (buf[i] == buf[i+1]) {
count++;
}
}
printf("count = %d", count);
return 0;
}
but I'm getting 0 as the output. Would appreciate some help on this.
Update
Apologies for not being clear.
First:
the array is limited to only of size 4 since it involves 4 directions, left, bottom, top and right.
Second:
if there is at least 2 elements in the array that have the same value, the count is accepted. Anything less will simply not register.
Example:
1,2,5,2
count = 2 since there are two '2's in the array.
1,2,2,2
count = 3 since there are three '2's in the array
1,2,3,4
count = 0 since there are no similarities in the array. Hence this is not accepted.
Anything less than the count = 2 is invalid.
You are really rather hamstrung by the order the values appear within buf. The only rudimentary way to handle this when limited to 4-values is to make a pass with nested loops to determine what the matching value is, and then make a single pass over buf again counting how many times it occurs (and since you limit to 4-values, even with a pair of matches, your count is limited to 2 -- so it doesn't make a difference which you count)
A short example would be:
#include <stdio.h>
int main (void) {
int buf[] = {1, 2, 5, 2},
length = sizeof(buf) / sizeof(int),
count = 0,
same = 0;
for (int i = 0; i < length - 1; i++) /* identify what value matches */
for (int j = i + 1; i < length; i++)
if (buf[i] == buf[j]) {
same = buf[i];
goto saved; /* jump out of both loops when same found */
}
saved:; /* the lowly, but very useful 'goto' saves the day - again */
for (int i = 0; i < length; i++) /* count matching numbers */
if (buf[i] == same)
count++;
printf ("count = %d\n", count);
return 0;
}
Example Use/Output
$ ./bin/arr_freq_count
count = 2
While making that many passes over the values, it takes little more to use an actual frequency array to fully determine how often each value occurs, e.g.
#include <stdio.h>
#include <string.h>
#include <limits.h>
int main (void) {
int buf[] = {1, 2, 3, 4, 5, 2, 5, 6},
n = sizeof buf / sizeof *buf,
max = INT_MIN,
min = INT_MAX;
for (int i = 0; i < n; i++) { /* find max/min for range */
if (buf[i] > max)
max = buf[i];
if (buf[i] < min)
min = buf[i];
}
int range = max - min + 1; /* max-min elements (inclusive) */
int freq[range]; /* declare VLA */
memset (freq, 0, range * sizeof *freq); /* initialize VLA zero */
for (int i = 0; i < n; i++) /* loop over buf setting count in freq */
freq[buf[i]-min]++;
for (int i = 0; i < range; i++) /* output frequence of values */
printf ("%d occurs %d times\n", i + min, freq[i]);
return 0;
}
(note: add a sanity check on the range to prevent being surprised by the amount of storage required if min is actually close to INT_MIN and your max is close to INT_MAX -- things could come to quick stop depending on the amount of memory available)
Example Use/Output
$ ./bin/freq_arr
1 occurs 1 times
2 occurs 2 times
3 occurs 1 times
4 occurs 1 times
5 occurs 2 times
6 occurs 1 times
After your edit and explanation that you are limited to 4-values, the compiler should optimize first rudimentary approach just fine. However, for any more than 4-values or when needing the frequency of anything (characters in a file, duplicates in an array, etc..), think frequency array.
The first thing that's wrong is that you are only comparing adjacent values in the buf array. You have to compare all the values to each other.
How to do this is an architectural question. The approach suggested by David Rankin in the comments is one, using an array of structs with the value and count count is a second, and using a hash table is a third option. You've got some coding to do! Good luck. Ask for more help as you need it.
You are comparing values of buf[i] and buf[i+1]. i.e. You are comparing buf[0] with buf[1], buf[1] with buf[2] etc.
What you need is a nested for loop to compare all buf values with each other.
count = 0;
for (i=0; i<4; i++)
{
for (j=i+1; j<4; j++)
{
if (buf[i]==buf[j])
{
count++;
}
}
}
As pointed out by Jonathan Leffler, there is an issue in the above algorithm in case the input has elements {1,1,1,1}. It gives a value of 6 when expected value is 4.
I am keeping it up, as the OP has mentioned that he wants to only check anything above 2. So, this method may still be useful.

Detecting most frequently recurring symbol from ASCII-characters on C

How do I write an implementation for a function that takes as input a sequence of ASCII-characters, and gives the most frequently recurring symbol? I need make it on C, Where my bad?
char mostFrequentCharacter(char* str, int size);
char value;
int valueCount = 0;
for (int i =0; i < strlen(str); i++)
{
char oneChar = str[i];
var totalCount = source.Split(oneChar).Length - 1;;
if (totalCount >= valueCount)
{
valueCount = totalCount;
value = oneChar;
}
}
return value;
The function to be optimized to run on a device with a dual-core ARM-based processors and infinite amount of memory.
If the memory is not an issue as you noted, then you shoud create lookup table where you will store number of occurences for each character. Since input is sequence of ASCII characters, size of the structure should be 256. After checking input and initializing lookup table, in the main for loop, increment number of occurences in the corresponding place in the lookup table, check if the number of occurences exceeded the current maximal count, if so, update current maximal count and current most frequent character. In the end, just return most frequent character. Time complexity of this solution is O(N) and space complexity O(1).
char mostFrequentCharacter(char* str, int size) {
char mosfFrequent;
int counts[256], i, maxCount = 0;
// in the case of invalid input, return some invalid character
if(!str || size < 1)
return '\0';
for(i = 0; i < 256; i++)
counts[i] = 0;
for (i = 0; i < size; i++)
{
counts[str[i]]++;
if(counts[str[i]] > maxCount) {
maxCount = counts[str[i]];
mostFrequent = str[i];
}
}
return mostFrequent;
}
Here is an algorithm outline:
Declare a 256 element array of integers (pick your size), zero it.
Loop over the string:
2a. Use each char as index into your array and increment element.
2b. If incremented element is largest so far record index.
All done in one pass over the string. Storage 256 * size of integer bytes, but you have "infinite memory" and that's minuscule in comparison ;-)

Converting int to int[] in 'C'

I basically want to convert a given int number and store individual digits in an array for further processing.
I know I can use % and get each digit and store it. But the thing is if I do not know the number of digits of the int till runtime and hence I cannot allocate the size of the array. So, I cannot work backwards (from the units place).
I also do not want to first store the number backwords in an array and then again reverse the array.
Is there any other way of getting about doing this?
Eg: int num = 12345;
OUTPUT: ar[0] = 1, ar[1] = 2 and so on, where ar[] is an int array.
Convert is probably not the right word. You can take the int, dynamically allocate a new int[], and then store the digits of the int into the int[]. I'm using log base 10 to calculate how many digits num has. Include math.h to use it. The following code is untested, but will give you an idea of what to do.
int num = 12345;
int size = (int)(log10(num)+1);
// allocate array
int *digits = (int*)malloc(sizeof(int) * size);
// get digits
for(int i=size-1; i>=0; --i) {
digits[i] = num%10;
num=num/10; // integer division
}
The easiest way is to calculate number of digits to know the size of an array you need
int input = <input number>; // >= 0
int d, numdigits = 1;
int *arr;
d = input;
while (d /= 10)
numdigits++;
arr = malloc(sizeof(int) * numdigits);
There's even easier way: probably you pass a number to your program as an argument from command line. In this case you receive it as a string in argp[N], so you can just call strlen(argp[N]) to determine number of digits in your number.
If you have a 32-bit integer type, the maximum value will be comprised of 10 digits at the most (excluding the sign for negative numbers). That could be your upper limit.
If you need to dynamically determine the minimum sufficient size, you can determine that with normal comparisons (since calling a logarithmic function is probably more expensive, but a possibility):
size = 10;
if (myint < 1000000000) size--;
if (myint < 100000000) size--;
/* ... */
Declaring the array to be of a dynamic size depends on the C language standard you are using. In C89 dynamic array sizes (based on values calculated during run-time) is not possible. You may need to use dynamically allocated memory.
HTH,
Johan
The following complete program shows one way to do this. It uses unsigned integers so as to not have to worry about converting - you didn't state what should happen for negative numbers so, like any good consultant, I made the problem disappear for my own convenience :-)
It basically works out the required size of an array and allocates it. The array itself has one element at the start specifying how many elements are in the array (a length int).
Each subsequent element is a digit in sequence. The main code below shows how to process it.
If it can't create the array, it'll just give you back NULL - you should also remember to free the memory passed back once you're done with it.
#include <stdio.h>
#include <stdlib.h>
int *convert (unsigned int num) {
unsigned int *ptr;
unsigned int digits = 0;
unsigned int temp = num;
// Figure out how many digits in the number.
if (temp == 0) {
digits = 1;
} else {
while (temp > 0) {
temp /= 10;
digits++;
}
}
// Allocate enough memory for length and digits.
ptr = malloc ((digits + 1) * sizeof (unsigned int));
// Populate array if we got one.
if (ptr != NULL) {
ptr[0] = digits;
for (temp = 0; temp < digits; temp++) {
ptr[digits - temp] = num % 10;
num /= 10;
}
}
return ptr;
}
That convert function above is the "meat" - it allocates an integer array to place the length (index 0) and digits (indexes 1 through N where N is the number of digits). The following was the test program I used.
int main (void) {
int i;
unsigned int num = 12345;
unsigned int *arr = convert (num);
if (arr == NULL) {
printf ("No memory\n");
} else {
// Length is index 0, rest are digits.
for (i = 1; i <= arr[0]; i++)
printf ("arr[%d] = %u\n", i, arr[i]);
free (arr);
}
return 0;
}
The output of this is:
arr[1] = 1
arr[2] = 2
arr[3] = 3
arr[4] = 4
arr[5] = 5
You can find out the number of digits by taking the base-10 logarithm and adding one. For that, you could use the log10 or log10f functions from the standard math library. This may be a bit slower, but it's probably the most exact as long as double has enough bits to exactly represent your number:
int numdigits = 1 + log10(num);
Alternatively, you could repeatedly divide by ten until the result is zero and count the digits that way.
Still another option is just to allocate enough room for the maximum number of digits the type can have. For a 32-bit integer, that'd be 10; for 64-bit, 20 should be enough. You can just zero the extra digits. Since that's not a lot of wasted space even in the worst case, it might be the simplest and fastest option. You'd have to know how many bits are in an int in your setup, though.
You can also estimate fairly well by allocating 3 digits for each 10 bits used, plus one. That should be enough digits unless the number of bits is ridiculously large (way above the number of digits any of the usual int types could have).
int numdigits = 1
unsigned int n = num;
for (n = num; n & 0x03ff; n >>= 10)
numdigits += 3;
/* numdigits is at least the needed number of digits, maybe up to 3 more */
This last one won't work (directly) if the number is negative.
What you basically want to do is to transform your integer to an array of its decimal positions. The printf family of functions perfectly knows how to do this, no need to reinvent the wheel. I am changing the assignment a bit since you didn't say anything about signs, and it simply makes more sense for unsigned values.
unsigned* res = 0;
size_t len = 0;
{
/* temporary array, large enough to hold the representation of any unsigned */
char positions[20] = { 0 };
sprintf(position, "%u", number);
len = strlen(position);
res = malloc(sizeof(unsigned[len]));
for (size_t i = 0; i < len; ++i)
res[i] = position[i] - '0';
}

Resources