Fastest way to count the number of occurrences of a string

Fastest way to count the number of occurrences of a string - c

I was wondering what is the fastest way to count the number of occurrences of a string (needle) within another string (haystack). The way I'm doing it is:
int findWord(char * file, char * word){
char *fptr;
char * current = strtok_r(file, " ,.\n", &fptr);
int sum = 0;
while (current != NULL){
//printf("%s\n", current);
if(strcmp(current, word) == 0)
sum+=1;
current = strtok_r(NULL, " ,.\n", &fptr);
}
return sum;
}
Would it be faster to use a more complex algorithm (Boyer-Moore)?
Thanks

Currently, if your program is counting word "blah" and encounters a token is "blahblah", your algorithm counts it as zero occurrences. If it needed to count it as two, you cound benefit from a more advanced approach.
If your program does what you want, you are processing as fast as you can: it is already linear in the number of letters of the longer "word", so you cannot speed it up further.
An even more interesting solution would be required to count words with self-aliasing: for example, count "aa"s inside "aaaa" string. If you needed to return 3 for this situation, you'd need a lot more advanced algorithm.

Would it be faster to use a more complex algorithm (Boyer-Moore)?
In your algorithm, the unit of comparison is a word rather than a character. This enables the algorithm to ignore matches that straddle a word boundary, and thus makes it run in O(n) time.
I doubt you'd be able to beat that asymptotically.
As far as lowering the multiplicative constant, right now your algorithm looks at every character in file twice. You can eliminate that redundancy by rewriting the code to use a pair of pointers and a single for loop (figuring out the details is left as an exercise for the reader :))

Unless your system has a bad implementation of string functions, this should be roughly the fastest:
const char *s, *t;
size_t cnt;
for (cnt=0, s=haystack; t=strchr(s, needle); s=t+1, cnt++);
Adjust it a bit (+strlen(needle) rather than +1) if you don't want to count overlapping matches.

Related

hash function in C

I need to create some hashing function... can you help me?
The input is a sequence of numbers. Your task is to determine the number of how many numbers are repeated.
Its string of numbers and letters (*a[]). N is the number of digits -input.
Returns the number of repetition.
int function(char *a[], int n)
{
int i,j;
int same=0;
for(i=0;i<n-1;i++)
{
for(j=i+1;j<n;j++)
{
if(!strcmp(a[i],a[j]))
same++;
}
}
return same;
}
int main(void)
{
char *a[] = {"AA123456", "BA987689", "AA123123", "AA312312", "BB345345", "AA123123"};
printf("Number of duplicates: %d\n", function(a, 6));
return 0;
}

Read the wikipage on hash functions & hash tables.
Often, a linear combination with prime coefficients (see bézout's identity) and involving the components and the partial hash gives good enough result.
For example, something like
int basile_hash(const char*str) {
int h = 65537;
while (*str) {
h = 75553*h + 5531* (*str);
str++;
};
return h;
}
I don't claim it is a very good hash, but it is probably good enough for your needs. All the constants 65537, 75553, 5531 are primes (given by /usr/games/primes from bsdgames Debian package)
You could make a variant with bitwise xor ^, or you could take into account more than one component:
h = 65579*str[0] ^ 5507*str[1] + 17*h;
str += 2;
but you should care -and special-case when s[1] is the terminating null byte.
Read also about MD5
Notice that a lot of standard or popular libraries gives you many hash functions. Most of the time the particular choice of some hash function is not very important. On the other hand, you can still earn a PhD on studying and inventing good hash functions. I have another one in this values.c file, function mom_cstring_hash near line 150 (I imagine that it might be better optimized, since for large strings some of the instructions might run "in parallel" inside the processor).
I certainly don't claim to be expert on hash functions.
Study also the source code of hash functions in free software libraries like Glib, Qt, etc.... See also gperf

C: Generating hash keys for large data sets?

I am currently playing around with hashing and key generation trying to make my own hash key generator.
At the moment I have a list of ~90000 strings (each 1 word and a different word). I was wondering what the best way to generate keys (number keys not string keys) would be?
Currently depending on the words last ascii character I do a calculation based on the value of the letter.
The result is about 50% of the words generate a key that clashes with another.
I have used quadratic probing to then find space in the table for the rest of the words.
My question, as above, is what is generally the best sort of way to generate a key for 90000 different words? I know that the larger the data set, the more likely there will be clashes, but how would you suggest/or minimise the clashes?
Edit: Also - I don't care about cryptography, it just needs to be fast.
Thanks.

You can "borrow" Java's implementation of String's hashCode*:
int hashCode(const char* s) {
int h = 0;
while (*s) {
h = 31*h + (*s++);
}
return h;
}
This function achieves a reasonable separation, and is among the most widely used hash functions out there.
* which, as it turns out, Java in turn "borrowed" from Kernighan & Ritchie's book on C programming.

To prevent clashes you need a good hash key generator.
There are several algorithms available. One recent and very fast one is called xxHash. It's written in C.

It cant be good choice choosing 90,000 size of the hash table, there is much better concept of perfect hashing, according to this use double hashing one for table lookup and the other to maintain the list, you should try multiplication method for both, i think that's good idea.

I've seen Knuth use:
register int h,k; register char *p;
for (h=0,p=w;*p;p++) h=(*p+h+h)%hash_prime;
Where hash_prime is a prime larger than 4x the expected number of live entries in the hash table.
See: Knuth's literateprogramming.com, the Adventure example.
Here's the hashing code in context:
#define hash_prime 1009/* the size of the hash table */
typedef struct {
char text[6]; /* string of length at most 5 */
char word_type; /* a |wordtype| */
char meaning;
} hash_entry;
hash_entry hash_table[hash_prime]; /* the table of words we know */
void new_word(w,m)
char *w; /* a string of length 5 or less */
int m; /* its meaning */
{
register int h,k; register char *p;
for (h=0,p=w;*p;p++) h=(*p+h+h)%hash_prime;
while (hash_table[h].word_type) {
h++;if (h==hash_prime) h=0;
}
int lookup(w)
char *w; /* a string that you typed */
{
register int h; register char *p; register char t;
t=w[5]; w[5]='\0'; /* truncate the word */
for (h=0,p=w;*p;p++) h=(*p+h+h)%hash_prime; /* compute starting address */
w[5]=t; /* restore original word */
if (h<0) return -1; /* a negative character might screw us up */
while (hash_table[h].word_type) {
if (streq(w,hash_table[h].text)) return h;
h++;if (h==hash_prime) h=0;
}
return -1;
}
Note, this code:
register char t;
// . . .
t=w[5]; w[5]='\0'; /* truncate the word */
// . . .
w[5]=t; /* restore original word */
Are for a specific requirement to only look at the first 5 characters and should be removed so you hash the entire word.

The term you want is avalanche - a hash function that provides optimal spread.
If you want your keys to be guaranteed to be unique, and if your dataset has zero duplicates
then you can convert your word as a base36 number into a base10 number. If you use stroull() you can return really large integers
char *p=myword;
for(; *p; p++)
*p=toupper(*p);
unsigned long long key=strtoull(myword, NULL, 36);
This can overflow and still return a positive number. Some hashes when given a long string may overflow a 32bit integer. Kerneghan's hash and Bernstein's hash do that.
In reality and as pointed out by several other folks:
Consider that collisions are a function of the hash_table size and the avalanche of the hash_function modulo hash_table size. Instead of truly unique keys what you want may be a better hash_table algorithm and size.

Algorithm for processing the string

I really don't know how to implement this function:
The function should take a pointer to an integer, a pointer to an array of strings, and a string for processing. The function should write to array all variations of exchange 'ch' combination to '#' symbol and change the integer to the size of this array. Here is an example of processing:
choker => {"choker","#oker"}
chocho => {"chocho","#ocho","cho#o","#o#o"}
chachacha => {"chachacha","#achacha","cha#acha","chacha#a","#a#acha","cha#a#a","#acha#a","#a#a#a"}
I am writing this in C standard 99. So this is sketch:
int n;
char **arr;
char *string = "chacha";
func(&n,&arr,string);
And function sketch:
int func(int *n,char ***arr, char *string) {
}
So I think I need to create another function, which counts the number of 'ch' combinations and allocates memory for this one. I'll be glad to hear any ideas about this algorithm.

You can count the number of combinations pretty easily:
char * tmp = string;
int i;
for(i = 0; *tmp != '\0'; i++){
if(!(tmp = strstr(tmp, "ch")))
break;
tmp += 2; // Skip past the 2 characters "ch"
}
// i contains the number of times ch appears in the string.
int num_combinations = 1 << i;
// num_combinations contains the number of combinations. Since this is 2 to the power of the number of occurrences of "ch"

First, I'd create a helper function, e.g. countChs that would just iterate over the string and return the number of 'ch'-s. That should be easy, as no string overlapping is involved.
When you have the number of occurences, you need to allocate space for 2^count strings, with each string (apart from the original one) of length strlen(original) - 1. You also alter your n variable to be equal to that 2^count.
After you have your space allocated, just iterate over all indices in your new table and fill them with copies of the original string (strcpy() or strncpy() to copy), then replace 'ch' with '#' in them (there are loads of ready snippets online, just look for "C string replace").
Finally make your arr pointer point to the new table. Be careful though - if it pointed to some other data before, you should think about freeing it or you'll end up having memory leaks.

If you would like to have all variations of replaced string, array size will have 2^n elements. Where n - number of "ch" substrings. So, calculating this will be:
int i = 0;
int n = 0;
while(string[i] != '\0')
{
if(string[i] == 'c' && string[i + 1] == 'h')
n++;
i++;
}
Then we can use binary representation of number. Let's note that incrementing integer from 0 to 2^n, the binary representation of i-th number will tell us, which "ch" occurrence to change. So:
for(long long unsigned int i = 0; i < (1 << n); i++)
{
long long unsigned int number = i;
int k = 0;
while(number > 0)
{
if(number % 2 == 1)
// Replace k-th occurence of "ch"
number /= 2;
k++;
}
// Add replaced string to array
}
This code check every bit in binary representation of number and changes k-th occurrence if k-th bit is 1. Changing k-th "ch" is pretty easy, and I leave it for you.
This code is useful only for 64 or less occurrences, because unsigned long long int can hold only 2^64 values.

There are two sub-problems that you need to solve for your original problem:
allocating space for the array of variations
calculating the variations
For the first problem, you need to find the mathematical function f that takes the number of "ch" occurrences in the input string and returns the number of total variations.
Based on your examples: f(1) = 1, f(2) = 4 and f(3) = 8. This should give you a good idea of where to start, but it is important to prove that your function is correct. Induction is a good way to make that proof.
Since your replace process ensures that the results have either the same of a lower length than the original you can allocate space for each individual result equal to the length of original.
As for the second problem, the simplest way is to use recursion, like in the example provided by nightlytrails.
You'll need another function which take the array you allocated for the results, a count of results, the current state of the string and an index in the current string.
When called, if there are no further occurrences of "ch" beyond the index then you save the result in the array at position count and increment count (so the next time you don't overwrite the previous result).
If there are any "ch" beyond index then call this function twice (the recurrence part). One of the calls uses a copy of the current string and only increments the index to just beyond the "ch". The other call uses a copy of the current string with the "ch" replaced by "#" and increments the index to beyond the "#".
Make sure there are no memory leaks. No malloc without a matching free.
After you make this solution work you might notice that it plays loose with memory. It is using more than it should. Improving the algorithm is an exercise for the reader.

String similarity in c

For two strings A and B, we define the similarity of the strings to be the length of the longest prefix common to both strings. For example, the similarity of strings "abc" and "abd" is 2, while the similarity of strings "aaa" and "aaab" is 3.
Calculate the sum of similarities of a string S with each of its suffixes
Here is my solution...
#include<stdio.h>
#include<string.h>
int getSim(char str[],int subindex)
{
int l2=subindex
int i=0;
int count=0;
for(i=0;i<l2;i++)
if(str[i]==str[subindex])
{
count++;
subindex++;
}
else
break;
return count;
}
int main()
{
int testcase=0;
int len=0;
int sum=0;
int i=0;
char s[100000];
scanf("%d",&testcase);
while(testcase--)
{
sum=0;
scanf("%s",s);
for(i=0;i<strlen(s);i++)
if(s[i]==s[0])
{
sum=sum+getSim(s,i);
}
printf("%d\n",sum);
}
}
How can we go about solving this problem using suffix array??

I'm not sure if it is the best algorithm, but here is the solution.
First of all, build suffix array. The naive algorithm(putting all suffixes into array and then sorting it) is quite slow - O(n^2 * log(n)) operations, there are several algorithms to do this in O(nlogn) time.
I'm assuming that strings are 0-indexed.
Now, take the first letter l in the string s, and use one binary search to find the index i of the first string in the suffix array which starts with l, and another binary search to find the index j of the first string in range [i..n], which doesn't start with l. Then you'll have that all strings in the range [i..j-1] starts with the same letter l. So the answer to the problem is at least j-i.
Then apply similar procedure to the strings in range [i..j). Take the second letter l2, find indexes i2 and j2 corresponding to the first string s[i2] such that s[i2][1] == l2 and the first string s[j2] such that s[j2][1] != l2. Add j2-i2 to the answer.
Repeat this procedure n times, until you run out of letters in the original string. The answer to the problem is j1-i1 + j2-i2 + ... + jn-in

You mention in the comments that it is correct, but it's very slow.
In Java, you can get the length of a String with s.length() - this value is cached in the object and it is O(1) to get.
But when you go to C, you get the length of a string with strlen(s) which recalculates (in O(n)) each time. So while you should be doing O(n), because you have an O(n) operation in there, the entire function becomes O(n^2).
To get around this, cache the value once when you run it. This will bring you back into linear time.
Bad:
scanf("%s",s);
for(i=0;i<strlen(s);i++)
if(s[i]==s[0])
{
sum=sum+getSim(s,i);
}
Good:
scanf("%s",s);
strlen = strlen(s); /* assume you declared "int strlen" earlier */
for(i=0;i<strlen;i++) /* this is now constant time to run */
if(s[i]==s[0])
{
sum=sum+getSim(s,i);
}

Optimizing a search algorithm in C

Can the performance of this sequential search algorithm (taken from
The Practice of Programming) be improved using any of C's native utilities, e.g. if I set the i variable to be a register variable ?
int lookup(char *word, char*array[])
{
int i
for (i = 0; array[i] != NULL; i++)
if (strcmp(word, array[i]) == 0)
return i;
return -1;
}

Yes, but only very slightly. A much bigger performance improvement can be achieved by using better algorithms (for example keeping the list sorted and doing a binary search).
In general optimizing a given algorithm only gets you so far. Choosing a better algorithm (even if it's not completely optimized) can give you a considerable (order of magnitude) performance improvement.

I think, it will not make much of a difference. The compiler will already optimize it in that direction.
Besides, the variable i does not have much impact, word stays constant throughout the function and the rest is too large to fit in any register. It is only a matter how large the cache is and if the whole array might fit in there.
String comparisons are rather expensive computationally.
Can you perhaps use some kind of hashing for the array before searching?

There is well-known technique as sentinal method.
To use sentinal method, you must know about the length of "array[]".
You can remove "array[i] != NULL" comparing by using sentinal.
int lookup(char *word, char*array[], int array_len)
{
int i = 0;
array[array_len] = word;
for (;; ++i)
if (strcmp(word, array[i]) == 0)
break;
array[array_len] = NULL;
return (i != array_len) ? i : -1;
}

If you're reading TPOP, you will next see how they make this search many times faster with different data structures and algorithms.
But you can make things a bit faster by replacing things like
for (i = 0; i < n; ++i)
foo(a[i]);
with
char **p = a;
for (i = 0; i < n; ++i)
foo(*p);
++p;
If there is a known value at the end of the array (e.g. NULL) you can eliminate the loop counter:
for (p = a; *p != NULL; ++p)
foo(*p)
Good luck, that's a great book!

To optimize that code the best bet would be to rewrite the strcmp routine since you are only checking for equality and don't need to evaluate the entire word.
Other than that you can't do much else. You can't sort as it appears you are looking for text within a larger text. Binary search won't work either since the text is unlikely to be sorted.
My 2p (C-psuedocode):
wrd_end = wrd_ptr + wrd_len;
arr_end = arr_ptr - wrd_len;
while (arr_ptr < arr_end)
{
wrd_beg = wrd_ptr; arr_beg = arr_ptr;
while (wrd_ptr == arr_ptr)
{
wrd_ptr++; arr_ptr++;
if (wrd_ptr == wrd_en)
return wrd_beg;
}
wrd_ptr++;
}

Mark Harrison: Your for loop will never terminate! (++p is indented, but is not actually within the for :-)
Also, switching between pointers and indexing will generally have no effect on performance, nor will adding register keywords (as mat already mentions) -- the compiler is smart enough to apply these transformations where appropriate, and if you tell it enough about your cpu arch, it will do a better job of these than manual psuedo-micro-optimizations.

A faster way to match strings would be to store them Pascal style. If you don't need more than 255 characters per string, store them roughly like this, with the count in the first byte:
char s[] = "\x05Hello";
Then you can do:
for(i=0; i<len; ++i) {
s_len = strings[i][0];
if(
s_len == match_len
&& strings[i][s_len] == match[s_len-1]
&& 0 == memcmp(strings[i]+1, match, s_len-1)
) {
return 1;
}
}
And to get really fast, add memory prefetch hints for string start + 64, + 128 and the start of the next string. But that's just crazy. :-)

Another fast way to do it is to get your compiler to use a SSE2 optimized memcmp. Use fixed-length char arrays and align so the string starts on a 64-byte alignment. Then I believe you can get the good memcmp functions if you pass const char match[64] instead of const char *match into the function, or strncpy match into a 64,128,256,whatever byte array.
Thinking a bit more about this, these SSE2 match functions might be part of packages like Intel's and AMD's accelerator libraries. Check them out.

Realistically, setting I to be a register variable won't do anything that the compiler wouldn't do already.
If you are willing to spend some time upfront preprocessing the reference array, you should google "The World's Fastest Scrabble Program" and implement that. Spoiler: it's a DAG optimized for character lookups.

/* there is no more quick */
int lookup(char *word, char*array[])
{
int i;
for(i=0; *(array++) != NULL;i++)
if (strcmp(word, *array) == 0)
return i;
return -1;
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Fastest way to count the number of occurrences of a string - c

Unless your system has a bad implementation of string functions, this should be roughly the fastest: const char s, t; size_t cnt; for (cnt=0, s=haystack; t=strchr(s, needle); s=t+1, cnt++); Adjust it a bit (+strlen(needle) rather than +1) if you don't want to count overlapping matches.

Related

hash function in C

C: Generating hash keys for large data sets?

Algorithm for processing the string

String similarity in c

Optimizing a search algorithm in C

Categories

Resources

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Fastest way to count the number of occurrences of a string - c

Unless your system has a bad implementation of string functions, this should be roughly the fastest: const char *s, *t; size_t cnt; for (cnt=0, s=haystack; t=strchr(s, needle); s=t+1, cnt++); Adjust it a bit (+strlen(needle) rather than +1) if you don't want to count overlapping matches.

Related

hash function in C

C: Generating hash keys for large data sets?

Algorithm for processing the string

String similarity in c

Optimizing a search algorithm in C

Categories

Resources

Unless your system has a bad implementation of string functions, this should be roughly the fastest: const char s, t; size_t cnt; for (cnt=0, s=haystack; t=strchr(s, needle); s=t+1, cnt++); Adjust it a bit (+strlen(needle) rather than +1) if you don't want to count overlapping matches.