Is it a bug in ReduceVocab() or missing something? - c

here's a piece of code of word2vec i've downloaded from google word2vec.c:
// Reduces the vocabulary by removing infrequent tokens
void ReduceVocab() {
int a, b = 0;
unsigned int hash;
for (a = 0; a < vocab_size; a++) if (vocab[a].cn > min_reduce) {
vocab[b].cn = vocab[a].cn;
vocab[b].word = vocab[a].word;
b++;
} else free(vocab[a].word);
vocab_size = b;
for (a = 0; a < vocab_hash_size; a++) vocab_hash[a] = -1;
for (a = 0; a < vocab_size; a++) {
// Hash will be re-computed, as it is not actual
hash = GetWordHash(vocab[a].word);
while (vocab_hash[hash] != -1) hash = (hash + 1) % vocab_hash_size;
vocab_hash[hash] = a;
}
fflush(stdout);
min_reduce++;
}
which is called in LearnVocabFromTrainFile function.
Assume min_reduce=5
So if the input file is not that good, I mean if a word say "hello" that appeared 4 times when ReduceVocab called, and the vocab will remove hello from itself.
Later, when ReduceVocab called again and luckly hello appeared 5 times.. and it seems ReduceVocab will remove hello again.
As in truth, hello appeared 9 times which should be in the vocab, but the code above removed it.
it takes not such matter as it seems the situation happens seldomly. Just wondering my analysis is right or i've missed something in the code.
Thanks for any advice.

A better URL for reviewing the relevant source is:
https://github.com/tmikolov/word2vec/blob/master/word2vec.c#L185
As I understand it, this is not a bug – just a compromise with non-intuitive effects.
This code uses an intentionally rough/approximate method of ensuring the number of tracked vocabulary terms never exceeds 0.7 * vocab_hash_size (21 million). Whenever the number of terms hits that high-water mark, all terms with fewer than min_reduce occurrences are discarded - & min_reduce is increased to take even more, next time.
(And in practice, this escalating-floor, along with the typical long-tail Zipfian distribution of word frequences, can mean that at each triggered ReduceVocab operation, most terms are discarded, bringing the total vocab size to something that's way smaller than 0.7 * vocab_hash_size.)
An unavoidable effect of discarding known counts, in an interim running fashion, is that counts after each discard are no longer complete & exact. The relative position of terms in the corpus can thus have a big effect on which terms are ReduceVocab-pruned - with terms that "just miss" the cutoff each time potentially having far more occurrences, in total, than the final min_reduce. And further, all final counts of less-frequent words might be incomplete, if the term's early occurrence counts didn't survive earlier ReduceVocab steps.
Still, this approach works to keep the vocabulary-survey from taking an arbitrary amount of RAM, and the imprecision in the tail of rarer word counts isn't too big of a concern in typical cases.
If you have the RAM & want to prevent this behavior, you could edit the source to make vocab_hash_size arbitrarily larger, so that either ReduceVocab() is never triggered (and thus your final counts are exact), or happens rarely enough that any words it affects don't concern you.

Related

(Edit) I wrote same code with Swift and C lang(Find Prime number), but C lang is much faster then Swift

(There has some Edit in below)
Well, I wrote exactly the same code with Swift and C lang. It's a code to find a Prime number and show that.
I expect that Swift lang's Code is much faster than C lang's program, but It doesn't.
Is there any reason Swift lang is much slower than C lang code?
When I found until 4000th Prime number, C lang finished calculating with only one second.
But, Swift finished with 38.8 seconds.
It's much much slower than I thought.
Here is a code I wrote.
Do there any solutions to fast up Swift's code?
(Sorry for the Japanese comment or text in the code.)
Swift
import CoreFoundation
/*
var calendar = Calendar.current
calender.locale = .init(identifier: "ja.JP")
*/
var primeCandidate: Int
var prime: [Int] = []
var countMax: Int
print("いくつ目まで?(最小2、最大100000まで)\n→ ", terminator: "")
countMax = Int(readLine()!)!
var flagPrint: Int
print("表示方法を選んでください。(1:全て順番に表示、2:\(countMax)番目の一つだけ表示)\n→ ", terminator: "")
flagPrint = Int(readLine()!)!
prime.append(2)
prime.append(3)
var currentMaxCount: Int = 2
var numberCount: Int
primeCandidate = 4
var flag: Int = 0
var ix: Int
let startedTime = clock()
//let startedTime = time()
//.addingTimeInterval(0.0)
while currentMaxCount < countMax {
for ix in 2..<primeCandidate {
if primeCandidate % ix == 0 {
flag = 1
break
}
}
if flag == 0 {
prime.append(primeCandidate)
currentMaxCount += 1
} else if flag == 1 {
flag = 0
}
primeCandidate += 1
}
let endedTime = clock()
//let endedTime = Time()
//.timeIntervalSince(startedTime)
if flagPrint == 1 {
print("計算された素数の一覧:", terminator: "")
let completedPrimeNumber = prime.map {
$0
}
print(completedPrimeNumber)
//print("\(prime.map)")
print("\n\n終わり。")
} else if flagPrint == 2 {
print("\(currentMaxCount)番目の素数は\(prime[currentMaxCount - 1])です。")
}
print("\(countMax)番目の素数まで計算。")
print("計算経過時間: \(round(Double((endedTime - startedTime) / 100000)) / 10)秒")
Clang
#include <stdio.h>
#include <time.h> //経過時間計算のため
int main(void)
{
int primeCandidate;
unsigned int prime[100000];
int countMax;
printf("いくつ目まで?(最小2、最大100000まで)\n→ ");
scanf("%d", &countMax);
int flagPrint;
printf("表示方法を選んでください。(1:全て順番に表示、2:%d番目の一つだけ表示)\n→ ", countMax);
scanf("%d", &flagPrint);
prime[0] = 2;
prime[1] = 3;
int currentMaxCount = 2;
int numberCount;
primeCandidate = 4;
int flag = 0;
int ix;
int startedTime = time(NULL);
for(;currentMaxCount < countMax;primeCandidate++){
/*
for(numberCount = 0;numberCount < currentMaxCount - 1;numberCount++){
if(primeCandidate % prime[numberCount] == 0){
flag = 1;
break;
}
}
*/
for(ix = 2;ix < primeCandidate;++ix){
if(primeCandidate % ix == 0){
flag = 1;
break;
}
}
if(flag == 0){
prime[currentMaxCount] = primeCandidate;
currentMaxCount++;
} else if(flag == 1){
flag = 0;
}
}
int endedTime = time(NULL);
if(flagPrint == 1){
printf("計算された素数の一覧:");
for(int i = 0;i < currentMaxCount - 1;i++){
printf("%d, ", prime[i]);
}
printf("%d.\n\n終わり", prime[currentMaxCount - 1]);
} else if(flagPrint == 2){
printf("%d番目の素数は「%d」です。\n",currentMaxCount ,prime[currentMaxCount - 1]);
}
printf("%d番目の素数まで計算", countMax);
printf("計算経過時間: %d秒\n", endedTime - startedTime);
return 0;
}
**Add**
I found some reason for one.
for ix in 0..<currentMaxCount - 1 {
if primeCandidate % prime[ix] == 0 {
flag = 1
break
}
}
I wrote a code to compare all numbers. That was a mistake.
But, I fix with code with this, also Swift finished calculating in 4.7 secs.
It's 4 times slower than C lang also.
The fundamental cause
As with most of these "why does this same program in 2 different languages perform differently?", the answer is almost always: "because they're not the same program."
They might be similar in high-level intent, but they're implemented differently enough that you can distinguish their performance.
Sometimes they're different in ways you can control (e.g. you use an array in one program and a hash set in the other) or sometimes in ways you can't (e.g. you're using CPython and you're experiencing the overhead of interpretation and dynamic method dispatch, as compared to compiled C function calls).
Some example differences
In this case, there's a few notable differences I can see:
The prime array in your C code uses unsigned int, which is typically akin to UInt32. Your Swift code uses Int, which is typically equivalent to Int64. It's twice the size, which doubles memory usage and decreases the efficacy of the CPU cache.
Your C code pre-allocates the prime array on the stack, whereas your Swift code starts with an empty Array, and repeatedly grows it as necessary.
Your C code doesn't pre-initialize the contents of the prime array. Any junk that might be leftover in the memory is still there to be observed, whereas the Swift code will zero-out all the array memory before use.
All Swift arithmetic operations are checked for overflow. This introduces a branch within every single +, %, etc. That's good for program safety (overflow bugs will never be silent and will always be detected), but sub-optimal in performance-critical code where you're certain that overflow is impossible. There's non-checked variants of all the operators that you can use, such as &+, &-, etc.
The general trend
In general, you'll notice a trend that Swift optimizes for safety and developer experience, whereas C optimizes for being close to the hardware. Swift optimizes for allowing the developer to express their intent about the business logic, whereas C optimizes for allowing the developer to express their intent about the final machine code that runs.
There are typically "escape hatches" in Swift that let you sacrifice safety or convenience for C-like performance. This sounds bad, but arguably, you can view C just being exclusively using these escape hatches. There's no Array, Dictionary, automatic reference counting, Sequence algorithms, etc. E.g. what Swift calls UnsafePointer is just a "pointer" in C. "Unsafe" comes with the territory.
Improving the performance
You could get pretty far in hitting performance parity by:
Pre-allocating a sufficiently large array with [Array.reserveCapacity(_:)](https://developer.apple.com/documentation/swift/array/reservecapacity(_:)). See this note in the Array documentation:
Growing the Size of an Array
Every array reserves a specific amount of memory to hold its contents. When you add elements to an array and that array begins to exceed its reserved capacity, the array allocates a larger region of memory and copies its elements into the new storage. The new storage is a multiple of the old storage’s size. This exponential growth strategy means that appending an element happens in constant time, averaging the performance of many append operations. Append operations that trigger reallocation have a performance cost, but they occur less and less often as the array grows larger.
If you know approximately how many elements you will need to store, use the reserveCapacity(_:) method before appending to the array to avoid intermediate reallocations. Use the capacity and count properties to determine how many more elements the array can store without allocating larger storage.
For arrays of most Element types, this storage is a contiguous block of memory. For arrays with an Element type that is a class or #objc protocol type, this storage can be a contiguous block of memory or an instance of NSArray. Because any arbitrary subclass of NSArray can become an Array, there are no guarantees about representation or efficiency in this case.
Use UInt32 or Int32 instead of Int.
If necessary drop down to UnsafeMutableBuffer<UInt32> instead of Array<UInt32>. This is closer to the simple pointer implementation used in your C example.
You can used unchecked arithmetic operators like &+, &-, &% and so on. Obviously, you should only do this when you're absolutely certain that overflow is impossible. Given how many thousands of silent overflow related bugs have come and gone, this is almost always a bad bet, but the loaded gun is available for you if you insist.
These aren't things you should generally do. They're merely possibilities that exist if they're necessary to improve performance of critical code.
For example, the Swift convention is to generally use Int unless you have a good reason to use something else. For example, Array.count returns an Int, even though it can never be negative, and is unlikely to ever need to be more than UInt32.max.
You've forgotten to turn on the optimizer. Swift is much slower without optimization than C, but on things like this is roughly the same when optimized:
➜ x swift -O prime.swift
いくつ目まで?(最小2、最大100000まで)
→ 40000
表示方法を選んでください。(1:全て順番に表示、2:40000番目の一つだけ表示)
→ 2
40000番目の素数は479909です。
40000番目の素数まで計算。
計算経過時間: 5.9秒
➜ x clang -O3 prime.c && ./a.out
いくつ目まで?(最小2、最大100000まで)
→ 40000
表示方法を選んでください。(1:全て順番に表示、2:40000番目の一つだけ表示)
→ 2
40000番目の素数は「479909」です。
40000番目の素数まで計算計算経過時間: 6秒
This is without doing any work to improve your code (probably the most significant would be to pre-allocate the buffer like you do in C that doesn't actually matter).

Dynamically indexing an array in C

Is it possible to create arrays based of their index as in
int x = 4;
int y = 5;
int someNr = 123;
int foo[x][y] = someNr;
dynamically/on the run, without creating foo[0...3][0...4]?
If not, is there a data structure that allow me to do something similar to this in C?
No.
As written your code make no sense at all. You need foo to be declared somewhere and then you can index into it with foo[x][y] = someNr;. But you cant just make foo spring into existence which is what it looks like you are trying to do.
Either create foo with correct sizes (only you can say what they are) int foo[16][16]; for example or use a different data structure.
In C++ you could do a map<pair<int, int>, int>
Variable Length Arrays
Even if x and y were replaced by constants, you could not initialize the array using the notation shown. You'd need to use:
int fixed[3][4] = { someNr };
or similar (extra braces, perhaps; more values perhaps). You can, however, declare/define variable length arrays (VLA), but you cannot initialize them at all. So, you could write:
int x = 4;
int y = 5;
int someNr = 123;
int foo[x][y];
for (int i = 0; i < x; i++)
{
for (int j = 0; j < y; j++)
foo[i][j] = someNr + i * (x + 1) + j;
}
Obviously, you can't use x and y as indexes without writing (or reading) outside the bounds of the array. The onus is on you to ensure that there is enough space on the stack for the values chosen as the limits on the arrays (it won't be a problem at 3x4; it might be at 300x400 though, and will be at 3000x4000). You can also use dynamic allocation of VLAs to handle bigger matrices.
VLA support is mandatory in C99, optional in C11 and C18, and non-existent in strict C90.
Sparse arrays
If what you want is 'sparse array support', there is no built-in facility in C that will assist you. You have to devise (or find) code that will handle that for you. It can certainly be done; Fortran programmers used to have to do it quite often in the bad old days when megabytes of memory were a luxury and MIPS meant millions of instruction per second and people were happy when their computer could do double-digit MIPS (and the Fortran 90 standard was still years in the future).
You'll need to devise a structure and a set of functions to handle the sparse array. You will probably need to decide whether you have values in every row, or whether you only record the data in some rows. You'll need a function to assign a value to a cell, and another to retrieve the value from a cell. You'll need to think what the value is when there is no explicit entry. (The thinking probably isn't hard. The default value is usually zero, but an infinity or a NaN (not a number) might be appropriate, depending on context.) You'd also need a function to allocate the base structure (would you specify the maximum sizes?) and another to release it.
Most efficient way to create a dynamic index of an array is to create an empty array of the same data type that the array to index is holding.
Let's imagine we are using integers in sake of simplicity. You can then stretch the concept to any other data type.
The ideal index depth will depend on the length of the data to index and will be somewhere close to the length of the data.
Let's say you have 1 million 64 bit integers in the array to index.
First of all you should order the data and eliminate duplicates. That's something easy to achieve by using qsort() (the quick sort C built in function) and some remove duplicate function such as
uint64_t remove_dupes(char *unord_arr, char *ord_arr, uint64_t arr_size)
{
uint64_t i, j=0;
for (i=1;i<arr_size;i++)
{
if ( strcmp(unord_arr[i], unord_arr[i-1]) != 0 ){
strcpy(ord_arr[j],unord_arr[i-1]);
j++;
}
if ( i == arr_size-1 ){
strcpy(ord_arr[j],unord_arr[i]);
j++;
}
}
return j;
}
Adapt the code above to your needs, you should free() the unordered array when the function finishes ordering it to the ordered array. The function above is very fast, it will return zero entries when the array to order contains one element, but that's probably something you can live with.
Once the data is ordered and unique, create an index with a length close to that of the data. It does not need to be of an exact length, although pledging to powers of 10 will make everything easier, in case of integers.
uint64_t* idx = calloc(pow(10, indexdepth), sizeof(uint64_t));
This will create an empty index array.
Then populate the index. Traverse your array to index just once and every time you detect a change in the number of significant figures (same as index depth) to the left add the position where that new number was detected.
If you choose an indexdepth of 2 you will have 10² = 100 possible values in your index, typically going from 0 to 99.
When you detect that some number starts by 10 (103456), you add an entry to the index, let's say that 103456 was detected at position 733, your index entry would be:
index[10] = 733;
Next entry begining by 11 should be added in the next index slot, let's say that first number beginning by 11 is found at position 2023
index[11] = 2023;
And so on.
When you later need to find some number in your original array storing 1 million entries, you don't have to iterate the whole array, you just need to check where in your index the first number starting by the first two significant digits is stored. Entry index[10] tells you where the first number starting by 10 is stored. You can then iterate forward until you find your match.
In my example I employed a small index, thus the average number of iterations that you will need to perform will be 1000000/100 = 10000
If you enlarge your index to somewhere close the length of the data the number of iterations will tend to 1, making any search blazing fast.
What I like to do is to create some simple algorithm that tells me what's the ideal depth of the index after knowing the type and length of the data to index.
Please, note that in the example that I have posed, 64 bit numbers are indexed by their first index depth significant figures, thus 10 and 100001 will be stored in the same index segment. That's not a problem on its own, nonetheless each master has his small book of secrets. Treating numbers as a fixed length hexadecimal string can help keeping a strict numerical order.
You don't have to change the base though, you could consider 10 to be 0000010 to keep it in the 00 index segment and keep base 10 numbers ordered, using different numerical bases is nonetheless trivial in C, which is of great help for this task.
As you make your index depth become larger, the amount of entries per index segment will be reduced
Please, do note that programming, especially lower level like C consists in comprehending the tradeof between CPU cycles and memory use in great part.
Creating the proposed index is a way to reduce the number of CPU cycles required to locate a value at the cost of using more memory as the index becomes larger. This is nonetheless the way to go nowadays, as masive amounts of memory are cheap.
As SSDs' speed become closer to that of RAM, using files to store indexes is to be taken on account. Nevertheless modern OSs tend to load in RAM as much as they can, thus using files would end up in something similar from a performance point of view.

Fastest algorithm to figure out if an array has at least one duplicate

I have a quite peculiar case here. I have a file containing several million entries and want to find out if there exists at least one duplicate. The language here isn't of great importance, but C seems like a reasonable choice for speed. Now, what I want to know is what kind of approach to take to this? Speed is the primary goal here. Naturally, we want to stop looking as soon as one duplicate is found, that's clear, but when the data comes in, I don't know anything about how it's sorted. I just know it's a file of strings, separated by newline. Now keep in mind, all I want to find out is if a duplicate exists. Now, I have found a lot of SO questions regarding finding all duplicates in an array, but most of them go the easy and comprehensive way, rather than the fastest.
Hence, I'm wondering: what is the fastest way to find out if an array contains at least one duplicate? So far, the closest I've been able to find on SO is this: Finding out the duplicate element in an array. The language chosen isn't important, but since it is, after all, programming, multi-threading would be a possibility (I'm just not sure if that's a feasible way to go about it).
Finally, the strings have a format of XXXNNN (3 characters and 3 integers).
Please note that this is not strictly theoretical. It will be tested on a machine (Intel i7 with 8GB RAM), so I do have to take into consideration the time of making a string comparison etc. Which is why I'm also wondering if it could be faster to split the strings in two, and first compare the integer part, as an int comparison will be quicker, and then the string part? Of course, that will also require me to split the string and cast the second half to an int, which might be slower...
Finally, the strings have a format of XXXNNN (3 characters and 3 integers).
Knowing your key domain is essential to this sort of problem, so this allows us to massively simplify the solution (and this answer).
If X &in; {A..Z} and N &in; {0..9}, that gives 263 * 103 = 17,576,000 possible values ... a bitset (essentially a trivial, perfect Bloom filter with no false positives) would take ~2Mb for this.
Here you go: a python script to generate all possible 17 million keys:
import itertools
from string import ascii_uppercase
for prefix in itertools.product(ascii_uppercase, repeat=3):
for numeric in range(1000):
print "%s%03d" % (''.join(prefix), numeric)
and a simple C bitset filter:
#include <limits.h>
/* convert number of bits into number of bytes */
int filterByteSize(int max) {
return (max + CHAR_BIT - 1) / CHAR_BIT;
}
/* set bit #value in the filter, returning non-zero if it was already set */
int filterTestAndSet(unsigned char *filter, int value) {
int byteIndex = value / CHAR_BIT;
unsigned char mask = 1 << (value % CHAR_BIT);
unsigned char byte = filter[byteIndex];
filter[byteIndex] = byte | mask;
return byte & mask;
}
which for your purposes you'd use like so:
#include <stdlib.h>
/* allocate filter suitable for this question */
unsigned char *allocMyFilter() {
int maxKey = 26 * 26 * 26 * 10 * 10 * 10;
return calloc(filterByteSize(maxKey), 1);
}
/* key conversion - yes, it's horrible */
int testAndSetMyKey(unsigned char *filter, char *s) {
int alpha = s[0]-'A' + 26*(s[1]-'A' + 26*(s[2]-'A'));
int numeric = s[3]-'0' + 10*(s[4]-'0' + 10*(s[5]-'0'));
int key = numeric + 1000 * alpha;
return filterTestAndSet(filter, key);
}
#include <stdio.h>
int main() {
unsigned char *filter = allocMyFilter();
char key[8]; /* 6 chars + newline + nul */
while (fgets(key, sizeof(key), stdin)) {
if (testAndSetMyKey(filter, key)) {
printf("collision: %s\n", key);
return 1;
}
}
return 0;
}
This is linear, although there's obviously scope to optimise the key conversion and file input. Anyway, sample run:
useless:~/Source/40044744 $ python filter_test.py > filter_ok.txt
useless:~/Source/40044744 $ time ./filter < filter_ok.txt
real 0m0.474s
user 0m0.436s
sys 0m0.036s
useless:~/Source/40044744 $ cat filter_ok.txt filter_ok.txt > filter_fail.txt
useless:~/Source/40044744 $ time ./filter < filter_fail.txt
collision: AAA000
real 0m0.467s
user 0m0.452s
sys 0m0.016s
admittedly the input file is cached in memory for these runs.
The reasonable answer is to keep the algorithm with the smallest complexity. I encourage you to use a HashTable to keep track of inserted elements; the final algorithm complexity is O(n), because search in HashTable is O(1) theoretically. In your case I suggest you, to run the algorithm when reading file.
public static bool ThereAreDuplicates(string[] inputs)
{
var hashTable = new Hashtable();
foreach (var input in inputs)
{
if (hashTable[input] != null)
return true;
hashTable.Add(input, string.Empty);
}
return false;
}
A fast but inefficient memory solution would use
// Entries are AAA####
char found[(size_t)36*36*36*36*36*36 /* 2,176,782,336 */] = { 0 }; // or calloc() this
char buffer[100];
while (fgets(buffer, sizeof buffer, istream)) {
unsigned long index = strtoul(buffer, NULL, 36);
if (found[index]++) {
Dupe_found();
break;
}
}
The trouble with the post is that it wants "Fastest algorithm", but does not detail memory concerns and its relative importance to speed. So speed must be king and the above wastes little time. It does meet the "stop looking as soon as one duplicate is found" requirement.
Depending on how many different things there can be you have some options:
Sort whole array and then lookup for repeating element, complexity O(n log n) but can be done in place, so memory will be O(1)
Build set of all elements. Depending on chosen set implementation can be O(n) (when it will be hash set) or O(n log n) (binary tree), but it would cost you some memory to do so.
The fastest way to find out if an array contains at least one duplicate is to use a bitmap, multiple CPUs and an (atomic or not) "test and set bit" instruction (e.g. lock bts on 80x86).
The general idea is to divide the array into "total elements / number of CPUs" sized pieces and give each piece to a different CPU. Each CPU processes it's piece of the array by calculating an integer and doing the atomic "test and set bit" for the bit corresponding to that integer.
However, the problem with this approach is that you're modifying something that all CPUs are using (the bitmap). A better idea is to give each CPU a range of integers (e.g. CPU number N does all integers from "(min - max) * N / CPUs" to "(min - max) * (N+1) / CPUs"). This means that all CPUs read from the entire array, but each CPU only modifies it's own private piece of the bitmap. This avoids some performance problems involved with cache coherency protocols ("read for ownership of cache line") and also avoids the need for atomic instructions.
Then next step beyond that is to look at how you're converting your "3 characters and 3 digits" strings into an integer. Ideally, this can/would be done using SIMD; which would require that the array is in "structure of arrays" format (and not the more likely "array of structures" format). Also note that you can convert the strings to integers first (in an "each CPU does a subset of the strings" way) to avoid the need for each CPU to convert each string and pack more into each cache line.
Since you have several million entries I think the best algorithm would be counting sort. Counting sort does exactly what you asked: it sorts an array by counting how many times every element exists. So you could write a function that does the counting sort to the array :
void counting_sort(int a[],int n,int max)
{
int count[max+1]={0},i;
for(i=0;i<n;++i){
count[a[i]]++;
if (count[a[i]]>=2) return 1;
}
return 0;
}
Where you should first find the max element (in O(n)). The asymptotic time complexity of counting sort is O(max(n,M)) where M is the max value found in the array. So because you have several million entries if M has size order of some millions this will work in O(n) (or less for counting sort but because you need to find M it is O(n)). If also you know that there is no way that M is greater than some millions than you would be sure that this gives O(n) and not just O(max(n,M)).
You can see counting sort visualization to understand it better, here:
https://www.cs.usfca.edu/~galles/visualization/CountingSort.html
Note that in the above function we don't implement exactly counting sort, we stop when we find a duplicate which is even more efficient, since yo only want to know if there is a duplicate.

Efficient search for series of values in an array? Ideally OpenCL usable?

I have a massive array I need to search (actually it's a massive array of smaller arrays, but for all intents and purposes, lets consider it one huge array). What I need to find is a specific series of numbers. Obviously, a simple for loop will work:
Pseudocode:
for(x = 0; x++) {
if(array[x] == searchfor[location])
location++;
else
location = 0;
if(location >= strlen(searchfor))
return FOUND_IT;
}
Thing is I want this to be efficient. And in a perfect world, I do NOT want to return the prepared data from an OpenCL kernel and do a simple search loop.
I'm open to non-OpenCL ideas, but something I can implement across a work group size of 64 on a target array length of 1024 would be ideal.
I'm kicking around ideas (split the target across work items, compare each item, looped, against each target, if it matches, set a flag. After all work items complete, check flags. Though as I write that, that sounds very inefficient) but I'm sure I'm missing something.
Other idea was that since the target array is uchar, to lump it together as a double, and check 8 indexes at a time. Not sure I can do that in opencl easily.
Also toying with the idea of hashing the search target with something fast, MD5 likely, then grabbing strlen(searchtarget) characters at a time, hashing it, and seeing if it matches. Not sure how much the hashing will kill my search speed though.
Oh - code is in C, so no C++ maps (something I found while googling that seems like it might help?)
Based on comments above, for future searches, it seems a simple for loop scanning the range IS the most efficient way to find matches given an OpenCL implementation.
Create an index array[sizeof uchar]. For each uchar in the search string make array[uchar] = position in search string of first occurence of uchar. The rest of array contains -1.
unsigned searchindexing[sizeof char] = { (unsigned)-1};
memcpy(searchindexing + 1, searchindexing, sizeof char - 1);
for (i = 0; i < strlen(searchfor); i++)
searchindexing[searchfor[i]] = i;
If you don't start at the beginning, an uchar occuring more than one time will get the wrong position entered into searchindexing.
Then you search the array by stepping strlen(searchfor) unless finding an uchar from searchfor.
for (i = 0; i < MAXARRAYLEN; i += strlen(searchfor))
if ((unsigned)-1 != searchindexing[array[i]]) {
i -= searchindexing[array[i]];
if (!memcmp(searchfor, &array[i], strlen(searchfor)))
return FOUND_IT;
}
If most of the uchar in array isn't in searchfor, this is probably the fastest way. Note the code has not been optimized.
Example: searchfor = "banana". strlen is 6. searchindexing['a'] = 5, ['b'] = 0, ['n'] = 4 and the rest a value not between 0 to 5, like -1 or maxuint. If array[i] is something not in banana like space, i increments by 6. If array[i] now is 'a', you might be in banana and it can be any of the 3 'a's. So we assume the last 'a' and move 5 places back and do a compare with searchfor. If succes, we found it, otherwise we step 6 places forward.

very large loop counts in c

How can I run a loop in c for a very large count in c for eg. 2^1000 times?
Also, using two loops that run a and b no. of times, we get a resultant block that runs a*b no. of times. Is there any smart method for running a loop a^b times?
You could loop recursively, e.g.
void loop( unsigned a, unsigned b ) {
unsigned int i;
if ( b == 0 ) {
printf( "." );
} else {
for ( i = 0; i < a; ++i ) {
loop( a, b - 1 );
}
}
}
...will print a^b . characters.
While I cannot answer your first question, (although look into libgmp, this might help you work with large numbers), a way to perform an action a^b times woul be using recursion.
function (a,b) {
if (b == 0) return;
while (i < a) {
function(a,b-1);
}
}
This will perform the loop a times for each step until b equals 0.
Regarding your answer to one of the comments: But if I have two lines of input and 2^n lines of trash between them, how do I skip past them? Can you tell me a real life scenario where you will see 2^1000 lines of trash that you have to monitor?
For a more reasonable (smaller) number of inputs, you may be able to solve what sounds to be your real need (i.e. handle only relevant lines of input), not by iterating an index, but rather by simply checking each line for the relevant component as it is processed in a while loop...
pseudo code:
BOOL criteriaMet = FALSE;
while(1)
{
while(!criteriaMet)
{
//test next line of input
//if criteria met, set criteriaMet = TRUE;
//if criteria met, handle line of input
//if EOF or similar, break out of loops
}
//criteria met, handle it here and continue
criteriaMet = FALSE;//reset for more searching...
}
Use a b-sized array i[] where each cell hold values from 0 to a-1. For example - for 2^3 use a 3-sized array of booleans.
On each iteration. Increment i[0]. If a==i[0], set i[0] to 0 and increment i[1]. If 0==i[1], set i[1] to 0 and increment i[2], and so on until you increment a cell without reaching a. This can easily be done in a loop:
for(int j=0;j<b;++j){
++i[j];
if(i[j]<a){
break;
}
}
After a iterations, i[0] will return to zero. After a^2 iterations, i[0],i[1] will both be zero. AFter a^b iterations, all cells will be 0 and you can exit the loop. You don't need to check the array each time - the moment you reset i[b-1] you know the all the array is back to zero.
Your question doesn't make sense. Even when your loop is empty you'd be hard pressed to do more than 2^32 iterations per second. Even in this best case scenario, processing 2^64 loop iterations which you can do with a simple uint64_t variable would take 136 years. This is when the loop does absolutely nothing.
Same thing goes for skipping lines as you later explained in the comments. Skipping or counting lines in text is a matter of counting newlines. In 2006 it was estimated that the world had around 10*2^64 bytes of storage. If we assume that all the data in the world is text (it isn't) and the average line is 10 characters including newline (it probably isn't), you'd still fit the count of numbers of lines in all the data in the world in one uint64_t. This processing would of course still take at least 136 years even if the cache of your cpu was fed straight from 4 10Gbps network interfaces (since it's inconceivable that your machine could have that much disk).
In other words, whatever problem you think you're solving is not a problem of looping more than a normal uint64_t in C can handle. The n in your 2^n can't reasonably be more than 50-55 on any hardware your code can be expected to run on.
So to answer your question: if looping a uint64_t is not enough for you, your best option is to wait at least 30 years until Moore's law has caught up with your problem and solve the problem then. It will go faster than trying to start running the program now. I'm sure we'll have a uint128_t at that time.

Resources