Linear Search Algorithm Optimization

Linear Search Algorithm Optimization - c

I just finished a homework problem for Computer Science 1 (yes, it's homework, but hear me out!). Now, the assignment is 100% complete and working, so I don't need help on it. My question involves the efficiency of an algorithm I'm using (we aren't graded on algorithmic efficiency yet, I'm just really curious).
The function I'm about to present currently uses a modified version of the linear search algorithm (that I came up with, all by myself!) in order to check how many numbers on a given lottery ticket match the winning numbers, assuming that both the numbers on the ticket and the numbers drawn are in ascending order. I was wondering, is there any way to make this algorithm more efficient?
/*
* Function: ticketCheck
*
* #param struct ticket
* #param array winningNums[6]
*
* Takes in a ticket, counts how many numbers
* in the ticket match, and returns the number
* of matches.
*
* Uses a modified linear search algorithm,
* in which the index of the successor to the
* last matched number is used as the index of
* the first number tested for the next ticket value.
*
* #return int numMatches
*/
int ticketCheck( struct ticket ticket, int winningNums[6] )
{
int numMatches = 0;
int offset = 0;
int i;
int j;
for( i = 0; i < 6; i++ )
{
for( j = 0 + offset; j < 6; j++ )
{
if( ticket.ticketNum[i] == winningNums[j] )
{
numMatches++;
offset = j + 1;
break;
}
if( ticket.ticketNum[i] < winningNums[j] )
{
i++;
j--;
continue;
}
}
}
return numMatches;
}

It's more or less there, but not quite. In most situations, it's O(n), but it's O(n^2) if every ticketNum is greater than every winningNum. (This is because the inner j loop doesn't break when j==6 like it should, but runs the next i iteration instead.)
You want your algorithm to increment either i or j at each step, and to terminate when i==6 or j==6. [Your algorithm almost satisfies this, as stated above.] As a result, you only need one loop:
for (i=0,j=0; i<6 && j<6; /* no increment step here */) {
if (ticketNum[i] == winningNum[j]) {
numMatches++;
i++;
j++;
}
else if (ticketNum[i] < winningNum[j]) {
/* ticketNum[i] won't match any winningNum, discard it */
i++;
}
else { /* ticketNum[i] > winningNum[j] */
/* discard winningNum[j] similarly */
j++;
}
}
Clearly this is O(n); at each stage, it either increments i or j, so the most steps it can do is 2*n-1. This has almost the same behaviour as your algorithm, but is easier to follow and easier to see that it's correct.

You're basically looking for the size of the intersection of two sets. Given that most lottos use around 50 balls (or so), you could store the numbers as bits that are set in an unsigned long long. Finding the common numbers is then a simple matter of ANDing the two together: commonNums = TicketNums & winningNums;.
Finding the size of the intersection is a matter of counting the one bits in the resulting number, a subject that's been covered previously (though in this case, you'd use 64-bit numbers, or a pair of 32-bit numbers, instead of a single 32-bit number).

Yes, there is something faster, but probably using more memory. Make an array full of 0 in the size of the possible numbers, put a 1 on every drawn number. For every ticket number add the value at the index of that number.
int NumsArray[MAX_NUMBER+1];
memset(NumsArray, 0, sizeof NumsArray);
for( i = 0; i < 6; i++ )
NumsArray[winningNums[i]] = 1;
for( i = 0; i < 6; i++ )
numMatches += NumsArray[ticket.ticketNum[i]];
12 loop rounds instead of up to 36
The surrounding code left as an exercise.
EDIT: It also has the advantage of not needing to sort both set of values.

This is really only a minor change on a scale like this, but if the second loop reaches a number bigger than the current ticket number, it is already allowed to brake. Furthermore, if your seconds traverses numbers lower than your ticket number, it may update the offset even if no match is found within that iteration.
PS:
Not to forget, general results on efficiency make more sense, if we take the number of balls or the size of the ticket to be variable. Otherwise it is too much dependent of the machine.

If instead of comparing the arrays of lottery numbers you were to create two bit arrays of flags -- each flag being set if it's index is in that array -- then you could perform a bitwise and on the two bit arrays (the lottery ticket and the winning number sets) and produce another bit array whose bits were flags for matching numbers only. Then count the bits set.
For many lotteries 64 bits would be enough, so a uint64_t should be big enough to cover this. Also, some architectures have instructions to count the bits set in a register, which some compilers might be able to recognize and optimize for.
The efficiency of this algorithm is based both on the range of lottery numbers (M) and the number of lottery numbers per ticket (N). The setting if the flags is O(N), while the and-ing of the two bit arrays and counting of the bits could be O(M), depending on if your M (lotto number range) is larger than the size that the target cpu can preform these operations on directly. Most likely, though, M will be small and its impact will likely be less than that of N on the performance.

Related

Cycling through interval in C efficiently

I have dynamically allocated array consisting of a lot of numbers (200 000+) and I have to find out, if (and how many) these numbers are contained in given interval. There can be duplicates and all the numbers are in random order.
Example of numbers I get at the beginning:
{1,2,3,1484984,48941651,489416,1816,168189161,6484,8169181,9681916,121,231,684979,795641,231484891,...}
Given interval:
<2;150000>
I created a simple algorithm with 2 for loops cycling through all numbers:
for( int j = 0; j <= numberOfRepeats; j++){
for( int i = 0; i < arraySize; i++){
if(currentNumber == array[i]){
counter++;
}
}
currentNumber++;
}
printf(" -> %d\n", counter);
}
This algorithm is too slow for my task. Is there more efficient way for me to implement my solution? Could sorting the arrays by value help in this case / wouldn't that be too slow?
Example of working program:
{ 1, 7, 22, 4, 7, 5, 11, 9, 1 }
<4;7>
-> 4

The problem was simple as the single comment in my question answered it - there was no reason for second loop. Single loop could do it alone.
My changed code:
for(int i = 0; i <= arraySize-1; i++){
if(array[i] <= endOfInterval && array[i] >= startOfInterval){
counter++;
}

This algorithm is too slow for my task. Is there more efficient way for me to implement my solution? Could sorting the arrays by value help in this case / wouldn't that be too slow?
Of course, it is slow. A single pass algorithm to count the number of elements that are in the set should suffice, just count them in a single pass if they pass the test (be n[i] >= lower bound && be n[i] < upper bound or similar approach) will do the work.
Only in case you need to consider duplicates (e.g. not counting them) you will need to consider if you have already touched them or no. In that case, the sorting solution will be faster (a qsort(3) call is O(nlog(n)) against the O(nn) your double loop is doing, so it will run in an almost linear, then you make a second pass over the data (converting your complexity to O(nlog(n) + n), still lower than O(nn) for the large amount of data you have.
Sorting has the advantage that puts all the repeated key values together, so you have to consider only if the last element you read was the same as the one you are processing now, if it is different, then count it only if it is in the specified range.
One final note: Reading a set of 200,000 integers into an array to filter them, based on some criteria is normally a bad, non-scalable way to solve a problem. Your problem (select the elements that belong to a given interval) allow you for a scalable and better solution by streaming the problem (you read a number, check if it is in the interval, then output it, or count it, or whatever you like to do on it), without using a large amount of memory to hold them all before starting. That is far better way to solve a problem, as it allows you to read a true unbounded set of numbers (coming e.g. from a file) and producing an output based on that:
#include <stdio.h>
#define A (2)
#define B (150000)
int main()
{
int the_number;
size_t count = 0;
int res;
while ((res = scanf("%d", &the_number)) > 0) {
if (the_number >= A && the_number <= B)
count++;
}
printf("%zd numbers fitted in the range\n", count);
}
on this example you can give the program 1.0E26 numbers (assuming that you have an input file system large enough to hold a file this size) and your program will be able to handle it (you cannot create an array with capacity to hold 10^26 values)

Given an array of integers of size n+1 consisting of the elements [1,n]. All elements are unique except one which is duplicated k times

I have been attempting to solve the following problem:
You are given an array of n+1 integers where all the elements lies in [1,n]. You are also given that one of the elements is duplicated a certain number of times, whilst the others are distinct. Develop an algorithm to find both the duplicated number and the number of times it is duplicated.
Here is my solution where I let k = number of duplications:
struct LatticePoint{ // to hold duplicate and k
int a;
int b;
LatticePoint(int a_, int b_) : a(a_), b(b_) {}
}
LatticePoint findDuplicateAndK(const std::vector<int>& A){
int n = A.size() - 1;
std::vector<int> Numbers (n);
for(int i = 0; i < n + 1; ++i){
++Numbers[A[i] - 1]; // A[i] in range [1,n] so no out-of-access
}
int i = 0;
while(i < n){
if(Numbers[i] > 1) {
int duplicate = i + 1;
int k = Numbers[i] - 1;
LatticePoint result{duplicate, k};
return LatticePoint;
}
So, the basic idea is this: we go along the array and each time we see the number A[i] we increment the value of Numbers[A[i]]. Since only the duplicate appears more than once, the index of the entry of Numbers with value greater than 1 must be the duplicate number with the value of the entry the number of duplications - 1. This algorithm of O(n) in time complexity and O(n) in space.
I was wondering if someone had a solution that is better in time and/or space? (or indeed if there are any errors in my solution...)

You can reduce the scratch space to n bits instead of n ints, provided you either have or are willing to write a bitset with run-time specified size (see boost::dynamic_bitset).
You don't need to collect duplicate counts until you know which element is duplicated, and then you only need to keep that count. So all you need to track is whether you have previously seen the value (hence, n bits). Once you find the duplicated value, set count to 2 and run through the rest of the vector, incrementing count each time you hit an instance of the value. (You initialise count to 2, since by the time you get there, you will have seen exactly two of them.)
That's still O(n) space, but the constant factor is a lot smaller.

The idea of your code works.
But, thanks to the n+1 elements, we can achieve other tradeoffs of time and space.
If we have some number of buckets we're dividing numbers between, putting n+1 numbers in means that some bucket has to wind up with more than expected. This is a variant on the well-known pigeonhole principle.
So we use 2 buckets, one for the range 1..floor(n/2) and one for floor(n/2)+1..n. After one pass through the array, we know which half the answer is in. We then divide that half into halves, make another pass, and so on. This leads to a binary search which will get the answer with O(1) data, and with ceil(log_2(n)) passes, each taking time O(n). Therefore we get the answer in time O(n log(n)).
Now we don't need to use 2 buckets. If we used 3, we'd take ceil(log_3(n)) passes. So as we increased the fixed number of buckets, we take more space and save time. Are there other tradeoffs?
Well you showed how to do it in 1 pass with n buckets. How many buckets do you need to do it in 2 passes? The answer turns out to be at least sqrt(n) bucekts. And 3 passes is possible with the cube root. And so on.
So you get a whole family of tradeoffs where the more buckets you have, the more space you need, but the fewer passes. And your solution is merely at the extreme end, taking the most spaces and the least time.

Here's a cheekier algorithm, which requires only constant space but rearranges the input vector. (It only reorders; all the original elements are still present at the end.)
It's still O(n) time, although that might not be completely obvious.
The idea is to try to rearrange the array so that A[i] is i, until we find the duplicate. The duplicate will show up when we try to put an element at the right index and it turns out that that index already holds that element. With that, we've found the duplicate; we have a value we want to move to A[j] but the same value is already at A[j]. We then scan through the rest of the array, incrementing the count every time we find another instance.
#include <utility>
#include <vector>
std::pair<int, int> count_dup(std::vector<int> A) {
/* Try to put each element in its "home" position (that is,
* where the value is the same as the index). Since the
* values start at 1, A[0] isn't home to anyone, so we start
* the loop at 1.
*/
int n = A.size();
for (int i = 1; i < n; ++i) {
while (A[i] != i) {
int j = A[i];
if (A[j] == j) {
/* j is the duplicate. Now we need to count them.
* We have one at i. There's one at j, too, but we only
* need to add it if we're not going to run into it in
* the scan. And there might be one at position 0. After that,
* we just scan through the rest of the array.
*/
int count = 1;
if (A[0] == j) ++count;
if (j < i) ++count;
for (++i; i < n; ++i) {
if (A[i] == j) ++count;
}
return std::make_pair(j, count);
}
/* This swap can only happen once per element. */
std::swap(A[i], A[j]);
}
}
/* If we get here, every element from 1 to n is at home.
* So the duplicate must be A[0], and the duplicate count
* must be 2.
*/
return std::make_pair(A[0], 2);
}

A parallel solution with O(1) complexity is possible.
Introduce an array of atomic booleans and two atomic integers called duplicate and count. First set count to 1. Then access the array in parallel at the index positions of the numbers and perform a test-and-set operation on the boolean. If a boolean is set already, assign the number to duplicate and increment count.
This solution may not always perform better than the suggested sequential alternatives. Certainly not if all numbers are duplicates. Still, it has constant complexity in theory. Or maybe linear complexity in the number of duplicates. I am not quite sure. However, it should perform well when using many cores and especially if the test-and-set and increment operations are lock-free.

Specific permutations of 32 card deck (in C)

I want to generate all permutations of 32 card deck, I represent cards as numbers 0-7, so I don´t care about color of the card. The game is very simple (divide deck into two gropus, compare two cards, add both cards to group of bigger card). I have already code this part of game, but deck is now generating randomly, and I want to look to all possibilities of cards, and make some statistics about it. How can I code this card generating? I totaly don´t know, how to code it.

Because I was just studying Aaron Williams 2009 paper "Loopless Generation of Multiset Permutations by Prefix Shifts", I'll contribute a version of his algorithm, which precisely solves this problem. I believe it to be faster than the standard C++ next_permutation which is usually cited for this problem, because it doesn't rely on searching the input vector for the pivot point. But more extensive benchmarking would be required to produce a definitive answer; it is quite possible that it ends up moving more data around.
Williams' implementation of the algorithm avoids data movement by storing the permutation in a linked list, which allows the "prefix shift" (rotate a prefix of the vector by one position) to be implemented by just modifying two next pointers. That makes the algorithm loopless.
My version here differs in a couple of ways.
First, it uses an ordinary array to store the values, which means that the shift does require a loop. On the other hand, it avoids having to implement a linked-list datatype, and many operations are faster on arrays.
Second, it uses suffix shifts rather than prefix shifts; in effect, it produces the reverse of each permutation (compared with Williams' implementation). I did that because it simplifies the description of the starting condition.
Finally, it just does one permutation step. One of the great things about Williams' algorithm is that the state of the permutation sequence can be encapsulated in a single index value (as well as the permutation itself, of course). This implementation returns the state to be provided to the next call. (Since the state variable will be 0 at the end, the return value doubles as a termination indicator.)
Here's the code:
/* Do a single permutation of v in reverse coolex order, using
* a modification of Aaron Williams' loopless shift prefix algorithm.
* v must have length n. It may have repeated elements; the permutations
* generated will be unique.
* For the first call, v must be sorted into non-descending order and the
* third parameter must be 1. For subsequent calls, the third parameter must
* be the return value of the previous call. When the return value is 0,
* all permutations have been generated.
*/
unsigned multipermute_step(int* v, unsigned n, unsigned state) {
int old_end = v[n - 1];
unsigned pivot = state < 2 || v[state - 2] > v[state] ? state - 1 : state - 2;
int new_end = v[pivot];
for (; pivot < n - 1; ++pivot) v[pivot] = v[pivot + 1];
v[pivot] = new_end;
return new_end < old_end ? n - 1 : state - 1;
}
In case that comment was unclear, you could use the following to produce all shuffles of a deck of 4*k cards without regard to suit:
unsigned n = 4 * k;
int v[n];
for (unsigned i = 0; i < k; ++i)
for (unsigned j = 0; j < 4; ++j)
v[4 * i + j] = i;
unsigned state = 1;
do {
/* process the permutation */
} while ((state = multipermute_step(v, n, state);
Actually trying to do that for k == 8 will take a while, since there are 32!/(4!)8 possible shuffles. That's about 2.39*1024. But I did do all the shuffles of decks of 16 cards in 0.3 seconds, and I estimate that I could have done 20 cards in half an hour.

Shuffle an array while making each index have the same probability to be in any index

I want to shuffle an array, and that each index will have the same probability to be in any other index (excluding itself).
I have this solution, only i find that always the last 2 indexes will always ne swapped with each other:
void Shuffle(int arr[]. size_t n)
{
int newIndx = 0;
int i = 0;
for(; i > n - 2; ++i)
{
newIndx = rand() % (n - 1);
if (newIndx >= i)
{
++newIndx;
}
swap(i, newIndx, arr);
}
}
but in the end it might be that some indexes will go back to their first place once again.
Any thoughts?
C lang.

A permutation (shuffle) where no element is in its original place is called a derangement.
Generating random derangements is harder than generating random permutations, can be done in linear time and space. (Generating a random permutation can be done in linear time and constant space.) Here are two possible algorithms.
The simplest solution to understand is a rejection strategy: do a Fisher-Yates shuffle, but if the shuffle attempts to put an element at its original spot, restart the shuffle. [Note 1]
Since the probability that a random shuffle is a derangement is approximately 1/e, the expected number of shuffles performed is about e (that is, 2.71828…). But since unsuccessful shuffles are restarted as soon as the first fixed point is encountered, the total number of shuffle steps is less than e times the array size for a detailed analysis, see this paper, which proves the expected number of random numbers needed by the algorithm to be around (e−1) times the number of elements.
In order to be able to do the check and restart, you need to keep an array of indices. The following little function produces a derangement of the indices from 0 to n-1; it is necessary to then apply the permutation to the original array.
/* n must be at least 2 for this to produce meaningful results */
void derange(size_t n, size_t ind[]) {
for (size_t i = 0; i < n; ++i) ind[i] = i;
swap(ind, 0, randint(1, n));
for (size_t i = 1; i < n; ++i) {
int r = randint(i, n);
swap(ind, i, r);
if (ind[i] == i) i = 0;
}
}
Here are the two functions used by that code:
void swap(int arr[], size_t i, size_t j) {
int t = arr[i]; arr[i] = arr[j]; arr[j] = t;
}
/* This is not the best possible implementation */
int randint(int low, int lim) {
return low + rand() % (lim - low);
}
The following function is based on the 2008 paper "Generating Random Derangements" by Conrado Martínez, Alois Panholzer and Helmut Prodinger, although I use a different mechanism to track cycles. Their algorithm uses a bit vector of size N but uses a rejection strategy in order to find an element which has not been marked. My algorithm uses an explicit vector of indices not yet operated on. The vector is also of size N, which is still O(N) space [Note 2]; since in practical applications, N will not be large, the difference is not IMHO significant. The benefit is that selecting the next element to use can be done with a single call to the random number generator. Again, this is not particularly significant since the expected number of rejections in the MP&P algorithm is very small. But it seems tidier to me.
The basis of the algorithms (both MP&P and mine) is the recursive procedure to produce a derangement. It is important to note that a derangement is necessarily the composition of some number of cycles where each cycle is of size greater than 1. (A cycle of size 1 is a fixed point.) Thus, a derangement of size N can be constructed from a smaller derangement using one of two mechanisms:
Produce a derangement of the N-1 elements other than element N, and add N to some cycle at any point in that cycle. To do so, randomly select any element j in the N-1 cycle and place N immediately after j in the j's cycle. This alternative covers all possibilities where N is in a cycle of size > 3.
Produce a derangement of N-2 of the N-1 elements other than N, and add a cycle of size 2 consisting of N and the element not selected from the smaller derangement. This alternative covers all possibilities where N is in a cycle of size 2.
If Dn is the number of derangements of size n, it is easy to see from the above recursion that:
Dn = (n−1)(Dn−1 + Dn−2)
The multiplier is n−1 in both cases: in the first alternative, it refers to the number of possible places N can be added, and in the second alternative to the number of possible ways to select n−2 elements of the recursive derangement.
Therefore, if we were to recursively produce a random derangement of size N, we would randomly select one of the N-1 previous elements, and then make a random boolean decision on whether to produce alternative 1 or alternative 2, weighted by the number of possible derangements in each case.
One advantage to this algorithm is that it can derange an arbitrary vector; there is no need to apply the permuted indices to the original vector as with the rejection algorithm.
As MP&P note, the recursive algorithm can just as easily be performed iteratively. This is quite clear in the case of alternative 2, since the new 2-cycle can be generated either before or after the recursion, so it might as well be done first and then the recursion is just a loop. But that is also true for alternative 1: we can make element N the successor in a cycle to a randomly-selected element j even before we know which cycle j will eventually be in. Looked at this way, the difference between the two alternatives reduces to whether or not element j is removed from future consideration or not.
As shown by the recursion, alternative 2 should be chosen with probability (n−1)Dn−2/Dn, which is how MP&P write their algorithm. I used the equivalent formula Dn−2 / (Dn−1 + Dn−2), mostly because my prototype used Python (for its built-in bignum support).
Without bignums, the number of derangements and hence the probabilities need to be approximated as double, which will create a slight bias and limit the size of the array to be deranged to about 170 elements. (long double would allow slightly more.) If that is too much of a limitation, you could implement the algorithm using some bignum library. For ease of implementation, I used the Posix drand48 function to produce random doubles in the range [0.0, 1.0). That's not a great random number function, but it's probably adequate to the purpose and is available in most standard C libraries.
Since no attempt is made to verify the uniqueness of the elements in the vector to be deranged, a vector with repeated elements may produce a derangement where one or more of these elements appear to be in the original place. (It's actually a different element with the same value.)
The code:
/* Deranges the vector `arr` (of length `n`) in place, to produce
* a permutation of the original vector where every element has
* been moved to a new position. Returns `true` unless the derangement
* failed because `n` was 1.
*/
bool derange(int arr[], size_t n) {
if (n < 2) return n != 1;
/* Compute derangement counts ("subfactorials") */
double subfact[n];
subfact[0] = 1;
subfact[1] = 0;
for (size_t i = 2; i < n; ++i)
subfact[i] = (i - 1) * (subfact[i - 2] + subfact[i - 1]);
/* The vector 'todo' is the stack of elements which have not yet
* been (fully) deranged; `u` is the count of elements in the stack
*/
size_t todo[n];
for (size_t i = 0; i < n; ++i) todo[i] = i;
size_t u = n;
/* While the stack is not empty, derange the element at the
* top of the stack with some element lower down in the stack
*/
while (u) {
size_t i = todo[--u]; /* Pop the stack */
size_t j = u * drand48(); /* Get a random stack index */
swap(arr, i, todo[j]); /* i will follow j in its cycle */
/* If we're generating a 2-cycle, remove the element at j */
if (drand48() * (subfact[u - 1] + subfact[u]) < subfact[u - 1])
todo[j] = todo[--u];
}
return true;
}
Notes
Many people get this wrong, particularly in social occasions such as "secret friend" selection (I believe this is sometimes called "the Santa game" in other parts of the world.) The incorrect algorithm is to just choose a different swap if the random shuffle produces a fixed point, unless the fixed point is at the very end in which case the shuffle is restarted. This will produce a random derangement but the selection is biased, particularly for small vectors. See this answer for an analysis of the bias.
Even if you don't use the RAM model where all integers are considered fixed size, the space used is still linear in the size of the input in bits, since N distinct input values must have at least N log N bits. Neither this algorithm nor MP&P makes any attempt to derange lists with repeated elements, which is a much harder problem.

Your algorithm is only almost correct (which in algorithmics means unexpected results). Because of some little errors scattered along, it will not produce expected results.
First, rand() % N is not guaranteed to produce an uniformal distribution, unless N is a divisor of the number of possible values. In any other case, you will get a slight bias. Anyway my man page for rand describes it as a bad random number generator, so you should try to use random or if available arc4random_uniform.
But avoiding that an index come back at its original place is both incommon, and rather hard to achieve. The only way I can imagine is to keep an array of the numbers [0; n[ and swap it the same as the real array to be able to know the original index of a number.
The code could become:
void Shuffle(int arr[]. size_t n)
{
int i, newIndx;
int *indexes = malloc(n * sizeof(int));
for (i=0; i<n; i++) indexes[i] = i;
for(i=0; i < n - 1; ++i) // beware to the inequality!
{
int i1;
// search if index i is in the [i; n[ current array:
for (i1=i; i1 < n; ++i) {
if (indexes[i1] == i) { // move it to i position
if (i1 != i) { // nothing to do if already at i
swap(i, i1, arr);
swap(i, i1, indexes);
}
break;
}
}
i1 = (i1 == n) ? i : i+1; // we will start the search at i1
// to guarantee that no element keep its place
newIndx = i1 + arc4random_uniform(n - i1);
/* if arc4random is not available:
newIndx = i1 + (random() % (n - i1));
*/
swap(i, newIndx, arr);
swap(i, newIndx, indexes);
}
/* special case: a permutation of [0: n-1[ have left last element in place
* we will exchange the last element with a random one
*/
if (indexes[n-1] == n-1) {
newIndx = arc4random_uniform(n-1)
swap(n-1, newIndx, arr);
swap(n-1, newIndx, indexes);
}
free(indexes); // don't forget to free what we have malloc'ed...
}
Beware: the algorithm should be correct, but the code has not been tested and can contain typos...

Efficient way to detect "rank of corner" in flattened multi-dimensional array

This is a small piece of very frequently-called code, and part of a convolution algorithm I am trying to optimise (technically it's my first-pass optimisation, and I have already improved speed by a factor of 2, but now I am stuck):
inline int corner_rank( int max_ranks, int *shape, int pos ) {
int i;
int corners = 0;
for ( i = 0; i < max_ranks; i++ ) {
if ( pos % shape[i] ) break;
pos /= shape[i];
corners++;
}
return corners;
}
The code is being used to calculate a property of a position pos within an N-dimensional array (that has been flattened to pointer, plus arithmetic). max_ranks is the dimensionality, and shape is the array of sizes in each dimension.
An example 3-dimensional array might have max_ranks = 3, and shape = { 3, 4, 5 }. The schematic layout of the first few elements might look like this:
0 1 2 3 4 5 6 7 8
[0,0,0] [1,0,0] [2,0,0] [0,1,0] [1,1,0] [2,1,0] [0,2,0] [1,2,0] [2,2,0]
Returned by function:
3 0 0 1 0 0 1 0 0
Where the first row 0..8 shows the index offset given by pos, and the numbers below give the multi-dimensional indices. Edit: Below that I have put the value returned by the function (the value of 2 is returned at positions 12, 24 and 36).
The function is effectively returning the number of "leading" zeros in the multi-dimensional index, and is designed as it is to avoid needing to make a full conversion to array indices on every increment.
Is there anything I can do with this function to make it inherently faster? Is there a clever way of avoiding %, or another way to calculate the "corner rank" - apologies by the way if it has a more formal name that I do not know . . .

The only time you should return max_ranks is if pos equals zero. Checking for this allows you to remove the conditional check from your for-loop. This should improve both the worst case completion time, and speed of the looping for large values of max_ranks.
Here is my addition, plus a alternative way of avoiding the division operation. I believe that this is as fast as a handwritten div like #twalberg was suggesting, unless there is some way to produce the remainder without a second multiplication.
I'm afraid since the most common answer is 0 (which doesn't even get past the first mod call) you aren't going to see much improvement. My guess is that your average run time is very close to the run time of the modulus function itself. You might try searching for a faster way to determine if a number is a factor of pos. You don't actual need to calculate the remainder; you just need to know if there is a remainder or not.
Sorry if I made things confusing by restructuring your code. I believe this will be slightly faster unless your compiler was already making these optimizations.
inline int corner_rank( int max_ranks, int *shape, int pos ) {
// Most calls will not get farther than this.
if (pos % shape[0] != 0) return 0;
// One check here, guarantees that while loop below always returns.
if (pos == 0) return max_ranks;
int divisor = shape[0] * shape[1];
int i = 1;
while (true) {
if (pos % divisor != 0) return i;
divisor *= shape[++i];
}
}
Also try declaring pos and divisor as the smallest types possible. If they will never be greater than 255 you can use an unsigned char. I know that some processors can perform a divide with smaller numbers faster than larger numbers, but you have to set your variable types appropriately.