Hash tables: double probe when collision

Hash tables: double probe when collision - c

I am currently working on hash tables and am a little confused on double hashing. Let me first start with what the information I was given.
You first make an array which will hold all the data and they are sorted by keys. I used the formula K % size to find the position in the array that the key will go. If you submit a key into a spot where there is already a key its called a collision. Here is where the double comes in. I use the formula max(1,(K/size) % size) to get a number which will decrement from that position.
So I came up with these functions:
int hashing(table_t *hash, hashkey_t K)
{
int item;
item = K % hash->size;
return item;
}
int double_hashing(table_t *hash, hashkey_t K)
{
int item;
item = K/hash->size % hash->size);
return item;
}
//This is part of another function which involves the double.
else if(hash->probing_type == 2)
{
int dec, item;
item = hashing(hash,K);
if(hash->table[item] == NULL)
{
hash->table[item]->K == K;
hash->table[item]->I == I;
}
else
{
dec = double_hashing(hash,K);
hash->table[item-dec]->K == K;
hash->table[item-dec]->I == I;
}
}
So I use the two formulas to move the keys around. Now I am confused to what happens if I decrement and land on another spot in which a key is already placed. Do I decrement again by that much until I find a place?

Now I am confused to what happens if I decrement and land on another
spot in which a key is already placed. Do I decrement again by that
much until I find a place?
Yes. Provided your hash table size is prime and the table is not full, you will eventually find a free space for your new entry.
You don't just check if the entry is NULL. You need to also check that it doesn't contain the same key that is being inserted. Storing the key in a hash table is essential, so you can be sure that the key you searched on is the key you found.
Beware of modifying your table index without forcing it to be in the array bounds. For example, if item was 0 and then you subtract 1, you will have an out-of-bounds index.
You can correct this like so:
item = (item - dec + hash->size) % hash->size;

Related

How to delete an element from an array in C?

I've tried shifting elements backwards but it is not making the array completely empty.
for(i=pos;i<N-count;i++)
{
A[i]=A[i+1];
}
Actually, I've to test for a key value in an input array and if the key value is present in the array then I've to remove it from the array. The loop should be terminated when the array becomes empty. Here "count" represents the number of times before a key value was found and was removed. And, "pos" represents the position of the element to be removed. I think dynamic memory allocation may help but I've not learned it yet.

From your description and code, by "delete" you probably mean shift the values to remove the given element and shorten the list by reducing the total count.
In your example, pos and count would be/should be the similar (off by 1?) .
The limit for your for loop isn't N - count. It is N - 1
So, you want:
for (i = pos; i < (N - 1); i++) {
A[i] = A[i + 1];
}
N -= 1;
To do a general delete, given some criteria (a function/macro that matches on element(s) to delete, such as match_for_delete below), you can do the match and delete in a single pass on the array:
int isrc = 0;
int idst = 0;
for (; isrc < N; ++isrc) {
if (match_for_delete(A,isrc,...))
continue;
if (isrc > idst)
A[idst] = A[isrc];
++idst;
}
N = idst;

How to remove repeating elements from an array in C

I want to write a C program that removes repeated values in an array and keep only the last occurrence.
For example I have to arrays:
char vals[6]={'a','b','c','a','f','b'};
int pos[6]={1,2,3,4,5,6};
I want to write a function so that the elements in the array after would be:
char vals[4]={'c','a','f','b'};
int pos[4]={3,4,5,6};
I know how to delete elements in general but in this case I am looking for a way where I could also delete the values in the pos array (associated with the Vals array)

Overwriting duplicate elements, per se, isn't particularly complicated. But right here you have the additional constraint of wanting the last index of each element you find. This can be solved easily when you search for duplicates:
unsigned remove_duplicates (char * restrict array,
unsigned * restrict positions, unsigned count) {
// assume positions is uninitialized
unsigned current, insert = 0;
for (current = 0; current < count; current ++) {
// first, see if the value is already in the array
unsigned search;
for (search = 0; search < current; search ++)
if (array[current] == array[search]) break;
if (search < current)
// if we found it, we just have a new position for it
positions[search] = current + 1; // +1 because your positions are 1-based
else {
// otherwise, write it into the array and store its position
// insert tracks the insertion pointer (i.e., the new end of the array)
array[insert] = array[current];
positions[insert] = current + 1;
insert ++;
}
}
// at this point we're done; insert will have tracked the number of
// unique elements, which we can return as the new array size
// the positions won't be sorted; you can sort both arrays if you want
return insert;
}

removing set interval of struct

im having trouble "removing" my struct/array. Right now i can define max array to be size 10. I can fill the array with struct containing name, age, ect. My search function will let me search between a set of interval, say age 10 to 25. What i want my remove function do is remove those all those people between age 10-25. I should be able to re-enter new people into the database as long as it doesn't exceed my defined limit. Right now it seems to randomly remove stuff from the array.
struct database
{
float age,b,c,d;
char name[WORDLENGTH];
};
typedef struct database Database;
search func();
.........
void remove(Database inv[], int *np, int *min, int *max, int *option)
{
int i;
if (*np == 0)
{
printf("The database is empty\n");
return;
}
search(inv, *np, low, high, option);
if (*option == 1)
{
for (i = 0; i<*np; i++)
{
if (inv[i].age >= *low && inv[i].age <= *high)
{
(*np)--;
}
}
}
}

Right now it seems to randomly remove stuff from the array.
The items that your code removes are not random at all. This line
(*np)--;
removes the last item. Therefore, if the range contains two items that match the search condition at the beginning of the inv, your code would remove two items off the end. Things get a little more complicated if matching items are located in the back of the valid range of inv, so deletions start looking random.
Deleting from an array of structs is not different from deleting from an array of ints. You need to follow this algorithm:
Maintain a read index and a write index, initially set to zero
Run a loop that terminates when the read index goes past the end
At each step check the item at read index
If the item does not match the removal condition, copy from read index to write index, and advance both indexes
Otherwise, advance only the read index
Set new np to the value of write index at the end of the loop.
This algorithm ensures that items behind the deleted ones get moved toward the front of the array. See this answer for an example implementation of the above approach.

You can't remove an array element simply by decreasing the count of number of elements.
If you want to remove the n'th element in the array, you have to overwrite the n'th element with the (n+1)'th element and overwrite the (n+1)'th element with the (n+2)'th element and so on.
Something like:
int arr[5] = { 1, 2, 3, 4, 5};
int np = 5;
// Remove element 3 (aka index 2)
int i;
for (i = 2; i < (np-1); ++i)
{
arr[i] = arr[i+1];
}
--np;
This is a simple approach to explain the concept. But notice that it requires a lot of copy so in real code, you should use a better algorithm (if performance is an issue). The answer from #dasblinkenlight explains one good algorithm.

Efficiently choose an integer distinct from all elements of a list

I have a linked list of objects each containing a 32-bit integer (and provably fewer than 232 such objects) and I want to efficiently choose an integer that's not present in the list, without using any additional storage (so copying them to an array, sorting the array, and choosing the minimum value not in the array would not be an option). However, the definition of the structure for list elements is under my control, so I could add (within reason) additional storage to each element as part of solving the problem. For example, I could add an extra set of prev/next pointers and merge-sort the list. Is this the best solution? Or is there a simpler or more efficient way to do it?

Given the conditions that you outline in the comments, especially your expectation of many identical values, you must expect a sparse distribution of used values.
Consequently, it might actually be best to just guess a value randomly and then check whether it coincides with a value in the list. Even if half the available value range were used (which seems extremely unlikely from your comments), you would only traverse the list twice on average. And you can drastically decrease this factor by simultaneously checking a number of guesses in one pass. Done correctly, the factor should always be close to one.
The advantage of such a probabilistic approach is that you are immune to bad sequences of values. Such sequences are always possible with range based approaches: If you calculate the min and max of the data, you run the risk, that the data contains both 0 and 2^32-1. If you sequentially subdivide an interval, you run the risk of always getting values in the middle of the interval, which can shrink it to zero in 32 steps. With a probabilistic approach, these sequences can't hurt you.
I think, I would use something like four guesses for very small lists, and crank it up to roughly 16 as the size of the list approaches the limit. The high starting value is due to the fact that any such algorithm will be memory bound, i. e. your CPU has ample amounts of time to check a value while it waits for the next values to arrive from memory, so you better make good use of that time to reduce the number of passes required.
A further optimization would instantly replace a busted guess with a new one and keep track of where the replacement happened, so that you can avoid a complete second pass through the data. Also, move the busted guess to the end of the list of guesses, so that you only need to check against the start position of the first guess in your loop to stop as early as possible.

If you can spare one pointer in each object, you get an O(n) worst-case algorithm easily (standard divide-and-conquer):
Divide the range of possible IDs equally.
Make a singly-linked list covering each subrange.
If one subrange is empty, choose any id in it.
Otherwise repeat with the elements of the subrange with fewest elements.
Example code using two sub-ranges per iteration:
unsigned getunusedid(element* h) {
unsigned start = 0, stop = -1;
for(;h;h = h->mainnext)
h->next = h->mainnext;
while(h) {
element *l = 0, *r = 0;
unsigned cl = 0, cr = 0;
unsigned mid = start + (stop - start) / 2;
while(h) {
element* next = h->next;
if(h->id < mid) {
h->next = l;
cl++;
l = h;
} else {
h->next = r;
cr++;
r = h;
}
h = next;
}
if(cl < cr) {
h = l;
stop = mid - 1;
} else {
h = r;
start = mid;
}
}
return start;
}
Some more remarks:
Beware of bugs in the above code; I have only proved it correct, not tried it.
Using more buckets (best keep to a power of 2 for easy and efficient handling) each iteration might be faster due to better data-locality (though only try and measure if it's not fast enough otherwise), as #MarkDickson rightly remarks.
Without those extra-pointers, you need full sweeps each iteration, raising the bound to O(n*lg n).
An alternative would be using 2+ extra-pointers per element to maintain a balanced tree. That would speed up id-search, at the expense of some memory and insertion/removal time overhead.

If you don't mind an O(n) scan for each change in the list and two extra bits per element, whenever an element is inserted or removed, scan through and use the two bits to represent whether an integer (element + 1) or (element - 1) exists in the list.
For example, inserting the element, 2, the extra bits for each 3 and 1 in the list would be updated to show that 3-1 (in the case of 3) and 1+1 (in the case of 1) now exist in the list.
Insertion/deletion time can be reduced by adding a pointer from each element to the next element with the same integer.

I am supposing that integers have random values not controlled by your code.
Add two unsigned integers in your list class:
unsigned int rangeMinId = 0;
unsigned int rangeMaxId = 0xFFFFFFFF ;
Or if not possible to change the List class add them as global variables.
When the list is empty you will always know that the range if free. When you add a new item in the list check if its ID is between rangeMinId and rangeMaxId and if so change the nearest of them to this ID.
It may happen after a lot of time that rangeMinId to become equal to rangeMaxId-1, then you need a simple function which traverses the whole list and search for another free range. But this will not happens very frequently.
Other solutions are more complex and involves using of sets, binary trees or sorted arrays.
Update:
The free range search function can be done in O(n*log(n)). An example of such function is given below(I have not extensively tested it). The example is for integer array but easily can be adapted for a list.
int g_Calls = 0;
bool _findFreeRange(const int* value, int n, int& left, int& right)
{
g_Calls ++ ;
int l=left, r=right,l2,r2;
int m = (right + left) / 2 ;
int nl=0, nr=0;
for(int k = 0; k < n; k++)
{
const int& i = value[k] ;
if(i > l && i < r)
{
if(i-l < r-i)
l = i;
else
r = i;
}
if(i < m)
nl ++ ;
else
nr ++ ;
}
if ( (r - l) > 1 )
{
left = l;
right = r;
return true ;
}
if( nl < nr)
{
// check first left then right
l2 = left;
r2 = m;
if(r2-l2 > 1 && _findFreeRange(value, n, l2, r2))
{
left = l2 ;
right = r2 ;
return true;
}
l2 = m;
r2 = right;
if(r2-l2 > 1 && _findFreeRange(value, n, l2, r2))
{
left = l2 ;
right = r2 ;
return true;
}
}
else
{
// check first right then left
l2 = m;
r2 = right;
if(r2-l2 > 1 && _findFreeRange(value, n, l2, r2))
{
left = l2 ;
right = r2 ;
return true;
}
l2 = left;
r2 = m;
if(r2-l2 > 1 && _findFreeRange(value, n, l2, r2))
{
left = l2 ;
right = r2 ;
return true;
}
}
return false;
}
bool findFreeRange(const int* value, int n, int& left, int& right, int maxx)
{
g_Calls = 1;
left = 0;
right = maxx;
if(!_findFreeRange(value, n, left, right))
return false ;
left++;
right--;
return (right - left) >= 0 ;
}
If it returns false list is filled and there is no free range (very least possible), maxm is the maximal limit of the range in this case 0xFFFFFFFF.
The idea is first to search the biggest range of the list and then if no free hole is found to recursively search the subranges for holes which may have been left during the first pass. If the list is sparsely filled it is very least probable that function will be called more than once. However when the list become almost completely filled it can happen the range search to take longer. Thus in this most worst case scenario, when the list becomes closed to filled, its better to start keeping all free ranges in a list.

This reminds me of the book Programming Pearls, and in particular the very first column, "Cracking the Oyster". What is the real problem you are trying to solve?
If your list is small, then a simple linear search to find max/min would work and it would work quickly.
When your list gets large and linear search becomes unwieldy, you can build a bitmap to represent the unused numbers for much less memory than adding 2 extra pointers at each node in the linked list. In fact, it would only be 2^(32-8) = 16KB of RAM compared to your linked list being potentially >10GB.
Then to find an unused number, you can just traverse the bitmap one machine-word at a time, checking if it's non-zero. If it is, then at least one number in that 32- or 64- bit block is unused, and you can inspect the word to find out exactly which bit is set. As you add numbers to the list, all you have to do is clear the corresponding bit in the bitmap.

One possible solution is to take the min and max of the list with a simple O(n) iteration, then pick a number between max and min + (1 << 32). This is simple to do since overflow/underflow behavior is well-defined for unsigned integers:
uint32_t min, max;
// TODO: compute min and max here
// exclude max from choice space (min will be an exclusive upper bound)
max++;
uint32_t choice = rand32() % (min - max) + max; // where rand32 is a random unsigned 32-bit integer
Of course, if it doesn't need to be random, then you can just use one more than the maximum of the list.
Note: the only case where this fails is if min is 0 and max is UINT32_MAX (aka 4294967295).

Ok. Here is one really simple solution. Some of the answers have become too theoretical and complicated for optimization. If you need a quick solution do this:
1.In your List add a member:
unsigned int NextFreeId = 1;
add also an std::set<unsigned int> ids
When you add item in the list add also the integer in the set and keep track of the NextFreeId:
int insert(unsigned int id)
{
ids.insert(id);
if (NextFreeId == id) //will not happen too frequently
{
unsigned int TheFreeId ;
unsigned int nextid = id+1, previd = id-1;
while(true )
{
if(nextid < 0xFFFFFFF && !ids.count(nextid))
{
NextFreeId = nextid ;
break ;
}
if(previd > 0 && !ids.count(previd))
{
NextFreeId = previd ;
break ;
}
if(prevId == 0 && nextid == 0xFFFFFFF)
break; // all the range is filled, there is no free id
nextid++ ;
previd -- ;
}
}
return 1;
}
Sets are very efficient to check if a value is contained so the complexity will be O(log(N)). It is quick to implement. Also set is searched not each time but only when the NextFreeId is filled. List is not traversed at all.

Request a lua table size in c before iterating it

This should be simple, and it probably is, but in my C code, I want to know the size of a table before I start iterating trough it. I need to preallocate some memory to store values in that come from that table.
I get this table as a parameter in a lua c function.
static int lua_FloatArray(lua_State *L)
{
int n = lua_gettop(L);
if (n != 1 || lua_gettype(L, 1) != LUA_TTABLE)
{
luaL_error(L, "FloatArray expects first parameter to be a table");
return 0;
}
int tablesize = ????;
float *a = (float*)lua_newuserdata(L, tablesize * sizeof(float));
lua_pushnil(L);
int x = 0;
while (lua_next(L, index) != 0)
{
a[x++] = (float)lua_tonumber(L, -1);
lua_pop(L, 1); // Remove value, but keep key for next iteration
}
return 1;
}
tablesize? how to get tablesize?

Assuming you are working with arrays - tables with integer keys, without holes (some keys being nil) - you can use the lua_objlen method. Quoting from the manual:
Returns the "length" of the value at the given acceptable index: for strings, this is the string length; for tables, this is the result of the length operator ('#');

There's no such API function. You need to count the items yourself.
On the other hand, you seem to be filling an array in C and I guess you have a Lua table like say {10,20,30}and you assume that you'll get the items in the order I've listed. This is not so with lua_next. See the second paragraph in http://www.lua.org/manual/5.1/manual.html#pdf-next .

I dont know whether this is actually good programming practice, but I often use 2d arrays / tables in the form of a structure
with a field for a ptr to your arrays (or pointer to array of pointers) and a field for number of columns and number of rows.