Ideal data structure for mapping integers to integers? - c

I won't go into details, but I'm attempting to implement an algorithm similar to the Boyer-Moore-Horspool algorithm, only using hex color values instead of characters (i.e., there is a much greater range).
Following the example on Wikipedia, I originally had this:
size_t jump_table[0xFFFFFF + 1];
memset(jump_table, default_value, sizeof(jump_table);
However, 0xFFFFFF is obviously a huge number and this quickly causes C to seg-fault (but not stack-overflow, disappointingly).
Basically, what I need is an efficient associative array mapping integers to integers. I was considering using a hash table, but having a malloc'd struct for each entry just seems overkill to me (I also do not need hashes generated, as each key is a unique integer and there can be no duplicate entries).
Does anyone have any alternatives to suggest? Am I being overly pragmatic about this?
Update
For those interested, I ended up using a hash table via the uthash library.

0xffffff is rather too large to put on the stack on most systems, but you absolutely can malloc a buffer of that size (at least on current computers; not so much on a smartphone). Whether or not you should do it for this task is a separate issue.
Edit: Based on the comment, if you expect the common case to have a relatively small number of entries other than the "this color doesn't appear in the input" skip value, you should probably just go ahead and use a hash map (obviously only storing values that actually appear in the input).
(ignore earlier discussion of other data structures, which was based on an incorrect recollection of the algorithm under discussion -- you want to use a hash table)

If the array you were going to make (of size 0xFFFFFF) was going to be sparse you could try making a smaller array to act as a simple hash table, with the size being 0xFFFFFF / N and the hash function being hexValue / N (or hexValue % (0xFFFFFF / N)). You'll have to be creative to handle collisions though.
This is the only way I can foresee getting out of mallocing structs.

You can malloc(3) 0xFFFFFF blocks of size_t on the heap (for simplicity), and address them as you do with an array.
As for the stack overflow. Basically the program receives a SIGSEGV, which can be a result of a stack overflow or accessing illegal memory or writing on a read-only segment etc... They are all abstracted under the same error message "Segmentation fault".
But why don't you use a higher level language like python that supports associate arrays?

At possibly the cost of some speed, you could try modifying the algorithm to find only matches that are aligned to some boundary (every three or four symbols), then perform the search at byte level.

You could create a sparse array of sorts which has "pages" like this (this example uses 256 "pages", so the upper most byte is the page number):
int *pages[256];
/* call this first to make sure all of the pages start out NULL! */
void init_pages(void) {
for(i = 0; i < 256; ++i) {
pages[i] = NULL;
}
}
int get_value(int index) {
if(pages[index / 0x10000] == NULL) {
pages[index / 0x10000] = calloc(0x10000, 1); /* calloc so it will zero it out */
}
return pages[index / 0x10000][index % 0x10000];
}
void set_value(int index, int value) {
if(pages[index / 0x10000] == NULL) {
pages[index / 0x10000] = calloc(0x10000, 1); /* calloc so it will zero it out */
}
pages[index / 0x10000][index % 0x10000] = value;
}
this will allocate a page the first time it is touched, read or write.

To avoid the overhead of malloc you can use a hashtable where the entries in the table are your structs, assuming they are small. In your case a pair of integers should suffice, with a special value to indicate emptyness of the slot in the table.

How many values are there in your output space, i.e. how many different values do you map to in the range 0-0xFFFFF?
Using randomized universal hashing you can come up with a collision-free hash function with a table no bigger than 2 times the number of values in your output space (for a static table)

Related

Dynamically indexing an array in C

Is it possible to create arrays based of their index as in
int x = 4;
int y = 5;
int someNr = 123;
int foo[x][y] = someNr;
dynamically/on the run, without creating foo[0...3][0...4]?
If not, is there a data structure that allow me to do something similar to this in C?
No.
As written your code make no sense at all. You need foo to be declared somewhere and then you can index into it with foo[x][y] = someNr;. But you cant just make foo spring into existence which is what it looks like you are trying to do.
Either create foo with correct sizes (only you can say what they are) int foo[16][16]; for example or use a different data structure.
In C++ you could do a map<pair<int, int>, int>
Variable Length Arrays
Even if x and y were replaced by constants, you could not initialize the array using the notation shown. You'd need to use:
int fixed[3][4] = { someNr };
or similar (extra braces, perhaps; more values perhaps). You can, however, declare/define variable length arrays (VLA), but you cannot initialize them at all. So, you could write:
int x = 4;
int y = 5;
int someNr = 123;
int foo[x][y];
for (int i = 0; i < x; i++)
{
for (int j = 0; j < y; j++)
foo[i][j] = someNr + i * (x + 1) + j;
}
Obviously, you can't use x and y as indexes without writing (or reading) outside the bounds of the array. The onus is on you to ensure that there is enough space on the stack for the values chosen as the limits on the arrays (it won't be a problem at 3x4; it might be at 300x400 though, and will be at 3000x4000). You can also use dynamic allocation of VLAs to handle bigger matrices.
VLA support is mandatory in C99, optional in C11 and C18, and non-existent in strict C90.
Sparse arrays
If what you want is 'sparse array support', there is no built-in facility in C that will assist you. You have to devise (or find) code that will handle that for you. It can certainly be done; Fortran programmers used to have to do it quite often in the bad old days when megabytes of memory were a luxury and MIPS meant millions of instruction per second and people were happy when their computer could do double-digit MIPS (and the Fortran 90 standard was still years in the future).
You'll need to devise a structure and a set of functions to handle the sparse array. You will probably need to decide whether you have values in every row, or whether you only record the data in some rows. You'll need a function to assign a value to a cell, and another to retrieve the value from a cell. You'll need to think what the value is when there is no explicit entry. (The thinking probably isn't hard. The default value is usually zero, but an infinity or a NaN (not a number) might be appropriate, depending on context.) You'd also need a function to allocate the base structure (would you specify the maximum sizes?) and another to release it.
Most efficient way to create a dynamic index of an array is to create an empty array of the same data type that the array to index is holding.
Let's imagine we are using integers in sake of simplicity. You can then stretch the concept to any other data type.
The ideal index depth will depend on the length of the data to index and will be somewhere close to the length of the data.
Let's say you have 1 million 64 bit integers in the array to index.
First of all you should order the data and eliminate duplicates. That's something easy to achieve by using qsort() (the quick sort C built in function) and some remove duplicate function such as
uint64_t remove_dupes(char *unord_arr, char *ord_arr, uint64_t arr_size)
{
uint64_t i, j=0;
for (i=1;i<arr_size;i++)
{
if ( strcmp(unord_arr[i], unord_arr[i-1]) != 0 ){
strcpy(ord_arr[j],unord_arr[i-1]);
j++;
}
if ( i == arr_size-1 ){
strcpy(ord_arr[j],unord_arr[i]);
j++;
}
}
return j;
}
Adapt the code above to your needs, you should free() the unordered array when the function finishes ordering it to the ordered array. The function above is very fast, it will return zero entries when the array to order contains one element, but that's probably something you can live with.
Once the data is ordered and unique, create an index with a length close to that of the data. It does not need to be of an exact length, although pledging to powers of 10 will make everything easier, in case of integers.
uint64_t* idx = calloc(pow(10, indexdepth), sizeof(uint64_t));
This will create an empty index array.
Then populate the index. Traverse your array to index just once and every time you detect a change in the number of significant figures (same as index depth) to the left add the position where that new number was detected.
If you choose an indexdepth of 2 you will have 10² = 100 possible values in your index, typically going from 0 to 99.
When you detect that some number starts by 10 (103456), you add an entry to the index, let's say that 103456 was detected at position 733, your index entry would be:
index[10] = 733;
Next entry begining by 11 should be added in the next index slot, let's say that first number beginning by 11 is found at position 2023
index[11] = 2023;
And so on.
When you later need to find some number in your original array storing 1 million entries, you don't have to iterate the whole array, you just need to check where in your index the first number starting by the first two significant digits is stored. Entry index[10] tells you where the first number starting by 10 is stored. You can then iterate forward until you find your match.
In my example I employed a small index, thus the average number of iterations that you will need to perform will be 1000000/100 = 10000
If you enlarge your index to somewhere close the length of the data the number of iterations will tend to 1, making any search blazing fast.
What I like to do is to create some simple algorithm that tells me what's the ideal depth of the index after knowing the type and length of the data to index.
Please, note that in the example that I have posed, 64 bit numbers are indexed by their first index depth significant figures, thus 10 and 100001 will be stored in the same index segment. That's not a problem on its own, nonetheless each master has his small book of secrets. Treating numbers as a fixed length hexadecimal string can help keeping a strict numerical order.
You don't have to change the base though, you could consider 10 to be 0000010 to keep it in the 00 index segment and keep base 10 numbers ordered, using different numerical bases is nonetheless trivial in C, which is of great help for this task.
As you make your index depth become larger, the amount of entries per index segment will be reduced
Please, do note that programming, especially lower level like C consists in comprehending the tradeof between CPU cycles and memory use in great part.
Creating the proposed index is a way to reduce the number of CPU cycles required to locate a value at the cost of using more memory as the index becomes larger. This is nonetheless the way to go nowadays, as masive amounts of memory are cheap.
As SSDs' speed become closer to that of RAM, using files to store indexes is to be taken on account. Nevertheless modern OSs tend to load in RAM as much as they can, thus using files would end up in something similar from a performance point of view.

How to save memory in an array of which many elements are always 0?

I have a 2tensor in C that looks like:
int n =4;
int l =5;
int p =6;
int q=2;
I then initialize each element of T
//loop over each of the above indices
T[n][l][p][q]=...
However, many of them are zero and there are symmetries such as.
T[4][3][2][1]=-T[3][4][2][1]
How can I save memory on the elements of T which are zero? Ideally I would like to place something like NULL in those positions so they use 0 instead of 8 bytes. Also, later on in the calculation I can check if they are zero or not by checking if they are equal to NULL
How do I implicitly include those symmetries in T with using excess memory?
Edit: the symmetry can perhaps be fixed with a different implementation. But what about the zeros? Is there any implementation to not have them waste memory?
You cannot influence the size of any variable by a value you write to it.
If you want to save memory you have not only to not use it, you have to not define a variable using it.
If you do not define a variable, then you have to not use it ever.
Then you have saved memory.
This is of course obvious.
Now, how to apply that to your problem.
Allow me to simplify, for one because you did not give enough information and explanation, at least not for me to understand every detail. For another, to keep the explanation simple.
So I hope that it suffices if I solve the following problem for you, which I think is kind of the little brother of your problem.
I have a large array in C (not really large, lets say N entries, with N==20).
But for special reasons, I will never need to actually read and write any even indices, they should act as if they contain 0, but I want to save the memory used by them.
So actually I want to only use M of the entries, with M*2==N.
So instead of
int Array[N]; /* all the theoretical elements */
I define
int Array[M]; /* only the actually used elements */
Of course I cannot access any of the elements which are not needed and it will not really be necessary.
But for the logic of my program, I want to be able to program as if I could access them, but be sure that they will always every only read 0 and ignore any written value.
So what I do is wrapping all accesses to the array.
int GetArray(int index)
{
if (index & 1)
{
/* odd, I need to really access the array,
but at a calculated index */
return Array[index/2];
} else
{
/* even, always 0 */
return 0;
}
}
void SetArray(int index, int value)
{
if (index & 1)
{
/* odd, I need to really access the array,
but at a calculated index */ */
Array[index/2] = value;
} else
{
/* even, no need to store anything, stays always "0" */
}
}
So I can read and write as if the array were twice as large, but guarantee not to ever use the faked elements.
And by mapping the indices as
actualindex = wantindex / 2
I ensure that I do not access beyond the size of the actually existing array.
Porting this concept now to the more complicated setup you have described is your job. You know all the details, you can test wether everything works.
I recommend to extend GetArray() and SetArray() by checks on the resulting index, to make sure that it is never outside of the actual array.
You can also add all kinds of self checks to verify that all your rules and expectations are met.

Fastest algorithm to figure out if an array has at least one duplicate

I have a quite peculiar case here. I have a file containing several million entries and want to find out if there exists at least one duplicate. The language here isn't of great importance, but C seems like a reasonable choice for speed. Now, what I want to know is what kind of approach to take to this? Speed is the primary goal here. Naturally, we want to stop looking as soon as one duplicate is found, that's clear, but when the data comes in, I don't know anything about how it's sorted. I just know it's a file of strings, separated by newline. Now keep in mind, all I want to find out is if a duplicate exists. Now, I have found a lot of SO questions regarding finding all duplicates in an array, but most of them go the easy and comprehensive way, rather than the fastest.
Hence, I'm wondering: what is the fastest way to find out if an array contains at least one duplicate? So far, the closest I've been able to find on SO is this: Finding out the duplicate element in an array. The language chosen isn't important, but since it is, after all, programming, multi-threading would be a possibility (I'm just not sure if that's a feasible way to go about it).
Finally, the strings have a format of XXXNNN (3 characters and 3 integers).
Please note that this is not strictly theoretical. It will be tested on a machine (Intel i7 with 8GB RAM), so I do have to take into consideration the time of making a string comparison etc. Which is why I'm also wondering if it could be faster to split the strings in two, and first compare the integer part, as an int comparison will be quicker, and then the string part? Of course, that will also require me to split the string and cast the second half to an int, which might be slower...
Finally, the strings have a format of XXXNNN (3 characters and 3 integers).
Knowing your key domain is essential to this sort of problem, so this allows us to massively simplify the solution (and this answer).
If X &in; {A..Z} and N &in; {0..9}, that gives 263 * 103 = 17,576,000 possible values ... a bitset (essentially a trivial, perfect Bloom filter with no false positives) would take ~2Mb for this.
Here you go: a python script to generate all possible 17 million keys:
import itertools
from string import ascii_uppercase
for prefix in itertools.product(ascii_uppercase, repeat=3):
for numeric in range(1000):
print "%s%03d" % (''.join(prefix), numeric)
and a simple C bitset filter:
#include <limits.h>
/* convert number of bits into number of bytes */
int filterByteSize(int max) {
return (max + CHAR_BIT - 1) / CHAR_BIT;
}
/* set bit #value in the filter, returning non-zero if it was already set */
int filterTestAndSet(unsigned char *filter, int value) {
int byteIndex = value / CHAR_BIT;
unsigned char mask = 1 << (value % CHAR_BIT);
unsigned char byte = filter[byteIndex];
filter[byteIndex] = byte | mask;
return byte & mask;
}
which for your purposes you'd use like so:
#include <stdlib.h>
/* allocate filter suitable for this question */
unsigned char *allocMyFilter() {
int maxKey = 26 * 26 * 26 * 10 * 10 * 10;
return calloc(filterByteSize(maxKey), 1);
}
/* key conversion - yes, it's horrible */
int testAndSetMyKey(unsigned char *filter, char *s) {
int alpha = s[0]-'A' + 26*(s[1]-'A' + 26*(s[2]-'A'));
int numeric = s[3]-'0' + 10*(s[4]-'0' + 10*(s[5]-'0'));
int key = numeric + 1000 * alpha;
return filterTestAndSet(filter, key);
}
#include <stdio.h>
int main() {
unsigned char *filter = allocMyFilter();
char key[8]; /* 6 chars + newline + nul */
while (fgets(key, sizeof(key), stdin)) {
if (testAndSetMyKey(filter, key)) {
printf("collision: %s\n", key);
return 1;
}
}
return 0;
}
This is linear, although there's obviously scope to optimise the key conversion and file input. Anyway, sample run:
useless:~/Source/40044744 $ python filter_test.py > filter_ok.txt
useless:~/Source/40044744 $ time ./filter < filter_ok.txt
real 0m0.474s
user 0m0.436s
sys 0m0.036s
useless:~/Source/40044744 $ cat filter_ok.txt filter_ok.txt > filter_fail.txt
useless:~/Source/40044744 $ time ./filter < filter_fail.txt
collision: AAA000
real 0m0.467s
user 0m0.452s
sys 0m0.016s
admittedly the input file is cached in memory for these runs.
The reasonable answer is to keep the algorithm with the smallest complexity. I encourage you to use a HashTable to keep track of inserted elements; the final algorithm complexity is O(n), because search in HashTable is O(1) theoretically. In your case I suggest you, to run the algorithm when reading file.
public static bool ThereAreDuplicates(string[] inputs)
{
var hashTable = new Hashtable();
foreach (var input in inputs)
{
if (hashTable[input] != null)
return true;
hashTable.Add(input, string.Empty);
}
return false;
}
A fast but inefficient memory solution would use
// Entries are AAA####
char found[(size_t)36*36*36*36*36*36 /* 2,176,782,336 */] = { 0 }; // or calloc() this
char buffer[100];
while (fgets(buffer, sizeof buffer, istream)) {
unsigned long index = strtoul(buffer, NULL, 36);
if (found[index]++) {
Dupe_found();
break;
}
}
The trouble with the post is that it wants "Fastest algorithm", but does not detail memory concerns and its relative importance to speed. So speed must be king and the above wastes little time. It does meet the "stop looking as soon as one duplicate is found" requirement.
Depending on how many different things there can be you have some options:
Sort whole array and then lookup for repeating element, complexity O(n log n) but can be done in place, so memory will be O(1)
Build set of all elements. Depending on chosen set implementation can be O(n) (when it will be hash set) or O(n log n) (binary tree), but it would cost you some memory to do so.
The fastest way to find out if an array contains at least one duplicate is to use a bitmap, multiple CPUs and an (atomic or not) "test and set bit" instruction (e.g. lock bts on 80x86).
The general idea is to divide the array into "total elements / number of CPUs" sized pieces and give each piece to a different CPU. Each CPU processes it's piece of the array by calculating an integer and doing the atomic "test and set bit" for the bit corresponding to that integer.
However, the problem with this approach is that you're modifying something that all CPUs are using (the bitmap). A better idea is to give each CPU a range of integers (e.g. CPU number N does all integers from "(min - max) * N / CPUs" to "(min - max) * (N+1) / CPUs"). This means that all CPUs read from the entire array, but each CPU only modifies it's own private piece of the bitmap. This avoids some performance problems involved with cache coherency protocols ("read for ownership of cache line") and also avoids the need for atomic instructions.
Then next step beyond that is to look at how you're converting your "3 characters and 3 digits" strings into an integer. Ideally, this can/would be done using SIMD; which would require that the array is in "structure of arrays" format (and not the more likely "array of structures" format). Also note that you can convert the strings to integers first (in an "each CPU does a subset of the strings" way) to avoid the need for each CPU to convert each string and pack more into each cache line.
Since you have several million entries I think the best algorithm would be counting sort. Counting sort does exactly what you asked: it sorts an array by counting how many times every element exists. So you could write a function that does the counting sort to the array :
void counting_sort(int a[],int n,int max)
{
int count[max+1]={0},i;
for(i=0;i<n;++i){
count[a[i]]++;
if (count[a[i]]>=2) return 1;
}
return 0;
}
Where you should first find the max element (in O(n)). The asymptotic time complexity of counting sort is O(max(n,M)) where M is the max value found in the array. So because you have several million entries if M has size order of some millions this will work in O(n) (or less for counting sort but because you need to find M it is O(n)). If also you know that there is no way that M is greater than some millions than you would be sure that this gives O(n) and not just O(max(n,M)).
You can see counting sort visualization to understand it better, here:
https://www.cs.usfca.edu/~galles/visualization/CountingSort.html
Note that in the above function we don't implement exactly counting sort, we stop when we find a duplicate which is even more efficient, since yo only want to know if there is a duplicate.

Optimising C for performance vs memory optimisation using multidimensional arrays

I am struggling to decide between two optimisations for building a numerical solver for the poisson equation.
Essentially, I have a two dimensional array, of which I require n doubles in the first row, n/2 in the second n/4 in the third and so on...
Now my difficulty is deciding whether or not to use a contiguous 2d array grid[m][n], which for a large n would have many unused zeroes but would probably reduce the chance of a cache miss. The other, and more memory efficient method, would be to dynamically allocate an array of pointers to arrays of decreasing size. This is considerably more efficient in terms of memory storage but would it potentially hinder performance?
I don't think I clearly understand the trade-offs in this situation. Could anybody help?
For reference, I made a nice plot of the memory requirements in each case:
There is no hard and fast answer to this one. If your algorithm needs more memory than you expect to be given then you need to find one which is possibly slower but fits within your constraints.
Beyond that, the only option is to implement both and then compare their performance. If saving memory results in a 10% slowdown is that acceptable for your use? If the version using more memory is 50% faster but only runs on the biggest computers will it be used? These are the questions that we have to grapple with in Computer Science. But you can only look at them once you have numbers. Otherwise you are just guessing and a fair amount of the time our intuition when it comes to optimizations are not correct.
Build a custom array that will follow the rules you have set.
The implementation will use a simple 1d contiguous array. You will need a function that will return the start of array given the row. Something like this:
int* Get( int* array , int n , int row ) //might contain logical errors
{
int pos = 0 ;
while( row-- )
{
pos += n ;
n /= 2 ;
}
return array + pos ;
}
Where n is the same n you described and is rounded down on every iteration.
You will have to call this function only once per entire row.
This function will never take more that O(log n) time, but if you want you can replace it with a single expression: http://en.wikipedia.org/wiki/Geometric_series#Formula
You could use a single array and just calculate your offset yourself
size_t get_offset(int n, int row, int column) {
size_t offset = column;
while (row--) {
offset += n;
n << 1;
}
return offset;
}
double * array = calloc(sizeof(double), get_offset(n, 64, 0));
access via
array[get_offset(column, row)]

Keeping Track of Array Indices which Meet Some Requirement in C

This sort of situation comes up pretty often. You loop through an array, and if some elements meet some requirement, you'd like to keep track of their indices for later. Here's what I mean:
for(i=0;i<10;++i)
{
if(array[i] > 10)
{
//Keep track of this index for later use.
}
}
The easy solution would be to create an array of 10 elements, and if say the 2nd element is greater than 10, one could do indices[i] = 1; But I feel this approach isn't that good. I'll need a large array to store this and most of the space is wasted.
In my application, I'm trying to find which bits are set in a bit array. So if bits 0 and 10 are set, I need to store these numbers for later use by the program. What's the best way to go about this?
This code needs to run on an AVR Mega and I'm using AVR-GCC so a C only solution is required.
You can use a bitmap: this only uses 1 bit per index, instead of 16 or 32 bits per index.
uint32_t bitmap[10] = {0}; // works for array size up to 320 elements
for(i=0;i<10;++i)
{
if(array[i] > 10)
{
//Keep track of this index for later use.
bitmap[i/32] |= (uint32_t)1 << (i%32);
}
}
for(i=0;i<10;++i)
{
if((bitmap[i/32] >> (i%32)) & 1)
{
// Later use :)
// Put some code here
}
}
On a PC, I would say a dynamically growing linked list or stack would be best.
On a microcontroller, it's often best to use statically allocated structures so that performance is deterministic and to avoid wasting precious memory. So a fixed size FIFO that stores the index you care about (rather than a simple 1/0 status) is the way to go. Just be prepared to think about detection and graceful failure if there's an overflow OR find some way to guarantee no overflow.
If you feel that much space would be wasted by using an additional array to remember the "special" indices, try to determine exactly how much space would be wasted. Then, use a smaller array. For example, if you know that you must remember at most 4 indices, declare an array of size 4.
You can also declare a small array, not large enough to remember all indices, and run the loop that fills it several times:
int indices[4];
int number_of_indices = 0;
int i_start = 0; // array entries up to this index were already checked
while (i_start < 10) {
for(i=i_start;i<10;++i)
{
if(array[i] > 10)
{
//Keep track of this index for "later use" below.
indices[number_of_indices++] = i;
// If 4 indices have been gathered, break the loop and use them
if (number_of_indices == 4)
{
break;
}
}
}
i_start = i;
// Put "Later use" here :)
// Do something for the list of indices gathered so far
}

Resources