Maintain a sorted array that a separate, iterative function can keep accessing - c

I'm writing code for a decision tree in C. Right now it gives me the correct result (0% training error, low test error), but it takes a long time to run.
The problem lies in how often I run qsort. My basic algorithm is this:
for every feature
sort that feature column using qsort
remove duplicate feature values in that column
for every unique feature value
split
determine entropy given that split
save the best feature to split + split value
for every training_example
if training_example's value for best feature < best split value, store in Left[]
else store in Right[]
recursively call this function, using only the Left[] training examples
recursively call this function, using only the Right[] training examples
Because the last two lines are iterative calls, and because the tree can extend for dozens and dozens of branches, the number of calls to qsort is huge (especially for my dataset that has > 1000 features).
My idea to reduce the runtime is to create a 2d array (in a separate function) where each column is a sorted feature column. Then, as long as I maintain a vector of row numbers of the training examples in Left[] and Right[] for each recursive call, I can just call this separate function, grab the rows I want in the pre-sorted feature vector, and save the cost of having to qsort each time.
I'm fairly new to C and so I'm not sure how to code this. In MatLab I can just have a global array that any function can change or access, looking for something like that in C.

Global arrays in C are totally possible. There are actually two ways of doing that. In the first case the dimensions of the array are fixed for the application:
#define NROWS 100
#define NCOLS 100
int array[NROWS][NCOLS];
int main(void)
{
int i, j;
for (i = 0; i < NROWS; i++)
for (j = 0; j < NCOLS; j++)
{
array[i][j] = i+j;
}
return 0;
}
In the second example the dimensions may depend on values from the input.
#include <stdlib.h>
int **array;
int main(void)
{
int nrows = 100;
int ncols = 100;
int i, j;
array = malloc(nrows*sizeof(*array));
for (i = 0; i < nrows; i++)
{
array[i] = malloc(ncols*sizeof(*(array[i])));
for (j = 0; j < ncols; j++)
{
array[i][j] = i+j;
}
}
}
Although the access to the arrays in both examples looks deceivingly similar, the implementation of the arrays is quite different. In the first example the array is located in one piece of memory and the strides to access rows is a whole row. In the second example each row access is a pointer to a row, which is one piece of memory. The various rows can however be located in different areas of the memory. In the second example rows might also have a different length. In that case you would need to store the length of each row somewhere too.
I don't fully understand what you are trying to achieve, because I'm not familiar with the terminology of decision tree, feature and the standard approaches to training sets. But you may also want to have a look at other data structures to maintain sorted data:
http://en.wikipedia.org/wiki/Red–black_tree maintains a more or less balanced and sorted tree.
AVL tree a bit slower but more balanced and sorted tree.
Trie a sorted tree on lists of elements.
Hash function to easily map a complex element to an integral value that can be used to sort the elements. Good for finding exact elements, but there is no real order in the elements itself.
P.S1: Coming from Matlab you may want to consider a different language from C to move to. C++ has standard libraries to support above data structures. Java, Python come to mind or even Haskell if you are daring. Pointer handling in C can be quite tedious and error prone.
P.S2: I'm unable to include a - in a URL on StackOverflow. So the Red-black tree links is a bit off and can't be clicked. If someone can edit my post to fix it, then I would appreciate that.

Related

Making a character array rotate its cells left/right n times

I'm totally new here but I heard a lot about this site and now that I've been accepted for a 7 months software development 'bootcamp' I'm sharpening my C knowledge for an upcoming test.
I've been assigned a question on a test that I've passed already, but I did not finish that question and it bothers me quite a lot.
The question was a task to write a program in C that moves a character (char) array's cells by 1 to the left (it doesn't quite matter in which direction for me, but the question specified left). And I also took upon myself NOT to use a temporary array/stack or any other structure to hold the entire array data during execution.
So a 'string' or array of chars containing '0' '1' '2' 'A' 'B' 'C' will become
'1' '2' 'A' 'B' 'C' '0' after using the function once.
Writing this was no problem, I believe I ended up with something similar to:
void ArrayCharMoveLeft(char arr[], int arrsize, int times) {
int i;
for (i = 0; i <= arrsize ; i++) {
ArraySwap2CellsChar(arr, i, i+1);
}
}
As you can see the function is somewhat modular since it allows to input how many times the cells need to move or shift to the left. I did not implement it, but that was the idea.
As far as I know there are 3 ways to make this:
Loop ArrayCharMoveLeft times times. This feels instinctively inefficient.
Use recursion in ArrayCharMoveLeft. This should resemble the first solution, but I'm not 100% sure on how to implement this.
This is the way I'm trying to figure out: No loop within loop, no recursion, no temporary array, the program will know how to move the cells x times to the left/right without any issues.
The problem is that after swapping say N times of cells in the array, the remaining array size - times are sometimes not organized. For example:
Using ArrayCharMoveLeft with 3 as times with our given array mentioned above will yield
ABC021 instead of the expected value of ABC012.
I've run the following function for this:
int i;
char* lastcell;
if (!(times % arrsize))
{
printf("Nothing to move!\n");
return;
}
times = times % arrsize;
// Input checking. in case user inputs multiples of the array size, auto reduce to array size reminder
for (i = 0; i < arrsize-times; i++) {
printf("I = %d ", i);
PrintArray(arr, arrsize);
ArraySwap2CellsChar(arr, i, i+times);
}
As you can see the for runs from 0 to array size - times. If this function is used, say with an array containing 14 chars. Then using times = 5 will make the for run from 0 to 9, so cells 10 - 14 are NOT in order (but the rest are).
The worst thing about this is that the remaining cells always maintain the sequence, but at different position. Meaning instead of 0123 they could be 3012 or 2301... etc.
I've run different arrays on different times values and didn't find a particular pattern such as "if remaining cells = 3 then use ArrayCharMoveLeft on remaining cells with times = 1).
It always seem to be 1 out of 2 options: the remaining cells are in order, or shifted with different values. It seems to be something similar to this:
times shift+direction to allign
1 0
2 0
3 0
4 1R
5 3R
6 5R
7 3R
8 1R
the numbers change with different times and arrays. Anyone got an idea for this?
even if you use recursion or loops within loops, I'd like to hear a possible solution. Only firm rule for this is not to use a temporary array.
Thanks in advance!
If irrespective of efficiency or simplicity for the purpose of studying you want to use only exchanges of two array elements with ArraySwap2CellsChar, you can keep your loop with some adjustment. As you noted, the given for (i = 0; i < arrsize-times; i++) loop leaves the last times elements out of place. In order to correctly place all elements, the loop condition has to be i < arrsize-1 (one less suffices because if every element but the last is correct, the last one must be right, too). Of course when i runs nearly up to arrsize, i+times can't be kept as the other swap index; instead, the correct index j of the element which is to be put at index i has to be computed. This computation turns out somewhat tricky, due to the element having been swapped already from its original place. Here's a modified variant of your loop:
for (i = 0; i < arrsize-1; i++)
{
printf("i = %d ", i);
int j = i+times;
while (arrsize <= j) j %= arrsize, j += (i-j+times-1)/times*times;
printf("j = %d ", j);
PrintArray(arr, arrsize);
ArraySwap2CellsChar(arr, i, j);
}
Use standard library functions memcpy, memmove, etc as they are very optimized for your platform.
Use the correct type for sizes - size_t not int
char *ArrayCharMoveLeft(char *arr, const size_t arrsize, size_t ntimes)
{
ntimes %= arrsize;
if(ntimes)
{
char temp[ntimes];
memcpy(temp, arr, ntimes);
memmove(arr, arr + ntimes, arrsize - ntimes);
memcpy(arr + arrsize - ntimes, temp, ntimes);
}
return arr;
}
But you want it without the temporary array (more memory efficient, very bad performance-wise):
char *ArrayCharMoveLeft(char *arr, size_t arrsize, size_t ntimes)
{
ntimes %= arrsize;
while(ntimes--)
{
char temp = arr[0];
memmove(arr, arr + 1, arrsize - 1);
arr[arrsize -1] = temp;
}
return arr;
}
https://godbolt.org/z/od68dKTWq
https://godbolt.org/z/noah9zdYY
Disclaimer: I'm not sure if it's common to share a full working code here or not, since this is literally my first question asked here, so I'll refrain from doing so assuming the idea is answering specific questions, and not providing an example solution for grabs (which might defeat the purpose of studying and exploring C). This argument is backed by the fact that this specific task is derived from a programing test used by a programing course and it's purpose is to filter out applicants who aren't fit for intense 7 months training in software development. If you still wish to see my code, message me privately.
So, with a great amount of help from #Armali I'm happy to announce the question is answered! Together we came up with a function that takes an array of characters in C (string), and without using any previously written libraries (such as strings.h), or even a temporary array, it rotates all the cells in the array N times to the left.
Example: using ArrayCharMoveLeft() on the following array with N = 5:
Original array: 0123456789ABCDEF
Updated array: 56789ABCDEF01234
As you can see the first cell (0) is now the sixth cell (5), the 2nd cell is the 7th cell and so on. So each cell was moved to the left 5 times. The first 5 cells 'overflow' to the end of the array and now appear as the Last 5 cells, while maintaining their order.
The function works with various array lengths and N values.
This is not any sort of achievement, but rather an attempt to execute the task with as little variables as possible (only 4 ints, besides the char array, also counting the sub function used to swap the cells).
It was achieved using a nested loop so by no means its efficient runtime-wise, just memory wise, while still being self-coded functions, with no external libraries used (except stdio.h).
Refer to Armali's posted solution, it should get you the answer for this question.

Dynamically indexing an array in C

Is it possible to create arrays based of their index as in
int x = 4;
int y = 5;
int someNr = 123;
int foo[x][y] = someNr;
dynamically/on the run, without creating foo[0...3][0...4]?
If not, is there a data structure that allow me to do something similar to this in C?
No.
As written your code make no sense at all. You need foo to be declared somewhere and then you can index into it with foo[x][y] = someNr;. But you cant just make foo spring into existence which is what it looks like you are trying to do.
Either create foo with correct sizes (only you can say what they are) int foo[16][16]; for example or use a different data structure.
In C++ you could do a map<pair<int, int>, int>
Variable Length Arrays
Even if x and y were replaced by constants, you could not initialize the array using the notation shown. You'd need to use:
int fixed[3][4] = { someNr };
or similar (extra braces, perhaps; more values perhaps). You can, however, declare/define variable length arrays (VLA), but you cannot initialize them at all. So, you could write:
int x = 4;
int y = 5;
int someNr = 123;
int foo[x][y];
for (int i = 0; i < x; i++)
{
for (int j = 0; j < y; j++)
foo[i][j] = someNr + i * (x + 1) + j;
}
Obviously, you can't use x and y as indexes without writing (or reading) outside the bounds of the array. The onus is on you to ensure that there is enough space on the stack for the values chosen as the limits on the arrays (it won't be a problem at 3x4; it might be at 300x400 though, and will be at 3000x4000). You can also use dynamic allocation of VLAs to handle bigger matrices.
VLA support is mandatory in C99, optional in C11 and C18, and non-existent in strict C90.
Sparse arrays
If what you want is 'sparse array support', there is no built-in facility in C that will assist you. You have to devise (or find) code that will handle that for you. It can certainly be done; Fortran programmers used to have to do it quite often in the bad old days when megabytes of memory were a luxury and MIPS meant millions of instruction per second and people were happy when their computer could do double-digit MIPS (and the Fortran 90 standard was still years in the future).
You'll need to devise a structure and a set of functions to handle the sparse array. You will probably need to decide whether you have values in every row, or whether you only record the data in some rows. You'll need a function to assign a value to a cell, and another to retrieve the value from a cell. You'll need to think what the value is when there is no explicit entry. (The thinking probably isn't hard. The default value is usually zero, but an infinity or a NaN (not a number) might be appropriate, depending on context.) You'd also need a function to allocate the base structure (would you specify the maximum sizes?) and another to release it.
Most efficient way to create a dynamic index of an array is to create an empty array of the same data type that the array to index is holding.
Let's imagine we are using integers in sake of simplicity. You can then stretch the concept to any other data type.
The ideal index depth will depend on the length of the data to index and will be somewhere close to the length of the data.
Let's say you have 1 million 64 bit integers in the array to index.
First of all you should order the data and eliminate duplicates. That's something easy to achieve by using qsort() (the quick sort C built in function) and some remove duplicate function such as
uint64_t remove_dupes(char *unord_arr, char *ord_arr, uint64_t arr_size)
{
uint64_t i, j=0;
for (i=1;i<arr_size;i++)
{
if ( strcmp(unord_arr[i], unord_arr[i-1]) != 0 ){
strcpy(ord_arr[j],unord_arr[i-1]);
j++;
}
if ( i == arr_size-1 ){
strcpy(ord_arr[j],unord_arr[i]);
j++;
}
}
return j;
}
Adapt the code above to your needs, you should free() the unordered array when the function finishes ordering it to the ordered array. The function above is very fast, it will return zero entries when the array to order contains one element, but that's probably something you can live with.
Once the data is ordered and unique, create an index with a length close to that of the data. It does not need to be of an exact length, although pledging to powers of 10 will make everything easier, in case of integers.
uint64_t* idx = calloc(pow(10, indexdepth), sizeof(uint64_t));
This will create an empty index array.
Then populate the index. Traverse your array to index just once and every time you detect a change in the number of significant figures (same as index depth) to the left add the position where that new number was detected.
If you choose an indexdepth of 2 you will have 10² = 100 possible values in your index, typically going from 0 to 99.
When you detect that some number starts by 10 (103456), you add an entry to the index, let's say that 103456 was detected at position 733, your index entry would be:
index[10] = 733;
Next entry begining by 11 should be added in the next index slot, let's say that first number beginning by 11 is found at position 2023
index[11] = 2023;
And so on.
When you later need to find some number in your original array storing 1 million entries, you don't have to iterate the whole array, you just need to check where in your index the first number starting by the first two significant digits is stored. Entry index[10] tells you where the first number starting by 10 is stored. You can then iterate forward until you find your match.
In my example I employed a small index, thus the average number of iterations that you will need to perform will be 1000000/100 = 10000
If you enlarge your index to somewhere close the length of the data the number of iterations will tend to 1, making any search blazing fast.
What I like to do is to create some simple algorithm that tells me what's the ideal depth of the index after knowing the type and length of the data to index.
Please, note that in the example that I have posed, 64 bit numbers are indexed by their first index depth significant figures, thus 10 and 100001 will be stored in the same index segment. That's not a problem on its own, nonetheless each master has his small book of secrets. Treating numbers as a fixed length hexadecimal string can help keeping a strict numerical order.
You don't have to change the base though, you could consider 10 to be 0000010 to keep it in the 00 index segment and keep base 10 numbers ordered, using different numerical bases is nonetheless trivial in C, which is of great help for this task.
As you make your index depth become larger, the amount of entries per index segment will be reduced
Please, do note that programming, especially lower level like C consists in comprehending the tradeof between CPU cycles and memory use in great part.
Creating the proposed index is a way to reduce the number of CPU cycles required to locate a value at the cost of using more memory as the index becomes larger. This is nonetheless the way to go nowadays, as masive amounts of memory are cheap.
As SSDs' speed become closer to that of RAM, using files to store indexes is to be taken on account. Nevertheless modern OSs tend to load in RAM as much as they can, thus using files would end up in something similar from a performance point of view.

Is there an approach to traverse array randomly?

I am trying to compare linear memory access to random memory access. I am traversing an array in the order of its indices to log performance of linear memory access. However to log memory's performance with random memory access I want to traverse my array randomly i.e arr[8], arr[17], arr[34], arr[2]...
Can I use pointer chasing to achieve this while ensuring that no index are accessed twice? Is pointer chasing most optimal approach in this case?
If your goal is to show that sequential access is faster than non-sequential access, simply pointer chasing the latter is not a good way to demonstrate that. You would be comparing access via a single pointer plus simple offset against deterrencing one or more pointers before offsetting.
To use pointer chasing, you'd have to apply it to both cases. Here's an example:
int arr[n], i;
int *unshuffled[n];
int *shuffled[n];
for(i = 0; i < n; i++) {
unshuffled[i] = arr + i;
}
/* I'll let you figure out how to randomize your indices */
shuffle(unshuffled, shuffled)
/* Do toning on these two loops */
for(i = 0; i < n; i++) {
do_stuff(*unshuffled[i]);
}
for(i = 0; i < n; i++) {
do_stuff(*shuffled[i]);
}
It you want to time the direct access better though, you could construct some simple formula for advancing the index instead of randomizing the access completely:
for(i = 0; i < n; i++) {
do_stuff(arr[i]);
}
for(i = 0; i < n; i++) {
do_stuff(arr[i / 2 + (i % 2) * (n / 2)]);
}
This will only work properly for even n as shown, but it illustrates the idea. You could go so far as to compensate for the extra flops in computing the index within do_stuff.
Probably the most apples-to-apples test would be to literally access the indices you want, without loops or additional computations:
do_stuff(arr[0]);
do_stuff(arr[1]);
do_stuff(arr[2]);
...
do_stuff(arr[123]);
do_stuff(arr[17]);
do_stuff(arr[566]);
...
Since I'd imagine you'd want to test with large arrays, you can write a program to generate the actual test code for you, and possibly compile and run the result.
I can tell you that for arrays in C the access time is constant regardless of the index being accessed. There will be no difference between accessing them randomly or sequentially other than the fact that randomizing will in itself introduce additional computations.
But, to really answer your question, you would probably be best off to build some kind of lookup array and shuffle it a few times and use that array to get the next index. Obviously, you would be accessing two arrays, one sequentially and another randomly, by doing so, thus making the exercise pretty much useless.

How to remove certain elements from an array using a conditional test in C?

I am writing a program that goes through an array of ints and calculates stdev to identify outliers in the data. From here, I would like to create a new array with the identified outliers removed in order to recalculate the avg and stdev. Is there a way that I can do this?
There is a pretty simple solution to the problem that involves switching your mindset in the if statement (which isn't actually in a for loop it seems... might want to fix that).
float dataMinusOutliers[n];
int indexTracker = 0;
for (i=0; i<n; i++) {
if (data[i] >= (-2*stdevfinal) && data[i] <= (2*stdevfinal)) {
dataMinusOutliers[indexTracker] = data[i];
indexTracker += 1;
}
}
Note that this isn't particularly scalable and that the dataMinusOutliers array is going to potentially have quite a few unused indices. You can always use indexTracker - 1 to note how large the array actually is though, and create yet another array into which you copy the important values in dataMinusOutliers. Is there likely a more elegant solution? Yes. Does this work given your requirements though? Yup.

Optimising C for performance vs memory optimisation using multidimensional arrays

I am struggling to decide between two optimisations for building a numerical solver for the poisson equation.
Essentially, I have a two dimensional array, of which I require n doubles in the first row, n/2 in the second n/4 in the third and so on...
Now my difficulty is deciding whether or not to use a contiguous 2d array grid[m][n], which for a large n would have many unused zeroes but would probably reduce the chance of a cache miss. The other, and more memory efficient method, would be to dynamically allocate an array of pointers to arrays of decreasing size. This is considerably more efficient in terms of memory storage but would it potentially hinder performance?
I don't think I clearly understand the trade-offs in this situation. Could anybody help?
For reference, I made a nice plot of the memory requirements in each case:
There is no hard and fast answer to this one. If your algorithm needs more memory than you expect to be given then you need to find one which is possibly slower but fits within your constraints.
Beyond that, the only option is to implement both and then compare their performance. If saving memory results in a 10% slowdown is that acceptable for your use? If the version using more memory is 50% faster but only runs on the biggest computers will it be used? These are the questions that we have to grapple with in Computer Science. But you can only look at them once you have numbers. Otherwise you are just guessing and a fair amount of the time our intuition when it comes to optimizations are not correct.
Build a custom array that will follow the rules you have set.
The implementation will use a simple 1d contiguous array. You will need a function that will return the start of array given the row. Something like this:
int* Get( int* array , int n , int row ) //might contain logical errors
{
int pos = 0 ;
while( row-- )
{
pos += n ;
n /= 2 ;
}
return array + pos ;
}
Where n is the same n you described and is rounded down on every iteration.
You will have to call this function only once per entire row.
This function will never take more that O(log n) time, but if you want you can replace it with a single expression: http://en.wikipedia.org/wiki/Geometric_series#Formula
You could use a single array and just calculate your offset yourself
size_t get_offset(int n, int row, int column) {
size_t offset = column;
while (row--) {
offset += n;
n << 1;
}
return offset;
}
double * array = calloc(sizeof(double), get_offset(n, 64, 0));
access via
array[get_offset(column, row)]

Resources