Best way to compare two huge arrays? - arrays

I'm tasked with having 1,000,000 cards in one input file having a market price and then having the same 1,000,000 cards with a higher price in another input file, and I have to compare both to compute the profit.
A nested for loop of:
for(int i = 0; i < marketPriceCards.size(); i++){
for(int j = 0; j < priceListCards.size(); j++){
compute profit
is O(n^2) which is way too long. I was thinking a hash table but how big would I have to make it? A prime number that's higher than 1000000?

In Java the default load factor is 0.75, so you can create your hashtable at the size of:
1.75 * <size of your data>
and that should be a good start.
By the way, you didn't mention which language are you going to use. In case it's Java you should use HashMap - not Hashtable (just FYI).

I don't understand why you wrote a nested loop since it can be done in one loop O(n).
As your data is recorded in a two big files you need to read them and you need to traverse whole of the both file since you need all numbers.
if the records was less than 100,000 I would suggest to load them both in to memory using mopen() however you have two big files and loading them both into memory is not a clever action. So here it is what I think you should do in case you have text files
cardsFile = fopen ("elapsed.dta", "rt");
priceFile = fopen ("elapsed.dta", "rt");
while(fgets(aCardline, 80, cardsFile) != NULL)
{
sscanf (aCardline, "%ld", &aCard);
fgets(aPriceline, 80, priceFile)
sscanf (aCardline, "%ld", &aPrice);
printf ("Card :%s Price :%ld\n", aCard, aPrice,);
}
I think you have to change the methods which return cards and prices
you may use buffers in case you need to elaborate more the data.
I personally like to store this size of data in a database.
Hope this helps you.

Related

Cycling through interval in C efficiently

I have dynamically allocated array consisting of a lot of numbers (200 000+) and I have to find out, if (and how many) these numbers are contained in given interval. There can be duplicates and all the numbers are in random order.
Example of numbers I get at the beginning:
{1,2,3,1484984,48941651,489416,1816,168189161,6484,8169181,9681916,121,231,684979,795641,231484891,...}
Given interval:
<2;150000>
I created a simple algorithm with 2 for loops cycling through all numbers:
for( int j = 0; j <= numberOfRepeats; j++){
for( int i = 0; i < arraySize; i++){
if(currentNumber == array[i]){
counter++;
}
}
currentNumber++;
}
printf(" -> %d\n", counter);
}
This algorithm is too slow for my task. Is there more efficient way for me to implement my solution? Could sorting the arrays by value help in this case / wouldn't that be too slow?
Example of working program:
{ 1, 7, 22, 4, 7, 5, 11, 9, 1 }
<4;7>
-> 4
The problem was simple as the single comment in my question answered it - there was no reason for second loop. Single loop could do it alone.
My changed code:
for(int i = 0; i <= arraySize-1; i++){
if(array[i] <= endOfInterval && array[i] >= startOfInterval){
counter++;
}
This algorithm is too slow for my task. Is there more efficient way for me to implement my solution? Could sorting the arrays by value help in this case / wouldn't that be too slow?
Of course, it is slow. A single pass algorithm to count the number of elements that are in the set should suffice, just count them in a single pass if they pass the test (be n[i] >= lower bound && be n[i] < upper bound or similar approach) will do the work.
Only in case you need to consider duplicates (e.g. not counting them) you will need to consider if you have already touched them or no. In that case, the sorting solution will be faster (a qsort(3) call is O(nlog(n)) against the O(nn) your double loop is doing, so it will run in an almost linear, then you make a second pass over the data (converting your complexity to O(nlog(n) + n), still lower than O(nn) for the large amount of data you have.
Sorting has the advantage that puts all the repeated key values together, so you have to consider only if the last element you read was the same as the one you are processing now, if it is different, then count it only if it is in the specified range.
One final note: Reading a set of 200,000 integers into an array to filter them, based on some criteria is normally a bad, non-scalable way to solve a problem. Your problem (select the elements that belong to a given interval) allow you for a scalable and better solution by streaming the problem (you read a number, check if it is in the interval, then output it, or count it, or whatever you like to do on it), without using a large amount of memory to hold them all before starting. That is far better way to solve a problem, as it allows you to read a true unbounded set of numbers (coming e.g. from a file) and producing an output based on that:
#include <stdio.h>
#define A (2)
#define B (150000)
int main()
{
int the_number;
size_t count = 0;
int res;
while ((res = scanf("%d", &the_number)) > 0) {
if (the_number >= A && the_number <= B)
count++;
}
printf("%zd numbers fitted in the range\n", count);
}
on this example you can give the program 1.0E26 numbers (assuming that you have an input file system large enough to hold a file this size) and your program will be able to handle it (you cannot create an array with capacity to hold 10^26 values)

create two-dimensional array/matrix using C

I need to read a file with some kind of a matrix from CSV file(number of matrix columns and rows may be different every time) using C.
The file will be something like that:
#,#,#,#,#,#,.,#,.,.,.$
#,.,#,.,.,#,.,#,#,#,#$
#,.,#,.,.,.,.,.,.,#,#$
#,.,#,.,.,#,#,#,#,#,#$
#,.,.,#,.,.,.,.,.,.,#$
#,.,.,.,#,.,#,#,.,.,#$
#,.,.,.,.,#,.,.,.,.,#$
#,.,.,.,.,#,.,.,.,.,#$
#,.,.,.,.,.,.,.,.,.,#$
#,#,#,#,#,#,#,#,#,.,#$
I need to read the file and save it to a two-dimensional array to be able to iterate through it and find the path out of the labyrinth using Lee algorithm.
So I want to do somenthing like:
int fd = open (argv[i], O_RDONLY);
while (read(fd, &ch, 1)) {
here should be some for loops to find the number of colums and rows.
}
Unfortunately, I don't know how to do that if heigth and width of the matrix is unknown.
I was trying to do that:
while (read (fd, &ch, 1)) {
for (int i = 0; arr[i] != '\0'; i++) {
for (int j = 0; j != '\n'; j++) {
somehow save the values, number of columns and rows.
}
}
}
However, number of rows could be greater than number of columns.
Any help will be appreciated
If the size isn't known but has to be determined as you parse the file, then a simple but a bit naive idea would be to use a char** rows = malloc(n); where n is a sufficiently large number to cover most normal use-cases. realloc if you go past n.
Then for each row you read, store it inside rows[i] through another malloc followed by strcpy/memcpy.
A smarter version of the same would be to first read the first row, find the row length and then assume that all rows in the file have that size. You can do do a char (*rows)[n] = malloc (n * (row_length+1) ); to allocate a true 2D array. This has advantages over the char**, since you get a proper cache-friendly 2D array with faster access, faster allocation and less heap fragmentation. See Correctly allocating multi-dimensional arrays for details about that.
Another big advantage of the char (*rows)[n] is that if you know n in advance, you can actually read/fread the whole file in one go, which would be a significant performance boost since file I/O will be the bottleneck in this program.
If you don't know n in advance, you would still have to realloc in case you end up reading more than n rows. So a third option would be to use a linked list, which is probably the worst option since it is slow and adds complexity. The only advantage being that a link list allows you to swiftly add/remove rows on the fly.

Efficient search for series of values in an array? Ideally OpenCL usable?

I have a massive array I need to search (actually it's a massive array of smaller arrays, but for all intents and purposes, lets consider it one huge array). What I need to find is a specific series of numbers. Obviously, a simple for loop will work:
Pseudocode:
for(x = 0; x++) {
if(array[x] == searchfor[location])
location++;
else
location = 0;
if(location >= strlen(searchfor))
return FOUND_IT;
}
Thing is I want this to be efficient. And in a perfect world, I do NOT want to return the prepared data from an OpenCL kernel and do a simple search loop.
I'm open to non-OpenCL ideas, but something I can implement across a work group size of 64 on a target array length of 1024 would be ideal.
I'm kicking around ideas (split the target across work items, compare each item, looped, against each target, if it matches, set a flag. After all work items complete, check flags. Though as I write that, that sounds very inefficient) but I'm sure I'm missing something.
Other idea was that since the target array is uchar, to lump it together as a double, and check 8 indexes at a time. Not sure I can do that in opencl easily.
Also toying with the idea of hashing the search target with something fast, MD5 likely, then grabbing strlen(searchtarget) characters at a time, hashing it, and seeing if it matches. Not sure how much the hashing will kill my search speed though.
Oh - code is in C, so no C++ maps (something I found while googling that seems like it might help?)
Based on comments above, for future searches, it seems a simple for loop scanning the range IS the most efficient way to find matches given an OpenCL implementation.
Create an index array[sizeof uchar]. For each uchar in the search string make array[uchar] = position in search string of first occurence of uchar. The rest of array contains -1.
unsigned searchindexing[sizeof char] = { (unsigned)-1};
memcpy(searchindexing + 1, searchindexing, sizeof char - 1);
for (i = 0; i < strlen(searchfor); i++)
searchindexing[searchfor[i]] = i;
If you don't start at the beginning, an uchar occuring more than one time will get the wrong position entered into searchindexing.
Then you search the array by stepping strlen(searchfor) unless finding an uchar from searchfor.
for (i = 0; i < MAXARRAYLEN; i += strlen(searchfor))
if ((unsigned)-1 != searchindexing[array[i]]) {
i -= searchindexing[array[i]];
if (!memcmp(searchfor, &array[i], strlen(searchfor)))
return FOUND_IT;
}
If most of the uchar in array isn't in searchfor, this is probably the fastest way. Note the code has not been optimized.
Example: searchfor = "banana". strlen is 6. searchindexing['a'] = 5, ['b'] = 0, ['n'] = 4 and the rest a value not between 0 to 5, like -1 or maxuint. If array[i] is something not in banana like space, i increments by 6. If array[i] now is 'a', you might be in banana and it can be any of the 3 'a's. So we assume the last 'a' and move 5 places back and do a compare with searchfor. If succes, we found it, otherwise we step 6 places forward.

Best (fastest) way to find the number most frequently entered in C?

Well, I think the title basically explains my doubt. I will have n numbers to read, this n numbers go from 1 to x, where x is at most 105. What is the fastest (less possible time to run it) way to find out which number were inserted more times? That knowing that the number that appears most times appears more than half of the times.
What I've tried so far:
//for (1<=x<=10⁵)
int v[100000+1];
//multiple instances , ends when n = 0
while (scanf("%d", &n)&&n>0) {
zerofill(v);
for (i=0; i<n; i++) {
scanf("%d", &x);
v[x]++;
if (v[x]>n/2)
i=n;
}
printf("%d\n", x);
}
Zero-filling a array of x positions and increasing the position vector[x] and at the same time verifying if vector[x] is greater than n/2 it's not fast enough.
Any idea might help, thank you.
Observation: No need to care about amount of memory used.
The trivial solution of keeping a counter array is O(n) and you obviously can't get better than that. The fight is then about the constants and this is where a lot of details will play the game, including exactly what are the values of n and x, what kind of processor, what kind of architecture and so on.
On the other side this seems really the "knockout" problem, but that algorithm will need two passes over the data and an extra conditional, thus in practical terms in the computers I know it will be most probably slower than the array of counters solutions for a lot of n and x values.
The good point of the knockout solution is that you don't need to put a limit x on the values and you don't need any extra memory.
If you know already that there is a value with the absolute majority (and you simply need to find what is this value) then this could make it (but there are two conditionals in the inner loop):
initialize count = 0
loop over all elements
if count is 0 then set champion = element and count = 1
else if element != champion decrement count
else increment count
at the end of the loop your champion will be the value with the absolute majority of elements, if such a value is present.
But as said before I'd expect a trivial
for (int i=0,n=size; i<n; i++) {
if (++count[x[i]] > half) return x[i];
}
to be faster.
EDIT
After your edit seems you're really looking for the knockout algorithm, but caring about speed that's probably still the wrong question with modern computers (100000 elements is nothing even for a nail-sized single chip today).
I think you can create a max heap for the count of number you read,and use heap sort to find all the count which greater than n/2

Maintain a sorted array that a separate, iterative function can keep accessing

I'm writing code for a decision tree in C. Right now it gives me the correct result (0% training error, low test error), but it takes a long time to run.
The problem lies in how often I run qsort. My basic algorithm is this:
for every feature
sort that feature column using qsort
remove duplicate feature values in that column
for every unique feature value
split
determine entropy given that split
save the best feature to split + split value
for every training_example
if training_example's value for best feature < best split value, store in Left[]
else store in Right[]
recursively call this function, using only the Left[] training examples
recursively call this function, using only the Right[] training examples
Because the last two lines are iterative calls, and because the tree can extend for dozens and dozens of branches, the number of calls to qsort is huge (especially for my dataset that has > 1000 features).
My idea to reduce the runtime is to create a 2d array (in a separate function) where each column is a sorted feature column. Then, as long as I maintain a vector of row numbers of the training examples in Left[] and Right[] for each recursive call, I can just call this separate function, grab the rows I want in the pre-sorted feature vector, and save the cost of having to qsort each time.
I'm fairly new to C and so I'm not sure how to code this. In MatLab I can just have a global array that any function can change or access, looking for something like that in C.
Global arrays in C are totally possible. There are actually two ways of doing that. In the first case the dimensions of the array are fixed for the application:
#define NROWS 100
#define NCOLS 100
int array[NROWS][NCOLS];
int main(void)
{
int i, j;
for (i = 0; i < NROWS; i++)
for (j = 0; j < NCOLS; j++)
{
array[i][j] = i+j;
}
return 0;
}
In the second example the dimensions may depend on values from the input.
#include <stdlib.h>
int **array;
int main(void)
{
int nrows = 100;
int ncols = 100;
int i, j;
array = malloc(nrows*sizeof(*array));
for (i = 0; i < nrows; i++)
{
array[i] = malloc(ncols*sizeof(*(array[i])));
for (j = 0; j < ncols; j++)
{
array[i][j] = i+j;
}
}
}
Although the access to the arrays in both examples looks deceivingly similar, the implementation of the arrays is quite different. In the first example the array is located in one piece of memory and the strides to access rows is a whole row. In the second example each row access is a pointer to a row, which is one piece of memory. The various rows can however be located in different areas of the memory. In the second example rows might also have a different length. In that case you would need to store the length of each row somewhere too.
I don't fully understand what you are trying to achieve, because I'm not familiar with the terminology of decision tree, feature and the standard approaches to training sets. But you may also want to have a look at other data structures to maintain sorted data:
http://en.wikipedia.org/wiki/Red–black_tree maintains a more or less balanced and sorted tree.
AVL tree a bit slower but more balanced and sorted tree.
Trie a sorted tree on lists of elements.
Hash function to easily map a complex element to an integral value that can be used to sort the elements. Good for finding exact elements, but there is no real order in the elements itself.
P.S1: Coming from Matlab you may want to consider a different language from C to move to. C++ has standard libraries to support above data structures. Java, Python come to mind or even Haskell if you are daring. Pointer handling in C can be quite tedious and error prone.
P.S2: I'm unable to include a - in a URL on StackOverflow. So the Red-black tree links is a bit off and can't be clicked. If someone can edit my post to fix it, then I would appreciate that.

Resources