All the references to this error I could find searching online were completely inapplicable to my situation, they were dealing with some kind of variables involving dots, like a.b (structures in other words), whereas I am strictly using arrays. Nothing involves a dot, nor does my code ask about it.
Ok, I have this GINORMOUS array called tier2comparatorconnectionpoints. It is a 4-D array of size 400×10×20×10. Consider tier2comparatorconnectionpoints(counter,counter2,counter3,counter4).
counter is a number 1 to 400,
counter2 is a number 1 to numchromosomes(counter), and numchromosomes(counter1) is bound to 10,
counter3 is a number 1 to tier2numcomparators(counter,counter2), which is in turn bounded to 20.
counter4 is a number 1 to tier2inputspercomparator(counter,counter2,counter3), which is bounded to 10.
Now, so that I don't run out of RAM, I have tier2comparatorconnectionpoints as type int8, and UNFORTUNATELY at some point in my horrendous amount of code, I forgot to cast it to a double when I'm doing math with it, and a rounding error involved with multiplying it with a rand ends up with tier2comparatorconnectionpoints for some values of its 4 inputs exceeding what it's allowed to be.
The values it's allowed to have are 1 through tier1numcomparators(counter,counter2), which is bounded to 40, 41 through 40+tier2numcomparators(counter,counter2), with tier2numcomparators(counter,counter2) being bounded to 20, and 61 through 60+tier2numcomparators(counter,counter2), thus it's not allowed to be more than 80 since tier2numcomparators(counter,counter2) is bounded to 20 and it's not allowed to be more than 60+tier2numcomparators(counter,counter2), but it's also not allowed to be less than 40 but more than tier1numcomparators(counter,counter2) and it's not allowed to be less than 60 but more than 40+tier2numcomparators(counter,counter2). I became aware of the problem because it was being set to 81 somewhere.
This is an evolutionary simulation by the way, it's natural selection on simulated organisms. I need to hunt down the part of the code that is allowing the values of tier2comparatorconnectionpoints to exceed what it's allowed to be. But that is a separate problem.
A temporary fix of my data, just so that it at least is made to conform to its allowed values, is to set anything that is greater than tier1numcomparators(counter,counter2) but less than 40 to tier1numcomparators(counter,counter2), to set anything that is greater than 40+tier2numcomparators(counter,counter2) but less than 60 to 40+tier2numcomparators(counter,counter2), and to set anything that is greater than 60+tier2numcomparators(counter,counter2) to 60+tier2numcomparators(counter,counter2). I first found this problem because it was being set to 81, so it didn't just exceed 60+tier2numcomparators(counter,counter2), it exceeded 60+20, with tier2numcomparators being bounded to 20.
I hope this isn't all too-much-information, but I felt it might be necessary to get you to understand just what sort of variables these are.
So in my attempts to at least turn the data into valid data, I did the following:
for counter=1:size(tier2comparatorconnectionpoints,1)
for counter2=1:size(tier2comparatorconnectionpoints,2)
for counter3=1:size(tier2comparatorconnectionpoints,3)
for counter4=1:size(tier2comparatorconnectionpoints,4)
if tier2comparatorconnectionpoints(counter,counter2,counter3,counter4)>60+tier2numcomparators(counter,counter2)
tier2comparatorconnectionpoints(counter,counter2,counter3,counter4)=60+tier2numcomparators(counter,counter2);
end
end
end
end
end
And that worked just fine. And then:
for counter=1:size(tier2comparatorconnectionpoints,1)
for counter2=1:size(tier2comparatorconnectionpoints,2)
for counter3=1:size(tier2comparatorconnectionpoints,3)
for counter4=1:size(tier2comparatorconnectionpoints,4)
if tier2comparatorconnectionpoints(counter,counter2,counter3,counter4)>40+tier2numcomparators(counter,counter2)
if tier2comparatorconnectionpoints(counter,counter2,counter3,counter4)<60
tier2comparatorconnectionpoints(counter,counter2,counter3,counter4)=40+tier2numcomparators(counter,counter2);
end
end
end
end
end
end
And that's where it said "Attempt to reference field of non-structure array".
TBH it sounds like maybe you've made a typo and put a . in somewhere? Otherwise please post the entire error as maybe it's happening in a different function or something.
Either way you don't need all those for loops, it's simpler and usually quicker to do this (and should bypass your error):
First replicate your tier2numcomparators matrix so that it has the same dimension sizes as tier2comparatorconnectionpoints
T = repmat(tier2numcomparators + 40, 1, 1, size(tier2comparatorconnectionpoints, 3), size(tier2comparatorconnectionpoints, 4));
Now in one shot you can create a logical matrix of which elements meet your criteria:
ind = tier2comparatorconnectionpoints > T | tier2comparatorconnectionpoints < 60;
Finally employ logical indexing to set your desired elements:
tier2comparatorconnectionpoints(ind) = T(ind);
You can play around with bsxfun instead of repmat if this is slow or takes too much memory
Related
Mathematician here looking for a bit of help. (If you ever need math help I'll try to reciprocate on math.stackexchange!) Sorry if this is a dup. Couldn't find it myself.
Here's the thing. I write a lot of code (mostly in C) that is extremely slow and I know it could be sped up considerably but I'm not sure what data structure to use. I went to school 20 years ago and unfortunately never got to take a computer science course. I have watched a lot of open-course videos on data structures but I'm still a bit fuddled never taking an actual class.
Mostly my functions just take integers to integers. I almost always use 64-bit numbers and I have three use cases that I'm interested in. I use the word small to mean no more than a million or two in quantity.
Case 1: Small numbers as input. Outputs are arbitrary.
Case 2: Any 64-bit values as input, but only a small number of them. Outputs are arbitrary.
Case 3: Two parameter functions with one parameter that's small in value (say less than two million), and the other parameter is Large but with only a small number of possible inputs. Outputs are arbitrary.
For Case 1, I just make an array to cache the values. Easy and fast.
For Case 2, I think I should be using a hash. I haven't yet done this but I think I could figure it out if I took the time.
Case 3 is the one I'd like help with and I'm not even sure what I need.
For a specific example take a function F(n,p) that takes large inputs n for the first parameter and a prime p for the second. The prime is at most the square root of n. so even if n is about 10^12, the primes are only up to about a million. Suppose this function is recursive or otherwise difficult to calculate (expensive) and will be called over and over with the same inputs. What might be a good data structure to use to easily create and retrieve the possible values of F(n,p) so that I don't have to recalculate it every time? Total number of possible inputs should be 10 or 20 million at most.
Help please! and Thank you in advance!
You are talking about memoizing I presume. Trying to answer without a concrete exemple...
If you have to retrieve values from a small range (the 2nd parameter), say from 0 to 10^6, and that needs to be upper fast, and... you have enough memory, you could simply declare an array of int (long...), which basically stores the output values from all input.
To make things simple, let say the value 0 means there is no-value set
long *small = calloc(MAX, sizeof(*small)); // Calloc intializes to 0
then in a function that gives the value for a small range
if (small[ input ]) return small[ input ];
....calculate
small[ input ] = value;
+/-
+ Very fast
- Memory consumption takes the whole range, [ 0, MAX-1 ].
If you need to store arbitrary input, use the many libraries available (there are so many). Use a Set structure, that tells if the items exists or no.
if (set.exists( input )) return set.get( input );
....calculate
set.set( input, value );
+/-
+ less memory usage
+ still fast (said to be O(1))
- but, not as fast as a mere array
Add to this the hashed set (...), which are faster, as in terms of probabilities, values (hashes) are better distributed.
+/-
+ less memory usage than array
+ faster than a simple Set
- but, not as fast as a mere array
- use more memory than a simple Set
Here's my problem. I have a set of 20 objects stored in memory as an array. I want to store a second piece of data that defines an order for the objects to be displayed.
The simplest way to store the order is as an array of 20 unsigned integers, each of which is 5 bits (aka 0-31). The position of the object in the output list would be defined by the number stored in this array at the same index as the object in it's array.
But.. I know from statistics that there are only 20! (that's 20 factorial), ways to arrange these objects.
This could be stored in 62 bits, since 2^62 > 20!
I'm currently using 100 bits to store the same information.
So my question is this: Is there a space efficient way to store ORDER as a sequence of bits?
I have some addition constraints as well. This will run on an embedded device, so I can't use any huge arrays or high level math functions. I would need a simple iterative method.
Edit: Some clarification on why this is necessary. Say for example the objects are pictures, and they're stored in ROM (aka they can't be moved around). Now lets say I want to keep track of what order to display the images in, and i'm going to update that order every second. My device has 1k of storage with wear leveling, but each bit in the storage can only be written 1000 times before it becomes unreliable. If I need 1kb to store the order, than my device will only work for 1000 seconds. If I need 0.1kb, it will work for 10k seconds, and so on. Thus the devices longevity will be inversely proportional to the number of bits I need to update every cycle.
You can store the order in a single 64-bit value x:
For the first choice, 20 possibilities, compute the index as x % 20 and update x as x /= 20,
For the next choice, only 19 possibilities, compute x % 19 and update x as x /= 19.
Continue this process 17 more times and you are done.
I think I've found a partial solution to my own question. Assuming I start at the left side of the order array, for every move right there are fewer remaining possibilities for the position value. The number of possibilities is 20,19,18,etc. I can take advantage of this by populating the order array in a relative fashion. The first index will place a value in the order array. There are 20 possibilities so this takes 5 bits. Placing the next value, there are only 19 position available (still 5 bits). Proceeding though the whole array. The bits-required is now 5,5,5,5,4,4,4,4,4,4,4,4,3,3,3,3,2,2,1,0. So that gets me down to 69 bits, much better.
There's still some "wasted" precision in each of the values, since for example the first position can store 32 possible values, even though there are only 20. I'm not sure how to deal with this, but I think will have something to do with carrying a remainder from one calculation to the next..
Given an array of values,
arr = [8,10,4,5,3,7,6,0,1,9,13,2]
X is an array of values can be chosen at a time where X.length != 0 and X.length < arr.length
The chosen values are then fed into a function, score(), which will return a score based on the array of select values.
Example 1:
X = [8]
score(X) = 71
Example 2:
X = [4]
score(X) = 36
Example 3:
X = [8,10,7]
score(X) = 51
Example 4:
X = [5,9,0]
score(X) = 4
The function score() here is a blackbox and we can't modify how the function works, we just provide an input and the function will return the score output.
My problem: How to get the lowest score for each set of numbers?
Meaning, if X is an array that has only 1 value, and I feed all the different values in arr, each value will return me a different score value, and I find which arr value provides the lowest score.
If X is an array of 3 values, I feed a combination of all the different possible values in arr, with each different set of 3 values returning a different score and finding the lowest score.
This is simple enough to do if my arr is small. However if I have an array of 50 or even 100 values, how can I create an algorithm that would provide the lowest score based on the number of input values
tl;dr: If you don't know anything about score, then you can't speed it up.
In order to optimize score itself, you would have to know how it works. After all "optimizing" simply means "does the same thing more efficient", but how can you know if it really does "the same thing" if you don't know what "the same thing" is? Plus, speeding up score will not help you with the combinatorial explosion anyway. The number of combinations grows so fast, that any speedups to score will be quickly eaten up by slightly larger inputs.
In order to optimize how you apply score, you would again need to know something about it. If you knew something about score, you could, for example, only generate combinations that you know will yield different values, or combinations that you know will only yield larger values. In other words, you could exploit some structure in the output of score in order to reduce the input size. However, we don't know the structure of the output of score, in fact, we don't even know if there is some structure at all! So we can't exploit it. Plus, there would have to be some extreme redundancy and regularity in the structure, in order for a significant reduction in input size.
In his comment, #ndn suggested applying some form of machine learning to discover structure in the output.. How well this works depends on what kind of structure the output has. And of course, this again assumes that there even is some structure to discover, which we don't know. And again, even if there were some structure, it would have to very redundant and regular to make up for the combinatorial explosion of the input space.
Really, brute force is the only way. Our last straw is going to be parallelization. Maybe, if we distribute the problem across enough CPU cores, we can tackle it? Unfortunately, the combinatorial explosion in the input space is still really going to hurt you:
If we assume that we have a 10THz CPU (i.e. a thousand times faster than the fastest currently available CPU), and we assume that we can compute score in a single clock cycle, and we assume that we have a computer with 10 million cores (again, that's a thousand times larger than the largest supercomputers), it's still going to take over 400 years to find the optimal selection for an input array as small as 100 numbers. And even if we make our CPU a billion times faster and the computer a billion times bigger, simply doubling the size of the array to 200 items will increase the runtime to 500 trillion years.
There is a reason why we call combinatorial explosion "combinatorial explosion", after all.
I have a vector of numbers like this:
myVec= [ 1 2 3 4 5 6 7 8 ...]
and I have a custom function which takes the input of one number, performs an algorithm and returns another number.
cust(1)= 55, cust(2)= 497, cust(3)= 14, etc.
I want to be able to return the number in the first vector which yielded the highest outcome.
My current thought is to generate a second vector, outcomeVec, which contains the output from the custom function, and then find the index of that vector that has max(outcomeVec), then match that index to myVec. I am wondering, is there a more efficient way of doing this?
What you described is a good way to do it.
outcomeVec = myfunc(myVec);
[~,ndx] = max(outcomeVec);
myVec(ndx) % input that produces max output
Another option is to do it with a loop. This saves a little memory, but may be slower.
maxOutputValue = -Inf;
maxOutputNdx = NaN;
for ndx = 1:length(myVec)
output = myfunc(myVec(ndx));
if output > maxOutputValue
maxOutputValue = output;
maxOutputNdx = ndx;
end
end
myVec(maxOutputNdx) % input that produces max output
Those are pretty much your only options.
You could make it fancy by writing a general purpose function that takes in a function handle and an input array. That method would implement one of the techniques above and return the input value that produces the largest output.
Depending on the size of the range of discrete numbers you are searching over, you may find a solution with a golden section algorithm works more efficiently. I tried for instance to minimize the following:
bf = -21;
f =#(x) round(x-bf).^2;
within the range [-100 100] with a routine based on a script from the Mathworks file exchange. This specific file exchange script does not appear to implement the golden section correctly as it makes two function calls per iteration. After fixing this the number of calls required is reduced to 12, which certainly beats evaluating the function 200 times prior to a "dumb" call to min. The gains can quickly become dramatic. For instance, if the search region is [-100000 100000], golden finds the minimum in 25 function calls as opposed to 200000 - the dependence of the number of calls in golden section on the range is logarithmic, not linear.
So if the range is sufficiently large, other methods can definitely beat min by requiring less function calls. Minimization search routines sometimes incorporate such a search in early steps. However you will have a problem with convergence (termination) criteria, which you will have to modify so that the routine knows when to stop. The best option is probably to narrow the search region for application of min by starting out with a few iterations of golden section.
An important caveat is that golden section is guaranteed to work only with unimodal regions, that is, displaying a single minimum. In a region containing multiple minima it's likely to get stuck in one and may miss the global minimum. In that sense min is a sure bet.
Note also that the function in the example here rounds input x, whereas your function takes an integer input. This means you would have to place a wrapper around your function which rounds the input passed by the calling golden routine.
Others appear to have used genetic algorithms to perform such a search, although I did not research this.
Say, i have 10 billions of numbers stored in a file. How would i find the number that has already appeared once previously?
Well i can't just populate billions of number at a stretch in array and then keep a simple nested loop to check if the number has appeared previously.
How would you approach this problem?
Thanks in advance :)
I had this as an interview question once.
Here is an algorithm that is O(N)
Use a hash table. Sequentially store pointers to the numbers, where the hash key is computed from the number value. Once you have a collision, you have found your duplicate.
Author Edit:
Below, #Phimuemue makes the excellent point that 4-byte integers have a fixed bound before a collision is guaranteed; that is 2^32, or approx. 4 GB. When considered in the conversation accompanying this answer, worst-case memory consumption by this algorithm is dramatically reduced.
Furthermore, using the bit array as described below can reduce memory consumption to 1/8th, 512mb. On many machines, this computation is now possible without considering either a persistent hash, or the less-performant sort-first strategy.
Now, longer numbers or double-precision numbers are less-effective scenarios for the bit array strategy.
Phimuemue Edit:
Of course one needs to take a bit "special" hash table:
Take a hashtable consisting of 2^32 bits. Since the question asks about 4-byte-integers, there are at most 2^32 different of them, i.e. one bit for each number. 2^32 bit = 512mb.
So now one has just to determine the location of the corresponding bit in the hashmap and set it. If one encounters a bit which already is set, the number occured in the sequence already.
The important question is whether you want to solve this problem efficiently, or whether you want accurately.
If you truly have 10 billion numbers and just one single duplicate, then you are in a "needle in the haystack" type of situation. Intuitively, short of very grimy and unstable solution, there is no hope of solving this without storing a significant amount of the numbers.
Instead, turn to probabilistic solutions, which have been used in most any practical application of this problem (in network analysis, what you are trying to do is look for mice, i.e., elements which appear very infrequently in a large data set).
A possible solution, which can be made to find exact results: use a sufficiently high-resolution Bloom filter. Either use the filter to determine if an element has already been seen, or, if you want perfect accuracy, use (as kbrimington suggested you use a standard hash table) the filter to, eh, filter out elements which you can't possibly have seen and, on a second pass, determine the elements you actually see twice.
And if your problem is slightly different---for instance, you know that you have at least 0.001% elements which repeat themselves twice, and you would like to find out how many there are approximately, or you would like to get a random sample of such elements---then a whole score of probabilistic streaming algorithms, in the vein of Flajolet & Martin, Alon et al., exist and are very interesting (not to mention highly efficient).
Read the file once, create a hashtable storing the number of times you encounter each item. But wait! Instead of using the item itself as a key, you use a hash of the item iself, for example the least significant digits, let's say 20 digits (1M items).
After the first pass, all items that have counter > 1 may point to a duplicated item, or be a false positive. Rescan the file, consider only items that may lead to a duplicate (looking up each item in table one), build a new hashtable using real values as keys now and storing the count again.
After the second pass, items with count > 1 in the second table are your duplicates.
This is still O(n), just twice as slow as a single pass.
How about:
Sort input by using some algorith which allows only portion of input to be in RAM. Examples are there
Seek duplicates in output of 1st step -- you'll need space for just 2 elements of input in RAM at a time to detect repetitions.
Finding duplicates
Noting that its a 32bit integer means that you're going to have a large number of duplicates, since a 32 bit int can only represent 4.3ish billion different numbers and you have "10 billions".
If you were to use a tightly packed set you could represent whether all the possibilities are in 512 MB, which can easily fit into current RAM values. This as a start pretty easily allows you to recognise the fact if a number is duplicated or not.
Counting Duplicates
If you need to know how many times a number is duplicated you're getting into having a hashmap that contains only duplicates (using the first 500MB of the ram to tell efficiently IF it should be in the map or not). At a worst case scenario with a large spread you're not going to be able fit that into ram.
Another approach if the numbers will have an even amount of duplicates is to use a tightly packed array with 2-8 bits per value, taking about 1-4GB of RAM allowing you to count up to 255 occurrances of each number.
Its going to be a hack, but its doable.
You need to implement some sort of looping construct to read the numbers one at a time since you can't have them in memory all at once.
How? Oh, what language are you using?
You have to read each number and store it into a hashmap, so that if a number occurs again, it will automatically get discarded.
If possible range of numbers in file is not too large then you can use some bit array to indicate if some of the number in range appeared.
If the range of the numbers is small enough, you can use a bit field to store if it is in there - initialize that with a single scan through the file. Takes one bit per possible number.
With large range (like int) you need to read through the file every time. File layout may allow for more efficient lookups (i.e. binary search in case of sorted array).
If time is not an issue and RAM is, you could read each number and then compare it to each subsequent number by reading from the file without storing it in RAM. It will take an incredible amount of time but you will not run out of memory.
I have to agree with kbrimington and his idea of a hash table, but first of all, I would like to know the range of the numbers that you're looking for. Basically, if you're looking for 32-bit numbers, you would need a single array of 4.294.967.296 bits. You start by setting all bits to 0 and every number in the file will set a specific bit. If the bit is already set then you've found a number that has occurred before. Do you also need to know how often they occur?Still, it would need 536.870.912 bytes at least. (512 MB.) It's a lot and would require some crafty programming skills. Depending on your programming language and personal experience, there would be hundreds of solutions to solve it this way.
Had to do this a long time ago.
What i did... i sorted the numbers as much as i could (had a time-constraint limit) and arranged them like this while sorting:
1 to 10, 12, 16, 20 to 50, 52 would become..
[1,10], 12, 16, [20,50], 52, ...
Since in my case i had hundreds of numbers that were very "close" ($a-$b=1), from a few million sets i had a very low memory useage
p.s. another way to store them
1, -9, 12, 16, 20, -30, 52,
when i had no numbers lower than zero
After that i applied various algorithms (described by other posters) here on the reduced data set
#include <stdio.h>
#include <stdlib.h>
/* Macro is overly general but I left it 'cos it's convenient */
#define BITOP(a,b,op) \
((a)[(size_t)(b)/(8*sizeof *(a))] op (size_t)1<<((size_t)(b)%(8*sizeof *(a))))
int main(void)
{
unsigned x=0;
size_t *seen = malloc(1<<8*sizeof(unsigned)-3);
while (scanf("%u", &x)>0 && !BITOP(seen,x,&)) BITOP(seen,x,|=);
if (BITOP(seen,x,&)) printf("duplicate is %u\n", x);
else printf("no duplicate\n");
return 0;
}
This is a simple problem that can be solved very easily (several lines of code) and very fast (several minutes of execution) with the right tools
my personal approach would be in using MapReduce
MapReduce: Simplified Data Processing on Large Clusters
i'm sorry for not going into more details but once getting familiar with the concept of MapReduce it is going to be very clear on how to target the solution
basicly we are going to implement two simple functions
Map(key, value)
Reduce(key, values[])
so all in all:
open file and iterate through the data
for each number -> Map(number, line_index)
in the reduce we will get the number as the key and the total occurrences as the number of values (including their positions in the file)
so in Reduce(key, values[]) if number of values > 1 than its a duplicate number
print the duplicates : number, line_index1, line_index2,...
again this approach can result in a very fast execution depending on how your MapReduce framework is set, highly scalable and very reliable, there are many diffrent implementations for MapReduce in many languages
there are several top companies presenting already built up cloud computing environments like Google, Microsoft azure, Amazon AWS, ...
or you can build your own and set a cluster with any providers offering virtual computing environments paying very low costs by the hour
good luck :)
Another more simple approach could be in using bloom filters
AdamT
Implement a BitArray such that ith index of this array will correspond to the numbers 8*i +1 to 8*(i+1) -1. ie first bit of ith number is 1 if we already had seen 8*i+1. Second bit of ith number is 1 if we already have seen 8*i + 2 and so on.
Initialize this bit array with size Integer.Max/8 and whenever you saw a number k, Set the k%8 bit of k/8 index as 1 if this bit is already 1 means you have seen this number already.