I am currently refactoring an old program and I am having trouble with finding a way to optimize a certain piece of code.
The primary aim is memory usage reduction over performance as the system is embedded.
for(n = 0; n < NUMBER_VARS_IN_STRUCT1; n++) {
int m;
for(m = 0; m < NUMBER_OF_LANGUAGES; m++) {
UnicodeStrCat(2, tx_data_1, struct1.var1[n][m], L"\r\n");
SendSerialUserData(UNICODE);
}
for(m = 0; m < 4; m++) {
sprintf(ansicode_text, "%.8f\r\n", (double) struct1.var2[n][m]);
StrAnsiToUnicode(tx_data_1, ansicode_text);
SendSerialUserData(UNICODE);
}
for(m = 0; m < 4; m++) {
sprintf(ansicode_text, "%.8f\r\n", (double) struct1.var3[n][m]);
StrAnsiToUnicode(tx_data_1, ansicode_text);
SendSerialUserData(UNICODE);
}
The code is much longer (~250 lines of the same sort of thing) and is then repeated in a similar manner for allowing data to be read back in to the device.
I had thought that a solution to reduce memory usage would be to potentially hardcode an array to hold pointers to each of the array values in each structure (Or if I know the size of each array then I could increment the pointer location), and then largely reduce the code size by simply cycling through this.
The output of the function is to print out a large table of data through a serial bus.
Thank you in advance for any help
Related
I am doing a program in C which needs to take in a set of values (integers) into a 2D array, and then performs certain mathematical operations on it. I have decided to implement a check in the program as the user is inputting the values to avoid them from entering values that are already present in the array.
I am however unsure of how to go about this check. I figured out I might need some sort of recursive function to check all the elements previous to the one that's being entered, but I don't know how to implement it.
Please find below a snippet of my code for illustrative purposes:
Row and col are values inputted by the user for the dimension of the array
for (int i=0; i<row;i++){
for (int j=0; j<col; j++){
scanf("%d", &arr[i][j]); //take in elements
}
}
for (int i = 0; i < row; i++)
{
for (int j = 0; i < col; j++)
{
if (arr[i][j] == arr[i][j-1]){
printf("Duplicate.\n");}
else {}
}
}
I know this is probably not correct but it's my attempt.
Any help would be much appreciated.
I would suggest that your store every element you read in a temporary 1D array. Everytime you scan a new element, traverse the 1D array checking if the value exists or not. Although this is not optimal, this will be at least less expensive than traversing the 2D array everytime.
Example:
int temp[SIZE];
int k,elements = 0;
for (int i = 0; i < row; i++) {
for (int j = 0; j < col; j++) {
scanf("%d", &arr[i][j]); //take in elements
temp[elements] = arr[i][j];
elements++;
for (int k = 0; k < elements; k++) {
if (temp[k] == arr[i][j])
printf("Duplicate.\n"); //or do whatever you wish
}
}
}
A balanced tree inserts and searches in O(log N) time.
Since the algorithms are quite simple & standard and were published in the seminal books by Knuth, there are plenty of implementations out there, including a clear and concise one at codereview.SE (which is thus automatically CC-BY-SA 3.0; do apply a bugfix in the answer). Using it (as well as virtually any other one) is simple: start with node* root = NULL;, then insert and search, and finally free_tree.
Asymptotically, the best method is a hash table with O(1) for both, but that is probably an overkill (the algorithms are much more complex and memory footprint is larger) unless you have a lot of numbers. For C++, there's a standard implementation, yet there are plenty 3rd-party ones for C, too.
If your number of input values is small, even the tree may be an overkill, and simply looking through previous values would be fast enough. If your 2D array is contiguous in memory, you can access it as 1D with int* arr1d = (int*)&arr2d.
I'm trying to speed up a matrix multiplication algorithm by blocking the loops to improve cache performance, yet the non-blocked version remains significantly faster regardless of matrix size, block size (I've tried lots of values between 2 and 200, potenses of 2 and others) and optimization level.
Non-blocked version:
for(size_t i = 0; i < n; ++i)
{
for(size_t k = 0; k < n; ++k)
{
int r = a[i][k];
for(size_t j = 0; j < n; ++j)
{
c[i][j] += r * b[k][j];
}
}
}
Blocked version:
for(size_t kk = 0; kk < n; kk += BLOCK)
{
for(size_t jj = 0; jj < n; jj += BLOCK)
{
for(size_t i = 0; i < n; ++i)
{
for(size_t k = kk; k < kk + BLOCK; ++k)
{
int r = a[i][k];
for(size_t j = jj; j < jj + BLOCK; ++j)
{
c[i][j] += r * b[k][j];
}
}
}
}
}
I also have a bijk version and a 6-loops bikj version but they all gets outperformed by the non-blocked version and I don't get why this happens. Every paper and tutorial that I've come across seems to indicate that the the blocked version should be significantly faster. I'm running this on a Core i5 if that matters.
Try blocking in one dimension only, not in both dimensions.
Matrix multiplication exhaustively processes elements from both matrices. Each row vector on the left matrix is repeatedly processed, taken into successive columns of the right matrix.
If the matrices do not both fit into the cache, some data will invariably end up loaded multiple times.
What we can do is break up the operation so that we work with about a cache-sized amount of data at one time. We want the row vector from the left operand to be cached, since it is repeatedly applied against multiple columns. But we should only take enough columns (at a time) to stay within the limit of the cache. For instance, if we can only take 25% of the columns, it means we will have to pass over the row vectors four times. We end up loading the left matrix from memory four times, and the right matrix only once.
(If anything is to be loaded more than once, it should be the row vectors on the left, because they are flat in memory, which benefits from burst loading. Many cache architectures can perform a burst load from memory into adjacent cache lines faster than random access loads. If the right matrix were stored in column-major order, that would be even better: then we are doing cross-products between flat arrays, which prefetch into memory nicely.)
Let's also not forget the output matrix. The output matrix occupies space in the cache also.
I suspect one flaw in the 2D blocked approach is that each element of the output matrix depends on two inputs: its entire entire row in the left matrix, and the entire column in the right matrix. If the matrices are visited in blocks, that means that each target element is visited multiple times to accumulate the partial result.
If we do a complete row-column dot product, we don't have to visit the c[i][j] more than once; once we take column j into row i, we are done with that c[i][j].
I was wondering, why does one set of loops allow for better cache performance than another in spite of logically doing the same thing?
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
accum = 0.0;
for (k = 0; k < n; k++) {
accum += b[j][k] * a[k][i];
}
c[j][i] = accum;
}
}
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
val = b[j][k];
for (i = 0; i < n; i++) {
c[j][i] += val * a[k][i];
}
}
}
I believe the first one above delivers better cache performance, but why?
Also, when we increase block size, but keep cache size and associativity constant, does it influence the miss rate? At a certain point increasing block size can cause a higher miss rate, right?
Just generally speaking, the most efficient loops through a matrix are going to cycle through the last dimension, not the first ("last" being c in m[a][b][c]).
For example, given a 2D matrix like an image which has its pixels represented in memory from top-left to bottom-right, the quickest way to sequentially iterate through it is going to be horizontally across each scanline, like so:
for (int y=0; y < h; ++y) {
for (int x=0; x < w; ++x)
// access pixel[y][x]
}
... not like this:
for (int x=0; x < w; ++x) {
for (int y=0; y < h; ++y)
// access pixel[y][x]
}
... due to spatial locality. It's because the computer grabs memory from slower, bigger regions of the hierarchy and moves it to faster, small regions in large, aligned chunks (ex: 64 byte cache lines, 4 kilobyte pages, and down to a little teeny 64-bit general-purpose register, e.g.). The first example accesses all the data from such a contiguous chunk immediately and prior to eviction.
harold on this site gave me a nice view on how to look at and explain this subject by suggesting not to focus so much on cache misses, but instead focusing on striving to use all the data in a cache prior to eviction. The second example fails to do that for all but the most trivially-small images by iterating through the image vertically with a large, scanline-sized stride rather than horizontally with a small, pixel-sized one.
Also, when we increase block size, but keep cache size and associativity constant, does it influence the miss rate? At a certain point increasing block size can cause a higher miss rate, right?
The answer here would be "yes", as an increase in block size would naturally equate to more compulsory misses (that would be more simply "misses" though rather than "miss rate") but also just more data to process which won't all necessarily fit into the fastest L1 cache. If we're accessing a large amount of data with a large stride, we end up getting a higher non-compulsory miss rate as a result of more data being evicted from the cache before we utilize it, only to then redundantly load it back into a faster cache.
There is also a case where, if the block size is small enough and aligned properly, all the data will just fit into a single cache line and it wouldn't matter so much how we sequentially access it.
Matrix Multiplication
Now your example is quite a bit more complex than this straightforward image example above, but the same concepts tend to apply.
Let's look at the first one:
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
accum = 0.0;
for (k = 0; k < n; k++)
accum += b[j][k] * a[k][i];
c[j][i] = accum;
}
}
If we look at the innermost k loop, we access b[j][k]. That's a fairly optimal access pattern: "horizontal" if we imagine a row-order memory layout. However, we also access a[k][i]. That's not so optimal, especially for a very large matrix, as it's accessing memory in a vertical pattern with a large stride and will tend to suffer from data being evicted from the fastest but smallest forms of memory before it is used, only to load that chunk of data again redundantly.
If we look at the second j loop, that's accessing c[j][i], again in a vertical fashion which is not so optimal.
Now let's have a glance at the second example:
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
val = b[j][k];
for (i = 0; i < n; i++)
c[j][i] += val * a[k][i];
}
}
If we look at the second k loop in this case, it's starting off accessing b[j][k] which is optimal (horizontal). Furthermore, it's explicitly memoizing the value to val, which might improve the odds of the compiler moving that to a register and keeping it there for the following loop (this relates to compiler concepts related to aliasing, however, rather than CPU cache).
In the innermost i loop, we're accessing c[j][i] which is also optimal (horizontal) along with a[k][i] which is also optimal (horizontal).
So this second version is likely to be more efficient in practice. Note that we can't absolutely say that, as aggressive optimizing compilers can do all sorts of magical things like rearranging and unrolling loops for you. Yet short of that, we should be able to say the second one has higher odds of being more efficient.
"What's a profiler?"
I just noticed this question in the comments. A profiler is a measuring tool that can give you a precise breakdown of where time is spent in your code, along with possibly further statistics like cache misses and branch mispredictions.
It's not only good for optimizing real-world production code and helping you more effectively prioritize your efforts to places that really matter, but it can also accelerate the learning process of understanding why inefficiencies exist through the process of chasing one hotspot after another.
Loop Tiling/Blocking
It's worth mentioning an advanced optimization technique which can be useful for large matrices -- loop tiling/blocking. It's beyond the scope of this subject but that one plays to temporal locality.
Deep C Optimization
Hopefully later you will be able to C these things clearly as a deep C explorer. While most optimization is best saved for hindsight with a profiler in hand, it's useful to know the basics of how the memory hierarchy works as you go deeper and deeper exploring the C.
I have the following piece of c code,
double findIntraClustSimFullCoverage(cluster * pCluster)
{
double sum = 0;
register int i = 0, j = 0;
double perElemSimilarity = 0;
for (i = 0; i < 10000; i++)
{
perElemSimilarity = 0;
for (j = 0; j < 10000; j++)
{
perElemSimilarity += arr[i][j];
}
perElemSimilarity /= pCluster->size;
sum += perElemSimilarity;
}
return (sum / pCluster->size);
}
NOTE: arr is a matrix of size 10000 X 10000
This is a portion of a GA code, hence this nested for loop runs many times.
This affects the performance of the code i.e. takes hell a lot of time to give the results.
I profiled the code using valgrind / kcachegrind.
This indicated that 70 % of the process execution time was spent in running this nested for loop.
The register variables i and j, do not seem to be stored in register values (profiling with and without "register" keyword indicated this)
I simply can not find a way to optimize this nested for loop portion of code (as it is very simple and straight forward).
Please help me in optimizing this portion of code.
I'm assuming that you change the arr matrix frequently, else you could just compute the sum (see Lucian's answer) once and remember it.
You can use a similar approach when you modify the matrix. Instead of completely re-computing the sum after the matrix has (likely) been changed, you can store a 'sum' value somewhere, and have every piece of code that updates the matrix update the stored sum appropriately. For instance, assuming you start with an array of all zeros:
double arr[10000][10000];
< initialize it to all zeros >
double sum = 0;
// you want set arr[27][53] to 82853
sum -= arr[27][53];
arr[27][53] = 82853;
sum += arr[27][53];
// you want set arr[27][53] to 473
sum -= arr[27][53];
arr[27][53] = 473;
sum += arr[27][53];
You might want to completely re-calculate the sum from time to time to avoid accumulation of errors.
If you're sure that you have no option for algorithmic optimization, you'll have to rely on very low level optimizations to speed up your code. These are very platform/compiler specific so your mileage may vary.
It is probable that, at some point, the bottleneck of the operation is pulling the values of arr from the memory. So make sure that your data is laid out in a linear cache friendly way. That is to say that &arr[i][j+1] - &arr[i][j] == sizeof(double).
You may also try to unroll your inner loop, in case your compiler does not already do it. Your code :
for (j = 0; j < 10000; j++)
{
perElemSimilarity += arr[i][j];
}
Would for example become :
for (j = 0; j < 10000; j+=10)
{
perElemSimilarity += arr[i][j+0];
perElemSimilarity += arr[i][j+1];
perElemSimilarity += arr[i][j+2];
perElemSimilarity += arr[i][j+3];
perElemSimilarity += arr[i][j+4];
perElemSimilarity += arr[i][j+5];
perElemSimilarity += arr[i][j+6];
perElemSimilarity += arr[i][j+7];
perElemSimilarity += arr[i][j+8];
perElemSimilarity += arr[i][j+9];
}
These are the basic ideas, difficult to say more without knowing your platform, compiler, looking at the generated assembly code.
You might want to take a look at this presentation for more complete examples of optimization opportunities.
If you need even more performance, you could take a look at SIMD intrinsics for your platform, of try to use, say OpenMP, to distribute your computation on multiple threads.
Another step would be to try with OpenMP, something along the following (untested) :
#pragma omp parallel for private(perElemSimilarity) reduction(+:sum)
for (i = 0; i < 10000; i++)
{
perElemSimilarity = 0;
/* INSERT INNER LOOP HERE */
perElemSimilarity /= pCluster->size;
sum += perElemSimilarity;
}
But note that even if you bring this portion of code to 0% (which is impossible) of your execution time, your GA algorithm will still take hours to run. Your performance bottleneck is elsewhere now that this portion of code takes 'only' 22% of your running time.
I might be wrong here, but isn't the following equivalent:
for (i = 0; i < 10000; i++)
{
for (j = 0; j < 10000; j++)
{
sum += arr[i][j];
}
}
return (sum / ( pCluster->size * pCluster->size ) );
The register keyword is an optimizer hint, if the optimizer doesn't think the register is well spent there, it won't be.
Is the matrix well packed, i.e. is it a contiguous block of memory?
Is 'j' the minor index (i.e. are you going from one element to the next in memory), or are you jumping from one element to that plus 1000?
Is arr fairly static? Is this called more than once on the same arr? The result of the inner loop only depends on the row/column that j traverses, so calculating it lazily and storing it for future reference will make a big difference
The way this problem is stated, there isn't much you can do. You are processing 10,000 x 10,000 double input values, that's 800 MB. Whatever you do is limited by the time it takes to read 800 MB of data.
On the other hand, are you also writing 10,000 x 10,000 values each time this is called? If not, you could for example store the sums for each row and have a boolean row indicating that a row sum needs to be calculated, which is set each time you change a row element. Or you could even update the sum for a row each time an array element is change.
What is the difference between storing the multi-dimensional arrays in memory in Row Major or Column Major fashion?
As far as I know, 'C' seems to be following the Row Major style.
Just out of curiosity I would like to know, are there any benefits of one style over another?
In general, you can do emulate each one with the other, so there's no inherent advantage to one over the other. However, cache implementations usually consider locality of reference as a positive factor for estimating whether a memory location is going to get accessed soon. That may have performance implications. For instance, in a row-major implementation, this code snippet:
int sum = 0;
for (int i = 0; i < n; ++i)
for (int j = 0; j < m; ++j)
sum += a[i][j];
is likely to be faster than:
int sum = 0;
for (int i = 0; i < m; ++i)
for (int j = 0; j < n; ++j)
sum += a[j][i];
You should try to design your algorithms so that you code outer loops over rows in a row-major environment and over columns in a column major environment to minimize cache misses.