How to use Armadillo Columns/Rows to perform optimised calculations on accesses within the same column - arrays

What is the best way to manipulate indexing in Armadillo? I was under the impression that it heavily used template expressions to avoid temporaries, but I'm not seeing these speedups.
Is direct array indexing still the best way to approach calculations that rely on consecutive elements within the same array?
Keep in mind, that I hope to parallelise these calculations in the future with TBB::parallel_for (In this case, from a maintainability perspective, it may be simpler to use direct accessing?) These calculations happen in a tight loop, and I hope to make them as optimal as possible.
ElapsedTimer timer;
int n = 768000;
int numberOfLoops = 5000;
arma::Col<double> directAccess1(n);
arma::Col<double> directAccess2(n);
arma::Col<double> directAccessResult1(n);
arma::Col<double> directAccessResult2(n);
arma::Col<double> armaAccess1(n);
arma::Col<double> armaAccess2(n);
arma::Col<double> armaAccessResult1(n);
arma::Col<double> armaAccessResult2(n);
std::valarray<double> valArrayAccess1(n);
std::valarray<double> valArrayAccess2(n);
std::valarray<double> valArrayAccessResult1(n);
std::valarray<double> valArrayAccessResult2(n);
// Prefil
for (int i = 0; i < n; i++) {
directAccess1[i] = i;
directAccess2[i] = n - i;
armaAccess1[i] = i;
armaAccess2[i] = n - i;
valArrayAccess1[i] = i;
valArrayAccess2[i] = n - i;
for (int j = 0; j < numberOfLoops; j++) {
for (int i = 1; i < n; i++) {
directAccessResult1[i] = -directAccess1[i] / (directAccess1[i] + directAccess1[i - 1]) * directAccess2[i - 1];
directAccessResult2[i] = -directAccess1[i] / (directAccess1[i] + directAccess1[i]) * directAccess2[i];
timer.StopAndPrint("Direct Array Indexing Took");
std::cout << std::endl;
for (int j = 0; j < numberOfLoops; j++) {
armaAccessResult1.rows(1, n - 1) = -armaAccess1.rows(1, n - 1) / (armaAccess1.rows(1, n - 1) + armaAccess1.rows(0, n - 2)) % armaAccess2.rows(0, n - 2);
armaAccessResult2.rows(1, n - 1) = -armaAccess1.rows(1, n - 1) / (armaAccess1.rows(1, n - 1) + armaAccess1.rows(1, n - 1)) % armaAccess2.rows(1, n - 1);
timer.StopAndPrint("Arma Array Indexing Took");
std::cout << std::endl;
for (int j = 0; j < numberOfLoops; j++) {
for (int i = 1; i < n; i++) {
valArrayAccessResult1[i] = -valArrayAccess1[i] / (valArrayAccess1[i] + valArrayAccess1[i - 1]) * valArrayAccess2[i - 1];
valArrayAccessResult2[i] = -valArrayAccess1[i] / (valArrayAccess1[i] + valArrayAccess1[i]) * valArrayAccess2[i];
timer.StopAndPrint("Valarray Array Indexing Took:");
std::cout << std::endl;
In vs release mode (/02 - to avoid armadillo array indexing checks), they produce the following timings:
Started Performance Analysis!
Direct Array Indexing Took: 37.294 seconds elapsed
Arma Array Indexing Took: 39.4292 seconds elapsed
Valarray Array Indexing Took:: 37.2354 seconds elapsed

Your direct code is already quite optimal, so expression templates are not going to help here.
However, you may want to make sure the optimization level in your compiler actually enables auto-vectorization (-O3 in gcc). Secondly, you can get a bit of extra speed by #define ARMA_NO_DEBUG before including the Armadillo header. This will turn off all run-time checks (such as bound checks for element access), but this is not recommended until you have completely debugged your program.


How to find all the palindromes in a large array?

I need to find all the palindromes of π with 50 million digits 3.141592653589793238462643383279502884197169399375105820974944592307816406286... (goes on and on...)
I've stored all the digits of π in a char array. Now I need to search and count the number of 'palindromes' of length 2 to 15. For example, 535, 979, 33, 88, 14941, etc. are all valid results.
The final output I want is basically like the following.
Palindrome length Number of Palindromes of this length
2 1234 (just an example)
3 1245
4 689
... ...
... ...
... ...
... ...
15 0
pseudocode of my logic so far - it works but takes forever
//store all digits in a char array
char *piArray = (char *)malloc(NUM_PI_DIGITS * sizeof(char));
int count = 0; //count for the number of palindromes
//because we only need to find palindroms that are 2 - 15 digits long
for(int i = 2; i <= 15; i++){
//loop through the piArray and find all the palindromes with i digits long
for(int j = 0; j < size_of_piArray; j++){
//check if the the sub array piArray[j:j+i] is parlindrom, if so, add a count
bool isPalindrome = true;
for (int k = 0; k < i / 2; k++)
if (piArray [j + k] != piArray [j + i - 1 - k])
isPalindrom = false;
The problem I am facing now is that it takes too long to loop through the array of this large size (15-2)=13 times. Is there any better way to do this?
Here is a C version adapted from the approach proposed by Caius Jard:
void check_pi_palindromes(int NUM_PI_DIGITS, int max_length, int counts[]) {
// store all digits in a char array
int max_span = max_length / 2;
int start = max_span;
int end = NUM_PI_DIGITS + max_span;
char *pi = (char *)malloc(max_span + NUM_PI_DIGITS + max_span);
// read of generate the digits starting at position `max_span`
// clear an initial and trailing area to simplify boundary testing
memset(pi, ' ', start);
memset(pi + end, ' ', max_span);
// clear the result array
for (int i = 0; i <= max_length; i++) {
count[i] = 0;
// loop through the pi array and find all the palindromes
for (int i = start; i < end; i++) {
if (pi[i + 1] == pi[i - 1]) { //center of an odd length palindrome
for (n = 2; n <= max_span && pi[i + n] == pi[i - n]; n++) {
count[n * 2 + 1]++;
if (pi[i] == pi[i - 1]) { //center of an even length palindrome
for (n = 1; n <= max_span && pi[i + n] == pi[i - n]; n++) {
count[n * 2]++;
For each position in the array, it scans in both directions for palindromes of odd and even lengths with these advantages:
single pass through the array
good cache locality because all reads from the array are in a small span from the current position
fewer tests as larger palindromes are only tested as extensions of smaller ones.
A small working prefix and suffix is used to avoid the need to special case the beginning and end of the sequence.
I can't solve it for C, as I'm a C# dev but I expect the conversion will be trivial - I've tried to keep it as basic as possible
char[] pi = "3.141592653589793238462643383279502884197169399375105820974944592307816406286".ToCharArray(); //get a small piece as an array of char
int[] lenCounts = new int[16]; //make a new int array with slots 0-15
for(int i = 1; i < pi.Length-1; i++){
if(pi[i+1] == pi[i-1]){ //center of an odd length pal
int n = 2;
while(pi[i+n] == pi[i-n] && n <= 7) n++;
} else if(pi[i] == pi[i-1]){ //center of an even length pal
int n = 1;
while(pi[i+n] == pi[i-1-n] && n <= 7) n++;
This demonstrates the "crawl the string looking for a palindrome center then crawl away from it in both directions looking for equal chars" technique..
..the only thing I'm not sure on, and it has occurred in the Pi posted, is what you want to do if palindromes overlap:
This contains 939 and overlapping with it, 3993. The algo above will find both, so if overlaps are not to be allowed then you might need to extend it to deal with eliminating earlier palindromes if they're overlapped by a longer one found later
You can play with the c# version at - it has some debug print lines in too. Fiddles are limited to a 10 second execution time so I don't know if you'll be able to time the full 50 megabyte 😀 - you might have to run this algo locally for that one
Edit: fixed a bug in the answer but I haven't fixed it in the fiddle; I did have while(.. n<lenCounts.Length) i.e. allowing n to reach 15, but that would be an issue because it's in both directions.. nshould go to 7 to remain in range of the counts array. I've patched that by hard coding 7 but you might want to make it dependent on array length/2 etc
Well, I think it can't be done less than O(len*n), and that you are doing this O(len^2*n), where 2 <= len <= 15, is almost the same since the K coefficient doesn't change the O notation in this case, but if you want to avoid this extra loop, you can check these links, it shouldn't be hard to add a counter for each length since these codes are counting all of them, with maximum possible length:
source1, source2, source3.
NOTE: Mostly it's better to reach out GeekForGeeks when you are looking for algorithms or optimizations.
EDIT: one of the possible ways with O(n^2) time complexity and O(n)
Auxiliary Space. You can change unordered_map by array if you wish, anyway here the key will be the length and the value will be the count of palindromes with that length.
unordered_map<int, int> countPalindromes(string& s) {
unordered_map<int, int> m;
for (int i = 0; i < s.length(); i++) {
// check for odd length palindromes
for (int j = 0; j <= i; j++) {
if (!s[i + j])
if (s[i - j] == s[i + j]) {
// check for palindromes of length
// greater than 1
if ((i + j + 1) - (i - j) > 1)
m[(i + j + 1) - (i - j)]++;
} else
// check for even length palindromes
for (int j = 0; j <= i; j++) {
if (!s[i + j + 1])
if (s[i - j] == s[i + j + 1]) {
// check for palindromes of length
// greater than 1
if ((i + j + 2) - (i - j) > 1)
m[(i + j + 2) - (i - j)]++;
} else
return m;

Work out the overall progress (%) of three nested for loops

I have three nested for loops, each of which obviously have a limit. To calculate the progress of any one of the three for loops, all that I need to do is to divide the current iteration by the total number of iterations that the loop will make. However, given that there are three different for loops, how can I work out the overall percentage complete?
int iLimit = 10, jLimit = 24, kLimit = 37;
for (int i = 0; i < iLimit; i++) {
for (int j = 0; j < jLimit; j++) {
for (int k = 0; k < kLimit; k++) {
printf("Percentage Complete = %d", percentage);
I tried the following code, but it reset after the completion of each loop, reaching a percentage greater than 100.
float percentage = ((i + 1) / (float)iLimit) * ((j + 1) / (float)jLimit) * ((k + 1) / (float)kLimit) * 100;
You can easily calculate the "change in percentage per inner cycle"
const double percentChange = 1.0 / iLimit / jLimit / kLimit;
Note, this mathematically equivalent to 1/(iLimit*jLimit*kLimit), however if iLimitjLimitkLimit is sufficiently large, you'll have an overflow and unexpected behavior. It's still possible to have an underflow with the 1.0/... approach, but its far less likely.
int iLimit = 10, jLimit = 24, kLimit = 37;
const double percentChange = 1.0 / iLimit / jLimit / kLimit;
double percentage = 0;
for (int i = 0; i < iLimit; i++) {
for (int j = 0; j < jLimit; j++) {
for (int k = 0; k < kLimit; k++) {
percentage += percentChange;
printf("Percentage Complete = %d\n", (int)(percentage * 100));
If I do understand your question right, then I think the counter variables at each level (i.e. i, j, k) should have a different weightage in the %age formula. Let me explain what I mean: Each increment of j corresponds to kLimit iterations of the innermost loop. So, if you have only one level of nesting (say the outermost loop using i is not present), total number of loop iterations would be kLimit*jLimit and the percentage:
percentage = (100.0 * (j*kLimit + k + 1)) / (float)(kLimit*jLimit)
You got the idea? Its very easy to generalize this concept to the required level of nesting. I hope you can very well figure out the needed equation for your case. Anyways here is the final formula:
percentage = 100.0 * (kLimit * (i * jLimit + j) + k + 1) / (iLimit * jLimit * kLimit)
The total number of loops is iLimit * jLimit * kLimit, and so if you have an incrementing percentage in the inner loop, you can just print
100 * percentage / (iLimit * jLimit * kLimit)
Since you are using %d to print the percentage, you can limit everything to integer calculations. (And it avoids seeing meaningless 'exact' values such as 0.011261 for the first step.)
If you want to see properly rounded values, you can also use this:
printf("Percentage Complete = %d%%\r", (counter*200+iLimit * jLimit * kLimit) /
(2 * iLimit * jLimit * kLimit));
The \r at the end is a small refinement so each line will overprint the previous one.
Try this one:
int iLimit = 10, jLimit = 24, kLimit = 37;
float percentage;
for (int i = 0; i < iLimit; i++) {
for (int j = 0; j < jLimit; j++) {
for (int k = 0; k < kLimit; k++) {
percentage = ((k+1) + j * kLimit + i*jLimit*kLimit)/(float)(iLimit*jLimit*kLimit) * 100;
printf("Percentage Complete = %f\n", percentage);
This solution is very simmilar to the counter incrementation solution posted here.
The advantage for this solution is that I supplied a formula for the counter which depends on the i,j,k and the limits iLimit, jLimit, kLimit:
counter = (k+1) + j * kLimit + i*jLimit*kLimit
This way, you can find out the percentage when you know i, j, k without iterating through the loops.
Thus you can possibly reduce a O(iLimit * jLimit * kLimit) problem to a O(1) problem.
Remember that percentage is parts of 100.
To get 100 you need to do e.g. (iLimit * jLimit * kLimit) / (iLimit * jLimit * kLimit) * 100.
Each iteration of the loop takes 1 / (iLimit * jLimit * kLimit) parts of the whole.
To get the percentage, "simply" do e.g.
float percentage = ++counter / (float) (iLimit * jLimit * kLimit) * 100.0;
Remember to declare and initialize the counter variable before the loops.

Generating a Sparse Matrix in C

Is there a simpler way of generating sparse matrix other than this?
for (i = 0; i < 1000; i++)
if (rand() % 3 == 0)
array[i] = rand() % 3;
array[i] = ((rand() % 3) - 1);
I used array for presentational purposes
With a determine how sparse you want it to be.
for (i = 0; i < 1000; i++)
if (rand() % a == 0)
array[i] = rand() % 100;
array[i] = 0;
Let t be the target number of non-zero elements in the array, which should be much less than the length of the array for sparseness. I'm assuming your array is of length length. I'm also generating the random indices without the modulus operator to avoid modulo bias.
for (i = 0; i < t; ++i) {
int index = (int) (length * ((double) rand() / (RAND_MAX + 1.0)));
array[index] = i % 2 ? -1 : 1;
Note that this may give a few less than t non-zero elements because random numbers can produce duplicates, but that should be rare if it really is sparse, e.g., t < square root of the array length. If you're worried about duplicate randoms making things sparser than you want, you can modify accordingly:
for (i = 0; i < t;) {
int index = (int) (length * ((double) rand() / (RAND_MAX + 1.0)));
if (array[index]) { /* something already at this index */
continue; /* skip incrementing and try again */
array[index] = i % 2 ? -1 : 1;
In both cases I'm alternating +/- ones for the non-zero values, but if you want it more random that would be easy to replace the right-hand side of the assignment of array[index].
Finally, I ask your indulgence if I fluffed something on C syntax. My C is about 15 years rusty, but the intent should be clear.

2D convolution with a with a kernel which is not center originated

I want to do 2D convolution of an image with a Gaussian kernel which is not centre originated given by equation:
h(x-x', y-y') = exp(-((x-x')^2+(y-y'))/2*sigma)
Lets say the centre of kernel is (1,1) instead of (0,0). How should I change my following code for generation of kernel and for the convolution?
int krowhalf=krow/2, kcolhalf=kcol/2;
int sigma=1
// sum is for normalization
float sum = 0.0;
// generate kernel
for (int x = -krowhalf; x <= krowhalf; x++)
for(int y = -kcolhalf; y <= kcolhalf; y++)
r = sqrtl((x-1)*(x-1) + (y-1)*(y-1));
gKernel[x + krowhalf][y + kcolhalf] = exp(-(r*r)/(2*sigma));
sum += gKernel[x + krowhalf][y + kcolhalf];
//normalize the Kernel
for(int i = 0; i < krow; ++i)
for(int j = 0; j < kcol; ++j)
gKernel[i][j] /= sum;
float **convolve2D(float** in, float** out, int h, int v, float **kernel, int kCols, int kRows)
int kCenterX = kCols / 2;
int kCenterY = kRows / 2;
int i,j,m,mm,n,nn,ii,jj;
for(i=0; i < h; ++i) // rows
for(j=0; j < v; ++j) // columns
for(m=0; m < kRows; ++m) // kernel rows
mm = kRows - 1 - m; // row index of flipped kernel
for(n=0; n < kCols; ++n) // kernel columns
nn = kCols - 1 - n; // column index of flipped kernel
//index of input signal, used for checking boundary
ii = i + (m - kCenterY);
jj = j + (n - kCenterX);
// ignore input samples which are out of bound
if( ii >= 0 && ii < h && jj >= 0 && jj < v )
//out[i][j] += in[ii][jj] * (kernel[mm+nn*29]);
out[i][j] += in[ii][jj] * (kernel[mm][nn]);
Since you're using the convolution operator you have 2 choices:
Using it Spatial Invariant property.
To so so, just calculate the image using regular convolution filter (Better done using either conv2 or imfilter) and then shift the result.
You should mind the boundary condition you'd to employ (See imfilter properties).
Calculate the shifted result specifically.
You can do this by loops as you suggested or more easily create non symmetric kernel and still use imfilter or conv2.
Sample Code (MATLAB)
mInputImage = imread('3.png');
mInputImage = double(mInputImage) / 255;
mConvolutionKernel = zeros(3, 3);
mConvolutionKernel(2, 2) = 1;
mOutputImage01 = conv2(mConvolutionKernel, mInputImage);
mConvolutionKernelShifted = [mConvolutionKernel, zeros(3, 150)];
mOutputImage02 = conv2(mConvolutionKernelShifted, mInputImage);
The tricky part is to know to "Crop" the second image in the same axis as the first.
Then you'll have a shifted image.
You can use any Kernel and any function which applies convolution.

complexity for a nested loop with varying internal loop

Very similar complexity examples. I am trying to understand as to how these questions vary. Exam coming up tomorrow :( Any shortcuts for find the complexities here.
void doit(int N) {
while (N) {
for (int j = 0; j < N; j += 1) {}
N = N / 2;
void doit(int N) {
while (N) {
for (int j = 0; j < N; j *= 4) {}
N = N / 2;
void doit(int N) {
while (N) {
for (int j = 0; j < N; j *= 2) {}
N = N / 2;
Thank you so much!
void doit(int N) {
while (N) {
for (int j = 0; j < N; j += 1) {}
N = N / 2;
To find the O() of this, notice that we are dividing N by 2 each iteration. So, (not to insult your intelligence, but for completeness) the final non-zero iteration through the loop we will have N=1. The time before that we will have N=a(2), then before that N=a(4)... where 0< a < N (note those are non-inclusive bounds). So, this loop will execute a total of log(N) times, meaning the first iteration we see that N=a2^(floor(log(N))).
Why do we care about that? Well, it's a geometric series which has a nice closed form:
Sum = \sum_{k=0}^{\log(N)} a2^k = a*\frac{1-2^{\log N +1}}{1-2} = 2aN-a = O(N).
If someone can figure out how to get that latexy notation to display correctly for me I would really appreciate it.
You already have the answer to number 1 - O(n), as given by #NickO, here is an alternative explanation.
Denote the number of outer repeats of inner loop by T(N), and let the number of outer loops be h. Note that h = log_2(N)
T(N) = N + N/2 + ... + N / (2^i) + ... + 2 + 1
< 2N (sum of geometric series)
in O(N)
Number 3: is O((logN)^2)
Denote the number of outer repeats of inner loop by T(N), and let the number of outer loops be h. Note that h = log_2(N)
T(N) = log(N) + log(N/2) + log(N/4) + ... + log(1) (because log(a*b) = log(a) + log(b)
= log(N * (N/2) * (N/4) * ... * 1)
= log(N^h * (1 * 1/2 * 1/4 * .... * 1/N))
= log(N^h) + log(1 * 1/2 * 1/4 * .... * 1/N) (because log(a*b) = log(a) + log(b))
< log(N^h) + log(1)
= log(N^h) (log(1) = 0)
= h * log(N) (log(a^b) = b*log(a))
= (log(N))^2 (because h=log_2(N))
Number 2 is almost identical to number 3.
(In 2,3: assuming j starts from 1, not from 0, if this is not the case #WhozCraig giving the reason why it never breaks)
