Time-consuming Problem of Memory Copy Between REE and QSEE - arm

Firstly, the test code as below:
#define DATA_TYPE float
#define _1KB (1024)
static inline __attribute__((__always_inline__)) void swap_data_value(DATA_TYPE* pSrc, DATA_TYPE* pDst, uint32_t elemCnt)
for (int i = 0; i < elemCnt; ++i) {
pDst[i] = pSrc[i];
void test_func()
const int DATA_NUM = _1KB * _1KB;
uint32_t calc_len = 64;
int loop_cnt = DATA_NUM / calc_len;
if((DATA_NUM % calc_len) != 0) {
LOGE("loop_cnt not match calc_len");
for(int k = 0; k < 666; ++k) {
DATA_TYPE* pData = (DATA_TYPE*)ftk_ta_malloc(2 * DATA_NUM * sizeof(DATA_TYPE));
for(int i = 0; i < DATA_NUM * 2; ++i) {
pData[i] = (DATA_TYPE)i;
DATA_TYPE* pSeg1 = pData;
DATA_TYPE* pSeg2 = pData + k * 1024;
ftk_millisecond_t t0 = ftk_ta_get_uptime();
for(int j = 0; j < 400; ++j) {
DATA_TYPE* p1 = pSeg1;
DATA_TYPE* p2 = pSeg2;
for (int i = 0; i < loop_cnt; i++) {
swap_data_value(p1, p2, calc_len);
p1 += calc_len;
p2 += calc_len;
t0 = ftk_ta_get_uptime() - t0;
LOGD("swap_data_value[%d: %dx%d]: %0.4f ms", k, calc_len, loop_cnt, t0/400.0f);
I run this code on platform sdm865, and has huge difference of performance between REE and QSEE(TrustZone of Qualcomm).
In REE, it spends 0.1325 ~ 0.1375 ms stably.
But in QSEE, it spends 0.7275 ~ 10.37 ms, increased volatilily.
I doubt this is because something of cache limition. But I cann't get the cache information in QSEE, and below codes leads to the TA crash(exit directly).
uint64_t ctr_el0 = 0;
asm volatile("mrs %0, CTR_EL0" : "=r"(ctr_el0) : );
And in REE, I get the cache line is 64B.
So, is this problem because the QSEE(TrustZone) limit the cache size or cache access performance?


As a result of processing arrays -nan(ind)

I am writing a program that creates arrays of a given length and manipulates them. You cannot use other libraries.
First, an array M1 of length N is formed, after which an array M2 of length N is formed/2.
In the M1 array, the division by Pi operation is applied to each element, followed by elevation to the third power.
Then, in the M2 array, each element is alternately added to the previous one, and the tangent modulus operation is applied to the result of addition.
After that, exponentiation is applied to all elements of the M1 and M2 array with the same indexes and the resulting array is sorted by dwarf sorting.
And at the end, the sum of the sines of the elements of the M2 array is calculated, which, when divided by the minimum non-zero element of the M2 array, give an even number.
The problem is that the result X gives is -nan(ind). I can't figure out exactly where the error is.
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
const int A = 441;
const double PI = 3.1415926535897931159979635;
inline void dwarf_sort(double* array, int size) {
size_t i = 1;
while (i < size) {
if (i == 0) {
i = 1;
if (array[i - 1] <= array[i]) {
long tmp = array[i];
array[i] = array[i - 1];
array[i - 1] = tmp;
inline double reduce(double* array, int size) {
size_t i;
double min = RAND_MAX, sum = 0;
for (i = 0; i < size; ++i) {
if (array[i] < min && array[i] != 0) {
min = array[i];
for (i = 0; i < size; ++i) {
if ((int)(array[i] / min) % 2 == 0) {
sum += sin(array[i]);
return sum;
int main(int argc, char* argv[])
int i, N, j;
double* M1 = NULL, * M2 = NULL, * M2_copy = NULL;
double X;
unsigned int seed = 0;
N = atoi(argv[1]); /* N равен первому параметру командной строки */
M1 = malloc(N * sizeof(double));
M2 = malloc(N / 2 * sizeof(double));
M2_copy = malloc(N / 2 * sizeof(double));
for (i = 0; i < 100; i++)
seed = i;
for (j = 0; j < N; ++j) {
M1[j] = (rand_r(&seed) % A) + 1;
for (j = 0; j < N / 2; ++j) {
M2[j] = (rand_r(&seed) % (10 * A)) + 1;
for (j = 0; j < N; ++j)
M1[j] = pow(M1[j] / PI, 3);
for (j = 0; j < N / 2; ++j) {
M2_copy[j] = M2[j];
M2[0] = fabs(tan(M2_copy[0]));
for (j = 0; j < N / 2; ++j) {
M2[j] = fabs(tan(M2[j] + M2_copy[j]));
for (j = 0; j < N / 2; ++j) {
M2[j] = pow(M1[j], M2[j]);
dwarf_sort(M2, N / 2);
X = reduce(M2, N / 2);
printf("\nN=%d.\n", N);
printf("X=%f\n", X);
return 0;
Knowledgeable people, does anyone see where my mistake is? I think I'm putting the wrong data types to the variables, but I still can't solve the problem.
Replace the /* merge */ part with this:
for (j = 0; j < N / 2; ++j) {
printf("%f %f ", M1[j], M2[j]);
M2[j] = pow(M1[j], M2[j]);
printf("%f\n", M2[j]);
This will print the values and the results of the pow operation. You'll see that some of these values are huge resulting in an capacity overflow of double.
Something like pow(593419.97, 31.80) will not end well.

How do I include a switch-case statement for this encryption/decryption code?

I just started using a microcontroller and I have to implement encryption/decryption in it. Sorry for the super long post.
This is the python script and do not need to be edited.
DEVPATH = "/dev"
INPUT = b"Hello!"
#OUTPUT = b"Ifmmp!"
if __name__=='__main__':
for tty in (os.path.join(DEVPATH,tty) for tty in os.listdir(DEVPATH) \
if tty.startswith(TTYPREFIX)):
ctt = serial.Serial(tty, timeout=1, writeTimeout=1)
except serial.SerialException:
# print(ctt)
except serial.SerialTimeoutException:
for retry in range(3): # Try three times to read connection test result
ret = ctt.read(2*len(INPUT))
print("ret: " + repr(ret))
if INPUT in ret:
This is the main.c file. I know that CDC_Device_BytesReceived will receive the input from the python script. And if there are input, it will run the while loop since Bytes will be more than 0.
while (1)
/* Check if data received */
Bytes = CDC_Device_BytesReceived(&VirtualSerial_CDC_Interface);
while(Bytes > 0)
/* Send data back to the host */
ch = CDC_Device_ReceiveByte(&VirtualSerial_CDC_Interface);
CDC_Device_SendByte(&VirtualSerial_CDC_Interface, ch);
return 0;
However, in the loop, I was tasked to add a switch case so that it will switch between encryption and decryption. But I have no idea what kind of condition to use to differentiate the encryption and decryption.
This is the code for encryption.
int crypto_aead_encrypt(unsigned char* c, unsigned long long* clen,
const unsigned char* m, unsigned long long mlen,
const unsigned char* ad, unsigned long long adlen,
const unsigned char* nsec, const unsigned char* npub,
const unsigned char* k)
int klen = CRYPTO_KEYBYTES; // 16 bytes
int size = 320 / 8; // 40 bytes
int rate = 128 / 8; // 16 bytes
// int capacity = size - rate;
// Permutation
int a = 12;
int b = 8;
// Padding process appends a 1 to the associated data
i64 s = adlen / rate + 1;
// Padding process appends a 1 to the plain text
i64 t = mlen / rate + 1;
// Length = plaintext mod r
// i64 l = mlen % rate;
u8 S[size];
// Resulting Padded associated data is split into s blocks of r bits
u8 A[s * rate];
// Resulting Padded plain text is split into t blocks of r bits
u8 P[t * rate];
i64 i, j;
// Pad Associated Data
for(i = 0; i < adlen; ++i)
A[i] = ad[i];
A[adlen] = 0x80; // 128 bits
// No Padding Applied
for(i = adlen + 1; i < s * rate; ++i)
A[i] = 0;
// Pad Plaintext
for(i = 0; i < mlen; ++i)
P[i] = m[i];
P[mlen] = 0x80; // 128 bits
// No Padding Applied
for(i = mlen + 1; i < t * rate; ++i)
P[i] = 0;
// Initialization
// IV = k || r || a || b || 0
// S = IV || K || N
S[0] = klen * 8;
S[1] = rate * 8;
S[2] = a;
S[3] = b;
// i < 40 - 2 * 16 = 8
for(i = 4; i < size - 2 * klen; ++i)
// S[4] until S[7] = 0
S[i] = 0;
// i < 16
for(i = 0; i < klen; ++i)
// S[8] = k[0], S[9] = k[1] until S[23] = k[15]
S[size - 2 * klen + i] = k[i];
// i < 16
for(i = 0; i < klen; i++)
// S[24] = npub[0], S[25] = npub[1] until S[39] = npub[15]
S[size - klen + i] = npub[i];
printstate("Initial Value: ", S);
// S - state, 12-a - start, a - 12 rounds
permutations(S, 12 - a, a);
// i < 16
for(i = 0; i < klen; ++i)
// S[24] ^= k[0], S[25] ^= k[1] until S[39] ^= k[15]
S[size - klen + i] ^= k[i];
printstate("Initialization: ", S);
// Process Associated Data
if(adlen != 0)
// i < s = (adlen / rate + 1)
for(i = 0; i < s; ++i)
// rate = 16
for(j = 0; j < rate; ++i)
// S ^= A
S[j] ^= A[i * rate + j];
// S - state, 12-b - start, b - 8 rounds
permutations(S, 12 - b, b);
// S <- S ^= 1
S[size - 1] ^= 1;
printstate("Process Associated Data: ", S);
// Process Plain Text
for(i = 0; i < t - 1; ++i)
for(j = 0; j < rate; ++j)
// S <- S ^= P
S[j] ^= P[i * rate + j];
// c <- S
c[i * rate + j] = S[j];
// S <- permutation b (S)
permutations(S, 12 - b, b);
for(j = 0; j < rate; ++j)
// S <- S ^= Pt
S[j] ^= P[(t-1) * rate + j];
for(j = 0; j < 1; ++j);
// C <- S
// Bitstring S truncated to the first (most significant) k bits
c[(t - 1) * rate + j] = S[j];
printstate("Process Plaintext: ", S);
// Finalization
for(i = 0; i < klen; ++i)
S[rate + i] ^= k[i];
permutations(S, 12 - a, a);
for(i = 0; i < klen; ++i)
// T <- S ^= k
// Bitstring S truncated to the last (least significant) k bits
S[size - klen + i] ^= k[i];
printstate("Finalization: ", S);
// Return Cipher Text & Tag
for(i = 0; i < klen; ++i)
c[mlen + i] = S[size - klen + i];
*clen = mlen + klen;
return 0;
and the code for decryption
int crypto_aead_decrypt(unsigned char *m, unsigned long long *mlen,
unsigned char *nsec, const unsigned char *c,
unsigned long long clen, const unsigned char *ad,
unsigned long long adlen, const unsigned char *npub,
const unsigned char *k)
*mlen = 0;
return -1;
// int nlen = CRYPTO_NPUBBYTES;
int size = 320 / 8;
int rate = 128 / 8;
// int capacity = size - rate;
int a = 12;
int b = 8;
i64 s = adlen / rate + 1;
i64 t = (clen - klen) / rate + 1;
i64 l = (clen - klen) % rate;
u8 S[size];
u8 A[s * rate];
u8 M[t * rate];
i64 i, j;
// pad associated data
for (i = 0; i < adlen; ++i)
A[i] = ad[i];
A[adlen] = 0x80;
for (i = adlen + 1; i < s * rate; ++i)
A[i] = 0;
// initialization
S[0] = klen * 8;
S[1] = rate * 8;
S[2] = a;
S[3] = b;
for (i = 4; i < size - 2 * klen; ++i)
S[i] = 0;
for (i = 0; i < klen; ++i)
S[size - 2 * klen + i] = k[i];
for (i = 0; i < klen; ++i)
S[size - klen + i] = npub[i];
printstate("initial value:", S);
permutations(S, 12 - a, a);
for (i = 0; i < klen; ++i)
S[size - klen + i] ^= k[i];
printstate("initialization:", S);
// process associated data
if (adlen)
for (i = 0; i < s; ++i)
for (j = 0; j < rate; ++j)
S[j] ^= A[i * rate + j];
permutations(S, 12 - b, b);
S[size - 1] ^= 1;
printstate("process associated data:", S);
// process plaintext
for (i = 0; i < t - 1; ++i)
for (j = 0; j < rate; ++j)
M[i * rate + j] = S[j] ^ c[i * rate + j];
S[j] = c[i * rate + j];
permutations(S, 12 - b, b);
for (j = 0; j < l; ++j)
M[(t - 1) * rate + j] = S[j] ^ c[(t - 1) * rate + j];
for (j = 0; j < l; ++j)
S[j] = c[(t - 1) * rate + j];
S[l] ^= 0x80;
printstate("process plaintext:", S);
// finalization
for (i = 0; i < klen; ++i)
S[rate + i] ^= k[i];
permutations(S, 12 - a, a);
for (i = 0; i < klen; ++i)
S[size - klen + i] ^= k[i];
printstate("finalization:", S);
// return -1 if verification fails
for (i = 0; i < klen; ++i)
if (c[clen - klen + i] != S[size - klen + i])
return -1;
// return plaintext
*mlen = clen - klen;
for (i = 0; i < *mlen; ++i)
m[i] = M[i];
return 0;
Thanks for the help in advance, I am really clueless right now.
However, in the loop, I was tasked to add a switch case so that it
will switch between encryption and decryption. But I have no idea what
kind of condition to use to differentiate the encryption and
According to your comments, the calls for encryption and decryption are happening inside of CDC_Device_ReceiveByte and CDC_Device_SendByte, which means you need to create a state machine for sending and receiving of the bytes. The condition that you would use for this is the return value of CDC_Device_BytesReceived.
You can create an enum for the states, and a simple struct for holding the current state along with any other pertinent information. You can create a function for the state machine that maps out what to do given the current state. Your while(1) loop will simply call the function to ensure the state machine moves along. You might implement that like this:
typedef enum{
typedef struct{
state_t current_state;
fsm_t my_fsm = {0}; //initial state is idle
void myFSM(void){
case IDLE:
/* Check if data received */
Bytes = CDC_Device_BytesReceived(&VirtualSerial_CDC_Interface);
if(Bytes) my_fsm.current_state = DECRYPTING; //we have data, decrypt it
/* Send data back to the host */
ch = CDC_Device_ReceiveByte(&VirtualSerial_CDC_Interface);
my_fsm.current_state = ENCRYPTING; // encrypt byte that we are going to send to host
CDC_Device_SendByte(&VirtualSerial_CDC_Interface, ch);
my_fsm.current_state = DECRYPTING; // still have bytes left to decrypt
else my_fsm.current_state = IDLE;
asm("nop"); // whoops
Now your loop is just

Loading an Integer Array into a SIMD register

at the moment I'm trying to load an integer array into a SIMD register using SSE.
I have an aligned 32-bit integer array Ai and want to load 4 consecutive elements into a SIMD register Xi. However, the values stored in Xi after executing _mm_load_si128 are garbage except for the first one.
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <immintrin.h>
// number has to be divisible by 4 without remainder
#define VECTOR_SIZE 8
int main() {
__attribute__((aligned (16))) int32_t *Ai = (int32_t*) malloc(VECTOR_SIZE * sizeof(int32_t));
for(int i = 0; i < VECTOR_SIZE; i++) {
Ai[i] = rand() % 100000;
__m128i Xi;
for(int i = 0; i < VECTOR_SIZE; i+=4) {
Xi = _mm_load_si128((__m128i*) &Ai[i]);
// show content of Xi and Ai
for(int j = 0; j < 4; j++) {
printf("Xi[%d] = %d\t Ai[%d] = %d\n", j, Xi[j], i+j, Ai[i+j]);
Here is an example output:
Xi[0] = 16807 Ai[0] = 16807
Xi[1] = 50073 Ai[1] = 75249
Xi[2] = 1489217992 Ai[2] = 50073
Xi[3] = 1346391152 Ai[3] = 43658
Xi[0] = 8930 Ai[4] = 8930
Xi[1] = 27544 Ai[5] = 11272
Xi[2] = 1489217992 Ai[6] = 27544
Xi[3] = 1346391168 Ai[7] = 50878
What is wrong?
You probably meant this when you were coming up with your example:
union {
__m128i m128;
int32_t i32[4];
} Xi;
for(int i = 0; i < VECTOR_SIZE; i+=4) {
Xi.m128 = _mm_load_si128((__m128i*) &Ai[i]);
// show content of Xi and Ai
for(int j = 0; j < 4; j++) {
printf("Xi[%d] = %d\t Ai[%d] = %d\n", j, Xi.i32[j], i+j, Ai[i+j]);
Here is the example output:
Xi[0] = 89383 Ai[0] = 89383
Xi[1] = 30886 Ai[1] = 30886
Xi[2] = 92777 Ai[2] = 92777
Xi[3] = 36915 Ai[3] = 36915
Xi[0] = 47793 Ai[4] = 47793
Xi[1] = 38335 Ai[5] = 38335
Xi[2] = 85386 Ai[6] = 85386
Xi[3] = 60492 Ai[7] = 60492

Allocate 3D matrix in one big chunk

I'd like to allocate a 3D matrix in one big chunk. It should be possible to access this matrix in the [i][j][k] fashion, without having to calculate the linearized index every time.
I think it should be something like below, but I'm having trouble filling the ...
double ****matrix = (double ****) malloc(...)
for (int i = 0; i < imax; i++) {
matrix[i] = &matrix[...]
for (int j = 0; j < jmax; j++) {
matrix[i][j] = &matrix[...]
for (int k = 0; k < kmax; k++) {
matrix[i][j][k] = &matrix[...]
For the single allocation to be possible and work, you need to lay out the resulting memory like this:
imax units of double **
imax * jmax units of double *
imax * jmax * kmax units of double
Further, the 'imax units of double **' must be allocated first; you can reorder the other two sections, but it is most sensible to deal with them in the order listed.
You also need to be able to assume that double and double * (and double **, but that's not much of a stretch) are sufficiently well aligned that you can simply allocate the chunks contiguously. That is going to hold OK on most 64-bit systems with type double, but be aware of the possibility that it does not hold on 32-bit systems or for other types than double (basically, the assumption could be problematic when sizeof(double) != sizeof(double *)).
With those caveats made, then this code works cleanly (tested on Mac OS X 10.10.2 with GCC 4.9.1 and Valgrind version valgrind-3.11.0.SVN):
#include <stdio.h>
#include <stdlib.h>
typedef double Element;
static Element ***alloc_3d_matrix(size_t imax, size_t jmax, size_t kmax)
size_t i_size = imax * sizeof(Element **);
size_t j_size = imax * jmax * sizeof(Element *);
size_t k_size = imax * jmax * kmax * sizeof(Element);
Element ***matrix = malloc(i_size + j_size + k_size);
if (matrix == 0)
return 0;
printf("i = %zu, j = %zu, k = %zu; sizes: i = %zu, j = %zu, k = %zu; "
"%zu bytes total\n",
imax, jmax, kmax, i_size, j_size, k_size, i_size + j_size + k_size);
printf("matrix = %p .. %p\n", (void *)matrix,
(void *)((char *)matrix + i_size + j_size + k_size));
Element **j_base = (void *)((char *)matrix + imax * sizeof(Element **));
printf("j_base = %p\n", (void *)j_base);
for (size_t i = 0; i < imax; i++)
matrix[i] = &j_base[i * jmax];
printf("matrix[%zu] = %p (%p)\n",
i, (void *)matrix[i], (void *)&matrix[i]);
Element *k_base = (void *)((char *)j_base + imax * jmax * sizeof(Element *));
printf("k_base = %p\n", (void *)k_base);
for (size_t i = 0; i < imax; i++)
for (size_t j = 0; j < jmax; j++)
matrix[i][j] = &k_base[(i * jmax + j) * kmax];
printf("matrix[%zu][%zu] = %p (%p)\n",
i, j, (void *)matrix[i][j], (void *)&matrix[i][j]);
/* Diagnostic only */
for (size_t i = 0; i < imax; i++)
for (size_t j = 0; j < jmax; j++)
for (size_t k = 0; k < kmax; k++)
printf("matrix[%zu][%zu][%zu] = %p\n",
i, j, k, (void *)&matrix[i][j][k]);
return matrix;
int main(void)
size_t i_max = 3;
size_t j_max = 4;
size_t k_max = 5;
Element ***matrix = alloc_3d_matrix(i_max, j_max, k_max);
if (matrix == 0)
fprintf(stderr, "Failed to allocate matrix[%zu][%zu][%zu]\n", i_max, j_max, k_max);
return 1;
for (size_t i = 0; i < i_max; i++)
for (size_t j = 0; j < j_max; j++)
for (size_t k = 0; k < k_max; k++)
matrix[i][j][k] = (i + 1) * 100 + (j + 1) * 10 + k + 1;
for (size_t i = 0; i < i_max; i++)
for (size_t j = 0; j < j_max; j++)
for (size_t k = k_max; k > 0; k--)
printf("[%zu][%zu][%zu] = %6.0f\n", i, j, k-1, matrix[i][j][k-1]);
return 0;
Example output (with some boring bits omitted):
i = 3, j = 4, k = 5; sizes: i = 24, j = 96, k = 480; 600 bytes total
matrix = 0x100821630 .. 0x100821888
j_base = 0x100821648
matrix[0] = 0x100821648 (0x100821630)
matrix[1] = 0x100821668 (0x100821638)
matrix[2] = 0x100821688 (0x100821640)
k_base = 0x1008216a8
matrix[0][0] = 0x1008216a8 (0x100821648)
matrix[0][1] = 0x1008216d0 (0x100821650)
matrix[0][2] = 0x1008216f8 (0x100821658)
matrix[0][3] = 0x100821720 (0x100821660)
matrix[1][0] = 0x100821748 (0x100821668)
matrix[1][1] = 0x100821770 (0x100821670)
matrix[1][2] = 0x100821798 (0x100821678)
matrix[1][3] = 0x1008217c0 (0x100821680)
matrix[2][0] = 0x1008217e8 (0x100821688)
matrix[2][1] = 0x100821810 (0x100821690)
matrix[2][2] = 0x100821838 (0x100821698)
matrix[2][3] = 0x100821860 (0x1008216a0)
matrix[0][0][0] = 0x1008216a8
matrix[0][0][1] = 0x1008216b0
matrix[0][0][2] = 0x1008216b8
matrix[0][0][3] = 0x1008216c0
matrix[0][0][4] = 0x1008216c8
matrix[0][1][0] = 0x1008216d0
matrix[0][1][1] = 0x1008216d8
matrix[0][1][2] = 0x1008216e0
matrix[0][1][3] = 0x1008216e8
matrix[0][1][4] = 0x1008216f0
matrix[0][2][0] = 0x1008216f8
matrix[2][2][4] = 0x100821858
matrix[2][3][0] = 0x100821860
matrix[2][3][1] = 0x100821868
matrix[2][3][2] = 0x100821870
matrix[2][3][3] = 0x100821878
matrix[2][3][4] = 0x100821880
[0][0][4] = 115
[0][0][3] = 114
[0][0][2] = 113
[0][0][1] = 112
[0][0][0] = 111
[0][1][4] = 125
[0][1][3] = 124
[0][1][2] = 123
[0][1][1] = 122
[0][1][0] = 121
[0][2][4] = 135
[2][2][0] = 331
[2][3][4] = 345
[2][3][3] = 344
[2][3][2] = 343
[2][3][1] = 342
[2][3][0] = 341
There is a lot of diagnostic output in the code shown.
This code will work with C89 (and C99 and C11), without requiring support for variable-length arrays or VLAs — though since I declare variables in for loops, the code as written requires C99 or later, but it can easily be fixed to declare the variables outside the for loops and it can then compile with C89.
This can be done with one simple malloc() call in C (not in C++, though, there are no variable length arrays in C++):
void foo(int imax, int jmax, int kmax) {
double (*matrix)[jmax][kmax] = malloc(imax*sizeof(*matrix));
//Allocation done. Now fill the matrix:
for(int i = 0; i < imax; i++) {
for(int j = 0; j < jmax; j++) {
for(int k = 0; k < kmax; k++) {
matrix[i][j][k] = ...
Note that C allows jmax and kmax to be dynamic values that are only known at runtime. That is the ability that's missing in C++, which makes C arrays much more powerful than their C++ counterpart.
The only drawback of this approach, as WhozCraig rightly notes, is that you can't return the resulting matrix as the return value of the function without resorting to a void*. However, you can return it by reference like this:
void foo(int imax, int jmax, int kmax, double (**outMatrix)[jmax][kmax]) {
*outMatrix = malloc(imax*sizeof(**outMatrix));
double (*matrix)[jmax][kmax] = *outMatrix; //avoid having to write (*outMatrix)[i][j][k] everywhere
... //as above
This function would need to be called like this:
int imax = ..., jmax = ..., kmax = ...;
double (*myMatrix)[jmax][kmax];
foo(imax, jmax, kmax, &myMatrix);
That way you get full type checking on the inner two dimension sizes even though they are runtime values.
Note: This was intended to be a comment but it got too long, until it turned into a proper answer.
You can't use a single chunk of memory without performing some calculations.
Note that the beginning of each row is marked by the formula
// row_begin is the memory address of the row at index row_idx
row_begin = row_idx * jmax * kmax
And then, each column depends on where the row starts:
// column_begin is the memory address of the column
// at index column_idx of the row starting at row_begin
column_begin = row_begin + column_idx * kmax
Which, using absolute addresses (relative to the matrix pointer, of course) translates to:
column_begin = (row_idx * jmax * kmax) + column_idx * kmax
Finally, getting the k-index of an element is very straightforward, following the previous rule this could turn in an infinite recursion:
// element address = row_address + column_address + element_k_index
element_k_idx = column_begin + element_k_idx
Which translates to
element_k_idx = (row_idx * jmax * kmax) + column_idx * kmax + element_k_idx
This works for me:
void foo(int imax, int jmax, int kmax)
// Allocate memory for all the numbers.
// Think of this as (imax*jmax) number of memory chunks,
// with each chunk containing kmax doubles.
double* data_0 = malloc(imax*jmax*kmax*sizeof(double));
// Allocate memory for the previus dimension of pointers.
// This of this as imax number of memory chunks,
// with each chunk containing jmax double*.
double** data_1 = malloc(imax*jmax*sizeof(double*));
// Allocate memory for the previus dimension of pointers.
double*** data_2 = malloc(imax*sizeof(double**));
for (int i = 0; i < imax; i++)
data_2[i] = &data_1[i*jmax];
for (int j = 0; j < jmax; j++)
data_1[i*jmax+j] = &data_0[(i*jmax+j)*kmax];
// That is the matrix.
double ***matrix = data_2;
for (int i = 0; i < imax; i++)
for (int j = 0; j < jmax; j++)
for (int k = 0; k < kmax; k++)
matrix[i][j][k] = i+j+k;
for (int i = 0; i < imax; i++)
for (int j = 0; j < jmax; j++)
for (int k = 0; k < kmax; k++)
printf("%lf ", matrix[i][j][k]);
// Deallocate memory

High Pass Filter using FFTW in C

I have a question regarding FFT. I already manage to do FFT forward and backward using FFTW in C. Now, I want to apply high pass filter for edge detection, some of my source said that just zeroing the centre of the magnitude.
This is my input image
Basically what I do are :
Forward FFT
Convert the output to 2D array
Do forward FFT shifting
Make the real and imag value to 0 when the distance from the centre is 25% of the height
Generate the magnitude
Do backward FFT shifting
Convert into 1D array
Do Backward FFT.
This is the original magnitude, the processed magnitude, and the result
can someone help me, to tell me which part is wrong and how to do the high pass filtering using FFTW in C.
Thank You.
The Source Code:
unsigned char **FFT2(int width,int height, unsigned char **pixel, char line1[100],char line2[100], char line3[100],char filename[100])
fftw_complex* in, * dft, * idft, * dft2;
//fftw_complex tmp1,tmp2;
fftw_plan plan_f,plan_i;
int i,j,k,w,h,N,w2,h2;
w = width;
h = height;
N = w*h;
unsigned char **pixel_out;
pixel_out = malloc(h*sizeof(unsigned char*));
for(i = 0 ; i<h;i++)
pixel_out[i]=malloc(w*sizeof(unsigned char));
in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) *N);
dft = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) *N);
dft2 = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) *N);
idft = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) *N);
/*run forward FFT*/
plan_f = fftw_plan_dft_2d(w,h,in,dft,FFTW_FORWARD,FFTW_ESTIMATE);
for(i = 0,k = 0 ; i < h ; i++)
for(j = 0 ; j < w ; j++,k++)
in[k][0] = pixel[i][j];
in[k][1] = 0.0;
double maxReal = 0.0;
for(i = 0 ; i < N ; i++)
maxReal = dft[i][0] > maxReal ? dft[i][0] : maxReal;
printf("MAX REAL : %f\n",maxReal);
//convert to 2d
double ***temp1;
temp1 = malloc(h * sizeof (double**));
for (i = 0;i < h; i++){
temp1[i] = malloc(w*sizeof (double*));
for (j = 0; j < w; j++){
temp1[i][j] = malloc(2*sizeof(double));
double ***temp2;
temp2 = malloc(h * sizeof (double**));
for (i = 0;i < h; i++){
temp2[i] = malloc(w*sizeof (double*));
for (j = 0; j < w; j++){
temp2[i][j] = malloc(2*sizeof(double));
for (i = 0;i < h; i++){
for (j = 0; j < w; j++){
temp1[i][j][0] = dft[i*w+j][0];
temp1[i][j][1] = dft[i*w+j][1];
int m2 = h/2;
int n2 = w/2;
//forward shifting
for (i = 0; i < m2; i++)
for (k = 0; k < n2; k++)
double tmp13[2] = {temp1[i][k][0],temp1[i][k][1]};
temp1[i][k][0] = temp1[i+m2][k+n2][0];
temp1[i][k][1] = temp1[i+m2][k+n2][1];
temp1[i+m2][k+n2][0] = tmp13[0];
temp1[i+m2][k+n2][1] = tmp13[1];
double tmp24[2] = {temp1[i+m2][k][0],temp1[i+m2][k][1]};
temp1[i+m2][k][0] = temp1[i][k+n2][0];
temp1[i+m2][k][1] = temp1[i][k+n2][1];
temp1[i][k+n2][0] = tmp24[0];
temp1[i][k+n2][1] = tmp24[1];
for (i = 0;i < h; i++){
for (j = 0; j < w; j++){
if(distance_to_center(i,j,m2,n2) < 0.25*h)
temp1[i][j][0] = (double)0.0;
temp1[i][j][1] = (double)0.0;
/* copy for magnitude */
for (i = 0;i < h; i++){
for (j = 0; j < w; j++){
temp2[i][j][0] = temp1[i][j][0];
temp2[i][j][1] = temp1[i][j][1];
//backward shifting
for (i = 0; i < m2; i++)
for (k = 0; k < n2; k++)
double tmp13[2] = {temp1[i][k][0],temp1[i][k][1]};
temp1[i][k][0] = temp1[i+m2][k+n2][0];
temp1[i][k][1] = temp1[i+m2][k+n2][1];
temp1[i+m2][k+n2][0] = tmp13[0];
temp1[i+m2][k+n2][1] = tmp13[1];
double tmp24[2] = {temp1[i+m2][k][0],temp1[i+m2][k][1]};
temp1[i+m2][k][0] = temp1[i][k+n2][0];
temp1[i+m2][k][1] = temp1[i][k+n2][1];
temp1[i][k+n2][0] = tmp24[0];
temp1[i][k+n2][1] = tmp24[1];
//convert back to 1d
for (i = 0;i < h; i++){
for (j = 0; j < w; j++){
dft[i*w+j][0] = temp1[i][j][0];
dft[i*w+j][1] = temp1[i][j][1];
dft2[i*w+j][0] = temp2[i][j][0];
dft2[i*w+j][1] = temp2[i][j][1];
/* magnitude */
double max = 0;
double min = 0;
double mag=0;
for (i = 0, k = 1; i < h; i++){
for (j = 0; j < w; j++, k++){
mag = sqrt(pow(dft2[i*w+j][0],2) + pow(dft2[i*w+j][1],2));
if (max < mag)
max = mag;
double **magTemp;
magTemp = malloc(h * sizeof (double*));
for (i = 0;i < h; i++){
magTemp[i] = malloc(w*sizeof (double));
for(i = 0,k = 0 ; i < h ; i++)
for(j = 0 ; j < w ; j++,k++)
double mag = sqrt(pow(dft2[i*w+j][0],2) + pow(dft2[i*w+j][1],2));
mag = 255*(mag/max);
//magTemp[i][j] = 255-mag; //Putih
magTemp[i][j] = mag; //Item
/* brightening magnitude*/
for(i = 0,k = 0 ; i < h ; i++)
for(j = 0 ; j < w ; j++,k++)
//double temp = magTemp[i][j];
double temp = (double)(255/(log(1+255)))*log(1+magTemp[i][j]);
pixel_out[i][j] = (unsigned char)temp;
/* backward fft */
plan_i = fftw_plan_dft_2d(w,h,dft,idft,FFTW_BACKWARD,FFTW_ESTIMATE);
for(i = 0,k = 0 ; i < h ; i++)
for(j = 0 ; j < w ; j++,k++)
double temp = idft[i*w+j][0]/N;
pixel_out[i][j] = (unsigned char)temp; //+ pixel[i][j];
return pixel_out;
EDIT new source code
I add this part before the forward shifting, the result is as expected also.
//create filter
unsigned char **pixel_filter;
pixel_filter = malloc(h*sizeof(unsigned char*));
for(i = 0 ; i<h;i++)
pixel_filter[i]=malloc(w*sizeof(unsigned char));
for (i = 0;i < h; i++){
for (j = 0; j < w; j++){
if(distance_to_center(i,j,m2,n2) < 20)
pixel_filter[i][j] = 0;
pixel_filter[i][j] = 255;
for (i = 0; i < m2; i++)
for (k = 0; k < n2; k++)
unsigned char tmp13 = pixel_filter[i][k];
pixel_filter[i][k] = pixel_filter[i+m2][k+n2];
pixel_filter[i+m2][k+n2] = tmp13;
unsigned char tmp24 = pixel_filter[i+m2][k];
pixel_filter[i+m2][k] = pixel_filter[i][k+n2];
pixel_filter[i][k+n2] = tmp24;
for (i = 0;i < h; i++){
for (j = 0; j < w; j++){
temp1[i][j][0] *= pixel_filter[i][j];
temp1[i][j][1] *= pixel_filter[i][j];
Your general idea is OK. From the output, it's hard to tell whether there's simply an accounting problem in your program, or whether this is perhaps the expected result. Try padding the source image with much more empty space, and filter out a smaller area in the frequency domain.
As a side note, doing this in C appears incredibly painful. Here is an equivalent implementation in Matlab. Not including plotting, it's around 10 lines of code. You might also try Numerical Python (NumPy).
% Demonstrate frequency-domain image filtering in Matlab
% Define the grid
x = linspace(-1, 1, 1001);
y = x;
[X, Y] = meshgrid(x, y);
% Make a square (source image)
rect = (abs(X) < 0.1) & (abs(Y) < 0.1);
% Compute the transform
rect_hat = fft2(rect);
% Make the high-pass filter
R = sqrt(X.^2 + Y.^2);
filt = (R > 0.05);
% Apply the filter
rect_hat_filtered = rect_hat .* ifftshift(filt);
% Compute the inverse transform
rect_filtered = ifft2(rect_hat_filtered);
%% Plot everything
axis square
saveas(gcf, 'fig1.png');
axis square
saveas(gcf, 'fig2.png');
title('filter (frequency domain)');
axis square
saveas(gcf, 'fig3.png');
title('fft(source) .* filter');
axis square
saveas(gcf, 'fig4.png');
axis square
saveas(gcf, 'fig5.png');
The source image:
Fourier transform of the source image:
The filter:
Result of applying (multiplying) the filter with the fourier transform of the source image:
Taking the inverse transform gives the final result:
