C - Store global variables in flash? - c

As the title may suggest, I'm currently short on SRAM in my program and I can't find a way to reduce my global variables. Is it possible to bring global variables over to flash memory? Since these variables are frequently read and written, would it be bad for the nand flash because they have limited number of read/write cycle?
If the flash cannot handle this, would EEPROM be a good alternative?
Sorry for the ambiguity guys. I'm working with Atmel AVR ATmega32HVB which has:
2K bytes of SRAM,
1K bytes of EEPROM
32K bytes of FLASH
Compiler: AVR C/C++
Platform: IAR Embedded AVR
The global variables that I want to get rid of are:
uint32_t capacityInCCAccumulated[TOTAL_CELL];
int32_t AccumulatedCCADCvalue[TOTAL_CELL];
Code snippets:
int32_t AccumulatedCCADCvalue[TOTAL_CELL];
void CCGASG_AccumulateCCADCMeasurements(int32_t ccadcMeasurement, uint16_t slowRCperiod)
uint8_t cellIndex;
// Sampling period dependant on configuration of CCADC sampling..
int32_t temp = ccadcMeasurement * (int32_t)slowRCperiod;
bool polChange = false;
if(temp < 0) {
temp = -temp;
polChange = true;
// Add 0.5*divisor to get proper rounding
temp += (1<<(CCGASG_ACC_SCALING-1));
if(polChange) {
temp = -temp;
for (cellIndex = 0; cellIndex < TOTAL_CELL; cellIndex++)
AccumulatedCCADCvalue[cellIndex] += temp;
// If it was a charge, update the charge cycle counter
if(ccadcMeasurement <= 0) {
// If it was a discharge, AccumulatedCADCvalue can be negative, and that
// is "impossible", so set it to zero
for (cellIndex = 0; cellIndex < TOTAL_CELL; cellIndex++)
if(AccumulatedCCADCvalue[cellIndex] < 0)
AccumulatedCCADCvalue[cellIndex] = 0;
And this
uint32_t capacityInCCAccumulated[TOTAL_CELL];
void BATTPARAM_InitSramParameters() {
uint8_t cellIndex;
// Active current threshold in ticks
battParams_sram.activeCurrentThresholdInTicks = (uint16_t) BATTCUR_mA2Ticks(battParams.activeCurrentThreshold);
for (cellIndex = 0; cellIndex < TOTAL_CELL; cellIndex++)
// Full charge capacity in CC accumulated
battParams_sram.capacityInCCAccumulated[cellIndex] = (uint32_t) CCGASG_mAh2Acc(battParams.fullChargeCapacity);
// Terminate discharge limit in CC accumulated
battParams_sram.terminateDischargeLimit = CCGASG_mAh2Acc(battParams.terminateDischargeLimit);
// Values for remaining capacity calibration

would it be bad for the nand flash because they have limited number of
read/write cycle?
Yes it's not a good idea to use flash for frequent modification of data.
Read only from flash does not reduce the life time of flash. Erasing and writing will reduce the flash lifetime.
Reading and writing from flash is substantially slower compared to conventional memory.
To write a byte whole block has to be erased and re written in flash.

Any kind of Flash is a bad idea to be used for frequently changing values:
limited number of erase/write cycles, see datasheet.
very slow erase/write (erase can be ~1s), see datasheet.
You need a special sequence to erase then write (no language support).
While erasing or writing accesses to Flash are blocked at best, some require not to access the Flash at all (undefined behaviour).
Flash cells cannot freely be written per-byte/word. Most have to be written per page (e.g. 64 bytes) and erased most times in much larger units (segments/blocks/sectors).
For NAND Flash, endurance is even more reduced compared to NOR Flash and the cells are less reliable (bits might flip occasionally or are defective), so you have to add error detection and correction. This is very likely a direction you should not go.
True EEPROM shares most issues, but they might be written byte/word-wise (internal erase).
Note that modern MCU-integrated "EEPROM" is most times also Flash. Some implementations just use slightly more reliable cells (about one decade more erase/write cycles than the program flash) and additional hardware allowing arbitrary byte/word write (automatic erase). But that is still not sufficient for frequent changes.
However, you first should verify if your application can tolerate the lengthly write/erase times. Can you accept a process blocking that long, or rewrite your program acordingly? If the answer is "no", you should even stop further investigation into that direction. Otherwise you should calculate the number of updates over the expected lifetime and compare to the information in the datasheet. There are also methods to reduce the number of erase cycles, but the leads too far.
If an external device (I2C/SPI) is an option, you could use a serial SRAM. Although the better (and likely cheaper) approach would be a larger MCU or think about a more efficient (i.e. less RAM, more code) way to store the data in SRAM.


RTC Static Memory in Deep Sleep on ESP32 with ESP-IDF

I am using the 8KB of static RAM on the RTC inside the ESP32 to save a small amount of sensor data to reduce power consumption by transmitting less frequently. But I am having no luck with even the simple example code:
RTC_DATA_ATTR uint32_t testValue = 0;
ESP_LOGE(TAG, "testValue = %d", testValue++);
In the monitor, I can see the value as 0 first time round, but then it's anyone's guess.
E (109) app_main: testValue = 0
E (109) app_main: testValue = -175962388
Also tried the attribute:
RTC_NOINIT_ATTR uint32_t testValue = 0;
What am I doing wrong?
I received an answer from other channels that I'd like to share on. The solution was to set:
So the RTC memory regions are enabled. In my case, I had specifically disabled them in another area of the code (the deep sleep power management code). This solution doesn't significantly affect the deep sleep power consumption ~ 10 uA.

Improve performance of reading volatile memory

I have a function reading from some volatile memory which is updated by a DMA. The DMA is never operating on the same memory-location as the function. My application is performance critical. Hence, I realized the execution time is improved by approx. 20% if I not declare the memory as volatile. In the scope of my function the memory is non-volatile. Hovever, I have to be sure that next time the function is called, the compiler know that the memory may have changed.
The memory is two two-dimensional arrays:
volatile uint16_t memoryBuffer[2][10][20] = {0};
The DMA operates on the opposite "matrix" than the program function:
void myTask(uint8_t indexOppositeOfDMA)
for(uint8_t n=0; n<10; n++)
for(uint8_t m=0; m<20; m++)
//Do some stuff with memory (readings only):
Is there a proper way to tell my compiler that the memoryBuffer is non-volatile inside the scope of myTask() but may be changed next time i call myTask(), so I could optain the performance improvement of 20%?
Platform Cortex-M4
The problem without volatile
Let's assume that volatile is omitted from the data array. Then the C compiler
and the CPU do not know that its elements change outside the program-flow. Some
things that could happen then:
The whole array might be loaded into the cache when myTask() is called for
the first time. The array might stay in the cache forever and is never
updated from the "main" memory again. This issue is more pressing on multi-core
CPUs if myTask() is bound to a single core, for example.
If myTask() is inlined into the parent function, the compiler might decide
to hoist loads outside of the loop even to a point where the DMA transfer
has not been completed.
The compiler might even be able to determine that no write happens to
memoryBuffer and assume that the array elements stay at 0 all the time
(which would again trigger a lot of optimizations). This could happen if
the program was rather small and all the code is visible to the compiler
at once (or LTO is used).
Remember: After all the compiler does not know anything about the DMA
peripheral and that it is writing "unexpectedly and wildly into memory"
(from a compiler perspective).
If the compiler is dumb/conservative and the CPU not very sophisticated (single core, no out-of-order execution), the code might even work without the volatile declaration. But it also might not...
The problem with volatile
the whole array volatile is often a pessimisation. For speed reasons you
probably want to unroll the loop. So instead of loading from the
array and incrementing the index alternatingly such as
load memoryBuffer[m]
m += 1;
load memoryBuffer[m]
m += 1;
load memoryBuffer[m]
m += 1;
load memoryBuffer[m]
m += 1;
it can be faster to load multiple elements at once and increment the index
in larger steps such as
load memoryBuffer[m]
load memoryBuffer[m + 1]
load memoryBuffer[m + 2]
load memoryBuffer[m + 3]
m += 4;
This is especially true, if the loads can be fused together (e.g. to perform
one 32-bit load instead of two 16-bit loads). Further you want the
compiler to use SIMD instruction to process multiple array elements with
a single instruction.
These optimizations are often prevented if the load happens from
volatile memory because compilers are usually very conservative with
load/store reordering around volatile memory accesses.
Again the behavior differs between compiler vendors (e.g. MSVC vs GCC).
Possible solution 1: fences
So you would like to make the array non-volatile but add a hint for the compiler/CPU saying "when you see this line (execute this statement), flush the cache and reload the array from memory". In C11 you could insert an atomic_thread_fence at the beginning of myTask(). Such fences prevent the re-ordering of loads/stores across them.
Since we do not have a C11 compiler, we use intrinsics for this task. The ARMCC compiler has a __dmb() intrinsic (data memory barrier). For GCC you may want to look at __sync_synchronize() (doc).
Possible solution 2: atomic variable holding the buffer state
We use the following pattern a lot in our codebase (e.g. when reading data from
SPI via DMA and calling a function to analyze it): The buffer is declared as
plain array (no volatile) and an atomic flag is added to each buffer, which
is set when the DMA transfer has finished. The code looks something
like this:
typedef struct Buffer
uint16_t data[10][20];
// Flag indicating if the buffer has been filled. Only use atomic instructions on it!
int filled;
// C11: atomic_int filled;
// C++: std::atomic_bool filled{false};
} Buffer_t;
Buffer_t buffers[2];
Buffer_t* volatile currentDmaBuffer; // using volatile here because I'm lazy
void setupDMA(void)
for (int i = 0; i < 2; ++i)
int bufferFilled;
// Atomically load the flag.
bufferFilled = __sync_fetch_and_or(&buffers[i].filled, 0);
// C11: bufferFilled = atomic_load(&buffers[i].filled);
// C++: bufferFilled = buffers[i].filled;
if (!bufferFilled)
currentDmaBuffer = &buffers[i];
... configure DMA to write to buffers[i].data and start it
// If you end up here, there is no free buffer available because the
// data processing takes too long.
void DMA_done_IRQHandler(void)
// ... stop DMA if needed
// Atomically set the flag indicating that the buffer has been filled.
__sync_fetch_and_or(&currentDmaBuffer->filled, 1);
// C11: atomic_store(&currentDmaBuffer->filled, 1);
// C++: currentDmaBuffer->filled = true;
currentDmaBuffer = 0;
// ... possibly start another DMA transfer ...
void myTask(Buffer_t* buffer)
for (uint8_t n=0; n<10; n++)
for (uint8_t m=0; m<20; m++)
// Reset the flag atomically.
__sync_fetch_and_and(&buffer->filled, 0);
// C11: atomic_store(&buffer->filled, 0);
// C++: buffer->filled = false;
void waitForData(void)
// ... see setupDma(void) ...
The advantage of pairing the buffers with an atomic is that you are able to detect when the processing is too slow meaning that you have to buffer more,
make the incoming data slower or the processing code faster or whatever is
sufficient in your case.
Possible solution 3: OS support
If you have an (embedded) OS, you might resort to other patterns instead of using volatile arrays. The OS we use features memory pools and queues. The latter can be filled from a thread or an interrupt and a thread can block on
the queue until it is non-empty. The pattern looks a bit like this:
MemoryPool pool; // A pool to acquire DMA buffers.
Queue bufferQueue; // A queue for pointers to buffers filled by the DMA.
void* volatile currentBuffer; // The buffer currently filled by the DMA.
void setupDMA(void)
currentBuffer = MemoryPool_Allocate(&pool, 20 * 10 * sizeof(uint16_t));
// ... make the DMA write to currentBuffer
void DMA_done_IRQHandler(void)
// ... stop DMA if needed
Queue_Post(&bufferQueue, currentBuffer);
currentBuffer = 0;
void myTask(void)
void* buffer = Queue_Wait(&bufferQueue);
[... work with buffer ...]
MemoryPool_Deallocate(&pool, buffer);
This is probably the easiest approach to implement but only if you have an OS
and if portability is not an issue.
Here you say that the buffer is non-volatile:
"memoryBuffer is non-volatile inside the scope of myTask"
But here you say that it must be volatile:
"but may be changed next time i call myTask"
These two sentences are contradicting. Clearly the memory area must be volatile or the compiler can't know that it may be updated by DMA.
However, I rather suspect that the actual performance loss comes from accessing this memory region repeatedly through your algorithm, forcing the compiler to read it back over and over again.
What you should do is to take a local, non-volatile copy of the part of the memory you are interested in:
void myTask(uint8_t indexOppositeOfDMA)
for(uint8_t n=0; n<10; n++)
for(uint8_t m=0; m<20; m++)
volatile uint16_t* data = &memoryBuffer[indexOppositeOfDMA][n][m];
uint16_t local_copy = *data; // this access is volatile and wont get optimized away
foo(&local_copy); // optimizations possible here
// if needed, write back again:
*data = local_copy; // optional
You'll have to benchmark it, but I'm pretty sure this should improve performance.
Alternatively, you could first copy the whole part of the array you are interested in, then work on that, before writing it back. That should help performance even more.
You're not allowed to cast away the volatile qualifier1.
If the array must be defined holding volatile elements then the only two options, "that let the compiler know that the memory has changed", are to keep the volatile qualifier, or use a temporary array which is defined without volatile and is copied to the proper array after the function call. Pick whichever is faster.
1 (Quoted from: ISO/IEC 9899:201x 6.7.3 Type qualifiers 6)
If an attempt is
made to refer to an object defined with a volatile-qualified type through use of an lvalue
with non-volatile-qualified type, the behavior is undefined.
It seems to me that you a passing half of the buffer to myTask and each half does not need to be volatile. So I wonder if you could solve your issue by defining the buffer as such, and then passing a pointer to one of the half-buffers to myTask. I'm not sure whether this will work but maybe something like this...
typedef struct memory_buffer {
uint16_t buffer[10][20];
} memory_buffer ;
volatile memory_buffer double_buffer[2];
void myTask(memory_buffer *mem_buf)
for(uint8_t n=0; n<10; n++)
for(uint8_t m=0; m<20; m++)
//Do some stuff with memory:
I don't know you platform/mCU/SoC, but usually DMAs have interrupt that trigger on programmable threshold.
What I can imagine is to remove volatile keyword and use interrupt as semaphore for task.
In other words:
DMA is programmed to interrupt when last byte of buffer is written
Task is block on a semaphore/flag waiting that the flag is released
When DMA calls the interrupt routine cange the buffer pointed by DMA for the next reading time and change the flag that unlock the task that can elaborate data.
Something like:
uint16_t memoryBuffer[2][10][20];
volatile uint8_t PingPong = 0;
void interrupt ( void )
// Change current DMA pointed buffer
PingPong ^= 1;
void myTask(void)
static uint8_t lastPingPong = 0;
if (lastPingPong != PingPong)
for (uint8_t n = 0; n < 10; n++)
for (uint8_t m = 0; m < 20; m++)
//Do some stuff with memory:
lastPingPong = PingPong;

Freertos + STM32 - thread memory overflow with malloc

I'm working with stm32+rtos to implement a file system based on spi flash. For freertos, I adopted heap_1 implementation. This is how i create my task.
osThreadDef(Task_Embedded, Task_VATEmbedded, osPriorityNormal, 0, 2500);
VATEmbeddedTaskHandle = osThreadCreate(osThread(Task_Embedded), NULL);
I allocated 10000 bytes of memory to this thread.
and in this thread. I tried to write data into flash. In the first few called it worked successfully. but somehow it crash when i tried more time of write.
VATAPI_RESULT STM32SPIWriteSector(void *writebuf, uint8_t* SectorAddr, uint32_t buff_size){
if(STM32SPIEraseSector(SectorAddr) == VAT_SUCCESS){
DBGSTR("ERASE SECTOR - 0x%2x %2x %2x", SectorAddr[0], SectorAddr[1], SectorAddr[2]);
}else return VAT_UNKNOWN;
if(STM32SPIProgram_multiPage(writebuf, SectorAddr, buff_size) == VAT_SUCCESS){
}else return VAT_UNKNOWN;
VATAPI_RESULT STM32SPIProgram_multiPage(uint8_t *writebuf, uint8_t *writeAddr, uint32_t buff_size){
uint8_t tmpaddr[3] = {writeAddr[0], writeAddr[1], writeAddr[2]};
uint8_t* sectorBuf = malloc(4096 * sizeof(uint8_t));
uint8_t* pagebuf = malloc(255* sizeof(uint8_t));
uint32_t i = 0, tmp_convert1, times = 0;
if(buff_size < Page_bufferSize)
times = 1;
times = buff_size / (Page_bufferSize-1);
/* Note : According to winbond flash feature, the last bytes of every 256 bytes should be 0, so we need to plus one byte on every 256 bytes*/
i = 0;
while(i < times){
memset(&pagebuf[0], 0, Page_bufferSize - 1);
memcpy(&pagebuf[0], &writebuf[i*255], Page_bufferSize - 1);
memcpy(&sectorBuf[i*Page_bufferSize], &pagebuf[0], Page_bufferSize - 1);
sectorBuf[((i+1)*Page_bufferSize)-1] = 0;
i = 0;
while(i < times){
if((nres=STM32SPIPageProgram(&sectorBuf[Page_bufferSize*i], &tmpaddr[0], Page_bufferSize)) != VAT_SUCCESS){
DBGSTR("STM32SPIProgram_allData write data fail on %d times!",i);
return nres;
tmp_convert1 = (tmpaddr[0]<<16 | tmpaddr[1]<<8 | tmpaddr[2]) + Page_bufferSize;
tmpaddr[0] = (tmp_convert1&0xFF0000) >> 16;
tmpaddr[1] = (tmp_convert1&0xFF00) >>8;
tmpaddr[2] = 0x00;
return nres;
I open the debugger and it seems like it crash when i malloced "sectorbuf" in function "STM32SPIProgram_multiPage", what Im confused is that i did free the memory after "malloc". anyone has idea about it?
arm-none-eabi-size "RTOS.elf"
text data bss dec hex filename
77564 988 100756 179308 2bc6c RTOS.elf
Reading the man
Memory Management
If RTOS objects are created dynamically then the standard C library malloc() and free() functions can sometimes be used for the purpose, but ...
they are not always available on embedded systems,
they take up valuable code space,
they are not thread safe, and
they are not deterministic (the amount of time taken to execute the function will differ from call to call)
... so more often than not an alternative memory allocation implementation is required.
One embedded / real time system can have very different RAM and timing requirements to another - so a single RAM allocation algorithm will only ever be appropriate for a subset of applications.
To get around this problem, FreeRTOS keeps the memory allocation API in its portable layer. The portable layer is outside of the source files that implement the core RTOS functionality, allowing an application specific implementation appropriate for the real time system being developed to be provided. When the RTOS kernel requires RAM, instead of calling malloc(), it instead calls pvPortMalloc(). When RAM is being freed, instead of calling free(), the RTOS kernel calls vPortFree().
(Emphasis mine.)
So the meaning is that if you use directly malloc, FreeRTOS is not able to handle the heap consumed by the system function. Same if you choose heap_3 management that is a simple malloc wrapper.
Take also note that the memory management you choose has no free capability.
This is the simplest implementation of all. It does not permit memory to be freed once it has been allocated. Despite this, heap_1.c is appropriate for a large number of embedded applications. This is because many small and deeply embedded applications create all the tasks, queues, semaphores, etc. required when the system boots, and then use all of these objects for the lifetime of program (until the application is switched off again, or is rebooted). Nothing ever gets deleted.
The implementation simply subdivides a single array into smaller blocks as RAM is requested. The total size of the array (the total size of the heap) is set by configTOTAL_HEAP_SIZE - which is defined in FreeRTOSConfig.h. The configAPPLICATION_ALLOCATED_HEAP FreeRTOSConfig.h configuration constant is provided to allow the heap to be placed at a specific address in memory.
The xPortGetFreeHeapSize() API function returns the total amount of heap space that remains unallocated, allowing the configTOTAL_HEAP_SIZE setting to be optimised.
The heap_1 implementation:
Can be used if your application never deletes a task, queue, semaphore, mutex, etc. (which actually covers the majority of applications in which FreeRTOS gets used).
Is always deterministic (always takes the same amount of time to execute) and cannot result in memory fragmentation.
Is very simple and allocated memory from a statically allocated array, meaning it is often suitable for use in applications that do not permit true dynamic memory allocation.
(Emphasis mine.)
Side note: You have always to check malloc return value != NULL.

Creating File in arduino's Memory while arduino is operating

In my arduino project i have to store some integers(25 to be specific) in a file in arduino's memory (as i have arduino UNO and it doesn't have built-in port for SD Card) and read that file next-time i start the arduino .
Also my arduino is not connected to PC or laptop so i can't use file system of PC or laptop
so is there any way possible doing it ?
Arduino Uno has 1KB of non-volatile EEPROM memory, which you can use for this purpose. An integer is 2 bytes, so you should be able to store over 500 ints this way.
This example sketch should write a couple of integers from 10 to 5 into EEPROM memory:
#include <EEPROM.h>
void setup() {
int address = 0; //Location we want the data to be put.
for (int value = 10; value >= 5; --value)
// Write the int at address
EEPROM.put(eeAddress, value)
// Move the address, so the next value will be written after the first.
address += sizeof(int);
void loop() {
This example is a stripped down version of the one in the EEPROM.put documentation. Other examples can be found in the documentation of the various EEPROM functions.
Another nice tutorial on the subject can be found on tronixstuff.
By the way, if you need more memory, you could also use EEPROM memory banks. These are small ICs. They are available for very low prices in low amounts of memory, typically from 1KB to 256KB. Not much in terms of modern computing, but a huge expansion compared to the 1KB you have by default.

Why doesn't this code scale linearly?

I wrote this SOR solver code. Don't bother too much what this algorithm does, it is not the concern here. But just for the sake of completeness: it may solve a linear system of equations, depending on how well conditioned the system is.
I run it with an ill conditioned 2097152 rows sparce matrix (that never converges), with at most 7 non-zero columns per row.
Translating: the outer do-while loop will perform 10000 iterations (the value I pass as max_iters), the middle for will perform 2097152 iterations, split in chunks of work_line, divided among the OpenMP threads. The innermost for loop will have 7 iterations, except in very few cases (less than 1%) where it can be less.
There is data dependency among the threads in the values of sol array. Each iteration of the middle for updates one element but reads up to 6 other elements of the array. Since SOR is not an exact algorithm, when reading, it can have any of the previous or the current value on that position (if you are familiar with solvers, this is a Gauss-Siedel that tolerates Jacobi behavior on some places for the sake of parallelism).
typedef struct{
size_t size;
unsigned int *col_buffer;
unsigned int *row_jumper;
real *elements;
} Mat;
int work_line;
// Assumes there are no null elements on main diagonal
unsigned int solve(const Mat* matrix, const real *rhs, real *sol, real sor_omega, unsigned int max_iters, real tolerance)
real *coefs = matrix->elements;
unsigned int *cols = matrix->col_buffer;
unsigned int *rows = matrix->row_jumper;
int size = matrix->size;
real compl_omega = 1.0 - sor_omega;
unsigned int count = 0;
bool done;
do {
done = true;
#pragma omp parallel shared(done)
bool tdone = true;
#pragma omp for nowait schedule(dynamic, work_line)
for(int i = 0; i < size; ++i) {
real new_val = rhs[i];
real diagonal;
real residual;
unsigned int end = rows[i+1];
for(int j = rows[i]; j < end; ++j) {
unsigned int col = cols[j];
if(col != i) {
real tmp;
#pragma omp atomic read
tmp = sol[col];
new_val -= coefs[j] * tmp;
} else {
diagonal = coefs[j];
residual = fabs(new_val - diagonal * sol[i]);
if(residual > tolerance) {
tdone = false;
new_val = sor_omega * new_val / diagonal + compl_omega * sol[i];
#pragma omp atomic write
sol[i] = new_val;
#pragma omp atomic update
done &= tdone;
} while(++count < max_iters && !done);
return count;
As you can see, there is no lock inside the parallel region, so, for what they always teach us, it is the kind of 100% parallel problem. That is not what I see in practice.
All my tests were run on a Intel(R) Xeon(R) CPU E5-2670 v2 # 2.50GHz, 2 processors, 10 cores each, hyper-thread enabled, summing up to 40 logical cores.
On my first set runs, work_line was fixed on 2048, and the number of threads varied from 1 to 40 (40 runs in total). This is the graph with the execution time of each run (seconds x number of threads):
The surprise was the logarithmic curve, so I thought that since the work line was so large, the shared caches were not very well used, so I dug up this virtual file /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size that told me this processor's L1 cache synchronizes updates in groups of 64 bytes (8 doubles in the array sol). So I set the work_line to 8:
Then I thought 8 was too low to avoid NUMA stalls and set work_line to 16:
While running the above, I thought "Who am I to predict what work_line is good? Lets just see...", and scheduled to run every work_line from 8 to 2048, steps of 8 (i.e. every multiple of the cache line, from 1 to 256). The results for 20 and 40 threads (seconds x size of the split of the middle for loop, divided among the threads):
I believe the cases with low work_line suffers badly from cache synchronization, while bigger work_line offers no benefit beyond a certain number of threads (I assume because the memory pathway is the bottleneck). It is very sad that a problem that seems 100% parallel presents such bad behavior on a real machine. So, before I am convinced multi-core systems are a very well sold lie, I am asking you here first:
How can I make this code scale linearly to the number of cores? What am I missing? Is there something in the problem that makes it not as good as it seems at first?
Following suggestions, I tested both with static and dynamic scheduling, but removing the atomics read/write on the array sol. For reference, the blue and orange lines are the same from the previous graph (just up to work_line = 248;). The yellow and green lines are the new ones. For what I could see: static makes a significant difference for low work_line, but after 96 the benefits of dynamic outweighs its overhead, making it faster. The atomic operations makes no difference at all.
The sparse matrix vector multiplication is memory bound (see here) and it could be shown with a simple roofline model. Memory bound problems benefit from higher memory bandwidth of multisocket NUMA systems but only if the data initialisation is done in such a way that the data is distributed among the two NUMA domains. I have some reasons to believe that you are loading the matrix in serial and therefore all its memory is allocated on a single NUMA node. In that case you won't benefit from the double memory bandwidth available on a dual-socket system and it really doesn't matter if you use schedule(dynamic) or schedule(static). What you could do is enable memory interleaving NUMA policy in order to have the memory allocation spread among both NUMA nodes. Thus each thread would end up with 50% local memory access and 50% remote memory access instead of having all threads on the second CPU being hit by 100% remote memory access. The easiest way to enable the policy is by using numactl:
$ OMP_NUM_THREADS=... OMP_PROC_BIND=1 numactl --interleave=all ./program ...
OMP_PROC_BIND=1 enables thread pinning and should improve the performance a bit.
I would also like to point out that this:
done = true;
#pragma omp parallel shared(done)
bool tdone = true;
// ...
#pragma omp atomic update
done &= tdone;
is a probably a not very efficient re-implementation of:
done = true;
#pragma omp parallel reduction(&:done)
// ...
if(residual > tolerance) {
done = false;
// ...
It won't have a notable performance difference between the two implementations because of the amount of work done in the inner loop, but still it is not a good idea to reimplement existing OpenMP primitives for the sake of portability and readability.
Try running the IPCM (Intel Performance Counter Monitor). You can watch memory bandwidth, and see if it maxes out with more cores. My gut feeling is that you are memory bandwidth limited.
As a quick back of the envelope calculation, I find that uncached read bandwidth is about 10 GB/s on a Xeon. If your clock is 2.5 GHz, that's one 32 bit word per clock cycle. Your inner loop is basically just a multiple-add operation whose cycles you can count on one hand, plus a few cycles for the loop overhead. It doesn't surprise me that after 10 threads, you don't get any performance gain.
Your inner loop has an omp atomic read, and your middle loop has an omp atomic write to a location that could be the same one read by one of the reads. OpenMP is obligated to ensure that atomic writes and reads of the same location are serialized, so in fact it probably does need to introduce a lock, even though there isn't any explicit one.
It might even need to lock the whole sol array unless it can somehow figure out which reads might conflict with which writes, and really, OpenMP processors aren't necessarily all that smart.
No code scales absolutely linearly, but rest assured that there are many codes that do scale much closer to linearly than yours does.
I suspect you are having caching issues. When one thread updates a value in the sol array, it invalids the caches on other CPUs that are storing that same cache line. This forces the caches to be updated, which then leads to the CPUs stalling.
Even if you don't have an explicit mutex lock in your code, you have one shared resource between your processes: the memory and its bus. You don't see this in your code because it is the hardware that takes care of handling all the different requests from the CPUs, but nevertheless, it is a shared resource.
So, whenever one of your processes writes to memory, that memory location will have to be reloaded from main memory by all other processes that use it, and they all have to use the same memory bus to do so. The memory bus saturates, and you have no more performance gain from additional CPU cores that only serve to worsen the situation.
