Freertos + STM32 - thread memory overflow with malloc

Freertos + STM32 - thread memory overflow with malloc - c

I'm working with stm32+rtos to implement a file system based on spi flash. For freertos, I adopted heap_1 implementation. This is how i create my task.
osThreadDef(Task_Embedded, Task_VATEmbedded, osPriorityNormal, 0, 2500);
VATEmbeddedTaskHandle = osThreadCreate(osThread(Task_Embedded), NULL);
I allocated 10000 bytes of memory to this thread.
and in this thread. I tried to write data into flash. In the first few called it worked successfully. but somehow it crash when i tried more time of write.
VATAPI_RESULT STM32SPIWriteSector(void *writebuf, uint8_t* SectorAddr, uint32_t buff_size){
if(STM32SPIEraseSector(SectorAddr) == VAT_SUCCESS){
DBGSTR("ERASE SECTOR - 0x%2x %2x %2x", SectorAddr[0], SectorAddr[1], SectorAddr[2]);
}else return VAT_UNKNOWN;
if(STM32SPIProgram_multiPage(writebuf, SectorAddr, buff_size) == VAT_SUCCESS){
DBGSTR("WRTIE SECTOR SUCCESSFUL");
return VAT_SUCCESS;
}else return VAT_UNKNOWN;
return VAT_UNKNOWN;
}
.
VATAPI_RESULT STM32SPIProgram_multiPage(uint8_t *writebuf, uint8_t *writeAddr, uint32_t buff_size){
VATAPI_RESULT nres;
uint8_t tmpaddr[3] = {writeAddr[0], writeAddr[1], writeAddr[2]};
uint8_t* sectorBuf = malloc(4096 * sizeof(uint8_t));
uint8_t* pagebuf = malloc(255* sizeof(uint8_t));
memset(&sectorBuf[0],0,4096);
memset(&pagebuf[0],0,255);
uint32_t i = 0, tmp_convert1, times = 0;
if(buff_size < Page_bufferSize)
times = 1;
else{
times = buff_size / (Page_bufferSize-1);
if((times%(Page_bufferSize-1))!=0)
times++;
}
/* Note : According to winbond flash feature, the last bytes of every 256 bytes should be 0, so we need to plus one byte on every 256 bytes*/
i = 0;
while(i < times){
memset(&pagebuf[0], 0, Page_bufferSize - 1);
memcpy(&pagebuf[0], &writebuf[i*255], Page_bufferSize - 1);
memcpy(&sectorBuf[i*Page_bufferSize], &pagebuf[0], Page_bufferSize - 1);
sectorBuf[((i+1)*Page_bufferSize)-1] = 0;
i++;
}
i = 0;
while(i < times){
if((nres=STM32SPIPageProgram(&sectorBuf[Page_bufferSize*i], &tmpaddr[0], Page_bufferSize)) != VAT_SUCCESS){
DBGSTR("STM32SPIProgram_allData write data fail on %d times!",i);
free(sectorBuf);
free(pagebuf);
return nres;
}
tmp_convert1 = (tmpaddr[0]<<16 | tmpaddr[1]<<8 | tmpaddr[2]) + Page_bufferSize;
tmpaddr[0] = (tmp_convert1&0xFF0000) >> 16;
tmpaddr[1] = (tmp_convert1&0xFF00) >>8;
tmpaddr[2] = 0x00;
i++;
}
free(sectorBuf);
free(pagebuf);
return nres;
}
I open the debugger and it seems like it crash when i malloced "sectorbuf" in function "STM32SPIProgram_multiPage", what Im confused is that i did free the memory after "malloc". anyone has idea about it?
arm-none-eabi-size "RTOS.elf"
text data bss dec hex filename
77564 988 100756 179308 2bc6c RTOS.elf

Reading the man
Memory Management
[...]
If RTOS objects are created dynamically then the standard C library malloc() and free() functions can sometimes be used for the purpose, but ...
they are not always available on embedded systems,
they take up valuable code space,
they are not thread safe, and
they are not deterministic (the amount of time taken to execute the function will differ from call to call)
... so more often than not an alternative memory allocation implementation is required.
One embedded / real time system can have very different RAM and timing requirements to another - so a single RAM allocation algorithm will only ever be appropriate for a subset of applications.
To get around this problem, FreeRTOS keeps the memory allocation API in its portable layer. The portable layer is outside of the source files that implement the core RTOS functionality, allowing an application specific implementation appropriate for the real time system being developed to be provided. When the RTOS kernel requires RAM, instead of calling malloc(), it instead calls pvPortMalloc(). When RAM is being freed, instead of calling free(), the RTOS kernel calls vPortFree().
[...]
(Emphasis mine.)
So the meaning is that if you use directly malloc, FreeRTOS is not able to handle the heap consumed by the system function. Same if you choose heap_3 management that is a simple malloc wrapper.
Take also note that the memory management you choose has no free capability.
heap_1.c
This is the simplest implementation of all. It does not permit memory to be freed once it has been allocated. Despite this, heap_1.c is appropriate for a large number of embedded applications. This is because many small and deeply embedded applications create all the tasks, queues, semaphores, etc. required when the system boots, and then use all of these objects for the lifetime of program (until the application is switched off again, or is rebooted). Nothing ever gets deleted.
The implementation simply subdivides a single array into smaller blocks as RAM is requested. The total size of the array (the total size of the heap) is set by configTOTAL_HEAP_SIZE - which is defined in FreeRTOSConfig.h. The configAPPLICATION_ALLOCATED_HEAP FreeRTOSConfig.h configuration constant is provided to allow the heap to be placed at a specific address in memory.
The xPortGetFreeHeapSize() API function returns the total amount of heap space that remains unallocated, allowing the configTOTAL_HEAP_SIZE setting to be optimised.
The heap_1 implementation:
Can be used if your application never deletes a task, queue, semaphore, mutex, etc. (which actually covers the majority of applications in which FreeRTOS gets used).
Is always deterministic (always takes the same amount of time to execute) and cannot result in memory fragmentation.
Is very simple and allocated memory from a statically allocated array, meaning it is often suitable for use in applications that do not permit true dynamic memory allocation.
(Emphasis mine.)
Side note: You have always to check malloc return value != NULL.

Related

concurrent I/O - buffers corruption, block device driver

I developing block layered device driver. So, I intercept WRITE request and encrypt data, and decrypt data in the end_bio() routine (during processing and READ request).
So all works fine in single stream. But I getting buffers content corruption if have tried to performs I/O from two and more processes simultaneously. I have not any local storage for buffers.
Do I'm need to count a BIO merging in my driver?
Is the Linux I/O subsystem have some requirements related to the a number of concurrent I/O request?
Is there some tips and tricks related stack using or compilation?
This is under kernel 4.15.
At the time I use next constriction to run over disk sectors:
/*
* A portion of the bio_copy_data() ...
*/
for (vcnt = 0, src_iter = src->bi_iter; ; vcnt++)
{
if ( !src_iter.bi_size)
{
if ( !(src = src->bi_next) )
break;
src_iter = src->bi_iter;
}
src_bv = bio_iter_iovec(src, src_iter);
src_p = bv_page = kmap_atomic(src_bv.bv_page);
src_p += src_bv.bv_offset;
nlbn = src_bv.bv_len512;
for ( ; nlbn--; lbn++ , src_p += 512 )
{
{
/* Simulate a processing of data in the I/O buffer */
char *srcp = src_p, *dstp = src_p;
int count = DUDRV$K_SECTORSZ;
while ( count--)
{
*(dstp++) = ~ (*(srcp++));
}
}
}
kunmap_atomic(bv_page);
**bio_advance_iter**(src, &src_iter, src_bv.bv_len);
}
Is this correct ? Or I'm need to use something like **bio_for_each_segment(bvl, bio, iter) ** ?

The root of the problem is a "feature" of the Block I/O methods. In particularly (see description at Linex site reference )
** Biovecs can be shared between multiple bios - a bvec iter can represent an
arbitrary range of an existing biovec, both starting and ending midway
through biovecs. This is what enables efficient splitting of arbitrary
bios. Note that this means we only use bi_size to determine when we've
reached the end of a bio, not bi_vcnt - and the bio_iovec() macro takes
bi_size into account when constructing biovecs.*
So, in my case this is a cause of the buffer with disk sector overrun.
The trick is set REQ_NOMERGE_FLAGS in the .bi_opf before sending BIO to backed device driver.
The second reason is the non-actual .bi_iter is returned by backed device driver. So, we need to save it (before submiting BIO request to backend) and restore it in the our "bio_endio()" routine.

Have you considered the use of vmap with global synchronization, instead?
The use of kmap_atomic has some restrictions:
Since the mapping is restricted to the CPU that issued it, it
performs well, but the issuing task is therefore required to stay on that
CPU until it has finished, lest some other task displace its mappings.
kmap_atomic() may also be used by interrupt contexts, since it is does not
sleep and the caller may not sleep until after kunmap_atomic() is called.
Reference: https://www.kernel.org/doc/Documentation/vm/highmem.txt

Read Kernel Memory from user mode WITHOUT driver

I'm writing an program which enumerates hooks created by SetWindowsHookEx() Here is the process:
Use GetProcAddress() to obtain gSharedInfo exported in User32.dll(works, verified)
Read User-Mode memory at gSharedInfo + 8, the result should be a pointer of first handle entry. (works, verified)
Read User-Mode memory at [gSharedInfo] + 8, the result should be countof handles to enumerate. (works, verified)
Read data from address obtained in step 2, repeat count times
Check if HANDLEENTRY.bType is 5(which means it's a HHOOK). If so, print informations.
The problem is, although step 1-3 only mess around with user mode memory, step 4 requires the program to read kernel memory. After some research I found that ZwSystemDebugControl can be used to access Kernel Memory from user mode. So I wrote the following function:
BOOL GetKernelMemory(PVOID pKernelAddr, PBYTE pBuffer, ULONG uLength)
{
MEMORY_CHUNKS mc;
ULONG uReaded = 0;
mc.Address = (UINT)pKernelAddr; //Kernel Memory Address - input
mc.pData = (UINT)pBuffer;//User Mode Memory Address - output
mc.Length = (UINT)uLength; //length
ULONG st = -1;
ZWSYSTEMDEBUGCONTROL ZwSystemDebugControl = (ZWSYSTEMDEBUGCONTROL)GetProcAddress(
GetModuleHandleA("ntdll.dll"), "NtSystemDebugControl");
st = ZwSystemDebugControl(SysDbgCopyMemoryChunks_0, &mc, sizeof(MEMORY_CHUNKS), 0, 0, &uReaded);
return st == 0;
}
But the function above didn't work. uReaded is always 0 and st is always 0xC0000002. How do I resolve this error?
my full program:
http://pastebin.com/xzYfGdC5

MSFT did not implement NtSystemDebugControl syscall after windows XP.

The Meltdown vulnerability makes it possible to read Kernel memory from User Mode on most Intel CPUs with a speed of approximately 500kB/s. This works on most unpatched OS'es.

HeapFree Breakpoint on Free()

I have a very large (~1E9) array of objects that I malloc, realloc, and free on iterations of a single thread program.
Specifically,
//(Individual *ind)
//malloc, old implementation
ind->obj = (double *)malloc(sizeof(double)*acb->in.nobj);
//new implementation
ind->obj = (double *)a_allocate(acb->in.nobj*sizeof(double));
void *a_allocate (int siz){
void *buf;
buf = calloc(1,siz);
acb->totmemMalloc+=siz;
if (buf==NULL){
a_throw2("a_allocate...failed to allocate buf...<%d>",siz);
}
return buf;
}
...
//realloc
ind->obj = (double *)a_realloc(ind->obj, acb->in.nobj*sizeof(double));
void *a_realloc (void *bufIn,int siz)
{
void *buf = bufIn;
if (buf==NULL){
a_throw2("a_realloc called with null bufIn...");
}
buf = realloc(buf,siz);
return buf;
}
...
//deallocate
free(ind->obj);
The other three dozen properties are processed similarly.
However, every few test runs, the code fails a heap validation on the deallocation of only this object property (the free() statement). At the time of failure, the ind->obj property is not null and has some valid value.
Is there any obvious problem with what I'm doing?
I'm very new to C and am not entirely sure I'm perform the memory operations correctly.
Thanks!
EDIT: using _CRTLDBG_REPORT_FLAG
HEAP[DEMO.exe]: Heap block at 010172B0 modified at 010172E4 past requested size of 2c

Heap validation is a delayed metric. The Visual Studio debug heap can be used (debug build) with more frequent checks Microsoft : Debug Heap flags.
Alternative using application verifier and turning on heap checking, will help find the point which is causing this.
+-----+----------+-----+ +----+-------------+-----+
| chk | memory |chk2 | | chk| different m | chk2|
+-----+----------+-----+ +----+-------------+-----+
When the system allocates memory, it puts meta- information about the memory before the returned pointer (or maybe after). When these memory pieces get overwritten, then that causes the heap failure.
This may be the memory you are freeing, or the memory which was directly before hand.
Edit - to address comments
A message such as "HEAP[DEMO.exe]: Heap block at 010172B0 modified at 010172E4 past requested size of 2c"
Implies that the memory at 01017280 wrote beyond the end of the allocated memory.
This could be because the amount malloced/realloced was too small, or an error in your loops.
+---+-----------------+----+--------------------------+
|chk|d0|d1|d2|d3|d4|d5|chk2| memory |
+---+-----------------+----+--------------------------+
So if you tried to write into d6 above, that would cause 'chk2' to be overwritten, which is being detected. In this case the difference is small - requested size is 0x2c and the difference = E4 - B0 = 0x34
Turning on these debug checks should change your code to be more crashing and predictable. If there is no randomness in your data, then turn off ASLR (only for debugging) and the addresses being used will be predictable, you can put a breakpoint in the malloc/realloc for a given memory address.

C - Store global variables in flash?

As the title may suggest, I'm currently short on SRAM in my program and I can't find a way to reduce my global variables. Is it possible to bring global variables over to flash memory? Since these variables are frequently read and written, would it be bad for the nand flash because they have limited number of read/write cycle?
If the flash cannot handle this, would EEPROM be a good alternative?
EDIT:
Sorry for the ambiguity guys. I'm working with Atmel AVR ATmega32HVB which has:
2K bytes of SRAM,
1K bytes of EEPROM
32K bytes of FLASH
Compiler: AVR C/C++
Platform: IAR Embedded AVR
The global variables that I want to get rid of are:
uint32_t capacityInCCAccumulated[TOTAL_CELL];
and
int32_t AccumulatedCCADCvalue[TOTAL_CELL];
Code snippets:
int32_t AccumulatedCCADCvalue[TOTAL_CELL];
void CCGASG_AccumulateCCADCMeasurements(int32_t ccadcMeasurement, uint16_t slowRCperiod)
{
uint8_t cellIndex;
// Sampling period dependant on configuration of CCADC sampling..
int32_t temp = ccadcMeasurement * (int32_t)slowRCperiod;
bool polChange = false;
if(temp < 0) {
temp = -temp;
polChange = true;
}
// Add 0.5*divisor to get proper rounding
temp += (1<<(CCGASG_ACC_SCALING-1));
temp >>= CCGASG_ACC_SCALING;
if(polChange) {
temp = -temp;
}
for (cellIndex = 0; cellIndex < TOTAL_CELL; cellIndex++)
{
AccumulatedCCADCvalue[cellIndex] += temp;
}
// If it was a charge, update the charge cycle counter
if(ccadcMeasurement <= 0) {
// If it was a discharge, AccumulatedCADCvalue can be negative, and that
// is "impossible", so set it to zero
for (cellIndex = 0; cellIndex < TOTAL_CELL; cellIndex++)
{
if(AccumulatedCCADCvalue[cellIndex] < 0)
{
AccumulatedCCADCvalue[cellIndex] = 0;
}
}
}
}
And this
uint32_t capacityInCCAccumulated[TOTAL_CELL];
void BATTPARAM_InitSramParameters() {
uint8_t cellIndex;
// Active current threshold in ticks
battParams_sram.activeCurrentThresholdInTicks = (uint16_t) BATTCUR_mA2Ticks(battParams.activeCurrentThreshold);
for (cellIndex = 0; cellIndex < TOTAL_CELL; cellIndex++)
{
// Full charge capacity in CC accumulated
battParams_sram.capacityInCCAccumulated[cellIndex] = (uint32_t) CCGASG_mAh2Acc(battParams.fullChargeCapacity);
}
// Terminate discharge limit in CC accumulated
battParams_sram.terminateDischargeLimit = CCGASG_mAh2Acc(battParams.terminateDischargeLimit);
// Values for remaining capacity calibration
GASG_CalculateRemainingCapacityValues();
}

would it be bad for the nand flash because they have limited number of
read/write cycle?
Yes it's not a good idea to use flash for frequent modification of data.
Read only from flash does not reduce the life time of flash. Erasing and writing will reduce the flash lifetime.
Reading and writing from flash is substantially slower compared to conventional memory.
To write a byte whole block has to be erased and re written in flash.

Any kind of Flash is a bad idea to be used for frequently changing values:
limited number of erase/write cycles, see datasheet.
very slow erase/write (erase can be ~1s), see datasheet.
You need a special sequence to erase then write (no language support).
While erasing or writing accesses to Flash are blocked at best, some require not to access the Flash at all (undefined behaviour).
Flash cells cannot freely be written per-byte/word. Most have to be written per page (e.g. 64 bytes) and erased most times in much larger units (segments/blocks/sectors).
For NAND Flash, endurance is even more reduced compared to NOR Flash and the cells are less reliable (bits might flip occasionally or are defective), so you have to add error detection and correction. This is very likely a direction you should not go.
True EEPROM shares most issues, but they might be written byte/word-wise (internal erase).
Note that modern MCU-integrated "EEPROM" is most times also Flash. Some implementations just use slightly more reliable cells (about one decade more erase/write cycles than the program flash) and additional hardware allowing arbitrary byte/word write (automatic erase). But that is still not sufficient for frequent changes.
However, you first should verify if your application can tolerate the lengthly write/erase times. Can you accept a process blocking that long, or rewrite your program acordingly? If the answer is "no", you should even stop further investigation into that direction. Otherwise you should calculate the number of updates over the expected lifetime and compare to the information in the datasheet. There are also methods to reduce the number of erase cycles, but the leads too far.
If an external device (I2C/SPI) is an option, you could use a serial SRAM. Although the better (and likely cheaper) approach would be a larger MCU or think about a more efficient (i.e. less RAM, more code) way to store the data in SRAM.

multithreaded environment in c

I'm just trying to get my head around multithreading environments, specifically how you would implement a cooperative one in c (on an AVR, but out of interest I would like to keep this general).
My problem comes with the thread switch itself: I'm pretty sure I could write this in assembler, flushing all the registers to a stack and then saving the PC to return to later.
How would one pull something like this off in c? I have been told it can do "everything".
I realize this is quite a general question, so any links with information on this topic would be greatly appreciated.
Thanks

You can do this with setjmp/longjmp on most systems -- here is some code I've use in the past for task switching:
void task_switch(Task *to, int exit)
{
int tmp;
int task_errno; /* save space for errno */
task_errno = errno;
if (!(tmp = setjmp(current_task->env))) {
tmp = exit ? (int)current_task : 1;
current_task = to;
longjmp(to->env, tmp); }
if (exit) {
/* if we get here, the stack pointer is pointing into an already
** freed block ! */
abort(); }
if (tmp != 1)
free((void *)tmp);
errno = task_errno;
}
This depends on sizeof(int) == sizeof(void *) in order to pass a pointer as the argument to setjmp/longjmp, but that could be avoided by using handles (indexes into a global array of all task structures) instead of raw pointers here, or by using a static pointer.
Of course, the tricky part is setting up jmpbuf objects for newly created tasks, each with their own stack. You can use a signal handler with sigaltstack for that:
static void (*tfn)(void *);
static void *tfn_arg;
static stack_t old_ss;
static int old_sm;
static struct sigaction old_sa;
Task *current_task = 0;
static Task *parent_task;
static int task_count;
static void newtask()
{
int sm;
void (*fn)(void *);
void *fn_arg;
task_count++;
sigaltstack(&old_ss, 0);
sigaction(SIGUSR1, &old_sa, 0);
sm = old_sm;
fn = tfn;
fn_arg = tfn_arg;
task_switch(parent_task);
sigsetmask(sm);
(*fn)(fn_arg);
abort();
}
Task *task_start(int ssize, void (*_tfn)(void *), void *_arg)
{
Task *volatile new;
stack_t t_ss;
struct sigaction t_sa;
old_sm = sigsetmask(~sigmask(SIGUSR1));
if (!current_task) task_init();
tfn = _tfn;
tfn_arg = _arg;
new = malloc(sizeof(Task) + ssize + ALIGN);
new->next = 0;
new->task_data = 0;
t_ss.ss_sp = (void *)(new + 1);
t_ss.ss_size = ssize;
t_ss.ss_flags = 0;
if ((unsigned long)t_ss.ss_sp & (ALIGN-1))
t_ss.ss_sp = (void *)(((unsigned long)t_ss.ss_sp+ALIGN) & ~(ALIGN-1));
t_sa.sa_handler = newtask;
t_sa.sa_mask = ~sigmask(SIGUSR1);
t_sa.sa_flags = SA_ONSTACK|SA_RESETHAND;
sigaltstack(&t_ss, &old_ss);
sigaction(SIGUSR1, &t_sa, &old_sa);
parent_task = current_task;
if (!setjmp(current_task->env)) {
current_task = new;
kill(getpid(), SIGUSR1); }
sigaltstack(&old_ss, 0);
sigaction(SIGUSR1, &old_sa, 0);
sigsetmask(old_sm);
return new;
}

If you wanted to keep it pure C, I think you might be able to use setjmp and longjmp, but I've never tried it myself, and I imagine there's probably some platforms on which this wouldn't work (i.e. certain registers/other settings not being saved). The only other alternative would be to write it in assembly.

As mentioned, setjmp/longjmp are standard C and are available even in the libc of 8-bit AVRs. They do exactly what you said you'd do in assembler: save the processor context. But one has to keep in mind that the intended purpose of those functions is just to jump backwards in the flow of control; switching between tasks is an abuse. It does work anyway, and looks like this is even frequently used in a variety of user-level thread libraries -- like GNU Pth. But still, is an abuse of the intended purpose, and requires being careful.
As Chris Dodd said, you still need to provide an stack for each new task. He used sigaltstack() and other signal-related functions, but those do not exist in standard C, only in unix-like environments. For example, the AVR libc does not provide them. So as an alternative you can try reserving a part of your existing stack (by declaring a big local array, or using alloca()) for use as the stack of the new thread. Just keep in mind that the main/scheduler thread will keep using its stack, each thread uses its own stack, and all of them will grow and shrink as stacks usually do, so they will need space for doing so without interfering with each other.
And since we're already mentioning unix-like, non-standard-C mechanisms, there is also makecontext()/swapcontext() and family, which are more powerful but harder to find than setjmp()/longjmp(). The names say it all really: the context functions let you manage full process contexts (stacks included), the jmp functions let you just jump around - you'll have to hack the rest.
For the AVR anyway, given that you won't probably have an OS to help nor much memory to blindly reserve, you'd be probably better off using assembler for the switching and stack initializing.

In my experience if people start writing schedulers it isn't too long before they start wanting things like network stacks, memory allocation and file systems too. It's almost never worth going down that route; you end up spending more time writing your own operating system than you're spending on your actual application.
First whiff of your project heading that way and it's almost always worth putting the effort to put in an existing OS (linux, VxWorks, etc). Of course, that might mean that you run into problems if the CPU isn't up to it. And AVR isn't exactly a whole lot of CPU, and fitting an existing OS on to it ranges from mostly impossible to tricky for the major OSes, though there are some tiny OSes (some open source, see http://en.wikipedia.org/wiki/List_of_real-time_operating_systems).
So at the commencement of a project you should carefully consider how you might wish to evolve it going into the future. This might influence your choice of CPU now to save having to do hideous things in software later.