How do I use memcpy_toio/fromio? - c

I am working on a kernel module in C to talk to a PCIe card and I have allocated some io memory using pci_iomap, and I write/read there using ioread/write32.
This works but the performance is quite poor, and I read I could use block transfer through memcpy_toio/fromio instead of just doing 32b at a time.
To write, I am using iowrite32(buffer[i], privdata->registers + i);
To read, I do buffer[i] = ioread32(&privdata->registers[i]);
I tried to replace the for loops these are in with:
memcpy_toio(privdata->registers, buffer, 2048);
memcpy_fromio(buffer, privdata->registers, 2048);
If I only replace the write loop with memcpy_toio and I do the reading using ioread32, the program doesn't crash but the instruction doesn't seem to be doing anything (registers don't change);
Also, when I replace the read loop as well with the memcpy_fromio instruction, it crashes.
I was thinking it might be because the reads try to access the mem location while it is still being written to. Is there a way to flush the writes queue after either iowrite32 or memcpy_toio?
What am I doing wrong here?

memcpy_from/toio() can be used only if the I/O memory behaves like memory, i.e., if values can be read speculatively, and be written multiple times or out of order.
An I/O range marked as non-prefetchable does not support this.

I don't know if my suggestion is valid, the single ioread32 function is more efficient than memcpy, the read function only needs to read or write the PCI device once, while memcpy needs multiple times. The kernel provides functions such as ioread32_rep to replace the cumbersome for loop (essentially the same).If you need to pursue efficiency, you can try to use ioread32_rep, and you can try to use memcpy for reading and writing of variable length.

Which buffer type do you use?
Look at the implementation of memcpy_fromio() , memcpy_toio()
static inline void
memcpy_fromio(void *dst, volatile void __iomem *src, int count)
{
memcpy(dst, (void __force *) src, count);
}
static inline void
memcpy_toio(volatile void __iomem *dst, const void *src, int count)
{
memcpy((void __force *) dst, src, count);
}
Yo can see simple memcpy call.
And look at the iowrite32() and ioread32() implementations:
static inline void iowrite32(u32 val, void __iomem *p)
{
if (__is_PCI_addr(p))
val = _swapl(val);
__builtin_write32(p, val);
if (__is_PCI_MEM(p))
__flush_PCI_writes();
}
static inline unsigned int ioread32(void __iomem *p)
{
uint32_t ret = __builtin_read32(p);
if (__is_PCI_addr(p))
ret = _swapl(ret);
return ret;
}
As you can see memcpy_fromio() , memcpy_toio() are not suitable for working with PCIe devices.

Related

C: How to guard static variables in multithreaded environment?

Suppose having the following code elements working on a fifo buffer:
static uint_fast32_t buffer_start;
static uint_fast32_t buffer_end;
static mutex_t buffer_guard;
(...)
void buffer_write(uint8_t* data, uint_fast32_t len)
{
uint_fast32_t pos;
mutex_lock(buffer_guard);
pos = buffer_end;
buffer_end = buffer_end + len;
(...) /* Wrap around buffer_end, fill in data */
mutex_unlock(buffer_guard);
}
bool buffer_isempty(void)
{
bool ret;
mutex_lock(buffer_guard);
ret = (buffer_start == buffer_end);
mutex_unlock(buffer_guard);
return ret;
}
This code might be running on an embedded system, with a RTOS, with the buffer_write() and buffer_isempty() functions called from different threads. The compiler has no means to know that the mutex_lock() and mutex_unlock() functions provided by the RTOS are working with a critical sections.
As the code is above, due to buffer_end being a static variable (local to the compilation unit), the compiler might choose to reorder accesses to it around function calls (at least as far as I understand the C standard, this seems possible to happen). So potentially the code performing buffer_end = buffer_end + len line have a chance to end up before the call to mutex_lock().
Using volatile on these variables (like static volatile uint_fast32_t buffer_end;) seems to resolve this as then they would be constrained by sequence points (which a mutex_lock() call is, due to being a function call).
Is my understanding right on these?
Is there a more appropriate means (than using volatile) of dealing with this type of problem?

How is data really exchanged between user and kernel space while copy_from_user() is being executed?

I'm writing my first trivial device driver and got a few questions:
I'm following this book but doesn't seem like it goes into the details of the working while copy_(to|from)_user() API (or any APIs that transfer data between the user and kernel space) is executed. Something not super detailed but something one must know while working on kernel.
What's the implementation of copy_from_user() really like? I came across the following snippets but it just goes down to the assembly level. I might be navigating incorrectly. I have seen some references for this function and looks like if it returns anything other than 0, something went wrong.
// https://elixir.bootlin.com/linux/latest/source/include/linux/uaccess.h#L189
static __always_inline unsigned long __must_check
copy_from_user(void *to, const void __user *from, unsigned long n)
{
if (likely(check_copy_size(to, n, false)))
n = _copy_from_user(to, from, n);
return n;
}
__copy_from_user(void *to, const void __user *from, unsigned long n)
{
might_fault();
if (should_fail_usercopy())
return n;
instrument_copy_from_user(to, from, n);
check_object_size(to, n, false);
return raw_copy_from_user(to, from, n);
}
// https://elixir.bootlin.com/linux/latest/source/arch/arm64/include/asm/uaccess.h#L385
#define raw_copy_from_user(to, from, n) \
({ \
unsigned long __acfu_ret; \
uaccess_enable_not_uao(); \
__acfu_ret = __arch_copy_from_user((to), \
__uaccess_mask_ptr(from), (n)); \
uaccess_disable_not_uao(); \
__acfu_ret; \
})
// https://elixir.bootlin.com/linux/latest/source/arch/nds32/lib/copy_from_user.S#L34
.text
ENTRY(__arch_copy_from_user)
add $r5, $r0, $r2
#include "copy_template.S"
move $r0, $r2
ret
.section .fixup,"ax"
.align 2
9001:
sub $r0, $r5, $r0
ret
.previous
ENDPROC(__arch_copy_from_user)
During a syscall the kernel still has the process memory space mapped, so can directly read and write on most modern architectures. The main work is validating the user-provided address and size. Also, the data may not be resident so the normal page fault mechanism can be triggered. After that its just a memcpy.
Most of the macro layers and calls are there to deal with arch-specific differences. For example, ARM has user-access override, uao in your example code, which involves privileged mode access to user memory.
EDIT:
During the syscall, the current process isn't changed so the kernel has both the kernel memory and the user process memory in the memory map.
Address validation is to limit the access to the allowed user-process memory. Otherwise, the user process could pass a kernel address to a write, for example, and copy kernel memory out to a user file.

String parsing failes when using a function doing substring in C

I am having an issue with parsing a string in C. It causes a HardFault eventually.
MCU: LPC1769,
OS: FreeRTOS 10,
Toolchain: IAR
In order to test, If I keep sending the same data frame (you may see the sample below in message variable in parseMessage function),
after 5-6 times parsing it goes OK, parsing works as I expected, and then suddenly falls in HardFault when I send one more the exact same string to the function.
I tested the function in OnlineGDB. I haven't observed any issue.
I have couple of slightly different version of that function below although the result is the same;
char *substr3(char const *input, size_t start, size_t len) {
char *ret = malloc(len+1);
memcpy(ret, input+start, len);
ret[len] = '\0';
return ret;
}
I've extracted the function piece for a better overveiw:
(don't pay attention to stripEOL(message); call, it just strips out end-of-line characters, but you can see it in the gdbonline share of mine)
void parseMessage(char * message){
//char* message= "7E00002A347C31323030302D3132353330387C33302E30372E323032307C31317C33307C33317C31352D31367C31357C317C57656E67657274880D";
// Parsing the frame
char* start;
char* len;
char* cmd;
char* data;
char* chksum;
char* end;
stripEOL(message);
unsigned int messagelen = strlen(message);
start = substr3(message, 0, 2);
len = substr3(message, 2, 4);
cmd = substr3(message, 6, 2);
data = substr3(message, 8, messagelen-8-4);
chksum = substr3(message, messagelen-4, 2);
end = substr3(message, messagelen-2, 2);
}
Only the data variable differs in length.
e.g. data --> "347C31323030302D3132353330387C33302E30372E323032307C31317C33307C33317C31352D31367C31357C317C57656E67657274"
A HardFault debug log:
LR = 0x8667 in disassembly
PC = 0x2dd0 in disassembly
I appreciate to the contributors which they led me to find the solution for my instance.
Since there wasn't a total solution by the contributors and I found a working solution, I'd better be writing for whom may interest in future.
Since I am developing my application on top of FreeRTOS 10 and using malloc from the C library, apparently it wasn't cooping at least with my implementations. It's been said in some resources, you can use standard malloc within FreeRTOS, I couldn't manage myself for some unknown reason. It might have been a help, if I had increased the heap memory, I don't know but I didn't have intention on that as well.
I've just placed that two wrapper functions (somewhere in a common file) without even changing my malloc and free calls.;
Creating a malloc/free functions that work with the built-in FreeRTOS heap is quite simple. We just wrap the pvPortMalloc/pvPortFree calls:
void* malloc(size_t size)
{
void* ptr = NULL;
if(size > 0)
{
// We simply wrap the FreeRTOS call into a standard form
ptr = pvPortMalloc(size);
} // else NULL if there was an error
return ptr;
}
void free(void* ptr)
{
if(ptr)
{
// We simply wrap the FreeRTOS call into a standard form
vPortFree(ptr);
}
}
Note that: You can't use that with heap schema #1 but with the others (2, 3, 4 and 5).
I would recommend start using portable/MemMang/heap_4.c

two function addresses subtraction

I'm reading a piece of code about exploit in here. There is a statement going like this:
/*
FreeBSD <= 6.1 suffers from classical check/use race condition on SMP
systems in kevent() syscall, leading to kernel mode NULL pointer
dereference. It can be triggered by spawning two threads:
1st thread looping on open() and close() syscalls, and the 2nd thread
looping on kevent(), trying to add possibly invalid filedescriptor.
*/
static void kernel_code(void) {
struct thread *thread;
gotroot = 1;
asm(
"movl %%fs:0, %0"
: "=r"(thread)
);
thread->td_proc->p_ucred->cr_uid = 0;
#ifdef PRISON_BREAK
thread->td_proc->p_ucred->cr_prison = NULL;
#endif
return;
}
static void code_end(void) {
return;
}
int main() {
....
memcpy(0, &kernel_code, &code_end - &kernel_code);
....
}
I'm curious what's the meaning of this memcpy? What is the result of &code_end - &kernel_code?
This assumes that the function kernel_code() will end where somewhere before function code_end() starts. The memcpy() therefore copies kernel_code() to address 0. One assumes that some other aspect of the exploit results in a return or jump to address 0, thereby running kernel_code().
void * memcpy ( void * destination, const void * source, size_t num );
That memcpy will copy the function kernel_code to address 0 (NULL).
What the code is trying to exploit, is gaining root privilege, UID of 0, by two threads competing for the queue, do_thread/do_thread2.
By mmaping the contents of the address of the code_end function, with the address of kernel_code, copy the result into the buffer, to the address 0, on the condition that the code are adjacent to each other, thereby, as in effective user id of 0 aka root.
This C++ Ref page summarizes what memcpy is about.
void * memcpy ( void * destination, const void * source, size_t num );
Copies the values of num bytes from the location pointed to by source
directly to the memory block pointed to by destination.

C: Minimising code duplication using functions in a header file

this is a bit of a strange use case so searching for existing discussion is difficult. I'm programming for embedded systems (Microchip PIC24 using XC16 compiler) and am currently implementing a communication protocol identically across 3 separate UART channels (each UART will grab data from a master data table).
The way I started out writing the project was to have each UART handled by a separate module, with a lot of code duplication, along the lines of the following pseudocode:
UART1.c:
static unsigned char buffer[128];
static unsigned char pointer = 0;
static unsigned char packet_received = 0;
void interrupt UART1Receive (void) {
buffer[pointer++] = UART1RX_REG;
if (end of packet condition) packet_received = 1;
}
void processUART1(void) { // This is called regularly from main loop
if (packet_received) {
// Process packet
}
}
UART2.c:
static unsigned char buffer[128];
static unsigned char pointer = 0;
static unsigned char packet_received = 0;
void interrupt UART2Receive (void) {
buffer[pointer++] = UART2RX_REG;
if (end of packet condition) packet_received = 1;
}
void processUART2(void) { // This is called regularly from main loop
if (packet_received) {
// Process packet
}
}
While the above is neat and works well, in practice the communication protocol itself is quite complex, so having it duplicated three times (simply with changes to references to the UART registers) is increasing the opportunity for bugs to be introduced. Having a single function and passing pointers to it is not an option, since this will have too great an impact on speed. The code needs to be physically duplicated in memory for each UART.
I gave it a lot of thought and despite knowing the rules of never putting functions in a header file, decided to try a specific header file that included the duplicate code, with references as #defined values:
protocol.h:
// UART_RECEIVE_NAME and UART_RX_REG are just macros to be defined
// in calling file
void interrupt UART_RECEIVE_NAME (void) {
buffer[pointer++] = UART_RX_REG;
if (end of packet condition) packet_received = 1;
}
UART1.c:
static unsigned char buffer[128];
static unsigned char pointer = 0;
static unsigned char packet_received = 0;
#define UART_RECEIVE_NAME UART1Receive
#define UART_RX_REG UART1RX_REG
#include "protocol.h"
void processUART1(void) { // This is called regularly from main loop
if (packet_received) {
// Process packet
}
}
UART2.c:
static unsigned char buffer[128];
static unsigned char pointer = 0;
static unsigned char packet_received = 0;
#define UART_RECEIVE_NAME UART2Receive
#define UART_RX_REG UART2RX_REG
#include "protocol.h"
void processUART2(void) { // This is called regularly from main loop
if (packet_received) {
// Process packet
}
}
I was slightly surprised when the code compiled without any errors! It does seem to work though, and post compilation MPLAB X can even work out all of the symbol references so that every macro reference in UART1.c and UART2.c don't get identified as an unresolvable identifier. I did then realise I should probably rename the protocol.h file to protocol.c (and update the #includes accordingly), but that's not practically a big deal.
There is only one downside: the IDE has no idea what to do while stepping through code included from protocol.h while simulating or debugging. It just stays at the calling instruction while the code executes, so debugging will be a little more difficult.
So how hacky is this solution? Will the C gods smite me for even considering this? Are there any better alternatives that I've overlooked?
An alternative is to define a function macro that contains the body of code. Some token pasting operators can automatically generate the symbol names required. Multi-line macros can be generated by using \ at the end of all but the last line.
#define UART_RECEIVE(n) \
void interrupt UART##n##Receive (void) { \
buffer[pointer++] = UART##n##RX_REG; \
if (end of packet condition) packet_received = 1; \
}
UART_RECEIVE(1)
UART_RECEIVE(2)
Using macros for this purpose seems for mee to be a bad idea. Making debugging impossible is just one disadvantage. It also makes it difficult to understand, by hiding the real meaning of symbols. And interrupt routines should realy be kept independant and short, with common functions hidden in handler functions.
The first thing I would do is to define a common buffer struct for each UART. This makes it possible with simultanous communications. If each uart needs a separate handler function for the messages, it can be included as a function pointer. The syntax is a bit
complicated, but it results in efficient code.
typedef struct uart_buf uart_buf_t;
struct uart_buf {
uint8_t* buffer;
int16_t inptr;
bool packet_received;
void (*handler_func)(uart_buf_t*);
};
uart_buf_t uart_buf_1;
uart_buf_t uart_buf_2;
Then each interrupt handler will be like this:
void interrupt UART1Receive (void) {
handle_input(UART1RX_REG, &uart_buf_1);
}
void interrupt UART2Receive (void) {
handle_input(UART2RX_REG, &uart_buf_2);
}
And the common handler will be:
void handle_input(uint8_t in_char, *buff) {
buf->buffer[buf->inptr++] = in_char;
if (in_char=LF)
buf->packet_received = true;
buf->handler_func(buf);
}
}
And the message handler is:
void hadle_packet(uart_buf_t* buf) {
... code to handle message
buf->packet_received=0;
}
And the function pointers must be initialized:
void init() {
uart_buf_1.handler_func=handler1;
uart_buf_2.handler_func=handler1;
}
The resulting code is very flexible, and can be easily changed. Single-steping the code is no problem.

Resources