concurrent I/O - buffers corruption, block device driver - c

I developing block layered device driver. So, I intercept WRITE request and encrypt data, and decrypt data in the end_bio() routine (during processing and READ request).
So all works fine in single stream. But I getting buffers content corruption if have tried to performs I/O from two and more processes simultaneously. I have not any local storage for buffers.
Do I'm need to count a BIO merging in my driver?
Is the Linux I/O subsystem have some requirements related to the a number of concurrent I/O request?
Is there some tips and tricks related stack using or compilation?
This is under kernel 4.15.
At the time I use next constriction to run over disk sectors:
/*
* A portion of the bio_copy_data() ...
*/
for (vcnt = 0, src_iter = src->bi_iter; ; vcnt++)
{
if ( !src_iter.bi_size)
{
if ( !(src = src->bi_next) )
break;
src_iter = src->bi_iter;
}
src_bv = bio_iter_iovec(src, src_iter);
src_p = bv_page = kmap_atomic(src_bv.bv_page);
src_p += src_bv.bv_offset;
nlbn = src_bv.bv_len512;
for ( ; nlbn--; lbn++ , src_p += 512 )
{
{
/* Simulate a processing of data in the I/O buffer */
char *srcp = src_p, *dstp = src_p;
int count = DUDRV$K_SECTORSZ;
while ( count--)
{
*(dstp++) = ~ (*(srcp++));
}
}
}
kunmap_atomic(bv_page);
**bio_advance_iter**(src, &src_iter, src_bv.bv_len);
}
Is this correct ? Or I'm need to use something like **bio_for_each_segment(bvl, bio, iter) ** ?

The root of the problem is a "feature" of the Block I/O methods. In particularly (see description at Linex site reference )
** Biovecs can be shared between multiple bios - a bvec iter can represent an
arbitrary range of an existing biovec, both starting and ending midway
through biovecs. This is what enables efficient splitting of arbitrary
bios. Note that this means we only use bi_size to determine when we've
reached the end of a bio, not bi_vcnt - and the bio_iovec() macro takes
bi_size into account when constructing biovecs.*
So, in my case this is a cause of the buffer with disk sector overrun.
The trick is set REQ_NOMERGE_FLAGS in the .bi_opf before sending BIO to backed device driver.
The second reason is the non-actual .bi_iter is returned by backed device driver. So, we need to save it (before submiting BIO request to backend) and restore it in the our "bio_endio()" routine.

Have you considered the use of vmap with global synchronization, instead?
The use of kmap_atomic has some restrictions:
Since the mapping is restricted to the CPU that issued it, it
performs well, but the issuing task is therefore required to stay on that
CPU until it has finished, lest some other task displace its mappings.
kmap_atomic() may also be used by interrupt contexts, since it is does not
sleep and the caller may not sleep until after kunmap_atomic() is called.
Reference: https://www.kernel.org/doc/Documentation/vm/highmem.txt

Related

How does Qemu emulate PCIe devices?

I'm writing an open source document about qemu internals so if you help me you're helping the growth of Qemu project
The closest answer I found was: In which conditions the ioctl KVM_RUN returns?
This is the thread loop for a single CPU running on KVM:
static void *qemu_kvm_cpu_thread_fn(void *arg)
{
CPUState *cpu = arg;
int r;
rcu_register_thread();
qemu_mutex_lock_iothread();
qemu_thread_get_self(cpu->thread);
cpu->thread_id = qemu_get_thread_id();
cpu->can_do_io = 1;
current_cpu = cpu;
r = kvm_init_vcpu(cpu);
if (r < 0) {
error_report("kvm_init_vcpu failed: %s", strerror(-r));
exit(1);
}
kvm_init_cpu_signals(cpu);
/* signal CPU creation */
cpu->created = true;
qemu_cond_signal(&qemu_cpu_cond);
qemu_guest_random_seed_thread_part2(cpu->random_seed);
do {
if (cpu_can_run(cpu)) {
r = kvm_cpu_exec(cpu);
if (r == EXCP_DEBUG) {
cpu_handle_guest_debug(cpu);
}
}
qemu_wait_io_event(cpu);
} while (!cpu->unplug || cpu_can_run(cpu));
qemu_kvm_destroy_vcpu(cpu);
cpu->created = false;
qemu_cond_signal(&qemu_cpu_cond);
qemu_mutex_unlock_iothread();
rcu_unregister_thread();
return NULL;
}
You can see here:
do {
if (cpu_can_run(cpu)) {
r = kvm_cpu_exec(cpu);
if (r == EXCP_DEBUG) {
cpu_handle_guest_debug(cpu);
}
}
qemu_wait_io_event(cpu);
} while (!cpu->unplug || cpu_can_run(cpu));
that every time the KVM returns, it gives an opportunity for Qemu to emulate things. I suppose that when the kernel on the guest tries to access a PCIe device, KVM on the host returns. I don't know how KVM knows how to return. Maybe KVM maintains the addresses of the PCIe device and tells Intel's VT-D or AMD's IOV which addresses should generate an exception. Can someone clarify this?
Well, by the look of the qemu_kvm_cpu_thread_fn, the only place where a PCIe access could be emulated, is qemu_wait_io_event(cpu), which is defined here: https://github.com/qemu/qemu/blob/stable-4.2/cpus.c#L1266 and which calls qemu_wait_io_event_common defined here: https://github.com/qemu/qemu/blob/stable-4.2/cpus.c#L1241 which calls process_queued_cpu_work defined here: https://github.com/qemu/qemu/blob/stable-4.2/cpus-common.c#L309
Let's see the code which executes the queue functions:
while (cpu->queued_work_first != NULL) {
wi = cpu->queued_work_first;
cpu->queued_work_first = wi->next;
if (!cpu->queued_work_first) {
cpu->queued_work_last = NULL;
}
qemu_mutex_unlock(&cpu->work_mutex);
if (wi->exclusive) {
/* Running work items outside the BQL avoids the following deadlock:
* 1) start_exclusive() is called with the BQL taken while another
* CPU is running; 2) cpu_exec in the other CPU tries to takes the
* BQL, so it goes to sleep; start_exclusive() is sleeping too, so
* neither CPU can proceed.
*/
qemu_mutex_unlock_iothread();
start_exclusive();
wi->func(cpu, wi->data);
It looks like that the only power the VCPU thread qemu_kvm_cpu_thread_fn has when KVM returns, is to execute the queued functions:
wi->func(cpu, wi->data);
This means that a PCIe device would have to constantly queue itself as a function for qemu to execute. I don't see how it would work.
The functions that are able to queue work on this cpu have run_on_cpu on its name. By searching it on VSCode I found some functions that queue work but none related to PCIe or even emulation. The nicest function I found was this one that apparently patches instructions: https://github.com/qemu/qemu/blob/stable-4.2/hw/i386/kvmvapic.c#L446. Nice, I wanted to know that also.
Device emulation (of all devices, not just PCI) under KVM gets handled by the "case KVM_EXIT_IO" (for x86-style IO ports) and "case KVM_EXIT_MMIO" (for memory mapped IO including PCI) in the "switch (run->exit_reason)" inside kvm_cpu_exec(). qemu_wait_io_event() is unrelated.
Want to know how execution gets to "emulate a register read on a PCI device" ? Run QEMU under gdb, set a breakpoint on, say, the register read/write function for the ethernet PCI card you're using, and then when you get dropped into the debugger look at the stack backtrace. (Compile QEMU --enable-debug to get better debug info for this kind of thing.)
PS: If you're examining QEMU internals for educational purposes, you'd be better to use the current code, not a year-old release of it.

Which of these implementations of seqlock are correct?

I am studying the implementation of Seqlock. However all sources I found implement them differently.
Linux Kernel
Linux kernel implements it like this:
static inline unsigned __read_seqcount_begin(const seqcount_t *s)
{
unsigned ret;
repeat:
ret = READ_ONCE(s->sequence);
if (unlikely(ret & 1)) {
cpu_relax();
goto repeat;
}
return ret;
}
static inline unsigned raw_read_seqcount_begin(const seqcount_t *s)
{
unsigned ret = __read_seqcount_begin(s);
smp_rmb();
return ret;
}
Basically, it uses a volatile read plus a read barrier with acquire semantics on the reader side.
When used, subsequent reads are unprotected:
struct Data {
u64 a, b;
};
// ...
read_seqcount_begin(&seq);
int v1 = d.a, v2 = d.b;
// ...
rigtorp/Seqlock
RIGTORP_SEQLOCK_NOINLINE T load() const noexcept {
T copy;
std::size_t seq0, seq1;
do {
seq0 = seq_.load(std::memory_order_acquire);
std::atomic_signal_fence(std::memory_order_acq_rel);
copy = value_;
std::atomic_signal_fence(std::memory_order_acq_rel);
seq1 = seq_.load(std::memory_order_acquire);
} while (seq0 != seq1 || seq0 & 1);
return copy;
}
The load of data is still performed without an atomic operation or protection. However, an atomic_signal_fence with acquire-release semantics is added prior to the read, in contrast to the rmb with acquire semantics in Kernel.
Amanieu/seqlock (Rust)
pub fn read(&self) -> T {
loop {
// Load the first sequence number. The acquire ordering ensures that
// this is done before reading the data.
let seq1 = self.seq.load(Ordering::Acquire);
// If the sequence number is odd then it means a writer is currently
// modifying the value.
if seq1 & 1 != 0 {
// Yield to give the writer a chance to finish. Writing is
// expected to be relatively rare anyways so this isn't too
// performance critical.
thread::yield_now();
continue;
}
// We need to use a volatile read here because the data may be
// concurrently modified by a writer.
let result = unsafe { ptr::read_volatile(self.data.get()) };
// Make sure the seq2 read occurs after reading the data. What we
// ideally want is a load(Release), but the Release ordering is not
// available on loads.
fence(Ordering::Acquire);
// If the sequence number is the same then the data wasn't modified
// while we were reading it, and can be returned.
let seq2 = self.seq.load(Ordering::Relaxed);
if seq1 == seq2 {
return result;
}
}
}
No memory barrier between loading seq and data, but instead a volatile read is used here.
Can Seqlocks Get Along with Programming Language Memory Models? (Variant 3)
T reader() {
int r1, r2;
unsigned seq0, seq1;
do {
seq0 = seq.load(m_o_acquire);
r1 = data1.load(m_o_relaxed);
r2 = data2.load(m_o_relaxed);
atomic_thread_fence(m_o_acquire);
seq1 = seq.load(m_o_relaxed);
} while (seq0 != seq1 || seq0 & 1);
// do something with r1 and r2;
}
Similar to the Rust implementation, but atomic operations instead of volatile_read are used on data.
Arguments in P1478R1: Byte-wise atomic memcpy
This paper claims that:
In the general case, there are good semantic reasons to require that all data accesses inside such a seqlock "critical section" must be atomic. If we read a pointer p as part of reading the data, and then read *p as well, the code inside the critical section may read from a bad address if the read of p happened to see a half-updated pointer value. In such cases, there is probably no way to avoid reading the pointer with a conventional atomic load, and that's exactly what's desired.
However, in many cases, particularly in the multiple process case, seqlock data consists of a single trivially copyable object, and the seqlock "critical section" consists of a simple copy operation. Under normal circumstances, this could have been written using memcpy. But that's unacceptable here, since memcpy does not generate atomic accesses, and is (according to our specification anyway) susceptable to data races.
Currently to write such code correctly, we need to basically decompose such data into many small lock-free atomic subobjects, and copy them a piece at a time. Treating the data as a single large atomic object would defeat the purpose of the seqlock, since the atomic copy operation would acquire a conventional lock. Our proposal essentially adds a convenient library facility to automate this decomposition into small objects.
My question
Which of the above implementations are correct? Which are correct but inefficient?
Can the volatile_read be reordered before the acquire-read of seqlock?
Your qoutes from Linux seems wrong.
According to https://www.kernel.org/doc/html/latest/locking/seqlock.html the read process is:
Read path:
do {
seq = read_seqcount_begin(&foo_seqcount);
/* ... [[read-side critical section]] ... */
} while (read_seqcount_retry(&foo_seqcount, seq));
If you look at the github link posted in the question, you'll find a comment including nearly the same process.
It seems that you are only looking into one part of the read process. The linked file implements what you need to implement readers and writers but not the reader/writer them self.
Also notice this comment from the top of the file:
* The seqlock seqcount_t interface does not prescribe a precise sequence of
* read begin/retry/end. For readers, typically there is a call to
* read_seqcount_begin() and read_seqcount_retry(), however, there are more
* esoteric cases which do not follow this pattern.

DirectShow data copy is TOO slow

Have USB 3.0 HDMI Capture device. It uses YUY2 format (2 bytes per pixel) and 1920x1080 resolution.
Video capture Output Pin connects directly to Video Render input Pin.
And all works good. It shows me 1920x1080 without any freezes.
But I need to make screenshot every second. So this is what I do:
void CaptureInterface::ScreenShoot() {
IMemInputPin* p_MemoryInputPin = nullptr;
hr = p_RenderInputPin->QueryInterface(IID_IMemInputPin, (void**)&p_MemoryInputPin);
IMemAllocator* p_MemoryAllocator = nullptr;
hr = p_MemoryInputPin->GetAllocator(&p_MemoryAllocator);
IMediaSample* p_MediaSample = nullptr;
hr = p_MemoryAllocator->GetBuffer(&p_MediaSample, 0, 0, 0);
long buff_size = p_MediaSample->GetSize(); //buff_size = 4147200 Bytes
BYTE* buff = nullptr;
hr = p_MediaSample->GetPointer(&buff);
//BYTE CaptureInterface::ScreenBuff[1920*1080*2]; defined in header
//--------- TOO SLOW (1.5 seconds for 4 MBytes) ----------
std::memcpy(ScreenBuff, buff, buff_size);
//--------------------------------------------
p_MediaSample->Release();
p_MemoryAllocator->Release();
p_MemoryInputPin->Release();
return;
}
Any other operations with this buffer is very slow too.
But If I use memcpy on other data (2 arrays in my class for example same size 4MB) It is very fast. <0.01sec
Video memory is (might be) slow to read back by its nature (e.g. VMR9 IBasicVideo->GetCurrentImage very slow and you can find other references). You normally want to grab the data before it actually reaches video memory.
Additionally, the way you read data is not quite reliable. You don't know what frame you are actually copying and it might so happen that you even read blackness or garbage, or vice versa your acquiring access to buffer freezes the main video streaming. This is because you are grabbing an unused buffer from pool of available buffers rather than a buffer that corresponds to specific video frame. Your getting an image from such buffer happen in a fragile assumption that unused data from previously streamed frame was initialized and is not yet overwritten by anything else.

Freertos + STM32 - thread memory overflow with malloc

I'm working with stm32+rtos to implement a file system based on spi flash. For freertos, I adopted heap_1 implementation. This is how i create my task.
osThreadDef(Task_Embedded, Task_VATEmbedded, osPriorityNormal, 0, 2500);
VATEmbeddedTaskHandle = osThreadCreate(osThread(Task_Embedded), NULL);
I allocated 10000 bytes of memory to this thread.
and in this thread. I tried to write data into flash. In the first few called it worked successfully. but somehow it crash when i tried more time of write.
VATAPI_RESULT STM32SPIWriteSector(void *writebuf, uint8_t* SectorAddr, uint32_t buff_size){
if(STM32SPIEraseSector(SectorAddr) == VAT_SUCCESS){
DBGSTR("ERASE SECTOR - 0x%2x %2x %2x", SectorAddr[0], SectorAddr[1], SectorAddr[2]);
}else return VAT_UNKNOWN;
if(STM32SPIProgram_multiPage(writebuf, SectorAddr, buff_size) == VAT_SUCCESS){
DBGSTR("WRTIE SECTOR SUCCESSFUL");
return VAT_SUCCESS;
}else return VAT_UNKNOWN;
return VAT_UNKNOWN;
}
.
VATAPI_RESULT STM32SPIProgram_multiPage(uint8_t *writebuf, uint8_t *writeAddr, uint32_t buff_size){
VATAPI_RESULT nres;
uint8_t tmpaddr[3] = {writeAddr[0], writeAddr[1], writeAddr[2]};
uint8_t* sectorBuf = malloc(4096 * sizeof(uint8_t));
uint8_t* pagebuf = malloc(255* sizeof(uint8_t));
memset(&sectorBuf[0],0,4096);
memset(&pagebuf[0],0,255);
uint32_t i = 0, tmp_convert1, times = 0;
if(buff_size < Page_bufferSize)
times = 1;
else{
times = buff_size / (Page_bufferSize-1);
if((times%(Page_bufferSize-1))!=0)
times++;
}
/* Note : According to winbond flash feature, the last bytes of every 256 bytes should be 0, so we need to plus one byte on every 256 bytes*/
i = 0;
while(i < times){
memset(&pagebuf[0], 0, Page_bufferSize - 1);
memcpy(&pagebuf[0], &writebuf[i*255], Page_bufferSize - 1);
memcpy(&sectorBuf[i*Page_bufferSize], &pagebuf[0], Page_bufferSize - 1);
sectorBuf[((i+1)*Page_bufferSize)-1] = 0;
i++;
}
i = 0;
while(i < times){
if((nres=STM32SPIPageProgram(&sectorBuf[Page_bufferSize*i], &tmpaddr[0], Page_bufferSize)) != VAT_SUCCESS){
DBGSTR("STM32SPIProgram_allData write data fail on %d times!",i);
free(sectorBuf);
free(pagebuf);
return nres;
}
tmp_convert1 = (tmpaddr[0]<<16 | tmpaddr[1]<<8 | tmpaddr[2]) + Page_bufferSize;
tmpaddr[0] = (tmp_convert1&0xFF0000) >> 16;
tmpaddr[1] = (tmp_convert1&0xFF00) >>8;
tmpaddr[2] = 0x00;
i++;
}
free(sectorBuf);
free(pagebuf);
return nres;
}
I open the debugger and it seems like it crash when i malloced "sectorbuf" in function "STM32SPIProgram_multiPage", what Im confused is that i did free the memory after "malloc". anyone has idea about it?
arm-none-eabi-size "RTOS.elf"
text data bss dec hex filename
77564 988 100756 179308 2bc6c RTOS.elf
Reading the man
Memory Management
[...]
If RTOS objects are created dynamically then the standard C library malloc() and free() functions can sometimes be used for the purpose, but ...
they are not always available on embedded systems,
they take up valuable code space,
they are not thread safe, and
they are not deterministic (the amount of time taken to execute the function will differ from call to call)
... so more often than not an alternative memory allocation implementation is required.
One embedded / real time system can have very different RAM and timing requirements to another - so a single RAM allocation algorithm will only ever be appropriate for a subset of applications.
To get around this problem, FreeRTOS keeps the memory allocation API in its portable layer. The portable layer is outside of the source files that implement the core RTOS functionality, allowing an application specific implementation appropriate for the real time system being developed to be provided. When the RTOS kernel requires RAM, instead of calling malloc(), it instead calls pvPortMalloc(). When RAM is being freed, instead of calling free(), the RTOS kernel calls vPortFree().
[...]
(Emphasis mine.)
So the meaning is that if you use directly malloc, FreeRTOS is not able to handle the heap consumed by the system function. Same if you choose heap_3 management that is a simple malloc wrapper.
Take also note that the memory management you choose has no free capability.
heap_1.c
This is the simplest implementation of all. It does not permit memory to be freed once it has been allocated. Despite this, heap_1.c is appropriate for a large number of embedded applications. This is because many small and deeply embedded applications create all the tasks, queues, semaphores, etc. required when the system boots, and then use all of these objects for the lifetime of program (until the application is switched off again, or is rebooted). Nothing ever gets deleted.
The implementation simply subdivides a single array into smaller blocks as RAM is requested. The total size of the array (the total size of the heap) is set by configTOTAL_HEAP_SIZE - which is defined in FreeRTOSConfig.h. The configAPPLICATION_ALLOCATED_HEAP FreeRTOSConfig.h configuration constant is provided to allow the heap to be placed at a specific address in memory.
The xPortGetFreeHeapSize() API function returns the total amount of heap space that remains unallocated, allowing the configTOTAL_HEAP_SIZE setting to be optimised.
The heap_1 implementation:
Can be used if your application never deletes a task, queue, semaphore, mutex, etc. (which actually covers the majority of applications in which FreeRTOS gets used).
Is always deterministic (always takes the same amount of time to execute) and cannot result in memory fragmentation.
Is very simple and allocated memory from a statically allocated array, meaning it is often suitable for use in applications that do not permit true dynamic memory allocation.
(Emphasis mine.)
Side note: You have always to check malloc return value != NULL.

Read Kernel Memory from user mode WITHOUT driver

I'm writing an program which enumerates hooks created by SetWindowsHookEx() Here is the process:
Use GetProcAddress() to obtain gSharedInfo exported in User32.dll(works, verified)
Read User-Mode memory at gSharedInfo + 8, the result should be a pointer of first handle entry. (works, verified)
Read User-Mode memory at [gSharedInfo] + 8, the result should be countof handles to enumerate. (works, verified)
Read data from address obtained in step 2, repeat count times
Check if HANDLEENTRY.bType is 5(which means it's a HHOOK). If so, print informations.
The problem is, although step 1-3 only mess around with user mode memory, step 4 requires the program to read kernel memory. After some research I found that ZwSystemDebugControl can be used to access Kernel Memory from user mode. So I wrote the following function:
BOOL GetKernelMemory(PVOID pKernelAddr, PBYTE pBuffer, ULONG uLength)
{
MEMORY_CHUNKS mc;
ULONG uReaded = 0;
mc.Address = (UINT)pKernelAddr; //Kernel Memory Address - input
mc.pData = (UINT)pBuffer;//User Mode Memory Address - output
mc.Length = (UINT)uLength; //length
ULONG st = -1;
ZWSYSTEMDEBUGCONTROL ZwSystemDebugControl = (ZWSYSTEMDEBUGCONTROL)GetProcAddress(
GetModuleHandleA("ntdll.dll"), "NtSystemDebugControl");
st = ZwSystemDebugControl(SysDbgCopyMemoryChunks_0, &mc, sizeof(MEMORY_CHUNKS), 0, 0, &uReaded);
return st == 0;
}
But the function above didn't work. uReaded is always 0 and st is always 0xC0000002. How do I resolve this error?
my full program:
http://pastebin.com/xzYfGdC5
MSFT did not implement NtSystemDebugControl syscall after windows XP.
The Meltdown vulnerability makes it possible to read Kernel memory from User Mode on most Intel CPUs with a speed of approximately 500kB/s. This works on most unpatched OS'es.

Resources