Using memcpy and friends with memory-mapped I/O

Using memcpy and friends with memory-mapped I/O - c

I'm working on an embedded project which involves I/O on memory-mapped FPGA registers. Pointers to these memory regions need to be marked volatile so the compiler does not "optimize out" reads and writes to the FPGA by caching values in CPU registers.
In a few cases, we want to copy a series of FPGA registers into a buffer for further use. Since the registers are mapped to contiguous addresses, memcpy seems appropriate, but passing our volatile pointer as the source argument gives a warning about discarding the volatile qualifier.
Is it safe (and sane) to cast away the volatile-ness of the pointer to suppress this warning? Unless the compiler does something magical, I can't imagine a scenario where calling memcpy would fail to perform an actual copy. The alternative is to just use a for loop and copy byte by byte, but memcpy implementations can (and do) optimize the copy based on size of the copy, alignment, etc.

As a developer of both: FPGA and embedded software, there is just one clear answer: do not use memcpy et al. for this
Some reasons:
There is no guarantee memcpy will work in any specific order.
The compiler might very well replace the call with inline code.
Such acceses often require a certain word-size. memcpy does not guarantee that.
Gaps in the register map might result in undefined behaviour.
You can, however, use a simple for loop and copy yourself. This is safe, if the registers are volatile (see below).
Depending on your platform, volatile alone might not be sufficient. The memory area has also to be non-cachable and strictily ordered (and - possibly - non-shared). Otherwise the system busses might (and will for some platforms) reorder accesses.
Furthermore, you might need barriers/fences for your CPU not to reorder accesses. Please read your hardware-specs very carefully about this.
If you need to transfer larger blocks more often, think about using DMA. If the FPGA uses PCI(e), you could use busmaster DMA with scatter/gather for instance (however, this is not easily implemented; did that myself, but might be worth the effort).
The best (and most sane) approach depends actually on multiple factors, like platform, required speed, etc. Of all possible approaches, I would deem using mempcy() one of the lesser sane(1) at best (1): not sure if that is correct grammar, but I hope you got my point).

Absolutely not safe. There is no guarantee whatsoever in which order memcpy will copy the data, and how many bytes are copied at a time.

Related

Does labelling a block of memory volatile imply the cache is always bypassed? [duplicate]

Cache is controlled by cache hardware transparently to processor, so if we use volatile variables in C program, how is it guaranteed that my program reads data each time from the actual memory address specified but not cache.
My understanding is that,
Volatile keyword tells compiler that the variable references shouldn't be optimized and should be read as programmed in the code.
Cache is controlled by cache hardware transparently, hence when processor issues an address, it doesn't know whether the data is coming from cache or the memory.
So, if I have a requirement of having to read a memory address every time required, how can I make sure that its not referred from cache but from required address?
Some how, these two concepts are not fitting together well. Please clarify how its done.
(Imagining we have write-back policy in cache (if required for analyzing the problem))
Thank you,
Microkernel :)

Firmware developer here. This is a standard problem in embedded programming, and one that trips up many (even very experienced) developers.
My assumption is that you are attempting to access a hardware register, and that register value can change over time (be it interrupt status, timer, GPIO indications, etc.).
The volatile keyword is only part of the solution, and in many cases may not be necessary. This causes the variable to be re-read from memory each time it is used (as opposed to being optimized out by the compiler or stored in a processor register across multiple uses), but whether the "memory" being read is an actual hardware register versus a cached location is unknown to your code and unaffected by the volatile keyword. If your function only reads the register once then you can probably leave off volatile, but as a general rule I will suggest that most hardware registers should be defined as volatile.
The bigger issue is caching and cache coherency. The easiest approach here is to make sure your register is in uncached address space. That means every time you access the register you are guaranteed to read/write the actual hardware register and not cache memory. A more complex but potentially better performing approach is to use cached address space and have your code manually force cache updates for specific situations like this. For both approaches, how this is accomplished is architecture-dependent and beyond the scope of the question. It could involve MTRRs (for x86), MMU, page table modifications, etc.
Hope that helps. If I've missed something, let me know and I'll expand my answer.

From your question there is a misconception on your part.
Volatile keyword is not related to the cache as you describe.
When the keyword volatile is specified for a variable, it gives a hint to the compiler not to do certain optimizations as this variable can change from other parts of the program unexpectedly.
What is meant here, is that the compiler should not reuse the value already loaded in a register, but access the memory again as the value in register is not guaranteed to be the same as the value stored in memory.
The rest concerning the cache memory is not directly related to the programmer.
I mean the synchronization of any cache memory of CPU with the RAM is an entirely different subject.

My suggestion is to mark the page as non-cached by the virtual memory manager.
In Windows, this is done through setting PAGE_NOCACHE when calling VirtualProtect.
For a somewhat different purpose, the SSE 2 instructions have the _mm_stream_xyz instructions to prevent cache pollution, although I don't think they apply to your case here.
In either case, there is no portable way of doing what you want in C; you have to use OS functionality.

Wikipedia has a pretty good article about MTRR (Memory Type Range Registers) which apply to the x86 family of CPUs.
To summarize it, starting with the Pentium Pro Intel (and AMD copied) had these MTR registers which could set uncached, write-through, write-combining, write-protect or write-back attributes on ranges of memory.
Starting with the Pentium III but as far as I know, only really useful with the 64-bit processors, they honor the MTRRs but they can be overridden by the Page Attribute Tables which let the CPU set a memory type for each page of memory.
A major use of the MTRRs that I know of is graphics RAM. It is much more efficient to mark it as write-combining. This lets the cache store up the writes and it relaxes all of the memory write ordering rules to allow very high-speed burst writes to a graphics card.
But for your purposes you would want either a MTRR or a PAT setting of either uncached or write-through.

As you say cache is transparent to the programmer. The system guarantees that you always see the value that was last written to if you access an object through its address. The "only" thing that you may incur if an obsolete value is in your cache is a runtime penalty.

volatile makes sure that data is read everytime it is needed without bothering with any cache between CPU and memory. But if you need to read actual data from memory and not cached data, you have two options:
Make a board where said data is not cached. This may already be the case if you address some I/O device,
Use specific CPU instructions that bypass the cache. This is used when you need to scrub memory for activating possible SEU errors.
The details of second option depend on OS and/or CPU.

using the _Uncached keyword may help in embedded OS , like MQX
#define MEM_READ(addr) (*((volatile _Uncached unsigned int *)(addr)))
#define MEM_WRITE(addr,data) (*((volatile _Uncached unsigned int *)(addr)) = data)

How Dangerous is This Faster `strlen`?

strlen is a fairly simple function, and it is obviously O(n) to compute. However, I have seen a few approaches that operate on more than one character at a time. See example 5 here or this approach here. The basic way these work is by reinterpret-casting the char const* buffer to a uint32_t const* buffer and then checking four bytes at a time.
Personally, my gut reaction is that this is a segfault-waiting-to-happen, since I might dereference up to three bytes outside valid memory. However, this solution seems to hang around, and it seems curious to me that something so obviously broken has stood the test of time.
I think this comprises UB for two reasons:
Potential dereference outside valid memory
Potential dereference of unaligned pointer
(Note that there is not an aliasing issue; one might think the uint32_t is aliased as an incompatible type, and code after the strlen (such as code that might change the string) could run out of order to the strlen, but it turns out that char is an explicit exception to strict aliasing).
But, how likely is it to fail in practice? At minimum, I think there needs to be 3 bytes padding after the string literal data section, malloc needs to be 4-byte or larger aligned (actually the case on most systems), and malloc needs to allocate 3 extra bytes. There are other criteria related to aliasing. This is all fine for compiler implementations, which create their own environments, but how frequently are these conditions met on modern hardware for user code?

The technique is valid, and you will not avoid it if you call our C library strlen. If that library is, for instance, a recent version of the GNU C library (at least on certain targets), it does the same thing.
The key to make it work is to ensure that the pointer is aligned properly. If the pointer is aligned, the operation will read beyond the end of the string surely enough, but not into an adjacent page. If the null terminating byte is within one word of the end of a page, then that last word will be accessed without touching the subsequent page.
It certainly isn't well-defined behavior in C, and so it carries the burden of careful validation when ported from one compiler to another. It also triggers false positives from out-of-bounds access detectors like Valgrind.
Valgrind had to be patched to work around Glibc doing this. Without the patches, you get nuisance errors such as this:
==13669== Invalid read of size 8
==13669== at 0x411D6D7: __wcslen_sse2 (wcslen-sse2.S:59)
==13669== by 0x806923F: length_str (lib.c:2410)
==13669== by 0x807E61A: string_out_put_string (stream.c:997)
==13669== by 0x8075853: obj_pprint (lib.c:7103)
==13669== by 0x8084318: vformat (stream.c:2033)
==13669== by 0x8081599: format (stream.c:2100)
==13669== by 0x408F4D2: (below main) (libc-start.c:226)
==13669== Address 0x43bcaf8 is 56 bytes inside a block of size 60 alloc'd
==13669== at 0x402BE68: malloc (in /usr/lib/valgrind/vgpreload_memcheck-x86-linux.so)
==13669== by 0x8063C4F: chk_malloc (lib.c:1763)
==13669== by 0x806CD79: sub_str (lib.c:2653)
==13669== by 0x804A7E2: sysroot_helper (txr.c:233)
==13669== by 0x408F4D2: (below main) (libc-start.c:226)
Glibc is using SSE instructions to do calculate wcslen eight bytes at a time (instead of four, the width of wchar_t). In doing so, it is accessing at offset 56 in a block that is 60 bytes wide. However, note that this access could never straddle a page boundary: the address is divisible by 8.
If you're working in assembly language, you don't have to think twice about the technique.
In fact, the technique is used quite bit in some optimized audio codecs that I work with (targetting ARM), which feature a lot of hand-written assembly language in the Neon instruction set.
I noticed it when running Valgrind on code which integrated these codecs, and contacted the vendor. They explained that it was just a harmless loop optimization technique; I went through the assembly language and convinced myself they were right.

(1) can definitely happen. There's nothing preventing you from taking the strlen of a string near the end of an allocated page, which could result in an access past the end of allocated memory and a nice big crash. As you note, this could be mitigated by padding all your allocations, but then you have to have any libraries do the same. Worse, you have to arrange for the linker and OS to always add this padding (remember the OS passes argv[] in a static memory buffer somewhere). The overhead of doing this isn't worth it.
(2) also definitely happens. Earlier versions of ARM processors generate data aborts on unaligned accesses, which either cause your program to die with a bus error (or halt the CPU if you're running bare-metal), or force a very expensive trap through the kernel to handle the unaligned access. These earlier ARM chips are still in wide use in older cellphones and embedded devices. Later ARM processors synthesize multiple word accesses to deal with unaligned accesses, but this will result in overall slower performance since you basically double the number of memory loads you need to do.
Many current ("modern") PICs and embedded microprocessors lack the logic to handle unaligned accesses, and may behave unpredictably or even nonsensically when given unaligned addresses (I've personally seen chips that will just mask off the bottom bits, which would give incorrect answers, and others that will just give garbage results with unaligned accesses).
So, this is ridiculously dangerous to use in anything that should be remotely portable. Please, please, please do not use this code; use the libc strlen. It will usually be faster (optimized for your platform properly) and will make your code portable. The last thing you want is for your code to subtly and unexpectedly break in some situation (string near the end of an allocation) or on some new processor.

Donald Knuth, a person who wrote 3+ volumes on clever algorithms said: "Premature optimization is the root of all evil".
strlen() is used a lot, so it really should be fast. Riffing on wildplasser's remark, "I would trust the library function", what makes you think that the library function works byte at a time? Or is slow?
The title may give folks the impression that the code you suggest is faster than the standard system library strlen(), but I think what you mean is that it is faster than a naive strlen() which probably doesn't get used, anyway.
I compiled a simple C program and looked on my 64-bit system which uses GNU's glibc function. The code I saw was pretty sophisticated and looks pretty fast in terms of working with register width rather than byte at a time. The code I saw for strlen() is written in assembly language so there probably aren't junk instructions as you might get if this were compiled C code. What I saw was rtld-strlen.S. This code also unrolls loops to reduce the overhead in looping.
Before you think you can do better on strlen, you should look at that code, or the corresponding code for your particular architecture, and register size.
And if you do write your own strlen, benchmark it against the existing implementation.
And obviously, if you use the system strlen, then it is probably correct and you don't have to worry about invalid memory references as a result of an optimization in the code.

I agree it's a bletcherous technique, but I suspect it's likely to work most of the time. It's only a segfault if the string happens to be right up against the end of your data (or stack) segment. The vast majority of strings (wheher statically or dynamically allocated) won't be.
But you're right, to guarantee it working you'd need some guarantee that all strings were padded somehow, and your list of shims looks about right.
If alignment is a problem, you could take care of that in the fast strlen implementation; you wouldn't have to run around trying to align all strings.
(But of course, if your problem is that you're spending too much time scanning strings, the right fix is not to desperately try to make string scanning faster, but to rig things up so that you don't have to scan so many strings in the first place...)

Read and Write atomic operation implementation in the Linux Kernel

Recently I've peeked into the Linux kernel implementation of an atomic read and write and a few questions came up.
First the relevant code from the ia64 architecture:
typedef struct {
int counter;
} atomic_t;
#define atomic_read(v) (*(volatile int *)&(v)->counter)
#define atomic64_read(v) (*(volatile long *)&(v)->counter)
#define atomic_set(v,i) (((v)->counter) = (i))
#define atomic64_set(v,i) (((v)->counter) = (i))
For both read and write operations, it seems that the direct approach was taken to read from or write to the variable. Unless there is another trick somewhere, I do not understand what guarantees exist that this operation will be atomic in the assembly domain. I guess an obvious answer will be that such an operation translates to one assembly opcode, but even so, how is that guaranteed when taking into account the different memory cache levels (or other optimizations)?
On the read macros, the volatile type is used in a casting trick. Anyone has a clue how this affects the atomicity here? (Note that it is not used in the write operation)

I think you are misunderstanding the (very much vague) usage of the word "atomic" and "volatile" here. Atomic only really means that the words will be read or written atomically (in one step, and guaranteeing that the contents of this memory position will always be one write or the other, and not something in between). And the volatile keyword tells the compiler to never assume the data in that location due to an earlier read/write (basically, never optimize away the read).
What the words "atomic" and "volatile" do NOT mean here is that there's any form of memory synchronization. Neither implies ANY read/write barriers or fences. Nothing is guaranteed with regards to memory and cache coherence. These functions are basically atomic only at the software level, and the hardware can optimize/lie however it deems fit.
Now as to why simply reading is enough: the memory models for each architecture are different. Many architectures can guarantee atomic reads or writes for data aligned to a certain byte offset, or x words in length, etc. and vary from CPU to CPU. The Linux kernel contains many defines for the different architectures that let it do without any atomic calls (CMPXCHG, basically) on platforms that guarantee (sometimes even only in practice even if in reality their spec says the don't actually guarantee) atomic reads/writes.
As for the volatile, while there is no need for it in general unless you're accessing memory-mapped IO, it all depends on when/where/why the atomic_read and atomic_write macros are being called. Many compilers will (though it is not set in the C spec) generate memory barriers/fences for volatile variables (GCC, off the top of my head, is one. MSVC does for sure.). While this would normally mean that all reads/writes to this variable are now officially exempt from just about any compiler optimizations, in this case by creating a "virtual" volatile variable only this particular instance of a read/write is off-limits for optimization and re-ordering.

The reads are atomic on most major architectures, so long as they are aligned to a multiple of their size (and aren't bigger than the read size of a give type), see the Intel Architecture manuals. Writes on the other hand many be different, Intel states that under x86, single byte write and aligned writes may be atomic, under IPF (IA64), everything use acquire and release semantics, which would make it guaranteed atomic, see this.
the volatile prevents the compiler from caching the value locally, forcing it to be retrieve where ever there is access to it.

If you write for a specific architecture, you can make assumptions specific to it.
I guess IA-64 does compile these things to a single instruction.
The cache shouldn't be an issue, unless the counter crosses a cache line boundry. But if 4/8 byte alignment is required, this can't happen.
A "real" atomic instruction is required when a machine instruction translates into two memory accesses. This is the case for increments (read, increment, write) or compare&swap.
volatile affects the optimizations the compiler can do.
For example, it prevents the compiler from converting multiple reads into one read.
But on the machine instruction level, it does nothing.

Concurrent access to struct member

I'm using 32-bit microcontroller (STR91x). I'm concurrently accessing (from ISR and main loop) struct member of type enum. Access is limited to writing to that enum field in the ISR and checking in the main loop. Enum's underlying type is not larger than integer (32-bit).
I would like to make sure that I'm not missing anything and I can safely do it.

Provided that 32 bit reads and writes are atomic, which is almost certainly the case (you might want to make sure that your enum's word-aligned) then that which you've described will be just fine.

As paxdiablo & David Knell said, generally speaking this is fine. Even if your bus is < 32 bits, chances are the instruction's multiple bus cycles won't be interrupted, and you'll always read valid data.
What you stated, and what we all know, but it bears repeating, is that this is fine for a single-writer, N-reader situation. If you had more than one writer, all bets are off unless you have a construct to protect the data.

If you want to make sure, find the compiler switch that generates an assembly listing and examine the assembly for the write in the ISR and the read in the main loop. Even if you are not familiar with ARM assembly, I'm sure you could quickly and easily be able to discern whether or not the reads and writes are atomic.

ARM supports 32-bit aligned reads that are atomic as far as interrupts are concerned. However, make sure your compiler doesn't try to cache the value in a register! Either mark it as a volatile, or use an explicit memory barrier - on GCC this can be done like so:
int tmp = yourvariable;
__sync_synchronize(yourvariable);
Note, however, that current versions of GCC person a full memory barrier for __sync_synchronize, rather than just for the one variable, so volatile is probably better for your needs.
Further, note that your variable will be aligned automatically unless you are doing something Weird (ie, explicitly specifying the location of the struct in memory, or requesting a packed struct). Unaligned variables on ARM cannot be read atomically, so make sure it's aligned, or disable interrupts while reading.

Well, it depends entirely on your hardware but I'd be surprised if an ISR could be interrupted by the main thread.
So probably the only thing you have to watch out for is if the main thread could be interrupted halfway through a read (so it may get part of the old value and part of the new).
It should be a simple matter of consulting the specs to ensure that interrupts are only processed between instructions (this is likely since the alternative would be very complex) and that your 32-bit load is a single instruction.

An aligned 32 bit access will generally be atomic (unless it were a particularly ludicrous compiler!).
However the rock-solid solution (and one generally applicable to non-32 bit targets too) is to simply disable the interrupt temporarily while accessing the data outside of the interrupt. The most robust way to do this is through an access function to statically scoped data rather than making the data global where you then have no single point of access and therefore no way of enforcing an atomic access mechanism when needed.

memcpy vs assignment in C

Under what circumstances should I expect memcpys to outperform assignments on modern INTEL/AMD hardware? I am using GCC 4.2.x on a 32 bit Intel platform (but am interested in 64 bit as well).

You should never expect them outperform assignments. The reason is, the compiler will use memcpy anyway when it thinks it would be faster (if you use optimize flags). If not and if the structure is reasonable small that it fits into registers, direct register manipulation could be used which wouldn't require any memory access at all.
GCC has special block-move patterns internally that figure out when to directly change registers / memory cells, or when to use the memcpy function. Note when assigning the struct, the compiler knows at compile time how big the move is going to be, so it can unroll small copies (do a move n-times in row instead of looping) for instance. Note -mno-memcpy:
-mmemcpy
-mno-memcpy
Force (do not force) the use of "memcpy()" for non-trivial block moves.
The default is -mno-memcpy, which allows GCC to inline most constant-sized copies.
Who knows it better when to use memcpy than the compiler itself?

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight