32-bit fixed instruction length in 64-bit memory space

32-bit fixed instruction length in 64-bit memory space - arm

I am currently reading up on the AArch64 architecture by ARM. They are using a RISC-like instruction set with a fixed instruction length of 32-bit while operating on 64-bit addresses. I am still new to the topic of ISA so my question is: how can you operate with 64-bit long addresses when you only have 32-bit length in your instructions?

32-bit is the length of an instruction in bytes, not the operand-size or the address size.
ARM 32-bit (like all other 32-bit RISCs that use 32-bit fixed-width instruction words) can't fit a 32-bit address as an immediate into a single instruction either: there'd be no room for an opcode to say what instruction it is.
The width of an instruction limits the number of registers you can have. With 3 registers per instruction (dst, src1, src2), AArch64's increase from 16 to 32 registers means that each instruction needs 3 * log2(32) = 3* 5 = 15 bits to encode the registers. Or fewer for instructions with only 2 or 1 registers. (e.g. mov-immediate or add-immediate). The rest of the space goes to number of possible opcodes, and the size of immediates.
To get an address into a register, ARM compilers will typically load it from a nearby pool of constants (with a PC-relative addressing mode).
The other option is what most RISC CPUs do: use a 2-instruction sequence to put a 16-bit immediate in the upper half of a register, then OR a 16-bit immediate into the low half. (Or use the lower half of a static address as the displacement to a load/store instruction that uses an addressing mode like register + 16-bit offset.)
MIPS is a good example of a very simple RISC, see it's ISA with binary encoding. Its lui reg, imm16 puts imm16 <<16 into a register. (Load Upper Immediate). Then lw dst, imm16(base_reg) is a load like I was talking about in the last paragraph.
Even in 64-bit code, most numbers are still small, so there's not much need for wider immediate operands (except for addresses). e.g. x86 still uses a choice of 32-bit or 8-bit immediate operands for add r64, imm. x86 being a variable length ISA saves space when immediates are between -128 and +127 in a lot of cases.

You could have a 4-bit CPU that operates on 4096 bit values, it just wouldn't be terribly fast at doing it. Any programming language with a "bignum" type will work this way, though usually not to the same extreme.
The way a 32-bit CPU can operate on 64-bit values is because the hardware or software allows it, no other reason.
It's not like we couldn't manipulate 64-bit integer values before we had 64-bit CPUs. Even the old Intel 80386 could support 64-bit operations and it was released in 1985.

Related

how does the processor read memory?

I'm trying to re-implement malloc and I need to understand the purpose of the alignment. As I understand it, if the memory is aligned, the code will be executed faster because the processor won't have to take an extra step to recover the bits of memory that are cut. I think I understand that a 64-bit processor reads 64-bit by 64-bit memory. Now, let's imagine that I have a structure with in order (without padding): a char, a short, a char, and an int. Why will the short be misaligned? We have all the data in the block! Why does it have to be on an address which is a multiple of 2. Same question for the integers and other types?
I also have a second question: With the structure I mentioned before, how does the processor know when it reads its 64 bits that the first 8 bits correspond to a char, then the next 16 correspond to a short etc...?

The effects can even include correctness, not just performance: C Undefined Behaviour (UB) leading to possible segfaults or other misbehaviour for example if you have a short object that doesn't satisfy alignof(short). (Faulting is expected on ISAs where load/store instructions require alignment by default, like SPARC, and MIPS before MIPS64r6. And possible even on x86 after compiler optimization of loops, even though x86 asm allows unaligned loads/stores except for some SIMD with 16-byte or wider.)
Or tearing of atomic operations if an _Atomic int doesn't have alignof(_Atomic int).
(Typically alignof(T) = sizeof(T) up to some size, often register width or wider, in any given ABI).
malloc should return memory with alignof(max_align_t) because you don't have any type info about how the allocation will be used.
For allocations smaller than sizeof(max_align_t), you can return memory that's merely naturally aligned (e.g. a 4-byte allocation aligned by 4 bytes) if you want, because you know that storage can't be used for anything with a higher alignment requirement.
Over-aligned stuff like the dynamically-allocated equivalent of alignas (16) int32_t foo needs to use a special allocator like C11 aligned_alloc. If you're implementing your own allocator library, you probably want to support aligned_realloc and aligned_calloc, filling those gaps that ISO C leave for no apparent reason.
And make sure you don't implement the braindead ISO C++17 requirement for aligned_alloc to fail if the allocation size isn't a multiple of the alignment. Nobody wants an allocator that rejects an allocation of 101 floats starting on a 16-byte boundary, or much larger for better transparent hugepages. aligned_alloc function requirements and How to solve the 32-byte-alignment issue for AVX load/store operations?
I think I understand that a 64-bit processor reads 64-bit by 64-bit memory
Nope. Data bus width and burst size, and load/store execution unit max width or actually-used width, don't have to be the same as width of integer registers, or however the CPU defines its bitness. (And in modern high performance CPUs typically aren't. e.g. 32-bit P5 Pentium had a 64-bit bus; modern 32-bit ARM has load/store-pair instructions that do atomic 64-bit accesses.)
Processors read whole cache lines from DRAM / L3 / L2 cache into L1d cache; 64 bytes on modern x86; 32 bytes on some other systems.
And when reading individual objects or array elements, they read from L1d cache with the element width. e.g. a uint16_t array may only benefit from alignment to a 2-byte boundary for 2-byte loads/stores.
Or if a compiler vectorizes a loop with SIMD, a uint16_t array can be read 16 or 32 bytes at a time, i.e. SIMD vectors of 8 or 16 elements. (Or even 64 with AVX512). Aligning arrays to the expected vector width can be helpful; unaligned SIMD load/store run fast on modern x86 when they don't cross a cache-line boundary.
Cache-line splits and especially page-splits are where modern x86 slows down from misalignment; unaligned within a cache line generally not because they spend the transistors for fast unaligned load/store. Some other ISAs slow down, and some even fault, on any misalignment, even within a cache line. The solution is the same: give types natural alignment: alignof(T) = sizeof(T).
In your struct example, modern x86 CPUs will have no penalty even though the short is misaligned. alignof(int) = 4 in any normal ABI, so the whole struct has alignof(struct) = 4, so the char;short;char block starts at a 4-byte boundary. Thus the short is contained within a single 4-byte dword, not crossing any wider boundary. AMD and Intel both handle this with full efficiency. (And the x86 ISA guarantees that accesses to it are atomic, even uncached, on CPUs compatible with P5 Pentium or later: Why is integer assignment on a naturally aligned variable atomic on x86?)
Some non-x86 CPUs would have penalties for the misaligned short, or have to use other instructions. (Since you know the alignment relative to an aligned 32-bit chunk, for loads you'd probably do a 32-bit load and shift.)
So yes there's no problem accessing one single word containing the short, but the problem is for load-port hardware to extract and zero-extend (or sign-extend) that short into a full register. This is where x86 spends the transistors to make this fast. (#Eric's answer on a previous version of this question goes into more detail about the shifting required.)
Committing an unaligned store back into cache is also non-trivial. For example, L1d cache might have ECC (error-correction against bit flips) in 32-bit or 64-bit chunks (which I'll call "cache words"). Writing only part of a cache word is thus a problem for that reason, as well as for shifting it to an arbitrary byte boundary within the cache word you want to access. (Coalescing of adjacent narrow stores in the store buffer can produce a full-width commit that avoids an RMW cycle to update part of a word, in caches that handle narrow stores that way). Note that I'm saying "word" now because I'm talking about hardware that's more word-oriented instead of being designed around unaligned loads/stores the way modern x86 is. See Are there any modern CPUs where a cached byte store is actually slower than a word store? (storing a single byte is only slightly simpler than an unaligned short)
(If the short spans two cache words, it would of course needs to separate RMW cycles, one for each byte.)
And of course the short is misaligned for the simple reason that alignof(short) = 2 and it violates this ABI rule (assuming an ABI that does have that). So if you pass a pointer to it to some other function, you could get into trouble. Especially on CPUs that have fault-on-misaligned loads, instead of hardware handling that case when it turns out to be misaligned at runtime. Then you can get cases like Why does unaligned access to mmap'ed memory sometimes segfault on AMD64? where GCC auto-vectorization expected to reach a 16-byte boundary by doing some multiple of 2-byte elements scalar, so violating the ABI leads to a segfault on x86 (which is normally tolerant of misalignment.)
For the full details on memory access, from DRAM RAS / CAS latency up to cache bandwidth and alignment, see What Every Programmer Should Know About Memory? It's pretty much still relevant / applicable
Also Purpose of memory alignment has a nice answer. There are plenty of other good answers in SO's memory-alignment tag.
For a more detailed look at (somewhat) modern Intel load/store execution units, see: https://electronics.stackexchange.com/questions/329789/how-can-cache-be-that-fast/329955#329955
how does the processor know when it reads its 64 bits that the first 8 bits correspond to a char, then the next 16 correspond to a short etc...?
It doesn't, other than the fact it's running instructions which treat the data that way.
In asm / machine-code, everything is just bytes. Every instruction specifies exactly what to do with which data. It's up to the compiler (or human programmer) to implement variables with types, and the logic of a C program, on top of a raw array of bytes (main memory).
What I mean by that is that in asm, you can run any load or store instruction you want to, and it's up to you to use the right ones on the right addresses. You could load 4 bytes that overlap two adjacent int variable into a floating-point register, then and run addss (single-precision FP add) on it, and the CPU won't complain. But you probably don't want to because making the CPU interpret those 4 bytes as an IEEE754 binary32 float is unlikely to be meaningful.

modern processors and memory are built to optimize memory access as much as possible. One the current way of accessing memory is to address it not byte by byte but by an address of a bigger block, e.g. by an 8 byte blocks. You do not need 3 lower bits of the address this way. To access a certain byte within the block the processs needs to get the block at the aligned address, then shift and mask the byte. So, it gets slower.
When fields in the struct are not aligned, there is a risk of slowing down the access to them. Therefore, it is better to align them.
But the alignment requirements are based on the underlying platform. For systems which support word access (32 bit), 4-byte alignment is ok, otherwise 8-byte can be used or some other. The compiler (and libc) knows the requirements.
So, in your example char, short, char, the short will start with an odd byte position if not padded. To access it, the system might need to read the 64 bit word for the struct, then shift it 1 byte right and then mask 2 bytes in order to provide you with this byte.

As I understand it, if the memory is aligned, the code will be executed faster because the processor won't have to take an extra step to recover the bits of memory that are cut.
It's not necessarily an execution thing, an x86 has variable length instructions starting with single 8 bit instructions on up to a handful to several bytes, its all about being unaligned. but they have taken measures to smooth that out for the most part.
If I have a 64 bit bus on the edge of my processor that doesn't mean edge of chip that means edge of the core. The other side of this is a memory controller that knows the bus protocol and is the first place the addresses start to be decoded and the transactions start to split up down other buses toward their destination.
It is very much architecture and bus design specific and you can have architectures with different buses over time or different versions you can get an arm with a 64 bus or a 32 bit bus for example. But let's say we have a not atypical situation where the bus is 64 bits wide and all transactions on that bus are aligned on a 64 bit boundary.
If I were to do a 64 bit write to 0x1000 that would be a single bus transaction, which these days is some sort of write address bus with some id x and a length of 0 (n-1) then the other side acks that I see you want to do a write with id x, I am ready to take your data. Then the processor uses the data bus with id x to send the data, one clock per 64 bits this is a single 64 bit so one clock on that bus. and maybe an ack comes back or maybe not.
But if I wanted to do a 64 bit write to 0x1004, what would happen is that turns into two transactions one complete 64 bit address/data transaction at address 0x1000 with only four byte lanes enabled lanes 4-7 (representing bytes at address 0x1004-0x1007). Then a complete transaction at 0x1008 with 4 byte lanes enabled, lanes 0-3. So the actual data movement across the bus goes from one clock to two, but there is also twice the overhead of the handshakes to get to those data cycles. On that bus it is very noticeable, how the overall system design is though you may feel it or not, or may have to do many of them to feel it or not. But the inefficiency is there, buried in the noise or not.
I think I understand that a 64-bit processor reads 64-bit by 64-bit memory.
Not a good assumption at all. 32 bit ARMs have 64 bit buses these days the ARMv6 and ARMv7s for example come with them or can.
Now, let's imagine that I have a structure with in order (without padding): a char, a short, a char, and an int. Why will the short be misaligned? We have all the data in the block! Why does it have to be on an address which is a multiple of 2. Same question for the integers and other types?
unsigned char a 0x1000
unsigned short b 0x1001
unsigned char c 0x1003
unsigned int d 0x1004
You would normally use the structure items in the code something.a something.b something.c something.d. When you access something.b that is a 16 bit transaction against the bus. In a 64 bit system you are correct that if aligned as I have addressed it, then the whole structure is being read when you do x = something.b but the processor is going to discard all but byte lanes 1 and 2 (discarding 0 and 3-7), then if you access something.c it will do another bus transaction at 0x1000 and discard all but lane 3.
When you do a write to something.b with a 64 bit bus only byte lanes 1 and 2 are enabled. Now where more pain comes in is if there is a cache it is likely also constructed of a 64 bit ram to mate up with this bus, doesn't have to, but let's assume it does. You want to write through the cache to something.b, a write transaction at 0x1000 with byte lanes 1 and 2 enabled 0, 3-7 disabled. The cache ultimately gets this transaction, it internally has to do a read-modify write because it is not a full 64 bit wide transaction (all lanes enabled) so you are taking hit with that read-modify write from a performance perspective as well (same was true for the unaligned 64 bit write above).
The short is unaligned because when packed its address lsbit is set, to be aligned a 16 bit item in an 8 bit is a byte world needs to be zero, for a 32 bit item to be aligned the lower two bits of its address are zero, 64 bit, three zeros and so on.
Depending on the system you may end up on a 32 or 16 bit bus (not for memory so much these days) so you can end up with the multiple transfers thing.
Your highly efficient processors like MIPS and ARM took the approach of aligned instructions, and forced aligned transactions even in the something.b case that specifically doesn't have a penalty on a 32 nor 64 bit bus. The approach is performance over memory consumption, so the instructions are to some extent wasteful in their consumption to be more efficient in their fetching and execution. The data bus is likewise much simpler. When high level concepts like a struct in C are constructed there is memory waste in padding to align each item in the struct to gain performance.
unsigned char a 0x1000
unsigned short b 0x1002
unsigned char c 0x1004
unsigned int d 0x1008
as an example
I also have a second question: With the structure I mentioned before, how does the processor know when it reads its 64 bits that the first 8 bits correspond to a char, then the next 16 correspond to a short etc...?
unsigned char c 0x1003
the compiler generates a single byte sized read at address 0x1003, this turns in to that specific instruction with that address and the processor generates the bus transaction to do that, the other side of the processor bus then does its job and so on down the line.
The compiler in general does not turn a packed version of that struct into a single 64 bit transaction that gives you all of the items, you burn a 64 bit bus transaction for each item.
it is possible that depending on the instruction set, prefetcher, caches and so on that instead of using a struct at a high level you create a single 64 bit integer and you do the work in the code, then you might or might not gain performance. This is not expected to perform better on most architectures running with caches and such, but when you get into embedded systems where you may have some number of wait states on the ram or some number of wait states on the flash or whatever code storage there is you can find times where instead of fewer instructions and more data transactions you want more instructions and fewer data transactions. code is linear a code section like this read, mask and shift, mask and shift, etc. the instruction storage may have a burst mode for linear transactions but data transactions take as many clocks as they take.
A middle ground is to just make everything a 32 bit variable or a 64 bit, then it is all aligned and performs relatively well at the cost of more memory used.
Because folks don't understand alignment, have been spoiled by x86 programming, choose to use structs across compile domains (such a bad idea), the ARMs and others are tolerating unaligned accesses, you can very much feel the performance hit on those platforms as they are so efficient if everything is aligned, but when you do something unaligned it just generates more bus transactions making everything take longer. So the older arms would fault by default, the arm7 could have the fault disabled but would rotate the data around the word (nice trick for swapping 16 bit values in a word) rather than spill over into the next word, later architectures default to not fault on aligned or most folks set them to not fault on aligned and they read/write the unaligned transfers as one would hope/expect.
For every x86 chip you have in your computer you have several if not handfuls of non-x86 processors in that same computer or peripherals hanging off that computer (mouse, keyboard, monitor, etc). A lot of those are 8-bit 8051s and z80s, but also a lot of them are arm based. So there is lots of non-x86 development going on not just all the phones and tablets main processors. Those others desire to be low cost and low power so more efficiency in the coding both in its bus performance so the clock can be slower but also a balance of code/data usage overall to reduce the cost of the flash/ram.
It is quite difficult to force these alignment issues on an x86 platform there is a lot of overhead to overcome its architectural issues. But you can see this on more efficient platforms. Its like a train vs a sports car, something falls off a train a person jumps off or on there is so much momentum its not noticed one bit, but step change the mass on the sports car and you will feel it. So trying to do this on an x86 you are going to have to work a lot harder if you can even figure out how to do it. But on other platforms its easier to see the effects. Unless you find an 8086 chip and I suspect you can feel the differences there, would have to pull out my manual to confirm.
If you are lucky enough to have access to chip sources/simulations then you can see this kind of thing happening all over the place and can really start to hand tune your program (for that platform). Likewise you can see what caching, write buffering, instruction prefetching in its various forms and so on do for overall performance and at times create parallel periods of time where other not-so-efficient transactions can hide, and or intentional spare cycles are created so that transactions that take extra time can have a time slice.

Too many Local Variables and Stack Base Pointer Offset Overflows

So the %ebp (stack base pointer) + a constant is used to reference local variables in assembly. What if there are too many local variables and the required constant is soo large that it does not fit in one line of assembly code (32 or 64 bits)? How are edge cases like this handled?
For example, in the above image assume that there are 2^30 local variables. To reference the last one we would need an offset of 2^32. If we are working in a 32bit environment this offset won't fit in one line of code considering there is the opcode, destination etc also in that same line.

In its 32 bit and 64 bit operation modes, the x86 architecture addressing modes allow for either no displacement, an 8 bit displacement, or a 32 bit displacement.
In 32 bit mode, a 32 bit displacement is sufficient to describe every possible displacement (and thus, every possible stack offset). For your concern: The stack couldn't possibly contain 230 variables as that would be 4 GiB of stack space, leaving no space to store the machine code.
In 64 bit mode, it is indeed possible to have displacements that cannot be described with a 32 bit displacement. This rarely happens in reality (which is why the AMD engineers decided to leave the displacement size at 32 bit) but it can happen occasionally. In such cases, the displacement has to be applied through a register:
mov rax,0x123456789abcdef0 ; displacement
mov eax,[rax,rbp] ; value

ARM Cortex A7: avoid memory veneers?

On ARMv7, which is Thumb capable, is it right that we can avoid all the veneers by using the BX instruction?
Since this instruction takes a 32 bit register, are we good?
If yes, when I see veneers in the generated code, I should specialize the output for my machine, right?
Thanks

Yes, since BX takes a 32-bit register, there's no need for veeners because you can cover the whole addressing space.
Of course you'd need to load a 32-bit value into the register, which usually means constant pooling, so if you are looking to squeeze every cycle out of it and your program is not too large you're better off with relative branches. As #Notlikethat notes, if you don't already have the address in a register there's no point in using BX when you can just LDR PC, ... (unless you need to support ARMv4T interworking).
Relative, non-conditional, 32-bit Thumb branches have a 24-bit addressing space, so you can reach +/- 16MB (for others see here). If you're doing ELF, be really careful with 16-bit relative Thumb branches. A 32-bit branch will generate a 24-bit relocation and the linker will insert a veener if the target can't be addressed with 24 bits. A 16-bit branch generates a 11-bit relocation and ELF for ARM specifies that the linker is not required to generate veeners for those, so you'd risk a link-time out-of-range branch.

How does one access individual characters of a string properly aligned in memory, on ARM platform?

Since (from what I have read) ARM9 platform may fail to correctly load data at an unaligned memory address, let's assume unaligned meaning that the address value is not multiple of 2 (i.e. not aligned on 16-bit), then how would one access say, fourth character on a string of characters pointed to by a properly aligned pointer?
char buf[] = "Hello world.";
buf[3]; // (buf + 3) is unaligned here, is it not?
Does compiler generate extra code, as opposed to the case when buf + 3 is properly aligned? Or will the last statement in the example above have undesired results at runtime - yielding something else than the fourth character, the second l in Hello?

Byte accesses don't have to be aligned. The compiler will generate a ldrb instruction, which does not need any sort of alignment.
If you're curious as to why, this is because ARM will load the entire aligned word that contains the target byte, and then simply select that byte out of the four it just loaded.

The concept to remember is that the compiler will try to optimize access based on the type in order to get the most efficiency of your processor. So when accessing ints, it'll want to use things like the ldr instruction which will fault if it's an unaligned access. For something link a char access, the compiler will work some of the details for you. Where you have to be concerned are things like:
Casting pointers. If you cast a char * to an int * and the pointer is not aligned correctly, you'll get an alignment trap. In general, it's okay to cast down (from an int to a char), but not the other way around. You would not want to do this:
char buf[] = "12345678";
int *p = &buf[1];
printf("0x%08X\n", *p); // *p is badness here!
Trying to pull data off the wire with structures. I've seen this done a lot, and it's just plain bad practice. Endianess issues aside, you can cause an alignment trap if the elements aren't aligned correctly for the platform.
FWIW, casting pointers is probably the number one issue I've seen in practice.
There's a great book called Write Portable Code which goes over quite a few details about writing code for multiple platforms. The sample chapter on the linked site actually contains a section talking about alignment.
There's a little more that's going on too. Memory functions, like malloc, also give you back aligned blocks (generally on a double-word boundary) so that you can write in data and not hit an alignment fault.
One last bit, while newer ARMs can cope with unaligned accesses better, that does not mean they're performant. It just means they're tolerant. The same can be said for the X86 processors too. They'll do the unaligned access, but you're forcing extra memory fetches by doing so.

Most systems use byte based addressing. The address 0x1234 is in terms of bytes for example. Assume that I mean 8 bit bytes for this answer.
The definition of unaligned as to do with the size of the transfer. A 32 bit transfer for example is 4 bytes. 4 is 2 to the power 2 so if the lower 2 bits of the address are anything other than zeros then that address is an unaligned 32 bit transfer.
So using a table like this or just understanding powers of 2
8 1 0 []
16 2 1 [0]
32 4 2 [1:0]
64 8 3 [2:0]
128 16 4 [3:0]
the first column is the number of bits in the transfer. the second is the number of bytes that represents, the third is the number of bits at the bottom of the address that have to be zero to make it an aligned transfer, and the last column describes those bits.
It is not possible to have an unaligned 8 bit transfer. Not on arm, not on any system. Please understand that.
16 bit transfers. Once we get into transfers larger than 16 bits then you can START to talk about being unaligned. Then problem with unaligned transfers has to do with the number of bus cycles. Say you are doing 16 bit transfers on a system with a 16 bit wide bus and 16 bit wide memories. That means that we have items at memory at these addresses for example, address on left, data on right:
0x0100 : 0x1234
0x0102 : 0x5678
If you want to do a 16 bit transfer that is aligned the lsbit of your address must be zero, 0x100, 0x102, 0x104, etc. Unaligned transfers would be at addresses with the lsbit set, 0x101, 0x103, 0x105, etc. Why are they a problem? In this hypothetical (there were and are still real systems like this) system in order to get two bytes at address 0x0100 we only need to access the memory one time and take all 16 bits from that one address resulting in 0x1234. But if we want 16 bits starting at address 0x0101. We have to do two memory transactions 0x0100 and 0x0102 and take one byte from each combine those to get the result which little endian is 0x7812. That takes more clock cycles, more logic, etc. Inefficient and costly. Intel x86 and other systems from that era which were 8 or 16 bit processors but used 8 bit memory, everything larger than an 8 bit transfer was multiple clock cycles, instructions themselves took multiple clock cycles to execute before the next one could start, burning clock cycles and complication in the logic was not of interest (they saved themselves from pain in other ways).
The older arms may or may not have been from that era, but post acorn, the armv4 to the present is a 32 bit system from a perspective of the size of the general purpose registers, the data bus is 32 or 64 bits (the newest arms have 64 bit registers and I would assume if not already 128 bit busses) depending on your system. The core that put ARM on the map the ARM7TDMI which is an ARMv4T, I assume is a 32 bit data bus. The ARM7 and ARM9 ARM ARM (ARM Architectural Reference Manual) changed its language on each revision (I have several revisions going back to the paper only ones) with respect to words like UNPREDICTABLE RESULTS. When and where they would list something as bad or broken. Some of this was legal, understand ARM does not make chips, they sell IP, back then it was masks for a particular foundry today you get the source code to their core and you deal with it. So to survive you need a good legal defense, your secrets are exposed to customers, some of these items that were claimed not to be supported actually have deterministic results, if ARM were to find a clone (which is yet another legal discussion) with these unpredictable results being predictable and matching what arms logic does you have to be pretty good at explaining why. The clones have been crushed when they have tried (that or legally become licensed arm cores) so some of this is just interesting history. Another arm manual described quite clearly what happens when you do an unaligned transfer on the older ARM7 systems. And it is a bit of a duh moment when you see it, quite obvious what was going on (just plain keep it simple stupid logic).
The byte lanes rotated. On a 32 bit bus somewhere in the system, likely not on the amba/axi bus but inside the memory controller you would effectively get this:
0x0100 : 0x12345678
0x0101 : 0x78123456
0x0102 : 0x56781234
0x0103 : 0x34567812
address on the left resulting data on the right. Now why is that obvious you ask and what is the size of that transfer? The size of the transfer is irrelevant, doesnt matter, look at that address/data this way:
0x0100 : 0x12345678
0x0101 : 0xxx123456
0x0102 : 0xxxxx1234
0x0103 : 0xxxxxxx12
Using aligned transfers, 0x0100 is legal for 32, 16, and 8 bit and look at the lower 8, 16, or 32 bits you get the right answer with the data as shown. For address 0x0101 only an 8 bit transfer is legal, and the lower 8 bits of that data is in the lower 8 bits, just copy those over to the registers lower 8 bits. for address 0x0102 8 and 16 are legal, unaligned, transfers and 0x1234 is the right answer for 16 bit and 0x34 for 8. lastly 0x0103 8 bit is the only transfer size without alignment issues and 0x12 is the right answer.
This above information is all from publicly available documents, no secrets here or special insider knowledge, just generic programming experience.
ARM put an exception in, data abort or prefetch abort (thumb is a separate topic) to discourage the use of unaligned transfers as do other architectures. Unfortunately x86 has lead people to be very lazy and also not care about the performance hit that they incur when doing such a thing on an x86, which allows the transfer at the price of extra cycles and extra logic. The prefetch abort if I remember was not on by default on the ARM7 platforms I used, but was on by default on the ARM9 platforms I used, my memory could be wrong and since I dont know how the defaults worked that could have been a strap option on the core so it could have varied from chip to chip, vendor to vendor. You could disable it and do unaligned transfers so long as you understood what happened with the data (rotate not spill over into the next word).
More modern ARM processors do support unaligned transfers and they are as one would expect, I wont use 64 bit examples here to save typing and space but go back to that 16 bit example to paint the picture
0x0100: 0x1234
0x0102: 0x5678
With a 16 bit wide system, memory and bus, little endian, if you did a 16 bit unaligned transfer at address 0x0101 you would expect to see 0x7812 and that is what you get now on the modern arm systems. But it is still a software controlled feature, you can enable exceptions on unaligned transfers and you will get a data abort instead of a completed transfer.
As far as your question goes look at the ldrb instruction, that instruction does an 8 bit read from memory, being 8 bit there is no such thing as unaligned all addresses are valid, if buf[] happened to live at address 0x1234 then buf[3] is at address 0x1237 and that is a perfectly valid address for an 8 bit read. No alignment issues of any kind, no exceptions will fire. Where you would get into trouble is if you do one of these very ugly programming hacks:
char buf[]="hello world";
short *sptr;
int *iptr;
sptr=(short *)&buf[3];
iptr=(int *)&buf[3];
...
something=*sptr;
something=*iptr;
...
short_something=*(short *)&buf[3];
int_something=*(int *)&buf[3];
And then yes you would need to worry about unaligned transfers as well as hoping that you dont have any compiler optimization issues making the code not work as you had thought it would. +1 to jszakmeister for already covering this sub topic.
short answer:
char buf[]="hello world";
char is generally assumed to mean an 8 bit byte so this is a quantity of 8 bit items. certainly compiled for ARM that is what you will get (or mips or x86 or power pc, etc). So accessing buf[X] for any X within that string, cannot be unaligned because
something = buf[X];
Is an 8 bit transfer and you cant have unaligned 8 bit transfers. If you were to do this
short buf[]={1,2,1,2,3,2,1};
short is assumed but not always the case, to be 16, bit, for the arm compilers I know it is 16 bit. but that doesnt matter buf[X] here also cannot be unaligned because the compiler computes the offset for you. As follows address of buf[X] is base_address_of_buf + (X<<1). And the compiler and/or linker will insure, on ARM, MIPS, and other systems that buf is placed on a 16 bit aligned address so that math will always result in an aligned address.

Is it atomic to access(load/store) 32 bit integer when using ARM Thumb instruction set?

Using ARM cortex with thumb instruction set and Keil realview compiler, is it safe to access to 32 bit integer? Since the thumb register set is 16 bits, does this mean, fetching a 32 bit int needs 2 machine instructions? If so, accessing 32 bit will not be atomic. If my worry is true, does it mean that int assignment should be protected by a critical region?

Thumb uses the same 32-bit registers as ARM, so there's no issue there. What's halved is the instruction size (and even that is not strictly true for Thumb-2).
Do not worry, you don't need to change your code if you're compiling to Thumb.

The instruction size is 16-Bit in thumb mode, not the register size.
This means that a constant assignment - as in i=1; - can be seen as atomic. Although more than one instruction is generated, only one will modify the memory location of i even if i is int32_t.
But you need a critical section once you to things like i=i+1. That is of course not atomic.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight