This question already has answers here:
how does the processor read memory?
(3 answers)
Closed 3 years ago.
I am trying to learn how memory is arranged and handled by a computer, and I don't catch the alignment concept.
For instance, in a 32-bit architecture, why do we say that short (2 bytes) are unaligned if they fit entirely within a single 32-bit word, even if they are not located at an even address?
Because if the processor reads 32 bits by 32 bits and a char is at address x0 then is followed by a short (address x01 and x02) then is followed by another char (x03). Suddenly there is no problem since there will be no cut data since the processor reads 4 bytes.
So the short is aligned, isn't it?
The question suggests a processor that has 32 wires connected to a bus, for data, with possibly other wires for control. When it wants data from memory, it puts an address on the bus, requests a read from memory, waits for the data, and reads it through those 32 wires.
In typical processor designs, those 32 wires are connected to some temporary internal register which itself has connections to other registers. It is easy to move those 32 bits around as a block, with each bit going on its own wire.
If we want to move some of the bits within the 32, we need to shift them. This might be done with various hardware, such as a shifting unit that we put bits into, request a certain amount of shift, and read a result from. Internally, that shifting unit will have a variety of connections and switches to do its job.
Typically, such a shifting unit will be able to move eight bits from any of four positions (starting at bits 0, 8, 16, or 24) to the base position (0). That way, an instruction such as “load byte” can be effected by reading 32 bits from memory (because it only comes in 32-bit chunks), then using the shifting unit to get the desired byte. That shifting unit might not have the wires and switches needed to move any arbitrary set of bits (say, starting at 7, 13, or 22) to the base position. That would take many more wires and switches.
The processor also needs to be able to effect a load-16-bits instruction. For that, the shifting unit will be able to move 16 bits from positions 0 or 16 to position 0. Certainly the engineers could design it to also move 16 bits from position 8 to position 0. But that requires more wires and switches, which cost money, silicon, and energy. In many processors, a decision was made that this expense was not worthwhile, so the capability is not implemented.
In consequence, the hardware simply cannot shift data from bytes 1 and 2 to bytes 0 and 1 in the course of the loading process. (There might be other shifters in the processor, such as in a general-purpose logic unit for implementing shift instructions, but those are generally separate and accessed through instruction dispatching and control mechanisms. They are not in the line of components used in loading from memory.)
Alignment is a definition. Assuming 8 bit bytes and the memory is byte addressable. an 8 bit byte (unsigned char) cannot be unaligned. a 16 bit halfword to be aligned must have the lsbit zero. A 32 bit word the lower two bits zero, 64 bit doubleword three bits zero and so on. So if your 16 bit unsigned short is on an odd address then it is unaligned.
A "32 bit system" does not mean a 32 bit bus, bus widths do not necessarily match the size of the processor registers or instruction size or whatever. No reason to make that assumption. Saying that though, if you are talking MIPS or ARM then yes the buses are most likely 32 or 64 bit for their 32 bit register processors and 64 or perhaps 128 for 64 bit processors, likely 64 bit. But an x86 has 8 bit instructions with 8,16,32,64 bit registers and variable length instructions when you add up the bytes it can possibly take, there is no way to classify its sizes is it an 8 bit processor with its 8 bit instructions 32 or 64 due to its larger register sizes or 128,256,512 etc due to its bus sizes?
You mentioned 32, let's stick with that. I want to walk through an array of bytes, I want to do writes. I have a 32 bit wide data bus one of the typical designs you see today. Let's say the other side is a cache and it is built of 32 bit wide srams to line up with the processor side bus, we won't worry about how the dram is implemented on the other side. So you will likely have a write data bus, a read data bus and either separate write address and read address or one address bus with a way to indicate a read/write transaction.
As far as the bus is concerned all transactions are 32 bit, you don't necessarily expect the unused byte lanes to float, z state, you expect them to be high or low for valid clocks on that bus (Between valid clock cycles sure the bus may go high-z).
A read transaction will typically be and let's assume be an aligned address to the bus width so a 32 bit aligned address (either on the bus or on the far side). There isn't usually a notion of byte lane enables on a read, the processor internally isolates the bytes of interest and discards the others. Some have a length field on the address bus where it makes sense. plus cache control signals and other signals.
An aligned 32 bit read would be say address 0x1000 or 0x1004 length of 0 (n-1), the address bus does its handshake with a unique transaction id, later on the read data bus ideally a single clock cycle will contain that 32 bits of data with that id, the processor sees that and completes the transaction (might be more handshaking) and extracts all 4 bytes and does what the instruction said to do with them.
A 64 bit access aligned on a 32 bit boundary would have a length of one, one address bus handshake, two clocks cycles worth of data on the read data bus. A 16 bit aligned transaction at 0x1000 or 0x1002 will let's say be a read of 0x1000 and the processor will discard either lanes 0 and 1 or lanes 2 and 3, some bus designs align the bytes on the lower lanes so you might see a bus where the two bytes always come back on lanes 0 and 1 for a 16 bit read.
An unaligned 32 bit read would take two bus cycles, twice the overhead, twice the number of clocks a 0x1002 32 bit read is one 0x1000 read where the processor saves 2 of the bytes, then a 0x1004 read and the processor saves two of those byte combines them into the 32 bit number and then does what the instruction says so instead of 5 or 8 or whatever the minimum is for this bus it is now twice as many and likely not interleaved but back to back.
An unaligned 16 bit at address 0x1001 would be a single 32 bit read hopefully but an unaligned 16 bit read at address 0x1003 is two transactions now twice the clocks twice the overh head one at 0x1000 and one at 0x1004 saving one byte each.
Writes are the same but with an additional penalty. Aligned 32 bit writes, say at 0x1000 one bus transaction, address, write data, done. The cache being 32 bits wide in this example could simply write those 32 bits to sram in one sram transaction. An unaligned 32 bit write say at 0x1001, would be two complete bus transactions as expected taking twice the number of bus clocks but also the sram will take two or more number of clocks as well because you need to read-modify-write the sram you can't just write. in order to write the 0x1001 to 0x1003 bytes you need to read 32 bits from sram, change three of those bytes not changing the lower one, and write that back. Then when the other transaction comes in you write the 0x1004 byte while preserving the other three in that sram location.
All byte writes are a single bus transaction per, but all also incur the read-modify-write. Note that depending on how many clocks the bus takes and how many transactions you can have in flight at a time, the read-modify-write of the sram might be invisible you might not be able to get data to the cache fast enough to have a bus transaction have to wait on the sram read-modify-write, but in another similar question since this has been asked so many times here, there is a platform where this was demonstrated.
So you can now tell me how the 16 bit write transactions are going to go, they also incur the read-modify-write at the cache for every one of them, if the address is say 0x1003 then you get two bus transactions and two read-modify-writes.
One of the beauties of the cache though is that even though drams come in 8, 16, 32 bit parts (count how many chips are on a dram stick, often 8 or 9, 4 or 5 or 2 or 3 or some multiple of those. 8 is likely a 64 bit wide bus 8 bits per part, 16 64 bit wide, 8 bits per part, dual rank and so on) the transactions are done in 32 or 64 bit widths, which is kind of the point of a cache. If we were to have to do a read-modify-write at the drams slow speeds that would be horrible, we read-modify-write at the cache/sram speed, then all transactions, cache line evictions and fills are at multiples of the dram bus width so 64 or 2x64 or 4x64 etc per cache line.
Related
My question is about Chapter 5 in this link.
I have an Error Correction Code which simply increments the program counter (PC) by 2 or 4 bytes according the length of the instruction at the time of exception. The core is e200z4.
As far as I know e200z4 can support Fixed Length Instructions of 4 bytes, too.
The thing I don't understand is that: To determine if Variable Length Instructions (VLE) enabled, we need to check the VLEMI bit in the ESR (Exception Syndrome Register). However, this register always contains 0x00000000. The only interrupt that we end up with is Machine Check Interrupt (IVOR1) (during Power On and Off tests with increasing On and fixed Off intervals).
So, why does the CPU not provide the information about the length of the instruction if VLE is used at the moment of interrupt, for instance via VLEMI bit inside ESR? How could I determine if the instruction at the time of interrupt is 2 bytes or 4 bytes long is fixed length or variable length?
Note1: isOpCode32Bit below is decoding opCode to determine instruction length, but isOpCode32Bit is relevant only if isFixedLength is 0, i.e. when (syndrome & VLEMI_MASK) is equal to 1. So, we need to have VLEMI value in syndrome somehow, but ESR seems to be always 0x00 (why?).
Note2: As mentioned before, we always end up in IVOR1 and the instruction address right before the interrupt is reachable (provided in a register).
// IVOR1 (Machine Check Interrupt Assembly part):
(ASSEMBLY)(mfmcsr r7) // copy MCSR into register 7 (MCSR in Chapter 5 in the link)
(ASSEMBLY)(store r7 &syndrome)
// IVOR2:
(ASSEMBLY)(mfesr r7) // copy ESR into register 7 (ESR in Chapter 5 in the link)
(ASSEMBLY)(store r7 &syndrome)
------------------------------------------------------
#define VLEMI_MASK 0x00000020uL
isFixedLength = ((syndrome & VLEMI_MASK) == 0);
if (isFixedLength || isOpCode32Bit)
{
PC += 4; // instruction is 32-bit, increase PC by 4
}
else
{
PC += 2; // instruction is 16-bit, increase PC by 2
}
When it comes to how these exception handlers work in real systems:
Sometimes handling the exception only requires servicing a page fault (e.g. via copy on write or disc reload). In such cases, we don't even need to know the length of the instruction, just the effective memory address the instruction is accessing, and the CPUs generally offer that value. If the page fault can be serviced, then re-running that faulting instruction (without advancing the PC) is appropriate (and if not, then halting the program, also without advancing the PC, is appropriate.)
In other cases, such as software emulation for instructions not present in this hardware, presumably hardware designers consider that such a software handler needs to decode the faulting instruction in order to emulate it, and so will figure out the instruction length anyway.
Thus, hardware turns the job of understanding the faulting instruction over to software. As such system software needs to have deep knowledge of the instruction set architecture, while also likely requiring customization for each different hardware instantiation of the instruction set.
So, why does the CPU not provide information about the length of the instruction at the moment of interrupt inside ESR?
No CPU that I know tells us of the length of an instruction that caused an exception. If they did, that would be convenient — but only for toy exception handlers. For real systems, ultimately, this isn't a true burden.
How to determine if an instruction is long or short at the event of an exception? (Vairable Length Instructions)
Decode the instruction (while considering any instruction modes the CPU was in at the time of exception)!
I am new to embedded world and would like to know if passing value to an array before writing to EEPROM do any good for reliability or accuracy? I am using I2C protocol here. Below is the difference of writing a value to my EEPROM chip. I see mostly people store data into an array before writing. if neither for reliability nor accuracy, what is the reason behind it?
uint64_t OperationTime;
uint8_t e2prom_w_buf[256];
uint8_t i;
for(i =0 ; i < 8 ; i++ )
{
e2prom_w_buf[i]=OperationTime >> i*8;
}
e2prom_PageWrite(&e2prom_w_buf, Address, 8);
&
uint8_t
for(i =0 ; i < 8 ; i++ )
{
e2prom_PageWrite((OperationTime << i*8), Address, 8); // do I need a uint8_t cast here for OperationTime?
}
The OperationTime keeps incrementing in timer interrupt function per second
I2C setup 100kHz
Writing an EEPROM in in a block is faster than multiple single byte operations. You should look at the data sheet for the specific EEPROM to understand why, but just to pick a typical SPIsee note EEPROM as an example (AT25M01 in this case). It has a byte write sequence thus:
Note: I chose an SPI part for illustration because the timing diagrams are a little simpler. I note you mention I2C - the principle is the same, the bus transactions are more complicated.
©2019 Microchip Technology Inc.
and a Page Write thus:
©2019 Microchip Technology Inc.
The Page Write can write up to 256 bytes in a burst with only the start address, while for a byte write you have to write the address of every byte, and have to wait for the write to complete (busy state) before you can write the next one. So you have to send 5 bytes for a single byte write, but only 4 + n bytes for a page write. So in terms of SPI (or I2C in your case)) bus activity, writing single bytes requires nearly 4 times the number of bytes to be written to the bus.
Also in the case of the AT25M01 in this example, the write cycle time is "up-to 5ms". That is 5ms for one Byte Write or 5ms for a Page Write. So for byte writes the 8 bytes in your example may take as much as 40ms, whereas for a Page Write it will take no more than 5ms.
Another benefit of page writes is that on many MCU's you can send the data using DMA (Direct Memory Access) so that there is little or no software overhead is writing to the EEPROM, and the MCU can do other work.
Memory of some EEPROM devices are organized in pages, which consists of multiple bytes. Writing multiple bytes into the same page takes the same time as writing a single byte. So, a page write operation is preferred for faster writes/programming.
Can anybody explains the usage of EN4B command of micron SPI chips.
I want to know the difference between 3 byte and 4 byte address mode in SPI.
I was going through the SPI drivers where I found this commands.
Thanks in Advance !!
From a legacy point of view, SPI commands have always used 3 bytes for the address interested by their operation.
This was fine as with 24 bits it is possible to address up to 128MiB.
When the Flashes grew larger it was needed to switch from 3 bytes to 4 bytes addressing.
Whenever you have any doubts regarding the hardware you can find the answers in the proper datasheet, I don't know what specific chip you are referring to however.
I found the Micron N25Q512A NOR Flash, which is 512MiB so it needs a form of 4 bytes addressing; from it you can learn that
There are 3 bytes legacy commands and new 4 bytes commands.
For example 03h and 13h for the single read.
You can supply a default fourth address byte with a specific register.
The Extended Address Register let you choose the region of the flash for the legacy commands.
You can enable 4 bytes addressing for legacy command.
Either write the appropriate bit in the Nonvolatile Configuration Register or use the ENTER / EXIT 4-BYTE ADDRESS MODE (opcodes B7h and E9h respectively)
This Linux patch also have some insights, basically telling that some chips only support one of the three points above.
Macronix seems to have first opted for the number 3 only and Spansion for the number 1.
Checking some datasheet of theirs seems to suggests that now both support all three methods.
I am trying to use hcsr04 sensors on the Beaglebone black (adapted from this code - https://github.com/luigif/hcsr04)
I got it working for 4 different sets of sensors individually, and were now unsure of how to combine them into one program.
Is there a way to give the trigger and receive the echos simultaneously, such that interrupts can be generated as different events to the C program.
Running them one after the other is the last option we have in mind.
Russ is correct - since there's 2x PRU cores in the BeagleBone's AM335x processor, there's no way to run 4 instances of that PRU program simultaneously. I suppose you could load one compiled for one set of pins, take a measurement, stop it, then load a different binary compiled for a sensor on different pins, but that would be a pretty inefficient (and ugly, IMHO) way to do it.
If you know any assembly it should be pretty straight-forward to update that code to drive all 4 sensors (PRU assembly instructions). Alternatively you could start from scratch in C and use the clpru PRU C compiler as Russ suggested, though AFAIK that's still in somewhat of a beta state and there's not much info out there on it. Either way, I'd recommend reading from the 4 sensors in parallel or one after the other, loading the measurements into the PRU memory at different offsets, then sending a single signal to the ARM.
In that code you linked, the line:
SBCO roundtrip, c24, 0, 4
Takes 4 bytes from register roundtrip (which is register r4, per the #define roundtrip r4 at the top of the file), and loads it into the PRU data RAM (constant c24 is set to the beginning of data RAM in lines 39-41) at offset 0. So if you had 4 different measurements in 4 registers, you could offset the data in RAM, e.g.:
SBCO roundtrip1, c24, 0, 4
SBCO roundtrip2, c24, 4, 4
SBCO roundtrip3, c24, 8, 4
SBCO roundtrip4, c24, 12, 4
Then read those 4 consecutive 32-bit integers in your C program.
At memory 0x100 and 0x104 are two 32-bit counters. They represent a 64-bit timer and are constantly incrementing.
How do I correctly read from two memory addresses and store the time as a 64-bit integer?
One incorrect solution:
x = High
y = Low
result = x << 32 + y
(The program could be swapped out and in the meantime Low overflows...)
Additional requirements:
Use C only, no assembly
The bus is 32-bit, so no way to read them in one instruction.
Your program may get context switched at any time.
No mutex or locks available.
Some high-level explanation is okay. Code not necessary. Thanks!
I learned this from David L. Mills, who attributes it to Leslie Lamport:
Read the upper half of the timer into H.
Read the lower half of the timer into L.
Read the upper half of the timer again into H'.
If H == H' then return {H, L}, otherwise go back to 1.
Assuming that the timer itself updates atomically then this is guaranteed to work -- if L overflowed somewhere between steps 1 and 2, then H will have incremented between steps 1 and 3, and the test in step 4 will fail.
Given the nature of the memory (a timer), you should be able to read A, read B, read A' and compare A to A', if they match you have your answer. Otherwise repeat.
It sortof depends on what other constraints there are on this memory. If it's something like a system-clock, the above will handle the situation where 0x0000FFFF goes to 0x00010000, and, depending on the order you read it in, you would otherwise erroneously end up with 0x00000000 or 0x0001FFFF.
In addition to what has already been said, you won't get more accurate timing reads than your interrupt / context switch jitter allows. If you fear an interrupt / context switch in the middle of a timer polling, the solution is not to adapt some strange read-read-read-compare algorithm, nor is it to use memory barriers or semaphores.
The solution is to use a hardware interrupt for the timer, with an interrupt service routine that cannot be interrupted when executed. This will give the highest possible accuracy, if you actually have need of such.
The obvious and presumably intended answer is already given by Hobbs and jkerian:
sample High
sample Low
read High again - if it differs from the sample from step 1, return to step 1
On some multi-CPU/core hardware, this doesn't actually work properly. Unless you have a memory barrier to ensure that you're not reading High and Low from your own core's cache, then updates from another core - even if 64-bit atomic and flushed to some shared memory - aren't guaranteed to be visible in your core a timely fashion. While High and Low must be volatile-qualified, this is not sufficient.
The higher the frequency of updates, the more probable and significant the errors due to this issue.
There is no portable way to do this without some C wrappers for OS/CPU-specific memory barriers, mutexes, atomic operations etc..
Brooks' comment below mentions that this does work for certain CPUs, such as modern AMDs.
If you can guarantee that the maximum time of context switch is significantly less than half the low word rollover period, you can use that fact to decide whether the Low value was read before or after its rollover, and choose the correct high word accordingly.
H1=High;L=Low;H2=High;
if (H2!=H1 && L < 0x7FFFFFF) { H1=H2;}
result= H1<<32+L;
This avoids the 'repeat' phase of other solutions.
The problem statement didn't include whether the counters could roll over all 64-bits several times between reads. So I might try alternating reading both 32-bit words a few thousand times, more if needed, store them in 2 vector arrays, run a linear regression fit modulo 2^32 against both vectors, and apply slope matching contraints of that ratio to the possible results, then use the estimated regression fit to predict the count value back to the desired reference time.