regarding always block in implementing ARM cpu in verilog

regarding always block in implementing ARM cpu in verilog - arm

I'm trying to implement the register file in an ARM CPU in verilog.
I'm very new to verilog so I had trouble.
I want to make the register file save in it's 15th register the value PC+8 and in register number 0 the value 0 in the beginning, so that the register file is able to give PC+8 as output when it's input for one of the read-register is 15 and so on.
Currently, I've written the code like this
reg[31:0] register[15:0];
initial
begin
register[15] = register15;//register15 is the input holding PC+8 as it's value
register[0] = 32'h00000000;
end
always #(posedge clk)
begin
outreg1 <= register[A1];// outreg1,2 are outputs (values of register A1, A2)
outreg2 <= register[A2];
end
However, I want to make it all happen in posedge of clk, when 'register-read' happens. But if I do that, would I have to make all the statements in always #(posedge clk) a blocking assignment '='to make it go in order and assign 15 and 0 first?
My understanding of blocking and unblocking assignments aren't very clear so I am not sure if that would work or not.

So, this looks like an attempt to remap of input values 'register0, ... register15' to a set of 'outreg1...' using 'A1...' as map manipulators.
In this case you cannot use initial block. Initial block runs only once in the simulation at its beginning and cannot react to the input changes. They are not synthesizable as well. Since you said that 'registerN' are also inputs, you'd better create 2 different always_blocks;
reg[31:0] register[15:0];
always #*
begin
register[15] = register15;//register15 is the input holding PC+8 as it's value
register[0] = 32'h00000000;
end
always #(posedge clk)
begin
outreg1 <= register[A1];// outreg1,2 are outputs (values of register A1, A2)
outreg2 <= register[A2];
end
difference between blocking and non-blocking assignments is that with non-blocking assignments the real value will be assigned to the variables later, after all evaluation of the posedge is done for all such blocks in the design. This allows simulation to behave more like hardware in respect to flops and latches. i.e. if you have one flop A feeding another flop B at the same 'posedge clk', the flop B will catch the output of A as it existed before the posedge. This is the way the hardware behaves. With blocking assignments the result of the simulation will be unpredictable in such a case, depending on simulator implementation.
So, the rule of thumb is to use non-blocking assignment for all 'outputs' of the always blocks representing latches and flops. Everything else must be blocking. It means that flop/latch blocks can use blocking for intermediate variables if needed, but it is better to be avoided.

Related

Issue with global variable while making 32-bit counter

I am trying to do quadrature decoding using atmel xmega avr microcontroller. Xmega has only 16-bit counters. And in addition I have used up all the available timers.
Now to make 32-bit counter I have used one 16-bit counter and in its over/under flow interrupt I have increment/decrement a 16-bit global variable, so that by combining them we can make 32-bit counter.
ISR(timer_16bit)
{
if(quad_enc_mov_forward)
{
timer_over_flow++;
}
else if (quad_enc_mov_backward)
{
timer_over_flow--;
}
}
so far it is working fine. But I need to use this 32-bit value in various tasks running parallel. I'm trying to read 32-bit values as below
uint32_t current_count = timer_over_flow;
current_count = current_count << 16;
current_count = current_count + timer_16bit_count;
`timer_16_bit_count` is a hardware register.
Now the problem I am facing is when I read the read timer_over_flow to current_count in the first statement and by the time I add the timer_16bit_count there may be overflow and the 16bit timer may have become zero. This may result in taking total wrong value.
And I am trying to read this 32-bit value in multiple tasks .
Is there a way to prevent this data corruption and get the working model of 32-bit value.
Details sought by different members:
My motor can move forward or backward and accordingly counter increments/decrements.
In case of ISR, before starting my motor I'm making the global variables(quad_enc_mov_forward & quad_enc_mov_backward) set so that if there is a overflow/underflow timer_over_flow will get changed accordingly.
Variables that are modified in the ISR are declared as volatile.
Multiple tasks means that I'm using RTOS Kernel with about 6 tasks (mostly 3 tasks running parallel).
In the XMEGA I'm directly reading TCCO_CNT register for the lower byte.

One solution is:
uint16_t a, b, c;
do {
a = timer_over_flow;
b = timer_16bit_count;
c = timer_over_flow;
} while (a != c);
uint32_t counter = (uint32_t) a << 16 | b;
Per comment from user5329483, this must not be used with interrupts disabled, since the hardware counter fetched into b may be changing while the interrupt service routine (ISR) that modifies timer_over_flow would not run if interrupts are disabled. It is necessary that the ISR interrupt this code if a wrap occurs during it.
This gets the counters and checks whether the high word changed. If it did, this code tries again. When the loop exits, we know the low word did not wrap during the reads. (Unless there is a possibility we read the high word, then the low word wrapped, then we read the low word, then it wrapped the other way, then we read the high word. If that can happen in your system, an alternative is to add a flag that the ISR sets when the high word changes. The reader would clear the flag, read the timer words, and read the flag. If the flag is set, it tries again.)
Note that timer_over_flow, timer_16bit_count, and the flag, if used, must be volatile.
If the wrap-two-times scenario cannot happen, then you can eliminate the loop:
Read a, b, and c as above.
Compare b to 0x8000.
If b has a high value, either there was no wrap, it was read before a wrap upward (0xffff to 0), or it was read after a wrap downward. Use the lower of a or c.
Otherwise, either there was no wrap, b was read after a wrap upward, or it was read before a wrap downward. Use the larger of a or c.

The #1 fundamental embedded systems programming FAQ:
Any variable shared between the caller and an ISR, or between different ISRs, must be protected against race conditions. To prevent some compilers from doing incorrect optimizations, such variables should also be declared as volatile.
Those who don't understand the above are not qualified to write code containing ISRs. Or programs containing multiple processes or threads for that matter. Programmers who don't realize the above will always write very subtle, very hard-to-catch bugs.
Some means to protect against race conditions could be one of these:
Temporary disabling the specific interrupt during access.
Temporary disabling all maskable interrupts during access (crude way).
Atomic access, verified in the machine code.
A mutex or semaphore. On single-core MCU:s where interrupts cannot be interrupted in turn, you can use a bool as "poor man's mutex".

Just reading TCCO_CNT in multithreaded code is race condition if you do not handle it correctly. Check the section on reading 16bit registers in XMega manual. You should read lower byte first (this will be probably handled transparently by compiler for you). When lower byte is read, higher byte is (atomically) copied into the TEMP register. Then, reading high byte does read the TEMP register, not the counter. In this way atomic reading of 16bit value is ensured, but only if there is no access to TEMP register between low and high byte read.
Note that this TEMP register is shared between all counters, so context switch in right (wrong) moment will probably trash its content and therefore your high byte. You need to disable interrupts for this 16bit read. Because XMega will execute one instruction after the sei with interrupts disabled, the best way is probably:
cli
ld [low_byte]
sei
ld [high byte]
It disables interrupts for four CPU cycles (if I counted it correctly).
An alternative would to save shared TEMP register(s) on each context switch. It is possible (not sure if likely) that your OS already does this, but be sure to check. Even so, you need to make sure colliding access does not occur from an ISR.
This precaution should be applied to any 16bit register read in your code. Either make sure TEMP register is correctly saved/restored (or not used by multiple threads at all) or disable interrupts when reading/writing 16bit value.

This problem is indeed a very common and very hard one. All solutions will toit will have a caveat regarding timing constraints in the lower priority layers. To clarify this: the highest priority function in your system is the hardware counter - it's response time defines the maximum frequency that you can eventually sample. The next lower priority in your solution is the interrupt routine which tries to keep track of bit 2^16 and the lowest is your application level code which tries to read the 32-bit value. The question now is, if you can quantify the shortest time between two level changes on the A- and B- inputs of your encoder. The shortest time usually does occur not at the highest speed that your real world axis is rotating but when halting at a position: through minimal vibrations the encoder can double swing between two increments, thereby producing e.g. a falling and a rising edge on the same encoder output in short succession. Iff (if and only if) you can guarantee that your interrupt processing time is shorter (by a margin) than this minmal time you can use such a method to virtually extend the coordinate range of your encoder.

Avoid loop unrolling for executing sequential data transfer in verilog

I need to execute a set of codes sequentially in verilog
The problem is that i tried to give looping using for loop/ generate for loop. In for loop I strongly believe that loop unrolling takes place and every thing happens in parallel. Could you please suggest me how to implement the sequential execution of for loop so that I can apply the same concept for carrying out repeated process? Or Is there any other technique which can be employed for implementing sequential procedure? I am using the process for transferring multiple byte of data using UART.

The usual technique for implementing a sequential procedure in hardware is building a state machine with a case statement.
integer state, next_state;
parameter S0 = 0, S1 = 1, S2 = 2;
always #(posedge clock) state <= next_state;
always #(*)
case(state)
S0: begin
// ... code for sequence 0
next_state = S1;
end
S1: begin
// ... code for sequence 1
next_state = S2;
end
S2: begin
// ... code for sequence 2
next_state = S0;
end
endcase
But for data transfer, this is a very inefficient use of hardware. Think of your data as a car on a factory assembly line. Although the car goes through a sequential series of stage in its manufacture, each stage of the factory is going through a repetitive series of the same steps on different cars, with each stage working in parallel. That is how you should be describing your hardware to a synthesis tool. There are some tools just now beginning to appear that take a sequential description and parallelize it, but those are far from general available right now.

HCS12 embedded: counter timer and calculated output compare values

I'm having trouble with timer output compare interrupts on the HCS12. The problem seems to be that I'm writing calculated values to the output compare registers rather than immediates, ie...
OCval = x + y ; ldd OC1, OCval ; // what I need to do
ldd OC1, #3000 ; // what works
With the calculated values, the timer interrupt is erratic, which is unacceptable in my application. The problem has been firmly pinned down to the documented requirement to access the timer and OC registers in a single cycle, anything other than an immediate write violates this. I also note that all the sample code on the web uses immediate ops.
Just wondering if there is a software workaround. I need to allow the counter to free run (ie. no reset) because there are other output compares with immediate writes that have to remain in action. Only two of my interrupts need to be calculated.
A software fix would be nice because the only other options I can see involve additional hardware to handle the dynamic timing, messy. TIA

This is a bit tentative, but early tests are encouraging. I've moved the offending interrupts from the main timer to the modulo downcounter, which also provides clocked interrupts. The documentation states that setting the count register is subject to the same single-cycle write rules, however my extensive testing with the main timer indicates that problems are very unlikely to occur until the counter has run for a while with the new setting. The advantage of the new approach is that a value needs to be written only once, on initially setting the time value, unlike the main timer where a rewrite needed to be performed many times a second.
In case it helps, I'm stopping the counter before doing the write, then restarting it.

Why is disabling interrupts necessary here?

static void RadioReleaseSPI(void) {
__disable_interrupt();
spiTxRxByteCount &= ~0x0100;
__enable_interrupt();
}
I understand that multiple tasks may attempt to use the SPI resource. spiTxRxByteCount is a global variable used to keep track of whether the SPI is currently in use by another task. When a task requires the SPI it can check the status of spiTxRxByteCount to see if the SPI is being used. When a task is done using the SPI it calls this function and clears the bit, to indicate that the SPI is now free. But why disable the interrupts first and then re-enable them after? Just paranoia?

The &= will do a read-modify-write operation - it's not atomic. You don't want an interrupt changing things in the middle of that, resulting in the write over-writing with an incorrect value.

You need to disable interrupts to ensure atomic access. You don't want any other process to access and potentially modify that variable while you're reading it.
From Introduction to Embedded Computing:
The Need for Atomic Access
Imagine this scenario: foreground program, running on an 8-bit uC,
needs to examine a 16-bit variable, call it X. So it loads the high
byte and then loads the low byte (or the other way around, the order
doesn’t matter), and then examines the 16-bit value. Now imagine an
interrupt with an associated ISR that modifies that 16-bit variable.
Further imagine that the value of the variable happens to be 0x1234 at
a given time in the program execution. Here is the Very Bad Thing
that can happen:
foreground loads high byte (0x12)
ISR occurs, modifies X to 0xABCD
foreground loads low byte (0xCD)
foreground program sees a 16-bit value of 0x12CD.
The problem is that a supposedly indivisible piece of data, our
variable X, was actually modified in the process of accessing it,
because the CPU instructions to access the variable were divisible.
And thus our load of variable X has been corrupted. You can see that
the order of the variable read does not matter. If the order were
reversed in our example, the variable would have been incorrectly read
as 0xAB34 instead of 0x12CD. Either way, the value read is neither
the old valid value (0x1234) nor the new valid value (0xABCD).
Writing ISR-referenced data is no better. This time assume that the
foreground program has written, for the benefit of the ISR, the
previous value 0x1234, and then needs to write a new value 0xABCD. In
this case, the VBT is as follows:
foreground stores new high byte (0xAB)
ISR occurs, reads X as 0xAB34
foreground stores new low byte (0xCD)
Once again the code (this time the ISR) sees neither the previous
valid value of 0x1234, nor the new valid value of 0xABCD, but rather
the invalid value of 0xAB34.
While spiTxRxByteCount &= ~0x0100; may look like a single instruction in C, it is actually several instructions to the CPU. Compiled in GCC, the assembly listing looks like so:
57:atomic.c **** spiTxRxByteCount &= ~0x0100;
68 .loc 1 57 0
69 004d A1000000 movl _spiTxRxByteCount, %eax
69 00
70 0052 80E4FE andb $254, %ah
71 0055 A3000000 movl %eax, _spiTxRxByteCount
71 00
If an interrupt comes in in-between any of those instructions and modifies the data, your first ISR can potentially read the wrong value. So you need to disable interrupts before you operate on it and also declare the variable volatile.

There are two reasons for why you should be disabling interrupts:
The &= is a read-modify-write operation which is in nature not atomic. It consists of a read, a bitwise-and, and a write. You don't want this operation to be interrupted by an ISR (interrupt service route). The ISR could modify spiTxRxByteCount after the read and before the write. The write would then be based on an outdated value and you would lose information.
__disable_interrupt() and __enable_interrupt() serve as software barriers. Even if optimization is enabled, the compiler must not move the read or the write across the two barriers. Also, the compiler must not cache the value of spiTxRxByteCount across the two barriers. If there were no barriers, the compiler would be allowed to hold a copy of spiTxRxByteCount in some CPU register even across multiple invocations of RadioReleaseSPI(). This would typically happen if inlining is enabled and RadioReleaseSPI() is called repeatedly.
That disabling and enabling interrupts serves as barriers is at least as important as avoiding the interruption by an ISR, IMHO. But it seems to be overlooked, sometimes.

How to read two 32bit counters as a 64bit integer without race condition

At memory 0x100 and 0x104 are two 32-bit counters. They represent a 64-bit timer and are constantly incrementing.
How do I correctly read from two memory addresses and store the time as a 64-bit integer?
One incorrect solution:
x = High
y = Low
result = x << 32 + y
(The program could be swapped out and in the meantime Low overflows...)
Additional requirements:
Use C only, no assembly
The bus is 32-bit, so no way to read them in one instruction.
Your program may get context switched at any time.
No mutex or locks available.
Some high-level explanation is okay. Code not necessary. Thanks!

I learned this from David L. Mills, who attributes it to Leslie Lamport:
Read the upper half of the timer into H.
Read the lower half of the timer into L.
Read the upper half of the timer again into H'.
If H == H' then return {H, L}, otherwise go back to 1.
Assuming that the timer itself updates atomically then this is guaranteed to work -- if L overflowed somewhere between steps 1 and 2, then H will have incremented between steps 1 and 3, and the test in step 4 will fail.

Given the nature of the memory (a timer), you should be able to read A, read B, read A' and compare A to A', if they match you have your answer. Otherwise repeat.
It sortof depends on what other constraints there are on this memory. If it's something like a system-clock, the above will handle the situation where 0x0000FFFF goes to 0x00010000, and, depending on the order you read it in, you would otherwise erroneously end up with 0x00000000 or 0x0001FFFF.

In addition to what has already been said, you won't get more accurate timing reads than your interrupt / context switch jitter allows. If you fear an interrupt / context switch in the middle of a timer polling, the solution is not to adapt some strange read-read-read-compare algorithm, nor is it to use memory barriers or semaphores.
The solution is to use a hardware interrupt for the timer, with an interrupt service routine that cannot be interrupted when executed. This will give the highest possible accuracy, if you actually have need of such.

The obvious and presumably intended answer is already given by Hobbs and jkerian:
sample High
sample Low
read High again - if it differs from the sample from step 1, return to step 1
On some multi-CPU/core hardware, this doesn't actually work properly. Unless you have a memory barrier to ensure that you're not reading High and Low from your own core's cache, then updates from another core - even if 64-bit atomic and flushed to some shared memory - aren't guaranteed to be visible in your core a timely fashion. While High and Low must be volatile-qualified, this is not sufficient.
The higher the frequency of updates, the more probable and significant the errors due to this issue.
There is no portable way to do this without some C wrappers for OS/CPU-specific memory barriers, mutexes, atomic operations etc..
Brooks' comment below mentions that this does work for certain CPUs, such as modern AMDs.

If you can guarantee that the maximum time of context switch is significantly less than half the low word rollover period, you can use that fact to decide whether the Low value was read before or after its rollover, and choose the correct high word accordingly.
H1=High;L=Low;H2=High;
if (H2!=H1 && L < 0x7FFFFFF) { H1=H2;}
result= H1<<32+L;
This avoids the 'repeat' phase of other solutions.

The problem statement didn't include whether the counters could roll over all 64-bits several times between reads. So I might try alternating reading both 32-bit words a few thousand times, more if needed, store them in 2 vector arrays, run a linear regression fit modulo 2^32 against both vectors, and apply slope matching contraints of that ratio to the possible results, then use the estimated regression fit to predict the count value back to the desired reference time.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight