Related
I'm trying to implement the register file in an ARM CPU in verilog.
I'm very new to verilog so I had trouble.
I want to make the register file save in it's 15th register the value PC+8 and in register number 0 the value 0 in the beginning, so that the register file is able to give PC+8 as output when it's input for one of the read-register is 15 and so on.
Currently, I've written the code like this
reg[31:0] register[15:0];
initial
begin
register[15] = register15;//register15 is the input holding PC+8 as it's value
register[0] = 32'h00000000;
end
always #(posedge clk)
begin
outreg1 <= register[A1];// outreg1,2 are outputs (values of register A1, A2)
outreg2 <= register[A2];
end
However, I want to make it all happen in posedge of clk, when 'register-read' happens. But if I do that, would I have to make all the statements in always #(posedge clk) a blocking assignment '='to make it go in order and assign 15 and 0 first?
My understanding of blocking and unblocking assignments aren't very clear so I am not sure if that would work or not.
So, this looks like an attempt to remap of input values 'register0, ... register15' to a set of 'outreg1...' using 'A1...' as map manipulators.
In this case you cannot use initial block. Initial block runs only once in the simulation at its beginning and cannot react to the input changes. They are not synthesizable as well. Since you said that 'registerN' are also inputs, you'd better create 2 different always_blocks;
reg[31:0] register[15:0];
always #*
begin
register[15] = register15;//register15 is the input holding PC+8 as it's value
register[0] = 32'h00000000;
end
always #(posedge clk)
begin
outreg1 <= register[A1];// outreg1,2 are outputs (values of register A1, A2)
outreg2 <= register[A2];
end
difference between blocking and non-blocking assignments is that with non-blocking assignments the real value will be assigned to the variables later, after all evaluation of the posedge is done for all such blocks in the design. This allows simulation to behave more like hardware in respect to flops and latches. i.e. if you have one flop A feeding another flop B at the same 'posedge clk', the flop B will catch the output of A as it existed before the posedge. This is the way the hardware behaves. With blocking assignments the result of the simulation will be unpredictable in such a case, depending on simulator implementation.
So, the rule of thumb is to use non-blocking assignment for all 'outputs' of the always blocks representing latches and flops. Everything else must be blocking. It means that flop/latch blocks can use blocking for intermediate variables if needed, but it is better to be avoided.
I need to execute a set of codes sequentially in verilog
The problem is that i tried to give looping using for loop/ generate for loop. In for loop I strongly believe that loop unrolling takes place and every thing happens in parallel. Could you please suggest me how to implement the sequential execution of for loop so that I can apply the same concept for carrying out repeated process? Or Is there any other technique which can be employed for implementing sequential procedure? I am using the process for transferring multiple byte of data using UART.
The usual technique for implementing a sequential procedure in hardware is building a state machine with a case statement.
integer state, next_state;
parameter S0 = 0, S1 = 1, S2 = 2;
always #(posedge clock) state <= next_state;
always #(*)
case(state)
S0: begin
// ... code for sequence 0
next_state = S1;
end
S1: begin
// ... code for sequence 1
next_state = S2;
end
S2: begin
// ... code for sequence 2
next_state = S0;
end
endcase
But for data transfer, this is a very inefficient use of hardware. Think of your data as a car on a factory assembly line. Although the car goes through a sequential series of stage in its manufacture, each stage of the factory is going through a repetitive series of the same steps on different cars, with each stage working in parallel. That is how you should be describing your hardware to a synthesis tool. There are some tools just now beginning to appear that take a sequential description and parallelize it, but those are far from general available right now.
What purpose does while(1); serve ? I am aware while(1) (no semicolon) loops infinitely and is similar to a spinlock situation. However I do not see where while(1); could be used ?
Sample code
if(!condition)
{
while(1);
}
Note: This is not a case of do-while() or plain while(1).
Please note that all valid statements of the language do not have to serve a purpose. They are valid per the grammar of the language.
One can build many similar "useless" statements, such as if (1);.
I see such statements as the conjunction of a conditional (if, while, etc.) and the empty statement ; (which is also a valid statement although it obviously serves no specific purpose).
That being said, I encountered while (1); in security code. When the user does something very bad with an embedded device, it can be good to block them from trying anything else.
With while (1);, we can unconditionally block a device until an accredited operator manually reboots it.
while(1); can also be part of the implementation of a kernel panic, although a for(;;) {} loop seems to be a more common way of expressing the infinite loop, and there might be a non-empty body (for instance to panic_blink()).
If you dig down to assembly,
(this is easier to grasp from an embedded systems point of view, or if you tried to program a bootloader)
you will realize that a while loop is just a jmp instruction ... ie
(pseudo code: starting loop address)
add ax, bx
add ax, cx
cmp ax, dx
jz (pseudo code: another address location)
jmp (pseudo code: starting loop address)
Lets explain how this works, the processor will keep executing instructions sequentially ... no matter what. So the moment it enters this loop it will add register bx to ax and store in ax, add register cx to ax and store to ax, cmp ax, dx (this means subtract dx from ax) the jz instruction means jump to (another address location) if the zero flag is set (which is a bit in the flag register that will be set if the result of the above subtraction is zero), then jmp to starting loop address (pretty straight forward) and redo the whole thing.
The reason I bothered you with all this assembly is to show you that this would translate in C to
int A,B,C,D;
// initialize to what ever;
while(true)
{
A = A + B;
A = A + C;
if((A-D)==0)
{break;}
}
// if((X-Y)==0){break;} is the
// cmp ax, dx
// jz (pseudo code: another address location)
So imagine the senario in assembly if you just had a very long list of instructions that didn't end with a jmp (the while loop) to repeat some section or load a new program or do something ...
Eventually the processor will reach the last instruction and then load the following instruction to find nothing (it will then freeze or triple fault or something).
That is exactly why, when you want the program to do nothing until an event is triggered, you have to use a while(1) loop, so that the processor keeps jumping in its place and not reach that empty instruction address. When the event is triggered, it jumps to the event handler instructions address, executes it, clears the interrupt and goes back to your while(1) loop just jumping in its place awaiting further interrupts. Btw the while(1) is called a superloop if you want to read more about it ... Just for whoever that is insanely itching to argue and comment negatively at this point, this is not an assembly tutorial or a lecture or anything. It's just plain English explanation that is as simple as possible, overlooking a lot of underlying details like pointers and stacks and whatnot and at some instance over simplifying things to get a point across. No one is looking for documentation accuracy over here and I know this C code won't compile like this, but this is only for Demo !!
This is tagged C, but I'll start with a C++ perspective. In C++11, the compiler is free to optimize while(1); away.
From the C++11 draft standard n3092, section 6.5 paragraph 5 (emphasis mine):
A loop that, outside of the for-init-statement in the case of a for statement,
— makes no calls to library I/O functions, and
— does not access or modify volatile objects, and
— performs no synchronization operations (1.10) or atomic operations (Clause 29)
may be assumed by the implementation to terminate. [Note: This is intended to allow compiler transformations, such as removal of empty loops, even when termination cannot be proven. — end note ]
The C11 standard has a similar entry, but with one key difference. From the C11 draft standard n1570, (emphasis mine):
An iteration statement whose controlling expression is not a constant expression,156) that performs no input/output operations, does not access volatile objects, and performs no synchronization or atomic operations in its body, controlling expression, or (in the case of a for statement) its expression-3, may be assumed by the implementation to terminate.157)
156) An omitted controlling expression is replaced by a nonzero constant, which is a constant expression.
157) This is intended to allow compiler transformations such as removal of empty loops even when termination cannot be proven.
This means while(1); can be assumed to terminate in C++11 but not in C11. Even with that, note 157 (not binding) is interpreted by some vendors as allowing them to remove that empty loop. The difference between while(1); in C++11 and C11 is that of defined versus undefined behavior. Because the loop is empty it can be deleted in C++11. In C11, while(1); is provably non-terminating, and that is undefined behavior. Since the programmer has invoked UB, the compiler is free to do anything, including deleting that offending loop.
There have been a number of stackoverflow discussions on optimizing compilers deleting while(1);. For example, Are compilers allowed to eliminate infinite loops?, Will an empty for loop used as a sleep be optimized away?, Optimizing away a "while(1);" in C++0x. Note that the first two were C-specific.
An usage on embedded software is to implement a software reset using the watchdog:
while (1);
or equivalent but safer as it makes the intent more clear:
do { /* nothing, let's the dog bite */ } while (1);
If the watchdog is enabled and is not acknowledged after x milliseconds we know it will reset the processor so use this to implement a software reset.
I assume that the while(1); is not associated with a do loop...
The only semi-useful implementation of while(1); I have seen is a do-nothing loop waiting for an interrupt; such as a parent process waiting for a SIGCHLD, indicating a child process has terminated. The parent's SIGCHLD handler, after all child processes have terminated, can terminate the parent thread.
It does the trick, but wastes a lot of CPU-time. Such a usage should perhaps perform some sort of sleep to relinquish the processor periodically.
One place that I have seen a while(1); is in embedded programming.
The architecture used a main thread to monitor events and worker threads to handle them. There was a hardware watchdog timer (explanation here) that would perform a soft reset of the module after a period of time. Within the main thread polling loop, it would reset this timer. If the main thread detected an unrecoverable error, a while(1); would be used to tie up the main thread, thus triggering the watchdog reset. I believe that assert failure was implemented with a while(1); as well.
As others have said, it's just an infinite loop that does nothing, completely analogous to
while (1) {
/* Do nothing */
}
The loop with the semicolon does have a body. When used as a statement, a single semicolon is a null statement, and the loop body consists of that null statement.
For readability, to make it plain to the reader that the null statement is the body of the loop, I recommend writing it on a separate line:
while (1)
;
Otherwise it is easy to miss it at the end of the "while" line, where there usually isn't a semicolon, and the reader can mistake the next line as the body of the loop.
Or use an empty compound statement instead.
while(1);
is actually very useful. Especially when it's a program that has some sort of passcode or so and you want to disable the use of the program for the user because, for an example, he entered the wrong passcode for 3 times. Using a while(1); would stop the program's progress and nothing would happen until the program is rebooted, mostly for security reasons.
This may be used to wait for Interrupt. Basically you initialize all things you need and start waiting for some thing to occur. After that some specific function is called and executed, after that it goes back to waiting state.
That thing could be button pressed, mouse click/move, data received and etc.
What is more I would say, similar stuff is really often used by UI frameworks. While it waits for signals about user actions.
In AVR chipsets programming (using C programming language) this statement is frequently used, It plays a role like event loop.
Suppose I want to design a count-up counter, So I can use this code for implementing it:
void interrupt0() {
/* check if key pressed, count up the counter */
}
void main() {
/* Common inits */
/* Enable interrupt capability and register its routine */
/* Event loop */
while(1);
}
I think that the reason that while(1); is used is because earlier in the code an EventHandler or interrupt has been set on this thread. Using standard Thread Safe locking code can be fairly costly (in time) when you know that your code will only 'wait' for a very short amount of time.
Therefore you can set up the interrupt and 'spin' using while(1); which, although is a Busy Wait (doesn't let the CPU Idle/service other threads) takes up very few cycles to set up.
In summary, it's a 'cheap' spinlock while your thread waits for an interrupt or Event.
Since the condition is always true, we can say that we are using a logic tautology as known in mathematics.
While the loop proofs to be always true it won´t stop looping unless forced by the code or until resources have collapsed.
I'm currently learning OpenCL and came across this code snippet:
int gti = get_global_id(0);
int ti = get_local_id(0);
int n = get_global_size(0);
int nt = get_local_size(0);
int nb = n/nt;
for(int jb=0; jb < nb; jb++) { /* Foreach block ... */
pblock[ti] = pos_old[jb*nt+ti]; /* Cache ONE particle position */
barrier(CLK_LOCAL_MEM_FENCE); /* Wait for others in the work-group */
for(int j=0; j<nt; j++) { /* For ALL cached particle positions ... */
float4 p2 = pblock[j]; /* Read a cached particle position */
float4 d = p2 - p;
float invr = rsqrt(d.x*d.x + d.y*d.y + d.z*d.z + eps);
float f = p2.w*invr*invr*invr;
a += f*d; /* Accumulate acceleration */
}
barrier(CLK_LOCAL_MEM_FENCE); /* Wait for others in work-group */
}
Background info about the code: This is part of an OpenCL kernel in a NBody simulation program. The entirety of the code and tutorial can be found here.
Here are my questions (mainly to do with the for loops):
How exactly are for-loops executed in OpenCL? I know that all work-items run the same code and that work-items within a work group tries to execute in parallel. So if I run a for loop in OpenCL, does that mean all work-items run the same loop or is the loop somehow divided up to run across multiple work items, with each work item executing a part of the loop (ie. work item 1 processes indices 0 ~ 9, item 2 processes indices 10 ~ 19, etc).
In this code snippet, how does the outer and inner loops execute? Does OpenCL know that the outer loop is dividing the work among all the work groups and that the inner loop is trying to divide the work among work-items within each work group?
If the inner loop is divided among the work-items (meaning that the code within the for loop is executed in parallel, or at least attempted to), how does the addition at the end work? It is essentially doing a = a + f*d, and from my understanding of pipelined processors, this has to be executed sequentially.
I hope my questions are clear enough and I appreciate any input.
1) How exactly are for-loops executed in OpenCL? I know that all
work-items run the same code and that work-items within a work group
tries to execute in parallel. So if I run a for loop in OpenCL, does
that mean all work-items run the same loop or is the loop somehow
divided up to run across multiple work items, with each work item
executing a part of the loop (ie. work item 1 processes indices 0 ~ 9,
item 2 processes indices 10 ~ 19, etc).
You are right. All work items run the same code, but please note that, they may not run the same code at the same pace. Only logically, they run the same code. In the hardware, the work items inside the same wave (AMD term) or warp (NV term), they follow exactly the footprint in the instruction level.
In terms of loop, it is nothing more than just a few branch operations in the assembly code level. Threads from the same wave execute the branch instruction in parallel. If all work items meet the same condition, then they still follow the same path, and run in parallel. However, if they don't agree on the same condition, then typically, there will be divergent execution. For example, in the code below:
if(condition is true)
do_a();
else
do_b();
logically, if some work items meet the condition, they will execute do_a() function; while the other work items will execute do_b() function. However, in reality, the work items in a wave execute in exact the same step in the hardware, therefore, it is impossible for them to run different code in parallel. So, some work items will be masked out for do_a() operations, while the wave executes the do_a() function; when it is finished, the wave goes to do_b() function, at this time, the remaining work items are masked out. For either functions, only partial work items are active.
Go back to the loop question, since the loop is a branch operation, if the loop condition is true for some work items, then the above situation will occur, in which some work items execute the code in the loop, while the other work items will be masked out. However, in your code:
for(int jb=0; jb < nb; jb++) { /* Foreach block ... */
pblock[ti] = pos_old[jb*nt+ti]; /* Cache ONE particle position */
barrier(CLK_LOCAL_MEM_FENCE); /* Wait for others in the work-group */
for(int j=0; j<nt; j++) { /* For ALL cached particle positions ... */
The loop condition does not depend on the work item IDs, which means that all the work items will have exactly the same loop condition, so they will follow the same execution path and be running in parallel all the time.
2) In this code snippet, how does the outer and inner loops execute?
Does OpenCL know that the outer loop is dividing the work among all
the work groups and that the inner loop is trying to divide the work
among work-items within each work group?
As described in answer to (1), since the loop conditions of outer and inner loops are the same for all work items, they always run in parallel.
In terms of the workload distribution in OpenCL, it totally relies on the developer to specify how to distribute the workload. OpenCL does not know anything about how to divide the workload among work groups and work items. You can partition the workloads by assigning different data and operations by using the global work id or local work id. For example,
unsigned int gid = get_global_id(0);
buf[gid] = input1[gid] + input2[gid];
this code asks each work item to fetch two data from consecutive memory and store the computation results into consecutive memory.
3) If the inner loop is divided among the work-items (meaning that the
code within the for loop is executed in parallel, or at least
attempted to), how does the addition at the end work? It is
essentially doing a = a + f*d, and from my understanding of pipelined
processors, this has to be executed sequentially.
float4 d = p2 - p;
float invr = rsqrt(d.x*d.x + d.y*d.y + d.z*d.z + eps);
float f = p2.w*invr*invr*invr;
a += f*d; /* Accumulate acceleration */
Here, a, f and d are defined in the kernel code without specifier, which means they are private only to the work item itself. In GPU, these variable will be first assigned to registers; however, registers are typically very limited resources on GPU, so when registers are used up, these variables will be put into the private memory, which is called register spilling (depending on hardware, it might be implemented in different ways; e.g., in some platform, the private memory is implemented using global memory, therefore any register spilling will cause great performance degradation).
Since these variables are private, all the work items still run in parallel and each of the work item maintain and update their own a, f and d, without interfere with each other.
Heterogeneous programming works on work distribution model, meaning threads gets its portion to work on and start on it.
1.1) As you know that, threads are organized in work-group (or thread block) and in your case each thread in work-group (or thread-block) bringing data from global memory to local memory.
for(int jb=0; jb < nb; jb++) { /* Foreach block ... */
pblock[ti] = pos_old[jb*nt+ti];
//I assume pblock is local memory
1.2) Now all threads in thread-block have the data they need at there local storage (so no need to go to global memory anymore)
1.3) Now comes processing, If you look carefully the for loop where processing takes place
for(int j=0; j<nt; j++) {
which runs for total number of thread blocks. So this loop snippet design make sure that all threads process separate data element.
1) for loop is just like another C statement for OpenCL and all thread will execute it as is, its up-to you how you divide it. OpenCL will not do anything internally for your loop (like point # 1.1).
2) OpenCL don't know anything about your code, its how you divide the loops.
3) Same as statement:1 the inner loop is not divided among the threads, all threads will execute as is, only thing is they will point to the data which they want to process.
I guess this confusion for you is because you jumped into the code before having much knowledge on thread-block and local memory. I suggest you to see the initial version of this code where there is no use of local memory at all.
How exactly are for-loops executed in OpenCL?
They can be unrolled automatically into pages of codes that make it slower or faster to complete. SALU is used for loop counter so when you nest them, more SALU pressure is done and becomes a bottleneck when there are more than 9-10 loops nested (maybe some intelligent algorithm using same counter for all loops should do the trick) So not doing only SALU in the loop body but adding some VALU instructions, is a plus.
They are run in parallel in SIMD so all threads' loops are locked to each other unless there is branching or memory operation. If one loop is adding something, all other threads' loops adding too and if they finish sooner they wait the last thread computing. When they all finish, they continue to next instruction (unless there is branching or memory operation). If there is no local/global memory operation, you dont need synchronization. This is SIMD, not MIMD so it is not efficient when loops are not doing same thing at all threads.
In this code snippet, how does the outer and inner loops execute?
nb and nt are constants and they are same for all threads so all threads doing same amount of work.
If the inner loop is divided among the work-items
That needs opencl 2.0 which has the ability of fine-grain optimization(and spawning kernels in kernel).
http://developer.amd.com/community/blog/2014/11/17/opencl-2-0-device-enqueue/
Look for "subgroup-level functions" and "region growing" titles.
All subgroup threads would have their own accumulators which are then added in the end using a "reduction" operation for speed.
Found in torvalds/linux-2.6.git -> kernel/mutex.c line 171
I have tried to find it on Google and such to no avail.
What does for (;;) instruct?
It literally means "do nothing, until nothing happens and at each step, do nothing to prepare for the next". Basically, it's an infinite loop that you'll have to break somehow from within using a break, return or goto statement.
The for(;;) is an infinite loop condition, similar to while(1) as most have already mentioned. You would more often see this, in kernel mutex codes, or mutex eg problem such as dining philosophers. Until the mutex variable is set to a particular value, such that a second process gets access to the resource, the second process keeps on looping, also known as busy wait. Access to a resource can be disk access, for which 2 process are competing to gain access using a mutex such that at a time only one process has the access to the resource.
It is an infinite loop which has no initial condition, no increment condition and no end condition. So it will iterate forever equivalent to while(1).
It loops forever (until the code inside the loop calls break or return, of course. while(1) is equivalent, I personally find it more logical to use that.
It's equivalent to while( true )
Edit: Since there's been some debate sparked by my answer (good debate, mind you) it should be clarified that this is not entirely accurate for C programs not written to C99 and beyond wherein stdbool.h has set the value of true = 1.
it is an infinite for loop.
It is same as writing infinite loop using " for " statement but u have to use break or some other statement that can get out of this loop.
It is functionally equivilent to while(true) { }.
The reason why the for(;;) syntax is sometimes preferred comes from an older age where for(;;) actually compiled to a slightly faster machine code than while(TRUE) {}. This is because for(;;) { foo(); } will translate in the first pass of the compiler to:
lbl_while_condition:
mov $t1, 1
cmp $t1, 0
jnz _exit_while
lbl_block:
call _foo
jmp lbl_while_condition
whereas the for(;;) would compile in the first pass to:
lbl_for_init:
; do nothing
lbl_for_condition:
; always
lbl_for_block:
call foo;
lbl_for_iterate:
; no iterate
jmp lbl_for_condition
i.e.
lbl_for_ever:
call foo
jmp lbl_for_ever
Hence saving 3 instructions on every pass of the loop.
In practice however, both statements have long since been not only functionally equivalent, but also actually equivalent, since optimisations in the compiler for all builds other than debug builds will ensure that the mov, cmp and jnz are optimised away in the while(1) case, resulting in optimal code for both for(;;) and while(1).
I means:
#define EVER ;;
for(EVER)
{
// do something
}
Warning: Using this in your code is highly discouraged.
for(;;)
is an infinite loop just like while(1). Here no condition is given that will terminate the loop. If you are not breaking it using break statement this loop will never come to an end.
It's an infinite loop that you'll have to break somehow from within using a break, return or goto statement.
or either some interrupt happens otherwise this loop will run infinitely and executes ;(null statement) every time
That was obviously an infinite loop condition.