Avoid loop unrolling for executing sequential data transfer in verilog - loops

I need to execute a set of codes sequentially in verilog
The problem is that i tried to give looping using for loop/ generate for loop. In for loop I strongly believe that loop unrolling takes place and every thing happens in parallel. Could you please suggest me how to implement the sequential execution of for loop so that I can apply the same concept for carrying out repeated process? Or Is there any other technique which can be employed for implementing sequential procedure? I am using the process for transferring multiple byte of data using UART.

The usual technique for implementing a sequential procedure in hardware is building a state machine with a case statement.
integer state, next_state;
parameter S0 = 0, S1 = 1, S2 = 2;
always #(posedge clock) state <= next_state;
always #(*)
case(state)
S0: begin
// ... code for sequence 0
next_state = S1;
end
S1: begin
// ... code for sequence 1
next_state = S2;
end
S2: begin
// ... code for sequence 2
next_state = S0;
end
endcase
But for data transfer, this is a very inefficient use of hardware. Think of your data as a car on a factory assembly line. Although the car goes through a sequential series of stage in its manufacture, each stage of the factory is going through a repetitive series of the same steps on different cars, with each stage working in parallel. That is how you should be describing your hardware to a synthesis tool. There are some tools just now beginning to appear that take a sequential description and parallelize it, but those are far from general available right now.

Related

regarding always block in implementing ARM cpu in verilog

I'm trying to implement the register file in an ARM CPU in verilog.
I'm very new to verilog so I had trouble.
I want to make the register file save in it's 15th register the value PC+8 and in register number 0 the value 0 in the beginning, so that the register file is able to give PC+8 as output when it's input for one of the read-register is 15 and so on.
Currently, I've written the code like this
reg[31:0] register[15:0];
initial
begin
register[15] = register15;//register15 is the input holding PC+8 as it's value
register[0] = 32'h00000000;
end
always #(posedge clk)
begin
outreg1 <= register[A1];// outreg1,2 are outputs (values of register A1, A2)
outreg2 <= register[A2];
end
However, I want to make it all happen in posedge of clk, when 'register-read' happens. But if I do that, would I have to make all the statements in always #(posedge clk) a blocking assignment '='to make it go in order and assign 15 and 0 first?
My understanding of blocking and unblocking assignments aren't very clear so I am not sure if that would work or not.
So, this looks like an attempt to remap of input values 'register0, ... register15' to a set of 'outreg1...' using 'A1...' as map manipulators.
In this case you cannot use initial block. Initial block runs only once in the simulation at its beginning and cannot react to the input changes. They are not synthesizable as well. Since you said that 'registerN' are also inputs, you'd better create 2 different always_blocks;
reg[31:0] register[15:0];
always #*
begin
register[15] = register15;//register15 is the input holding PC+8 as it's value
register[0] = 32'h00000000;
end
always #(posedge clk)
begin
outreg1 <= register[A1];// outreg1,2 are outputs (values of register A1, A2)
outreg2 <= register[A2];
end
difference between blocking and non-blocking assignments is that with non-blocking assignments the real value will be assigned to the variables later, after all evaluation of the posedge is done for all such blocks in the design. This allows simulation to behave more like hardware in respect to flops and latches. i.e. if you have one flop A feeding another flop B at the same 'posedge clk', the flop B will catch the output of A as it existed before the posedge. This is the way the hardware behaves. With blocking assignments the result of the simulation will be unpredictable in such a case, depending on simulator implementation.
So, the rule of thumb is to use non-blocking assignment for all 'outputs' of the always blocks representing latches and flops. Everything else must be blocking. It means that flop/latch blocks can use blocking for intermediate variables if needed, but it is better to be avoided.

Measuring time in IAR (for STM8) of routine called each 100 micro seconds

I am maintaining some code in C for STM8 using IAR Embedded.
what is the way to measure execution time between one part of the code and another?
(Take into account that if possible I don't want to stop the execution of the code (a la breakpoint) or write to the console (since I found that this affects heavily the timing of the program).
I ve found something like this
Techniques for measuring the elapsed time
but this is usually for ARM processors so many of the methods don't apply to my setting. I am thinking something like Technique #3 might be applicable...
Concretely I am asking if I can do something like that technique
unsigned int cnt1 = 0;
unsigned int cnt2 = 0;
cnt1 = TIM3->CNT;
func();
cnt2 = TIM3->CNT;
printf("cnt1:%u cnt2:%u diff:%u \n",cnt1,cnt2,cnt2-cnt1);
for this microcontroller
Any help greatly appreciated
You cannot call printf each 100us in a 8 bit microprocessor, it has no throughput for that. Instead, increment status variables every time anything behaves unexpectedly.
unsigned int cnt1 = 0;
unsigned int cnt2 = 0;
cnt1 = TIM3->CNT;
func();
cnt2 = TIM3->CNT;
if ((cnt2 - cnt1) > MAX_DURATION_ALLOWED)
global_error_func_duration ++;
(Make sure TIM3 counts in microseconds)
CLK_PeripheralClockConfig (CLK_PERIPHERAL_TIMER3 , ENABLE);
TIM3_DeInit();
TIM3_TimeBaseInit(/* Fill these parameters to get 1us counts at your CPU clock */);
TIM3_Cmd(ENABLE);
Now, you can make a console function to print this variable, so from time to time you can check if even a single loop took more than 10ms.
Late in the development you will want to monitor these status variables to assess the system runtime integrity and take some action in case of misbehavior.
There's plenty of solutions for that, but simple solution would be using hardware pin and toggling pin in places where you want to start/stop measuring time and using oscilloscope or some cheap logic analyser. Software as someone mentioned have some variable start and end assign current timer tick to them and in debug read them. You could aswell print them using i.e uart in runtime but this would aswell slow them down.

OpenCL for loop execution model

I'm currently learning OpenCL and came across this code snippet:
int gti = get_global_id(0);
int ti = get_local_id(0);
int n = get_global_size(0);
int nt = get_local_size(0);
int nb = n/nt;
for(int jb=0; jb < nb; jb++) { /* Foreach block ... */
pblock[ti] = pos_old[jb*nt+ti]; /* Cache ONE particle position */
barrier(CLK_LOCAL_MEM_FENCE); /* Wait for others in the work-group */
for(int j=0; j<nt; j++) { /* For ALL cached particle positions ... */
float4 p2 = pblock[j]; /* Read a cached particle position */
float4 d = p2 - p;
float invr = rsqrt(d.x*d.x + d.y*d.y + d.z*d.z + eps);
float f = p2.w*invr*invr*invr;
a += f*d; /* Accumulate acceleration */
}
barrier(CLK_LOCAL_MEM_FENCE); /* Wait for others in work-group */
}
Background info about the code: This is part of an OpenCL kernel in a NBody simulation program. The entirety of the code and tutorial can be found here.
Here are my questions (mainly to do with the for loops):
How exactly are for-loops executed in OpenCL? I know that all work-items run the same code and that work-items within a work group tries to execute in parallel. So if I run a for loop in OpenCL, does that mean all work-items run the same loop or is the loop somehow divided up to run across multiple work items, with each work item executing a part of the loop (ie. work item 1 processes indices 0 ~ 9, item 2 processes indices 10 ~ 19, etc).
In this code snippet, how does the outer and inner loops execute? Does OpenCL know that the outer loop is dividing the work among all the work groups and that the inner loop is trying to divide the work among work-items within each work group?
If the inner loop is divided among the work-items (meaning that the code within the for loop is executed in parallel, or at least attempted to), how does the addition at the end work? It is essentially doing a = a + f*d, and from my understanding of pipelined processors, this has to be executed sequentially.
I hope my questions are clear enough and I appreciate any input.
1) How exactly are for-loops executed in OpenCL? I know that all
work-items run the same code and that work-items within a work group
tries to execute in parallel. So if I run a for loop in OpenCL, does
that mean all work-items run the same loop or is the loop somehow
divided up to run across multiple work items, with each work item
executing a part of the loop (ie. work item 1 processes indices 0 ~ 9,
item 2 processes indices 10 ~ 19, etc).
You are right. All work items run the same code, but please note that, they may not run the same code at the same pace. Only logically, they run the same code. In the hardware, the work items inside the same wave (AMD term) or warp (NV term), they follow exactly the footprint in the instruction level.
In terms of loop, it is nothing more than just a few branch operations in the assembly code level. Threads from the same wave execute the branch instruction in parallel. If all work items meet the same condition, then they still follow the same path, and run in parallel. However, if they don't agree on the same condition, then typically, there will be divergent execution. For example, in the code below:
if(condition is true)
do_a();
else
do_b();
logically, if some work items meet the condition, they will execute do_a() function; while the other work items will execute do_b() function. However, in reality, the work items in a wave execute in exact the same step in the hardware, therefore, it is impossible for them to run different code in parallel. So, some work items will be masked out for do_a() operations, while the wave executes the do_a() function; when it is finished, the wave goes to do_b() function, at this time, the remaining work items are masked out. For either functions, only partial work items are active.
Go back to the loop question, since the loop is a branch operation, if the loop condition is true for some work items, then the above situation will occur, in which some work items execute the code in the loop, while the other work items will be masked out. However, in your code:
for(int jb=0; jb < nb; jb++) { /* Foreach block ... */
pblock[ti] = pos_old[jb*nt+ti]; /* Cache ONE particle position */
barrier(CLK_LOCAL_MEM_FENCE); /* Wait for others in the work-group */
for(int j=0; j<nt; j++) { /* For ALL cached particle positions ... */
The loop condition does not depend on the work item IDs, which means that all the work items will have exactly the same loop condition, so they will follow the same execution path and be running in parallel all the time.
2) In this code snippet, how does the outer and inner loops execute?
Does OpenCL know that the outer loop is dividing the work among all
the work groups and that the inner loop is trying to divide the work
among work-items within each work group?
As described in answer to (1), since the loop conditions of outer and inner loops are the same for all work items, they always run in parallel.
In terms of the workload distribution in OpenCL, it totally relies on the developer to specify how to distribute the workload. OpenCL does not know anything about how to divide the workload among work groups and work items. You can partition the workloads by assigning different data and operations by using the global work id or local work id. For example,
unsigned int gid = get_global_id(0);
buf[gid] = input1[gid] + input2[gid];
this code asks each work item to fetch two data from consecutive memory and store the computation results into consecutive memory.
3) If the inner loop is divided among the work-items (meaning that the
code within the for loop is executed in parallel, or at least
attempted to), how does the addition at the end work? It is
essentially doing a = a + f*d, and from my understanding of pipelined
processors, this has to be executed sequentially.
float4 d = p2 - p;
float invr = rsqrt(d.x*d.x + d.y*d.y + d.z*d.z + eps);
float f = p2.w*invr*invr*invr;
a += f*d; /* Accumulate acceleration */
Here, a, f and d are defined in the kernel code without specifier, which means they are private only to the work item itself. In GPU, these variable will be first assigned to registers; however, registers are typically very limited resources on GPU, so when registers are used up, these variables will be put into the private memory, which is called register spilling (depending on hardware, it might be implemented in different ways; e.g., in some platform, the private memory is implemented using global memory, therefore any register spilling will cause great performance degradation).
Since these variables are private, all the work items still run in parallel and each of the work item maintain and update their own a, f and d, without interfere with each other.
Heterogeneous programming works on work distribution model, meaning threads gets its portion to work on and start on it.
1.1) As you know that, threads are organized in work-group (or thread block) and in your case each thread in work-group (or thread-block) bringing data from global memory to local memory.
for(int jb=0; jb < nb; jb++) { /* Foreach block ... */
pblock[ti] = pos_old[jb*nt+ti];
//I assume pblock is local memory
1.2) Now all threads in thread-block have the data they need at there local storage (so no need to go to global memory anymore)
1.3) Now comes processing, If you look carefully the for loop where processing takes place
for(int j=0; j<nt; j++) {
which runs for total number of thread blocks. So this loop snippet design make sure that all threads process separate data element.
1) for loop is just like another C statement for OpenCL and all thread will execute it as is, its up-to you how you divide it. OpenCL will not do anything internally for your loop (like point # 1.1).
2) OpenCL don't know anything about your code, its how you divide the loops.
3) Same as statement:1 the inner loop is not divided among the threads, all threads will execute as is, only thing is they will point to the data which they want to process.
I guess this confusion for you is because you jumped into the code before having much knowledge on thread-block and local memory. I suggest you to see the initial version of this code where there is no use of local memory at all.
How exactly are for-loops executed in OpenCL?
They can be unrolled automatically into pages of codes that make it slower or faster to complete. SALU is used for loop counter so when you nest them, more SALU pressure is done and becomes a bottleneck when there are more than 9-10 loops nested (maybe some intelligent algorithm using same counter for all loops should do the trick) So not doing only SALU in the loop body but adding some VALU instructions, is a plus.
They are run in parallel in SIMD so all threads' loops are locked to each other unless there is branching or memory operation. If one loop is adding something, all other threads' loops adding too and if they finish sooner they wait the last thread computing. When they all finish, they continue to next instruction (unless there is branching or memory operation). If there is no local/global memory operation, you dont need synchronization. This is SIMD, not MIMD so it is not efficient when loops are not doing same thing at all threads.
In this code snippet, how does the outer and inner loops execute?
nb and nt are constants and they are same for all threads so all threads doing same amount of work.
If the inner loop is divided among the work-items
That needs opencl 2.0 which has the ability of fine-grain optimization(and spawning kernels in kernel).
http://developer.amd.com/community/blog/2014/11/17/opencl-2-0-device-enqueue/
Look for "subgroup-level functions" and "region growing" titles.
All subgroup threads would have their own accumulators which are then added in the end using a "reduction" operation for speed.

cost on blocked operation was increased by the number of thread

I've written a program that executes some calculations and then merges the results.
I've used multi-threading to calculate in parallel.
During the phase of merge result, each thread will lock the global array, and then append individual part to it, and some extra work will be done to eliminate the repetitions.
I test it and find that the cost on merging increases with the number of threads, and the rate is unexpected:
2 thread: 40,116,084(us)
6 thread:511,791,532(us)
Why: what occurs when the number of threads increases? How do I change this?
--------------------------slash line -----------------------------------------------------
Actually, the code was very simply, there is the pseudo-code:
typedef my_object{
long no;
int count;
double value;
//something others
} my_object_t;
static my_object_t** global_result_array; //about ten thounds
static pthread_mutex_t global_lock;
void* thread_function(void* arg){
my_object_t** local_result;
int local_result_number;
int i;
my_object_t* ptr;
for(;;){
if( exit_condition ){ return NULL;}
if( merge_condition){
//start time point to log
pthread_mutex_lock( &global_lock);
for( i = local_result_number-1; i>=0 ;i++){
ptr = local_result[ i] ;
if( NULL == global_result_array[ ptr->no] ){
global_result_array[ ptr->no] = ptr; //step 4
}else{
global_result_array[ ptr->no] -> count += ptr->count; // step 5
global_result_array[ ptr->no] -> value += ptr->value; // step 6
}
}
pthread_mutex_unlock( &global_lock); // end time point to log
}else{
//do some calculation and produce the partly and thread-local result ,namely the local_result and local_result_number
}
}
}
As above, the difference between two threads and six threads are step 5 and step6, i has counted that there were about hundreds millions order of execution of step 5 and 6. The others are same.
So, from my view, the merge operation was very light, in spite of using 2 thread or 6 thread, they both need to lock and do merge exclusively.
Another astonished thing was : when using six thread, the cost on step 4 was boomed! It was the boot reason that the total cost was boomed!
btw: The test server has two cpus ,each cpu has four cores.
There are various reasons for the behaviour shown:
More threads means more locks and more blocking time among threads. As is apparent from your description, your implementation uses mutex locks or something similar. The speed-up with threads is better if the data sets are largely exclusive.
Unless your system has as many processors/cores as the number of threads, all of them cannot run concurrently. You can set the maximum concurrency using pthread_setconcurrency.
Context switching is an overhead. Hence the difference. If your computer had 6 cores it would be faster. Overwise you need to have more context switches for the threads.
This is a huge performance difference between 2/6 threads. I'm sorry, but you have to try very hard indeed to make such a huge discrepancy. You seem to have succeeded:((
As others have pointed out, using multiple threads on one data set only becomes worth it if the time spent on inter-thread communication, (locks etc.), is less than the time gained by the concurrent operations.
If, for example, you find that you are merging successively smaller data sections, (eg. with a merge sort), you are effectively optimizing the time wasted on inter-thread comms and cache-thrashing. This is why multi-threaded merge-sorts are frequently started with an in-place sort once the data has been divided up into a chunk less than the size of the L1 cache.
'each thread will lock the global array' - try to not do this. Locking large data structures for extended periods, or continually locking them for successive short periods, is a very bad plan. Locking the global once serializes the threads and generates one thread with too much inter-thread comms. Continualy locking/releasing generates one thread with far, far too much inter-thread comms.
Once the operations get so short that the returns are diminished to the point of uselessness, you would be better off queueing those operations to one thread that finishes off the job on its own.
Locking is often grossly over-used and/or misused. If I find myself locking anything for longer than the time taken to push/pop a pointer onto a queue or similar, I start to get jittery..
Without seeing/analysing the code, and more importantly, data,, (I guess both are complex), it's difficult to give any direct advice:(

Redundant loop inside a process (VHDL)?

I'm taking a university course to learn digital design using VHDL, and was doing some reading in the book the other day where I came across the following piece of code:
architecture abstract of computer_system is
...
cpu : process is
variable instr_reg : word;
variable PC : natural;
...
begin
loop
address <= PC;
mem_read <= '1';
wait until mem_ready;
...
end loop;
end process cpu;
end architecture abstract;
Now, as I've understood it, once a process reaches its last statement, it will go back and execute the first statement (provided that the last statement wasn't a wait, of course). And the purpose of loop ... end loop; is to repeat the intermediate code indefinitely. So doesn't that make the loop redundant in this case? Does it add any extra behaviour that isn't already exhibited by the process?
You're spot on as far as I can see, no need to have a loop in there.

Resources