Simple nested loop not pipelining on TI C64x+ - c

The following code is not pipelining when compiled on the C64x+:
void main ()
{
int a, b, ar[100] = {0};
for (a = 0; a < 1000; a++)
for (b = 0; b < 100; b++)
ar[b]++;
while(1);
}
My IDE (Code Composer v6) gives the following message for the inner loop: "Loop cannot be scheduled efficiently, as it contains complex conditional expression. Try to simplify condition."
The problem seems to be with the nested loop, but I can't find any more information about optimizing one as simple as this.
Has anyone solved a similar issue before?
-- Additional information --
Processor: TMS320C64x+
Compiler: TI v8.0.3
Compiler flags:-mv6400+ --abi=eabi -O3 --opt_for_speed=4 --include_path="D:/TI/ccsv6/tools/compiler/ti-cgt-c6000_8.0.3/include" --advice:performance -g --issue_remarks --verbose_diagnostics --diag_warning=225 --gen_func_subsections=on --debug_software_pipeline --gen_opt_info=2 --gen_profile_info -k --c_src_interlist --asm_listing --output_all_syms
Linker flags: -mv6400+ --abi=eabi -O3 --opt_for_speed=4 --advice:performance -g --issue_remarks --verbose_diagnostics --diag_warning=225 --gen_func_subsections=on --debug_software_pipeline --gen_opt_info=2 --gen_profile_info -k --c_src_interlist --asm_listing --output_all_syms -z -m"dsp.map" -i"D:/TI/ccsv6/tools/compiler/ti-cgt-c6000_8.0.3/lib" -i"D:/TI/ccsv6/tools/compiler/ti-cgt-c6000_8.0.3/include" --reread_libs --warn_sections --xml_link_info="dsp_linkInfo.xml" --rom_model

Removing --gen_profile_info from the compiler flags solved the issue. My loops have been splooped.

Related

Benchmarking C bubblesort performance compared to Julia

I wanted to create a formal comparison between C and Julia performance. For this purpose I wanted to compare different sorting algorithms, starting with the bubble. In Julia I wrote it like:
using BenchmarkTools
function bubble_sort(v::AbstractArray{T}) where T<:Real
for _ in 1:length(v)-1
for i in 1:length(v)-1
if v[i] > v[i+1]
v[i], v[i+1] = v[i+1], v[i]
end
end
end
return v
end
v = rand(Int32, 100_000)
#timed bubble_sort(_v)
In the case of C code (I don't know to program in C so I apologize for the code):
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
static void swap(int *xp, int *yp){
int temp = *xp;
*xp = *yp;
*yp = temp;
}
void bubble_sort(int arr[], int n){
int i, j;
for (j = 0; j < n - 1; j++){
for (i = 0; i < n - 1; i++){
if (arr[i] > arr[i+1]){
swap(&arr[i], &arr[i+1]);
}
}
}
}
int main(){
int arr_sz = 100000;
int arr[arr_sz], i;
for (i = 0; i < arr_sz; i++){
arr[i] = rand();
}
double cpu_time_used;
clock_t begin = clock();
bubble_sort(arr, arr_sz);
clock_t end = clock();
cpu_time_used = ((double) (end - begin)) / CLOCKS_PER_SEC;
printf("time %f\n", cpu_time_used);
return 0;
}
The performance difference is (in my computer):
Julia
C
20s
~50s
I suppose that I have a big mistake in the C code, but I am not able to find it out, or is just Julia faster in loops?
Update: performance optimization
Changed the type to int32 in Julia so it is the same as C
swap method as static (+1s improvement on average)
compiling optimization (detailed bellow)
Instead of gcc main.c, I've used different optimization flags, as also the clang compiler. Results:
Time (s)
Julia
19.13
gcc -O main.c
47.58
gcc -O1 main.c
15.98
gcc -O2 main.c
19.52
gcc -O3 main.c
19.20
gcc -Os main.c
17.72
clang -O0 main.c
51.59
clang -O1 main.c
16.78
clang -O2 main.c
13.53
clang -O3 main.c
13.57
clang -Ofast main.c
12.39
clang -Os main.c
18.85
clang -Oz main.c
15.64
clang -Og main.c
16.37
It seems like this question may have been reopened after you discovered that your initial measurements were taken on code compiled for debugging, rather than fully optimised, with different sets of data, different compiler platforms and different integer representations.
I suppose that I have a big mistake in the C code, but I am not able to find it out, or is just Julia faster in loops?
I can answer this somewhat cringy (in my opinion) double-barreled question with a quote from the C standard: "The semantic descriptions in this International Standard describe the behavior of an abstract machine in which issues of optimization are irrelevant." In short, there's no speed in C; that's an attribute that occurs in implementations of C. We can't reproduce your speed without having your implementation (including your hardware, for example).
It's very possible that Julia has similar clauses in her spec. The gist is: some nifty optimisations may determine that your sorted arrays don't have any necessary side-effects and so those may theoretically be optimised away. I'd expect both programs to output somewhere near 0.0 in that case. This is your perfectly optimal compiler; one that spots code that has no actual impact upon the logic of the program, and optimises away the dead code.
We haven't always had loop-invariant code motion, and so it stands to reason that there may be a fifth element here: your compilers version. You'll probably get different statistics if the underlying llvm is different, for example:
LLVM 11 tends to take 2x longer to compile code with optimizations, and as a result produces code that runs 10-20% faster (with occasional outliers in either direction), compared to LLVM 2.7 which is more than 10 years old.
-- source
Perhaps one day you'll update this question with output that reads 0.0s for both programs. Then this question has truly lost its point.
It's hard to tell what further is being asked here, #cbk. The comments section managed to reduce the runtime for the C program significantly with those four improvements. The question kinda doesn't even make sense here anymore, because it largely cancels itself out by answering itself at the end.
Perhaps this is just one of those cases where a newcomer ought to have answered their own question with an answer (you can do that), rather than rotting the question with edits that answer it... Nonetheless, it's a question that now shows up unanswered in the list of questions. I'd vote to close for the reason "The question should include more details", but I suspect then you might include an example of compilation halted after assembly generation, when OP seems to have glossed over the solution, the details we need are more along the lines of "What didn't you understand about these comments that answer your original question?" and yet the question has varied so much in apparent meaning... Are we gonna have a close/reopen war?

Explicit iteration counter privatisation with OMP parallel for

I have a few questions about parallelisation using OMP.
Say I have a program, within which there is a nested for loop. From my understanding of the directive #pragma omp parallel for, the outer iteration counter is automatically privatised. Is the same true for the inner iteration counter? This appears to be the case, as the outputs are identical whether I state it explicitly or not.
Is it necessary(/safer) to explicitly privatise iteration counters for for loops within a parallel for block?
I am compiling with GCC - I found that I had some unhelpful crosstalk between threads when using GCC 5.4.0, but not when using GCC 7.5.0. To resolve this, I added private(foo, bar) to the directive, but I am curious as to why it works without this statement for GCC 7.5.0. Does the GCC (7.5.0) automatically identify race conditions/crosstalk and privatise things it thinks should be private?
Other than allocating a few additional memory addresses, is there any significant overhead cost in privatising variables? I think likely 'yes, but (in my case) negligible'. Target audience for code will be using systems with ~10s-100s of cores
Toy example, which finds maximum values in chunks along an array:
find_max(double *inArr, double *outArr, int64_t nSamps, int64_t nCells, int64_t threads) {
double maxVal, curVal;
int64_t t, cell;
#pragma omp parallel for private(maxVal, curVal) num_threads(threads)
for (t=0; t<nSamps; t++) {
maxVal = inArr[t];
for (cell=1; cell<nCells; cell++) {
curVal = inArr[cell * nSamps + t];
if (curVal > maxVal) {
maxVal = curVal;
}
}
outArr[t] = maxVal;
}
}
I am building this as an extension module for a Python library - the call to gcc is:
gcc -pthread -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -Wall -Wstrict-prototypes -c src.c -o src.o -fopenmp -fPIC -Ofast

using -O2 decreases time of bubble sort C program

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
void sort();
int main() {
int i;
for (i = 0; i < 100000; i++) {
sort();
}
}
void sort() {
int i, j, k, array[100], l = 99, m;
for (i = 0; i < 100; i++) {
array[i] = rand() % 1000 + 1;
}
for (k = 0; k < 99; k++) {
for (j = 0; j < l; j++) {
if (array[j + 1] > array[j]) {
int temp = array[j];
array[j] = array[j + 1];
array[j + 1] = temp;
}
}
l--;
}
for (m = 0; m < 100; m++) {
printf("%d ", array[m]);
}
}
On the linux shell, gcc sort -o sort.c and then time ./sort >> out.
Here if I do gcc -o2 sort -o sort.c and similarly o3 and o4 then the time keeps on decreasing. How does the optimization options work? Please explain in terms of all real time, user time and system time.
PS: The code might be a little inefficient. Kindly ignore that.
Optimization options work between the reading of the source code and the writing of the binary instructions to the CPU.
GCC is a multi-phase compiler, where the phases roughly consist of:
Creating "tokens" from the input text.
Arranging those tokens into abstract syntax tree structure.
Pruning the abstract syntax tree.
Creating register based instructions, assuming an infinite number of CPU registers.
Mapping the registers into the actual registers available.
Writing the binary information out, in the loader's expected format.
Optimizations can impact a number of locations, typically they become active in the above mentioned steps 3 through 5. There are many optimizations, including:
Constant folding – Evaluate constant subexpressions in advance.
Strength reduction – Replace slow operations with faster equivalents.
Null sequences – Delete useless operations.
Combine operations – Replace several operations with one equivalent.
Algebraic laws – Use algebraic laws to simplify or reorder instructions.
Special case instructions – Use instructions designed for special operand cases.
Address mode operations – Use address modes to simplify code.
Loop unrolling - Replace a loop with equivalent instructions
Partial loop unrolling - Reduce times a loop is evaluated while preserving overall function.
Note that these are not all the optimizations that might be performed, but it starts to give you an idea.
For example, if the compiler sees
int s = 3;
while (s < 6) {
printf("%d\n", s);
s++;
}
and the flags are set to unroll loops, then it might write CPU instructions equivalent to
printf("%d\n", 3);
printf("%d\n", 4);
printf("%d\n", 5);
Those instructions might seem more wordy to us humans, but the CPU commands might be smaller, because there is no need to "lookup" the now-erased value of s, nor is there the need to add one to it, or store the new updated value back into RAM.
GCC arranges the optimizations into categories, ranging from "safe" to "risky". -O2 is a good compromise between speed and safety. Higher -O numbers are riskier.
The -O compiler flag controls the amount of compiler optimization that you wish the compiler to perform. In short, building the project will take longer but the resulting executable should be faster. For more information, type man gcc into the command prompt or gcc -c -Q -O3 --help=optimizers for specific information regarding the optimizations performed for a particular flag.
-O stands for optimize, in which gcc will automatically take the steps necessary to optimize your program. You can read more about the specific steps that GCC takes to optimize your program here: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
But essentially, -O2 is more optimized than -O1, and -O3 more than -O2. This might come with drawbacks in regard to compiled binary size, where the resulting binary could use more space, but run faster, and vice versa. You can actually paste your code into https://godbolt.org/, and write in -O1 or any of the optimization options beside the dropdown to choose a compiler, and godbolt will show you what the resulting code looks like. You will be able to see a difference between O1 and O2, namely, the O2 generated code is probably shorter and will use a lot of shortcuts to do your algorithm.
gcc offers a number of optimization flags. You can see what each one does specifically here:
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
There's always a tradeoff with optimizations, either by increased compile time, increased use of memory, etc...
There are dozens of optimizations enabled by the -o2 flag, so it might not be immediately clear which specific ones affect the sorting. Instead of -o2, you can try each optimization individually, for example using the -falign-loops flag, to see whether that is the one providing the performance increase.

How to use OpenMP in a C project built with meson

I am quite new to meson and C, please forgive me if the answer to this question is trivial ...
I want to use OpenMP in a C project, and I am using meson as a build tool.
I want to compile the parallel for example from this tutorial.
My main.c looks very similar:
#include <omp.h>
#define N 1000
#define CHUNKSIZE 100
int main(int argc, char *argv[]) {
int i, chunk;
float a[N], b[N], c[N];
/* Some initializations */
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
chunk = CHUNKSIZE;
#pragma omp parallel for \
shared(a,b,c,chunk) private(i) \
schedule(static,chunk)
for (i=0; i < N; i++)
c[i] = a[i] + b[i];
return 0;
}
My short meson.build file contains this:
project('openmp_with_meson', 'c')
# add_project_arguments('-fopenmp', language: 'c')
exe = executable('some_exe', 'src/main.c') #, c_args: '-fopenmp')
I commented out the c_args keyword in the call to executable here.
Now I end up with the following scenarios:
without '-fopenmp' option, I get the warning, that the pragma is unknown and will be ignored (as I would expect): ../src/main.c:15:0: warning: ignoring pragma omp parallel [-Wunknown-pragmas] #pragma omp parallel for
with the option c_args: '-fopenmp' inserted, I do not get the above warning anymore, instead I get errors for undefined references to GOMP_parallel, omp_get_num_threads and omp_get_thread_num, and nothing gets built
when I use gcc manually with gcc -Wall -o manually_with_gcc ../src/main.c -fopenmp the program compiles and executes without any errors.
Can anyone tell me how to get the executable to compile with meson?
Meson 0.46 or later
Meson 0.46 (released Apr 23, 2018) added OpenMP support. So, if you have meson 0.46 or later,
project('openmp_with_meson', 'c')
omp = dependency('openmp')
exe = executable('some_exe', 'src/main.c',
dependencies : omp)
Should work with both GCC and Clang.
Meson 0.45 or earlier
If you happen to have older version, Debian Stretch, Ubuntu Bionic (18.04LTS), or Fedora 27, you can do the following:
You need another keyword arg link_args : '-fopenmp' for executable().
exe = executable('some_exe', 'src/main.c',
c_args: '-fopenmp',
link_args : '-fopenmp')
Meson builds C program in two phases, compiling and linking. You can pass extra arguments with c_args for compiling and link_args for linking.
The option -fopenmp enables OpenMP directives while compiling, and
the flag also arranges for automatic linking of the OpenMP runtime
library.
That is, -fopenmp is dual purpose option.
Now, the above is simple and good. Once you understand it, however, you can also compile your program with -fopenmp to activate the OpenMP directives and link the OpenMP libraries by yourself without -fopenmp to the link_args.
Here is a complete meson.build:
project('openmp_with_meson', 'c')
cc = meson.get_compiler('c')
libgomp = cc.find_library('gomp')
exe = executable('some_exe', 'src/main.c',
c_args: '-fopenmp',
dependencies : libgomp)
Meson >= 0.46 now has a builtin for this (docs):
openmp = dependency('openmp') # meson builtin

cilk plus array notation not vectorized with gcc 4.9.0

I'm trying to figure out why gcc 4.9.0 won't vectorize a simple array addition when using gcc 4.9.0, using -O -ftree-vectorize:
int a[256], b[256], c[256];
foo () {
int i;
a[:] = b[:] + c[:];
}
From looking at the assembler produced this loop hasn't been vectorized and with the -fopt-info-vec-all flag I get a lot of output telling me why vectorization failed, beginning with:
>testvec.c:5: note: ===== analyze_loop_nest =====
>testvec.c:5: note: === vect_analyze_loop_form ===
>testvec.c:5: note: not vectorized: control flow in loop.
>testvec.c:5: note: bad loop form.
which is puzzling, since there's not control flow in the loop. Vectorization of the for loop using standard array notation for the same operation works fine.
It looks like only the latest version of GCC (6.1) can vectorize your example:
http://melpon.org/wandbox/permlink/LOIweYNRRLXeJsZf

Resources