I'd like to run this simple re-ordering code in a GPU with an OpenCl-kernel. Is it possible? - doparallel

I'd like to run this simple C code in GPU in an OpenCl-Kernel. Is it possible?
#include <stdio.h>
int main()
{
int a[15]={7,8,0,4,13,1,14,5,10,2,3,11,12,6,9};
int b[15];
printf(input datas: ");
for (i=0;i<15;i++) printf("%3d",a[i]);
printf("\n");
for (i=0;i<15;i++) b[a[i]]=i;
for (i=0;i<15;i++) printf("%3d",b[i]);
printf("\n");
return 0;
}
My input and output data should be:
Input: 7 8 0 4 13 1 14 5 10 2 3 11 12 6 9
Output: 2 5 9 10 3 7 13 0 1 14 8 11 12 4 6

It is possible, although it will be really inefficient because of those random memory accesses. Simplifying it a lot, GPUs work better when work-items (instances of an OpenCL kernel) access memory sequentially.
Having said this, to do this in C and OpenCL you need to perform the following steps (again I'm simplifying a bit):
Include OpenCL headers.
Write the OpenCL kernel itself, and either put it in a string in your main() save it to a .cl file and read it into a string from your main().
Get desired GPU device and create a context.
Create an OpenCL command queue.
Create the input and output device buffers.
Write the desired information to the input device buffer (via the command queue).
Create an OpenCL program (from the kernel source string), build it, get the kernel object and set its parameters.
Run the kernel (via the command queue), which will perform the desired operation, reading from the input buffer and writing to the output buffer.
Read back the data from the output device buffer (via the command queue) and show it on screen.
Release all the created OpenCL objects.
See this link on how to get started with OpenCL and GPU computing. It gives a good ideia of how something like this is done. You will notice that doing this in pure C is very verbose, so either use a wrapper library such as cf4ocl, use C++, or use some other language with higher-level bindings (e.g. Python).

Related

fflush of named pipe (fifo) blocks system for greater than 1 second in C code on raspberry pi

I am using a raspberry pi (either 3b+ or even the 4GB model 4 running either Stretch Lite or Buster Lite) to read an ADC and report the values to a named pipe (fifo). When I open the pipe I use the O_NONBLOCK parameter. The code is written in C. The data rate is not especially fast, as the values are written only once per second. I read the values using cat in a terminal.
I have set up a timer in my code, and typically the fprintf followed by fflush requires less than 1 millisecond. However, somewhat frequently (once every 10-15 minutes), it can take sometimes over 1 second to complete!
The code is part of a much larger project, but these are the lines I am using around this fifo implementation:
int chOutputData(int ch, struct output_s *output)
{
int nchar;
if(output->complete_flag) {
nchar = fprintf(fdch[ch], "m %d %d %d\n",
output->val1,
output->val2,
output->val3,
);
fflush(fdch[ch]);
return(nchar);
} else {
chOutputError(ch, "Output data is not complete.\n");
exit(1);
}
}
val1, val2, val3 are just some simple numbers, nothing crazy big.
Am I not using fprintf with fflush correctly? Do I need to set some sort of buffer? Is this related to slow sd cards on raspberry pi? Even if this took up to 5-10ms to complete I would not complain, but I have a watchdog that is tripped around 100ms so when this takes over 1 second I have issues.
I am open to other ideas of how to spit out this string. I wonder if MQTT might be a means to publish this data?
Thanks!

Sequential part of the program takes more time as process count increases

I'm writing a parallel C program using MPICH. The program naturally has a sequential part and a parallel part. The parallel part seems to be working fine, however I'm having trouble with the sequential part.
In the sequential part, the program reads some values from a file and then proceeds to distribute them among the other processes. It is written as follows:
if (rank == 0)
{
gettimeofday(&sequentialStartTime, NULL);
// Read document id's and their weights into the corresponding variables
documentCount = readDocuments(&weights, &documentIds, documentsFileName, dictionarySize);
readQuery(&query, queryFileName, dictionarySize);
gettimeofday(&endTime, NULL);
timersub(&endTime, &sequentialStartTime, &resultTime);
printf("Sequential part: %.2f ms\n", (double) resultTime.tv_sec + resultTime.tv_usec / 1000.0);
//distribute the data to other processes
} else {
//wait for the data and the start working
}
Here, readQuery and readDocuments are reading values from files, and the elapsed time is printed after they are complete. This piece of code actually works just fine. The problem arises when I try to run this with different number of processors.
I run the program with the following command
mpirun -np p ./main
where p is the processor count. I expect the sequential part to run in a certain amount of time no matter how many processors I use. For p values 1 to 4, this holds, however when I use 5 to 8 as p values, then the sequential part takes more time.
The processor I'm using is Intel® Core™ i7-4790 CPU # 3.60GHz × 8 and the operating system I have is Windows 8.1 64-bit. I'm running this program on Ubuntu 14.04 which runs on a virtual machine which has full access to my processor and 8 GB's of RAM.
The only reason that came to my mind is that maybe when the process count is higher than 4, the main process may share the physical core it's running on with another process, since I know that this CPU has 4 physical cores but functions as if it had 8, using the hyperthreading technology. However, when I increase p from 5 to 6 to 7 and so on, the execution time increases linearly, so that cannot be the case.
Any help or idea on this would be highly appreciated. Thanks in advance.
Edit: I realized that increasing p increases the run time no matter the value. I'm getting a linear increase in time as p increases.

Using multiple hcsr04 sensors on Beaglebone Black

I am trying to use hcsr04 sensors on the Beaglebone black (adapted from this code - https://github.com/luigif/hcsr04)
I got it working for 4 different sets of sensors individually, and were now unsure of how to combine them into one program.
Is there a way to give the trigger and receive the echos simultaneously, such that interrupts can be generated as different events to the C program.
Running them one after the other is the last option we have in mind.
Russ is correct - since there's 2x PRU cores in the BeagleBone's AM335x processor, there's no way to run 4 instances of that PRU program simultaneously. I suppose you could load one compiled for one set of pins, take a measurement, stop it, then load a different binary compiled for a sensor on different pins, but that would be a pretty inefficient (and ugly, IMHO) way to do it.
If you know any assembly it should be pretty straight-forward to update that code to drive all 4 sensors (PRU assembly instructions). Alternatively you could start from scratch in C and use the clpru PRU C compiler as Russ suggested, though AFAIK that's still in somewhat of a beta state and there's not much info out there on it. Either way, I'd recommend reading from the 4 sensors in parallel or one after the other, loading the measurements into the PRU memory at different offsets, then sending a single signal to the ARM.
In that code you linked, the line:
SBCO roundtrip, c24, 0, 4
Takes 4 bytes from register roundtrip (which is register r4, per the #define roundtrip r4 at the top of the file), and loads it into the PRU data RAM (constant c24 is set to the beginning of data RAM in lines 39-41) at offset 0. So if you had 4 different measurements in 4 registers, you could offset the data in RAM, e.g.:
SBCO roundtrip1, c24, 0, 4
SBCO roundtrip2, c24, 4, 4
SBCO roundtrip3, c24, 8, 4
SBCO roundtrip4, c24, 12, 4
Then read those 4 consecutive 32-bit integers in your C program.

what is the different between cgaputc(int c) / uartputc(int c) / constputc( int c) in xv6?

In xv6 MIT operating system, I'm trying to understand what is the different between the a few putc functions in /xv6/console.c
static void cgaputc(int c).
void uartputc (int c).
static void constputc(int c).
Thanks!
consputc() is a console output function. It writes a char to the console, which in that OS appears to mean both the serial port and the CGA text display. Before doing that, it first checks if the system has panicked (a panic is the state which the kernel enters when it has encountered an error and doesn't know what to do, so instead of going ahead and probably making matters worse decides to panic and stop), and if so, enters an infinite loop with interrupts disabled, so only a system reset can leave the panic state.
uartputc() writes a char to the serial port. It first checks that the serial port is not busy, and will accept the char.
cgaputc() writes a char to the CGA text framebuffer, and adjust the cursor position accordingly. The CGA text framebuffer starts at address 0xb8000, and consists of interleaved (attribute, character) bytes. The default mode, mode 3 is a 80x25 (80 columns, 25 rows) text mode. Attribute 07 means gray text on black background. The cursor position is manipulated via the CRT controller port, which exposes several registers, registers 14 and 15 hold the cursor position as 14 bits. The CRTC is accessed by first selecting a register to access by writing its number to the index CRTC port at 0x3d4, and then writing or reading from the CRTC control/data port at 0x3d5. This stuff is documented in a document called vgadoc4b, and in Ralph Brown's interrupt list.
You can see what all these functions do if you consult the code.
consputc(int c) clears interrupts then calls uartputc() and then calls cgaputc().
uartputc(int c) uses in and out to write c to the serial port (UART)
cgaputc(c) appears to be a console input/output function. Writes c to the serial port or the console, and it also sets the position of the cursor and sets the color for the console (black on white)
That's what I get from reading the code anyway, I have not used these functions before, but it seems pretty straight forward.

What is the outp() counterpart in gcc compiler?

In my School my project is to make a simple program that controls the LED lights
my professor said that outp() is in the conio.h, and I know that conio.h is not a standard one.
example of outp()
//assume that port to be used is 0x378
outp(0x378,1); //making the first LED light
thanks in advance
You can do this from user space in Linux by writing to /dev/port as long as you have write permissions to /dev/port (root or some user with write permissions). You can do it in the shell with:
echo -en '\001' | dd of=/dev/port bs=1 count=1 skip=888
(note that 888 decimal is 378 hex). I once wrote a working parallel port driver for Linux entirely in shell script this way. (It was rather slow, though!)
You can do this in C in Linux like so:
f = open("/dev/port", O_WRONLY);
lseek(f, 0x378, SEEK_SET);
write(f, "\01", 1);
Obviously, add #include and error handling.
How to write to a parallel port depends on the OS, not the compiler. In Linux, you'd open the appropriate device file for your parallel port, which is /dev/lp1 on PC hardware for port 0x0378.
Then, interpreting the MS docs for _outp, I guess you want to write a single byte with the value 1 to the parallel port. That's just
FILE *fp = fopen("/dev/lp1", "wb");
// check for errors, including permission denied
putc(1, fp);
You're mixing up two things. A compiler makes programs for an OS. Your school project made a program for DOS. outp(0x378,1); is essentially a DOS function. It writes to the parallel port. Other operating systems use other commands.
GCC is a compiler which targets multiple Operating Systems. On each OS, GCC will be able to use header files particular top that system.
It's usually going to be a bit more complex. DOS runs one program at a time, so there's no contention for port 0x378. About every other OS runs far more programs concurrently, so you first have to figure out who gets it.

Resources