When does OpenCL data transfer occur? - c

I've seen a few questions here on Stack overflow dealing with the same issues, but no definite answer. I thought I'll ask again, with a bunch of questions of my own. All relate to the subject matter at hand.
So, do we know when the data transfer from host to the openCL device occurs? Can you tell me the exact memory transfer operation of the functions below (that is, what data is transferred or created, if any, when these functions are invoked?):
clCreateBuffer()
clSetKernelArg()
clEnqueueNDRangeKernel()
The first two don't even produce events, so we can't time them, but surely some data transferring is happening here.
Is there a way to transfer data to a device without first setting it as a kernel arg?
It appears (from preliminary testing of my own) that a mem object created with CL_MEM_USE_HOST_PTR gets directly manipulated by the device. Why would that not be desirable, since, that way, we could avoid further data transfer commands (and surely the driver implements this in the most efficient way)?
Does transferred data (say, as par of a kernel arg) stay at the device for further manipulation, after a kernel returns? If not is there a way to do just that?

Buffer copies are related to command queues. Command queues are synced with host using finish() as easiest way.
clCreateBuffer()
clEnqueueWriteBuffer() <-------- you can get event data from this
(set blocking parameter to false to queue everything quickly)
(set blockinig to true if you sync write here)
clSetKernelArg()
clEnqueueWriteBuffer() <----- it could be here too
clEnqueueNDRangeKernel()
clEnqueueWriteBuffer() <----- or here (too quickly re-set an array?)
clFinish() <--------- this ensures all queued commands are executed before this
now you can query data of that event to check when it started and when ended
to let a buffer stay in device, you should create it in device first then don't migrate it to another device. Using only CL_MEM_READ_WRITE flag in createBuffer() is enough to make it a real buffer on device-side until you release that buffer.
CL_MEM_USE_HOST_PTR or CL_MEM_ALLOC_HOST_PTR uses host memory as device maps it to its cores. This is faster for streaming data in and out because of not-needing of extra data movements in host side. If you need to use device memory such as fast gddr5 or hbm always, then you should not use these flags.
Copy to device once, use as much as you want. If device has its own memory of course. For example, Intel HD Graphics 400 doesn't have its own memory and shares RAM so it is much faster to use CL_MEM_..._HOST_PTR flags and especially USE_HOST_PTR.
To check if device shares RAM with CPU, you query CL_DEVICE_HOST_UNIFIED_MEMORY property of device.
It appears (from preliminary testing of my own) that a mem object
created with CL_MEM_USE_HOST_PTR gets directly manipulated by the
device
Even without map/unmap commands pror to kernel execution, my computer is behaving same, but I'm using map/unmap just to be safe and it doesn't tax too many cycles.
Edit: if you want to make sure a command doesn't start before you want, you can add a user event in event list input parameter of bufferwrite command. Then you can trigger the user event to let writing start because commands wait for all events in the list to be fired+completed before continuing (if there are any specified in event list input parameter)

Related

Vulkan: Why have multiple command buffers per pool?

I am developing in Vulkan 1.0, building a rendering system by learning and implementing functionality one step at a time. I get the gist of command recording and submission, but I haven't been far enough to understand a use case in which I'd want to have multiple command buffers per pool. It was this presentation at slide 14 which raised some questions.
My understanding and current design is as follows:
Optimally, there should be one command pool per frame per thread so command buffers aren't recording over the same memory while in flight. If I have 3 frames and each frame can have up to 4 recording threads, that's 12 command pools at a minimum.
Command buffers are associated with a command pool at creation time and will be reset on the next frame. To potentially get better performance, the entire pool will be reset rather than the individual buffers.
A single command pool may be used in the creation of multiple command buffers. This group of command buffers would all be used in the same frame and thread.
According to this article under "Command overlap", the reordering of commands may happen between command buffers and vkQueueSubmit calls. So if I had a group of command buffers in the same frame and thread, I'd need something more than just submission order to guarantee the results I want. Maybe I'd use unique semaphores for each submission?
If I'm coding for a frame/thread, I see no advantage to submitting commands a few times from beginning to end as opposed to submitting everything at once in the end. It's the same amount of work in the same time span. It may even be detrimental to submit multiple times because of the vkQueueSubmit overhead mentioned in the specification.
From the assumptions above, in what cases would it be necessary or advantageous to have more than one command buffer per command pool as opposed to having one command buffer that records everything from beginning to end for the given frame and thread?
having one command buffer that records everything from beginning to end for the given frame and thread?
Well, what happens if a thread needs to record things in an order other than the order in which they need to be submitted? That's kind of the point of a CB, isn't it? The ability to build commands in an order that is convenient, then submit them in the way that works out for the GPU.
For example, let's say you have a thread that is rendering a particular set of objects. To do that, you need to write their matrices and other per-object properties to a uniform buffer. And let's say that, for whatever reason, this particular Vulkan implementation doesn't allow you to use mappable memory directly for uniform buffers. So you have to write to mappable memory and copy the data to a uniform buffer via a memory transfer operation.
So the thread creating the commands for these meshes need to do two things. They need to build the commands to render the meshes, and they need to build the commands to transfer the uniform data to the buffer that the rendering commands will need.
Your way however requires that commands are put into the CB in the order you want them executed. So you would have to loop through the entire list of objects to build the transfer commands, and loop through it again to build the rendering commands. But you're reading the same objects each time through the loop. During the first loop, you had access to 100% of the data needed to issue the rendering command.
And the second time through the loop, all that data is no longer in the cache. So the second time has about the same number of cache misses (and therefore real memory accesses) as the first time.
That's bad.
Furthermore, rendering commands need to be placed within a render pass instance. Transfer commands cannot be in a render pass instance. But if you're putting transfer commands into the same CB as the rendering commands... that CB must begin and end the render pass instance.
So... how can other threads issue commands for that render pass instance?
If you want parallelism (and you do), then you need these threads to be creating secondary CBs for their rendering commands. A later task will collate them into the primary CB, and that CB will have the render pass instance. But secondary CBs built for a render pass cannot contain transfer commands.
So if you want parallelism, then any transfer commands that have to be generated alongside rendering commands must go into a different CB. One that will be submitted before the secondary CBs (or even submitted to a different queue altogether).

Single use and the CPU data cache

I am working on an application that has quite a few internal data structures, but also processes huge amounts of user data. During this processing, I need to have the CPU look at the data just once (the rest of the processing is done via zero copies and DMA, so the CPU need not touch the data at all).
I am searching for a way to process the user data (even if it means copying it to a temporary buffer) without having it evict the internal structures from the CPU's data cache. In other words, I'm looking for a way to tell the CPU "give me this data, but I'm never going to need it again".
I seem to recall that gcc had an intrinsic to do it, but going over the list, I seem to have misremembered (or otherwise couldn't find it). Either way, assembly solution (Intel) would work fine for my purposes.
Logic states that there must be a way to do this, as it is necessary to do this before sending data to (or receiving from) DMA buffers.

Using interrupts during reading a file from disk

Assume that a large file is saved on disk and I want to run a computation on every chunk of data contained in the file.
The C/C++ code that I would write to do so would load part of the file, then do the processing, then load the next part, then do the processing of this next part, and so on.
If I am, however, interested to do so in the shortest possible time, I could actually do the following: First, tell DMA-controller to load first part of the file. When this part is loaded tell the DMA-controller to load the second part (in some other part of the memory) and then immediately start processing the first part.
If I get an interrupt from the DMA during processing the first part, I finish the first part and afterwards tell the DMA to overwrite it with the third part of the file; then I process the second part.
If I do not get an interrupt from the DMA during processing the first part, I finish the first part and wait for the interrupt of the DMA.
Depending of how long the processing takes in relation to the disk-read, this should be up to twice as fast. In reality, of course, one would have to measure. But that is not the question I am asking.
The question is: Is it possible to do this a) in C using some non-standard extension or b) in assembly? Or do operating systems not allow such things in general? The question is meant primarily in a single-thread context, although I also would be interested to know how to do it with two threads. Also, I am interested in specific code; this is more of a theoretical question.
You're right that you will not get the benefit of this by default, because a blocking read stops your thread from doing any processing. Hans is right that modern OSes already take care of all the little details of DMA and interrupt completion routines.
You need to use the architecture you've described, of issuing a request in advance of when you will use the data. Issue asynchronous I/O requests (on Windows these are called OVERLAPPED). Then the flow will go exactly as you envisions, but the DMA and interrupts are handled in the drivers.
On Windows, take a look at FILE_FLAG_OVERLAPPED (to CreateFile) and ReadFile (if you like events) or ReadFileEx (if you like callbacks). If you don't have to process the data in any particular order, then add a completion port to the mix, which queues the completion responses.
On Linux, OSX, and many other Unix-like OSes, look at aio_read. Or fadvise. Or use mmap with madvise.
And you can get these benefits without even writing native code. .NET recently added the ReadAsync method to its FileStream, which can be used with continuation-passing style in the form of Task objects, with async/await syntactic sugar in the C# compiler.
Typically, in a multi-mode (user/system) operating system, you do not have access to direct dma or to interrupts. In systems that extend those features from kernel(system) mode down to user mode, the overhead eliminates the benefit of using them.
Ignoring that what you're asking to do requires a very specialized environment to support it, the idea is sound and common: declaring two (or more) buffers to enable DMA to the next while you process the first. When two buffers are used they're sometimes referred to as ping-pong buffers.

How to prevent C read() from reading from cache

I have a program that is used to exercise several disk units in a raid configuration. 1 process synchronously (O_SYNC) writes random data to a file using write(). It then puts the name of the directory into a shared-memory queue, where a 2nd process is waiting for the queue to have entries to read the data back into memory using read().
The problem that I can't seem to overcome is that when the 2nd process attempts to read the data back into memory, none of the disk units show read accesses. The program has code to check whether or not the data read back in is equal to the code that is written to disk, and the data always matches.
My question is, how can I make the OS (IBM i) not buffer the data when it is written to disk so that the read() system call accesses the data on the disk rather than in cache? I am doing simple throughput calculations and the read() operations are always 10+ times faster than the write operations.
I have tried using the O_DIRECT flag, but cannot seem to get the data to write to the file. It could have to do with setting up the correct aligned buffers. I have also tried the posix_fadvise(fd, offset,len, POSIX_FADV_DONTNEED) system call.
I have read through this similar question but haven't found a solution. I can provide code if it would be helpful.
My though is that if you write ENOUGH data, then there simply won't be enough memory to cache it, and thus SOME data must be written to disk.
You can also, if you want to make sure that small writes to your file works, try writing ANOTHER large file (either from the same process or a different one - for example, you could start a process like dd if=/dev/zero of=myfile.dat bs=4k count=some_large_number) to force other data to fill the cache.
Another "trick" may be to "chew up" some (more like most) of the RAM in the system - just allocate a large lump of memory, then write to some small part of it at a time - for example, an array of integers, where you write to every 256th entry of the array in a loop, moving to one step forward each time - that way, you walk through ALL of the memory quickly, and since you are writing continuously to all of it, the memory will have to be resident. [I used this technique to simulate a "busy" virtual machine when running VM tests].
The other option is of course to nobble the caching system itself in OS/filesystem driver, but I would be very worried about doing that - it will almost certainly slow the system down to a slow crawl, and unless there is an existing option to disable it, you may find it hard to do accurately/correctly/reliably.
...exercise several disk units in a raid configuration... How? IBM i doesn't allow a program access to the hardware. How are you directing I/O to any specific physical disks?
ANSWER: The write/read operations are done in parallel against IFS so the stream file manager is selecting which disks to target. By having enough threads reading/writing, the busyness of SYSBASE or an IASP can be driven up.
...none of the disk units show read accesses. None of them? Unless you are running the sole job on a system in restricted state, there is going to be read activity on the disks from other tasks. Is the system divided into multiple LPARs? Multiple ASPs? I'm suggesting that you may be monitoring disks that this program isn't writing to, because IBM i handles physical I/O, not programs.
ANSWER I guess none of them is a slight exaggeration - I know which disks belong to SYSBASE and those disks are not being targeted with many read requests. I was just trying to generalize for an audience not familiar w/IBM i. In the picture below, you will see that the write reqs are driving the % busyness up, but the read reqs are not even though they are targeting the same files.
...how can I make the OS (IBM i) not buffer the data when it is written to disk... Use a memory starved main storage pool to maximise paging, write immense blocks of data so as to guarantee that the system and disk controller caches overflow and use a busy machine so that other tasks are demanding disk I/O as well.

How to record a huge data in kernel

I am trying to record a huge data output from the kernel. Essentially I am trying to log all how processes context switches in a kernel. For even a 1min of profiling the data recorded will be huge, How can I do this.I have to open a huge buffer, record the data in it, and then send it to user-space for further analysis.
EDIT: To clarify how big is "BIG" below is the exact problem I am trying to solve its roughly about 10000 lines of output
My suggestion is to use the same idea used by the linux kernel for capturing packets, specifically the packet ring buffer (search: PACKET_RX_RING).
The idea is quite simple, in your user space program, allocate the ring. Then pass this ring to the "driver" (your kernel module), then your driver can simply write the data points to the ring, and your user space program can read these off. Because it's a ring, you can simply continue writing and the client can continue to read - if the client falls behind, there is a chance that your driver may over take (once it's been around the ring), but I'm sure you can size the ring appropriately.
Each slot in the ring should contain your "serialized" data which the user space program can simply read off. This type of ring should be fairly easy to implement lock free, and most likely you want your client to spin to see if there is data.
Well, the "standard" way to export such a lot of data is using debugfs. You can take a look at how ftrace (kernel/trace/ftrace.c) does this.
And, for even more data, you can use relayfs interface (kernel/relay.c). You can take a look at how blktrace (kernel/trace/blktrace.c) does this.

Resources