I am developing in Vulkan 1.0, building a rendering system by learning and implementing functionality one step at a time. I get the gist of command recording and submission, but I haven't been far enough to understand a use case in which I'd want to have multiple command buffers per pool. It was this presentation at slide 14 which raised some questions.
My understanding and current design is as follows:
Optimally, there should be one command pool per frame per thread so command buffers aren't recording over the same memory while in flight. If I have 3 frames and each frame can have up to 4 recording threads, that's 12 command pools at a minimum.
Command buffers are associated with a command pool at creation time and will be reset on the next frame. To potentially get better performance, the entire pool will be reset rather than the individual buffers.
A single command pool may be used in the creation of multiple command buffers. This group of command buffers would all be used in the same frame and thread.
According to this article under "Command overlap", the reordering of commands may happen between command buffers and vkQueueSubmit calls. So if I had a group of command buffers in the same frame and thread, I'd need something more than just submission order to guarantee the results I want. Maybe I'd use unique semaphores for each submission?
If I'm coding for a frame/thread, I see no advantage to submitting commands a few times from beginning to end as opposed to submitting everything at once in the end. It's the same amount of work in the same time span. It may even be detrimental to submit multiple times because of the vkQueueSubmit overhead mentioned in the specification.
From the assumptions above, in what cases would it be necessary or advantageous to have more than one command buffer per command pool as opposed to having one command buffer that records everything from beginning to end for the given frame and thread?
having one command buffer that records everything from beginning to end for the given frame and thread?
Well, what happens if a thread needs to record things in an order other than the order in which they need to be submitted? That's kind of the point of a CB, isn't it? The ability to build commands in an order that is convenient, then submit them in the way that works out for the GPU.
For example, let's say you have a thread that is rendering a particular set of objects. To do that, you need to write their matrices and other per-object properties to a uniform buffer. And let's say that, for whatever reason, this particular Vulkan implementation doesn't allow you to use mappable memory directly for uniform buffers. So you have to write to mappable memory and copy the data to a uniform buffer via a memory transfer operation.
So the thread creating the commands for these meshes need to do two things. They need to build the commands to render the meshes, and they need to build the commands to transfer the uniform data to the buffer that the rendering commands will need.
Your way however requires that commands are put into the CB in the order you want them executed. So you would have to loop through the entire list of objects to build the transfer commands, and loop through it again to build the rendering commands. But you're reading the same objects each time through the loop. During the first loop, you had access to 100% of the data needed to issue the rendering command.
And the second time through the loop, all that data is no longer in the cache. So the second time has about the same number of cache misses (and therefore real memory accesses) as the first time.
That's bad.
Furthermore, rendering commands need to be placed within a render pass instance. Transfer commands cannot be in a render pass instance. But if you're putting transfer commands into the same CB as the rendering commands... that CB must begin and end the render pass instance.
So... how can other threads issue commands for that render pass instance?
If you want parallelism (and you do), then you need these threads to be creating secondary CBs for their rendering commands. A later task will collate them into the primary CB, and that CB will have the render pass instance. But secondary CBs built for a render pass cannot contain transfer commands.
So if you want parallelism, then any transfer commands that have to be generated alongside rendering commands must go into a different CB. One that will be submitted before the secondary CBs (or even submitted to a different queue altogether).
Related
When doing multiple aio_writes to file is it necessary to wait (e.g. aio_suspend or other) before starting the next one? From the documentation it says that writes are enqueued so does that mean they are written in order? Also, I can track the offset and make sure nothing is ever overwritten (I'm assuming that a failed write could leave a gap in this case).
You've asked two questions:
Can I issue new aios before previous aio finishes?
If so, are aios finished in-order(i.e. the same as issue order)?
Essentially aio works asynchronously only for O_DIRECT opened files, so I assume it following.
The answer is:
Yes. That's one essential usage of asynchronous I/O
No. You can almost assume nothing on their completion order
aio(different from POSIX aio, see also here) is a linux native support for asynchronous I/O. You submit async I/O to kernel, kernel records it and performs it in background asynchronously. Also, multiple "outstanding" async I/O is permitted and will be maintained by kernel.
As for order, more specifically, write order here, there are many potentials to reorder them inside the kernel.
A typical reorder comes from block layer, which accepts disk write requests from upper filesystem layer and delivers them to lower device driver layer. Within block layer there are many request schedulers, which will schedule the I/O into a "suitable" order. Well, how the "suitable" here is defined depends on the scheduler type. For example, HDD may try to merge more requests and cache a single sector/page waiting for merge potential, which incurs reorder. For more info here is an introduction.
For strong order requirement, you may control it manually by waiting for aios to finish before issuing more.
I've seen a few questions here on Stack overflow dealing with the same issues, but no definite answer. I thought I'll ask again, with a bunch of questions of my own. All relate to the subject matter at hand.
So, do we know when the data transfer from host to the openCL device occurs? Can you tell me the exact memory transfer operation of the functions below (that is, what data is transferred or created, if any, when these functions are invoked?):
clCreateBuffer()
clSetKernelArg()
clEnqueueNDRangeKernel()
The first two don't even produce events, so we can't time them, but surely some data transferring is happening here.
Is there a way to transfer data to a device without first setting it as a kernel arg?
It appears (from preliminary testing of my own) that a mem object created with CL_MEM_USE_HOST_PTR gets directly manipulated by the device. Why would that not be desirable, since, that way, we could avoid further data transfer commands (and surely the driver implements this in the most efficient way)?
Does transferred data (say, as par of a kernel arg) stay at the device for further manipulation, after a kernel returns? If not is there a way to do just that?
Buffer copies are related to command queues. Command queues are synced with host using finish() as easiest way.
clCreateBuffer()
clEnqueueWriteBuffer() <-------- you can get event data from this
(set blocking parameter to false to queue everything quickly)
(set blockinig to true if you sync write here)
clSetKernelArg()
clEnqueueWriteBuffer() <----- it could be here too
clEnqueueNDRangeKernel()
clEnqueueWriteBuffer() <----- or here (too quickly re-set an array?)
clFinish() <--------- this ensures all queued commands are executed before this
now you can query data of that event to check when it started and when ended
to let a buffer stay in device, you should create it in device first then don't migrate it to another device. Using only CL_MEM_READ_WRITE flag in createBuffer() is enough to make it a real buffer on device-side until you release that buffer.
CL_MEM_USE_HOST_PTR or CL_MEM_ALLOC_HOST_PTR uses host memory as device maps it to its cores. This is faster for streaming data in and out because of not-needing of extra data movements in host side. If you need to use device memory such as fast gddr5 or hbm always, then you should not use these flags.
Copy to device once, use as much as you want. If device has its own memory of course. For example, Intel HD Graphics 400 doesn't have its own memory and shares RAM so it is much faster to use CL_MEM_..._HOST_PTR flags and especially USE_HOST_PTR.
To check if device shares RAM with CPU, you query CL_DEVICE_HOST_UNIFIED_MEMORY property of device.
It appears (from preliminary testing of my own) that a mem object
created with CL_MEM_USE_HOST_PTR gets directly manipulated by the
device
Even without map/unmap commands pror to kernel execution, my computer is behaving same, but I'm using map/unmap just to be safe and it doesn't tax too many cycles.
Edit: if you want to make sure a command doesn't start before you want, you can add a user event in event list input parameter of bufferwrite command. Then you can trigger the user event to let writing start because commands wait for all events in the list to be fired+completed before continuing (if there are any specified in event list input parameter)
I am designing a file system in user space and need to test it. I do not want to use the available benchmarking tools as my requirements are different. So to test the file system I wish to simulate file access operation. To do this, I first use the ftw() function to walk through one f my existing file system(experimental) and list all the files and directories in a file.
Then I invoke a simulator to simulate file access by a number of processes. Thus, the simulator randomly starts a process i.e it forks a thread which does what a real process would have done. The thread randomly selects a file operation (read, write, rename etc) selects arguments to this operation from the list(generated by ftw()) . The thread does a number of such file operations and then exits marking the end of a process. The simulator continues to spawn threads; thread execution can overlap just as real processes do. Now, as operations are performed by threads, files get inserted, deleted, renamed and this is updated in the list of files.
I have not yet started coding. Does the plan seem sane? I am also not sure how to code the simulator...how will it spawn threads over a period of time. Should I be using some random delay to do this.
Thanks
Yep, that seems fairly reasonable to me. I would consider attempting to impose a statistical distribution over your file operations (and accesses to particular files) that is somehow matched to your expected workload. You might be able to find some statistics about typical filesystem workloads as a starting point.
That sounds about right for a decent test case just to make sure it's working. You could use sleep() to wait between spawning threads or just spawn them all at once and have them do an operation then wait a bit, then do another operation, etc... IMO if you hit it hard with a lot of requests and it works then there's a likely chance your filesystem will do just fine. Take an example from PostMark which all it does is append like crazy to different files and other benchmarks that do random access reads/writes in different locations to make sure that the page has to be read from disk.
In my program, I hold two files open for writing, a content-file, containing chunks of data, and an index-file, containing a map over which chunks of data has been written so far.
I would like to flush them both to disc, as performant as possible, with the only constraint that the blocks in the data-file must be written before the corresponding blocks in the map-file (naturally).
The catch is that I would like to avoid blocking I.E. doing an fsync, both for latency and throughput-reasons.
Any ideas?
I don't think you can do this easily in a single execution path. You need fsync to have the write to disk guaranteed - and this is going to have to wait for the write.
I suspect it is possible (but not easy) to do this by delegating the writing task to a separate thread or process. Generate the data in your existing program and 'write' it to the second thread/process using any method that looks sensible. This can be non-blocking. The second thread would then write any new data to the data to your content-file, then fsync, then write the index-file, then check for new data again. Key design decisions relate to how you separate the two execution paths, how you communicate between them, and if you need to report the write back to the main program. This could still have latency and throughput issues, but that's part of the cost of choosing to have the index-file and content-file in sync. At least there would be a chance of getting work done while waiting on the disk.
It could be worth looking to see if this is well encapsulated so as to be useful to you in the source of any of the transactional databases. You could also investigate the sync option when you mount the file system for the content-file.
I have an image generator which would benefit from running in threads. I am intending to use POSIX threads, and have written some mock up code based on https://computing.llnl.gov/tutorials/pthreads/#ConVarSignal to test things out.
In the intended program, when the GUI is in use, I want the generated lines to appear from the top to the bottom one by one (the image generation can be very slow).
It should also be noted, the data generated in the threads is not the actual image data. The thread data is read and transformed into RGB data and placed into the actual image buffer. And within the GUI, the way the thread generated data is translated to RGB data can be changed during image generation without stopping image generation.
However, there is no guarantee from the thread scheduler that the threads will run in the order I want, which unfortunately makes the transformations of the thread generated data trickier, implying the undesirable solution of keeping an array to hold a bool value to indicate which lines are done.
How should I deal with this?
Currently I have a watcher thread to report when the image is complete (which really should be for a progress bar but I've not got that far yet, it instead uses pthread_cond_wait). And several render threads doing while(next_line());
next_line() does a mutex lock, and gets the value of img_next_line before incrementing it and unlocking the mutex. it then renders the line and does a mutex lock (different to first) to get lines_done checks against height, signals if complete, unlocks and returns 0 if complete or 1 if not.
Given that threads may well be executing in parallel on different cores it's pretty much inevitable that the results will arrive out of order. I think your appraoch of tracking what's complete with a set of flags is quite reasonable.
It's possible that the overall effect might be nicer if used threads in a different granularity. Say give each thread (say) 20 lines to work on rather than one. Then on completion you'd have bigger blocks available to draw, and maybe drawing stripes would look ok?
Just accept that the rows will be done in a non-deterministic order; it sounds like that is happening because they take different lengths of time to render, in which case forcing a completion order will waste CPU time.
This may sound silly but as a user I don't want to see one line rendered slowly from top to bottom. It makes a slow process seem even slower because the user already has completely predicted what will happen next. Better to just render when ready even if it is scattered over the place (either as single lines or better yet as blocks as some have suggested). It makes it look more random and therefore more captivating and less boring to a user like me.