Im running a compute shader. It takes about .22ms to complete (according to D3D11 Query), but real world time is around .6ms on average. The map call definitely takes up some time! It computes a dirty RECT that contains all the pixels that have changed since the last frame and returns it back to the CPU. This all works great!
My issue is that the map call is very slow. Normal performance is optimal, at around .6ms, but if I run this is the background with any GPU intensive application (blender, AAA game, etc) it really starts to slow down, I see times jump the normal .6ms all the way to 9ms! I need to figure out why the map is taking so long!
The shader is passing the result into this structure here
buffer_desc.ByteWidth = sizeof(RECT);
buffer_desc.StructureByteStride = sizeof(RECT);
buffer_desc.Usage = D3D11_USAGE_DEFAULT;
buffer_desc.BindFlags = D3D11_BIND_UNORDERED_ACCESS;
buffer_desc.MiscFlags = D3D11_RESOURCE_MISC_BUFFER_STRUCTURED;
buffer_desc.CPUAccessFlags = 0;
hr = ID3D11Device5_CreateBuffer(ctx->device, &buffer_desc, NULL, &ctx->reduction_final);
With a UAV of
uav_desc.Format = DXGI_FORMAT_UNKNOWN;
uav_desc.ViewDimension = D3D11_UAV_DIMENSION_BUFFER;
uav_desc.Buffer.FirstElement = 0;
uav_desc.Buffer.NumElements = 1;
uav_desc.Buffer.Flags = 0;
hr = ID3D11Device5_CreateUnorderedAccessView(ctx->device, (ID3D11Resource*) ctx->reduction_final, &uav_desc, &ctx->reduction_final_UAV);
Copying it into this staging buffer here
buffer_desc.ByteWidth = sizeof(RECT);
buffer_desc.StructureByteStride = sizeof(RECT);
buffer_desc.Usage = D3D11_USAGE_STAGING;
buffer_desc.BindFlags = 0;
buffer_desc.MiscFlags = D3D11_RESOURCE_MISC_BUFFER_STRUCTURED;
buffer_desc.CPUAccessFlags = D3D11_CPU_ACCESS_READ;
hr = ID3D11Device5_CreateBuffer(ctx->device, &buffer_desc, NULL, &ctx->cpu_staging);
and mapping it like so
hr = ID3D11DeviceContext4_Map(ctx->device_ctx, (ID3D11Resource*) ctx->cpu_staging, 0, D3D11_MAP_READ, 0, &mapped_res);
I added a fence to my system so I could timeout if it took too long (if it took longer than 1ms).
Creation of fence and event
hr = ID3D11Device5_CreateFence(ctx->device, 1, D3D11_FENCE_FLAG_NONE, &IID_ID3D11Fence, (void**) &ctx->fence);
ctx->fence_handle = CreateEventA(NULL, FALSE, FALSE, NULL);
Then after my dispatch calls I copy into the staging, then signal the fence
ID3D11DeviceContext4_CopyResource(ctx->device_ctx, (ID3D11Resource*) ctx->cpu_staging, (ID3D11Resource*) ctx->reduction_final);
ID3D11DeviceContext4_Signal(ctx->device_ctx, ctx->fence, ++ctx->fence_counter);
hr = ID3D11Fence_SetEventOnCompletion(ctx->fence, ctx->fence_counter, ctx->fence_handle);
This all seems to be working. The fence signals and allows WaitForSingleObject to pass, and when I wait on it, I can timeout here and move on to other things while this finishes, then come back (this is not OK obviously, I want the data from the shader ASAP).
DWORD waitResult = WaitForSingleObject(ctx->fence_handle, 1);
if (waitResult != WAIT_OBJECT_0)
{
LOG("Timeout Count: %u", ++timeout_count);
return false;
}
else
{
LOG("Timeout Count: %u", timeout_count);
ID3D11DeviceContext4_Map(ctx->device_ctx, (ID3D11Resource*) ctx->cpu_staging, 0, D3D11_MAP_READ, 0, &mapped_res);
memcpy(rect, mapped_res.pData, sizeof(RECT));
ID3D11DeviceContext4_Unmap(ctx->device_ctx, (ID3D11Resource*) ctx->cpu_staging, 0);
return true;
}
Problem is, the map can still take a long time (ranging from .6 to 9ms). The fence signals once everything is done right? Which takes less than 1ms. But the map call itself can take a long time still. What gives? What am I missing about how mapping works? Is it possible to reduce the time to map? Is the fence signaling before the copy happens?
So just to re-cap. My shader runs fast under normal operations. around .6ms total time, with .22ms time for the shader itself (and Im assuming the rest is the mapping). Once I start running my shader while rendering with blender, it starts to seriously slow down. The fence tells me that the work is still being complete in record time, but the map call still is taking a very long time (upwards of 9ms).
Other random things i've tried:
Setting SetGPUThreadPriority to 7.
Setting AvSetMmThreadCharacteristicsW to "DisplayPostProcessing", "Games", and "Capture"
Setting AvSetMmThreadPriority to AVRT_PRIORITY_CRITICAL
Related
I'm using the transfer queue to upload data to GPU local memory to be used by the graphics queue. I believe I need 3 barriers, one to release the texture object from the transfer queue, one to acquire it on the graphics queue, and one transition it from TRANSFER_DST_OPTIMAL to SHADER_READ_ONLY_OPTIMAL. I think my barriers are what's incorrect as this is the error I get and also, I see the correct rendered output as I'm on Nvidia hardware. Is there any synchronization missing?
UNASSIGNED-CoreValidation-DrawState-InvalidImageLayout(ERROR / SPEC): msgNum: 1303270965 -
Validation Error: [ UNASSIGNED-CoreValidation-DrawState-InvalidImageLayout ] Object 0:
handle = 0x562696461ca0, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0x4dae5635 |
Submitted command buffer expects VkImage 0x1c000000001c[] (subresource: aspectMask 0x1 array
layer 0, mip level 0) to be in layout VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL--instead,
current layout is VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL.
I believe what I'm doing wrong is not properly specifying stageMasks
VkImageMemoryBarrier tex_barrier = {0};
/* layout transition - UNDEFINED -> TRANSFER_DST */
tex_barrier.srcAccessMask = 0;
tex_barrier.dstAccessMask = VK_ACCESS_TRANSFER_WRITE_BIT;
tex_barrier.oldLayout = VK_IMAGE_LAYOUT_UNDEFINED;
tex_barrier.newLayout = VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL;
tex_barrier.srcQueueFamilyIndex = -1;
tex_barrier.dstQueueFamilyIndex = -1;
tex_barrier.subresourceRange = (VkImageSubresourceRange) { VK_IMAGE_ASPECT_COLOR_BIT, 0, 1, 0, 1 };
vkCmdPipelineBarrier(transfer_cmdbuffs[0],
VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT,
VK_PIPELINE_STAGE_TRANSFER_BIT,
0,
0, NULL, 0, NULL, 1, &tex_barrier);
/* queue ownership transfer */
tex_barrier.srcAccessMask = 0;
tex_barrier.dstAccessMask = 0;
tex_barrier.oldLayout = VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL;
tex_barrier.newLayout = VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL;
tex_barrier.srcQueueFamilyIndex = device.transfer_queue_family_index;
tex_barrier.dstQueueFamilyIndex = device.graphics_queue_family_index;
vkCmdPipelineBarrier(transfer_cmdbuffs[0],
VK_PIPELINE_STAGE_TRANSFER_BIT,
VK_PIPELINE_STAGE_TRANSFER_BIT,
0,
0, NULL, 0, NULL, 1, &tex_barrier);
tex_barrier.srcAccessMask = 0;
tex_barrier.dstAccessMask = VK_ACCESS_SHADER_READ_BIT;
tex_barrier.oldLayout = VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL;
tex_barrier.newLayout = VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL;
tex_barrier.srcQueueFamilyIndex = device.transfer_queue_family_index;
tex_barrier.dstQueueFamilyIndex = device.graphics_queue_family_index;
vkCmdPipelineBarrier(transfer_cmdbuffs[0],
VK_PIPELINE_STAGE_TRANSFER_BIT,
VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT,
0,
0, NULL, 0, NULL, 1, &tex_barrier);
Doing an ownership transfer is a 2-way process: the source of the transfer has to release the resource, and the receiver has to acquire it. And by "the source" and "the receiver", I mean the queues themselves. You can't merely give a queue take ownership of a resource; that queue must issue a command to claim ownership of it.
You need to submit a release barrier operation on the source queue. It must specify the source queue family as well as the destination queue family. Then, you have to submit an acquire barrier operation on the receiving queue, using the same source and destination. And you must ensure the order of these operations via a semaphore. So the vkQueueSubmit call for the acquire has to wait on the semaphore from the submission of the release operation (a timeline semaphore would work too).
Now, since these are pipeline/memory barriers, you are free to also specify a layout transition. You don't need a third barrier to change the layout, but both barriers have to specify the same source/destination layouts for the acquire/release operation.
"select * from tables" query in MySQL connector/libmysql C is very slow in getting the results:
Here is my code in C :
int getfrommysql() {
time_t starttime, endtime;
time(&starttime);
double st;
st = GetTickCount();
MYSQL *sqlconn = NULL;
MYSQL_RES * res = NULL;
MYSQL_ROW row = NULL;
MYSQL_FIELD * field;
/*char ipaddr[16];
memset(ipaddr,0,sizeof(ipaddr));*/
char * sqlquery = "select * from seat_getvalue";
sqlconn = malloc(sizeof(MYSQL));
sqlconn = mysql_init(sqlconn);
mysql_real_connect(sqlconn, "111.111.111.111", "root", "password", "database", 0, NULL, 0);
char query[100];
memset(query, 0, 100);
strcpy(query, "select * from seat_getvalue");
mysql_query(sqlconn, query);
res = mysql_store_result(sqlconn);
int col_num, row_num;
if (res) {
col_num = res->field_count;
row_num = res->row_count;
printf("\nthere is a %d row,%d field table", res->row_count, res->field_count);
}
for (int i = 0; i < row_num; i++) {
row = mysql_fetch_row(res);
for (int j = 0; j < col_num; j++) {
printf("%s\t", row[j]);
}
printf("\n");
}
mysql_close(sqlconn);
time(&endtime);
double et = GetTickCount();
printf("the process cost time(get by GetTickCount):%f",et-st);
printf("\nthere is a %d row,%d field table", res->row_count, res->field_count);
}
Apart from the fact, that there isn't even a question given in your post, you are comparing apples to oranges. Mysql gives you (I think - correct me if I am wrong) the time needed to execute the query, while in your C code you measure the time that passed between the start and the end of the program. This is wrong for at least two reasons:
Difference between two GetTickCount() calls gives you the time that has passed between the calls in the whole system, not time spent executing your software. These are two different things, because your process does not have to executed from the beginning to the end uninterrupted - it can (and probably will be) swapped for another process in the middle of the execution, it can be interrupted etc. The whole time the system spent doing stuff outside your program will be added to your measurements. To get time spent on the execution of your code you could probably use GetProcessTimes or QueryProcessCycleTime.
Even if you did use an appropriate method of retrieving your time, you are timing the wrong part of the code. Instead of measuring the time spent just on executing the query and retrieving the results, you measure the whole execution time: estabilishing connection, copying the query, executing it, storing the results, fetching them, printing them and closing the connection. That's quite different from what mysql measures. And printing hundreds of lines can take quite a lot of time, depending on your shell - more than the actual SQL query execution. If you want to know how much time the connector needs to retrieve the data, you should benchmark only the code responsible for executing the query and data retrieval. Or, even better, use some dedicated performance monitoring tools or libraries. I can't point a specific solution, because I never performed tests like that, but there certainly must be some.
I'm implementing a low pass filter in C wih the PortAudio library.
I record my microphone input with a script from PortAudio itself. There I added the following code:
float cutoff = 4000.0;
float filter(float cutofFreq){
float RC = 1.0/(cutofFreq * 2 * M_PI);
float dt = 1.0/SAMPLE_RATE;
float alpha = dt/(RC+dt);
return alpha;
}
float filteredArray[numSamples];
filteredArray[0] = data.recordedSamples[0];
for(i=1; i<numSamples; i++){
if(i%SAMPLE_RATE == 0){
cutoff = cutoff - 400;
}
data.recordedSamples[i] = data.recordedSamples[i-1] + (filter(cutoff)*(data.recordedSamples[i] - data.recordedSamples[i-1]));
}
When I run this script for 5 seconds it works. But when I try to run this for more then 5 seconds it fails. The application records everything, but crashes on playback. If I remove the filter, the application works.
Any advice?
The problem:
you are lowering the cutoff frequency by 400 Hz everytime i%SAMPLE_RATE == 0
never stop so you go below zero
this is not done once per second !!!
instead every time your for passes through second barrier in your data
that can occur more often then you think if you are not calling your calls in the right place
which is not seen in your code
you are filtering in wrong oorder
... a[i]=f(a[i],a[i-1]; i++;
that means you are filtering with already filtered a[i-1] value
What to do with it
check the code placement
it should be in some event like on packed done sompling
or in thread after some Sleep(...); (or inside timer)
change the cut off changing (handle edge cases)
reverse filter for direction
Something like this:
int i_done=0;
void on_some_timer()
{
cutoff-=400;
if (cutoff<1) cutoff=1; // here change 1 for limit frequency
if (numSamples!=i_done)
for (i=numSamples-1,i>=i_done;i--)
data.recordedSamples[i] = data.recordedSamples[i-1] + (filter(cutoff)*(data.recordedSamples[i] - data.recordedSamples[i-1]));
i_done=numSamples;
}
if your code is already OK (you did not post th whole thing so I can missing something)
then just add the if (cutoff<1) cutoff=1; after cutoff change
I have a problem when I write a large amount of data <2GB to a file. First ~1.4GB data are written fast (100 MB/s) than the code become really slow (0-2 MB/s).
My code (simplified) is:
//FileOptions FILE_FLAG_NO_BUFFERING = (FileOptions)0x20000000;
FileOptions fileOptions = FileOptions.SequentialScan;
int fileBufferSize = 1024 * 1024;
byte[] Buffer = new byte[32768];
Random random = new Random();
long fileSize = 2588490188;
long totalByteWritten = 0;
using (FileStream fs = File.Create(#"c:\test\test.bin", fileBufferSize, fileOptions))
{
while (totalByteWritten < fileSize)
{
random.NextBytes(Buffer);
fs.Write(Buffer, 0, Buffer.Length);
totalByteWritten += Buffer.Length;
//Thread.Sleep(10);
}
}
I think there is an issue related to caching problem, in fact during "fast write performance" RAM used increase as well, when RAM usage stop to increase there is a drop in performance.
What I have tried:
change to async write
->no significantly change
change array buffer size
->no significantly change
change fileBufferSize
->no significantly change, but with a large buffer ~100MB, write performance is fast and when RAM usage stop to increase, write performance goes to 0 and than, after a while, goes back to 100MB, it seams that cache buffer is "flushed"
change fileOption to WriteThrough
->performance are always slow..
adding after xx loops fs.Flush(true)
->no significantly change
uncomment Thread.Sleep(10)
->write speed is always good.....this is strange
Is it somehow trying to write before it's finished writing the previous chunk and getting in a mess? (seems unlikely, but it's very odd that the Thread.Sleep should speed it up and this might explain it). What happens if you modify the code inside the using statement to lock the filestream, like this?
using (FileStream fs = File.Create(#"c:\testing\test.bin", fileBufferSize, fileOptions))
{
while (fs.Position < fileBufferSize)
{
lock(fs) // this is the bit I have added to try to speed it up
{
random.NextBytes(Buffer);
fs.Write(Buffer, 0, Buffer.Length);
}
}
}
EDIT: I have tweaked your example code to include the while loop required to make it write a file of the correct size.
Incidentally, when I run the sample code it is very quick with or without the lock statement and adding the sleep slows it down significantly.
I wrote a little HTTP video streaming server using GStreamer. Essentially the client does a GET request and receives a continuous HTTP stream.
The stream should be sent synchronously, i.e. at the same speed as the bitrate is. Problem is that some players (mplayer is a prominent example) don't buffer variable bitrate content well, thus lacking every other second.
I want to circumvent the buffer underruns by transmitting the first, say, 5 MB immediately, ignoring the pipeline's clock. The rest of the stream should be transmitted at the appropriate speed.
I figured setting the fdsink sync=TRUE for the first 5 MB, and sync=FALSE from then on should do the trick, but that does not work, as the fdsink patiently waits for the pipeline clock to catch up to the already sent data. In my test with a very low bitrate, there is no data transmitted for quite some seconds.
My fdsink reader thread currently looks like this:
static void *readerThreadFun(void*) {
int fastStart = TRUE;
g_object_set(G_OBJECT(fdsink0), "sync", FALSE, NULL);
for(uint64_t position = 0;;) {
// (On the other side there is node.js,
// that's why I don't do the HTTP chunking here)
ssize_t readCount = splice(gstreamerFd, NULL, remoteFd,
NULL, 1<<20, SPLICE_F_MOVE|SPLICE_F_MORE);
if(readCount == 0) {
break;
} else if(readCount < 0) {
goto error;
}
position += readCount;
if(fastStart && position >= 5*1024*1024) {
fastStart = FALSE;
g_object_set(G_OBJECT(fdsink0), "sync", TRUE, NULL);
}
}
...
}
How can I make GStreamer "forget" the duration the wall clock has to catch up with? Is there some "reset" function? Am I misunderstanding sync? Is there another method to realize a "fast start" in GStreamer?
That's not quite the solution I was looking for:
gst_base_sink_set_ts_offset(GST_BASE_SINK(fdsink0), -10ll*1000*1000*1000);
The sink will stream the first 10 seconds immediately.