Vulkan - Asynchronous Texture Upload - Image Transition Issue

Vulkan - Asynchronous Texture Upload - Image Transition Issue - c

I'm using the transfer queue to upload data to GPU local memory to be used by the graphics queue. I believe I need 3 barriers, one to release the texture object from the transfer queue, one to acquire it on the graphics queue, and one transition it from TRANSFER_DST_OPTIMAL to SHADER_READ_ONLY_OPTIMAL. I think my barriers are what's incorrect as this is the error I get and also, I see the correct rendered output as I'm on Nvidia hardware. Is there any synchronization missing?
UNASSIGNED-CoreValidation-DrawState-InvalidImageLayout(ERROR / SPEC): msgNum: 1303270965 -
Validation Error: [ UNASSIGNED-CoreValidation-DrawState-InvalidImageLayout ] Object 0:
handle = 0x562696461ca0, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0x4dae5635 |
Submitted command buffer expects VkImage 0x1c000000001c[] (subresource: aspectMask 0x1 array
layer 0, mip level 0) to be in layout VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL--instead,
current layout is VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL.
I believe what I'm doing wrong is not properly specifying stageMasks
VkImageMemoryBarrier tex_barrier = {0};
/* layout transition - UNDEFINED -> TRANSFER_DST */
tex_barrier.srcAccessMask = 0;
tex_barrier.dstAccessMask = VK_ACCESS_TRANSFER_WRITE_BIT;
tex_barrier.oldLayout = VK_IMAGE_LAYOUT_UNDEFINED;
tex_barrier.newLayout = VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL;
tex_barrier.srcQueueFamilyIndex = -1;
tex_barrier.dstQueueFamilyIndex = -1;
tex_barrier.subresourceRange = (VkImageSubresourceRange) { VK_IMAGE_ASPECT_COLOR_BIT, 0, 1, 0, 1 };
vkCmdPipelineBarrier(transfer_cmdbuffs[0],
VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT,
VK_PIPELINE_STAGE_TRANSFER_BIT,
0,
0, NULL, 0, NULL, 1, &tex_barrier);
/* queue ownership transfer */
tex_barrier.srcAccessMask = 0;
tex_barrier.dstAccessMask = 0;
tex_barrier.oldLayout = VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL;
tex_barrier.newLayout = VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL;
tex_barrier.srcQueueFamilyIndex = device.transfer_queue_family_index;
tex_barrier.dstQueueFamilyIndex = device.graphics_queue_family_index;
vkCmdPipelineBarrier(transfer_cmdbuffs[0],
VK_PIPELINE_STAGE_TRANSFER_BIT,
VK_PIPELINE_STAGE_TRANSFER_BIT,
0,
0, NULL, 0, NULL, 1, &tex_barrier);
tex_barrier.srcAccessMask = 0;
tex_barrier.dstAccessMask = VK_ACCESS_SHADER_READ_BIT;
tex_barrier.oldLayout = VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL;
tex_barrier.newLayout = VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL;
tex_barrier.srcQueueFamilyIndex = device.transfer_queue_family_index;
tex_barrier.dstQueueFamilyIndex = device.graphics_queue_family_index;
vkCmdPipelineBarrier(transfer_cmdbuffs[0],
VK_PIPELINE_STAGE_TRANSFER_BIT,
VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT,
0,
0, NULL, 0, NULL, 1, &tex_barrier);

Doing an ownership transfer is a 2-way process: the source of the transfer has to release the resource, and the receiver has to acquire it. And by "the source" and "the receiver", I mean the queues themselves. You can't merely give a queue take ownership of a resource; that queue must issue a command to claim ownership of it.
You need to submit a release barrier operation on the source queue. It must specify the source queue family as well as the destination queue family. Then, you have to submit an acquire barrier operation on the receiving queue, using the same source and destination. And you must ensure the order of these operations via a semaphore. So the vkQueueSubmit call for the acquire has to wait on the semaphore from the submission of the release operation (a timeline semaphore would work too).
Now, since these are pipeline/memory barriers, you are free to also specify a layout transition. You don't need a third barrier to change the layout, but both barriers have to specify the same source/destination layouts for the acquire/release operation.

Related

D3D11 Map call is slow even after fence signaled

Im running a compute shader. It takes about .22ms to complete (according to D3D11 Query), but real world time is around .6ms on average. The map call definitely takes up some time! It computes a dirty RECT that contains all the pixels that have changed since the last frame and returns it back to the CPU. This all works great!
My issue is that the map call is very slow. Normal performance is optimal, at around .6ms, but if I run this is the background with any GPU intensive application (blender, AAA game, etc) it really starts to slow down, I see times jump the normal .6ms all the way to 9ms! I need to figure out why the map is taking so long!
The shader is passing the result into this structure here
buffer_desc.ByteWidth = sizeof(RECT);
buffer_desc.StructureByteStride = sizeof(RECT);
buffer_desc.Usage = D3D11_USAGE_DEFAULT;
buffer_desc.BindFlags = D3D11_BIND_UNORDERED_ACCESS;
buffer_desc.MiscFlags = D3D11_RESOURCE_MISC_BUFFER_STRUCTURED;
buffer_desc.CPUAccessFlags = 0;
hr = ID3D11Device5_CreateBuffer(ctx->device, &buffer_desc, NULL, &ctx->reduction_final);
With a UAV of
uav_desc.Format = DXGI_FORMAT_UNKNOWN;
uav_desc.ViewDimension = D3D11_UAV_DIMENSION_BUFFER;
uav_desc.Buffer.FirstElement = 0;
uav_desc.Buffer.NumElements = 1;
uav_desc.Buffer.Flags = 0;
hr = ID3D11Device5_CreateUnorderedAccessView(ctx->device, (ID3D11Resource*) ctx->reduction_final, &uav_desc, &ctx->reduction_final_UAV);
Copying it into this staging buffer here
buffer_desc.ByteWidth = sizeof(RECT);
buffer_desc.StructureByteStride = sizeof(RECT);
buffer_desc.Usage = D3D11_USAGE_STAGING;
buffer_desc.BindFlags = 0;
buffer_desc.MiscFlags = D3D11_RESOURCE_MISC_BUFFER_STRUCTURED;
buffer_desc.CPUAccessFlags = D3D11_CPU_ACCESS_READ;
hr = ID3D11Device5_CreateBuffer(ctx->device, &buffer_desc, NULL, &ctx->cpu_staging);
and mapping it like so
hr = ID3D11DeviceContext4_Map(ctx->device_ctx, (ID3D11Resource*) ctx->cpu_staging, 0, D3D11_MAP_READ, 0, &mapped_res);
I added a fence to my system so I could timeout if it took too long (if it took longer than 1ms).
Creation of fence and event
hr = ID3D11Device5_CreateFence(ctx->device, 1, D3D11_FENCE_FLAG_NONE, &IID_ID3D11Fence, (void**) &ctx->fence);
ctx->fence_handle = CreateEventA(NULL, FALSE, FALSE, NULL);
Then after my dispatch calls I copy into the staging, then signal the fence
ID3D11DeviceContext4_CopyResource(ctx->device_ctx, (ID3D11Resource*) ctx->cpu_staging, (ID3D11Resource*) ctx->reduction_final);
ID3D11DeviceContext4_Signal(ctx->device_ctx, ctx->fence, ++ctx->fence_counter);
hr = ID3D11Fence_SetEventOnCompletion(ctx->fence, ctx->fence_counter, ctx->fence_handle);
This all seems to be working. The fence signals and allows WaitForSingleObject to pass, and when I wait on it, I can timeout here and move on to other things while this finishes, then come back (this is not OK obviously, I want the data from the shader ASAP).
DWORD waitResult = WaitForSingleObject(ctx->fence_handle, 1);
if (waitResult != WAIT_OBJECT_0)
{
LOG("Timeout Count: %u", ++timeout_count);
return false;
}
else
{
LOG("Timeout Count: %u", timeout_count);
ID3D11DeviceContext4_Map(ctx->device_ctx, (ID3D11Resource*) ctx->cpu_staging, 0, D3D11_MAP_READ, 0, &mapped_res);
memcpy(rect, mapped_res.pData, sizeof(RECT));
ID3D11DeviceContext4_Unmap(ctx->device_ctx, (ID3D11Resource*) ctx->cpu_staging, 0);
return true;
}
Problem is, the map can still take a long time (ranging from .6 to 9ms). The fence signals once everything is done right? Which takes less than 1ms. But the map call itself can take a long time still. What gives? What am I missing about how mapping works? Is it possible to reduce the time to map? Is the fence signaling before the copy happens?
So just to re-cap. My shader runs fast under normal operations. around .6ms total time, with .22ms time for the shader itself (and Im assuming the rest is the mapping). Once I start running my shader while rendering with blender, it starts to seriously slow down. The fence tells me that the work is still being complete in record time, but the map call still is taking a very long time (upwards of 9ms).
Other random things i've tried:
Setting SetGPUThreadPriority to 7.
Setting AvSetMmThreadCharacteristicsW to "DisplayPostProcessing", "Games", and "Capture"
Setting AvSetMmThreadPriority to AVRT_PRIORITY_CRITICAL

Cannot add perf event into group with existing leader

I'm trying to add perf event into group of events with an existing leader using the following code snippet (as example):
struct perf_event_attr perf_event;
int cpu_fd = -1;
int process_fd = -1;
memset(&perf_event, 0, sizeof(perf_event));
perf_event.type = PERF_TYPE_HARDWARE;
perf_event.size = sizeof(perf_event);
perf_event.config = config; // config (PERF_COUNT_HW_CPU_CYCLES) is passed as a function's argument
perf_event.use_clockid = 1;
perf_event.clockid = CLOCK_MONOTONIC;
perf_event.read_format = PERF_FORMAT_TOTAL_TIME_ENABLED | PERF_FORMAT_GROUP;
// First step:
// Create leader of perf events group
// cpu_idx is passed as an function's argument
cpu_fd = perf_event_open(&perf_event, -1, cpu_idx, -1,
PERF_FLAG_FD_CLOEXEC);
if (cpu_fd < 0) {
printf("Unable to open new CPU PMU event.\n");
return STATUS_NOK;
}
memset(&perf_event, 0, sizeof(perf_event));
perf_event.type = PERF_TYPE_HARDWARE;
perf_event.size = sizeof(perf_event);
perf_event.config = config; // same as above (PERF_COUNT_HW_CPU_CYCLES)
perf_event.use_clockid = 1;
perf_event.clockid = CLOCK_MONOTONIC;
perf_event.read_format = PERF_FORMAT_TOTAL_TIME_ENABLED;
// Second step:
// Attempting to add new perf event for account current process CPU cycles
// in existing group
// cpu_idx is passed as an function's argument
process_fd = perf_event_open(&perf_event, getpid(), cpu_idx, cpu_fd,
PERF_FLAG_FD_CLOEXEC);
if (process_fd < 0) {
printf("Unable to open new process PMU event.\n");
return STATUS_NOK;
}
On first step leader perf event is created for accounting the total number of CPU cycles on required CPU with cpu_idx.
On the second step there is an attempt to add new perf event to existing leader in order to account current process CPU cycles on required CPU with cpu_idx.
But for unknown reason creation of new process_fd perf descriptor always fails with errno EINVAL (Invalid argument). I tried to investigate root cause with syscall tracing in the kernel but it seems I completely stuck.
Could you please help to find out root cause of such issue?
These are many possible reasons for EINVAL case returned from perf_event_open syscall, I appreciate any suggestions or ideas for debugging.
Thanks a lot for your help!

How to chain BCryptEncrypt and BCryptDecrypt calls using AES in GCM mode?

Using the Windows CNG API, I am able to encrypt and decrypt individual blocks of data with authentication, using AES in GCM mode. I now want to encrypt and decrypt multiple buffers in a row.
According to documentation for CNG, the following scenario is supported:
If the input to encryption or decryption is scattered across multiple
buffers, then you must chain calls to the BCryptEncrypt and
BCryptDecrypt functions. Chaining is indicated by setting the
BCRYPT_AUTH_MODE_IN_PROGRESS_FLAG flag in the dwFlags member.
If I understand it correctly, this means that I can invoke BCryptEncrypt sequentially on multiple buffers an obtain the authentication tag for the combined buffers at the end. Similarly, I can invoke BCryptDecrypt sequentially on multiple buffers while deferring the actual authentication check until the end. I can not get that to work though, it looks like the value for dwFlags is ignored. Whenever I use BCRYPT_AUTH_MODE_IN_PROGRESS_FLAG, I get a return value of 0xc000a002 , which is equal to STATUS_AUTH_TAG_MISMATCH as defined in ntstatus.h.
Even though the parameter pbIV is marked as in/out, the elements pointed to by the parameter pbIV do not get modified by BCryptEncrypt(). Is that expected? I also looked at the field pbNonce in the BCRYPT_AUTHENTICATED_CIPHER_MODE_INFO structure, pointed to by the pPaddingInfo pointer, but that one does not get modified either. I also tried "manually" advancing the IV, modifying the contents myself according to the counter scheme, but that did not help either.
What is the right procedure to chain the BCryptEncrypt and/or BCryptDecrypt functions successfully?

I managed to get it to work. It seems that the problem is in MSDN, it should mention setting BCRYPT_AUTH_MODE_CHAIN_CALLS_FLAG instead of BCRYPT_AUTH_MODE_IN_PROGRESS_FLAG.
#include <windows.h>
#include <assert.h>
#include <vector>
#include <Bcrypt.h>
#pragma comment(lib, "bcrypt.lib")
std::vector<BYTE> MakePatternBytes(size_t a_Length)
{
std::vector<BYTE> result(a_Length);
for (size_t i = 0; i < result.size(); i++)
{
result[i] = (BYTE)i;
}
return result;
}
std::vector<BYTE> MakeRandomBytes(size_t a_Length)
{
std::vector<BYTE> result(a_Length);
for (size_t i = 0; i < result.size(); i++)
{
result[i] = (BYTE)rand();
}
return result;
}
int _tmain(int argc, _TCHAR* argv[])
{
NTSTATUS bcryptResult = 0;
DWORD bytesDone = 0;
BCRYPT_ALG_HANDLE algHandle = 0;
bcryptResult = BCryptOpenAlgorithmProvider(&algHandle, BCRYPT_AES_ALGORITHM, 0, 0);
assert(BCRYPT_SUCCESS(bcryptResult) || !"BCryptOpenAlgorithmProvider");
bcryptResult = BCryptSetProperty(algHandle, BCRYPT_CHAINING_MODE, (BYTE*)BCRYPT_CHAIN_MODE_GCM, sizeof(BCRYPT_CHAIN_MODE_GCM), 0);
assert(BCRYPT_SUCCESS(bcryptResult) || !"BCryptSetProperty(BCRYPT_CHAINING_MODE)");
BCRYPT_AUTH_TAG_LENGTHS_STRUCT authTagLengths;
bcryptResult = BCryptGetProperty(algHandle, BCRYPT_AUTH_TAG_LENGTH, (BYTE*)&authTagLengths, sizeof(authTagLengths), &bytesDone, 0);
assert(BCRYPT_SUCCESS(bcryptResult) || !"BCryptGetProperty(BCRYPT_AUTH_TAG_LENGTH)");
DWORD blockLength = 0;
bcryptResult = BCryptGetProperty(algHandle, BCRYPT_BLOCK_LENGTH, (BYTE*)&blockLength, sizeof(blockLength), &bytesDone, 0);
assert(BCRYPT_SUCCESS(bcryptResult) || !"BCryptGetProperty(BCRYPT_BLOCK_LENGTH)");
BCRYPT_KEY_HANDLE keyHandle = 0;
{
const std::vector<BYTE> key = MakeRandomBytes(blockLength);
bcryptResult = BCryptGenerateSymmetricKey(algHandle, &keyHandle, 0, 0, (PUCHAR)&key[0], key.size(), 0);
assert(BCRYPT_SUCCESS(bcryptResult) || !"BCryptGenerateSymmetricKey");
}
const size_t GCM_NONCE_SIZE = 12;
const std::vector<BYTE> origNonce = MakeRandomBytes(GCM_NONCE_SIZE);
const std::vector<BYTE> origData = MakePatternBytes(256);
// Encrypt data as a whole
std::vector<BYTE> encrypted = origData;
std::vector<BYTE> authTag(authTagLengths.dwMinLength);
{
BCRYPT_AUTHENTICATED_CIPHER_MODE_INFO authInfo;
BCRYPT_INIT_AUTH_MODE_INFO(authInfo);
authInfo.pbNonce = (PUCHAR)&origNonce[0];
authInfo.cbNonce = origNonce.size();
authInfo.pbTag = &authTag[0];
authInfo.cbTag = authTag.size();
bcryptResult = BCryptEncrypt
(
keyHandle,
&encrypted[0], encrypted.size(),
&authInfo,
0, 0,
&encrypted[0], encrypted.size(),
&bytesDone, 0
);
assert(BCRYPT_SUCCESS(bcryptResult) || !"BCryptEncrypt");
assert(bytesDone == encrypted.size());
}
// Decrypt data in two parts
std::vector<BYTE> decrypted = encrypted;
{
DWORD partSize = decrypted.size() / 2;
std::vector<BYTE> macContext(authTagLengths.dwMaxLength);
BCRYPT_AUTHENTICATED_CIPHER_MODE_INFO authInfo;
BCRYPT_INIT_AUTH_MODE_INFO(authInfo);
authInfo.pbNonce = (PUCHAR)&origNonce[0];
authInfo.cbNonce = origNonce.size();
authInfo.pbTag = &authTag[0];
authInfo.cbTag = authTag.size();
authInfo.pbMacContext = &macContext[0];
authInfo.cbMacContext = macContext.size();
// IV value is ignored on first call to BCryptDecrypt.
// This buffer will be used to keep internal IV used for chaining.
std::vector<BYTE> contextIV(blockLength);
// First part
authInfo.dwFlags = BCRYPT_AUTH_MODE_CHAIN_CALLS_FLAG;
bcryptResult = BCryptDecrypt
(
keyHandle,
&decrypted[0*partSize], partSize,
&authInfo,
&contextIV[0], contextIV.size(),
&decrypted[0*partSize], partSize,
&bytesDone, 0
);
assert(BCRYPT_SUCCESS(bcryptResult) || !"BCryptDecrypt");
assert(bytesDone == partSize);
// Second part
authInfo.dwFlags &= ~BCRYPT_AUTH_MODE_CHAIN_CALLS_FLAG;
bcryptResult = BCryptDecrypt
(
keyHandle,
&decrypted[1*partSize], partSize,
&authInfo,
&contextIV[0], contextIV.size(),
&decrypted[1*partSize], partSize,
&bytesDone, 0
);
assert(BCRYPT_SUCCESS(bcryptResult) || !"BCryptDecrypt");
assert(bytesDone == partSize);
}
// Check decryption
assert(decrypted == origData);
// Cleanup
BCryptDestroyKey(keyHandle);
BCryptCloseAlgorithmProvider(algHandle, 0);
return 0;
}

#Codeguard's answer got me through the project I was working on which lead me to find this question/answer in the first place; however, there were still a number of gotchas I struggled with. Below is the process I followed with the tricky parts called out. You can view the actual code at the link above:
Use BCryptOpenAlgorithmProvider to open the algorithm provider using BCRYPT_AES_ALGORITHM.
Use BCryptSetProperty to set the BCRYPT_CHAINING_MODE to BCRYPT_CHAIN_MODE_GCM.
Use BCryptGetProperty to get the BCRYPT_OBJECT_LENGTH to allocate for use by the BCrypt library for the encrypt/decrypt operation. Depending on your implementation, you may also want to:
Use BCryptGetProperty to determine BCRYPT_BLOCK_SIZE and allocate scratch space for the IV. The Windows API updates the IV with each call, and the caller is responsible for providing the memory for that usage.
Use BCryptGetProperty to determine BCRYPT_AUTH_TAG_LENGTH and allocate scratch space for the largest possible tag. Like the IV, the caller is responsible for providing this space, which the API updates each time.
Initialize the BCRYPT_AUTHENTICATED_CIPHER_MODE_INFO struct:
Initialize the structure with BCRYPT_INIT_AUTH_MODE_INFO()
Initialize the pbNonce and cbNonce field. Note that for the first call to BCryptEncrypt/BCryptDecrypt, the IV is ignored as an input and this field is used as the "IV". However, the IV parameter will be updated by that first call and used by subsequent calls, so space for it must still be provided. In addition, the pbNonce and cbNonce fields must remain set (even though they are unused after the first call) for all calls to BCryptEncrypt/BCryptDecrypt or those calls will complain.
Initialize pbAuthData and cbAuthData. In my project, I set these fields just before the first call to BCryptEncrypt/BCryptDecrypt and immediately reset them to NULL/0 immediately afterward. You can pass NULL/0 as the input and output parameters during these calls.
Initialize pbTag and cbTag. pbTag can be NULL until the final call to BCryptEncrypt/BCryptDecrypt when the tag is retrieved or checked, but cbTag must be set or else BCryptEncrypt/BCryptDecrypt will complain.
Initialize pbMacContext and cbMacContext. These point to a scratch space for the BCryptEncrypt/BCryptDecrypt to use to keep track of the current state of the tag/mac.
Initialize cbAAD and cbData to 0. The APIs use these fields, so you can read them at any time, but you should not update them after initially setting them to 0.
Initialize dwFlags to BCRYPT_AUTH_MODE_CHAIN_CALLS_FLAG. After initialization, changes to this field should be made by using |= or &=. Windows also sets flags within this field that the caller needs to take care not to alter.
Use BCryptGenerateSymmetricKey to import the key to use for encryption/decryption. Note that you will need to supply the memory associated with BCRYPT_OBJECT_LENGTH to this call for use by BCryptEncrypt/BCryptDecrypt during operation.
Call BCryptEncrypt/BCryptDecrypt with your AAD, if any; no input nor space for output need be supplied for this call. (If the call succeeds, you can see the size of your AAD reflected in the cbAAD field of the BCRYPT_AUTHENTICATED_CIPHER_MODE_INFO structure.)
Set pbAuthData and cbAuthData to reflect the AAD.
Call BCryptEncrypt or BCryptDecrypt.
Set pbAuthData and cbAuthData back to NULL and 0.
Call BCryptEncrypt/BCryptDecrypt "N - 1" times
The amount of data passed to each call must be a multiple of the algorithm's block size.
Do not set the dwFlags parameter of the call to anything other than 0.
The output space must be equal to or greater than the size of the input
Call BCryptEncrypt/BCryptDecrypt one final time (with or without plain/cipher text input/output). The size of the input need not be a multiple of the algorithm's block size for this call. dwFlags is still set to 0.
Set the pbTag field of the BCRYPT_AUTHENTICATED_CIPHER_MODE_INFO structure either to the location at which to store the generated tag or to the location of the tag to verify against, depending on whether the operation is an encryption or decryption.
Remove the BCRYPT_AUTH_MODE_CHAIN_CALLS_FLAG from the dwFlags field of the BCRYPT_AUTHENTICATED_CIPHER_MODE_INFO structure using the &= syntax.
Call BCryptDestroyKey
Call BCryptCloseAlgorithmProvider
It would be wise, at this point, to wipe out the space associated with BCRYPT_OBJECT_LENGTH.

What's the purpose of kqueue EV_ENABLE and EV_DISABLE

I'm trying to understand the use case for EV_DISABLE and EV_ENABLE in kqueue.
int KQueue = kqueue();
struct kevent ev = {
.ident = fd,
.filter = EVFILT_READ,
.flags = EV_ADD | EV_DISABLE,
.udata = somePtr
};
kevent(KQueue, &ev, 1, NULL, 0, NULL);
...
struct kevent ev = {
.ident = fd,
.filter = EVFILT_READ,
.flags = EV_ENABLE
};
kevent(KQueue, &ev, 1, &ev, 1, NULL);
Now, when the last call to kevent() returns, ev.udata is NULL instead of somePtr. If kevent() updates the udata pointer even though EV_ADD isn't set, instead of just enable the event, what is the reason for allowing you to add a disabled event, then?

Another use case for EV_ENABLE is in conjunction with EV_DISPATCH. This is necessary in multi-threaded scenario where you have multiple threads waiting for events in a kevent() call. When an event occurs, without EV_DISPATCH, all your threads would be woken up on the same event causing a thundering herd problem. With EV_DISPATCH, the event is delivered to one thread and disabled right after that (ie. atomically from userspace point of view). The thread then handles the event and may re-enable it.

kqueue did not update udata. YOU updated udata by leaving it uninitialized. You are registering the filter with new values. The point of udata is to cross the kernel with it. You can keep your own pointer in userland.
The point of disabling an event is that you want it returned at another call, or that you don't want to cause kqueue to return when it is triggered, but at a different time.

How to pass data between multiple Lua State(multi-thread)?

I initiate Redis connection pool in redis.lua, by calling from C, I got a redis_lua_state, this Lua state is global initiated once and other thread only get from it.
While there comes a HTTP request(worker thread), I need to fetch a redis connection from redis_lua_state, then new another Lua state to load other Lua script, and these scripts will use this redis connection to communicate with Redis, how to do this? Or how to design my Lua scripts to make it simple?
Code Sample:
/* on main thread, to init redis pool connection */
lua_State *g_ls = NULL;
lua_State *init_redis_pool(void) {
int ret = 0;
g_ls = luaL_newstate();
lua_State *ls = g_ls;
luaL_openlibs(ls);
ret = luaL_loadfile(ls, "redis.lua");
const char *err;
(void)err;
/* preload */
ret = lua_pcall(ls, 0, 0, 0);
lua_getglobal(ls, "init_redis_pool");
ret = lua_pcall(ls, 0, 0, 0);
return ls;
}
/* worker thread */
int worker() {
...
lua_State *ls = luaL_newstate();
ret = luaL_loadfile(ls, "run.lua");
/* How to fetch data from g_ls? */
...
lua_getglobal(ls, "run")
ret = lua_pcall(ls, 0, 0, 0)
lua_close(ls);
...
return 0;
}

If your Lua states are separate, then there's no way to do this. Your worker thread will have to initialize the Redis connection and do processing on it.

One way to do it is to implement copying variables between Lua states on C side. I have done a similar thing in my ERP project, but it requires a bit of a hassle, especially when it comes to copying user data.
What I did was to implement some sort of a super global variable (a C class in my instance) system that's implemented as __index and __newindex of the global table, which is a proxy to default Lua global table. When setting a global variable, __newindex would copy that to that super global. When another Lua state would try to access that global, it would retrieve it from the same structure.
And then the redis connection could be a mutex locked shared, so when one state accesses it, the other cannot, for instance.
Of course there's the issue of accessing Lua default globals that you also have to take care of.
Alternatively, you can check out Lua lanes, if that's an option (I've no Redis experience, so don't know how open the Lua is, but I see that you have full access to C api, so it should work).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight