MySQL connector(libmysql/C) is very slow in get RES - c

"select * from tables" query in MySQL connector/libmysql C is very slow in getting the results:
Here is my code in C :
int getfrommysql() {
time_t starttime, endtime;
time(&starttime);
double st;
st = GetTickCount();
MYSQL *sqlconn = NULL;
MYSQL_RES * res = NULL;
MYSQL_ROW row = NULL;
MYSQL_FIELD * field;
/*char ipaddr[16];
memset(ipaddr,0,sizeof(ipaddr));*/
char * sqlquery = "select * from seat_getvalue";
sqlconn = malloc(sizeof(MYSQL));
sqlconn = mysql_init(sqlconn);
mysql_real_connect(sqlconn, "111.111.111.111", "root", "password", "database", 0, NULL, 0);
char query[100];
memset(query, 0, 100);
strcpy(query, "select * from seat_getvalue");
mysql_query(sqlconn, query);
res = mysql_store_result(sqlconn);
int col_num, row_num;
if (res) {
col_num = res->field_count;
row_num = res->row_count;
printf("\nthere is a %d row,%d field table", res->row_count, res->field_count);
}
for (int i = 0; i < row_num; i++) {
row = mysql_fetch_row(res);
for (int j = 0; j < col_num; j++) {
printf("%s\t", row[j]);
}
printf("\n");
}
mysql_close(sqlconn);
time(&endtime);
double et = GetTickCount();
printf("the process cost time(get by GetTickCount):%f",et-st);
printf("\nthere is a %d row,%d field table", res->row_count, res->field_count);
}

Apart from the fact, that there isn't even a question given in your post, you are comparing apples to oranges. Mysql gives you (I think - correct me if I am wrong) the time needed to execute the query, while in your C code you measure the time that passed between the start and the end of the program. This is wrong for at least two reasons:
Difference between two GetTickCount() calls gives you the time that has passed between the calls in the whole system, not time spent executing your software. These are two different things, because your process does not have to executed from the beginning to the end uninterrupted - it can (and probably will be) swapped for another process in the middle of the execution, it can be interrupted etc. The whole time the system spent doing stuff outside your program will be added to your measurements. To get time spent on the execution of your code you could probably use GetProcessTimes or QueryProcessCycleTime.
Even if you did use an appropriate method of retrieving your time, you are timing the wrong part of the code. Instead of measuring the time spent just on executing the query and retrieving the results, you measure the whole execution time: estabilishing connection, copying the query, executing it, storing the results, fetching them, printing them and closing the connection. That's quite different from what mysql measures. And printing hundreds of lines can take quite a lot of time, depending on your shell - more than the actual SQL query execution. If you want to know how much time the connector needs to retrieve the data, you should benchmark only the code responsible for executing the query and data retrieval. Or, even better, use some dedicated performance monitoring tools or libraries. I can't point a specific solution, because I never performed tests like that, but there certainly must be some.

Related

D3D11 Map call is slow even after fence signaled

Im running a compute shader. It takes about .22ms to complete (according to D3D11 Query), but real world time is around .6ms on average. The map call definitely takes up some time! It computes a dirty RECT that contains all the pixels that have changed since the last frame and returns it back to the CPU. This all works great!
My issue is that the map call is very slow. Normal performance is optimal, at around .6ms, but if I run this is the background with any GPU intensive application (blender, AAA game, etc) it really starts to slow down, I see times jump the normal .6ms all the way to 9ms! I need to figure out why the map is taking so long!
The shader is passing the result into this structure here
buffer_desc.ByteWidth = sizeof(RECT);
buffer_desc.StructureByteStride = sizeof(RECT);
buffer_desc.Usage = D3D11_USAGE_DEFAULT;
buffer_desc.BindFlags = D3D11_BIND_UNORDERED_ACCESS;
buffer_desc.MiscFlags = D3D11_RESOURCE_MISC_BUFFER_STRUCTURED;
buffer_desc.CPUAccessFlags = 0;
hr = ID3D11Device5_CreateBuffer(ctx->device, &buffer_desc, NULL, &ctx->reduction_final);
With a UAV of
uav_desc.Format = DXGI_FORMAT_UNKNOWN;
uav_desc.ViewDimension = D3D11_UAV_DIMENSION_BUFFER;
uav_desc.Buffer.FirstElement = 0;
uav_desc.Buffer.NumElements = 1;
uav_desc.Buffer.Flags = 0;
hr = ID3D11Device5_CreateUnorderedAccessView(ctx->device, (ID3D11Resource*) ctx->reduction_final, &uav_desc, &ctx->reduction_final_UAV);
Copying it into this staging buffer here
buffer_desc.ByteWidth = sizeof(RECT);
buffer_desc.StructureByteStride = sizeof(RECT);
buffer_desc.Usage = D3D11_USAGE_STAGING;
buffer_desc.BindFlags = 0;
buffer_desc.MiscFlags = D3D11_RESOURCE_MISC_BUFFER_STRUCTURED;
buffer_desc.CPUAccessFlags = D3D11_CPU_ACCESS_READ;
hr = ID3D11Device5_CreateBuffer(ctx->device, &buffer_desc, NULL, &ctx->cpu_staging);
and mapping it like so
hr = ID3D11DeviceContext4_Map(ctx->device_ctx, (ID3D11Resource*) ctx->cpu_staging, 0, D3D11_MAP_READ, 0, &mapped_res);
I added a fence to my system so I could timeout if it took too long (if it took longer than 1ms).
Creation of fence and event
hr = ID3D11Device5_CreateFence(ctx->device, 1, D3D11_FENCE_FLAG_NONE, &IID_ID3D11Fence, (void**) &ctx->fence);
ctx->fence_handle = CreateEventA(NULL, FALSE, FALSE, NULL);
Then after my dispatch calls I copy into the staging, then signal the fence
ID3D11DeviceContext4_CopyResource(ctx->device_ctx, (ID3D11Resource*) ctx->cpu_staging, (ID3D11Resource*) ctx->reduction_final);
ID3D11DeviceContext4_Signal(ctx->device_ctx, ctx->fence, ++ctx->fence_counter);
hr = ID3D11Fence_SetEventOnCompletion(ctx->fence, ctx->fence_counter, ctx->fence_handle);
This all seems to be working. The fence signals and allows WaitForSingleObject to pass, and when I wait on it, I can timeout here and move on to other things while this finishes, then come back (this is not OK obviously, I want the data from the shader ASAP).
DWORD waitResult = WaitForSingleObject(ctx->fence_handle, 1);
if (waitResult != WAIT_OBJECT_0)
{
LOG("Timeout Count: %u", ++timeout_count);
return false;
}
else
{
LOG("Timeout Count: %u", timeout_count);
ID3D11DeviceContext4_Map(ctx->device_ctx, (ID3D11Resource*) ctx->cpu_staging, 0, D3D11_MAP_READ, 0, &mapped_res);
memcpy(rect, mapped_res.pData, sizeof(RECT));
ID3D11DeviceContext4_Unmap(ctx->device_ctx, (ID3D11Resource*) ctx->cpu_staging, 0);
return true;
}
Problem is, the map can still take a long time (ranging from .6 to 9ms). The fence signals once everything is done right? Which takes less than 1ms. But the map call itself can take a long time still. What gives? What am I missing about how mapping works? Is it possible to reduce the time to map? Is the fence signaling before the copy happens?
So just to re-cap. My shader runs fast under normal operations. around .6ms total time, with .22ms time for the shader itself (and Im assuming the rest is the mapping). Once I start running my shader while rendering with blender, it starts to seriously slow down. The fence tells me that the work is still being complete in record time, but the map call still is taking a very long time (upwards of 9ms).
Other random things i've tried:
Setting SetGPUThreadPriority to 7.
Setting AvSetMmThreadCharacteristicsW to "DisplayPostProcessing", "Games", and "Capture"
Setting AvSetMmThreadPriority to AVRT_PRIORITY_CRITICAL

Increase performance in Lua get_table slow

I would like to use the lua script do to some mathematic precalculations in my application i don't want to hardcode it. I use the LUA as a DLL linked libary. Caller program code languange is not C-based language.
The application is handling pretty big array. The array is normaly (25k-65k) * 8 double number array.
My target is:
put this array into the lua script using global variable
read back this array from the lua script
i would like to reach this action is less than 100ms.
Currently i tested with 28000 x 6 array but the time is 5 sec.
I am using lua_gettable function and iterating across the array, it is a huge amount of stack write and read.
My question is no have any other solution for that? I checked the API but maybe i skipped some function. Any possibilities to ask lua to put array subset into the stack? And of course the opposite way.
Thank you so much for any help and suggestion!
As suggested by DarkWiiPlayer, I believe the best way to achieve this in a reasonably fast speed would be to use Lua's userdata. I did an example using a class with a double matrix with [65536][65536][8] dimensions, as you said yours would be:
class MatrixHolder {
public:
double matrix[65536][65536][8];
};
Then, I created a method to create a new MatrixHolder and another one to perform an operation in one of the positions of the matrix (passing I, J and K as parameters).
static int newMatrixHolder(lua_State *lua) {
MatrixHolder* object;
size_t nbytes = sizeof(MatrixHolder);
object = static_cast<MatrixHolder*>(lua_newuserdata(lua, nbytes));
return 1;
}
static int performOperation(lua_State *lua) {
MatrixHolder* object = static_cast<MatrixHolder*>(lua_touserdata(lua, 1));
int i = luaL_checkinteger(lua, -3);
int j = luaL_checkinteger(lua, -2);
int k = luaL_checkinteger(lua, -1);
object->matrix[i][j][k] += 1.0;
lua_pushinteger(lua, object->matrix[i][j][k]);
return 1;
}
static const struct luaL_Reg matrixHolderLib [] = {
{"new", newMatrixHolder},
{"performOperation", performOperation},
{NULL, NULL} // - signals the end of the registry
};
In my computer, it executed the given Lua scripts in the following times:
m = matrixHolder.new()
i = matrixHolder.performOperation(m, 1,1,1);
j = matrixHolder.performOperation(m, 1,2,1);
i = matrixHolder.performOperation(m, 1,1,1);
~845 microseconds
for i = 1, 1000
do
m = matrixHolder.new()
i = matrixHolder.performOperation(m, 1,1,1);
j = matrixHolder.performOperation(m, 1,2,1);
i = matrixHolder.performOperation(m, 1,1,1);
end
~617 milliseconds
I'm unsure if it will serve your purpose, but it seems already way faster than the 5 seconds you mentioned. My computer is a 2,3 GHz 8-Core Intel Core i9 16 GB RAM, for comparison.

Merge sort large file in parallel with memory limit (Linux)

I need to sort large binary file of size M, using t threads. Records in file are all equal size. The task explicitly says that the amount of memory I can allocate is m, and is much smaller than M. Also hard drive is guaranteed to have at least 2 * M free space. This calls for merge sort ofc, but turned out it's not so obvious. I see three different approaches here:
A. Map files input, temp1 and temp2 into memory. Perform merge sort input -> temp1 -> temp2 -> temp1 ... until one of temps sorted. Threads only contend for selecting next portion of work , no contention on read/write.
B. fopen 3 files t times each, each thread gets 3 FILE pointers, one per file. Again they contend only for next portion of work, reads and writes should work in parallel.
C. fopen 3 files one time each, keep them under mutexes, all threads work in parallel but to grab more work or to read or to write they lock respective mutex.
Notes:
In real life I would choose A for sure. But doesn't it defeat the whole purpose of having limited buffer? (In other words isn't it cheating?). With such approach I can even radix sort whole file in place without extra buffer. Also this solution is Linux-specific, I think Linux is implied from conversation, but it's not stated explicitly in task description.
Regarding B, I think it works on Linux but isn't portable, see Linux note above.
Regarding C, it's portable but I am not sure how to optimize it (e.g. 8 threads with small enough m will just bump waiting their turn in queue, then read/write tiny portion of data, then instantly sort it and bump into each other again. IMO unlikely to work faster than 1 thread).
Questions:
Which solution is a better match for the task?
Which solution is a better design in real life (assuming Linux)?
Does B work? In other words is opening file multiple times and writing in parallel (to different parts of it) legal?
Any alternative approaches?
Your question has many facets, so I will try to break it down a bit, while trying to answer almost all of your questions:
You are given a large file on a storage device that probably operates on blocks, i.e. you can load and store many entries at the same time. If you access a single entry from storage, you have to deal with rather large access latency which you can only try to hide by loading many elements at the same time thus amortizing the latency over all element load times.
Your main memory is quite fast compared to the storage (especially for random access), so you want to keep as much data in main memory as possible and only read and write sequential blocks on the storage. This is also the reason why A is not really cheating, since if you tried to use your storage for random access, you would be waaay slower than using main memory.
Combining these results, you can arrive at the following approach, which is basically A but with some engineering details that are usually used in external algorithms.
Use only a single dedicated thread for reading and writing on the storage.
This way, you need only one file descriptor for every file and could in theory even collect and reorder read and write requests from all threads within a small timeframe to get nearly sequential access patterns. Additionally, your threads can just queue a write request and continue with the next block without waiting for the IO to finish.
Load t blocks (from input) into main memory of a maximum size such that you can run mergesort in parallel on each of these blocks. After the blocks are sorted, write them onto the storage as temp1.
Repeat this until all blocks in the file have been sorted.
Now do a so-called multiway merge on the sorted blocks:
Every thread loads a certain number k of consecutive blocks from temp1 into memory and merges them using a priority queue or tournament tree to find the next minimum to be inserted into the resulting block. As soon as your block is full, you write it onto your storage at temp2 to free up memory for the next block. After this step, conceptually swap temp1 and temp2
You still need to do several merge steps, but this number is down by a factor of log k compared to regular two-way merges you probably meant in A. After the first few merge steps, your blocks will probably be too large to fit into main memory, so you split them into smaller blocks and, starting from the first small block, fetch the next block only when all of the previous elements have already been merged. Here, you might even be able to do some prefetching since the order of block accesses is predetermined by the block minima, but this is probably outside the scope of this question.
Note that the value for k is usually only limited by available memory.
Finally, you arrive at t huge blocks which need to be merged together. I don't really know if there is a nice parallel approach to this, it might be necessary to just merge them sequentially, so again you can work with a t-way merge as above to result in a single sorted file.
Gnu sort is a multi-threaded merge sort for text files, but it's basic features could be used here. Define a "chunk" as the number of records that can be sorted in memory of size m.
Sort phase: for each "chunk" of records, read a "chunk" of records, use a multi-threaded sort on the "chunk" then write a "chunk" of records to a temp file, ended up with ceiling(M / m) temp files. Gnu sort sorts an array of pointers to records, partially because the records are variable length. For fixed size records, in my testing, due to cache issues, it's faster to sort records directly rather than sort an array of pointers to records (which results in cache unfriendly random access of records), unless record size is greater than somewhere between 128 and 256 bytes.
Merge phase: perform single threaded k-way merges (such as priority queue) on the temp files until a single file is produced. Multi-threading doesn't help here since it's assumed that the k-way merge phase is I/O bound and not cpu bound. For Gnu sort the default for k is 16 (it does 16-way merges on the temp files).
To keep from exceeding 2 x M space, files will need to be deleted once they have been read.
If your file is way bigger than your RAM size then This is the solution. https://stackoverflow.com/a/49839773/1647320
If your file size is 70-80% of your RAM size then following is the solution. It's in-memory parallel merge sort.
Change this lines according to your system . fpath is your one big input file. shared path is where the execution log is stored.fdir is where the intermediate files will be stored and merged. Change these paths according to your machine.
public static final String fdir = "/tmp/";
public static final String shared = "/exports/home/schatterjee/cs553-pa2a/";
public static final String fPath = "/input/data-20GB.in";
public static final String opLog = shared+"Mysort20GB.log";
Then run the following programme. Your final sorted file will be created with the name op2GB in fdir path. the last line Runtime.getRuntime().exec("valsort " + fdir + "op" + (treeHeight*100)+1 + " > " + opLog); checks the output is sorted or not . Remove this line if you dont have valsort installed in your machine or the input file is not generated using gensort(http://www.ordinal.com/gensort.html) .
Also, don't forget to change int totalLines = 20000000; to the total lines in your file. and thread count (int threadCount = 8) should be always in power of 2.
import java.io.*;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.LinkedList;
import java.util.Comparator;
import java.util.HashMap;
import java.util.stream.Stream;
class SplitJob extends Thread {
LinkedList<String> chunkName;
int startLine, endLine;
SplitJob(LinkedList<String> chunkName, int startLine, int endLine) {
this.chunkName = chunkName;
this.startLine = startLine;
this.endLine = endLine;
}
public void run() {
try {
int totalLines = endLine + 1 - startLine;
Stream<String> chunks =
Files.lines(Paths.get(Mysort2GB.fPath))
.skip(startLine - 1)
.limit(totalLines)
.sorted(Comparator.naturalOrder());
chunks.forEach(line -> {
chunkName.add(line);
});
System.out.println(" Done Writing " + Thread.currentThread().getName());
} catch (Exception e) {
System.out.println(e);
}
}
}
class MergeJob extends Thread {
int list1, list2, oplist;
MergeJob(int list1, int list2, int oplist) {
this.list1 = list1;
this.list2 = list2;
this.oplist = oplist;
}
public void run() {
try {
System.out.println(list1 + " Started Merging " + list2 );
LinkedList<String> merged = new LinkedList<>();
LinkedList<String> ilist1 = Mysort2GB.sortedChunks.get(list1);
LinkedList<String> ilist2 = Mysort2GB.sortedChunks.get(list2);
//Merge 2 files based on which string is greater.
while (ilist1.size() != 0 || ilist2.size() != 0) {
if (ilist1.size() == 0 ||
(ilist2.size() != 0 && ilist1.get(0).compareTo(ilist2.get(0)) > 0)) {
merged.add(ilist2.remove(0));
} else {
merged.add(ilist1.remove(0));
}
}
System.out.println(list1 + " Done Merging " + list2 );
Mysort2GB.sortedChunks.remove(list1);
Mysort2GB.sortedChunks.remove(list2);
Mysort2GB.sortedChunks.put(oplist, merged);
} catch (Exception e) {
System.out.println(e);
}
}
}
public class Mysort2GB {
//public static final String fdir = "/Users/diesel/Desktop/";
public static final String fdir = "/tmp/";
public static final String shared = "/exports/home/schatterjee/cs553-pa2a/";
public static final String fPath = "/input/data-2GB.in";
public static HashMap<Integer, LinkedList<String>> sortedChunks = new HashMap();
public static final String opfile = fdir+"op2GB";
public static final String opLog = shared + "mysort2GB.log";
public static void main(String[] args) throws Exception{
long startTime = System.nanoTime();
int threadCount = 8; // Number of threads
int totalLines = 20000000;
int linesPerFile = totalLines / threadCount;
LinkedList<Thread> activeThreads = new LinkedList<Thread>();
for (int i = 1; i <= threadCount; i++) {
int startLine = i == 1 ? i : (i - 1) * linesPerFile + 1;
int endLine = i * linesPerFile;
LinkedList<String> thisChunk = new LinkedList<>();
SplitJob mapThreads = new SplitJob(thisChunk, startLine, endLine);
sortedChunks.put(i,thisChunk);
activeThreads.add(mapThreads);
mapThreads.start();
}
activeThreads.stream().forEach(t -> {
try {
t.join();
} catch (Exception e) {
}
});
int treeHeight = (int) (Math.log(threadCount) / Math.log(2));
for (int i = 0; i < treeHeight; i++) {
LinkedList<Thread> actvThreads = new LinkedList<Thread>();
for (int j = 1, itr = 1; j <= threadCount / (i + 1); j += 2, itr++) {
int offset = i * 100;
int list1 = j + offset;
int list2 = (j + 1) + offset;
int opList = itr + ((i + 1) * 100);
MergeJob reduceThreads =
new MergeJob(list1,list2,opList);
actvThreads.add(reduceThreads);
reduceThreads.start();
}
actvThreads.stream().forEach(t -> {
try {
t.join();
} catch (Exception e) {
}
});
}
BufferedWriter writer = Files.newBufferedWriter(Paths.get(opfile));
sortedChunks.get(treeHeight*100+1).forEach(line -> {
try {
writer.write(line+"\r\n");
}catch (Exception e){
}
});
writer.close();
long endTime = System.nanoTime();
double timeTaken = (endTime - startTime)/1e9;
System.out.println(timeTaken);
BufferedWriter logFile = new BufferedWriter(new FileWriter(opLog, true));
logFile.write("Time Taken in seconds:" + timeTaken);
Runtime.getRuntime().exec("valsort " + opfile + " > " + opLog);
logFile.close();
}
}
[1]: https://i.stack.imgur.com/5feNb.png

XPC service array crashes

I'm using the C interface for XPC services; incidentally my XPC service runs very nicely asides from the following problem.
The other day I tried to send a "large" array via XPC; of the order of 200,000 entries. Usually, my application deals with data of the order of a couple of thousand entries and has no problems with that. For other uses an array of this size may not be special.
Here is my C++ server code for generating the array:
xpc_connection_t remote = xpc_dictionary_get_remote_connection(event);
xpc_object_t reply = xpc_dictionary_create_reply(event);
xpc_object_t times;
times = xpc_array_create(NULL, 0);
for(unsigned int s = 0; s < data.size(); s++)
{
xpc_object_t index = xpc_uint64_create(data[s]);
xpc_array_append_value(times, index);
}
xpc_dictionary_set_value(reply, "times", times);
xpc_connection_send_message(remote, reply);
xpc_release(times);
xpc_release(reply);
and here is the client code:
xpc_object_t times = xpc_dictionary_get_value(reply, "times");
size_t count = xpc_array_get_count(times);
for(int c = 0; c < count; c++)
{
long my_time = xpc_array_get_uint64(times, c);
local_times.push_back(my_time);
}
If I try to handle a large array I get a seg fault (SIGSEGV)
Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0 libxpc.dylib 0x00007fff90e5cc02 xpc_array_get_count + 0
When you say "extremely big array" are you speaking of something that launchd might regard as a resource-hog and kill?
XPC is only really meant for short-fast transactional runs rather than long-winded service-based runs.
If you're going to make calls that make launchd wait, then I'd suggest you try https://developer.apple.com/library/mac/documentation/MacOSX/Conceptual/BPSystemStartup/Chapters/CreatingLaunchdJobs.html
When the Service dies.. Are any specific events other then SIG_ABORTS etc... fired?
Do you get "xpc service was invalidated" (which usually means launchD killed it, or did you get "xpc service/exited prematurely" which usually is handler code error.

Low Pass filter in C

I'm implementing a low pass filter in C wih the PortAudio library.
I record my microphone input with a script from PortAudio itself. There I added the following code:
float cutoff = 4000.0;
float filter(float cutofFreq){
float RC = 1.0/(cutofFreq * 2 * M_PI);
float dt = 1.0/SAMPLE_RATE;
float alpha = dt/(RC+dt);
return alpha;
}
float filteredArray[numSamples];
filteredArray[0] = data.recordedSamples[0];
for(i=1; i<numSamples; i++){
if(i%SAMPLE_RATE == 0){
cutoff = cutoff - 400;
}
data.recordedSamples[i] = data.recordedSamples[i-1] + (filter(cutoff)*(data.recordedSamples[i] - data.recordedSamples[i-1]));
}
When I run this script for 5 seconds it works. But when I try to run this for more then 5 seconds it fails. The application records everything, but crashes on playback. If I remove the filter, the application works.
Any advice?
The problem:
you are lowering the cutoff frequency by 400 Hz everytime i%SAMPLE_RATE == 0
never stop so you go below zero
this is not done once per second !!!
instead every time your for passes through second barrier in your data
that can occur more often then you think if you are not calling your calls in the right place
which is not seen in your code
you are filtering in wrong oorder
... a[i]=f(a[i],a[i-1]; i++;
that means you are filtering with already filtered a[i-1] value
What to do with it
check the code placement
it should be in some event like on packed done sompling
or in thread after some Sleep(...); (or inside timer)
change the cut off changing (handle edge cases)
reverse filter for direction
Something like this:
int i_done=0;
void on_some_timer()
{
cutoff-=400;
if (cutoff<1) cutoff=1; // here change 1 for limit frequency
if (numSamples!=i_done)
for (i=numSamples-1,i>=i_done;i--)
data.recordedSamples[i] = data.recordedSamples[i-1] + (filter(cutoff)*(data.recordedSamples[i] - data.recordedSamples[i-1]));
i_done=numSamples;
}
if your code is already OK (you did not post th whole thing so I can missing something)
then just add the if (cutoff<1) cutoff=1; after cutoff change

Resources