Increase performance in Lua get_table slow

Increase performance in Lua get_table slow - arrays

I would like to use the lua script do to some mathematic precalculations in my application i don't want to hardcode it. I use the LUA as a DLL linked libary. Caller program code languange is not C-based language.
The application is handling pretty big array. The array is normaly (25k-65k) * 8 double number array.
My target is:
put this array into the lua script using global variable
read back this array from the lua script
i would like to reach this action is less than 100ms.
Currently i tested with 28000 x 6 array but the time is 5 sec.
I am using lua_gettable function and iterating across the array, it is a huge amount of stack write and read.
My question is no have any other solution for that? I checked the API but maybe i skipped some function. Any possibilities to ask lua to put array subset into the stack? And of course the opposite way.
Thank you so much for any help and suggestion!

As suggested by DarkWiiPlayer, I believe the best way to achieve this in a reasonably fast speed would be to use Lua's userdata. I did an example using a class with a double matrix with [65536][65536][8] dimensions, as you said yours would be:
class MatrixHolder {
public:
double matrix[65536][65536][8];
};
Then, I created a method to create a new MatrixHolder and another one to perform an operation in one of the positions of the matrix (passing I, J and K as parameters).
static int newMatrixHolder(lua_State *lua) {
MatrixHolder* object;
size_t nbytes = sizeof(MatrixHolder);
object = static_cast<MatrixHolder*>(lua_newuserdata(lua, nbytes));
return 1;
}
static int performOperation(lua_State *lua) {
MatrixHolder* object = static_cast<MatrixHolder*>(lua_touserdata(lua, 1));
int i = luaL_checkinteger(lua, -3);
int j = luaL_checkinteger(lua, -2);
int k = luaL_checkinteger(lua, -1);
object->matrix[i][j][k] += 1.0;
lua_pushinteger(lua, object->matrix[i][j][k]);
return 1;
}
static const struct luaL_Reg matrixHolderLib [] = {
{"new", newMatrixHolder},
{"performOperation", performOperation},
{NULL, NULL} // - signals the end of the registry
};
In my computer, it executed the given Lua scripts in the following times:
m = matrixHolder.new()
i = matrixHolder.performOperation(m, 1,1,1);
j = matrixHolder.performOperation(m, 1,2,1);
i = matrixHolder.performOperation(m, 1,1,1);
~845 microseconds
for i = 1, 1000
do
m = matrixHolder.new()
i = matrixHolder.performOperation(m, 1,1,1);
j = matrixHolder.performOperation(m, 1,2,1);
i = matrixHolder.performOperation(m, 1,1,1);
end
~617 milliseconds
I'm unsure if it will serve your purpose, but it seems already way faster than the 5 seconds you mentioned. My computer is a 2,3 GHz 8-Core Intel Core i9 16 GB RAM, for comparison.

Related

How to initialize struct members at compile time, depending on the value of previously defined entries in an array

Suppose I have a structure used for describing values stored inside a virtual memory map:
typedef struct
{
uint16_t u16ID;
uint16_t u16Offset;
uint8_t u8Size;
} MemMap_t;
const MemMap_t memoryMap[3] =
{
{
.u16ID = 0,
.u16Offset = 0,
.u8Size = 3
},
{
.u16ID = 1,
.u16Offset = 3,
.u8Size = 2
},
{
.u16ID = 2,
.u16Offset = 5,
.u8Size = 3
}
};
Each entry contains an offset for addressing the memory location and the size of the value it contains
The offset of each following value is dependent on the offset and size of the values before it
In this example I set all offsets manually.
The reason why I implemented it that way is that it allows me to change the layout of the entire memory map later on,
the structure still making it possible to look up the offset and size of an entry with a certain ID.
The problem with this is that setting the offsets manually is going to get unwieldy quite quickly once the map becomes bigger
and changing the size of an entry at the beginning would require manually changing all offsets of the entries after that one.
I came up with some ways to just calculate the offsets at runtime, but as the target system this will run on is a very RAM constrained embedded system, I really want to keep the entire map as a constant.
Is there an elegant way to calculate the offsets of the map entries at compile time?

After some experiments, found something that may work for large number of attributes. Posting as new answer, as my previous answer took very different approach.
Consider create a proxy structure that describe the object described by MamMap_t, using series of char[] objects.
static struct MemMap_v {
char t0[3] ;
char t1[2] ;
char t2[3] ;
char t3[10] ;
} vv ;
const MemMap_t memoryMap[3] =
{
{
.u16ID = 0,
.u16Offset = vv.t0 - vv.t0,
.u8Size = sizeof(vv.t0)
},
{
.u16ID = 1,
.u16Offset = vv.t1 - vv.t0,
.u8Size = sizeof(vv.t1)
},
{
.u16ID = 2,
.u16Offset = vv.t2 - vv.t0,
.u8Size = sizeof(vv.t2)
}
};

Is there an elegant way to calculate the offsets of the map entries at compile time?
Yes: write yourself a code generator that accepts input data describing the memory map and outputs C source for the initializer or for the whole declaration. Have the appropriate source file #include that. Structure this program so that the form of its input data is convenient for you to maintain.
If the number of map entries were bounded by a (very) small number, and if their IDs were certain to be consecutive and to correspond to their indices in the memoryMap array, then I feel pretty confident that it would be possible to write a set of preprocessor macros that did the job without a separate program. Such a preprocessor-based solution would be messy, and difficult to debug and maintain. I do not recommend this alternative.

Short Answer: not possible to calculate values at compile time, given data structure.
Alternative:
Consider using symbolic constants for the sizes. E_0, E_1, E_2, ..., then you can calculate the offset at compile time (E_0, E_0+E_1, E_0+E_1+E_2). Not very elegant, and does not scale well for large number of items, but will meet the requirements.
Second alternative will be to create a function that will return the pointer to memoryMap. The function can initialize the offset on the first call. The program will call getMemoryMap instead of memoryMap.
static MemMap_t memoryMap[3] =
{
...
}
const MemMap_t *getMemoryMap() {
MemMap_t *p = memoryMap ;
static bool offsetDone ;
if ( !offsetDone ) {
offsetDone = true ;
for (int i=1; i<sizeof(memoryMap)/sizeof(memoryMap[0]) ; i++ ) {
p[i].u16Offset = p[i-1].u16Offset + p[i-1].u8Size ;
} ;
return p;
}

Can gcc/clang optimize initialization computing?

I recently wrote a parser generator tool that takes a BNF grammar (as a string) and a set of actions (as a function pointer array) and output a parser (= a state automaton, allocated on the heap). I then use another function to use that parser on my input data and generates a abstract syntax tree.
In the initial parser generation, there is quite a lot of steps, and i was wondering if gcc or clang are able to optimize this, given constant inputs to the parser generation function (and never using the pointers values, only dereferencing them) ? Is is possible to run the function at compile time, and embed the result (aka, the allocated memory) in the executable ?
(obviously, that would be using link time optimization, since the compiler would need to be able to check that the whole function does indeed have the same result with the same parameters)

What you could do in this case is have code that generates code.
Have your initial parser generator as a separate piece of code that runs independently. The output of this code would be a header file containing a set of variable definitions initialized to the proper values. You then use this file in your main code.
As an example, suppose you have a program that needs to know the number of bits that are set in a given byte. You could do this manually whenever you need:
int count_bits(uint8_t b)
{
int count = 0;
while (b) {
count += b & 1;
b >>= 1;
}
return count;
}
Or you can generate the table in a separate program:
int main()
{
FILE *header = fopen("bitcount.h", "w");
if (!header) {
perror("fopen failed");
exit(1);
}
fprintf(header, "int bit_counts[256] = {\n");
int count;
unsigned v;
for (v=0,count=0; v<256; v++) {
uint8_t b = v;
while (b) {
count += b & 1;
b >>= 1;
}
fprintf(header, " %d,\n" count);
}
fprintf(header, "};\n");
fclose(header);
return 0;
}
This create a file called bitcount.h that looks like this:
int bit_counts[256] = {
0,
1,
1,
2,
...
7,
};
That you can include in your "real" code.

Merge sort large file in parallel with memory limit (Linux)

I need to sort large binary file of size M, using t threads. Records in file are all equal size. The task explicitly says that the amount of memory I can allocate is m, and is much smaller than M. Also hard drive is guaranteed to have at least 2 * M free space. This calls for merge sort ofc, but turned out it's not so obvious. I see three different approaches here:
A. Map files input, temp1 and temp2 into memory. Perform merge sort input -> temp1 -> temp2 -> temp1 ... until one of temps sorted. Threads only contend for selecting next portion of work , no contention on read/write.
B. fopen 3 files t times each, each thread gets 3 FILE pointers, one per file. Again they contend only for next portion of work, reads and writes should work in parallel.
C. fopen 3 files one time each, keep them under mutexes, all threads work in parallel but to grab more work or to read or to write they lock respective mutex.
Notes:
In real life I would choose A for sure. But doesn't it defeat the whole purpose of having limited buffer? (In other words isn't it cheating?). With such approach I can even radix sort whole file in place without extra buffer. Also this solution is Linux-specific, I think Linux is implied from conversation, but it's not stated explicitly in task description.
Regarding B, I think it works on Linux but isn't portable, see Linux note above.
Regarding C, it's portable but I am not sure how to optimize it (e.g. 8 threads with small enough m will just bump waiting their turn in queue, then read/write tiny portion of data, then instantly sort it and bump into each other again. IMO unlikely to work faster than 1 thread).
Questions:
Which solution is a better match for the task?
Which solution is a better design in real life (assuming Linux)?
Does B work? In other words is opening file multiple times and writing in parallel (to different parts of it) legal?
Any alternative approaches?

Your question has many facets, so I will try to break it down a bit, while trying to answer almost all of your questions:
You are given a large file on a storage device that probably operates on blocks, i.e. you can load and store many entries at the same time. If you access a single entry from storage, you have to deal with rather large access latency which you can only try to hide by loading many elements at the same time thus amortizing the latency over all element load times.
Your main memory is quite fast compared to the storage (especially for random access), so you want to keep as much data in main memory as possible and only read and write sequential blocks on the storage. This is also the reason why A is not really cheating, since if you tried to use your storage for random access, you would be waaay slower than using main memory.
Combining these results, you can arrive at the following approach, which is basically A but with some engineering details that are usually used in external algorithms.
Use only a single dedicated thread for reading and writing on the storage.
This way, you need only one file descriptor for every file and could in theory even collect and reorder read and write requests from all threads within a small timeframe to get nearly sequential access patterns. Additionally, your threads can just queue a write request and continue with the next block without waiting for the IO to finish.
Load t blocks (from input) into main memory of a maximum size such that you can run mergesort in parallel on each of these blocks. After the blocks are sorted, write them onto the storage as temp1.
Repeat this until all blocks in the file have been sorted.
Now do a so-called multiway merge on the sorted blocks:
Every thread loads a certain number k of consecutive blocks from temp1 into memory and merges them using a priority queue or tournament tree to find the next minimum to be inserted into the resulting block. As soon as your block is full, you write it onto your storage at temp2 to free up memory for the next block. After this step, conceptually swap temp1 and temp2
You still need to do several merge steps, but this number is down by a factor of log k compared to regular two-way merges you probably meant in A. After the first few merge steps, your blocks will probably be too large to fit into main memory, so you split them into smaller blocks and, starting from the first small block, fetch the next block only when all of the previous elements have already been merged. Here, you might even be able to do some prefetching since the order of block accesses is predetermined by the block minima, but this is probably outside the scope of this question.
Note that the value for k is usually only limited by available memory.
Finally, you arrive at t huge blocks which need to be merged together. I don't really know if there is a nice parallel approach to this, it might be necessary to just merge them sequentially, so again you can work with a t-way merge as above to result in a single sorted file.

Gnu sort is a multi-threaded merge sort for text files, but it's basic features could be used here. Define a "chunk" as the number of records that can be sorted in memory of size m.
Sort phase: for each "chunk" of records, read a "chunk" of records, use a multi-threaded sort on the "chunk" then write a "chunk" of records to a temp file, ended up with ceiling(M / m) temp files. Gnu sort sorts an array of pointers to records, partially because the records are variable length. For fixed size records, in my testing, due to cache issues, it's faster to sort records directly rather than sort an array of pointers to records (which results in cache unfriendly random access of records), unless record size is greater than somewhere between 128 and 256 bytes.
Merge phase: perform single threaded k-way merges (such as priority queue) on the temp files until a single file is produced. Multi-threading doesn't help here since it's assumed that the k-way merge phase is I/O bound and not cpu bound. For Gnu sort the default for k is 16 (it does 16-way merges on the temp files).
To keep from exceeding 2 x M space, files will need to be deleted once they have been read.

If your file is way bigger than your RAM size then This is the solution. https://stackoverflow.com/a/49839773/1647320
If your file size is 70-80% of your RAM size then following is the solution. It's in-memory parallel merge sort.
Change this lines according to your system . fpath is your one big input file. shared path is where the execution log is stored.fdir is where the intermediate files will be stored and merged. Change these paths according to your machine.
public static final String fdir = "/tmp/";
public static final String shared = "/exports/home/schatterjee/cs553-pa2a/";
public static final String fPath = "/input/data-20GB.in";
public static final String opLog = shared+"Mysort20GB.log";
Then run the following programme. Your final sorted file will be created with the name op2GB in fdir path. the last line Runtime.getRuntime().exec("valsort " + fdir + "op" + (treeHeight*100)+1 + " > " + opLog); checks the output is sorted or not . Remove this line if you dont have valsort installed in your machine or the input file is not generated using gensort(http://www.ordinal.com/gensort.html) .
Also, don't forget to change int totalLines = 20000000; to the total lines in your file. and thread count (int threadCount = 8) should be always in power of 2.
import java.io.*;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.LinkedList;
import java.util.Comparator;
import java.util.HashMap;
import java.util.stream.Stream;
class SplitJob extends Thread {
LinkedList<String> chunkName;
int startLine, endLine;
SplitJob(LinkedList<String> chunkName, int startLine, int endLine) {
this.chunkName = chunkName;
this.startLine = startLine;
this.endLine = endLine;
}
public void run() {
try {
int totalLines = endLine + 1 - startLine;
Stream<String> chunks =
Files.lines(Paths.get(Mysort2GB.fPath))
.skip(startLine - 1)
.limit(totalLines)
.sorted(Comparator.naturalOrder());
chunks.forEach(line -> {
chunkName.add(line);
});
System.out.println(" Done Writing " + Thread.currentThread().getName());
} catch (Exception e) {
System.out.println(e);
}
}
}
class MergeJob extends Thread {
int list1, list2, oplist;
MergeJob(int list1, int list2, int oplist) {
this.list1 = list1;
this.list2 = list2;
this.oplist = oplist;
}
public void run() {
try {
System.out.println(list1 + " Started Merging " + list2 );
LinkedList<String> merged = new LinkedList<>();
LinkedList<String> ilist1 = Mysort2GB.sortedChunks.get(list1);
LinkedList<String> ilist2 = Mysort2GB.sortedChunks.get(list2);
//Merge 2 files based on which string is greater.
while (ilist1.size() != 0 || ilist2.size() != 0) {
if (ilist1.size() == 0 ||
(ilist2.size() != 0 && ilist1.get(0).compareTo(ilist2.get(0)) > 0)) {
merged.add(ilist2.remove(0));
} else {
merged.add(ilist1.remove(0));
}
}
System.out.println(list1 + " Done Merging " + list2 );
Mysort2GB.sortedChunks.remove(list1);
Mysort2GB.sortedChunks.remove(list2);
Mysort2GB.sortedChunks.put(oplist, merged);
} catch (Exception e) {
System.out.println(e);
}
}
}
public class Mysort2GB {
//public static final String fdir = "/Users/diesel/Desktop/";
public static final String fdir = "/tmp/";
public static final String shared = "/exports/home/schatterjee/cs553-pa2a/";
public static final String fPath = "/input/data-2GB.in";
public static HashMap<Integer, LinkedList<String>> sortedChunks = new HashMap();
public static final String opfile = fdir+"op2GB";
public static final String opLog = shared + "mysort2GB.log";
public static void main(String[] args) throws Exception{
long startTime = System.nanoTime();
int threadCount = 8; // Number of threads
int totalLines = 20000000;
int linesPerFile = totalLines / threadCount;
LinkedList<Thread> activeThreads = new LinkedList<Thread>();
for (int i = 1; i <= threadCount; i++) {
int startLine = i == 1 ? i : (i - 1) * linesPerFile + 1;
int endLine = i * linesPerFile;
LinkedList<String> thisChunk = new LinkedList<>();
SplitJob mapThreads = new SplitJob(thisChunk, startLine, endLine);
sortedChunks.put(i,thisChunk);
activeThreads.add(mapThreads);
mapThreads.start();
}
activeThreads.stream().forEach(t -> {
try {
t.join();
} catch (Exception e) {
}
});
int treeHeight = (int) (Math.log(threadCount) / Math.log(2));
for (int i = 0; i < treeHeight; i++) {
LinkedList<Thread> actvThreads = new LinkedList<Thread>();
for (int j = 1, itr = 1; j <= threadCount / (i + 1); j += 2, itr++) {
int offset = i * 100;
int list1 = j + offset;
int list2 = (j + 1) + offset;
int opList = itr + ((i + 1) * 100);
MergeJob reduceThreads =
new MergeJob(list1,list2,opList);
actvThreads.add(reduceThreads);
reduceThreads.start();
}
actvThreads.stream().forEach(t -> {
try {
t.join();
} catch (Exception e) {
}
});
}
BufferedWriter writer = Files.newBufferedWriter(Paths.get(opfile));
sortedChunks.get(treeHeight*100+1).forEach(line -> {
try {
writer.write(line+"\r\n");
}catch (Exception e){
}
});
writer.close();
long endTime = System.nanoTime();
double timeTaken = (endTime - startTime)/1e9;
System.out.println(timeTaken);
BufferedWriter logFile = new BufferedWriter(new FileWriter(opLog, true));
logFile.write("Time Taken in seconds:" + timeTaken);
Runtime.getRuntime().exec("valsort " + opfile + " > " + opLog);
logFile.close();
}
}
[1]: https://i.stack.imgur.com/5feNb.png

Parallelisation in R: Using parLapply with pointer to C object

I'm trying to parallelise an R function that conducts some arithmetic in C.
A C object is constructed once from an R dataset using some function, which I'll call InitializeCObject, that returns a pointer to the object. I want to create an instance of this object on each worker that I can reuse many times.
This is where I've got so far:
nCores <- 2
cluster <- makeCluster(nCores)
on.exit(stopCluster(cluster))
clusterEvalQ(cluster, {library(pkgName); NULL})
The simplest solution is to make a new C object on each call:
x <- list(val1, val2) # list of length `nCores`
parLapply(cluster, x, function (x_i) pkgName::MakeObjectAndCalc(x_i, dataset))
But the time spent initializing the C object on every single call outweighs the benefits of parallelization.
I've tried creating nCores C objects and exporting all of them to each worker, then making worker n use local object n:
cPointer <- lapply(seq_len(nCores), function(xx) InitializeCObject(dataset))
on.exit(DestroyCObject(cPointer), add=TRUE)
clusterExport(cluster, 'cPointer')
parLapply(cluster, seq_len(nCores), function (i) Calculate(x[[i]], cPointer[[i]]))
But this doesn't work; the objects on the workers seem not to be initialized.
So I tried creating a separate C object locally on each worker:
clusterExport(cluster, 'dataset')
clusterEvalQ(cluster, {
localPointer <- InitializeCObject(dataset)
LocalCalc <- function (x) Calculate(x, localPointer)
on.exit(DestroyCObject(localPointer))
}
parLapply(cluster, x, LocalCalc)
But this causes the workers to crash. Any suggestions as to how I might move forwards would be appreciated.
edit: minimal C example
Here's my attempt to provide a minimal example of the associated C code. I'm far from fluent in C structures but hopefully this code is sufficient to demonstrate my problem.
// Define object structure
typedef struct CObject_t {
int data;
} CObject_t *CObject;
// Allocate memory to new (empty) object and return pointer to it
CObject new_object_t(void) {
CObject new = (CObject)calloc(1, sizeof(CObject_t));
return new;
}
int initialize_object(const int *dataset, CObject cObj) {
cObj->data = *dataset;
}
int use_object_to_calculate(int *x, CObject cObj) {
*x = *x + cObj->data;
return x;
}

Access violation when trying to populate an array of struct

Original code comment specifying the core question:
The error I am getting is while iterating through the while loop,
memory out of range or something... resizing to 300 ... Access
violation writing location that's the exact Fraze...
I'm trying to implement a faster .Net List<T> equivalent in C.
I'm using blittable data types in C#.
In the code below I've moved a function body to the main function just for testing after I have failed to understand where am I wrong.
The problem seems to be that inside the while loop UntArr does not increment.
What am I doing wrong?
typedef struct {
int Id;
char *StrVal;
}Unit; // a data unit as element of an array
unsigned int startTimer(unsigned int start);
unsigned int stopTimer(unsigned int start);
int main(){
Unit *UntArr= {NULL};
//Unit test[30000];
//decelerations comes first..
char *dummyStringDataObject;
int adummyNum,requestedMasterArrLength,requestedStringSize,MasterDataArrObjectMemorySize,elmsz;
int TestsTotalRounds, TestRoundsCounter,ccountr;
unsigned int start, stop, mar;
//Data Settings (manually for now)
requestedMasterArrLength=300;
requestedStringSize = 15;
//timings
start=0;stop=0;
//data sizes varies (x86/x64) compilation according to fastest results
MasterDataArrObjectMemorySize = sizeof(UntArr);
elmsz= sizeof(UntArr[0]);
TestRoundsCounter=-1;
start = startTimer(start);
while(++TestRoundsCounter<requestedMasterArrLength){
int count;
count=-1;
//allocate memory for the "Master Arr"
UntArr = (Unit *)malloc(sizeof(Unit)*requestedMasterArrLength);
dummyStringDataObject = (char*)malloc(sizeof(char)*requestedStringSize);
dummyStringDataObject = "abcdefgHijkLmNo";
while (++count<requestedMasterArrLength)
{
dummyStringDataObject[requestedStringSize-1]=count+'0';
puts(dummyStringDataObject);
ccountr=-1;
// tried
UntArr[count].Id = count;
UntArr[count].StrVal = (char*)malloc(sizeof(char)*requestedStringSize);
UntArr[count].StrVal = dummyStringDataObject;// as a whole
//while(++ccountr<15)// one by one cause a whole won't work ?
//UntArr[count].StrVal[ccountr] = dummyStringDataObject[ccountr];
}
free(UntArr);free(dummyStringDataObject);
}
stop = startTimer(start);
mar = stop - start;
MasterDataArrObjectMemorySize = sizeof(UntArr)/1024;
printf("Time taken in millisecond: %d ( %d sec)\r\n size: %d kb\r\n", mar,(mar/1000),MasterDataArrObjectMemorySize);
printf("UntArr.StrVal: %s",UntArr[7].StrVal);
getchar();
return 0;
}
unsigned int startTimer(unsigned int start){
start = clock();
return start;
}
unsigned int stopTimer(unsigned int start){
start = clock()-start;
return start;
}
testing the code one by one instead of within a while loop work as expected
//allocate memory for the "Master Arr"
UntArr = (Unit *)malloc(sizeof(Unit)*requestedMasterArrLength);
UntArr[0].Id = 0;
dummyStringDataObject = (char*)malloc(sizeof(char)*requestedStringSize);
dummyStringDataObject = "abcdefgHijkLmNo";
////allocate memory for the string object
UntArr[0].StrVal = (char*)malloc(sizeof(char)*requestedStringSize);
////test string manipulation
adummyNum=5;
UntArr[0].StrVal= dummyStringDataObject;
//
UntArr[0].StrVal[14] = adummyNum+'0';
////test is fine

as it happens and as i am new to pointers i have not realize that when debugging
i will not see the elements of given pointer to an array as i am used to
with normal Array[] but looping through result which i did not even try as when i was hovering above the Array* within the while loop expecting to see the elements as in a normal array:
Data[] DataArr = new Data[1000] <- i have expected to actually see the body of the array while looping and populating the Data* and did not realize it is not an actual array but a pointer to one so you can not see the elements/body.
the solution is via a function now as planed originally :
void dodata(int requestedMasterArrLength,int requestedStringSize){
int ccountr,count;
count=0;
UntArr=NULL;
UntArr = (Unit *)malloc(sizeof(Unit)*requestedMasterArrLength);
while(count!=requestedMasterArrLength)
{
char dummyStringDataObject[]= "abcdefgHi";
UntArr[count].StrVal=NULL;
dummyStringDataObject[requestedStringSize-1] = count+'0';
UntArr[count].Id= count;
ccountr=0;
UntArr[count].StrVal= (char*)malloc(sizeof(char)*requestedStringSize);
while(ccountr!=requestedStringSize){
UntArr[count].StrVal[ccountr] = dummyStringDataObject[ccountr];
++ccountr;
}
++count;
}
}
generaly speaking, x86 compilation would get better performance for this current task , populating an array of a struct.
so i have compiled it also in c++ and c#.
executing similar code in C# and C++
minimum time measured in c# ~ 3,100 ms.
minimum time measured in this code - C ~ 1700 ms.
minimum time measured in C++ ~ 900 ms.
i was surprised to see this last result c++ is the winner but why.
i thought c is closer to the system level, CPU, Memory...

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight