What is the fastest way to organize data in C?

What is the fastest way to organize data in C? - c

I need to keep track of a lot of boolean-esque data in C. I'm writing a toy kernel and need to store data on whether a certain memory address is used or free. Because of this, I need to store and traverse through this data in the fastest, most efficient way possible. Since I am writing the kernel from scratch I cannot use the C Standard Library. What's the best, fastest, most efficient way to organize, traverse through and modify a large series of boolean-esque data without using the C Standard Library? E.g. would a bitmap or an array or linked list take the least amount of resources to traverse and modify?

Many filesystems have the same problem: indicating whether an allocation unit (group of disk sectors) are available or not. Except for MSDOS's FAT, I think all use a bitmap. Definitely NTFS and Linux's ext/ext2/ext3/ext4 use bitmaps.
There are several easy optimizations to make. If more than 8/16/32/64 sequential units are needed for an allocation, checking that many bits at once is simple using the corresponding integer size. If a bit being zero means "available", then testing for a zero integer tells whether the whole allocation is available. However, boundary optimizations might need to be considered.

Related

Reading quickly from many random points of a file

I am in the process of developing a performance-critical network service in Rust. A request to my service looks like a vector ids: Vec<u64> of numerical ids. For each id in ids, my service must read the id-th record from a long sequence of records stored contiguously on an SSD. Because all records have the same size RECORD_SIZE (in practice, around 6 KB), the position of every record is entirely predictable, so a trivial solution reduces to
for id in ids {
file.seek(SeekFrom::Start(id * RECORD_SIZE)).unwrap();
let mut record = vec![0u8; RECORD_SIZE];
file.read_exact(&mut record).unwrap();
records.push(record);
}
// Do something with `records`
Now, sadly, the following apply:
The elements of ids are non-contiguous, unpredictable, unstructured, and effectively equivalent to distributed uniformly at random in the range [0, N].
N is way too large for me to store the entire file in memory.
ids.len() is much smaller than N, so I cannot efficiently cycle through the file linearly without having 99% of my reads be for records that have nothing to do with ids.
Now, reading the specs, the raw QD32 IOPS of my SSD should allow me to collect all records in time (i.e., before the next request comes). But, what I observe with my trivial implementation is much much worse. I suspect that this is due to that being effectively a QD1 implementation:
Read something from disk at a random location.
Wait for the data to arrive, store it in RAM.
Read the next thing from disk at another, independent location.
Now, the thing is I know all ids at the very beginning, and I would love it if there was a way to specify:
As much in parallel as possible, read all the locations relevant to each element of ids.
When that is done, carry on doing something on everything.
I am wondering if there is an easy way to get this done in Rust. I scouted for file.parallel_read-like functions in the standard library, for useful crates on crates.io, but to no avail. Which puzzles me because this should be a relatively common problem in a server / database setting. Am I missing something?

Depending on the architecture you're targeting, there is the posix_fadvise syscall:
Programs can use posix_fadvise() to announce an intention to access file data in a specific pattern in the future, thus allowing the kernel to perform appropriate optimizations.
You would pass the offset, RECORD_SIZE, and probably the POSIX_FADV_WILLNEED advise. Both the function and constant are available in the libc crate. This same idea can be done with memory mapped files using posix_madvise() and POSIX_MADV_WILLNEED as hinted in the comments.
You then will need to do some performance tuning to determine how far ahead to make these calls. Not early enough and the data isn't there when you want it, and too early means you're needlessly adding pressure on your system memory.

C- Why is for loop pointer indexing faster? [duplicate]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Some years ago I was on a panel that was interviewing candidates for a relatively senior embedded C programmer position.
One of the standard questions that I asked was about optimisation techniques. I was quite surprised that some of the candidates didn't have answers.
So, in the interests of putting together a list for posterity - what techniques and constructs do you normally use when optimising C programs?
Answers to optimisation for speed and size both accepted.

First things first - don't optimise too early. It's not uncommon to spend time carefully optimising a chunk of code only to find that it wasn't the bottleneck that you thought it was going to be. Or, to put it another way "Before you make it fast, make it work"
Investigate whether there's any option for optimising the algorithm before optimising the code. It'll be easier to find an improvement in performance by optimising a poor algorithm than it is to optimise the code, only then to throw it away when you change the algorithm anyway.
And work out why you need to optimise in the first place. What are you trying to achieve? If you're trying, say, to improve the response time to some event work out if there is an opportunity to change the order of execution to minimise the time critical areas. For example when trying to improve the response to some external interrupt can you do any preparation in the dead time between events?
Once you've decided that you need to optimise the code, which bit do you optimise? Use a profiler. Focus your attention (first) on the areas that are used most often.
So what can you do about those areas?
minimise condition checking. Checking conditions (eg. terminating conditions for loops) is time that isn't being spent on actual processing. Condition checking can be minimised with techniques like loop-unrolling.
In some circumstances condition checking can also be eliminated by using function pointers. For example if you are implementing a state machine you may find that implementing the handlers for individual states as small functions (with a uniform prototype) and storing the "next state" by storing the function pointer of the next handler is more efficient than using a large switch statement with the handler code implemented in the individual case statements. YMMV.
minimise function calls. Function calls usually carry a burden of context saving (eg. writing local variables contained in registers to the stack, saving the stack pointer), so if you don't have to make a call this is time saved. One option (if you're optimising for speed and not space) is to make use of inline functions.
If function calls are unavoidable minimise the data that is being passed to the functions. For example passing pointers is likely to be more efficient than passing structures.
When optimising for speed choose datatypes that are the native size for your platform. For example on a 32bit processor it is likely to be more efficient to manipulate 32bit values than 8 or 16 bit values. (side note - it is worth checking that the compiler is doing what you think it is. I've had situations where I've discovered that my compiler insisted on doing 16 bit arithmetic on 8 bit values with all of the to and from conversions to go with them)
Find data that can be precalculated, and either calculate during initialisation or (better yet) at compile time. For example when implementing a CRC you can either calculate your CRC values on the fly (using the polynomial directly) which is great for size (but dreadful for performance), or you can generate a table of all of the interim values - which is a much faster implementation, to the detriment of the size.
Localise your data. If you're manipulating a blob of data often your processor may be able to speed things up by storing it all in cache. And your compiler may be able to use shorter instructions that are suited to more localised data (eg. instructions that use 8 bit offsets instead of 32 bit)
In the same vein, localise your functions. For the same reasons.
Work out the assumptions that you can make about the operations that you're performing and find ways of exploiting them. For example, on an 8 bit platform if the only operation that at you're doing on a 32 bit value is an increment you may find that you can do better than the compiler by inlining (or creating a macro) specifically for this purpose, rather than using a normal arithmetic operation.
Avoid expensive instructions - division is a prime example.
The "register" keyword can be your friend (although hopefully your compiler has a pretty good idea about your register usage). If you're going to use "register" it's likely that you'll have to declare the local variables that you want "register"ed first.
Be consistent with your data types. If you are doing arithmetic on a mixture of data types (eg. shorts and ints, doubles and floats) then the compiler is adding implicit type conversions for each mismatch. This is wasted cpu cycles that may not be necessary.
Most of the options listed above can be used as part of normal practice without any ill effects. However if you're really trying to eke out the best performance:
- Investigate where you can (safely) disable error checking. It's not recommended, but it will save you some space and cycles.
- Hand craft portions of your code in assembler. This of course means that your code is no longer portable but where that's not an issue you may find savings here. Be aware though that there is potentially time lost moving data into and out of the registers that you have at your disposal (ie. to satisfy the register usage of your compiler). Also be aware that your compiler should be doing a pretty good job on its own. (of course there are exceptions)

As everybody else has said: profile, profile profile.
As for actual techniques, one that I don't think has been mentioned yet:
Hot & Cold Data Separation: Staying within the CPU's cache is incredibly important. One way of helping to do this is by splitting your data structures into frequently accessed ("hot") and rarely accessed ("cold") sections.
An example: Suppose you have a structure for a customer that looks something like this:
struct Customer
{
int ID;
int AccountNumber;
char Name[128];
char Address[256];
};
Customer customers[1000];
Now, lets assume that you want to access the ID and AccountNumber a lot, but not so much the name and address. What you'd do is to split it into two:
struct CustomerAccount
{
int ID;
int AccountNumber;
CustomerData *pData;
};
struct CustomerData
{
char Name[128];
char Address[256];
};
CustomerAccount customers[1000];
In this way, when you're looping through your "customers" array, each entry is only 12 bytes and so you can fit many more entries in the cache. This can be a huge win if you can apply it to situations like the inner loop of a rendering engine.

My favorite technique is to use a good profiler. Without a good profile telling you where the bottleneck lies, no tricks and techniques are going to help you.

most common techniques I encountered are:
loop unrolling
loop optimization for better cache prefetch
(i.e. do N operations in M cycles instead of NxM singular operations)
data aligning
inline functions
hand-crafted asm snippets
As for general recommendations, most of them are already sounded:
choose better algos
use profiler
don't optimize if it doesn't give 20-30% performance boost

For low-level optimization:
START_TIMER/STOP_TIMER macros from ffmpeg (clock-level accuracy for measurement of any code).
Oprofile, of course, for profiling.
Enormous amounts of hand-coded assembly (just do a wc -l on x264's /common/x86 directory, and then remember most of the code is templated).
Careful coding in general; shorter code is usually better.
Smart low-level algorithms, like the 64-bit bitstream writer I wrote that uses only a single if and no else.
Explicit write-combining.
Taking into account important weird aspects of processors, like Intel's cacheline split issue.
Finding cases where one can losslessly or near-losslessly make an early termination, where the early-termination check costs much less than the speed one gains from it.
Actually inlined assembly for tasks which are far more suited to the x86 SIMD unit, such as median calculations (requires compile-time check for MMX support).

First and foremost, use a better/faster algorithm. There is no point optimizing code that is slow by design.
When optimizing for speed, trade memory for speed: lookup tables of precomputed values, binary trees, write faster custom implementation of system calls...
When trading speed for memory: use in-memory compression

Avoid using the heap. Use obstacks or pool-allocator for identical sized objects. Put small things with short lifetime onto the stack. alloca still exists.

Pre-mature optimization is the root of all evil!
;)

As my applications usually don't need much CPU time by design, I focus on the size my binaries on disk and in memory. What I do mostly is looking out for statically sized arrays and replacing them with dynamically allocated memory where it's worth the additional effort of free'ing the memory later. To cut down the size of the binary, I look for big arrays that are initialized at compile time and put the initializiation to runtime.
char buf[1024] = { 0, };
/* becomes: */
char buf[1024];
memset(buf, 0, sizeof(buf));
This will remove the 1024 zero-bytes from the binaries .DATA section and will instead create the buffer on the stack at runtime and the fill it with zeros.
EDIT: Oh yeah, and I like to cache things. It's not C specific but depending on what you're caching, it can give you a huge boost in performance.
PS: Please let us know when your list is finished, I'm very curious. ;)

If possible, compare with 0, not with arbitrary numbers, especially in loops, because comparison with 0 is often implemented with separate, faster assembler commands.
For example, if possible, write
for (i=n; i!=0; --i) { ... }
instead of
for (i=0; i!=n; ++i) { ... }

Another thing that was not mentioned:
Know your requirements: don't optimize for situations that will unlikely or never happen, concentrate on the most bang for the buck

basics/general:
Do not optimize when you have no problem.
Know your platform/CPU...
...know it thoroughly
know your ABI
Let the compiler do the optimization, just help it with the job.
some things that have actually helped:
Opt for size/memory:
Use bitfields for storing bools
re-use big global arrays by overlaying with a union (be careful)
Opt for speed (be careful):
use precomputed tables where possible
place critical functions/data in fast memory
Use dedicated registers for often used globals
count to-zero, zero flag is free

Difficult to summarize ...
Data structures:
Splitting of a data structure depending on case of usage is extremely important. It is common to see a structure that holds data that is accessed based on a flow control. This situation can lower significantly the cache usage.
To take into account cache line size and prefetch rules.
To reorder the members of the structure to obtain a sequential access to them from your code
Algorithms:
Take time to think about your problem and to find the correct algorithm.
Know the limitations of the algorithm you choose (a radix-sort/quick-sort for 10 elements to be sorted might not be the best choice).
Low level:
As for the latest processors it is not recommended to unroll a loop that has a small body. The processor provides its own detection mechanism for this and will short-circuit whole section of its pipeline.
Trust the HW prefetcher. Of course if your data structures are well designed ;)
Care about your L2 cache line misses.
Try to reduce as much as possible the local working set of your application as the processors are leaning to smaller caches per cores (C2D enjoyed a 3MB per core max where iCore7 will provide a max of 256KB per core + 8MB shared to all cores for a quad core die.).
The most important of all: Measure early, Measure often and never ever makes assumptions, base your thinking and optimizations on data retrieved by a profiler (please use PTU).
Another hint, performance is key to the success of an application and should be considered at design time and you should have clear performance targets.
This is far from being exhaustive but should provide an interesting base.

These days, the most important things in optimzation are:
respecting the cache - try to access memory in simple patterns, and don't unroll loops just for fun. Use arrays instead of data structures with lots of pointer chasing and it'll probably be faster for small amounts of data. And don't make anything too big.
avoiding latency - try to avoid divisions and stuff that's slow if other calculations depend on them immediately. Memory accesses that depend on other memory accesses (ie, a[b[c]]) are bad.
avoiding unpredictabilty - a lot of if/elses with unpredictable conditions, or conditions that introduce more latency, will really mess you up. There's a lot of branchless math tricks that are useful here, but they increase latency and are only useful if you really need them. Otherwise, just write simple code and don't have crazy loop conditions.
Don't bother with optimizations that involve copy-and-pasting your code (like loop unrolling), or reordering loops by hand. The compiler usually does a better job than you at doing this, but most of them aren't smart enough to undo it.

Collecting profiles of code execution get you 50% of the way there. The other 50% deals with analyzing these reports.
Further, if you use GCC or VisualC++, you can use "profile guided optimization" where the compiler will take info from previous executions and reschedule instructions to make the CPU happier.

Inline functions! Inspired by the profiling fans here I profiled an application of mine and found a small function that does some bitshifting on MP3 frames. It makes about 90% of all function calls in my applcation, so I made it inline and voila - the program now uses half of the CPU time it did before.

On most of embedded system i worked there was no profiling tools, so it's nice to say use profiler but not very practical.
First rule in speed optimization is - find your critical path.
Usually you will find that this path is not so long and not so complex. It's hard to say in generic way how to optimize this it's depend on what are you doing and what is in your power to do. For example you want usually avoid memcpy on critical path, so ever you need to use DMA or optimize, but what if you hw does not have DMA ? check if memcpy implementation is a best one if not rewrite it.
Do not use dynamic allocation at all in embedded but if you do for some reason don't do it in critical path.
Organize your thread priorities correctly, what is correctly is real question and it's clearly system specific.
We use very simple tools to analyze the bottle-necks, simple macro that store the time-stamp and index. Few (2-3) runs in 90% of cases will find where you spend your time.
And the last one is code review a very important one. In most case we avoid performance problem during code review very effective way :)

Measure performance.
Use realistic and non-trivial benchmarks. Remember that "everything is fast for small N".
Use a profiler to find hotspots.
Reduce number of dynamic memory allocations, disk accesses, database accesses, network accesses, and user/kernel transitions, because these often tend to be hotspots.
Measure performance.
In addition, you should measure performance.

Sometimes you have to decide whether it is more space or more speed that you are after, which will lead to almost opposite optimizations. For example, to get the most out of you space, you pack structures e.g. #pragma pack(1) and use bit fields in structures. For more speed you pack to align with the processors preference and avoid bitfields.
Another trick is picking the right re-sizing algorithms for growing arrays via realloc, or better still writing your own heap manager based on your particular application. Don't assume the one that comes with the compiler is the best possible solution for every application.

If someone doesn't have an answer to that question, it could be they don't know much.
It could also be that they know a lot. I know a lot (IMHO :-), and if I were asked that question, I would be asking you back: Why do you think that's important?
The problem is, any a-priori notions about performance, if they are not informed by a specific situation, are guesses by definition.
I think it is important to know coding techniques for performance, but I think it is even more important to know not to use them, until diagnosis reveals that there is a problem and what it is.
Now I'm going to contradict myself and say, if you do that, you learn how to recognize the design approaches that lead to trouble so you can avoid them, and to a novice, that sounds like premature optimization.
To give you a concrete example, this is a C application that was optimized.

Great lists. I will just add one tip I didn't saw in the above lists that in some case can yield huge optimisation for minimal cost.
bypass linker
if you have some application divided in two files, say main.c and lib.c, in many cases you can just add a \#include "lib.c" in your main.c That will completely bypass linker and allow for much more efficient optimisation for compiler.
The same effect can be achieved optimizing dependencies between files, but the cost of changes is usually higher.

Sometimes Google is the best algorithm optimization tool. When I have a complex problem, a bit of searching reveals some guys with PhD's have found a mapping between this and a well-known problem and have already done most of the work.

I would recommend optimizing using more efficient algorithms and not do it as an afterthought but code it that way from the start. Let the compiler work out the details on the small things as it knows more about the target processor than you do.
For one, I rarely use loops to look things up, I add items to a hashtable and then use the hashtable to lookup the results.
For example you have a string to lookup and then 50 possible values. So instead of doing 50 strcmps, you add all 50 strings to a hashtable and give each a unique number ( you only have to do this once ). Then you lookup the target string in the hashtable and have one large switch with all 50 cases ( or have functions pointers ).
When looking up things with common sets of input ( like css rules ), I use fast code to keep track of the only possible solitions and then iterate thought those to find a match. Once I have a match I save the results into a hashtable ( as a cache ) and then use the cache results if I get that same input set later.
My main tools for faster code are:
hashtable - for quick lookups and for caching results
qsort - it's the only sort I use
bsp - for looking up things based on area ( map rendering etc )

Storing records in a byte array vs using an array of structs

I have 200 million records, some of which have variable sized fields (string, variable length array etc.). I need to perform some filters, aggregations etc. on them (analytics oriented queries).
I want to just lay them all out in memory (enough to fit in a big box) and then do linear scans on them. There are two approach I can take, and I want to hear your opinions on which approach is better, for maximizing speed:
Using an array of structs with char* and int* etc. to deal with variable length fields
Use a large byte array, scan the byte array like a binary stream, and then parse the records
Which approach would you recommend?
Update: Using C.

The unfortunate answer is that "it depends on the details which you haven't provided" which, while true, is not particularly useful. The general advice to approaching a problem like this is to start with the simplest/most obvious design and then profile and optimize it as needed. If it really matters you can start with a few very basic benchmark tests of a few designs using your exact data and use-cases to get a more accurate idea of what direction you should take.
Looking in general at a few specific designs and their general pros/cons:
One Big Buffer
char* pBuffer = malloc(200000000);
Assumes your data can fit into memory all at once.
Would work better for all text (or mostly text) data.
Wouldn't be my first choice for large data as it just mirrors the data on the disk. Better just to use the hardware/software file cache/read ahead and read data directly from the drive, or map it if needed.
For linear scans this is a good format but you lose a bit if it requires complex parsing (especially if you have to do multiple scans).
Potential for the least overhead assuming you can pack the structures one after the other.
Static Structure
typedef struct {
char Data1[32];
int Data2[10];
} myStruct;
myStruct *pData = malloc(sizeof(myStruct)*200000000);
Simplest design and likely the best potential for speed at a cost of memory (without actual profiling).
If your variable length arrays have a wide range of sizes you are going to waste a lot of memory. Since you have 200 million records you may not have enough memory to use this method.
For a linear scan this is likely the best memory structure due to memory cache/prefetching.
Dynamic Structure
typedef struct {
char* pData1;
int* pData2;
} myStruct2;
myStruct2 *pData = malloc(sizeof(myStruct2)*200000000);
With 200 million records this is going require a lot of dynamic memory allocations which is very likely going to have a significant impact on speed.
Has the potential to be more memory efficient if your dynamic arrays have a wide range of sizes (though see next point).
Be aware of the overhead of the pointer sizes. On a 32-bit system this structure needs 8 bytes (ignoring padding) to store the pointers which is 1.6 GB alone for 200 million records! If your dynamic arrays are generally small (or empty) you may be spending more memory on the overhead than the actual data.
For a linear scan of data this type of structure will probably perform poorly as you are accessing memory in a non-linear manner which cannot be predicted by the prefetcher.
Streaming
If you only need to do one scan of the data then I would look at a streaming solution where you read a small bit of the data at a time from the file.
Works well with very large data sets that wouldn't fit into memory.
Main limitation here is the disk read speed and how complex your parsing is.
Even if you have to do multiple passes with file caching this might be comparable in speed to the other methods.
Which of these is "best" really depends on your specific case...I can think of situations where each one would be the preferred method.

You could use structs, indeed, but you'd have to be very careful about alignments and aliasing, and it would all need patching up when there's a variable length section. In particular, you could not use an array of such structs because all entries in an array must be constant size.
I suggest the flat array approach. Then add a healthy dose of abstraction; you don't want your "business logic" doing bit-twiddling.
Better still, if you need to do a single linear scan over the whole data set then you should treat it like a data stream, and de-serialize (copy) the records into proper, native structs, one at a time.

"Which approach would you recommend?" Neither actually. With this amount of data my recommendation would be something like a linked list of your structs. However, if you are 100% sure that you will be able to allocate the required amount of memory (with 1 malloc call) for all your data, then use array of structs.

How much efficiency would be lost if a hash table is implemented with a 2d array but the second dimension of the array is never accessed?

I need to make a hash table that can eventually be used to write a full assembler.
Basically I will have something like:
foo 100,
and I will need to hash foo and then store the 100 (the address of the command). I was thinking I should just use a 2d array. The second dimension of the array would only be accessed when recording the address (just an int) or when returning the address. There would be no searching done in the second dimension.
If I implement the hash table this way, would it be inefficient? If it is very inefficient, what would be a better way to implement the table?
Edit: I haven't written any code yet. In fact I don't even know what language I'm going to use yet. I want to write it in C so it will be more of a challenge, but I might write it in Java if I feel pressured for time.

If you have every other int in the array unused then in addition to memory waste you're going to use the cache poorly as the cache lines will be underused.
But normally I wouldn't worry about such things when writing an assembler as it's not something very performance demanding as say graphics or heavy computations. At least, I wouldn't rush into optimizing too early.
It is, however, important to keep in mind that once you start assembling large pieces of code (~100,000 lines of assembly) generated automatically (say, from C/C++ code by a compiler), performance will become more and more important as the user experience (wait times) degrades. At that point there will be many candidates for optimization: I/O, parsing, symbol look up, generation of as short as possible jump instructions if they can have multiple encodings for shorter and longer jumps. Expressions and macros will contribute too. You may even consider minimizing white space and comments in the input assembly code in the first place.

Without being able to see any code, there is no reason that this would have to be inefficient. The only reason that it could be is if you pre allocated a bunch of memory that you did not end up using, however without seeing your algorithm you had in mind it is impossible to tell.

How to sort a very large array in C

I want to sort on the order of four million long longs in C. Normally I would just malloc() a buffer to use as an array and call qsort() but four million * 8 bytes is one huge chunk of contiguous memory.
What's the easiest way to do this? I rate ease over pure speed for this. I'd prefer not to use any libraries and the result will need to run on a modest netbook under both Windows and Linux.

Just allocate a buffer and call qsort. 32MB isn't so very big these days even on a modest netbook.
If you really must split it up: sort smaller chunks, write them to files, and merge them (a merge takes a single linear pass over each of the things being merged). But, really, don't. Just sort it.
(There's a good discussion of the sort-and-merge approach in volume 2 of Knuth, where it's called "external sorting". When Knuth was writing that, the external data would have been on magnetic tape, but the principles aren't very different with discs: you still want your I/O to be as sequential as possible. The tradeoffs are a bit different with SSDs.)

32 MB? thats not too big.... quicksort should do the trick.

Your best option would be to prevent having the data unordered if possible. Like it has been mentioned, you'd be better of reading the data from disk (or network or whatever the source) directly into a selforganizing container (a tree, perhaps std::set will do).
That way, you'll never have to sort through the lot, or have to worry about memory management. If you know the required capacity of the container, you might squeeze out additional performance by using std::vector(initialcapacity) or call vector::reserve up front.
You'd then best be advised to use std::make_heap to heapify any existing elements, and then add element by element using push_heap (see also pop_heap). This essentially is the same paradigm as the self-ordering set but
duplicates are ok
the storage is 'optimized' as a flat array (which is perfect for e.g. shared memory maps or memory mapped files)
(Oh, minor detail, note that sort_heap on the heap takes at most N log N comparisons, where N is the number of elements)
Let me know if you think this is an interesting approach. I'd really need a bit more info on the use case

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight