i'm working on a program in C that uses 4D arrays with 2 millions+ points. i currently have it implemented like this:
main.h
extern float data[31][31][25][100];
main.c
float data[31][31][25][100] = {{.....},{......},.....};
int main()
{
double sum;
for(i=0;i<31;i++)
for(j=0;j<31;j++)
for(k=0;k<25;k++)
for(l=0;l<100;l++)
sum += data[i][j][k][l];
}
the sum is there as a place holder...in its place will be something that does a 4D lookup table. i implemented like this because i dont want to load this data from disk. in the future i might use a database or something to load just a portion of the data but for now i need to use the whole dataset.
so my question is...is there any way i can so this more efficiently and/or can i make this executable smaller (executable currently is ~5 MB. This will eventually run on a quadcore arm board.
the only other thing i've tried is the optimization with gcc. i've tried -O2 and -O4 and i've gotten the following error. Without -O2, it compiles and runs fine. any ideas? i havent really looked up what all the optimization options are...just tried stuff i've seen online.
ld: can't link with a main executable file 'test' for architecture x86_64
thanks for your help!
answers to comments:
the data cannot be generated programmatically. the data is generated
by offline simulation
i've updated to code to show that the array is outside of the main and is global
#js1 you're right its closer to 9 MB, i was working with couple versions of the code and the 5 MB executable was one with 1 million+ elements.
#pm100 i guess thats a good question...i'm prototyping this code on a pc and it works fine...but where it will actually end up running is on an embedded platform. we are currently planning on testing with a pixhawk (which is an arm board for rc vehicles, our production board will be faster and have more memory). i am trying to be efficient and optimal as possible to mitigate any potential issues running on an arm platform.
#user3629249 yes we are prototyping our code on mac osx and then
compiling for ARM once we are finished.
#mcleod_ideafix would loading a binary file that contains 2million
pts be faster? honestly i hadnt even considered a binary file...i
will try implementing like this. everytime the program is called, it
will need access to some portion of this data. it wont need all 2
million pts but the input deciding which part of the array it needs
is variable. ideally i'd like to load just the portion of the arrays
that i needed. however when i tried it out, loading the file and
searching for the right arrays took 2-3x longer than this current
approach. i wonder if i messed something up when i was
loading/searching the file.
answers to comments pt2:
the data isnt sparse...i cant think of any easy ways to reduce the number of pts without reducing the fidelity of the model. the data is fixed and wont ever change. what will change are the inputs to use the data, which will lead to different portions of the 4D data being used.
as far as what the data is: it is essentially trajectory prediction data for flying vehicle. the 4D data is generated offline using a nonlinear simulation running on a cluster.
so what my embedded program has to do is take the current vehicle state (location, orientation, etc.) along with the 4D data to generate a estimated trajectory. i cant really provide the dataset due to proprietary reasons. i hope this answers some questions...sorry for being vague
i will work on a binary implementation and try loading a subset of the array. i might have done something dumb that made it really slow. thank you all for the comments, it gave me some new ideas to try out.
If your data cannot be generated programatically, then they have to be somewhere in your hard disk when your program starts, and your program, somehow must load that data into the 4D array.
So if your executable is 5MB in size, that is normal considering that the initialization data is included. This approach has the benefit of being the OS which loads and initializes your array. When your program executes the first instruction of your main() function, the data is already there. You have just to use it. The disadvantage is that if your program never needs to use the data, the memory they use will still be there, wasting address space.
On the other hand, you can have your data in a separate file: be it a data file which you will load as part of your processing, be it a dynamic library your program loads when needed, or a binary file you map into memory. The advantage if this is that you load the data into memory only if and when it's needed, needing the extra address space only when you actually access your data, and freeing it when it's no longer needed. Besides, your executable will load faster, as no prior load and initialization will be required. The disadvantage of this is that your program will have to include some procedure to load and initialize the 4D array prior to use it, and some other to dispose it when not needed.
That said, for a static non procedural-calculated values array that needs to live for the entire program, the most efficient way is to declare the array as global and initialize it in the same declaration. This will add a memory block with the initialized data, already in the format needed for the array, in your .data section. The beginning of that memory block will be assigned to the name of your array during the relocation operation.
Do you need 32bit floating point precision?
16bit fixed point values would for example cut the size of your binary in half.
If the characteristics of the values stored in the table are linear rather than exponential, then fixed point is the most efficient way to store them in terms of precision per stored bit of information.
Uneven fixed point representations of 24bits or 12bits per value is also a possibility.
You could also consider using different levels of precision for different parts of the table.
Is every single value of the lookup table actually used? Perhaps certain subsections of it can be omitted. It would reduce size at the cost of a more complicated data structure and lookup function.
On a side note, you may want to declare your lookup table to be "const".
As far as your lookups go.
I would recommend using Tree's for each of your lists. this will greatly reduce lookup time to (LOG(n)) and insertion time as well at n(log(n)) at most.
That will at least help your application move much faster when operating.
You'll want to use a decent data structure. such as a Heap or generic B-TREE.
Related
I am in the process of developing a performance-critical network service in Rust. A request to my service looks like a vector ids: Vec<u64> of numerical ids. For each id in ids, my service must read the id-th record from a long sequence of records stored contiguously on an SSD. Because all records have the same size RECORD_SIZE (in practice, around 6 KB), the position of every record is entirely predictable, so a trivial solution reduces to
for id in ids {
file.seek(SeekFrom::Start(id * RECORD_SIZE)).unwrap();
let mut record = vec![0u8; RECORD_SIZE];
file.read_exact(&mut record).unwrap();
records.push(record);
}
// Do something with `records`
Now, sadly, the following apply:
The elements of ids are non-contiguous, unpredictable, unstructured, and effectively equivalent to distributed uniformly at random in the range [0, N].
N is way too large for me to store the entire file in memory.
ids.len() is much smaller than N, so I cannot efficiently cycle through the file linearly without having 99% of my reads be for records that have nothing to do with ids.
Now, reading the specs, the raw QD32 IOPS of my SSD should allow me to collect all records in time (i.e., before the next request comes). But, what I observe with my trivial implementation is much much worse. I suspect that this is due to that being effectively a QD1 implementation:
Read something from disk at a random location.
Wait for the data to arrive, store it in RAM.
Read the next thing from disk at another, independent location.
Now, the thing is I know all ids at the very beginning, and I would love it if there was a way to specify:
As much in parallel as possible, read all the locations relevant to each element of ids.
When that is done, carry on doing something on everything.
I am wondering if there is an easy way to get this done in Rust. I scouted for file.parallel_read-like functions in the standard library, for useful crates on crates.io, but to no avail. Which puzzles me because this should be a relatively common problem in a server / database setting. Am I missing something?
Depending on the architecture you're targeting, there is the posix_fadvise syscall:
Programs can use posix_fadvise() to announce an intention to access file data in a specific pattern in the future, thus allowing the kernel to perform appropriate optimizations.
You would pass the offset, RECORD_SIZE, and probably the POSIX_FADV_WILLNEED advise. Both the function and constant are available in the libc crate. This same idea can be done with memory mapped files using posix_madvise() and POSIX_MADV_WILLNEED as hinted in the comments.
You then will need to do some performance tuning to determine how far ahead to make these calls. Not early enough and the data isn't there when you want it, and too early means you're needlessly adding pressure on your system memory.
Long story short, I have done several prototypes of interactive software. I use pygame now (python sdl wrapper) and everything is done on CPU. I am starting to port it to C now and at the same time search for the existing possibilities to use some GPU power to entlast the CPU from redundant operations. However I cannot find a good "guideline" what exact technology/tools should I pick in my situation. I just read plethora of docs, it drains my mental powers very fast. I am not sure if it is possible at all, so I'm puzzled.
Here I've made a very rough sketch of my typical application skeleton that I develop, but given that it uses GPU now (note, I have almost zero practical knowledge about GPU programming). Still important is that data types and functionality must be exactly preserved. Here it is:
So F(A,R,P) is some custom function, for example element substitution, repetition, etc. Function is presumably constant in program lifetime, rectangle's shapes generally are not equal with A shape, so it is not in-place calculation. So they are simply generated whith my functions. Examples of F: repeat rows and columns of A; substitute values with values from Substitution tables; compose some tiles into single array; any math function on A values, etc. As said all this can be easily made on CPU, but app must be really smooth. BTW in pure Python it became just unusable after adding several visual features, which are based on numpy arrays. Cython helps to make fast custom functions but then the source code is already kind of a salad.
Question:
Does this schema reflect some (standart) technology/dev.tools?
Is CUDA what I am looking for? If yes, some links/examples which coincides whith my application structure, would be great.
I realise, this a big question, so I will give more details if it helps.
Update
Here is a concrete example of two typical calculations for my prototype of bitmap editor. So the editor works with indexes and the data include layers with corresponding bit masks. I can determine the size of layers and masks are same size as layers and, say, all layers are same size (1024^2 pixels = 4 MB for 32 bit values). And my palette is say, 1024 elements (4 Kilobytes for 32 bpp format).
Consider I want to do two things now:
Step 1. I want to flatten all layers in one. Say A1 is default layer (background) and layers 'A2' and 'A3' have masks 'm2' and 'm3'. In python i'd write:
from numpy import logical_not
...
Result = (A1 * logical_not(m2) + A2 * m2) * logical_not(m3) + A3 * m3
Since the data is independent I believe it must give speedup proportionl to number of parallel blocks.
Step 2. Now I have an array and want to 'colorize' it with some palette, so it will be my lookup table. As I see now, there is a problem with simultanous read of lookup table element.
But my idea is, probably one can just duplicate the palette for all blocks, so each block can read its own palette? Like this:
When your code is highly parallel (i.e. there are small or no data dependencies between stages of processing) then you can go for CUDA (more finegrained control over synching) or OpenCL (very similar AND portable OpenGL-like API to interface with the GPU for kernel processing). Most of the acceleration work we do happens in OpenCL, which has excellent interop with both OpenGL and DirectX, but we also have the same setup working with CUDA. One big difference between CUDA and OpenCL is that in CUDA you can compile kernels once and delay-load (and/or link) them in your app, whereas in OpenCL the compiler plays nice with the OpenCL driver stack to ensure the kernel is compiled when the app starts.
One alternative that is often overlooked if you're using Microsoft Visual Studio is C++AMP, a C++ syntax-friendly and intuitive api for those who do not want to dig into the logic twists and turns of OpenCL/CUDA API's. Big advantage here is that the code also works if you do not have a GPU in the system, but then you do not have as many options to tweak performance. Still, in a lot of cases, this is a fast and efficient way to write proof your concept code and re-implement bits and parts in CUDA or OpenCL later.
OpenMP and Thread Building Blocks are only good alternatives when you have synching issues and lots of data dependencies. Native threading using worker threads is also a viable solution, but only if you have a good idea on how synch-points can be set up between the different processes in such a way that threads do not starve each-other out when fighting for priority. This is a lot harder to get right, and tools such as Parallel Studio are a must. But then, so is NVida NSight if you're writing GPU code.
Appendix:
A new platform called Quasar (http://quasar.ugent.be/blog/) is being developed that enables you to write your math problems in a syntax that is very similar to Matlab, but with full support of c/c++/c# or java integration, and cross-compiles (LLVM, CLANG) your "kernel" code to any underlying hardware configuration. It generates CUDA ptx files, or runs on openCL, or even on your CPU using TBB's, or a mixture of them. Using a few monikers, you can decorate the algorithm so that the underlying compiler can infer types (you can also explicitly use strict typing), so you can leave the type-heavy stuff entirely up to the compiler. To be fair, at the time of writing, the system is still w.i.p. and the first OpenCL compiled programs are just being tested, but most important benefit is fast prototyping with almost identical performance compared to optimized cuda.
What you want to do is send values really fast to the GPU using the high frequency dispatch and then display the result of a function which is basically texture lookups and some parameters.
I would say this problem will only be worth solving on the GPU if two conditions are met:
The size of A[] is optimised to make the transfer times irrelevant (Look at, http://blog.theincredibleholk.org/blog/2012/11/29/a-look-at-gpu-memory-transfer/).
The lookup table is not too big and/or the lookup values are organized in a way that the cache can be maximally utilized, in general random lookups on the GPU can be slow, ideally you can pre-load the R[] values in a shared memory buffer for each element of the A[] buffer.
If you can answer both of those questions positively then and only then consider having a go at using the GPU for your problem, else those 2 factors will overpower the computational speed-up that the GPU can provide you with.
Another thing you can have a look at is to as best as you can overlap the transfer and computing times to hide as much as possible the slow transfer rates of CPU->GPU data.
Regarding your F(A, R, P) function you need to make sure that you do not need to know the value of F(A, R, P)[0] in order to know what the value of F(A, R, P)[1] is because if you do then you need to rewrite F(A, R, P) to go around this issue, using some parallelization technique. If you have a limited number of F() functions then this can be solved by writing a parallel version of each F() function for the GPU to use, but if F() is user-defined then your problem becomes a bit trickier.
I hope this is enough information to have an informed guess towards whether you should or not use a GPU to solve your problem.
EDIT
Having read your edit, I would say yes. The palette could fit in shared memory (See GPU shared memory size is very small - what can I do about it?) which is very fast, if you have more than one palette, you could fit 16KB (size of shared mem on most cards) / 4KB per palette = 4 palettes per block of threads.
One last warning, integer operations are not the fastest on the GPU, consider using floating points if necessary after you have implemented your algorithm and it is working as a cheap optimization.
There is not much difference between OpenCL/CUDA so choose which works better for you. Just remember that CUDA will limit you to the NVidia GPUs.
If i understand corretly to your problem, kernel (function executed on GPU) should be simple. It should follow this pseudocode:
kernel main(shared A, shared outA, const struct R, const struct P, const int maxOut, const int sizeA)
int index := getIndex() // get offset in input array
if(sizeA >= index) return // GPU often works better when n of threads is 2^n
int outIndex := index*maxOut // to get offset in output array
outA[outIndex] := F(A[index], R, P)
end
Functions F should be inlined and you can use switch or if for different function. Since there is not known size of the output of F, then you have to use more memory. Each kernel instance must know positions for correct memory writes and reads so there have to be some maximum size (if there is none, than this all is useless and you have to use CPU!). If different sizes are sparse, then I would use something like computing these different sizes after getting the array back to RAM and compute these few with CPU, while filling outA with some zeros or indication values.
Sizes of arrays are obviously length(A) * maxOut = length(outA).
I forgot to mention that if execution of F is not same in most of the cases (same source code), than GPU will serialize it. GPU multiprocessors have a few cores connected into the same instruction cache so it will have to serialize the code, which is not the same for all cores! OpenMP or threads are better choice for this kind of problem!
I need to make a hash table that can eventually be used to write a full assembler.
Basically I will have something like:
foo 100,
and I will need to hash foo and then store the 100 (the address of the command). I was thinking I should just use a 2d array. The second dimension of the array would only be accessed when recording the address (just an int) or when returning the address. There would be no searching done in the second dimension.
If I implement the hash table this way, would it be inefficient? If it is very inefficient, what would be a better way to implement the table?
Edit: I haven't written any code yet. In fact I don't even know what language I'm going to use yet. I want to write it in C so it will be more of a challenge, but I might write it in Java if I feel pressured for time.
If you have every other int in the array unused then in addition to memory waste you're going to use the cache poorly as the cache lines will be underused.
But normally I wouldn't worry about such things when writing an assembler as it's not something very performance demanding as say graphics or heavy computations. At least, I wouldn't rush into optimizing too early.
It is, however, important to keep in mind that once you start assembling large pieces of code (~100,000 lines of assembly) generated automatically (say, from C/C++ code by a compiler), performance will become more and more important as the user experience (wait times) degrades. At that point there will be many candidates for optimization: I/O, parsing, symbol look up, generation of as short as possible jump instructions if they can have multiple encodings for shorter and longer jumps. Expressions and macros will contribute too. You may even consider minimizing white space and comments in the input assembly code in the first place.
Without being able to see any code, there is no reason that this would have to be inefficient. The only reason that it could be is if you pre allocated a bunch of memory that you did not end up using, however without seeing your algorithm you had in mind it is impossible to tell.
Cuda is awesome and I'm using it like crazy but I am not using her full potential because I'm having a issue transferring memory and was wondering if there was a better way to get a variable amount of memory out. Basically I send 65535 item array into Cuda and Cuda analyzes each data item around 20,000 different ways and if there's a match in my programs logic then it saves a 30 int list as a result. Think of my logic of analyzing each different combination and then looking at the total and if the total is equal to a number I'm looking for then it saves the results (which is a 30 int list for each analyzed item).
The problem is 65535 (blocks/items in data array) * 20000 (total combinations tested per item) = 1,310,700,000. This means I need to create a array of that size to deal with the chance that all the data will be a positive match (which is extremely unlikely and creating int output[1310700000][30] seems crazy for memory). I've been forced to make it smaller and send less blocks to process because I don't know how if Cuda can write efficiently to a linked list or a dynamically sized list (with this approach the it writes the output to host memory using block * number_of_different_way_tests).
Is there a better way to do this? Can Cuda somehow write to free memory that is not derived from the blockid? When I test this process on the CPU, less then 10% of the item array have a positive match so its extremely unlikely I'll use so much memory each time I send work to the kernel.
p.s. I'm looking above and although it's exactly what I'm doing, if it's confusing then another way of thinking about it (not exactly what I'm doing but good enough to understand the problem) is I am sending 20,000 arrays (that each contain 65,535 items) and adding each item with its peer in the other arrays and if the total equals a number (say 200-210) then I want to know the numbers it added to get that matching result.
If the numbers are very widely range then not all will match but using my approach I'm forced to malloc that huge amount of memory. Can I capture the results with mallocing less memory? My current approach to is malloc as much as I have free but I'm forced to run less blocks which isn't efficient (I want to run as many blocks and threads a time because I like the way Cuda organizes and runs the blocks). Is there any Cuda or C tricks I can use for this or I'm a stuck with mallocing the max possible results (and buying a lot more memory)?
As Per Roger Dahl's great answer:
The functionality you're looking for is called stream compaction.
You probably do need to provide an array that contains room for 4 solutions per thread because attempting to directly store the results in a compact form is likely to create so many dependencies between the threads that the performance gained in being able to copy less data back to the host is lost by a longer kernel execution time. The exception to this is if almost all of the threads find no solutions. In that case, you might be able to use an atomic operation to maintain an index into an array. So, for each solution that is found, you would store it in an array at an index and then use an atomic operation to increase the index. I think it would be safe to use atomicAdd() for this. Before storing a result, the thread would use atomicAdd() to increase the index by one. atomicAdd() returns the old value, and the thread can store the result using the old value as the index.
However, given a more common situation, where there's a fair number of results, the best solution will be to perform a compacting operation as a separate step. One way to do this is with thrust::copy_if. See this question for some more background.
I'm having the following problem. I need to store huge amounts of information (~32 GB) and be able to manipulate it as fast as possible. I'm wondering what's the best way to do it (combinations of programming language + OS + whatever you think its important).
The structure of the information I'm using is a 4D array (NxNxNxN) of double-precission floats (8 bytes). Right now my solution is to slice the 4D array into 2D arrays and store them in separate files in the HDD of my computer. This is really slow and the manipulation of the data is unbearable, so this is no solution at all!
I'm thinking on moving into a Supercomputing facility in my country and store all the information in the RAM, but I'm not sure how to implement an application to take advantage of it (I'm not a professional programmer, so any book/reference will help me a lot).
An alternative solution I'm thinking on is to buy a dedicated server with lots of RAM, but I don't know for sure if that will solve the problem. So right now my ignorance doesn't let me choose the best way to proceed.
What would you do if you were in this situation? I'm open to any idea.
Thanks in advance!
EDIT: Sorry for not providing enough information, I'll try to be more specific.
I'm storing a discretized 4D mathematical function. The operations that I would like to perform includes transposition of the array (change b[i,j,k,l] = a[j,i,k,l] and the likes), array multiplication, etc.
As this is a simulation of a proposed experiment, the operations will be applied only once. Once the result is obtained it wont be necessary to perform more operations on the data.
EDIT (2):
I also would like to be able to store more information in the future, so the solution should be somehow scalable. The current 32 GB goal is because I want to have the array with N=256 points, but it'll be better if I can use N=512 (which means 512 GB to store it!!).
Amazon's "High Memory Extra Large Instance" is only $1.20/hr and has 34 GB of memory. You might find it useful, assuming you're not running this program constantly..
Any decent answer will depend on how you need to access the data. Randomly access? Sequential access?
32GB is not really that huge.
How often do you need to process your data? Once per (lifetime | year | day | hour | nanosecond)? Often, stuff only needs to be done once. This has a profound effect on how much you need to optimize your solution.
What kind of operations will you be performing (you mention multiplication)? Can the data be split up into chunks, such that all necessary data for a set of operations is contained in a chunk? This will make splitting it up for parallel execution easier.
Most computers you buy these days have enough RAM to hold your 32GB in memory. You won't need a supercomputer just for that.
As Chris pointed out, what are you going to do with the data.
Besides, I think storing it in a (relational) database will be faster than reading it from the harddrive since the RDBMS will perform some optimizations for you like caching.
If you can represent your problem as MapReduce, consider a clustering system optimized for disk access, such as Hadoop.
Your description sounds more math-intensive, in which case you probably want to have all your data in memory at once. 32 GB of RAM in a single machine is not unreasonable; Amazon EC2 offers virtual servers with up to 68 GB.
Without more information, if you need quickest possible access to all the data I would go with using C for your programming language, using some flavor of *nix as the O/S, and buying RAM, it's relatively cheap now. This also depends on what you are familiar with, you can go the windows route as well. But as others have mentioned it will depend on how you are using this data.
So far, there are a lot of very different answers. There are two good starting points mentioned above. David suggests some hardware and someone mentioned learning C. Both of these are good points.
C is going to get you what you need in terms of speed and direct memory paging. The last thing you want to do is perform linear searches on the data. That would be slow - slow - slow.
Determine your workflow -, if your workflow is linear, that is one thing. If the workflow is not linear, I would design a binary tree referencing pages in memory. There are tons of information on B-trees on the Internet. In addition, these B-trees will be much easier to work with in C since you will also be able to set up and manipulate your memory paging.
Depending on your use, some mathematical and physical problems tend to be mostly zeros (for example, Finite Element models). If you expect that to be true for your data, you can get serious space savings by using a sparse matrix instead of actually storing all those zeros in memory or on disk.
Check out wikipedia for a description, and to decide if this might meet your needs:
http://en.wikipedia.org/wiki/Sparse_matrix
Here's another idea:
Try using an SSD to store your data. Since you're grabbing very small amounts of random data, an SSD would probably be much, much faster.
You may want to try using mmap instead of reading the data into memory, but I'm not sure it'll work with 32Gb files.
The whole database technology is about manipulating huge amounts of data that can't fit in RAM, so that might be your starting point (i.e. get a good dbms principles book and read about indexing, query execution, etc.).
A lot depends on how you need to access the data - if you absolutely need to jump around and access random bits of information, you're in trouble, but perhaps you can structure your processing of the data such that you will scan it along one axis (dimension). Then you can use a smaller buffer and continuously dump already processed data and read new data.
For transpositions, it's faster to actually just change your understanding of what index is what. By that, I mean you leave the data where it is and instead wrap an accessor delegate that changes b[i][j][k][l] into a request to fetch (or update) a[j][i][k][l].
Could it be possible to solve it by this procedure?
First create M child processes and execute them in paralel. Each process will be running in a dedicated core of a cluster and will load some information of the array into the RAM of that core.
A father process will be the manager of the array, calling (or connecting) the appropiate child process to obtain certain chunks of data.
Will this be faster than the HDD storage approach? Or am I cracking nuts with a sledgehammer?
The first thing that I'd recommend is picking an object-oriented language, and develop or find a class that lets you manipulate a 4-D array without concern for how it's actually implemented.
The actual implementation of this class would probably use memory-mapped files, simply because that can scale from low-power development machines up to the actual machine where you want to run production code (I'm assuming that you'll want to run this many times, so that performance is important -- if you can let it run overnight, then a consumer PC may be sufficient).
Finally, once I had my algorithms and data debugged, I would look into buying time on a machine that could hold all the data in memory. Amazon EC2, for instance, will provide you with a machine that has 68 GB of memory for $US 2.40 an hour (less if you play with spot instances).
How to handle processing large amounts of data typically revolves around the following factors:
Data access order / locality of reference: Can the data be separated out into independent chunks that are then processed either independently or in a serial/sequential fashon vs. random access to the data with little or no order?
CPU vs I/O bound: Is the processing time spent more on computation with the data or reading/writing it from/to storage?
Processing frequency: Will the data be processed only once, every few weeks, daily, etc?
If the data access order is essentially random, you will need either to get access to as much RAM as possible and/or find a way to at least partially organize the order so that not as much of the data needs to be in memory at the same time. Virtual memory systems slow down very quickly once physical RAM limits are exceeded and significant swapping occurs. Resolving this aspect of your problem is probably the most critical issue.
Other than the data access order issue above, I don't think your problem has significant I/O concerns. Reading/writing 32 GB is usually measured in minutes on current computer systems, and even data sizes up to a terabyte should not take more than a few hours.
Programming language choice is actually not critical so long as it is a compiled language with a good optimizing compiler and decent native libraries: C++, C, C#, or Java are all reasonable choices. The most computationally and I/O-intensive software I've worked on has actually been in Java and deployed on high-performance supercomputing clusters with a few thousand CPU cores.